RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Similar documents
IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

Reconfigurable Neural Net Chip with 32K Connections

RFI MITIGATING RECEIVER BACK-END FOR RADIOMETERS

System Quality Indicators

LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Digitally Assisted Analog Circuits. Boris Murmann Stanford University Department of Electrical Engineering

Analog Performance-based Self-Test Approaches for Mixed-Signal Circuits

100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017

High Speed Reconfigurable FPGA Architecture for Multi-Technology Applications

Sharif University of Technology. SoC: Introduction

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Sensor Development for the imote2 Smart Sensor Platform

Readout techniques for drift and low frequency noise rejection in infrared arrays

A High-Speed CMOS Image Sensor with Column-Parallel Single Capacitor CDSs and Single-slope ADCs

Advanced Video Processing for Future Multimedia Communication Systems

TODAY computer vision technologies are used with great

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Overview of All Pixel Circuits for Active Matrix Organic Light Emitting Diode (AMOLED)

ISSCC 2006 / SESSION 14 / BASEBAND AND CHANNEL PROCESSING / 14.6

Uncooled amorphous silicon ¼ VGA IRFPA with 25 µm pixel-pitch for High End applications

L12: Reconfigurable Logic Architectures

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

A Modified Static Contention Free Single Phase Clocked Flip-flop Design for Low Power Applications

Performance Modeling and Noise Reduction in VLSI Packaging

«Trends in high speed, low power Analog to Digital converters»

L11/12: Reconfigurable Logic Architectures

Use of Low Power DET Address Pointer Circuit for FIFO Memory Design

SoC IC Basics. COE838: Systems on Chip Design

Low-Power Decimation Filter for 2.5 GHz Operation in Standard-Cell Implementation

Major Differences Between the DT9847 Series Modules

A Low-Power 0.7-V H p Video Decoder

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

ISC0904: 1k x 1k 18µm N-on-P ROIC. Specification January 13, 2012

Research Results in Mixed Signal IC Design

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

Performance Driven Reliable Link Design for Network on Chips

Reading an Image using CMOS Linear Image Sensor. S.R.Shinthu 1, P.Maheswari 2, C.S.Manikandababu 3. 1 Introduction. A.

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Design and Analysis of Modified Fast Compressors for MAC Unit

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Region Adaptive Unsharp Masking based DCT Interpolation for Efficient Video Intra Frame Up-sampling

Area-Efficient Decimation Filter with 50/60 Hz Power-Line Noise Suppression for ΔΣ A/D Converters

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Information Transmission Chapter 3, image and video

Design of Fault Coverage Test Pattern Generator Using LFSR

VLSI IEEE Projects Titles LeMeniz Infotech

PICOSECOND TIMING USING FAST ANALOG SAMPLING

An FPGA Implementation of Shift Register Using Pulsed Latches

ECE 5765 Modern Communication Fall 2005, UMD Experiment 10: PRBS Messages, Eye Patterns & Noise Simulation using PRBS

Status of readout electronic design in MOST1

MPEG has been established as an international standard

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

17 October About H.265/HEVC. Things you should know about the new encoding.

FLEXIBLE SWITCHING AND EDITING OF MPEG-2 VIDEO BITSTREAMS

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Motion Video Compression

Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability

Low Power VLSI CMOS Design An Image Processing Chip for RGB to HSI Conversion

The Alice Silicon Pixel Detector (SPD) Peter Chochula for the Alice Pixel Collaboration

A video signal processor for motioncompensated field-rate upconversion in consumer television

Chapter 1. Introduction to Digital Signal Processing

UNIIQA+ NBASE-T Monochrome CMOS LINE SCAN CAMERA

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

FPGA Implementation of DA Algritm for Fir Filter

International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013

Techniques for Extending Real-Time Oscilloscope Bandwidth

Self-Test and Adaptation for Random Variations in Reliability

Introduction. Edge Enhancement (SEE( Advantages of Scalable SEE) Lijun Yin. Scalable Enhancement and Optimization. Case Study:

Digital Audio Design Validation and Debugging Using PGY-I2C

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Digital Correction for Multibit D/A Converters

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR NPTEL ONLINE CERTIFICATION COURSE. On Industrial Automation and Control

OVE EDFORS ELECTRICAL AND INFORMATION TECHNOLOGY

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

How advances in digitizer technologies improve measurement accuracy

High-Speed ADC Building Blocks in 90 nm CMOS

A Study of Encoding and Decoding Techniques for Syndrome-Based Video Coding

Architecture of Discrete Wavelet Transform Processor for Image Compression

Low-Voltage 96 db Snapshot CMOS Image Sensor with 4.5 nw Power Dissipation per Pixel

Robust Transmission of H.264/AVC Video using 64-QAM and unequal error protection

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

MANY computer vision applications can benefit from the

Interframe Bus Encoding Technique for Low Power Video Compression

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

P.Akila 1. P a g e 60

SYNCHRONOUS DERIVED CLOCK AND SYNTHESIS OF LOW POWER SEQUENTIAL CIRCUITS *

Distortion Analysis Of Tamil Language Characters Recognition

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

Area-efficient high-throughput parallel scramblers using generalized algorithms

Dual Slope ADC Design from Power, Speed and Area Perspectives

Compact multichannel MEMS based spectrometer for FBG sensing

Is the Golden Age of Analog circuit Design Over?

Model- based design of energy- efficient applications for IoT systems

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT

Transcription:

Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Robert LiKamWa Yunhui Hou Yuan Gao Mia Polansky Lin Zhong roblkw@rice.edu houyh@rice.edu yg18@rice.edu mia.polansky@rice.edu lzhong@rice.edu likamwa@asu.edu julianyg@stanford.edu 1

A vision of vision... Sense Compute Interact Energy efficiency goal: 10 mw Idle power consumption of smartphone Week-long use of small battery (2 Wh) Opens door to energy-harvesting solutions... continuous mobile vision! 4

Vision demands energy Sense 1 nj per pixel Ultra-low-power CMOS imager (Himax 2016) Compute 12 nj per data movement Quantifying Energy Cost of [Mobile] Data Movement (Pandiyan, Wu IISWC 2014) 5

Key Idea: Shift processing into the analog domain! Process Sense + Sense Compute Analog Challenges: Design complexity Noisy signal fidelity 6

Challenge #1: Design complexity No bus for control/data Analog exchanges data on pre-routed interconnects Congestion and overlap cause parasitics Source: Wikipedia Complexity limits the extent of analog computing : Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision, LiKamWa et al. [ISCA 16] 7

Challenge #2: Noisy signal fidelity Analog circuits suffer from thermal noise v 2 n = k B T/C or energy cost E = CV 2 /2 Low C High-noise Low-Energy High C Low-noise High-Energy Accumulating signal noise limits the extent of efficient analog computing 8

Complexity and noise limit the efficiency of prior analog architectures Analog neural processing (St. Amant et al @ UT-Austin, 2014) ADC consumes >90% of energy consumption : Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision, LiKamWa et al. [ISCA 16] 9

Insight #1: Vision is highly structured key ConvNet blocks Convolution... Repetitive building blocks Reusable structure Patch-based operations Data locality Dataflow bandwidth reduces with processing Feed-forward 10

What about noise? jigsaw 11

Insight #2: Noisy images are okay for vision log 10 [ Comp. Energy (J) ] 0-1 -2-3 -4-5 -6-7 -8-9 -10 Accuracy Energy 10 20 30 40 50 Comp. Noise SNR (db) 100 80 60 40 20 0 GoogLeNet Accuracy (%) jigsaw key key 12

Insight #2: Noisy images are okay for vision log 10 [ Comp. Energy (J) ] 0-1 -2-3 -4-5 -6-7 -8-9 -10 Accuracy Energy 10 20 30 40 50 Comp. Noise SNR (db) 100 80 60 40 20 0 GoogLeNet Accuracy (%) jigsaw key key 13

vision sensor architecture Programmable analog ConvNet execution Low-complexity modules for design scalability Noise mechanisms to trade accuracy/efficiency Reduce readout energy by 100x 14

vision sensor architecture Optical Columns Digital Control Plane System Bus Programming Program SRAM Analog Modules Kernel config Digitally Clocked Controller Noise tuning Flow control System Bus Send Results Feature SRAM Output 15

Pixel Column Correlated Double Sampling (CDS) Capacitance Noise Tuning Analog Memory 1 Storage Reusable Modules Programmable kernel Cyclic flow for reuse Column Vector of Kernel H columns Vertical Weighted Averaging Accumulation H columns 2 Convolutional Vertical H columns H columns 3 Cyclic Flow Control Quantization Noise Tuning Analog-to-Digital Conversion 4 Quantization Output 16

Pixel Column Pixel Column Pixel Column Pixel Column Pixel Column Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Analog Memory Analog Memory Analog Memory Analog Memory Analog Memory Reusable Modules Programmable kernel Cyclic flow for reuse Vertical Weighted Averaging Accumulation Vertical Weighted Averaging Vertical Weighted Averaging Vertical Weighted Averaging Column-parallel topology Accumulation Accumulation for streaming data locality Accumulation Vertical Weighted Averaging Accumulation Data locality for patches Streaming processing Column topology Vertical Vertical Vertical Vertical Vertical Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Output Output Output Output Output 17

Streaming patch-based access Vertical access through temporal buffering access through column interconnects 18

Pixel Column Pixel Column Pixel Column Pixel Column Pixel Column Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Correlated Double Sampling (CDS) Analog Memory Analog Memory Analog Memory Analog Memory Analog Memory Reusable Modules Programmable kernel Cyclic flow for reuse Vertical Weighted Averaging Accumulation Vertical Weighted Averaging Accumulation Vertical Weighted Averaging Accumulation Vertical Weighted Averaging Accumulation Vertical Weighted Averaging Accumulation Data locality for patches Streaming processing Column topology Vertical Vertical Vertical Vertical Vertical Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Cyclic Flow Control Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Analog-to-Digital Conversion Output Output Output Output Output 19

Noise-tuning mechanisms Mixed-signal Multiply-Accumulate w/tunable fidelity vs. efficiency SAR ADC w/tunable-resolution vs. efficiency Source: Wikipedia : Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision, LiKamWa et al. [ISCA 16] 20

vision sensor architecture Optical Columns Digital Control Plane System Bus Programming Program SRAM Analog Modules Kernel config Digitally Clocked Controller Noise tuning Flow control System Bus Send Results Feature SRAM Output 21

Estimation and Evaluation Cadence Spectre # Noise # Power # Timing Parametrized Behavioral Model -caffe Sim. Framework + Quantized Weights + Processing Noise Layer + Quantization Noise Layer +GoogLeNet_v1 https://github.com/julianyg/redeye_sim 22

Admitting noise saves energy! (but our current process limits us to 40 db) GoogLeNet Accuracy (%) 100 80 60 40 20 0 Top-5 Accuracy 10 20 30 40 50 Comp. Noise SNR (db) log 10 [ Comp. Energy (J) ] 0-1 -2-3 -4-5 -6-7 -8-9 Energy consumption (Processing) 10 20 30 40 50 Comp. Noise SNR (db) 23

reduces readout energy by >100x (log axis) 1.00 Image Sensor (readout) Readout Energy (mj) 0.10 GoogLeNet on at different depths 0.01 IS 1 2 3 4 5 Depth 24

reduces readout energy by >100x at expense of processing energy (log axis) 1.00 Image Sensor (readout) Readout Processing Energy (mj) 0.10 0.01 IS 1 2 3 4 5 Depth 25

can help state of the art ConvNet processing efficiency by 2x EyeRiss [ISCA 16, ISSCC 16] Chen et al Eyeriss+ Image Sensor: EyeRiss (Conv Layers): 5.9 mj Image Sensor: 1.0 mj EyeRiss (Full Layers): 2.1 mj Total: 9.0 mj Columns EyeRiss + : (Analog Conv): 2.5 mj Readout: 0.001 mj Eyeriss (Full Layers): 2.1 mj Total: 4.6 mj System Bus Programming Program SRAM Digitally Clocked Controller Control Plane Kernel config Noise tuning Flow control Modules System Bus Send Results Feature SRAM Output 26

limitations (and opportunities!) is bounded to 4o db (Limits energy savings) Unit capacitance of process technology ConvNet not optimized for architecture is strictly feed-forward (no recurrence, e.g., LSTM nets) : Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision, LiKamWa et al. [ISCA 16] 27

Realizing chip Silicon validation in 65 nm TSMC Non-idealities: noise, non-linearity, offset, process variation Opportunities: voltage scaling, sub-threshold circuits 28

? Raw image privacy through noisy degradation? App ` ADC ConvNet Features Vision Info Image Reconstruction Idea: App can have vision info, not image data. Degrade image and features (e.g., insert noise) Ensure vision usability, but image privacy Depth 1 Reverse Depth 2 Reverse Depth 3 Reverse Depth 4 Reverse Depth 5 Reverse Understanding Deep Representations by Inverting Them, Mahendran et al. 29

Hardware ConvNet acceleration Reconfigurable flexibility Related Work NeuFlow: Dataflow vision processing system-on-a-chip (Pham et al, MSCS 2012) Origami: A convolutional network accelerator (Cavigelli et al, GLSVLSI 2012) A dynamically configurable coprocessor for convolutional neural networks (Chakradhar et al, SIGARCH News 2010) Data Movement reduction Convolution engine: balancing efficiency & flexibility in specialized computing (Qadeer et al, SIGARCH News, 2013) Memory-centric accelerator design for convolutional neural networks (Peemen et al, ICCD 2013) DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. (Chen et al, ASPLOS 2014) PRIME: A Novel Processing-in-memory Architecture for NN Computation in ReRAM-based Main Memory (Chi et al, ISCA 2016) ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars (Shafiee et al, ISCA 2016) EIE: Efficient Inference Engine on Compressed Deep Neural Network (Han et al, ISCA 2016) Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks (Chen et al, ISCA 2016) Limited-precision ConvNets General-purpose code acceleration with limited-precision analog computation (St. Amant et al, ISCA 2014) Continuous real-world inputs can open up alternative accelerator designs (Belhadj et al, SIGARCH News 2013) Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators (Reagen et al, ISCA 2016) 30

Columns Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision System Bus System Bus Programming Send Results Program SRAM Digitally Clocked Controller Feature SRAM Control Plane Kernel config Noise tuning Flow control Output Modules Robert LiKamWa Yunhui Hou Yuan Gao Mia Polansky Lin Zhong likamwa@asu.edu houyh@rice.edu julianyg@stanford.edu mia.polansky@rice.edu lzhong@rice.edu Programmable analog ConvNet execution Modules for design scalability Tunable noise for accuracy and efficiency Programmability for flexibility Open-Source simulation framework: https://github.com/julianyg/redeye_sim 31