Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT Giuseppe Desoli ST Central Labs STMicroelectronics
Artificial Intelligence is Everywhere 2 Analysis, i.e.: Where am I? Scene classification (audio, video, environmental sensors) Which objects are in the scene, where are they? Video object detection/classification What am I doing? Activity recognition (audio, video, inertial sensors) What s happening? Event recognition (audio, video, intertial sensors, environmental sensors). User Interaction: Command detection (audio) Speech Recognition (audio) Gesture Recognition (inertial sensors, video) User identification and mood detection (audio, video) Continuous Learning, i.e.. How can I detect unpredictable, unclassified events in dynamic environments? Recurrent networks (audio, video, intertial sens, environm sens) And many more..
Comm Sensors Comm Sensors Comm Sensors Comm Sensors Comm Sensors Comm Sensors Artificial neural network placing intelligence where required Scalability Service Distributed Intelligence: why? Big data Service enablement Global optimization Centralized Distributed 3 Responsiveness Smart units Collected data analytic Service enablement 10000 sensors 100 Mb-1 Tb / sec 10 TOPS-10000 TOPS 10000 sensors ~10 Mb / sec ~10 GOPS concentrator concentrator al analytic Real time 100 sensors 1 Mb-10 Gb / sec 100 MOPS-100 TOPS 100 sensors ~100 Kb / sec ~100 MOPS Unit Unit Unit Unit Unit Unit 1 sensor 10 KOPS-100 MOPS 1 Intelligent sensor 100 MOPS-1 TOPS 1 sensor 10 Kb-100 Mb / sec
Why Artificial Neural Networks? 4 The power and usefulness of ANN have been demonstrated in several applications A specific kind of neural networks, the Deep Convolutional Neural Networks (DCNN), have proven very effective Achieving human-like performance in selected cases In the last decade breakthroughs made neural networks practical Better training algorithms, Moore s law and Big data availability For IoT the current challenge is to achieve low power and adequate cost with sufficient performance for edge computing
A Typical CDNN Structure 5 The Artificial Neuron is a processing unit with a close connection to neurobiology DCNN s are composed of multiple layers of neurons Each layer performs feature extraction with learned filters, reduction of input resolution and non-linear operations Multiple layers compress each image into denser information Depth indicates the number of layers of the specific CDNN network. Up to millions of parameters for each layer of the network can be involved Parameters can be defined thanks to supervised or unsupervised training algorithms processing large training sets Images Source: Stanford
Deep Learning on images 6 Image Classification Object alization Object Detection Image Segmentation Action Recognition Images source: ImgeNet Image Generation
Beyond recognition: semantic captioning 7 a giraffe has it's head up to a small tree. a giraffe in a pen standing under a tree. giraffe standing next to a wooden treelike structure. a tall giraffe standing next to a tree a giraffe in an enclosure standing next to a tree. Courtesy of COCO: Common Object in Context, Microsoft
Deep Learning for Speech 8 Speech Recognition Natural Language Processing Speech Translation https://www.flickr.com/photos/tevk/5429390495/ Audio Generation
Deep Leraning for autonomous driving 9 Simultaneous objects (cars pedestrian, signals) detection and identification Semantic segmentation Multiple sensory input (visual, radar,lidar, proximity, etc.) End to end processing and actuation
Convolutional NN Complexity Evolution Operations (GOPS) Parameters (Millions) 138 19.6 150 60 50 0.0002 1.0 1.5 0.01 ANNs (1997-2007) 3 layers AlexNET (2012) 8 layers GoogleLeNet (2014) 22 layers VGG19 (2014) 19 layers 11.3 ResNet (2015) 152 layers
HW for Deep Learning: few examples 11 Intel Xeon Phi 7285 Freq 1.4GHz, 68 cores, TDP 250W, 14nm, Perf > 3.4TOPS (TBA) Price $2036 (sept 2017) instructions for deep learning (AVX512-4VNNIW, AVX512-4FMAPS) NVIDIA Xavier 8-core CPU, 512-core Volta GPU 30 TOPS TDP 30W, 16nm Mobileye (now Intel) EyeQ5 7nm 12 TOPS peak 2.4 DL TOPS Movidius Myriad X (now Intel) 16 vector 128bit VLIW, Neural Compute Engine 4 TOPS TDP 2W, 16nm
Exploiting parallelism (we need special HW) 12 Temporal Architectures (SIMD/SIMT) Spatial (Data-flow) Architectures (SIMD) Two broad classes of architectures can be identified Both have pros and cons Specialized HW is needed to achieve power consumption compatible with IoT applications and cost Memory access the key aspect Courtesy of MIT Eyeriss project Energy/power x word access al SRAM On-chip SRAM LPDDR 1x 10x 100x
An Ultra low power example: Orlando SoC HW co-processors 8x Dual Cluster (16 cores) Image Stream Processor Image Stream Processor Video out (DVI) Interface DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ HW co-processors Color conv Crop Stream Switch Conv Conv DM IM I$ shared SM DM IM I$ DM IM I$ shared SM 8 ports DM IM I$ DM IM I$ shared SM DM IM I$ DM IM I$ shared SM STBUS T3 full (64bits) DM IM I$ Scale Integral Str. Str. Eng 0 Eng 1... Bus access arbiter & IF controller Str. Eng n Conv Conv STBUS slave Interface (type1) MEM MEM MEM MEM 4 x 1MB Global Ram T3-AXI AXI AXI-T3 HOST subsystem (e.g. ARM, Peripherals, mem, IFs, etc). Int controller T3-APB Shared Mailbox/ timer Debug controller n - 1 buf x Feature strip buffer kernel registers m * n M M Mn x M A C n (m/h) x col MACs A D D buf buf Convolution accelerators Presentation Title 3/2/2018 JTAG
Reconfigurable Accelerator Framework 14 Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 Image Sensor IF & ISP E14 Stream Switch RGB IMAGE... Image Sensor IF & ISP E4 BATCH -1 BATCH FEATURE E3 Bus Arbiter & System Bus Interface Display out (DVI) Interface E2 E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Virtual stream links Ferry data to/from accelerators, interfaces and engines Flow control mechanism is provided Streams can be multicast to multiple destinations More flexible than hardware data paths More power efficient than a bus
5598.2 um HW ACCELLER. SUBSYSTEM Prototype Chip FD-SOI 28nm 15 OTP High Speed Camera IF PLL CHIP TO CHIP M4 6239.2 um (DSP) CORES AND LOCAL MEMS GLOBAL MEMORY SUBSYSTEM Technology Package Frequency Supply voltages Power (**) @200Mhz, 0.575V, 8 CAs On-chip RAM FD-SOI 28nm FBGA 15x15x1.83 200MHz 1.175GHz 0.575 1.1 V digital 1.8V I/O 41 mw @ 42 FPS 4x1MB 8x192KB 128KB Host ARM Cortex -M4 No of DSPs 16 Peak DSP perf 75 GOPS (2x16bMAC (*) ) No of CAs 8 CAs perf (1.175GHz, 1.1V) 676 GOPS (*) peak (*) 1 MAC defined as 2 OPS (ADD + MUL) (**) HW Acc avg power for AlexNet
Ultra-Wide DVFS Range 16 LVT design with heterogeneous Poly-Bias levels => perf vs leakage GALS and low insertion delay clock networks to minimize on chip variation margins; Mono Supply memories with fine grained power switches and sleep mode; DVFS energy efficiency improvements via body bias. Wide DVFS Range 2930 2691 1175 950 1977 650 450 200 266 1423 969 801 0.575 0.6 0.7 0.825 1 1.1 Frequency GOPS/W
Application Example: AlexNet 17 Input image SENSOR I/F RGB->YUV KER. MEM JPEG MEM CA CROP 227x227 IN FMAP CA HOST + SPI + To PC DSPs OUT FMAP 37.5 mw @ 200MHz, 0.6V 10 FPS (38 ms DSPs, 62 ms CAs) 2 chained CAs Dynamic: 10 mw CAs + 17 mw system Static: 0.6 mw CAs + 9.9 mw system
FPS Power [mw] 160 140 120 100 80 60 40 20 0 Orlando CNN inference engine performance 18 VGG16 performance vs power scaling at Vdd range CAs with 8bpp MACs 16 Kernels in parallel 1 CAs 2 CAs 4 CAs 8 CAs 16 CAs 56 40 33 37 30 25 28 20 14.8 18.3 12 17 14 2 2 4 5 7 9 3.1 6 4.1 8 7.0 10.1 0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175 Vdd/Freq range 80 59 115 73 140 1200 1000 800 600 400 200 0 Compared (unfairly) to NVIDIA Tegra X1 @ 1200 MHz FP32 83 FPS Cost: 500-1000$ TDP: > 200W
Orlando at work 19 Left Orlando running Pico Yolo CNN for object detection and classification Top Orlando running a CNN trained to drive a simulated car Bottom: Orlando identifying faces and classifying expressions