Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT

Similar documents
RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Lossless Compression Algorithms for Direct- Write Lithography Systems

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

RFSOI and FDSOI enabling smarter and IoT applications. Kirk Ouellette Digital Products Group STMicroelectronics

TODAY computer vision technologies are used with great

MMI: A General Narrow Interface for Memory Devices

Alain Legault Hardent. Create Higher Resolution Displays With VESA Display Stream Compression

Pivoting Object Tracking System

Design and Implementation of an AHB VGA Peripheral

Film Grain Technology

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0

A Low-Power 0.7-V H p Video Decoder

Hi3518A Professional HD IP Camera SoC. Brief Data Sheet. Issue 03. Date Baseline Date

ISELED - A Bright Future for Automotive Interior Lighting

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Epiphan Frame Grabber User Guide

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

Sensor Development for the imote2 Smart Sensor Platform

Reconfigurable Neural Net Chip with 32K Connections

Implementation of an MPEG Codec on the Tilera TM 64 Processor

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

Getting Started with Launchpad and Grove Starter Kit. Franklin Cooper University Marketing Manager

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

NDIA Army Science and Technology Conference EWA Government Systems, Inc.

Hi3520A H.264 Codec Processor. Brief Data Sheet. Issue 01. Date

An Introduction to Deep Image Aesthetics

Display Interfaces. Display solutions from Inforce. MIPI-DSI to Parallel RGB format

PROF. TAJANA SIMUNIC ROSING. Midterm. Problem Max. Points Points Total 150 INSTRUCTIONS:

Low Power Design of the Next-Generation High Efficiency Video Coding

Intelligent Monitoring Software IMZ-RS300. Series IMZ-RS301 IMZ-RS304 IMZ-RS309 IMZ-RS316 IMZ-RS332 IMZ-RS300C

Computer and Machine Vision

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

microenable 5 marathon ACL Product Profile of microenable 5 marathon ACL Datasheet microenable 5 marathon ACL

Efficient FPGA-based Video Systems. Aaron Behman Xilinx

USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO. Edward Callway AMD

Tools to Debug Dead Boards

microenable IV AD1-PoCL Product Profile of microenable IV AD1-PoCL Datasheet microenable IV AD1-PoCL

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

4K Video, Real-Time Analytics, and AI Applications Drive 24G SAS

Day & Night 1080P HD Vari-Focal Dome IR IP Camera

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

MEMS Mirror: A8L AU-TINY48.4

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

Design and analysis of microcontroller system using AMBA- Lite bus

DSP in Communications and Signal Processing

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Multiband Noise Reduction Component for PurePath Studio Portable Audio Devices

A low-power portable H.264/AVC decoder using elastic pipeline

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Major Differences Between the DT9847 Series Modules

Technical Note PowerPC Embedded Processors Video Security with PowerPC

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

3. Configuration and Testing

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

Xetal-Pro: An Ultra-Low Energy and High Throughput SIMD Processor

SoC IC Basics. COE838: Systems on Chip Design

EnVinci Endoscopy with PC Comfort

Day & Night 1080P HD IR IP Camera

IEEE802.11a Based Wireless AV Module(WAVM) with Digital AV Interface. Outline

PRODUCT GUIDE CEL5500 LIGHT ENGINE. World Leader in DLP Light Exploration. A TyRex Technology Family Company

microenable IV AD4-LVDS Product Profile of microenable IV AD4-LVDS Datasheet microenable IV AD4-LVDS

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics

DT3130 Series for Machine Vision

microenable IV AS1-PoCL Product Profile of microenable IV AS1-PoCL Datasheet microenable IV AS1-PoCL

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

IP Video driving more Users & Uses

Linux+Zephyr: IoT made easy

HD Network Video Recorder Workstation

Design Challenge of a QuadHDTV Video Decoder

Multicore Design Considerations

Transparent low-overhead checkpoint for GPU-accelerated clusters

SEMICONDUCTOR TECHNOLOGY -CMOS-

ni.com Digital Signal Processing for Every Application

Frame Processing Time Deviations in Video Processors

On the Rules of Low-Power Design

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Data Converters and DSPs Getting Closer to Sensors

New Technologies: 4G/LTE, IOTs & OTTS WORKSHOP

Powerful Software Tools and Methods to Accelerate Test Program Development A Test Systems Strategies, Inc. (TSSI) White Paper.

Performance Driven Reliable Link Design for Network on Chips

ArcticLink III VX6 Solution Platform Data Sheet

Embedded System Design

AE16 DIGITAL AUDIO WORKSTATIONS

LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta

Sundance Multiprocessor Technology Limited. Capture Demo For Intech Unit / Module Number: C Hong. EVP6472 Intech Demo. Abstract

L12: Reconfigurable Logic Architectures

Understanding Compression Technologies for HD and Megapixel Surveillance

Sequential Logic. Introduction to Computer Yung-Yu Chuang

Set-Top Box Video Quality Test Solution

Sundance Multiprocessor Technology Limited. Capture Demo For Intech Unit / Module Number: C Hong. EVP6472 Intech Demo. Abstract

SEMICONDUCTOR TECHNOLOGY -CMOS-

Based on slides/material by. Topic 14. Testing. Testing. Logic Verification. Recommended Reading:

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications

1 Terasic Inc. D8M-GPIO User Manual

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010

VRT Radio Transport for SDR Architectures

Transcription:

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT Giuseppe Desoli ST Central Labs STMicroelectronics

Artificial Intelligence is Everywhere 2 Analysis, i.e.: Where am I? Scene classification (audio, video, environmental sensors) Which objects are in the scene, where are they? Video object detection/classification What am I doing? Activity recognition (audio, video, inertial sensors) What s happening? Event recognition (audio, video, intertial sensors, environmental sensors). User Interaction: Command detection (audio) Speech Recognition (audio) Gesture Recognition (inertial sensors, video) User identification and mood detection (audio, video) Continuous Learning, i.e.. How can I detect unpredictable, unclassified events in dynamic environments? Recurrent networks (audio, video, intertial sens, environm sens) And many more..

Comm Sensors Comm Sensors Comm Sensors Comm Sensors Comm Sensors Comm Sensors Artificial neural network placing intelligence where required Scalability Service Distributed Intelligence: why? Big data Service enablement Global optimization Centralized Distributed 3 Responsiveness Smart units Collected data analytic Service enablement 10000 sensors 100 Mb-1 Tb / sec 10 TOPS-10000 TOPS 10000 sensors ~10 Mb / sec ~10 GOPS concentrator concentrator al analytic Real time 100 sensors 1 Mb-10 Gb / sec 100 MOPS-100 TOPS 100 sensors ~100 Kb / sec ~100 MOPS Unit Unit Unit Unit Unit Unit 1 sensor 10 KOPS-100 MOPS 1 Intelligent sensor 100 MOPS-1 TOPS 1 sensor 10 Kb-100 Mb / sec

Why Artificial Neural Networks? 4 The power and usefulness of ANN have been demonstrated in several applications A specific kind of neural networks, the Deep Convolutional Neural Networks (DCNN), have proven very effective Achieving human-like performance in selected cases In the last decade breakthroughs made neural networks practical Better training algorithms, Moore s law and Big data availability For IoT the current challenge is to achieve low power and adequate cost with sufficient performance for edge computing

A Typical CDNN Structure 5 The Artificial Neuron is a processing unit with a close connection to neurobiology DCNN s are composed of multiple layers of neurons Each layer performs feature extraction with learned filters, reduction of input resolution and non-linear operations Multiple layers compress each image into denser information Depth indicates the number of layers of the specific CDNN network. Up to millions of parameters for each layer of the network can be involved Parameters can be defined thanks to supervised or unsupervised training algorithms processing large training sets Images Source: Stanford

Deep Learning on images 6 Image Classification Object alization Object Detection Image Segmentation Action Recognition Images source: ImgeNet Image Generation

Beyond recognition: semantic captioning 7 a giraffe has it's head up to a small tree. a giraffe in a pen standing under a tree. giraffe standing next to a wooden treelike structure. a tall giraffe standing next to a tree a giraffe in an enclosure standing next to a tree. Courtesy of COCO: Common Object in Context, Microsoft

Deep Learning for Speech 8 Speech Recognition Natural Language Processing Speech Translation https://www.flickr.com/photos/tevk/5429390495/ Audio Generation

Deep Leraning for autonomous driving 9 Simultaneous objects (cars pedestrian, signals) detection and identification Semantic segmentation Multiple sensory input (visual, radar,lidar, proximity, etc.) End to end processing and actuation

Convolutional NN Complexity Evolution Operations (GOPS) Parameters (Millions) 138 19.6 150 60 50 0.0002 1.0 1.5 0.01 ANNs (1997-2007) 3 layers AlexNET (2012) 8 layers GoogleLeNet (2014) 22 layers VGG19 (2014) 19 layers 11.3 ResNet (2015) 152 layers

HW for Deep Learning: few examples 11 Intel Xeon Phi 7285 Freq 1.4GHz, 68 cores, TDP 250W, 14nm, Perf > 3.4TOPS (TBA) Price $2036 (sept 2017) instructions for deep learning (AVX512-4VNNIW, AVX512-4FMAPS) NVIDIA Xavier 8-core CPU, 512-core Volta GPU 30 TOPS TDP 30W, 16nm Mobileye (now Intel) EyeQ5 7nm 12 TOPS peak 2.4 DL TOPS Movidius Myriad X (now Intel) 16 vector 128bit VLIW, Neural Compute Engine 4 TOPS TDP 2W, 16nm

Exploiting parallelism (we need special HW) 12 Temporal Architectures (SIMD/SIMT) Spatial (Data-flow) Architectures (SIMD) Two broad classes of architectures can be identified Both have pros and cons Specialized HW is needed to achieve power consumption compatible with IoT applications and cost Memory access the key aspect Courtesy of MIT Eyeriss project Energy/power x word access al SRAM On-chip SRAM LPDDR 1x 10x 100x

An Ultra low power example: Orlando SoC HW co-processors 8x Dual Cluster (16 cores) Image Stream Processor Image Stream Processor Video out (DVI) Interface DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ DM IM I$ SM shared DM IM I$ HW co-processors Color conv Crop Stream Switch Conv Conv DM IM I$ shared SM DM IM I$ DM IM I$ shared SM 8 ports DM IM I$ DM IM I$ shared SM DM IM I$ DM IM I$ shared SM STBUS T3 full (64bits) DM IM I$ Scale Integral Str. Str. Eng 0 Eng 1... Bus access arbiter & IF controller Str. Eng n Conv Conv STBUS slave Interface (type1) MEM MEM MEM MEM 4 x 1MB Global Ram T3-AXI AXI AXI-T3 HOST subsystem (e.g. ARM, Peripherals, mem, IFs, etc). Int controller T3-APB Shared Mailbox/ timer Debug controller n - 1 buf x Feature strip buffer kernel registers m * n M M Mn x M A C n (m/h) x col MACs A D D buf buf Convolution accelerators Presentation Title 3/2/2018 JTAG

Reconfigurable Accelerator Framework 14 Color convert Cropper H264 Ctrl Regs.... MJPEG COMP. IMAGE E15 Image Sensor IF & ISP E14 Stream Switch RGB IMAGE... Image Sensor IF & ISP E4 BATCH -1 BATCH FEATURE E3 Bus Arbiter & System Bus Interface Display out (DVI) Interface E2 E1 E0 CA 0 CA 1 CA 2 CA 3 KERNEL CA 7... Virtual stream links Ferry data to/from accelerators, interfaces and engines Flow control mechanism is provided Streams can be multicast to multiple destinations More flexible than hardware data paths More power efficient than a bus

5598.2 um HW ACCELLER. SUBSYSTEM Prototype Chip FD-SOI 28nm 15 OTP High Speed Camera IF PLL CHIP TO CHIP M4 6239.2 um (DSP) CORES AND LOCAL MEMS GLOBAL MEMORY SUBSYSTEM Technology Package Frequency Supply voltages Power (**) @200Mhz, 0.575V, 8 CAs On-chip RAM FD-SOI 28nm FBGA 15x15x1.83 200MHz 1.175GHz 0.575 1.1 V digital 1.8V I/O 41 mw @ 42 FPS 4x1MB 8x192KB 128KB Host ARM Cortex -M4 No of DSPs 16 Peak DSP perf 75 GOPS (2x16bMAC (*) ) No of CAs 8 CAs perf (1.175GHz, 1.1V) 676 GOPS (*) peak (*) 1 MAC defined as 2 OPS (ADD + MUL) (**) HW Acc avg power for AlexNet

Ultra-Wide DVFS Range 16 LVT design with heterogeneous Poly-Bias levels => perf vs leakage GALS and low insertion delay clock networks to minimize on chip variation margins; Mono Supply memories with fine grained power switches and sleep mode; DVFS energy efficiency improvements via body bias. Wide DVFS Range 2930 2691 1175 950 1977 650 450 200 266 1423 969 801 0.575 0.6 0.7 0.825 1 1.1 Frequency GOPS/W

Application Example: AlexNet 17 Input image SENSOR I/F RGB->YUV KER. MEM JPEG MEM CA CROP 227x227 IN FMAP CA HOST + SPI + To PC DSPs OUT FMAP 37.5 mw @ 200MHz, 0.6V 10 FPS (38 ms DSPs, 62 ms CAs) 2 chained CAs Dynamic: 10 mw CAs + 17 mw system Static: 0.6 mw CAs + 9.9 mw system

FPS Power [mw] 160 140 120 100 80 60 40 20 0 Orlando CNN inference engine performance 18 VGG16 performance vs power scaling at Vdd range CAs with 8bpp MACs 16 Kernels in parallel 1 CAs 2 CAs 4 CAs 8 CAs 16 CAs 56 40 33 37 30 25 28 20 14.8 18.3 12 17 14 2 2 4 5 7 9 3.1 6 4.1 8 7.0 10.1 0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175 Vdd/Freq range 80 59 115 73 140 1200 1000 800 600 400 200 0 Compared (unfairly) to NVIDIA Tegra X1 @ 1200 MHz FP32 83 FPS Cost: 500-1000$ TDP: > 200W

Orlando at work 19 Left Orlando running Pico Yolo CNN for object detection and classification Top Orlando running a CNN trained to drive a simulated car Bottom: Orlando identifying faces and classifying expressions