EIE: Efficient Inference Engine on Compressed Deep Neural Network

Size: px

Start display at page:

Download "EIE: Efficient Inference Engine on Compressed Deep Neural Network"

Ira Fields
6 years ago
Views:

1 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han*, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally Stanford University June 20, 2016

2 Deep Learning on Mobile Phones Drones Robots Glasses Self Driving Cars Battery Constrained! 2

Deep Learning on Mobile: Difficulty? Model Size!

High Power Operation Energy [pj] Relative Cost 32 bit int ADD

9 9 32 bit Register File 1 10 32 bit int MULT 3.

3 Deep Learning on Mobile: Difficulty? Model Size! Accurate Prediction => Large Model => More Memory Reference => High Power Operation Energy [pj] Relative Cost 32 bit int ADD bit float ADD bit Register File bit int MULT bit float MULT bit SRAM Cache bit DRAM Memory Relative Energy Cost = 100 3

4 Our Past Work: Deep Compression Problem 1: DNN Model Size too Large Solution 1: Deep Compression Smaller Size 90% zeros in weights 4-bit weight Accuracy No loss of accuracy / Improved accuracy On-chip State-of-the-art DNN fit on-chip SRAM 8

5 Our Past Work: Deep Compression Network Pruning [1]: 10x fewer weights 60M weights 6M weights Weight Sharing [2]: only 4-bits per remaining weight 32 bit 4 bit [1]. Han et al. NIPS 2015 [2]. Han et al. ICLR 2016, best paper award 10

Deep Compression Results Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.

6 Deep Compression Results Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% GoogleNet 28MB 2.8MB 10x 88.90% 88.92% SqueezeNet 4.8MB 0.47MB 10x 80.32% 80.35% No loss of accuracy on ImageNet dataset. Weights fits on-chip SRAM, taking 120x less energy than DRAM. 11

EIE: First Accelerator for Compressed Sparse Neural Network SpMat

Computation Pattern Solution 2: EIE accelerator Sparse Matrix 90%

memory footprint Sparse Vector 70% dynamic sparsity in the

less memory footprint Fully fits in SRAM 120x less energy than

7 EIE: First Accelerator for Compressed Sparse Neural Network SpMat Act_0 Act_1 Ptr_Even Arithm Ptr_Odd SpMat Problem 2: Irregular Computation Pattern Solution 2: EIE accelerator Sparse Matrix 90% static sparsity in the weights, 10x less computation, 5x less memory footprint Sparse Vector 70% dynamic sparsity in the activation 3x less computation Weight Sharing 4bits weights 8x less memory footprint Fully fits in SRAM 120x less energy than DRAM Savings are multiplicative: 5x3x8x120=14,400 theoretical energy improvement.

8 Distributed Storage and Processing logically ~a 0 a 1 0 a PE0 w 0,0 w 0,1 0 w 0,3 PE1 0 0 w 1,2 0 PE2 0 w 2,1 0 w 2,3 PE w 4,2 w 4,3 w 5, B w 6,3 A 0 w 7,1 0 0 = 0 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 C A ReLU ) 0 ~ b b 0 b 1 0 b 3 0 b 5 b C A physically Virtual Weight W 0,0 W 0,1 W 4,2 W 0,3 W 4,3 Relative Index Column Pointer

9 PE Architecture PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb

10 Benchmark CPU: Intel Core-i7 5930k GPU: NVIDIA TitanX Mobile GPU: NVIDIA Jetson TK1 Layer Size Weight Density Activation Density FLOP % AlexNet % 35.1% 3% AlexNet % 35.3% 3% AlexNet % 37.5% 10% VGG % 18.3% 1% VGG % 37.5% 2% VGG % 41.1% 9% Description AlexNet for image classification VGG-16 for image classification NeuralTalk-We % 100% 10% RNN and LSTM for NeuralTalk-Wd % 100% 11% image NeuralTalk-LSTM % 100% 11% caption 38

Scalability Speedup 100 10 1 1PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11.

11 Scalability Speedup PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11. System scalability. It measures the speedups with different numbers of PEs. The speedup is near-linear. #PEs ~ Speedup 64PEs: 64x 128PEs: 124x 256PEs: 210x 39

12 Load Balancing Load Balance 100% 80% 60% 40% 20% 0% FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256 Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM 8. Load efficiency improves as FIFO size increases. When FIFO deepth 8, the marginal gain quickly diminishes. So we choose FIFO de Imbalanced non-zeros among PEs degrades system utilization. This load imbalance could be solved by FIFO. With FIFO depth=16, ALU utilization is > 90%. 40

13 Result of EIE Technology 45 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 41

14 Energy Breakdown memory register clock network combinational Act_queue SpmatRead ActRW PtrRead ArithmUnit 11% 9% 13% 12% 1% 20% 20% 59% 54% 42

15 Prediction Accuracy 90% Multiply Energy (pj) Prediction Accuracy 4.0 Accuracy 68% 45% 23% 0% Figure b Float 32b Int 16b Int 8b Int Arithmetic Precision Prediction accuracy and multiplier energy with different Mixed Precision: 4 bit index (virtual weight) 16 bit real weight, 16 bit fixed point ALU Mul Energy (pj) 43

16 FC Layer: Speedup on EIE Speedup 1000x 100x 10x 1x 0.1x igure 6. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 1018x 507x 618x 248x 210x 94x 135x 189x 115x 92x 98x 56x 63x 60x 25x 24x 34x 33x 48x 21x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 1x 1x 1x 1x 2x 1.1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.3x 0.5x 0.5x 0.5x 0.6x Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. Energy Efficiency x Compared to CPU 61,533xand GPU: 34,522x 14,826x 10000x re 7. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 189x and 13x faster 119,797x 76,784x 11,828x 9,485x 10,904x 8,053x 1000x 78x 101x 102x 100x 59x 61x 26x 37x 37x 39x 17x 20x 9x12x 18x 25x 25x 36x 5x 7x 7x 10x 10x 10x 14x 15x 23x 14x 20x 8x 5x 6x 6x 6x 6x 3x 4x 5x 6x 7x 10x 2x 1x 10x 15x Baseline: 13x 14x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 1x Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cublas GEMV, cusparse CSRMV Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all ca NVIDIA Tegra K1: cublas GEMV, cusparse CSRMV er. We placed and routed the PE using the Synopsys IC piler (ICC). We used Cacti [25] to get SRAM area and 44 gy numbers. We annotated the toggle rate from the RTL Table III BENCHMARK FROM STATE-OF-THE-ART DNN MODELS Layer Size Weight% Act% FLOP% Description 9216, 24,207

17 FC Layer: Energy Efficiency on EIE Energy Efficiency re x 10000x 1000x 100x 10x 1x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 34,522x 61,533x 14,826x 119,797x 76,784x 11,828x 9,485x 10,904x 8,053x 78x 101x 102x 59x 61x 26x 37x 37x 39x 17x 20x 9x12x 18x 25x 25x 36x 5x 7x 7x 10x 10x 10x 14x 15x 23x 14x 20x 8x 5x 6x 6x 6x 6x 3x 4x 5x 6x 7x 2x 1x 10x 15x 13x 14x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cas Table III BENCHMARK FROM STATE-OF-THE-ART DNN MODELS er. We placed and routed the PE using the Synopsys IC Compared to CPU and GPU: piler (ICC). We used Cacti [25] to get SRAM area and gy numbers. 24,000x We annotated and 3,400x the toggle more rate energy from the RTL efficient 9216, Alex-6 lation to the gate-level netlist, which was dumped to , ching activity interchange format (SAIF), and estimated Alex-7 ower Baseline: 4096 using Prime-Time PX. 4096, Alex-8 omparison Intel Baseline. Core i7 We 5930K: compare reported EIE withby three pcm-power dift off-the-shelf NVIDIA computing GeForce units: GTX CPU, Titan GPU X: and reported mobile by nvidia-smi VGG-6 utility , 4096 utility. 4096, NVIDIA Tegra K1: measured with power-meter, VGG-7 60% AP+DRAM power CPU. We use Intel Core i k CPU, a Haswell-E , processor, that has been used in NVIDIA Digits Deep VGG ning Dev Box as a CPU baseline. To run the benchmark Layer Size Weight% Act% FLOP% Description NT-We 4096, 9% 35.1% 3% 9% 35.3% 3% 25% 37.5% 10% 24,207x Compressed AlexNet [1] fo large scale ima classification VGG-16 [3] fo classification a 4% 18.3% 1% Compressed 4% 37.5% 2% large scale ima 23% 41.1% 9% object detectio 10% 100% 10% Compressed

18 Comparison: Throughput MxV Throughput (Layers/s) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 46

19 Comparison: Area Efficiency Area Efficiency (Layers/s/mm^2) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 47

20 Comparison: Energy Efficiency Energy Efficiency (Layers/J) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 48

carry. 3 dynamic activation sparsity; carry only good bricks; ignore broken bricks.

21 Where are the savings from? Four factors for energy saving: 10 static weight sparsity; less work to do; less bricks to carry. 3 dynamic activation sparsity; carry only good bricks; ignore broken bricks. Weight sharing with only 4-bits per weight; lighter bricks to carry. DRAM => SRAM, no need to go off-chip; carry bricks from San Francisco to Seoul => Incheon to Seoul. 49

22 Conclusion EIE: first accelerator for compressed, sparse neural network. Compression => Acceleration, no loss accuracy. Distributed storage/computation to parallelize/load balance across PEs. 13x faster and 3,400x more energy efficient than GPU. 2.9x faster and 19x more energy efficient than past ASIC. 50

23 Beyond EIE: a Multi-Dimension Sparse Recipe for Deep Learning Faster Speed: EIE accelerator sparsity Smaller Size: Deep Compression, SqueezeNet++ Higher Accuracy: DSD regularization [1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Deep Learning Symposium 2015, ICLR 2016 (best paper award) [3]. Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016 [4]. Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow, arxiv 2016 [5]. Iandola, Han,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv 16 [6]. Yao, Han, et.al, Hardware-friendly convolutional neural network with even-number filter size, ICLR workshop

24 Backup Slides 52

25 Sparsity: Pruning AlexNet & VGGNet CONV: 3x FC: 10x Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

26 Retrain to Fully Recover Accuracy Accuracy Loss L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015

27 Weight Sharing: Accuracy with # Bits #CONV bits / #FC bits Top-1 Error Top-5 Error Top-1 Error Top-5 Error Increase Increase 32bits / 32bits 42.78% 19.73% bits / 5 bits 42.78% 19.70% 0.00% -0.03% 8 bits / 4 bits 42.79% 19.73% 0.01% 0.00% 4 bits / 2 bits 44.77% 22.33% 1.99% 2.60% Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ICLR

28 Deep Compression Result on Major Convnets Network Top-1 Error Top-5 Error Parameters Compress Rate LeNet Ref 1.64% KB LeNet Compressed 1.58% - 27 KB 40 LeNet-5 Ref 0.80% KB LeNet-5 Compressed 0.74% - 44 KB 39 AlexNet Ref 42.78% 19.73% 240 MB AlexNet Compressed 42.78% 19.70% 6.9 MB 35 VGG-16 Ref 31.50% 11.32% 552 MB VGG-16 Compressed 31.17% 10.91% 11.3 MB 49 SqueezeNet Ref 42.5% 19.7% 4.8 MB SqueezeNet Compressed 42.5% 19.7% 0.47MB 10 GoogLeNet Ref 31.30% 11.10% 28 MB GoogLeNet Compressed 31.26% 11.08% 2.8 MB 10 SqueezeNet and GoogleNet: just Pruning and Quantization gives 10x compression. Inception Model is really efficient for classification. But it can still achieve an order of magnitude smaller with Deep Compression. Fits in SRAM cache. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ICLR

Pruning NeuralTalk and LSTM Original: a basketball player in a white uniform is

playing with a basketball Original : a brown dog is running through a grassy

is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave

a man in a red shirt and black and white black shirt is running through a field

29 Pruning NeuralTalk and LSTM Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 poster

With Sparsity Constraint, DSD Training Improves

black dog is jumping standing in front of a

Sparse: a group of people are standing DSD: a

Baseline: two girls in bathing suits are

Sparse: two children are playing in the sand.

Baseline: a man in a red shirt and jeans is

Sparse: a man in a red shirt and a woman in a

DSD: a man and a woman are riding on a street.

a group of people are standing in a fountain.

30 With Sparsity Constraint, DSD Training Improves Accuracy (Baseline: NeuralTalk) Baseline: a boy is swimming in a pool. Baseline: a group of people are Sparse: a small black dog is jumping standing in front of a building. into a pool. Sparse: a group of people are standing DSD: a black and white dog is swimming in front of a building. in a pool. DSD: a group of people are walking in a park. Baseline: two girls in bathing suits are playing in the water. Sparse: two children are playing in the sand. DSD: two children are playing in the sand. Baseline: a man in a red shirt and jeans is riding a bicycle down a street. Sparse: a man in a red shirt and a woman in a wheelchair. DSD: a man and a woman are riding on a street. Baseline: a group of people sit on a bench in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are standing in a fountain. Baseline: a man in a black jacket and a black jacket is smiling. Sparse: a man and a woman are standing in front of a mountain. DSD: a man in a black jacket is standing next to a man in a black shirt. Baseline: a group of football players in red uniforms. Sparse: a group of football players in a field. DSD: a group of football players in red and white uniforms. Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow, arxiv 2016 Baseline: a dog runs through the grass. Sparse: a dog runs through the grass. DSD: a white and brown dog is running through the grass.

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Robert LiKamWa Yunhui Hou Yuan Gao Mia Polansky Lin Zhong roblkw@rice.edu houyh@rice.edu yg18@rice.edu mia.polansky@rice.edu lzhong@rice.edu