EIE: Efficient Inference Engine on Compressed Deep Neural Network
|
|
- Ira Fields
- 6 years ago
- Views:
Transcription
1 EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han*, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally Stanford University June 20, 2016
2 Deep Learning on Mobile Phones Drones Robots Glasses Self Driving Cars Battery Constrained! 2
3 Deep Learning on Mobile: Difficulty? Model Size! Accurate Prediction => Large Model => More Memory Reference => High Power Operation Energy [pj] Relative Cost 32 bit int ADD bit float ADD bit Register File bit int MULT bit float MULT bit SRAM Cache bit DRAM Memory Relative Energy Cost = 100 3
4 Our Past Work: Deep Compression Problem 1: DNN Model Size too Large Solution 1: Deep Compression Smaller Size 90% zeros in weights 4-bit weight Accuracy No loss of accuracy / Improved accuracy On-chip State-of-the-art DNN fit on-chip SRAM 8
5 Our Past Work: Deep Compression Network Pruning [1]: 10x fewer weights 60M weights 6M weights Weight Sharing [2]: only 4-bits per remaining weight 32 bit 4 bit [1]. Han et al. NIPS 2015 [2]. Han et al. ICLR 2016, best paper award 10
6 Deep Compression Results Network Original Size Compressed Size Compression Ratio Original Accuracy Compressed Accuracy AlexNet 240MB 6.9MB 35x 80.27% 80.30% VGGNet 550MB 11.3MB 49x 88.68% 89.09% GoogleNet 28MB 2.8MB 10x 88.90% 88.92% SqueezeNet 4.8MB 0.47MB 10x 80.32% 80.35% No loss of accuracy on ImageNet dataset. Weights fits on-chip SRAM, taking 120x less energy than DRAM. 11
7 EIE: First Accelerator for Compressed Sparse Neural Network SpMat Act_0 Act_1 Ptr_Even Arithm Ptr_Odd SpMat Problem 2: Irregular Computation Pattern Solution 2: EIE accelerator Sparse Matrix 90% static sparsity in the weights, 10x less computation, 5x less memory footprint Sparse Vector 70% dynamic sparsity in the activation 3x less computation Weight Sharing 4bits weights 8x less memory footprint Fully fits in SRAM 120x less energy than DRAM Savings are multiplicative: 5x3x8x120=14,400 theoretical energy improvement.
8 Distributed Storage and Processing logically ~a 0 a 1 0 a PE0 w 0,0 w 0,1 0 w 0,3 PE1 0 0 w 1,2 0 PE2 0 w 2,1 0 w 2,3 PE w 4,2 w 4,3 w 5, B w 6,3 A 0 w 7,1 0 0 = 0 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 1 C A ReLU ) 0 ~ b b 0 b 1 0 b 3 0 b 5 b C A physically Virtual Weight W 0,0 W 0,1 W 4,2 W 0,3 W 4,3 Relative Index Column Pointer
9 PE Architecture PE PE PE PE PE PE PE PE Central Control Act Value Act Index Act Queue Act Index Act Value Encoded Weight Act SRAM Leading NZero Detect PE PE PE PE PE PE PE PE Even Ptr SRAM Bank Odd Ptr SRAM Bank Pointer Read Col Start/ End Addr Sparse Matrix SRAM Sparse Matrix Access Regs Relative Index Weight Decoder Address Accum Absolute Address Arithmetic Unit Bypass Dest Act Regs Src Act Regs Act R/W ReLU SRAM Regs Comb
10 Benchmark CPU: Intel Core-i7 5930k GPU: NVIDIA TitanX Mobile GPU: NVIDIA Jetson TK1 Layer Size Weight Density Activation Density FLOP % AlexNet % 35.1% 3% AlexNet % 35.3% 3% AlexNet % 37.5% 10% VGG % 18.3% 1% VGG % 37.5% 2% VGG % 41.1% 9% Description AlexNet for image classification VGG-16 for image classification NeuralTalk-We % 100% 10% RNN and LSTM for NeuralTalk-Wd % 100% 11% image NeuralTalk-LSTM % 100% 11% caption 38
11 Scalability Speedup PE 2PEs 4PEs 8PEs 16PEs 32PEs 64PEs 128PEs 256PEs Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Figure 11. System scalability. It measures the speedups with different numbers of PEs. The speedup is near-linear. #PEs ~ Speedup 64PEs: 64x 128PEs: 124x 256PEs: 210x 39
12 Load Balancing Load Balance 100% 80% 60% 40% 20% 0% FIFO=1 FIFO=2 FIFO=4 FIFO=8 FIFO=16 FIFO=32 FIFO=64 FIFO=128 FIFO=256 Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM 8. Load efficiency improves as FIFO size increases. When FIFO deepth 8, the marginal gain quickly diminishes. So we choose FIFO de Imbalanced non-zeros among PEs degrades system utilization. This load imbalance could be solved by FIFO. With FIFO depth=16, ALU utilization is > 90%. 40
13 Result of EIE Technology 45 nm # PEs 64 on-chip SRAM Max Model Size Static Sparsity Dynamic Sparsity Quantization ALU Width Area MxV Throughput Power 8 MB 84 Million 10x 3x 4-bit 16-bit 40.8 mm^2 81,967 layers/s 586 mw 1. Post layout result 2. Throughput measured on AlexNet FC-7 41
14 Energy Breakdown memory register clock network combinational Act_queue SpmatRead ActRW PtrRead ArithmUnit 11% 9% 13% 12% 1% 20% 20% 59% 54% 42
15 Prediction Accuracy 90% Multiply Energy (pj) Prediction Accuracy 4.0 Accuracy 68% 45% 23% 0% Figure b Float 32b Int 16b Int 8b Int Arithmetic Precision Prediction accuracy and multiplier energy with different Mixed Precision: 4 bit index (virtual weight) 16 bit real weight, 16 bit fixed point ALU Mul Energy (pj) 43
16 FC Layer: Speedup on EIE Speedup 1000x 100x 10x 1x 0.1x igure 6. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 1018x 507x 618x 248x 210x 94x 135x 189x 115x 92x 98x 56x 63x 60x 25x 24x 34x 33x 48x 21x 22x 25x 14x 14x 16x 15x 15x 9x 8x 10x 9x 10x 9x 9x 5x 5x 2x 3x 2x 3x 3x 2x 3x 3x 1x 1x 1x 1x 2x 1.1x 1x 1x 1.0x 1x 1.0x 1x 1x 1x 1x 1x 1x 1x 1x 0.6x 0.5x 0.3x 0.5x 0.5x 0.5x 0.6x Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases. Energy Efficiency x Compared to CPU 61,533xand GPU: 34,522x 14,826x 10000x re 7. CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 189x and 13x faster 119,797x 76,784x 11,828x 9,485x 10,904x 8,053x 1000x 78x 101x 102x 100x 59x 61x 26x 37x 37x 39x 17x 20x 9x12x 18x 25x 25x 36x 5x 7x 7x 10x 10x 10x 14x 15x 23x 14x 20x 8x 5x 6x 6x 6x 6x 3x 4x 5x 6x 7x 10x 2x 1x 10x 15x Baseline: 13x 14x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x 1x Intel Core i7 5930K: MKL CBLAS GEMV, MKL SPBLAS CSRMV NVIDIA GeForce GTX Titan X: cublas GEMV, cusparse CSRMV Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all ca NVIDIA Tegra K1: cublas GEMV, cusparse CSRMV er. We placed and routed the PE using the Synopsys IC piler (ICC). We used Cacti [25] to get SRAM area and 44 gy numbers. We annotated the toggle rate from the RTL Table III BENCHMARK FROM STATE-OF-THE-ART DNN MODELS Layer Size Weight% Act% FLOP% Description 9216, 24,207
17 FC Layer: Energy Efficiency on EIE Energy Efficiency re x 10000x 1000x 100x 10x 1x CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mgpu Dense mgpu Compressed EIE 34,522x 61,533x 14,826x 119,797x 76,784x 11,828x 9,485x 10,904x 8,053x 78x 101x 102x 59x 61x 26x 37x 37x 39x 17x 20x 9x12x 18x 25x 25x 36x 5x 7x 7x 10x 10x 10x 14x 15x 23x 14x 20x 8x 5x 6x 6x 6x 6x 3x 4x 5x 6x 7x 2x 1x 10x 15x 13x 14x 1x 1x 7x 1x 1x 1x 5x 1x 8x 1x 7x 1x 7x 1x 9x Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cas Table III BENCHMARK FROM STATE-OF-THE-ART DNN MODELS er. We placed and routed the PE using the Synopsys IC Compared to CPU and GPU: piler (ICC). We used Cacti [25] to get SRAM area and gy numbers. 24,000x We annotated and 3,400x the toggle more rate energy from the RTL efficient 9216, Alex-6 lation to the gate-level netlist, which was dumped to , ching activity interchange format (SAIF), and estimated Alex-7 ower Baseline: 4096 using Prime-Time PX. 4096, Alex-8 omparison Intel Baseline. Core i7 We 5930K: compare reported EIE withby three pcm-power dift off-the-shelf NVIDIA computing GeForce units: GTX CPU, Titan GPU X: and reported mobile by nvidia-smi VGG-6 utility , 4096 utility. 4096, NVIDIA Tegra K1: measured with power-meter, VGG-7 60% AP+DRAM power CPU. We use Intel Core i k CPU, a Haswell-E , processor, that has been used in NVIDIA Digits Deep VGG ning Dev Box as a CPU baseline. To run the benchmark Layer Size Weight% Act% FLOP% Description NT-We 4096, 9% 35.1% 3% 9% 35.3% 3% 25% 37.5% 10% 24,207x Compressed AlexNet [1] fo large scale ima classification VGG-16 [3] fo classification a 4% 18.3% 1% Compressed 4% 37.5% 2% large scale ima 23% 41.1% 9% object detectio 10% 100% 10% Compressed
18 Comparison: Throughput MxV Throughput (Layers/s) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 46
19 Comparison: Area Efficiency Area Efficiency (Layers/s/mm^2) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 47
20 Comparison: Energy Efficiency Energy Efficiency (Layers/J) Core-i7 5930k (22nm CPU) TitanX (28nm GPU) Tegra K1 A-Eye Da-DianNao (28nm mgpu) (28nm FPGA) (28nm ASIC True-North (28nm ASIC) EIE (45nm ASIC) 64PEs EIE (28nm ASIC) 256PEs 48
21 Where are the savings from? Four factors for energy saving: 10 static weight sparsity; less work to do; less bricks to carry. 3 dynamic activation sparsity; carry only good bricks; ignore broken bricks. Weight sharing with only 4-bits per weight; lighter bricks to carry. DRAM => SRAM, no need to go off-chip; carry bricks from San Francisco to Seoul => Incheon to Seoul. 49
22 Conclusion EIE: first accelerator for compressed, sparse neural network. Compression => Acceleration, no loss accuracy. Distributed storage/computation to parallelize/load balance across PEs. 13x faster and 3,400x more energy efficient than GPU. 2.9x faster and 19x more energy efficient than past ASIC. 50
23 Beyond EIE: a Multi-Dimension Sparse Recipe for Deep Learning Faster Speed: EIE accelerator sparsity Smaller Size: Deep Compression, SqueezeNet++ Higher Accuracy: DSD regularization [1]. Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 [2]. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Deep Learning Symposium 2015, ICLR 2016 (best paper award) [3]. Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016 [4]. Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow, arxiv 2016 [5]. Iandola, Han,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arxiv 16 [6]. Yao, Han, et.al, Hardware-friendly convolutional neural network with even-number filter size, ICLR workshop
24 Backup Slides 52
25 Sparsity: Pruning AlexNet & VGGNet CONV: 3x FC: 10x Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
26 Retrain to Fully Recover Accuracy Accuracy Loss L2 regularization w/o retrain L1 regularization w/o retrain L1 regularization w/ retrain L2 regularization w/ retrain L2 regularization w/ iterative prune and retrain 0.5% 0.0% -0.5% -1.0% -1.5% -2.0% -2.5% -3.0% -3.5% -4.0% -4.5% 40% 50% 60% 70% 80% 90% 100% Parametes Pruned Away Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015
27 Weight Sharing: Accuracy with # Bits #CONV bits / #FC bits Top-1 Error Top-5 Error Top-1 Error Top-5 Error Increase Increase 32bits / 32bits 42.78% 19.73% bits / 5 bits 42.78% 19.70% 0.00% -0.03% 8 bits / 4 bits 42.79% 19.73% 0.01% 0.00% 4 bits / 2 bits 44.77% 22.33% 1.99% 2.60% Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ICLR
28 Deep Compression Result on Major Convnets Network Top-1 Error Top-5 Error Parameters Compress Rate LeNet Ref 1.64% KB LeNet Compressed 1.58% - 27 KB 40 LeNet-5 Ref 0.80% KB LeNet-5 Compressed 0.74% - 44 KB 39 AlexNet Ref 42.78% 19.73% 240 MB AlexNet Compressed 42.78% 19.70% 6.9 MB 35 VGG-16 Ref 31.50% 11.32% 552 MB VGG-16 Compressed 31.17% 10.91% 11.3 MB 49 SqueezeNet Ref 42.5% 19.7% 4.8 MB SqueezeNet Compressed 42.5% 19.7% 0.47MB 10 GoogLeNet Ref 31.30% 11.10% 28 MB GoogLeNet Compressed 31.26% 11.08% 2.8 MB 10 SqueezeNet and GoogleNet: just Pruning and Quantization gives 10x compression. Inception Model is really efficient for classification. But it can still achieve an order of magnitude smaller with Deep Compression. Fits in SRAM cache. Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding ICLR
29 Pruning NeuralTalk and LSTM Original: a basketball player in a white uniform is playing with a ball Pruned 90%: a basketball player in a white uniform is playing with a basketball Original : a brown dog is running through a grassy field Pruned 90%: a brown dog is running through a grassy area Original : a man is riding a surfboard on a wave Pruned 90%: a man in a wetsuit is riding a wave on a beach Original : a soccer player in red is running in the field Pruned 95%: a man in a red shirt and black and white black shirt is running through a field Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 poster
30 With Sparsity Constraint, DSD Training Improves Accuracy (Baseline: NeuralTalk) Baseline: a boy is swimming in a pool. Baseline: a group of people are Sparse: a small black dog is jumping standing in front of a building. into a pool. Sparse: a group of people are standing DSD: a black and white dog is swimming in front of a building. in a pool. DSD: a group of people are walking in a park. Baseline: two girls in bathing suits are playing in the water. Sparse: two children are playing in the sand. DSD: two children are playing in the sand. Baseline: a man in a red shirt and jeans is riding a bicycle down a street. Sparse: a man in a red shirt and a woman in a wheelchair. DSD: a man and a woman are riding on a street. Baseline: a group of people sit on a bench in front of a building. Sparse: a group of people are standing in front of a building. DSD: a group of people are standing in a fountain. Baseline: a man in a black jacket and a black jacket is smiling. Sparse: a man and a woman are standing in front of a mountain. DSD: a man in a black jacket is standing next to a man in a black shirt. Baseline: a group of football players in red uniforms. Sparse: a group of football players in a field. DSD: a group of football players in red and white uniforms. Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow, arxiv 2016 Baseline: a dog runs through the grass. Sparse: a dog runs through the grass. DSD: a white and brown dog is running through the grass.
RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision
Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Robert LiKamWa Yunhui Hou Yuan Gao Mia Polansky Lin Zhong roblkw@rice.edu houyh@rice.edu yg18@rice.edu mia.polansky@rice.edu lzhong@rice.edu
More informationTODAY computer vision technologies are used with great
ARXIV PREPRINT 1 Origami: A 803 GOp/s/W Convolutional Network Accelerator Lukas Cavigelli, Student Member, IEEE, and Luca Benini, Fellow, IEEE arxiv:1512.04295v2 [cs.cv] 19 Jan 2016 Abstract An ever increasing
More informationGeneralized Pattern Matching Micro-Engine
Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan Rasool, Dilip Vasudevan*, Andrew A. Chien* University of Chicago * Argonne National Laboratory King Faisal University Big Data Applications
More informationLossless Compression Algorithms for Direct- Write Lithography Systems
Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley
More informationPower Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT
Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT Giuseppe Desoli ST Central Labs STMicroelectronics Artificial Intelligence is Everywhere 2 Analysis,
More informationDiscriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik
Discriminative and Generative Models for Image-Language Understanding Svetlana Lazebnik Image-language understanding Robot, take the pan off the stove! Discriminative image-language tasks Image-sentence
More informationImplementation of an MPEG Codec on the Tilera TM 64 Processor
1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall
More informationScalability of MB-level Parallelism for H.264 Decoding
Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica
More informationA Low-Power 0.7-V H p Video Decoder
A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining
More informationTartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability
Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Eploiting Numerical Precision Variability Alberto Delmás Lascorz, Sayeh Sharify, Patrick Judd & Andreas Moshovos
More informationYong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan
Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University Reverse-engineer the brain National
More informationOptimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015
Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used
More informationEE5780 Advanced VLSI CAD
EE5780 Advanced VLSI CAD Lecture 11 SRAM and Yield Analysis Zhuo Feng 11.1 Memory Arrays SRAM Architecture SRAM Cell Decoders Column Circuitry Multiple Ports Outline Serial Access Memories 11.2 Memory
More informationThis paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.
This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright. The final version is published and available at IET Digital Library
More informationHardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems
Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hsin-I Liu, Brian Richards, Avideh Zakhor, and Borivoje Nikolic Dept. of Electrical Engineering
More informationESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming
ESE534: Computer Organization Today Retiming Demand Folded Computation Day 21: April 14, 2014 Retiming Logical Pipelining Physical Pipelining Retiming Supply Technology Structures Hierarchy 1 2 Image Processing
More informationESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling
ESE534: Computer Organization Previously Instruction Space Modeling Day 15: March 24, 2014 Empirical Comparisons Previously Programmable compute blocks LUTs, ALUs, PLAs Today What if we just built a custom
More informationOn the Rules of Low-Power Design
On the Rules of Low-Power Design (and How to Break Them) Prof. Todd Austin Advanced Computer Architecture Lab University of Michigan austin@umich.edu Once upon a time 1 Rules of Low-Power Design P = acv
More informationSensor Development for the imote2 Smart Sensor Platform
Sensor Development for the imote2 Smart Sensor Platform March 7, 2008 2008 Introduction Aging infrastructure requires cost effective and timely inspection and maintenance practices The condition of a structure
More informationCacheCompress A Novel Approach for Test Data Compression with cache for IP cores
CacheCompress A Novel Approach for Test Data Compression with cache for IP cores Hao Fang ( 方昊 ) fanghao@mprc.pku.edu.cn Rizhao, ICDFN 07 20/08/2007 To be appeared in ICCAD 07 Sections Introduction Our
More informationAn Introduction to Deep Image Aesthetics
Seminar in Laboratory of Visual Intelligence and Pattern Analysis (VIPA) An Introduction to Deep Image Aesthetics Yongcheng Jing College of Computer Science and Technology Zhejiang University Zhenchuan
More informationTutorial Outline. Typical Memory Hierarchy
Tutorial Outline 8:30-8:45 8:45-9:05 9:05-9:30 9:30-10:30 10:30-10:50 10:50-12:15 12:15-1:30 1:30-2:30 2:30-3:30 3:30-3:50 3:50-4:30 4:30-4:45 Introduction and motivation Sources of power in CMOS designs
More informationHardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems
Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hsin-I Liu, Brian Richards, Avideh Zakhor, and Borivoje Nikolic Dept. of Electrical Engineering
More informationFurther Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji
S.NO 2018-2019 B.TECH VLSI IEEE TITLES TITLES FRONTEND 1. Approximate Quaternary Addition with the Fast Carry Chains of FPGAs 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. A Low-Power
More informationCS 7643: Deep Learning
CS 7643: Deep Learning Topics: Stride, padding Pooling layers Fully-connected layers as convolutions Backprop in conv layers Dhruv Batra Georgia Tech Invited Talks Sumit Chopra on CNNs for Pixel Labeling
More informationA CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS
9th European Signal Processing Conference (EUSIPCO 2) Barcelona, Spain, August 29 - September 2, 2 A 6-65 CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS Jinjia Zhou, Dajiang
More informationA video signal processor for motioncompensated field-rate upconversion in consumer television
A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan,
More informationSequential Logic. Introduction to Computer Yung-Yu Chuang
Sequential Logic Introduction to Computer Yung-Yu Chuang with slides by Sedgewick & Wayne (introcs.cs.princeton.edu), Nisan & Schocken (www.nand2tetris.org) and Harris & Harris (DDCA) Review of Combinational
More informationFOIL it! Find One mismatch between Image and Language caption
FOIL it! Find One mismatch between Image and Language caption ACL, Vancouver, 31st July, 2017 Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raffaella Bernardi
More informationHigh Performance Carry Chains for FPGAs
High Performance Carry Chains for FPGAs Matthew M. Hosler Department of Electrical and Computer Engineering Northwestern University Abstract Carry chains are an important consideration for most computations,
More informationLayout Decompression Chip for Maskless Lithography
Layout Decompression Chip for Maskless Lithography Borivoje Nikolić, Ben Wild, Vito Dai, Yashesh Shroff, Benjamin Warlick, Avideh Zakhor, William G. Oldham Department of Electrical Engineering and Computer
More informationDeep learning for music data processing
Deep learning for music data processing A personal (re)view of the state-of-the-art Jordi Pons www.jordipons.me Music Technology Group, DTIC, Universitat Pompeu Fabra, Barcelona. 31st January 2017 Jordi
More informationSlide Set Overview. Special Topics in Advanced Digital System Design. Embedded System Design. Embedded System Design. What does a digital camera do?
Slide Set Overview Special Topics in Advanced Digital System Design by Dr. Lesley Shannon Email: lshannon@ensc.sfu.ca Course Website: http://www.ensc.sfu.ca/~lshannon/ Simon Fraser University Slide Set:
More informationUse of Low Power DET Address Pointer Circuit for FIFO Memory Design
International Journal of Education and Science Research Review Use of Low Power DET Address Pointer Circuit for FIFO Memory Design Harpreet M.Tech Scholar PPIMT Hisar Supriya Bhutani Assistant Professor
More informationAN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER
University of Kentucky UKnowledge Theses and Dissertations--Electrical and Computer Engineering Electrical and Computer Engineering 2007 AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER Vijai Raghunathan
More informationAdvanced Video Processing for Future Multimedia Communication Systems
Advanced Video Processing for Future Multimedia Communication Systems André Kaup Friedrich-Alexander University Erlangen-Nürnberg Future Multimedia Communication Systems Trend in video to make communication
More informationA High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System
A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Outline Introduction to H.264
More informationHardware Design I Chap. 5 Memory elements
Hardware Design I Chap. 5 Memory elements E-mail: shimada@is.naist.jp Why memory is required? To hold data which will be processed with designed hardware (for storage) Main memory, cache, register, and
More informationDay 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size
ESE534: Computer Organization Day 22: November 16, 2016 Retiming 1 Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will
More informationScene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke
Scene Classification with Inception-7 Christian Szegedy with Julian Ibarz and Vincent Vanhoucke Julian Ibarz Vincent Vanhoucke Task Classification of images into 10 different classes: Bedroom Bridge Church
More informationRegister Transfer Level (RTL) Design Cont.
CSE4: Components and Design Techniques for Digital Systems Register Transfer Level (RTL) Design Cont. Tajana Simunic Rosing Where we are now What we are covering today: RTL design examples, RTL critical
More informationA Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification
INTERSPEECH 17 August, 17, Stockholm, Sweden A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification Yun Wang and Florian Metze Language
More informationInternational Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN
International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September-2014 917 The Power Optimization of Linear Feedback Shift Register Using Fault Coverage Circuits K.YARRAYYA1, K CHITAMBARA
More informationAutomatic Piano Music Transcription
Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening
More informationAn Efficient Reduction of Area in Multistandard Transform Core
An Efficient Reduction of Area in Multistandard Transform Core A. Shanmuga Priya 1, Dr. T. K. Shanthi 2 1 PG scholar, Applied Electronics, Department of ECE, 2 Assosiate Professor, Department of ECE Thanthai
More informationIEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing
IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing Theodore Yu theodore.yu@ti.com Texas Instruments Kilby Labs, Silicon Valley Labs September 29, 2012 1 Living in an analog world The
More informationInternational Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013
International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue8- August 2013 Design and Implementation of an Enhanced LUT System in Security Based Computation dama.dhanalakshmi 1, K.Annapurna
More informationALONG with the progressive device scaling, semiconductor
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 4, APRIL 2010 285 LUT Optimization for Memory-Based Computation Pramod Kumar Meher, Senior Member, IEEE Abstract Recently, we
More informationMultimedia Communications. Video compression
Multimedia Communications Video compression Video compression Of all the different sources of data, video produces the largest amount of data There are some differences in our perception with regard to
More informationData Storage and Manipulation
Data Storage and Manipulation Data Storage Bits and Their Storage: Gates and Flip-Flops, Other Storage Techniques, Hexadecimal notation Main Memory: Memory Organization, Measuring Memory Capacity Mass
More informationOL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features
OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0 General Description Applications Features The OL_H264e core is a hardware implementation of the H.264 baseline video compression algorithm. The core
More informationPredicting Aesthetic Radar Map Using a Hierarchical Multi-task Network
Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network Xin Jin 1,2,LeWu 1, Xinghui Zhou 1, Geng Zhao 1, Xiaokun Zhang 1, Xiaodong Li 1, and Shiming Ge 3(B) 1 Department of Cyber Security,
More informationReconfigurable Architectures. Greg Stitt ECE Department University of Florida
Reconfigurable Architectures Greg Stitt ECE Department University of Florida How can hardware be reconfigurable? Problem: Can t change fabricated chip ASICs are fixed Solution: Create components that can
More informationJoint Image and Text Representation for Aesthetics Analysis
Joint Image and Text Representation for Aesthetics Analysis Ye Zhou 1, Xin Lu 2, Junping Zhang 1, James Z. Wang 3 1 Fudan University, China 2 Adobe Systems Inc., USA 3 The Pennsylvania State University,
More informationPredicting the immediate future with Recurrent Neural Networks: Pre-training and Applications
Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications Introduction Brandon Richardson December 16, 2011 Research preformed from the last 5 years has shown that the
More informationOn the HPDP from architecture to a device. Final Presentation Days ESTEC, May 9 th 2017
On the HPDP from architecture to a device Final Presentation Days ESTEC, May 9 th 2017 Outline Introduction HPDP Architecture Top Design Comparisons Target applications Design Flow Operating Environment
More informationVLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics
1) Explain why & how a MOSFET works VLSI Design: 2) Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width (c) considering Channel
More informationHardware Implementation of Viterbi Decoder for Wireless Applications
Hardware Implementation of Viterbi Decoder for Wireless Applications Bhupendra Singh 1, Sanjeev Agarwal 2 and Tarun Varma 3 Deptt. of Electronics and Communication Engineering, 1 Amity School of Engineering
More informationSinger Traits Identification using Deep Neural Network
Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic
More informationSoC IC Basics. COE838: Systems on Chip Design
SoC IC Basics COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview SoC
More informationA HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING. Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt
A HIGH THROUGHPUT CABAC ALGORITHM USING SYNTAX ELEMENT PARTITIONING Vivienne Sze Anantha P. Chandrakasan 2009 ICIP Cairo, Egypt Motivation High demand for video on mobile devices Compressionto reduce storage
More informationFPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER
FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER Young-kyu Choi, Kisun You, and Wonyong Sung School of Electrical Engineering, Seoul National University San 56-1, Shillim-dong,
More informationHigh Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation
High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design
More informationReconfigurable Neural Net Chip with 32K Connections
Reconfigurable Neural Net Chip with 32K Connections H.P. Graf, R. Janow, D. Henderson, and R. Lee AT&T Bell Laboratories, Room 4G320, Holmdel, NJ 07733 Abstract We describe a CMOS neural net chip with
More informationLOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta
LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES Masum Hossain University of Alberta 0 Outline Why ADC-Based receiver? Challenges in ADC-based receiver ADC-DSP based Receiver Reducing impact of Quantization
More informationMPEG has been established as an international standard
1100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 7, OCTOBER 1999 Fast Extraction of Spatially Reduced Image Sequences from MPEG-2 Compressed Video Junehwa Song, Member,
More informationMotion Video Compression
7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes
More informationInvestigation of Look-Up Table Based FPGAs Using Various IDCT Architectures
Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures Jörn Gause Abstract This paper presents an investigation of Look-Up Table (LUT) based Field Programmable Gate Arrays (FPGAs)
More informationDecoder Hardware Architecture for HEVC
Decoder Hardware Architecture for HEVC The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Tikekar, Mehul,
More informationObjectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath
Objectives Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath In the previous chapters we have studied how to develop a specification from a given application, and
More informationCMOS Technology for Increasing Efficiency of Clock Gating Techniques Using Tri-State Buffer
Engineering and Physical Sciences CMOS Technology for Increasing Efficiency of Clock Gating Techniques Using Tri-State Buffer Maan HAMEED *, Asem KHMAG, Fakhrul ZAMAN and Abdurrahman RAMLI Department of
More informationSEMICONDUCTOR TECHNOLOGY -CMOS-
SEMICONDUCTOR TECHNOLOGY -CMOS- Fire Tom Wada What is semiconductor and LSIs Huge number of transistors can be integrated in a small Si chip. The size of the chip is roughly the size of nails. Currently,
More informationAn FPGA Platform for Demonstrating Embedded Vision Systems. Ariana Eisenstein
An FPGA Platform for Demonstrating Embedded Vision Systems by Ariana Eisenstein B.S., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer Science
More information11. Sequential Elements
11. Sequential Elements Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 11, 2017 ECE Department, University of Texas at Austin
More informationPROF. TAJANA SIMUNIC ROSING. Midterm. Problem Max. Points Points Total 150 INSTRUCTIONS:
CSE 237A FALL 2006 PROF. TAJANA SIMUNIC ROSING Midterm NAME: ID: Solutions Problem Max. Points Points 1 20 2 20 3 30 4 25 5 25 6 30 Total 150 INSTRUCTIONS: 1. There are 6 problems on 11 pages worth a total
More informationSRAM Based Random Number Generator For Non-Repeating Pattern Generation
Applied Mechanics and Materials Online: 2014-06-18 ISSN: 1662-7482, Vol. 573, pp 181-186 doi:10.4028/www.scientific.net/amm.573.181 2014 Trans Tech Publications, Switzerland SRAM Based Random Number Generator
More informationSharif University of Technology. SoC: Introduction
SoC Design Lecture 1: Introduction Shaahin Hessabi Department of Computer Engineering System-on-Chip System: a set of related parts that act as a whole to achieve a given goal. A system is a set of interacting
More informationOL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features
OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression
More informationCommunication Avoiding Successive Band Reduction
Communication Avoiding Successive Band Reduction Grey Ballard, James Demmel, Nicholas Knight UC Berkeley PPoPP 12 Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by
More informationFrame Processing Time Deviations in Video Processors
Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).
More informationTKK S ASIC-PIIRIEN SUUNNITTELU
Design TKK S-88.134 ASIC-PIIRIEN SUUNNITTELU Design Flow 3.2.2005 RTL Design 10.2.2005 Implementation 7.4.2005 Contents 1. Terminology 2. RTL to Parts flow 3. Logic synthesis 4. Static Timing Analysis
More informationReconfigurable FPGA Implementation of FIR Filter using Modified DA Method
Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method M. Backia Lakshmi 1, D. Sellathambi 2 1 PG Student, Department of Electronics and Communication Engineering, Parisutham Institute
More informationLow Power Estimation on Test Compression Technique for SoC based Design
Indian Journal of Science and Technology, Vol 8(4), DOI: 0.7485/ijst/205/v8i4/6848, July 205 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Low Estimation on Test Compression Technique for SoC based
More informationA High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame
I J C T A, 9(34) 2016, pp. 673-680 International Science Press A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame K. Priyadarshini 1 and D. Jackuline Moni
More informationSlide Set 9. for ENCM 501 in Winter Steve Norman, PhD, PEng
Slide Set 9 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 9 slide
More informationContents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7
CM 69 W4 Section Slide Set 6 slide 2/9 Contents Slide Set 6 for CM 69 Winter 24 Lecture Section Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary
More informationLSTM Neural Style Transfer in Music Using Computational Musicology
LSTM Neural Style Transfer in Music Using Computational Musicology Jett Oristaglio Dartmouth College, June 4 2017 1. Introduction In the 2016 paper A Neural Algorithm of Artistic Style, Gatys et al. discovered
More informationAutomatic optimization of image capture on mobile devices by human and non-human agents
Automatic optimization of image capture on mobile devices by human and non-human agents 1.1 Abstract Sophie Lebrecht, Mark Desnoyer, Nick Dufour, Zhihao Li, Nicole A. Halmi, David L. Sheinberg, Michael
More informationLow Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis
Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis Abstract- A new technique of clock is presented to reduce dynamic power consumption.
More informationThe Multistandard Full Hd Video-Codec Engine On Low Power Devices
The Multistandard Full Hd Video-Codec Engine On Low Power Devices B.Susma (M. Tech). Embedded Systems. Aurora s Technological & Research Institute. Hyderabad. B.Srinivas Asst. professor. ECE, Aurora s
More informationTHE USE OF forward error correction (FEC) in optical networks
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract
More informationLUT Design Using OMS Technique for Memory Based Realization of FIR Filter
International Journal of Emerging Engineering Research and Technology Volume. 2, Issue 6, September 2014, PP 72-80 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) LUT Design Using OMS Technique for Memory
More informationLow Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 29 Minimizing Switched Capacitance-III. (Refer
More informationFPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique
FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique Dr. Dhafir A. Alneema (1) Yahya Taher Qassim (2) Lecturer Assistant Lecturer Computer Engineering Dept.
More informationSEMICONDUCTOR TECHNOLOGY -CMOS-
SEMICONDUCTOR TECHNOLOGY -CMOS- Fire Tom Wada 2011/12/19 1 What is semiconductor and LSIs Huge number of transistors can be integrated in a small Si chip. The size of the chip is roughly the size of nails.
More informationSyrah. Flux All 1rights reserved
Flux 2009. All 1rights reserved - The Creative adaptive-dynamics processor Thank you for using. We hope that you will get good use of the information found in this manual, and to help you getting acquainted
More informationDesign of Memory Based Implementation Using LUT Multiplier
Design of Memory Based Implementation Using LUT Multiplier Charan Kumar.k 1, S. Vikrama Narasimha Reddy 2, Neelima Koppala 3 1,2 M.Tech(VLSI) Student, 3 Assistant Professor, ECE Department, Sree Vidyanikethan
More informationInnovative Fast Timing Design
Innovative Fast Timing Design Solution through Simultaneous Processing of Logic Synthesis and Placement A new design methodology is now available that offers the advantages of enhanced logical design efficiency
More informationRead-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus
Digital logic: ALUs Sequential logic circuits CS207, Fall 2004 October 11, 13, and 15, 2004 1 Read-only memory (ROM) A form of memory Contents fixed when circuit is created n input lines for 2 n addressable
More informationarxiv: v1 [cs.lg] 15 Jun 2016
Deep Learning for Music arxiv:1606.04930v1 [cs.lg] 15 Jun 2016 Allen Huang Department of Management Science and Engineering Stanford University allenh@cs.stanford.edu Abstract Raymond Wu Department of
More information