Generalized Pattern Matching Micro-Engine

Size: px
Start display at page:

Download "Generalized Pattern Matching Micro-Engine"

Transcription

1 Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan Rasool, Dilip Vasudevan*, Andrew A. Chien* University of Chicago * Argonne National Laboratory King Faisal University

2 Big Data Applications Deep Packet Inspection Bioinformatics (DNA Alignment) JSON/XML Parsing Signal Triggering 6/24/2014 UNIVERSITY OF CHICAGO 2

3 Deep Packet Inspection High speed network : 100Gb/s Growing number of patterns: 6000 Snort Rules Speed requirement: > 75 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 375Gops/J 6/24/2014 UNIVERSITY OF CHICAGO 3

4 Bioinformatics (DNA Alignment) Bioinformatics database: millions of species Speed requirement: > 1 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 5 Gops/J Genome size: 130G base pairs 6/24/2014 UNIVERSITY OF CHICAGO 4

5 Deterministic Finite Automata (DFA) 6/24/2014 UNIVERSITY OF CHICAGO 5

6 Programmable Approaches target Intel Xeon E5-2600: 17G DFAops/second with 130W, 0.13Gops/J ; 6/24/2014 UNIVERSITY OF CHICAGO 6

7 Approach Workload M input characters(m DFA transitions) N DFA rules perform on the M input characters Goal Compute N x M transitions efficiently Approach Parallelize DFA execution Fused Instruction 6/24/2014 UNIVERSITY OF CHICAGO 7

8 What Is Micro-Engine Generalized Pattern Matching Micro-Engine ( GenPM ) is one micro-engine of 10x10 approach Local Memory I-Cache Basic RISC CPU I-Cache Microengine 2 I-Cache Microengine 3 I-Cache Microengine 4 I-Cache GenPM I-Cache Microengine 6 I-Cache Microengine 7 I-Cache Microengine 8 Shared L1 Data Cache 6/24/2014 UNIVERSITY OF CHICAGO 8

9 GenPM Micro Architecture 6/24/2014 UNIVERSITY OF CHICAGO 9

10 Fused Instructions: Multi-Step String buffer a b c Acc_Vec 0 1 Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 10

11 Fused Instructions: Multi-Step String buffer a b c Acc_Vec 0 1 Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 11

12 Fused Instructions: Multi-Step String buffer a b c Acc_Vec Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 12

13 Fused Instructions: Multi-Step String buffer a b c Acc_Vec Current State Acc_Vec A D ENB Q 1 Q 4 ALU address Local Mem Accept CHECK Next State 6/24/2014 UNIVERSITY OF CHICAGO 13

14 Parallel DFA: Vector Instruction SSE ADD /24/2014 UNIVERSITY OF CHICAGO 14

15 Parallel DFA: Vector Instruction GMVSNEXT DFAop DFAop DFAop DFAop DFAop DFAop DFAop 6/24/2014 UNIVERSITY OF CHICAGO 15

16 GenPM Code Example Data movement Multi-step parallel DFA execution Find precise matching position 6/24/2014 UNIVERSITY OF CHICAGO 16

17 Methodology Design space: Parallelism and step length Baseline 32-bit 6-stage in-order RISC 4GB DDR3 DRAM 32KB L1 I-cache, 24KB L1 D-cache, 512KB L2 (modeled on Intel Silverthorne) GenPM 1MB Local memory (up to 64 banks) Vector and Fused Instructions Performance/Power Model Core : 32nm synthesis by Synopsys Processor Designer Memories : MARSSX86/CACTI 6 + DRAMSim2 Workload 64 Snort rules from snapshot, 10KB random network dump 6/24/2014 UNIVERSITY OF CHICAGO 17

18 speedup versus RISC Performance GenPM_8way Speedup GenPM_64way step length 6/24/2014 UNIVERSITY OF CHICAGO 18

19 energy improvement versus RISC Energy Efficiency GenPM_8way GenPM_64way step length 6/24/2014 UNIVERSITY OF CHICAGO 19

20 Throughput per watt(gops/j) Throughput/watt (absolute) GenPM_8way Throughput/watt GenPM_64way step length Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second 6/24/2014 UNIVERSITY OF CHICAGO 20

21 total energy Energy Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% RISC GenPM_8B_1S GenPM_8B_8S GenPM_8B_16S GenPM_64B_1S GenPM_64B_8S GenPM_64B_16S LM_max = 83% LM L1_I L1_D L2 DRAM Core 6/24/2014 UNIVERSITY OF CHICAGO 21

22 General Comparison 6/24/2014 UNIVERSITY OF CHICAGO 22

23 Related Work ASIC: [Brodie, et.al. ISCA 2006], [Titanic System RXP], [ Cisco SCE ] FPGA: [Yang Xu, et.al. ANCS 2011], [ T Song, et.al. INFOCOM 2008], [I Sourdis et.al. VLSI 2008] CPU: [Mytkowicz et.al. ASPLOS 2014 ], [ Intel HyperScan] GPU: [Vasiliadis G, et.al. CCS 2011], [ Lin CH, et.al. INFOCOM 2012] SoC: [C Johnson et.al. ISSCC 2010 ], [ Cavium Octeon ], [ IBM PowerEN ] 6/24/2014 UNIVERSITY OF CHICAGO 23

24 Summery GenPM is a high performance and energy efficient accelerator for pattern matching workloads ISA exploits parallelism and multi-step execution Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second GenPM approaches ASIC efficiency and integrates it into a programmable core 6/24/2014 UNIVERSITY OF CHICAGO 24

25 Future Work DFA table compression Scale up with multiple GenPM micro-engines Explore more applications 6/24/2014 UNIVERSITY OF CHICAGO 25

26 Acknowledgements Defense Advanced Research Projects Agency (DARPA) Agilent Technologies (now Keysight Technologies) Synopsys Academic program Dr. Tung Hoang and members of the Large Scale Systems Group in the Department of Computer Science 6/24/2014 UNIVERSITY OF CHICAGO 26

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design

More information

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS 9th European Signal Processing Conference (EUSIPCO 2) Barcelona, Spain, August 29 - September 2, 2 A 6-65 CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS Jinjia Zhou, Dajiang

More information

EIE: Efficient Inference Engine on Compressed Deep Neural Network

EIE: Efficient Inference Engine on Compressed Deep Neural Network EIE: Efficient Inference Engine on Compressed Deep Neural Network Song Han*, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally Stanford University June 20, 2016 Deep Learning on

More information

Scalability of MB-level Parallelism for H.264 Decoding

Scalability of MB-level Parallelism for H.264 Decoding Scalability of Macroblock-level Parallelism for H.264 Decoding Mauricio Alvarez Mesa 1, Alex Ramírez 1,2, Mateo Valero 1,2, Arnaldo Azevedo 3, Cor Meenderinck 3, Ben Juurlink 3 1 Universitat Politècnica

More information

Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse

Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse Dark Silicon Workshop Kick-off Talk Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse Michael B. Taylor Associate Professor (July 2012) University of California, San Diego This Talk The

More information

A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H p Video Decoder A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining

More information

Lossless Compression Algorithms for Direct- Write Lithography Systems

Lossless Compression Algorithms for Direct- Write Lithography Systems Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley

More information

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel IEEE TRANSACTIONS ON MAGNETICS, VOL. 46, NO. 1, JANUARY 2010 87 Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel Ningde Xie 1, Tong Zhang 1, and

More information

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER Young-kyu Choi, Kisun You, and Wonyong Sung School of Electrical Engineering, Seoul National University San 56-1, Shillim-dong,

More information

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU 2017. This manuscript version (accecpted manuscript) is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/. Highly Parallel HEVC Decoding for Heterogeneous

More information

Time-Saving Features in Economy Oscilloscopes Streamline Test

Time-Saving Features in Economy Oscilloscopes Streamline Test Time-Saving Features in Economy Oscilloscopes Streamline Test Application Note Oscilloscopes are the go-to tool for debug and troubleshooting, whether you work in &, manufacturing or education. Like other

More information

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji S.NO 2018-2019 B.TECH VLSI IEEE TITLES TITLES FRONTEND 1. Approximate Quaternary Addition with the Fast Carry Chains of FPGAs 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. A Low-Power

More information

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 A 201.4 GOPS 496 mw Real-Time Multi-Object Recognition Processor With Bio-Inspired Neural Perception Engine Joo-Young Kim, Student

More information

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015 Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used

More information

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System

A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System A High-Performance Parallel CAVLC Encoder on a Fine-Grained Many-core System Zhibin Xiao and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Outline Introduction to H.264

More information

Low Power Design: From Soup to Nuts. Tutorial Outline

Low Power Design: From Soup to Nuts. Tutorial Outline Low Power Design: From Soup to Nuts Mary Jane Irwin and Vijay Narayanan Dept of CSE, Microsystems Design Lab Penn State University (www.cse.psu.edu/~mdl) ISCA Tutorial: Low Power Design Introduction.1

More information

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard

Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Mauricio Álvarez-Mesa ; Chi Ching Chi ; Ben Juurlink ; Valeri George ; Thomas Schierl Parallel video decoding in the emerging HEVC standard Conference object, Postprint version This version is available

More information

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size ESE534: Computer Organization Day 22: November 16, 2016 Retiming 1 Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will

More information

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large ESE680-002 (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance Last Time Saw how to formulate and automate retiming: start with network calculate minimum achievable

More information

Motion Compensation Hardware Accelerator Architecture for H.264/AVC

Motion Compensation Hardware Accelerator Architecture for H.264/AVC Motion Compensation Hardware Accelerator Architecture for H.264/AVC Bruno Zatt 1, Valter Ferreira 1, Luciano Agostini 2, Flávio R. Wagner 1, Altamiro Susin 3, and Sergio Bampi 1 1 Informatics Institute

More information

SoC IC Basics. COE838: Systems on Chip Design

SoC IC Basics. COE838: Systems on Chip Design SoC IC Basics COE838: Systems on Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University Overview SoC

More information

A Design Of A Low Power Delay Buffer Using Ring Counter Addressing Schemes

A Design Of A Low Power Delay Buffer Using Ring Counter Addressing Schemes A Design Of A Low Power Delay Buffer Using Ring Counter Addressing Schemes B.R.B Jaswanth St.Theresa Institute of Engineering and Technology Gudiwada, India Abstract This work presents circuit design of

More information

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER University of Kentucky UKnowledge Theses and Dissertations--Electrical and Computer Engineering Electrical and Computer Engineering 2007 AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER Vijai Raghunathan

More information

L12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures L12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following sources and are used with permission. Frank Honore Prof. Randy Katz (Unified Microelectronics

More information

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar

Jun-Hao Zheng et al.: An Efficient VLSI Architecture for MC of AVS HDTV Decoder 371 ture for MC which contains a three-stage pipeline. The hardware ar May 2006, Vol.21, No.3, pp.370 377 J. Comput. Sci. & Technol. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder Jun-Hao Zheng 1;3 (ΨΞ ), Lei Deng 2 ( Π), Peng Zhang 1;3 (Φ ±),

More information

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest ISSN: 0975-766X CODEN: IJPTFI Available Online through Research Article www.ijptonline.com IMPLEMENTATION OF FAST SQUARE ROOT SELECT WITH LOW POWER CONSUMPTION V.Elanangai*, Dr. K.Vasanth Department of

More information

Low-Power Techniques for Video Decoding. Daniel Frederic Finchelstein

Low-Power Techniques for Video Decoding. Daniel Frederic Finchelstein Low-Power Techniques for Video Decoding by Daniel Frederic Finchelstein Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree

More information

Amdahl s Law in the Multicore Era

Amdahl s Law in the Multicore Era Amdahl s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin Madison August 2008 @ Semiahmoo Workshop IBM s Dr. Thomas Puzak: Everyone knows Amdahl s Law 2008 Multifacet

More information

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing Theodore Yu theodore.yu@ti.com Texas Instruments Kilby Labs, Silicon Valley Labs September 29, 2012 1 Living in an analog world The

More information

Computer and Machine Vision

Computer and Machine Vision Computer and Machine Vision Lecture Week 3 Part-1 January 27, 2014 Sam Siewert Outline of Week 3 Processing Images and Moving Pictures High Level View and Computer Architecture for it Linux Platforms for

More information

L11/12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures L11/12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following people and used with permission. - Randy H. Katz (University of California, Berkeley,

More information

GPU Acceleration of a Production Molecular Docking Code

GPU Acceleration of a Production Molecular Docking Code GPU Acceleration of a Production Molecular Docking Code Bharat Sukhwani Martin Herbordt Computer Architecture and Automated Design Laboratory Department of Electrical and Computer Engineering Boston University

More information

FPGA Implementation of Low Power Self Testable MIPS Processor

FPGA Implementation of Low Power Self Testable MIPS Processor American-Eurasian Journal of Scientific Research 12 (3): 135-144, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.135.144 FPGA Implementation of Low Power Self Testable MIPS

More information

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction 1 Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Harris, David Blaauw, Dennis Sylvester mfojtik@umich.edu

More information

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems Comp 410/510 Computer Graphics Spring 2018 Introduction to Graphics Systems Computer Graphics Computer graphics deals with all aspects of 'creating images with a computer - Hardware (PC with graphics card)

More information

PRACE Autumn School GPU Programming

PRACE Autumn School GPU Programming PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct 2010 1 Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading

More information

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics An Algorithm to Silicon ESL Design Methodology Nitin Chawla,Harvinder Singh & Pascal Urard STMicroelectronics SOC Design Challenges:Increased Complexity 992 994 996 998 2 22 24 26 28 2.7.5.35.25.8.3 9

More information

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression

More information

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 29 Minimizing Switched Capacitance-III. (Refer

More information

An FPGA Implementation of Shift Register Using Pulsed Latches

An FPGA Implementation of Shift Register Using Pulsed Latches An FPGA Implementation of Shift Register Using Pulsed Latches Shiny Panimalar.S, T.Nisha Priscilla, Associate Professor, Department of ECE, MAMCET, Tiruchirappalli, India PG Scholar, Department of ECE,

More information

Design Challenge of a QuadHDTV Video Decoder

Design Challenge of a QuadHDTV Video Decoder Design Challenge of a QuadHDTV Video Decoder Youn-Long Lin Department of Computer Science National Tsing Hua University MPSOC27, Japan More Pixels YLLIN NTHU-CS 2 NHK Proposes UHD TV Broadcast Super HiVision

More information

Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics Vol. 0 No. 0 1959 TV MPEG2 MP3 JPEG 2000 OSCAR API VLIW 4 FR1000 SH-4A 4 RP1 FR1000 4 1 4 3.27 RP1 4 1 4 3.31 Parallelization of Multimedia Applications by Compiler on Multicores for Consumer Electronics

More information

The Multistandard Full Hd Video-Codec Engine On Low Power Devices

The Multistandard Full Hd Video-Codec Engine On Low Power Devices The Multistandard Full Hd Video-Codec Engine On Low Power Devices B.Susma (M. Tech). Embedded Systems. Aurora s Technological & Research Institute. Hyderabad. B.Srinivas Asst. professor. ECE, Aurora s

More information

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method M. Backia Lakshmi 1, D. Sellathambi 2 1 PG Student, Department of Electronics and Communication Engineering, Parisutham Institute

More information

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT Giuseppe Desoli ST Central Labs STMicroelectronics Artificial Intelligence is Everywhere 2 Analysis,

More information

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath Objectives Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath In the previous chapters we have studied how to develop a specification from a given application, and

More information

USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO. Edward Callway AMD

USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO. Edward Callway AMD USING FUSION SYSTEM ARCHITECTURE FOR BROADCAST VIDEO Edward Callway AMD USING PC COMPONENTS FOR BROADCAST VIDEO Video processing from pure analog to digital compute PC Design for video Parallel GPU computing

More information

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

Exploring Architecture Parameters for Dual-Output LUT based FPGAs Exploring Architecture Parameters for Dual-Output LUT based FPGAs Zhenghong Jiang, Colin Yu Lin, Liqun Yang, Fei Wang and Haigang Yang System on Programmable Chip Research Department, Institute of Electronics,

More information

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision Robert LiKamWa Yunhui Hou Yuan Gao Mia Polansky Lin Zhong roblkw@rice.edu houyh@rice.edu yg18@rice.edu mia.polansky@rice.edu lzhong@rice.edu

More information

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming ESE534: Computer Organization Today Retiming Demand Folded Computation Day 21: April 14, 2014 Retiming Logical Pipelining Physical Pipelining Retiming Supply Technology Structures Hierarchy 1 2 Image Processing

More information

Vlsi Digital Signal Processing Systems Design And Implementation

Vlsi Digital Signal Processing Systems Design And Implementation Vlsi Digital Signal Processing Systems Design And Implementation We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing it on your

More information

FLIP-FLOPS and latches, which we collectively refer to as

FLIP-FLOPS and latches, which we collectively refer to as 1294 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 8, AUGUST 2004 A Test Circuit for Measurement of Clocked Storage Element Characteristics Nikola Nedovic, Member, IEEE, William W. Walker, Member,

More information

TODAY computer vision technologies are used with great

TODAY computer vision technologies are used with great ARXIV PREPRINT 1 Origami: A 803 GOp/s/W Convolutional Network Accelerator Lukas Cavigelli, Student Member, IEEE, and Luca Benini, Fellow, IEEE arxiv:1512.04295v2 [cs.cv] 19 Jan 2016 Abstract An ever increasing

More information

Inside Digital Design Accompany Lab Manual

Inside Digital Design Accompany Lab Manual 1 Inside Digital Design, Accompany Lab Manual Inside Digital Design Accompany Lab Manual Simulation Prototyping Synthesis and Post Synthesis Name- Roll Number- Total/Obtained Marks- Instructor Signature-

More information

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University Reverse-engineer the brain National

More information

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era Keynote at the Bi annual HiPEAC Compu6ng Systems Week Mee6ng Barcelona, Spain October 19 th 2010 Prof. Simha Sethumadhavan Columbia

More information

Scalable Lossless High Definition Image Coding on Multicore Platforms

Scalable Lossless High Definition Image Coding on Multicore Platforms Scalable Lossless High Definition Image Coding on Multicore Platforms Shih-Wei Liao 2, Shih-Hao Hung 2, Chia-Heng Tu 1, and Jen-Hao Chen 2 1 Graduate Institute of Networking and Multimedia 2 Department

More information

How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors

How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors WHITE PAPER How to Manage Video Frame- Processing Time Deviations in ASIC and SOC Video Processors Some video frames take longer to process than others because of the nature of digital video compression.

More information

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems Hsin-I Liu, Brian Richards, Avideh Zakhor, and Borivoje Nikolic Dept. of Electrical Engineering

More information

Technical Note PowerPC Embedded Processors Video Security with PowerPC

Technical Note PowerPC Embedded Processors Video Security with PowerPC Introduction For many reasons, digital platforms are becoming increasingly popular for video security applications. In comparison to traditional analog support, a digital solution can more effectively

More information

ALICE Week Technical Board TPC Intelligent Readout Architecture. Volker Lindenstruth Universität Heidelberg

ALICE Week Technical Board TPC Intelligent Readout Architecture. Volker Lindenstruth Universität Heidelberg ALICE Week 17.11.99 Technical Board TPC Intelligent Readout Architecture Volker Lindenstruth Universität Heidelberg What s new?? TPC occuppancy is much higher than originally assumed New Trigger Detector

More information

Nutaq. PicoDigitizer-125. Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET. nutaq.

Nutaq. PicoDigitizer-125. Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET. nutaq. Nutaq Up to 64 Channels, 125 MSPS ADCs, FPGA-based DAQ Solution With Up to 32 Channels, 1000 MSPS DACs PRODUCT SHEET QUEBEC I MONTREAL I N E W YO R K I nutaq.com Nutaq The PicoDigitizer 125-Series is a

More information

Tutorial Outline. Typical Memory Hierarchy

Tutorial Outline. Typical Memory Hierarchy Tutorial Outline 8:30-8:45 8:45-9:05 9:05-9:30 9:30-10:30 10:30-10:50 10:50-12:15 12:15-1:30 1:30-2:30 2:30-3:30 3:30-3:50 3:50-4:30 4:30-4:45 Introduction and motivation Sources of power in CMOS designs

More information

SoC and SiP technology for digital consumer electronic systems

SoC and SiP technology for digital consumer electronic systems and SiP technology for digital consumer electronic systems Akira Matsuzawa Tokyo Institute of Technology 2005 06 20 A. Matsuzawa, Titech 1 Contents Digital consumer electronic systems and technology for

More information

Design and analysis of microcontroller system using AMBA- Lite bus

Design and analysis of microcontroller system using AMBA- Lite bus Design and analysis of microcontroller system using AMBA- Lite bus Wang Hang Suan 1,*, and Asral Bahari Jambek 1 1 School of Microelectronic Engineering, Universiti Malaysia Perlis, Perlis, Malaysia Abstract.

More information

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique Dr. Dhafir A. Alneema (1) Yahya Taher Qassim (2) Lecturer Assistant Lecturer Computer Engineering Dept.

More information

Chi Ching Chi, Ben Juurlink A QHD-capable parallel H.264 decoder

Chi Ching Chi, Ben Juurlink A QHD-capable parallel H.264 decoder Powered by TCPDF (www.tcpdf.org) Chi Ching Chi, Ben Juurlink A QHD-capable parallel H.264 decoder Conference Object, Postprint version This version is available at http://dx.doi.org/1.14279/depositonce-634

More information

Layout Decompression Chip for Maskless Lithography

Layout Decompression Chip for Maskless Lithography Layout Decompression Chip for Maskless Lithography Borivoje Nikolić, Ben Wild, Vito Dai, Yashesh Shroff, Benjamin Warlick, Avideh Zakhor, William G. Oldham Department of Electrical Engineering and Computer

More information

IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE

IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE IMPLEMENTATION OF X-FACTOR CIRCUITRY IN DECOMPRESSOR ARCHITECTURE SATHISHKUMAR.K #1, SARAVANAN.S #2, VIJAYSAI. R #3 School of Computing, M.Tech VLSI design, SASTRA University Thanjavur, Tamil Nadu, 613401,

More information

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation International Journal of Modern Science and Technology Vol. 2, No. 5, 2017. Page 217-222. http://www.ijmst.co/ ISSN: 2456-0235. Research Article Implementation of Low Power, Delay and Area Efficient Shifters

More information

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm

Enhancing Performance in Multiple Execution Unit Architecture using Tomasulo Algorithm Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

Clock Tree Power Optimization of Three Dimensional VLSI System with Network Clock Tree Power Optimization of Three Dimensional VLSI System with Network M.Saranya 1, S.Mahalakshmi 2, P.Saranya Devi 3 PG Student, Dept. of ECE, Syed Ammal Engineering College, Ramanathapuram, Tamilnadu,

More information

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER Wassim Hamidouche, Mickael Raulet and Olivier Déforges

More information

Retiming Sequential Circuits for Low Power

Retiming Sequential Circuits for Low Power Retiming Sequential Circuits for Low Power José Monteiro, Srinivas Devadas Department of EECS MIT, Cambridge, MA Abhijit Ghosh Mitsubishi Electric Research Laboratories Sunnyvale, CA Abstract Switching

More information

Logic Analysis Basics

Logic Analysis Basics Logic Analysis Basics September 27, 2006 presented by: Alex Dickson Copyright 2003 Agilent Technologies, Inc. Introduction If you have ever asked yourself these questions: What is a logic analyzer? What

More information

Tomasulo Algorithm Based Out of Order Execution Processor

Tomasulo Algorithm Based Out of Order Execution Processor Tomasulo Algorithm Based Out of Order Execution Processor Bhavana P.Shrivastava MAaulana Azad National Institute of Technology, Department of Electronics and Communication ABSTRACT In this research work,

More information

Logic Analysis Basics

Logic Analysis Basics Logic Analysis Basics September 27, 2006 presented by: Alex Dickson Copyright 2003 Agilent Technologies, Inc. Introduction If you have ever asked yourself these questions: What is a logic analyzer? What

More information

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures 46 H. Y. SU, M. WEN, J. REN, N. WU, J. CHAI, C.Y. ZHANG, HIGH-EFFICIENT PARALLEL CAVLC ENCODER High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures Huayou SU, Mei WEN, Ju REN,

More information

California State University, Bakersfield Computer & Electrical Engineering & Computer Science ECE 3220: Digital Design with VHDL Laboratory 7

California State University, Bakersfield Computer & Electrical Engineering & Computer Science ECE 3220: Digital Design with VHDL Laboratory 7 California State University, Bakersfield Computer & Electrical Engineering & Computer Science ECE 322: Digital Design with VHDL Laboratory 7 Rational: The purpose of this lab is to become familiar in using

More information

Impact of Intermittent Faults on Nanocomputing Devices

Impact of Intermittent Faults on Nanocomputing Devices Impact of Intermittent Faults on Nanocomputing Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks Outline Fault classes Permanent faults Transient faults Intermittent faults

More information

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques Madhavi Anupoju 1, M. Sunil Prakash 2 1 M.Tech (VLSI) Student, Department of Electronics & Communication Engineering, MVGR

More information

Built-In Proactive Tuning System for Circuit Aging Resilience

Built-In Proactive Tuning System for Circuit Aging Resilience IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems Built-In Proactive Tuning System for Circuit Aging Resilience Nimay Shah 1, Rupak Samanta 1, Ming Zhang 2, Jiang Hu 1, Duncan

More information

THE Collider Detector at Fermilab (CDF) [1] is a general

THE Collider Detector at Fermilab (CDF) [1] is a general The Level-3 Trigger at the CDF Experiment at Tevatron Run II Y.S. Chung 1, G. De Lentdecker 1, S. Demers 1,B.Y.Han 1, B. Kilminster 1,J.Lee 1, K. McFarland 1, A. Vaiciulis 1, F. Azfar 2,T.Huffman 2,T.Akimoto

More information

Large Area, High Speed Photo-detectors Readout

Large Area, High Speed Photo-detectors Readout Large Area, High Speed Photo-detectors Readout Jean-Francois Genat + On behalf and with the help of Herve Grabas +, Samuel Meehan +, Eric Oberla +, Fukun Tang +, Gary Varner ++, and Henry Frisch + + University

More information

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking. EE141-Fall 2011 Digital Integrated Circuits Lecture 2 Clock, I/O Timing 1 4 Administrative Stuff Pipelining Project Phase 4 due on Monday, Nov. 21, 10am Homework 9 Due Thursday, December 1 Visit to Intel

More information

Optical clock distribution for a more efficient use of DRAMs

Optical clock distribution for a more efficient use of DRAMs Optical clock distribution for a more efficient use of DRAMs D. Litaize, M.P.Y. Desmulliez*, J. Collet**, P. Foulk* Institut de Recherche en Informatique de Toulouse (IRIT), Universite Paul Sabatier, 31062

More information

New Directions in Manufacturing Test

New Directions in Manufacturing Test New Directions in Manufacturing Test Jacob A. Abraham Computer Engineering Research Center The University of Texas at Austin Shanghai Jiao Tong University July 19, 2005 July 19, 2005 1 Research Areas Manufacturing

More information

On the HPDP from architecture to a device. Final Presentation Days ESTEC, May 9 th 2017

On the HPDP from architecture to a device. Final Presentation Days ESTEC, May 9 th 2017 On the HPDP from architecture to a device Final Presentation Days ESTEC, May 9 th 2017 Outline Introduction HPDP Architecture Top Design Comparisons Target applications Design Flow Operating Environment

More information

As for Dec, BSK 2015 Business Line Cards

As for Dec, BSK 2015 Business Line Cards As for Dec, 2014 BSK 2015 Business Line Cards Bigshine Korea History 1989 ~ 1999 1999 Bigshine Chicago 현지법인설립 1998 NexFlash 사와 Agency Contract InNet 사와 Agency Contract 1997 Bigshine Korea Co., Ltd. 상호변경

More information

FOR MULTIMEDIA mobile systems powered by a battery

FOR MULTIMEDIA mobile systems powered by a battery IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY 2005 67 ITRON-LP: Power-Conscious Real-Time OS Based on Cooperative Voltage Scaling for Multimedia Applications Hiroshi Kawaguchi, Member, IEEE,

More information

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics FPGA PROTOTYPE RUNNING NOW WHAT? Well done team; we ve managed to get 100 s of millions of gates of FPGA-hostile RTL running

More information

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension 05-Silva-AF:05-Silva-AF 8/19/11 6:18 AM Page 43 A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension T. L. da Silva 1, L. A. S. Cruz 2, and L. V. Agostini 3 1 Telecommunications

More information

A Novel VLSI Architecture of Motion Compensation for Multiple Standards

A Novel VLSI Architecture of Motion Compensation for Multiple Standards A Novel VLSI Architecture of Motion Compensation for Multiple Standards Junhao Zheng, Wen Gao, Senior Member, IEEE, David Wu, and Don Xie Abstract Motion compensation (MC) is one of the most important

More information

Chapter 1. Introduction to Digital Signal Processing

Chapter 1. Introduction to Digital Signal Processing Chapter 1 Introduction to Digital Signal Processing 1. Introduction Signal processing is a discipline concerned with the acquisition, representation, manipulation, and transformation of signals required

More information

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE OI: 10.21917/ijme.2018.0088 LOW POWER AN HIGH PERFORMANCE SHIFT REGISTERS USING PULSE LATCH TECHNIUE Vandana Niranjan epartment of Electronics and Communication Engineering, Indira Gandhi elhi Technical

More information

Low Power Design of the Next-Generation High Efficiency Video Coding

Low Power Design of the Next-Generation High Efficiency Video Coding Low Power Design of the Next-Generation High Efficiency Video Coding Authors: Muhammad Shafique, Jörg Henkel CES Chair for Embedded Systems Outline Introduction to the High Efficiency Video Coding (HEVC)

More information

SRAM Based Random Number Generator For Non-Repeating Pattern Generation

SRAM Based Random Number Generator For Non-Repeating Pattern Generation Applied Mechanics and Materials Online: 2014-06-18 ISSN: 1662-7482, Vol. 573, pp 181-186 doi:10.4028/www.scientific.net/amm.573.181 2014 Trans Tech Publications, Switzerland SRAM Based Random Number Generator

More information

THE new video coding standard H.264/AVC [1] significantly

THE new video coding standard H.264/AVC [1] significantly 832 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 9, SEPTEMBER 2006 Architecture Design of Context-Based Adaptive Variable-Length Coding for H.264/AVC Tung-Chien Chen, Yu-Wen

More information

Frame Processing Time Deviations in Video Processors

Frame Processing Time Deviations in Video Processors Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).

More information

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands

MPEG decoder Case. K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf. Philips Research Eindhoven, The Netherlands MPEG decoder Case K.A. Vissers UC Berkeley Chamleon Systems Inc. and Pieter van der Wolf Philips Research Eindhoven, The Netherlands 1 Outline Introduction Consumer Electronics Kahn Process Networks Revisited

More information