Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

Similar documents
Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Introduction to Signal Processing D R. T A R E K T U T U N J I P H I L A D E L P H I A U N I V E R S I T Y

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Scalability of MB-level Parallelism for H.264 Decoding

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Amdahl s Law in the Multicore Era

System Quality Indicators

L11/12: Reconfigurable Logic Architectures

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

L12: Reconfigurable Logic Architectures

Analog Performance-based Self-Test Approaches for Mixed-Signal Circuits

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Sharif University of Technology. SoC: Introduction

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

Reconfigurable Neural Net Chip with 32K Connections

Introduction to Data Conversion and Processing

THE USE OF forward error correction (FEC) in optical networks

On the Rules of Low-Power Design

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Communication Avoiding Successive Band Reduction

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR NPTEL ONLINE CERTIFICATION COURSE. On Industrial Automation and Control

Co-simulation Techniques for Mixed Signal Circuits

Sequencing and Control

DESIGN PHILOSOPHY We had a Dream...

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

COE328 Course Outline. Fall 2007

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

Electrical and Telecommunications Engineering Technology_TCET3122/TC520. NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

Lossless Compression Algorithms for Direct- Write Lithography Systems

Combinational vs Sequential

VLSI Digital Signal Processing

Memory efficient Distributed architecture LUT Design using Unified Architecture

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Good afternoon! My name is Swetha Mettala Gilla you can call me Swetha.

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Digital Audio Design Validation and Debugging Using PGY-I2C

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Simple motion control implementation

CHAPTER 8 CONCLUSION AND FUTURE SCOPE

Digital Integrated Circuits EECS 312

SELSE ASAR: Applica+on-Specific Approximate Recovery to Mi+gate Hardware Variability. Presenter: Manish Gupta

Microprocessor Design

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Implementation of a turbo codes test bed in the Simulink environment

Implementation of an MPEG Codec on the Tilera TM 64 Processor

SiRX Single-Chip RF Front-End for Digital Satellite TV

Solution of Linear Systems

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

Using the XSV Board Xchecker Interface

Design of Low Power and Area Efficient 64 Bits Shift Register Using Pulsed Latches

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

COMP2611: Computer Organization. Introduction to Digital Logic

Design for Testability

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Powerful Software Tools and Methods to Accelerate Test Program Development A Test Systems Strategies, Inc. (TSSI) White Paper.

Figure 1: Feature Vector Sequence Generator block diagram.

OpenXLR8: How to Load Custom FPGA Blocks

ADVANCES in semiconductor technology are contributing

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

EFFICIENT DESIGN OF SHIFT REGISTER FOR AREA AND POWER REDUCTION USING PULSED LATCH

Guide to designing a device incorporating MEMSbased pico projection

Radar Signal Processing Final Report Spring Semester 2017

Digital Strobe Tuner. w/ On stage Display

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Bell. Program of Study. Accelerated Digital Electronics. Dave Bell TJHSST

A Low-Power 0.7-V H p Video Decoder

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

LUT Optimization for Memory Based Computation using Modified OMS Technique

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

A video signal processor for motioncompensated field-rate upconversion in consumer television

FPGA Development for Radar, Radio-Astronomy and Communications

DESIGN OF ANALOG FUZZY LOGIC CONTROLLERS IN CMOS TECHNOLOGIES

Principles of Computer Architecture. Appendix A: Digital Logic

ESE534: Computer Organization. Last Time. Last Time. Today. Preclass. Preclass. LUTs. Day 15: March 22, 2010 Compute 2: Cascades, ALUs, PLAs

Chapter Contents. Appendix A: Digital Logic. Some Definitions

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Computer Graphics. Introduction

CS8803: Advanced Digital Design for Embedded Hardware

HW#3 - CSE 237A. 1. A scheduler has three queues; A, B and C. Outgoing link speed is 3 bits/sec

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

An Improved Recursive and Non-recursive Comb Filter for DSP Applications

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Triple RTD. On-board Digital Signal Processor. Linearization RTDs 20 Hz averaged outputs 16-bit precision comparator function.

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

Meeting Embedded Design Challenges with Mixed Signal Oscilloscopes

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

COMPUTER ENGINEERING PROGRAM

A Fast Constant Coefficient Multiplier for the XC6200

FPGA Implementation of DA Algritm for Fir Filter

RF (Wireless) Fundamentals 1- Day Seminar

Transcription:

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era Keynote at the Bi annual HiPEAC Compu6ng Systems Week Mee6ng Barcelona, Spain October 19 th 2010 Prof. Simha Sethumadhavan Columbia University www.cs.columbia.edu/~simha Simha Sethumadhavan 0

The FUTURE depends on you! 1

Executive Summary Applications demand more Scientific & societal progress depends on better computers Silicon scaling is slowing down Energy usage more important than hardware cost Changing landscape demands new solutions Multi-cores and accelerators will not scale A solution: Analog-Digital Computing Great match for emerging applications New and Improved old technology! 2

What is Computing? P! M E G A Problem that needs a solution 1. Prepare a model of the problem to be solved 2. Prepare an executable specification 3. Execute to obtain results 3

Digital Computing Real world inputs i r Digital approx. of inputs i d Outputs Error: O r O d o d Programming Language P! M d E d G d A Problem that needs a solution 1. Prepare a model of the problem to be solved Compiler 2. Prepare an executable specification Arch. Microarch 3. Execute to obtain results 4

Multicore Computing Real world inputs i r Digital approx. of inputs i d Parallel Programming Language Outputs Error: O r O p P! M p E p G p o p Multicore promise Reduce power Same time increase throughput Par. Compiler Reality: Low power & throughput Serial regions limit throughput Voltage scaling limits on-chip switching Par. Arch. Die &Wafer sizes are not growing cores cannot increase Multicores have merely postponed the power wall Par. Microarch 5

Accelerator Computing Real world inputs i r Digital approx. of inputs i d Programming Language Outputs Error: O r O a P! M a E a Gm a o a Accelerators Map algorithms on to hardware H.264 [Hameed et al. ISCA 10] Acc. Compiler Improve Energy Efficiency Low or no microarch. overheads Some die area can go unused Acc. Arch. Further energy efficiency improvements are difficult 6

Next energy efficiency leap? TradiQonal approach: RetrofiTng Model to ImplementaQon P! M E G A Problem that needs a solution 1. Prepare a model of the problem to be solved 2. Prepare an executable specification 3. Execute to obtain results ATTACK FUNDAMENTAL ALGORITHMIC OVERHEADS 7

Attempt 1: Au naturel Computing Real world inputs i r Reduced inputs i d P! P Example: Hurricane in a bottle Fast but au naturel is also unnatural For computer scientists But this is what physical scientists do Great if you can simulate the problem on a reduced scale 8

Improvement: Continuous Computing Use mathematical formulations used by scientists Real world inputs i r Reduced Real-world Inputs (no discretization) i c Outputs Error: O r O c o c P! M c E c G c DifferenQal EquaQons, Neural Networks etc. No Qme discreqzaqon 9

Example Continuous Computer The Linear Differential Analyzer http://web.mit.edu/klund/www/analyzer/ Problems: Limited accuracy, not a practical general-purpose machine. 10

The HYBRID Discrete-Continuous Model Combine discrete and continuous models A more natural fit for computing Discrete problem on discrete computer e.g., FSM Continuous problem on continuous e.g., differential eqns. Better for programmability, efficiency & accuracy Real world inputs Digital approx. of inputs Outputs Error: O r O h i r i d HDCCA Programming Language P! M h E h Gm h o h Hardware support For conqnuous computaqon 11

HDCCA Research <???> Programming Language P! M h E h Gm h? Compiler? Arch.?. Microarch What should the HDCCA compu3ng stack be? 12

Introduction to HDCCA Outline A simple End-to-End example (mini tutorial) Mini-tutorial goals Solidify understanding of differences between discrete and continuous by studying differential equations. Show continuous implementation with analog hardware Analog Old Vs. Analog New Research Challenges for HDCCA Tangible Benefits 13

A Simple End-to-End Example Damped Harmonic Motion Toy, text book example Spring attached to mass m Spring constant k Acted by external force F Y(t) K m We are trying to determine what the position at time T? Solution is given by the following equation: 14

Discrete Solution Step 1: Write equations in matrix form Step 2: Compute values at time h based on Initial Values Step 3 n: Compute values at time 2h based on value at h, and iterate until you converge based on some error bound. 15

Continuous Solution Integrate twice using op-amps Scale down R,C value based on 1/m, b/m & k/m Determines solution time Analog Circuit for IntegraQon Analog Hardware Circuit 16

Comparison of Discrete vs. Continuous Both produce approximate answers Measured time to arrive at ±2.5% of analytical value Speed ups due to fact that we avoided discrete time stepping. 17

Introduction to HDCCA Outline A Simple End-to-End Example Analog Old Vs. Analog New We are using analog to implement the continuous model But, isn t analog dead? Understand why digital superseded analog in the 70s. Show that critical barriers to analog have been solved Research Challenges for HDCCA Tangible Benefits 18

Old Vs. New: Accuracy Old analog susceptible to noise and error Cannot get more than 11-12 good bits New analog devices are not much better But the application landscape has changed #1: Many modern apps do not need high accuracy Graphics, Optimization #2: Many modern apps are error-tolerant Games, Learning #3: For high accuracy applications HDCCA is useful Refine approximate analog values using digital solver! 19

Old Vs. New: Design/Implementation Old: Lack of CAD tools, design methodology New: Still black art/engineering (like parallel prog?!) Note that: Digital design complexity is approaching analog design complexity But analog CAD is improving A recent successful analog design Off-chip co-processor Published in 2005 in ISSCC Cowan and Tsividis @ Columbia TILE TILE 20

Old Vs. New: Programmability Old machines were large, clunky Scanimate graphics system [Siggraph 98] Most famous video produced by Scanimate Death Star 21

Old Vs. New: Programmability http://scanimate.zfx.com/scancpu.html 22

Old: Old Vs. New: Programmability Had to manually patch wires for programming New: RC values can be programmed by digital But more development is needed Compiler Toolchain Programming language Development environment etc., 23

Introduction to HDCCA Outline A simple End-to-End Example Analog Old Vs. Analog New Research Challenges for HDCCA? Arch.?. Microarch 24

Research Challenges: Microarchitecture Digital Chip Analog ACC. Cache Mem D/A Array Data Path Array Xn A/D Array Digital Control Interface Microarch How much on chip area should be alllocated to Analog? What type of func3onal units should be included? Should the units be connected with circuit or packet switching? How many input output channels to digital should be created? Should the Datapath and ADCs be operated at different speeds? 25

Research Challenges: Architecture Analog Interfaces ConfiguraQon CalibraQon Compute CompleQon Func3onality Configure the datapath for a sub problem To query processor state when computaqon is carried out. Export types of funcqonal units available When should the output values be sampled? Arch. What is the machine model? Should we use the accelerator as a slave or standalone? What are the seman3cs of the instruc3ons? Microarch How much on chip area should be alllocated to Analog? What type of funcqonal units should be included? Should the units be connected with circuit or packet switching? How many input output channels to digital should be created? Should the Datapath and ADCs be operated at different speeds? 26

Research Challenges: Compiler/PL PL Compiler Arch. Microarch What should the con3nuous languages primi3ves be? Do we need a separate staqc and dynamic compiler? What is the machine model? Should we use the accelerator as a slave or standalone? What are the semanqcs of the instrucqons? How much on chip area should be alllocated to Analog? What type of funcqonal units should be included? Should the units be connected with circuit or packet switching? How many input output channels to digital should be created? Should the Datapath and ADCs be operated at different speeds? 27

Research Challenges: Algorithms Algorithm Developer PL Compiler Arch. Microarch Development of algorithms to decompose tasks? Can we come up with a formal theory of how errors should be handled? What should the conqnuous languages primiqves be? Do we need a separate staqc and dynamic compiler? What is the machine model? Should we use the accelerator as a slave or standalone? What are the semanqcs of the instrucqons? How much on chip area should be alllocated to Analog? What type of funcqonal units should be included? Should the units be connected with circuit or packet switching? How many input output channels to digital should be created? Should the Datapath and ADCs be operated at different speeds? 28

Introduction to HDCCA Outline HDCCA mini tutorial Analog Old Vs. Analog New Research Challenges for HDCCA Tangible Benefits of HDCCA Analysis of existing benchmark suites Mapping a non-straightforward problem on to HDCCA 29

Analog Accelerator Utility Examined three benchmark suites SPEC CFP Intel RMS Berkeley Dwarfs Categorize Based on problem Not algorithm 40 % map to HDCCA Linear Algebra ODE Spectral Methods Other Domains 30

Solving Linear Programming Soplex in SPEC solves LP using Simplex Cannot be directly mapped on to analog accelerator Alternate formulation of the problem is needed Objec&ve func&on (linear) Decision Variables e.g., SPEC ~920K DVs Constraints e.g., SPEC ~2.5K constraints 31

Analog Mapping Solved using gradient descent instead of simplex Generate a moving in the solution space Move the point based on the objective function When P is maximum of minimum the gradient goes to zero P is the solution Also works for non-linear constraints! 32

Related Work Several Thesis 1942

CONCLUDING REMARKS 34

Use Models Philosophy of Engineering Innovations CYCLIC THEORY OF INVENTION LINEAR THEORY OF INVENTION Vector, SIMD MMX MulQprocessor, MulQcores Networks, Networks on Chip

Simha s Philosophy of Engineering Innovations The right symbolism for computer engineering innova6on is at least a helix. Analog improvements IntegraQon Digital interfacing to Analog Some AutomaQon Technology Time New emphasis on energy efficiency Use Models ApplicaQon Changes Error tolerant Reduced accuracy applicaqons To CITE This talk use CUCS 026 10 36

Other Research at CASTL I. Proactive Security Project Current Approach to Security : Patch flaws reactively What if we took a ground up approach? Secure hardware first Build hardware primitives to support SW security Build SW security countermeasures using HW primitives II. Accelerating Discovery in Computer Systems We are seeing rapid application development Traditional methods are too slow to keep pace Can we use Machine Learning and Crowd sourcing to build better systems? 37