ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

Similar documents
Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Why FPGAs? FPGA Overview. Why FPGAs?

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

L11/12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Integrated circuits/5 ASIC circuits

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

A Fast Constant Coefficient Multiplier for the XC6200

Lossless Compression Algorithms for Direct- Write Lithography Systems

EITF35: Introduction to Structured VLSI Design

High Density Asynchronous LUT Based on Non-Volatile MRAM Technology

Field Programmable Gate Arrays (FPGAs)

ESE534: Computer Organization. Last Time. Last Time. Today. Preclass. Preclass. LUTs. Day 15: March 22, 2010 Compute 2: Cascades, ALUs, PLAs

FPGA Hardware Resource Specific Optimal Design for FIR Filters

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

FPGA Design with VHDL

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

CS184a: Computer Architecture (Structures and Organization) Last Time

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

RELATED WORK Integrated circuits and programmable devices

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

VLSI IEEE Projects Titles LeMeniz Infotech

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

High Performance Carry Chains for FPGAs

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

Designing VeSFET-based ICs with CMOS-oriented EDA Infrastructure

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

An Introduction to VLSI (Very Large Scale Integrated) Circuit Design

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

SoC IC Basics. COE838: Systems on Chip Design

Integrated Circuit Design ELCT 701 (Winter 2017) Lecture 1: Introduction

Improving FPGA Performance with a S44 LUT Structure

FPGA Implementation of DA Algritm for Fir Filter

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

11. Sequential Elements

FP 12.4: A CMOS Scheme for 0.5V Supply Voltage with Pico-Ampere Standby Current

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

Amon: Advanced Mesh-Like Optical NoC

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

9 Programmable Logic Devices

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Lecture 1: Circuits & Layout

Fine-grain Leakage Optimization in SRAM based FPGAs

Sharif University of Technology. SoC: Introduction

Digital Integrated Circuits EECS 312

FPGA Glitch Power Analysis and Reduction

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

Self-Test and Adaptation for Random Variations in Reliability

Digital Systems Design

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Radar Signal Processing Final Report Spring Semester 2017

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 9 Field Programmable Gate Arrays (FPGAs)

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

SEMICONDUCTOR TECHNOLOGY -CMOS-

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

FPGA Digital Signal Processing. Derek Kozel July 15, 2017

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

The Stratix II Logic and Routing Architecture

SEMICONDUCTOR TECHNOLOGY -CMOS-

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

Digital Integrated Circuits EECS 312

GlitchLess: An Active Glitch Minimization Technique for FPGAs

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

Layout Analysis Analog Block

EECS150 - Digital Design Lecture 2 - CMOS

An FPGA Implementation of Shift Register Using Pulsed Latches

100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017

Clocking Spring /18/05

IC TECHNOLOGY Lecture 2.

Texas Instruments TNETE2201 Ethernet Transceiver Circuit Analysis

LUT Design Using OMS Technique for Memory Based Realization of FIR Filter

Flexible Electronics Production Deployment on FPD Standards: Plastic Displays & Integrated Circuits. Stanislav Loboda R&D engineer

Sequential Logic. Introduction to Computer Yung-Yu Chuang

An Efficient High Speed Wallace Tree Multiplier

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Reconfigurable Neural Net Chip with 32K Connections

EECS 151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: N. Weaver & J. Wawrzynek. Lecture 2 EE141

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

DIRECT DIGITAL SYNTHESIS AND SPUR REDUCTION USING METHOD OF DITHERING

Transcription:

ESE534: Computer Organization Previously Instruction Space Modeling Day 15: March 24, 2014 Empirical Comparisons Previously Programmable compute blocks LUTs, ALUs, PLAs Today What if we just built a custom circuit? What cost are we paying for programmability? Can we afford to build custom circuits? Can we afford not to? Empirically compare artifacts Different kind of lecture messier, real-world artifacts Coming at this about 3 different ways Bottom up from sizes; with full benchmarks; particular examples Empirical Data Custom Gate Array Std. Cell (ASIC) Full FPGAs Processors NRE Tasks Today How big? 2-LUT? Preclass 1 2-LUT w/ Flip-flop? 2-LUT w/ 4 input sources? 2-LUT w/ 200 input sources? 1

Empirical Empirical Comparisons Ground modeling in some concretes Start sorting out custom vs. configurable spatial configurable vs. temporal Start by reviewing alternatives Full Custom Standard Cell Area Get to define all layers Use any geometry you like Only rules are process design rules ESE570 inv nand3 inv AOI4 nor3 Inv Cell area All cells uniform height Width of channel determined by routing Standard Cell Area Standard Cell Area inv nand3 inv AOI4 nor3 Inv All cells uniform height inv nand3 inv AOI4 nor3 Inv All cells uniform height Cell area Width of channel determined by routing Identify the full custom and standard cell regions on 386DX die http://microscope.fsu.edu/chipshots/intel/386dxlarge.html Cell area What freedom have we removed? Impact? Width of channel determined by routing 2

MPGA Metal Programmable Gate Array Resurrected as Structured ASICs Gates pre-placed (poly, diffusion) Only get to define metal connections Today s structured ASICs maybe just vias Cheap (low NRE) only have to pay for metal mask(s) [Wu&Tsai/ISPD2004p103] 2011 45nm 1.4M LUTs 500MHz? Structured ASIC: easic http://www.easic.com/wp-content/uploads/2011/02/easic-nextreme-2t-product-brief.pdf Structured ASIC Maybe think about it as an FPGA with vias instead of configurable switches? Ratio of SRAM to via design? What do we expect? Comparing density/delay/energy Full custom Standard Cell (ASIC) MPGA / Structured ASIC FPGA Processor Why it isn t trivial? MPGA vs. Custom? Different logic forms Interconnect Balance of resources Mix of requirements in tasks AMI CICC 83 MPGA 1.0 Std-Cell 0.7 Custom 0.5 AMI CICC 04 Custom 0.6 (DSP) Custom 0.8 (DPath) Toshiba DSP Custom 0.3 Mosaid RAM Custom 0.2 GE CICC 86 MPGA 1.0 Std-Cell 0.4--0.7 FF/counter 0.7 FullAdder 0.4 RAM 0.2 3

Metal Programmable Gate Arrays MPGAs Modern -- Sea of Gates yield 35--70% Maybe 1.25F 2 /gate? (quite a bit of variance) Conventional FPGA Tile Toronto FPGA Model K-LUT (typical k=4) w/ optional output Flip-Flop FPGA Table (semi) Modern FPGAs APEX 20K1500E 52K LEs 0.18µm 24mm 22mm 300KF 2 /LE XC2V1000 10.44mm x 9.90mm [source: Chipworks] 0.15µm 11,520 4-LUTs 1. 5Mλ 2 /4-LUT (~375KF 2 /4-LUT) [Both also have RAM in cited area] 4

How many gates? (Prelcass 3) gates in 2-LUT Now how many? Which gives: Higher fraction of gates used? More gates/unit area? More usable gates? Gates Required? Depth=3, Depth=2048? Gate metric for FPGAs? Day11: several components for computations compute element interconnect: space time instructions Not all applications need in same balance Assigning a single capacity number to device is an oversimplification 5

MPGA (SOG GA) 1.25KF 2 /gate 35-70% usable (50%) 1.5-4KF 2 /gate net MPGA vs. FPGA Xilinx XC4K 300KF 2 /CLB 17--48 gates (26?) 6-18KF 2 /gate net Ratio: 2--10 (5) Adding ~2x Custom/MPGA, Custom/FPGA ~10x http://www.easic.com/high-speed-transceivers-low-cost-power-fpga-nre-asic-45nm-easic-nextreme-2/easic-nextreme-2-look-up-table-lut-architecture/ FPGA vs. Structure ASIC FPGA vs. Std Cell Virtex 6 40nm 470K 6-LUTs Largest device easic 45nm 580K ecells Probably smaller die 90nm FPGA: Stratix II STMicro CMOS090 Standard Cell Full custom layout but by tool http://www.easic.com/high-speed-transceivers-low-cost-power-fpga-nre-asic-45nm-easic-nextreme-2/easic-nextreme-2-look-up-table-lut-architecture/ [Kuon/Rose TRCADv26n2p203--215 2007] MPGA vs. FPGA (Delay) FPGA vs. Std. Cell Delay MPGA (SOG GA) F=1.2µ τ gd ~1ns Xilinx XC4K F=1.2µ 1-7 gates in 7ns 2-3 gates typical 90nm FPGA: Stratix II STMicro CMOS090 Ratio: 1--7 (2.5) Altera claiming 2 For their Structured ASIC [2007] LSI claiming 3 2005 [Kuon/Rose TRCADv26n2p203--215 2007] 6

FPGA vs. Std Cell Energy 90nm FPGA: Stratix II STMicro CMOS090 Processors vs. FPGAs easic (MPGA) claim 20% of FPGA power (best case) [Kuon/Rose TRCADv26n2p203--215 2007] Processors and FPGAs Component Example Single die in 0.35µm XC4085XL-09 3,136 CLBs 4.6ns 682 Bit Ops/ns Alpha 1996 2 64b ALUs 2.3ns 55.7 Bit Ops/ns [1 bit op = 2 gate evaluations] Processors and FPGAs Raw Density Summary Area MPGA 2-3x Custom FPGA 5x MPGA FPGA:std-cell custom ~ 15-30x Area-Time Gate Array 6-10x Custom FPGA 15-20x Gate Array FPGA:std-cell custom ~ 100x Processor 10x FPGA 7

Raw Density Caveats Processor/FPGA may solve more specialized problem Problems have different resource balance requirements can lead to low yield of raw density Challenge: NRE NRE Costs Economics Forcing fewer, more customizable chips 28-nm SoC development costs doubled over previous node EE Times 2013 28nm+78%, 20nm+48%, 14nm+31%, 10nm+35% Economics force fewer, more customizable chips Mask costs approaching millions of dollars Custom IC design NRE tens of millions of dollars Need market of hundreds of millions of dollars to recoup investment With fixed or slowly growing total IC industry revenues Number of unique chips must decrease Broadening Picture Task Comparisons Compare larger computations For comparison throughput density metric: results/area-time normalize out area-time point selection high throughput density most in fixed area least area to satisfy fixed throughput target 8

Multiply Preclass 4 Efficiency of 8 8 multiply on 16 16 multiplier? Multiply Example: FIR Filtering Y i =w 1 x i +w 2 x i+1 +... Application metric: TAPs = filter taps multiply accumulate Mixed Designs Modern FPGAs include hardwired multipliers (Virtex 25x18) FPGA vs. Std Cell (revisit) 90nm FPGA: Stratix II STMicro CMOS090 [Kuon/Rose TRCADv26n2p203--215 2007] http://www.xilinx.com/products/silicon-devices/fpga/virtex-6/index.htm 9

Energy Pleiades includes hardwire multiply accumulator FPGA vs. Std Cell Energy (revisit) 90nm FPGA: Stratix II STMicro CMOS090 [Abnous et al, The Application of Programmable DSPs in Mobile Communications, Wiley, 2002, pp. 327-360 ] [Kuon/Rose TRCADv26n2p203--215 2007] Degrade from Peak How do various architecture degrade from peak? FPGA? Processor? Custom? Degrade from Peak: FPGAs Long path length not run at cycle Limited throughput requirement bottlenecks elsewhere limit throughput req. Insufficient interconnect Insufficient retiming resources (bandwidth) Degrade from Peak: Processors Ops w/ no gate evaluations (interconnect) Ops use limited word width Stalls waiting for retimed data 10

Degrade from Peak: Custom/ MPGA Solve more general problem than required (more gates than really need) Long path length Limited throughput requirement Not needed or applicable to a problem Degrade Notes We ll cover these issues in more detail as we get into them later in the course Big Ideas [MSB Ideas] Raw densities: custom:ga:fpga:processor 1:5:100:1000 close gap with specialization Admin Grading HW5.2 (not touched, yet) HW6 due Wednesday HW7 out Wednesday Reading on web Classic Paper IIR/Biquad (Infinite Impulse Response) DES Keysearch <http://www.cs.berkeley.edu/~iang/isaac/hardware/> Simplest IIR: Y i =A X i +B Y i-1 11

DNA Sequence Match DNA Sequence Match Problem: cost of transform S 1 S 2 Given: cost of insertion, deletion, substitution Relevance: similarity of DNA sequences evolutionary similarity structure predict function Typically: new sequence compared to large databse 12