Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

Similar documents
ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

CS184a: Computer Architecture (Structures and Organization) Last Time

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

Lossless Compression Algorithms for Direct- Write Lithography Systems

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Why FPGAs? FPGA Overview. Why FPGAs?

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

FPGA Design. Part I - Hardware Components. Thomas Lenzi

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

FPGA Design with VHDL

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Field Programmable Gate Arrays (FPGAs)

ESE534: Computer Organization. Last Time. Last Time. Today. Preclass. Preclass. LUTs. Day 15: March 22, 2010 Compute 2: Cascades, ALUs, PLAs

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

L11/12: Reconfigurable Logic Architectures

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

A Low-Power 0.7-V H p Video Decoder

A video signal processor for motioncompensated field-rate upconversion in consumer television

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

EECS150 - Digital Design Lecture 13 - Project Description, Part 3 of? Project Overview

L12: Reconfigurable Logic Architectures

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

11. Sequential Elements

Difference with latch: output changes on (not after) falling clock edge

A Fast Constant Coefficient Multiplier for the XC6200

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

L14: Quiz Information and Final Project Kickoff. L14: Spring 2004 Introductory Digital Systems Laboratory

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

Register Files and Memories

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

Sequential Logic. Introduction to Computer Yung-Yu Chuang

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

L14: Final Project Kickoff. L14: Spring 2006 Introductory Digital Systems Laboratory

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 9 Field Programmable Gate Arrays (FPGAs)

Chapter 3: Sequential Logic

VARIABLE FREQUENCY CLOCKING HARDWARE

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

Lecture 6: Simple and Complex Programmable Logic Devices. EE 3610 Digital Systems

Out of order execution allows

LogiCORE IP AXI Video Direct Memory Access v5.01.a

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

CS3350B Computer Architecture Winter 2015

RELATED WORK Integrated circuits and programmable devices

EE178 Spring 2018 Lecture Module 5. Eric Crabill

A Low Power Delay Buffer Using Gated Driver Tree

L13: Final Project Kickoff. L13: Spring 2005 Introductory Digital Systems Laboratory

FPGA Hardware Resource Specific Optimal Design for FIR Filters

ECE 545 Lecture 1. FPGA Devices & FPGA Tools

Self-Test and Adaptation for Random Variations in Reliability

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

L14: Final Project Kickoff. L14: Spring 2007 Introductory Digital Systems Laboratory

Conceps and trends for Front-end chips in Astroparticle physics

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER

Register Transfer Level (RTL) Design Cont.

Authentic Time Hardware Co-simulation of Edge Discovery for Video Processing System

A S. x sa1 Z 1/0 1/0

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

High Density Asynchronous LUT Based on Non-Volatile MRAM Technology

Frame Processing Time Deviations in Video Processors

EITF35: Introduction to Structured VLSI Design

Chapter. Sequential Circuits

Modeling and simulation of altera logic array block using quantum-dot cellular automata

A low-power portable H.264/AVC decoder using elastic pipeline

FSM Cookbook. 1. Introduction. 2. What Functional Information Must be Modeled

Hardware Design I Chap. 5 Memory elements

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Clock-Aware FPGA Placement Contest

Remote Diagnostics and Upgrades

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

Improving FPGA Performance with a S44 LUT Structure

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Combinational vs Sequential

Towards Trusted Devices in FPGA by Modeling Radiation Induced Errors

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

SoC IC Basics. COE838: Systems on Chip Design

Design and Implementation of an AHB VGA Peripheral

XC4000E and XC4000X Series. Field Programmable Gate Arrays. Low-Voltage Versions Available. XC4000E and XC4000X Series. Features

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

A CYCLES/MB H.264/AVC MOTION COMPENSATION ARCHITECTURE FOR QUAD-HD APPLICATIONS

COMP2611: Computer Organization. Introduction to Digital Logic

Methodology. Nitin Chawla,Harvinder Singh & Pascal Urard. STMicroelectronics

DRS Application Note. Integrated VXS SIGINT Digital Receiver/Processor. Technology White Paper. cwcembedded.com

Chapter 7 Memory and Programmable Logic

Transcription:

ESE534: Computer Organization Day 22: November 16, 2016 Retiming 1 Day 21: Retiming Requirements Retiming requirement depends on parallelism and performance Even with a given amount of parallelism Will have a distribution of retiming requirements May differ from task to task May vary independently from compute/ interconnect requirements Another balance issue to watch Balance with compute, interconnect Need a canonical way to measure Like Rent? 2 Retiming Supply Technology Today Structures Hierarchy or, how do we add memory (state) to architectures Relative Sizes Bit Operator 3-5KF 2 Bit Operator Interconnect 200K-250KF 2 Instruction (w/ interconnect) 20KF 2 Memory bit (SRAM) 250-500F 2 Memory bit (DRAM) 25F 2 Flip-Flop 1000F 2 3 4 State Bit Operator 3-5KF 2 Bit Operator Interconnect 200K-250KF 2 Instruction (w/ interconnect) 20KF 2 Memory bit (SRAM) 250-500F 2 Memory bit (DRAM) 25F 2 Flip-Flop 1000F 2 A(state bit) << A(bit-processing element) 5 State Size A(state bit) << A(bit-processing element) Enabler for time-space tradeoffs Balance: can afford many bits per bit processing element 250K/1K = 250 (flip-flops) 250K/250=1K (SRAM bits) 6 1

State Density Reuse Distance Interpretation Memory is most dense in large arrays Also slow, low bandwidth What s expensive I/O Routing Bandwidth to access memory How long between when something is produced and when it is consumed? FSM state Produced every cycle/consumed every cycle Line buffers for video When retiming long delay Ratio memory/io is high Can afford/exploit large memory 7 8 Retiming Structure Concerns Optional Output Area: F 2 /bit Throughput: bandwidth (bits/time) Energy Flip-flop (optionally) on output 9 flip-flop: 1K F 2 Switch to select: ~ 1.25K F 2 Area: 1 LUT ~ 250K F 2 /LUT Bandwidth: 1b/cycle 10 Output Single Output Ok, if don t need other timings of signal Multiple Output more routing 11 Input More registers (K ) 2.5K F 2 /register+mux 4-LUT => 10K F 2 /depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (250K+d*10K F 2 ) get Kd regs d=4, 290K F 2 Bandwidth: K/cycle 1/d th capacity 12 2

Preclass 1 Just Logic Blocks Most primitive build flip-flop out of logic blocks I D*/Clk + I*Clk Q Q*/Clk + I*Clk Area: 2 LUTs (250K F 2 /LUT each) Bandwidth: 1b/cycle Compare LUT sizing, interconnect p. 13 14 Separate Flip-Flops Virtex SRL16 Network flip flop w/ own interconnect + can deploy where needed - requires more interconnect + Vary LUT/FF ratio Arch. Parameter Xilinx Virtex 4-LUT Use as 16b shiftreg Assume routing inputs 1/4 size of LUT Area: 50K F 2 each Bandwidth: 1b/cycle 15 Area: ~250K F 2 /16 16K F 2 /bit Does not need CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity 16 Register File Memory Bank Preclass 2 From MIPS-X 250F 2 /bit + 125F 2 /port Area(RF) = (d+6)(w+6)(250f 2 +ports* 125F 2 ) Complete Table How small can get? Compare w=1, d=16, ports=4 case to input retiming 17 18 3

Input More registers (K ) 2.5K F 2 /register+mux 4-LUT => 10K F 2 /depth No more interconnect than unretimed open: compare savings to additional reg. cost Area: 1 LUT (1M+d*10K F 2 ) get Kd regs d=4, 290K F 2 Bandwidth: K/cycle 1/d th capacity 19 Preclass 2 Note compactness from wide words (share decoder) 20 Xilinx CLB Memory Blocks Xilinx 4K CLB as memory works like RF Area: 1/2 CLB (160K F 2 )/16 10K F 2 /bit but need 4 CLBs to control Bandwidth: 1b/2 cycle (1/2 CLB) 1/16 th capacity 21 SRAM bit 300 F 2 (large arrays) DRAM bit 25 F 2 (large arrays) Bandwidth: W bits / 2 cycles usually single read/write 1/2 A th capacity 22 Dual-Ported Block RAMs Dual-Ported Block RAMs Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Stratix-5 20Kb Can put 250K/250 1K bits in space of 4-LUT Trade few 4-LUTs for considerable memory 23 Virtex-6 Series 36Kb memories Stratix-4 640b, 9Kb, 144Kb Stratix-5 20Kb Can put 250K/250 1K bits in space of 4-LUT Trade few 4-LUTs for considerable memory 24 4

Hierarchy/Structure Summary Big Idea: Memory Hierarchy arises from area/bandwidth tradeoffs Smaller/cheaper to store words/blocks (saves routing and control) Smaller/cheaper to handle long retiming in larger arrays (reduce interconnect) High bandwidth out of shallow memories Lower energy out of small memories Applications have mix of retiming needs (Area, BW)! Hierarchy Area (F 2 ) Bw/capacity FF/LUT 250K 1/1 netff 50K 1/1 XC 10K 1/16 RFx1 10K 1/100 FF/RF 1K 1/100 RF bit 2K 1/100 SRAM 300 1/10 5 DRAM 25 1/10 7 25 26 Clock Cycle Radius Clock Cycle Radius Radius of logic can reach in one cycle (45 nm) Radius 10 (preclass 20: L seg =5! 50ps) Few hundred PEs Chip side 600-700 PE 400-500 thousand PEs 100s of cycles to cross 27 Radius of logic memory can reach in one cycle (45 nm) Radius 10 Chip side 600-700 PE 400-500 thousand PEs 100s of cycles to cross Can only reach a small amount of data quickly More state! slower access 28 Capacity vs. Delay, Energy Capacity vs. Delay, Energy How many hops to 3, 15, 31, 63? What fraction of memory can reach in same hops as 3, 15, 31, 63? Energy to access 63, 31, 15 compared to 3? Can only place a few things close Slower to access far things More energy to access far things More energy to select from large number of things 29 30 5

Modern FPGAs Output Flop (depth 1) Use LUT as Shift Register (16,32) Embedded RAMs (9Kb,20Kb,36Kb) Larger chip RAMs (X UltraRAM 100Mbs) Interface off-chip DRAM (Gbits) Retiming in interconnect (Stratix 10) 31 Modern Processors DSPs have accumulator (depth 1) Inter-stage pipelines (depth 1) Lots of pipelining in memory path Reorder Buffer (4 32) Architected RF (16, 32, 128) Actual RF (256, 512 ) L1 Cache (~64Kb) L2 Cache (~1Mb) L3 Cache (10-100Mb) Main Memory in DRAM (100Gb 1Tbs) 32 Big Ideas [MSB Ideas] Tasks have a wide variety of retiming distances (depths) Within design, among tasks Retiming requirements vary independently of compute, interconnect requirements (balance) Wide variety of retiming costs 25 F 2 250K F 2 Routing and I/O bandwidth big factors in costs Gives rise to memory (retiming) hierarchy 33 HW9 due today Final out now 1 month exercise Admin Milestone deadlines next two Wednesdays (Wed. before Thanksgiving is not a Wednesday) 34 6