CS 250 VLSI System Design

Similar documents
EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

EECS150 - Digital Design Lecture 3 - Timing

CS 152 Computer Architecture and Engineering

EECS150 - Digital Design Lecture 3 - Timing

EECS150 - Digital Design Lecture 9 - CPU Microarchitecture. CMOS Devices

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

EECS150 - Digital Design Lecture 2 - CMOS

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Hardware Design I Chap. 5 Memory elements

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

Introduction to CMOS VLSI Design (E158) Lecture 11: Decoders and Delay Estimation

Slide Set 6. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

11. Sequential Elements

CS 152 Midterm 2 May 2, 2002 Bob Brodersen

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

On the Rules of Low-Power Design

Digital Integrated Circuits EECS 312

TKK S ASIC-PIIRIEN SUUNNITTELU

Go BEARS~ What are Machine Structures? Lecture #15 Intro to Synchronous Digital Systems, State Elements I C

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

EE178 Spring 2018 Lecture Module 5. Eric Crabill

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

Why FPGAs? FPGA Overview. Why FPGAs?

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Lecture 10: Sequential Circuits

Digital Integrated Circuits EECS 312

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Instruction Level Parallelism

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

EE-382M VLSI II FLIP-FLOPS

Report on 4-bit Counter design Report- 1, 2. Report on D- Flipflop. Course project for ECE533

Lecture 16: Instruction Level Parallelism -- Dynamic Scheduling (OOO) via Tomasulo s Approach

EITF35: Introduction to Structured VLSI Design

More Digital Circuits

Modeling Digital Systems with Verilog

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

EEC 581 Computer Architecture. Instruction Level Parallelism (3.4 & 3.5 Dynamic Scheduling)

CS3350B Computer Architecture Winter 2015

Pipeline design. Mehran Rezaei

CS/EE 6710 Digital VLSI Design CAD Assignment #3 Due Thursday September 21 st, 5:00pm

Sequential Circuit Design: Part 1

Clock Generation and Distribution for High-Performance Processors

Clocking Spring /18/05

CPS311 Lecture: Sequential Circuits

Sequential Circuit Design: Part 1

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

ELEC 4609 IC DESIGN TERM PROJECT: DYNAMIC PRSG v1.2

PICOSECOND TIMING USING FAST ANALOG SAMPLING

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Combinational vs Sequential

CS152 Computer Architecture and Engineering Lecture 17 Advanced Pipelining: Tomasulo Algorithm

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Tomasulo Algorithm. Developed at IBM and first implemented in IBM s 360/91

Sharif University of Technology. SoC: Introduction

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Boolean, 1s and 0s stuff: synthesis, verification, representation This is what happens in the front end of the ASIC design process

L11/12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures

2.6 Reset Design Strategy

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

Static Timing Analysis for Nanometer Designs

Instruction Level Parallelism and Its. (Part II) ECE 154B


EECS150 - Digital Design Lecture 15 Finite State Machines. Announcements

VARIABLE FREQUENCY CLOCKING HARDWARE

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

ECE321 Electronics I

UNIT III COMBINATIONAL AND SEQUENTIAL CIRCUIT DESIGN

55:131 Introduction to VLSI Design Project #1 -- Fall 2009 Counter built from NAND gates, timing Due Date: Friday October 9, 2009.

Microprocessor Design

PESIT Bangalore South Campus

An Introduction to VLSI (Very Large Scale Integrated) Circuit Design

CS61C : Machine Structures

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

Chapter 7 Sequential Circuits

Good afternoon! My name is Swetha Mettala Gilla you can call me Swetha.

Design Project: Designing a Viterbi Decoder (PART I)

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Performance Modeling and Noise Reduction in VLSI Packaging

Chapter 4: One-Shots, Counters, and Clocks

Performance Driven Reliable Link Design for Network on Chips

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Computer Architecture Spring 2016

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Sequential Logic. Introduction to Computer Yung-Yu Chuang

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Transcription:

CS 250 VLSI System Design Lecture 3 Timing 2013-9-5 Professor Jonathan Bachrach today s lecture by John Lazzaro TA: Ben Keller www-insteecsberkeleyedu/~cs250/ 1

everything doesn t happen at once Timing, the 10,000 ft view Locally synchronous, globally asynchronous On the same page Minimal set of timing concepts you need for project Break RTL Examples Better timing through micro-architecture Electrical details Just so you know 2

View from 10,000 Ft Google I/O, 2012 3

26 Billion Moore s Law 1 Million 2 Thousand Synchronous logic on a single clock domain is not practical for a 26 billion transistor design 4

GALS: Globally Asynchronous, Locally Synchronous Synchronous modules typically 50K-1M gates, so that the synchronous logic approach works well without requiring heroics Examples 5

IBM Power 5 CPU - Dynamically Scheduled Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Stars denote FIFOs that create separate synchronous domains An example of how architecture and circuits work together 6

Rocket uses GALS for accelerator interface Your project interfaces with the RISC-V pipeline and the memory system using FIFOs Your timing closure is independent of the CPU logic domain 7

Today: Timing insights for your project What we re not doing If this class was EE 241 and your project was an SRAM: You could see through down to the layout Timing? Use SPICE on this hand-drawn schematic 8

Technology X: The CS 250 timing challenge What we are doing ---> If your accelerator is too slow two options: Top-down: Rework high-level micro-architecture Let Technology X keep its job Today Logic Synthesis Bottom-up: Take control away from logic synthesis Use HDL as textual schematic Also, use command-line tool flags Sometimes necessary Ben is the expert, ask in discussion section 9

A Logic Circuit Primer Models should be as simple as possible, but no simpler Albert Einstein 10

Inverters: A simple transistor model In Inverter Out Out = In Correctly predicts logic output for simple static CMOS circuits In 0 1 Out 1 0 0 Circuit In 1 0 Vdd PMOS Out NMOS 1 1 0 Extensions to model subtler circuit families, or to predict timing, have not worked well pfet A switch On if gate is grounded nfet A switch On if gate is at Vdd 11

Transistors as water valves (Cartoon physics) If electrons are water molecules, transistor strengths (W/L) are pipe diameters, and capacitors are buckets Vdd 1 A on p-fet fills up the capacitor with charge Open Charge 0 Water level Time A on n-fet empties the bucket n Vdd Open Vdd Out Discharge 1 This model is often good enough 0 Water level Time 12

What is the bucket? A gate s fan-out Inverter: NAND gate: Fan-out : The number of gate inputs driven by a gate s output Driving other gates slows a gate down Driving wires slows a gate down Driving it s own parasitics slows a gate down 13

Fanout 14

A closer look at fan-out 2 1 3 Driving more gates adds delay Linear model works for reasonable fan-out 05ns Out: Low -> High Slope = 00021ns / ff FO4: Fanout of four delay Delay time of an inverter driving 4 inverters Cout 15

Propagation delay graphs Cascaded gates: 1 ->0 1 ->0 0 ->1 0 ->1 inverter transfer function Vout Vin 16

Worst-case delay through combinational logic T2 might be the worst-case delay path (critical path) 0 ->1 T2 0 ->1 T1 0 ->1 x = g(a, b, c, d, e, f) If d going 0-to-1 switches x 0-to-1, delay is T1 If a going 0-to-1 switches x 0-to-1, delay is T2 It would be surprising if T1 > T2 17

1 v2 Why might? Wires have delay too Even in those cases where the transmission line effect is negligible: Wires posses distributed resistance and capacitance v1 v2 v3 v4 Wires posses distributed resistance and capacitance v1 v2 v3 v4 Wire Delay Time constant associated with distributed RC is proportional to the square of the length Time constant associated with distributed RC is proportional to the square of the length v3 v4 For short wires on ICs, v1 v2 v3 v4 resistance is insignificant (relative to effective R of transistors), but C is important Typically around half of C of gate load is in the wires For long wires on ICs: v1 v2 v3 v4 control signal, etc busses, clock lines, global Looks benign, but Resistance is significant, time therefore distributed RC effect dominates signals are typically rebuffered to reduce delay: time Spring 2003 EECS150 Lec10-Timing Page 16 18

Clocked Logic Circuits 19

From Delay Models to Timing Analysis clk Timing Analysis What is the smallest T that produces correct operation? f T 1 MHz 1 μs 10 MHz 100 ns 100 MHz 10 ns 1 GHz 1 ns 20

Timing Analysis and Logic Delay Register: An Array of Flip-Flops Combinational Logic If our clock period T > worst-case delay through CL, does this ensure correct operation? 21

Flip Flops have internal delays D Q Value of D is sampled on positive clock edge Q outputs sampled value for rest of cycle t_setup CLK D Q t_clk-to-q 22

Flip-Flop delays eat into time budget Combinational Logic ALU time budget T! # clk"q + # CL + # setup 23

Clock skew also eats into time budget CLKd CLK CLK CLK CLKd CLK CL As T 0, which circuit fails first? CL CLK CLK CLKd clock skew, delay in distribution T " T CL +T setup +T clk!q + worst case skew ost modern large high-performance chi 24

Delay Grid Tuned sector trees Delay Sector buffers x Clock Tree Delays, IBM Power CPU y Buffer level 2 Buffer level 1 25

15 10 Delay Volts (V) 20 ps skew 05 00 0 500 1000 1500 2000 2500 Time (ps) Multiplefingered transmissio line x Clock Tree Delays, IBM Power y 26

Some Flip Flops have hold time t_setup t_inv t_hold CLK D Q D D must stay stable here What is the intended function of this circuit? CLK Does flip-flop hold time affect operation of this circuit? Under what conditions? t_clk-to-q + t_inv > t_hold For correct operation 27

Searching for processor critical path? Timing Analysis What is the smallest T that produces correct operation? Must consider all connected register pairs Why might I suspect this one? 28

Combinational paths for IBM Power 4 CPU The critical path Most wires have hundreds of picoseconds to spare Late-mode timing checks (thousands) 200 150 100 50 0 40 20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 Timing slack (ps) From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 29

How to retime logic Circles are combinational logic, labelled with delays Critical path is 5 We want to improve it without changing circuit semantics IN 1 1 1 1 2 2 OUT Figure 1: A small graph before retiming The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers The critical path is 5 Add a register, move one circle Performance improves by 20% Post-Placement C-slow Retiming for the Xilinx Virtex FPGA IN Nicholas Weaver UC Berkeley Berkeley, CA 1 1 1 1 2 2 Yury Markovskiy UC Berkeley Berkeley, CA Yatish Patel UC Berkeley Berkeley, CA OUT Figure 2: The example in Figure 2 after retiming The critical path is reduced from 5 to 4 Technology X can often do this John Wawrzynek UC Berkeley Berkeley, CA 30

Power 4: Timing Estimation, Closure Timing Estimation Predicting a processor s clock rate early in the project From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 31

Power 4: Timing Estimation, Closure Timing Closure Meeting (or exceeding!) the timing estimate From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 32

Floorplaning: essential to meet timing (Intel XScale 80200) 33

34

Break 35

Simple exercises for gaining intuition about timing for your process + EDA tools Thanks to Bhupesh Dasila, Open-Silicon Bangalore 36

Synthesize gate chains using hand-specified library cells Exercises cell library and place and route tools weak NANDs 40 nm process 29 ps/gate av Synthesis constrained to 2ns clock Lets you know how many levels of logic you can use in the best case Delay of a chain of 3 inverters with strongest strength Guaranteed not to exceed speed Chain lengths Helps you see through Technology X Bhupesh Dasila 37

Force P&L to drive a long wire with a known buffer cell Vary driver strength, wire length, metal layer Shows the maximum distance two gates can be placed and still meet your clock period Distributed RC is the square of the length is clearly seen! Bhupesh Dasila 38

Driving Large Loads Large fanout nets: clocks, resets, memory bit lines, off-chip Relatively small driver results in long rise time (and thus large gate delay) Strategy: Staged Buffers Optimal trade-off between delay per stage and total number of stages fanout of 4-6 per stage Lecture 04, Timing 12 UC Regents CS250, UC Fall Berkeley 2013 Fall UCB 12 39

Register file: Synthesize, or use SRAM? sel(ws) 5 WE D E M U X clk wd R0 - The constant 0 Q 32 D D D En En En R1 R2 R31 Q Q Q Speed will depend on how large it lays out two read ports 32 32 32 32 sel(rs1) M U X M U X 5 32 rd1 sel(rs2) 5 32 rd2 40

Synthesized, custom, and SRAM-based register files, 40nm For small register files, logic synthesis is competitive Not clear if the SRAM data points include area for register control, etc Synthesis SRAMS Register file compiler Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the RTL development stage for floorplanning purposes This shows an example of this graph for a 1-port, 32-bit-wide SRAM Bhupesh Dasila 41

Techniques 42

Pipelining 43

+ Starting point: A single-cycle processor Challenge: Speed up clock while keeping CPI == 1 Seconds Program Instructions Program Cycles Instruction Seconds Cycle 0x4 CPI == 1 This is good Slow This is bad D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE 32 32 op A L U 32 Data Memory Addr Dout Din WE MemToReg Ext 44

Reminder: How data flows after posedge PC Instr Mem 0x4 + D Q Addr Data 5 5 5 32 rs1 rs2 ws wd RegFile WE rd1 rd2 32 32 32 32 op Logic A L U 32 45

Next posedge: Update state and repeat PC D Q 5 rs1 5 rs2 5 ws 32 wd RegFile WE rd1 rd2 32 32 46

Observation: Logic idle most of cycle For most of cycle, ALU is either waiting for its inputs, or holding its output Ideal: a CPU architecture where each part is always working 0x4 + D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE 32 32 op A L U 32 Data Memory Addr Dout Din WE MemToReg Ext 47

Inspiration: Automobile assembly line Assembly line moves on a steady clock Each station does the same task on each car The clock Merge station Car body shell Bolting station Car chassis 48

Inspiration: Automobile assembly line Simpler station tasks more cars per hour Simple tasks take less time, clock is faster 49

Inspiration: Automobile assembly line Line speed limited by slowest task Most efficient if all tasks take same time to do 50

Inspiration: Automobile assembly line Simpler tasks, complex car long line! These lines go 24 x 7, and rarely shut down 51

Lessons from car assembly lines Faster line movement yields more cars per hour off the line Faster line movement requires more stages, each doing simpler tasks To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) Filling, flushing, and stalling assembly line are all bad news 52

Key Analogy: The instruction is the car Pipeline Stage #1 Stage #2 Stage #3 Stage #4 Stage #5 Instruction Fetch IR IR IR IR + PC 0x4 Instr Mem Controls hardware in stage 2 Controls hardware in stage 3 Controls hardware in stage 4 Controls hardware in stage 5 D Q Addr Data Data-stationary control 53

+ Example: Decode & Register Fetch Stage Pipeline Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch SUB R10,R9,R8 IR OR R7,R6,R5 IR ADD R4,R3,R2 IR 0x4 A sample program D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE Ext A M B ADD R4,R3,R2 OR R7,R6,R5 SUB R10,R9,R8 R s chosen so that instructions are independent - like cars on the line 54

Hazards: An instruction is not a car + Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch D PC Q 0x4 Addr Instr Mem Data OR R5,R4,R2 IR IR IR wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! rs1 rs2 ws wd RegFile WE rd1 rd2 Ext A M B ADD R4,R3,R2 R4 not written yet New sample program ADD R4,R3,R2 OR R5,R4,R2 An example of a hazard -- we must (1) detect and (2) resolve all hazards to make a CPU that matches ISA 55

Performance Equation and Hazards Seconds Program Instructions Program Cycles Instruction Seconds Cycle + D PC Q Addr Instr Fetch Decode & Reg Fetch Stage #3 IR IR IR Some ways to cope with hazards Added logic to makes CPI > 1 detect and resolve 0x4 stalling pipeline hazards increases A clock period Instr Mem Data rs1 rs2 ws wd RegFile WE rd1 rd2 Ext M B Software slows the machine down Seymour Cray 56

Superpipelining 57

Superpipelining: Add more stages Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Reduce critical path by adding more pipeline stages Example: 8-stage ARM XScale: extra IF, ID, data cache stages Difficulties: Added penalties for load delays and branch misses Also, power! Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup 58

Note: Some stages now overlap, some instructions take extra stages 5 Stage 8 Stage IF ID+RF EX MEM WB IR IR IR IR IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) 59

Superpipelining techniques Split ALU and decode logic over several pipeline stages Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes Remove rarely-used forwarding networks that are on critical path Creates stalls, affects CPI Pipeline the wires of frequently used forwarding networks Also: Clocking tricks (example: use posedge and negedge registers) 60

Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 FO4s 100 90 80 70 60 50 40 30 20 10 0 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 MIPS 2000 5 stages CPU Clock Periods 1985-2005 Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period Pentium 4 20 stages Thanks to Francois Labonte, Stanford * intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 Power wall: Intel Core Duo has 14 stages 61

CPU DB: Recording Microprocessor History With this open database, you can mine microprocessor trends over the past 40 years Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, Mark Horowitz, Stanford University F04 Delays Per Cycle for Processor Designs 140 120 100 F04 / cycle 80 60 40 20 0 1985 1990 1995 2000 2005 2010 2015 FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle 62

Multithreading 63

Multithreading of Static Pipelines Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, each run at 1/4 clock PC PC PC 1 PC 1 1 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Many variants 64

At the logic level Synchronous logic we want to multithread Critical path is 5 2X multi-threading: double each register Modern synthesis will retime this as shown: critical path is now 2 IN 1 1 1 1 2 2 OUT Figure 1: A small graph before retiming The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers The critical path is 5 IN 1 1 1 1 2 2 OUT Figure 3: The example in Figure 2 2-slowed This design now operates on 2 independent data streams IN 1 1 1 1 2 2 OUT Figure 4: The example in Figure 3 after retiming The combination of C-slowing and retiming reduced the critical path from 5 to 2 Post-Placement C-slow Retiming for the Xilinx Virtex FPGA Nicholas Weaver UC Berkeley Berkeley, CA Yury Markovskiy UC Berkeley Berkeley, CA Yatish Patel UC Berkeley Berkeley, CA John Wawrzynek UC Berkeley Berkeley, CA 65

Good fit for GALS Two input queues (red and green) The mux control logic implements turn-taking Outputs placed into two output queues 66

Crossbar Networks 67

When register files get big, they get slow sel(ws) 5 WE D E M U X clk wd R0 - The constant 0 Q 32 Even worse: adding ports slows down as O(N 2 ) D D D En En En R1 R2 R31 Q Q Q Why? Number of loads on each Q goes as O(N), and the wire length to port mux goes as O(N) 32 32 32 32 sel(rs1) M U X M U X 5 32 rd1 sel(rs2) 5 32 rd2 68

Crossbar networks: general case of this problem Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW Crossbar BW: 270 GB/s total (Read + Write) (Also shared by an I/O port, not shown) 69

Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) 100-200 wires/ port (each way) 4 cycle latency (715ps/cycle) Cycles 1-3 are for arbitration Transmit data on cycle 4 Pipelined 8 ports for L2 banks, plus one for I/0 70

A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port Epoch 2: Allocation algorithm decides which inputs get to write Epoch 3: Allocation system informs the winning inputs and outputs Epoch 4: Actual data transfer takes place Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests 71

Sun Niagara II 8 x 9 Crossbar Every cross of blue and purple is a pass gate with a unique control signal 72 control signals (if distributed unencoded) 72

73

Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns) Uniform latency between all port pairs Crossbar defines floorplan: all port devices should be equidistant to the crossbar Did not scale up for 16-core Rainbow Falls Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores Design alternatives to crossbar? 74

CLOS networks: from telecom world Build a high-port switch by tiling fixed-sized shuffle units Pipeline registers naturally fit between tiles Trades scalability for latency 75

CLOS networks: an example route Numbers on left and right are port numbers Colors show routing paths for an exchange Arbitration still needed to prevent blocking 76

Electrical Details 77

Flip Flops Revisited 78

Recall: Static RAM cell (6 Transistors) Gnd Vdd Vth Vth Vdd Gnd Crosscoupled inverters noise noise x x! 79

Recall: Positive edge-triggered flip-flop D Q A flip-flop samples right before the edge, and then holds value clk Sampling circuit clk Holds value clk clk clk clk clk Clock to Q delay results fr 16 Transistors: Makes an SRAM look compact! What do we get for the 10 extra transistors? Clocked logic semantics clk 80

Sensing: When clock is low D Q clk A flip-flop samples right before the edge, and then holds value Sampling circuit clk Holds value clk clk clk clk clk = 0 clk = 1 clk clk Clock to Q delay results fr clk clk clk clk clk clk Will capture clk new value on posedge Clock to Q delay results fr Outputs clk last value captured 81

Capture: When clock goes high D Q clk A flip-flop samples right before the edge, and then holds value Sampling circuit clk Holds value clk clk clk clk clk = 1 clk = 0 clk Clock to clk Q delay results fr clk clk clk clk clk clk Remembers value clk just captured Clock to Q delay results fr Outputs value clk just captured 82

Flip Flop delays: clk-to-q? setup? hold? clk clk D Q CLK clk clk clk clk CLK == 0 Sense D, but Q outputs old value clk Clock to Q delay results fr setup clk CLK 0->1 Capture D, pass value to Q hold clk-to-q 83

More Detailed Gate Models 84

Inverters: Circuits and Layout Vdd symbol Vin Vout Vin Vout 85

Inverter: Die Cross Section Vout Vin oxide n+ n+ p- Vin oxide p+ p+ n+ n-well Vin Vout 86

Inverters with Vin = Gnd, Vout = Vdd Is Vsd > Vsg - Vt once Vout is Vdd? Is Vsg > Vt? I sd V s Isd = k (W/L) [Vsg -Vt] [Vsd] Vin I ds V d V d V s Vout This goes as close to 0 as it can while still supplying the leakage current Ids 0, but really a small leakage current 87

Inverters with Vin = Vdd, Vout = Gnd Isd 0, but really a small leakage current Vin V s I sd V d V d Vout This goes as close to 0 as it can while still supplying the leakage current I ds V s Is Vds > Vgs - Vt once Vout is Gnd? Is Vgs > Vt? Ids = k (W/L) [Vgs -Vt] [Vds] 88

On Tuesday Power and Energy Heat Sink Heat Source 89