CS 250 VLSI System Design Lecture 3 Timing 2013-9-5 Professor Jonathan Bachrach today s lecture by John Lazzaro TA: Ben Keller www-insteecsberkeleyedu/~cs250/ 1
everything doesn t happen at once Timing, the 10,000 ft view Locally synchronous, globally asynchronous On the same page Minimal set of timing concepts you need for project Break RTL Examples Better timing through micro-architecture Electrical details Just so you know 2
View from 10,000 Ft Google I/O, 2012 3
26 Billion Moore s Law 1 Million 2 Thousand Synchronous logic on a single clock domain is not practical for a 26 billion transistor design 4
GALS: Globally Asynchronous, Locally Synchronous Synchronous modules typically 50K-1M gates, so that the synchronous logic approach works well without requiring heroics Examples 5
IBM Power 5 CPU - Dynamically Scheduled Program counter Instruction cache Instruction translation Alternate Branch history tables Instruction buffer 0 Instruction buffer 1 Branch prediction Return stack Thread priority Target cache Group formation Instruction decode Dispatch Sharedregister mappers Dynamic instruction selection Shared issue queues Read sharedregister files Shared execution units LSU0 FXU0 LSU1 FXU1 FPU0 FPU1 BXU CRL Write sharedregister files Data Translation Group completion Data translation Data Cache Store queue Data cache L2 cache Shared by two threads Thread 0 resources Thread 1 resources Stars denote FIFOs that create separate synchronous domains An example of how architecture and circuits work together 6
Rocket uses GALS for accelerator interface Your project interfaces with the RISC-V pipeline and the memory system using FIFOs Your timing closure is independent of the CPU logic domain 7
Today: Timing insights for your project What we re not doing If this class was EE 241 and your project was an SRAM: You could see through down to the layout Timing? Use SPICE on this hand-drawn schematic 8
Technology X: The CS 250 timing challenge What we are doing ---> If your accelerator is too slow two options: Top-down: Rework high-level micro-architecture Let Technology X keep its job Today Logic Synthesis Bottom-up: Take control away from logic synthesis Use HDL as textual schematic Also, use command-line tool flags Sometimes necessary Ben is the expert, ask in discussion section 9
A Logic Circuit Primer Models should be as simple as possible, but no simpler Albert Einstein 10
Inverters: A simple transistor model In Inverter Out Out = In Correctly predicts logic output for simple static CMOS circuits In 0 1 Out 1 0 0 Circuit In 1 0 Vdd PMOS Out NMOS 1 1 0 Extensions to model subtler circuit families, or to predict timing, have not worked well pfet A switch On if gate is grounded nfet A switch On if gate is at Vdd 11
Transistors as water valves (Cartoon physics) If electrons are water molecules, transistor strengths (W/L) are pipe diameters, and capacitors are buckets Vdd 1 A on p-fet fills up the capacitor with charge Open Charge 0 Water level Time A on n-fet empties the bucket n Vdd Open Vdd Out Discharge 1 This model is often good enough 0 Water level Time 12
What is the bucket? A gate s fan-out Inverter: NAND gate: Fan-out : The number of gate inputs driven by a gate s output Driving other gates slows a gate down Driving wires slows a gate down Driving it s own parasitics slows a gate down 13
Fanout 14
A closer look at fan-out 2 1 3 Driving more gates adds delay Linear model works for reasonable fan-out 05ns Out: Low -> High Slope = 00021ns / ff FO4: Fanout of four delay Delay time of an inverter driving 4 inverters Cout 15
Propagation delay graphs Cascaded gates: 1 ->0 1 ->0 0 ->1 0 ->1 inverter transfer function Vout Vin 16
Worst-case delay through combinational logic T2 might be the worst-case delay path (critical path) 0 ->1 T2 0 ->1 T1 0 ->1 x = g(a, b, c, d, e, f) If d going 0-to-1 switches x 0-to-1, delay is T1 If a going 0-to-1 switches x 0-to-1, delay is T2 It would be surprising if T1 > T2 17
1 v2 Why might? Wires have delay too Even in those cases where the transmission line effect is negligible: Wires posses distributed resistance and capacitance v1 v2 v3 v4 Wires posses distributed resistance and capacitance v1 v2 v3 v4 Wire Delay Time constant associated with distributed RC is proportional to the square of the length Time constant associated with distributed RC is proportional to the square of the length v3 v4 For short wires on ICs, v1 v2 v3 v4 resistance is insignificant (relative to effective R of transistors), but C is important Typically around half of C of gate load is in the wires For long wires on ICs: v1 v2 v3 v4 control signal, etc busses, clock lines, global Looks benign, but Resistance is significant, time therefore distributed RC effect dominates signals are typically rebuffered to reduce delay: time Spring 2003 EECS150 Lec10-Timing Page 16 18
Clocked Logic Circuits 19
From Delay Models to Timing Analysis clk Timing Analysis What is the smallest T that produces correct operation? f T 1 MHz 1 μs 10 MHz 100 ns 100 MHz 10 ns 1 GHz 1 ns 20
Timing Analysis and Logic Delay Register: An Array of Flip-Flops Combinational Logic If our clock period T > worst-case delay through CL, does this ensure correct operation? 21
Flip Flops have internal delays D Q Value of D is sampled on positive clock edge Q outputs sampled value for rest of cycle t_setup CLK D Q t_clk-to-q 22
Flip-Flop delays eat into time budget Combinational Logic ALU time budget T! # clk"q + # CL + # setup 23
Clock skew also eats into time budget CLKd CLK CLK CLK CLKd CLK CL As T 0, which circuit fails first? CL CLK CLK CLKd clock skew, delay in distribution T " T CL +T setup +T clk!q + worst case skew ost modern large high-performance chi 24
Delay Grid Tuned sector trees Delay Sector buffers x Clock Tree Delays, IBM Power CPU y Buffer level 2 Buffer level 1 25
15 10 Delay Volts (V) 20 ps skew 05 00 0 500 1000 1500 2000 2500 Time (ps) Multiplefingered transmissio line x Clock Tree Delays, IBM Power y 26
Some Flip Flops have hold time t_setup t_inv t_hold CLK D Q D D must stay stable here What is the intended function of this circuit? CLK Does flip-flop hold time affect operation of this circuit? Under what conditions? t_clk-to-q + t_inv > t_hold For correct operation 27
Searching for processor critical path? Timing Analysis What is the smallest T that produces correct operation? Must consider all connected register pairs Why might I suspect this one? 28
Combinational paths for IBM Power 4 CPU The critical path Most wires have hundreds of picoseconds to spare Late-mode timing checks (thousands) 200 150 100 50 0 40 20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 Timing slack (ps) From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 29
How to retime logic Circles are combinational logic, labelled with delays Critical path is 5 We want to improve it without changing circuit semantics IN 1 1 1 1 2 2 OUT Figure 1: A small graph before retiming The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers The critical path is 5 Add a register, move one circle Performance improves by 20% Post-Placement C-slow Retiming for the Xilinx Virtex FPGA IN Nicholas Weaver UC Berkeley Berkeley, CA 1 1 1 1 2 2 Yury Markovskiy UC Berkeley Berkeley, CA Yatish Patel UC Berkeley Berkeley, CA OUT Figure 2: The example in Figure 2 after retiming The critical path is reduced from 5 to 4 Technology X can often do this John Wawrzynek UC Berkeley Berkeley, CA 30
Power 4: Timing Estimation, Closure Timing Estimation Predicting a processor s clock rate early in the project From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 31
Power 4: Timing Estimation, Closure Timing Closure Meeting (or exceeding!) the timing estimate From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, JD Warnock et al 32
Floorplaning: essential to meet timing (Intel XScale 80200) 33
34
Break 35
Simple exercises for gaining intuition about timing for your process + EDA tools Thanks to Bhupesh Dasila, Open-Silicon Bangalore 36
Synthesize gate chains using hand-specified library cells Exercises cell library and place and route tools weak NANDs 40 nm process 29 ps/gate av Synthesis constrained to 2ns clock Lets you know how many levels of logic you can use in the best case Delay of a chain of 3 inverters with strongest strength Guaranteed not to exceed speed Chain lengths Helps you see through Technology X Bhupesh Dasila 37
Force P&L to drive a long wire with a known buffer cell Vary driver strength, wire length, metal layer Shows the maximum distance two gates can be placed and still meet your clock period Distributed RC is the square of the length is clearly seen! Bhupesh Dasila 38
Driving Large Loads Large fanout nets: clocks, resets, memory bit lines, off-chip Relatively small driver results in long rise time (and thus large gate delay) Strategy: Staged Buffers Optimal trade-off between delay per stage and total number of stages fanout of 4-6 per stage Lecture 04, Timing 12 UC Regents CS250, UC Fall Berkeley 2013 Fall UCB 12 39
Register file: Synthesize, or use SRAM? sel(ws) 5 WE D E M U X clk wd R0 - The constant 0 Q 32 D D D En En En R1 R2 R31 Q Q Q Speed will depend on how large it lays out two read ports 32 32 32 32 sel(rs1) M U X M U X 5 32 rd1 sel(rs2) 5 32 rd2 40
Synthesized, custom, and SRAM-based register files, 40nm For small register files, logic synthesis is competitive Not clear if the SRAM data points include area for register control, etc Synthesis SRAMS Register file compiler Figure 3: Using the raw area data, the physical implementation team can get a more accurate area estimation early in the RTL development stage for floorplanning purposes This shows an example of this graph for a 1-port, 32-bit-wide SRAM Bhupesh Dasila 41
Techniques 42
Pipelining 43
+ Starting point: A single-cycle processor Challenge: Speed up clock while keeping CPI == 1 Seconds Program Instructions Program Cycles Instruction Seconds Cycle 0x4 CPI == 1 This is good Slow This is bad D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE 32 32 op A L U 32 Data Memory Addr Dout Din WE MemToReg Ext 44
Reminder: How data flows after posedge PC Instr Mem 0x4 + D Q Addr Data 5 5 5 32 rs1 rs2 ws wd RegFile WE rd1 rd2 32 32 32 32 op Logic A L U 32 45
Next posedge: Update state and repeat PC D Q 5 rs1 5 rs2 5 ws 32 wd RegFile WE rd1 rd2 32 32 46
Observation: Logic idle most of cycle For most of cycle, ALU is either waiting for its inputs, or holding its output Ideal: a CPU architecture where each part is always working 0x4 + D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE 32 32 op A L U 32 Data Memory Addr Dout Din WE MemToReg Ext 47
Inspiration: Automobile assembly line Assembly line moves on a steady clock Each station does the same task on each car The clock Merge station Car body shell Bolting station Car chassis 48
Inspiration: Automobile assembly line Simpler station tasks more cars per hour Simple tasks take less time, clock is faster 49
Inspiration: Automobile assembly line Line speed limited by slowest task Most efficient if all tasks take same time to do 50
Inspiration: Automobile assembly line Simpler tasks, complex car long line! These lines go 24 x 7, and rarely shut down 51
Lessons from car assembly lines Faster line movement yields more cars per hour off the line Faster line movement requires more stages, each doing simpler tasks To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) Filling, flushing, and stalling assembly line are all bad news 52
Key Analogy: The instruction is the car Pipeline Stage #1 Stage #2 Stage #3 Stage #4 Stage #5 Instruction Fetch IR IR IR IR + PC 0x4 Instr Mem Controls hardware in stage 2 Controls hardware in stage 3 Controls hardware in stage 4 Controls hardware in stage 5 D Q Addr Data Data-stationary control 53
+ Example: Decode & Register Fetch Stage Pipeline Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch SUB R10,R9,R8 IR OR R7,R6,R5 IR ADD R4,R3,R2 IR 0x4 A sample program D PC Q Addr Instr Mem Data RegFile rs1 rs2 rd1 ws rd2 wd WE Ext A M B ADD R4,R3,R2 OR R7,R6,R5 SUB R10,R9,R8 R s chosen so that instructions are independent - like cars on the line 54
Hazards: An instruction is not a car + Stage #1 Stage #2 Stage #3 Instr Fetch Decode & Reg Fetch D PC Q 0x4 Addr Instr Mem Data OR R5,R4,R2 IR IR IR wrong value of R4 fetched from RegFile, contract with programmer broken! Oops! rs1 rs2 ws wd RegFile WE rd1 rd2 Ext A M B ADD R4,R3,R2 R4 not written yet New sample program ADD R4,R3,R2 OR R5,R4,R2 An example of a hazard -- we must (1) detect and (2) resolve all hazards to make a CPU that matches ISA 55
Performance Equation and Hazards Seconds Program Instructions Program Cycles Instruction Seconds Cycle + D PC Q Addr Instr Fetch Decode & Reg Fetch Stage #3 IR IR IR Some ways to cope with hazards Added logic to makes CPI > 1 detect and resolve 0x4 stalling pipeline hazards increases A clock period Instr Mem Data rs1 rs2 ws wd RegFile WE rd1 rd2 Ext M B Software slows the machine down Seymour Cray 56
Superpipelining 57
Superpipelining: Add more stages Seconds Program Instructions Program Cycles Instruction Seconds Cycle Goal: Reduce critical path by adding more pipeline stages Example: 8-stage ARM XScale: extra IF, ID, data cache stages Difficulties: Added penalties for load delays and branch misses Also, power! Ultimate Limiter: As logic delay goes to 0, FF clk-to-q and setup 58
Note: Some stages now overlap, some instructions take extra stages 5 Stage 8 Stage IF ID+RF EX MEM WB IR IR IR IR IM Reg DM Reg ALU IF now takes 2 stages (pipelined I-cache) ID and RF each get a stage ALU split over 3 stages MEM takes 2 stages (pipelined D-cache) 59
Superpipelining techniques Split ALU and decode logic over several pipeline stages Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes Remove rarely-used forwarding networks that are on critical path Creates stalls, affects CPI Pipeline the wires of frequently used forwarding networks Also: Clocking tricks (example: use posedge and negedge registers) 60
Hardware limits to superpipelining? FO4 Delays Historical limit: about 12 FO4s 100 90 80 70 60 50 40 30 20 10 0 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 MIPS 2000 5 stages CPU Clock Periods 1985-2005 Pentium Pro 10 stages FO4: How many fanout-of-4 inverter delays in the clock period Pentium 4 20 stages Thanks to Francois Labonte, Stanford * intel 386 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 Power wall: Intel Core Duo has 14 stages 61
CPU DB: Recording Microprocessor History With this open database, you can mine microprocessor trends over the past 40 years Andrew Danowitz, Kyle Kelley, James Mao, John P Stevenson, Mark Horowitz, Stanford University F04 Delays Per Cycle for Processor Designs 140 120 100 F04 / cycle 80 60 40 20 0 1985 1990 1995 2000 2005 2010 2015 FO4 delay per cycle is roughly proportional to the amount of computation completed per cycle 62
Multithreading 63
Multithreading of Static Pipelines Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1) t0 t1 t2 t3 t4 t5 t6 t7 t8 F D X M W F D X M W F D X M W F D X M W F D X M W t9 Last instruction in a thread always completes writeback before next instruction in same thread reads regfile 4 CPUs, each run at 1/4 clock PC PC PC 1 PC 1 1 1 I$ IR GPR1 GPR1 GPR1 GPR1 X Y D$ +1 2 Thread select 2 Many variants 64
At the logic level Synchronous logic we want to multithread Critical path is 5 2X multi-threading: double each register Modern synthesis will retime this as shown: critical path is now 2 IN 1 1 1 1 2 2 OUT Figure 1: A small graph before retiming The nodes represent logic delays, with the inputs and outputs passing through mandatory, fixed registers The critical path is 5 IN 1 1 1 1 2 2 OUT Figure 3: The example in Figure 2 2-slowed This design now operates on 2 independent data streams IN 1 1 1 1 2 2 OUT Figure 4: The example in Figure 3 after retiming The combination of C-slowing and retiming reduced the critical path from 5 to 2 Post-Placement C-slow Retiming for the Xilinx Virtex FPGA Nicholas Weaver UC Berkeley Berkeley, CA Yury Markovskiy UC Berkeley Berkeley, CA Yatish Patel UC Berkeley Berkeley, CA John Wawrzynek UC Berkeley Berkeley, CA 65
Good fit for GALS Two input queues (red and green) The mux control logic implements turn-taking Outputs placed into two output queues 66
Crossbar Networks 67
When register files get big, they get slow sel(ws) 5 WE D E M U X clk wd R0 - The constant 0 Q 32 Even worse: adding ports slows down as O(N 2 ) D D D En En En R1 R2 R31 Q Q Q Why? Number of loads on each Q goes as O(N), and the wire length to port mux goes as O(N) 32 32 32 32 sel(rs1) M U X M U X 5 32 rd1 sel(rs2) 5 32 rd2 68
Crossbar networks: general case of this problem Sun Niagara II: 8 cores, 4MB L2, 4 DRAM channels Each DRAM channel: 50 GB/s Read, 25 GB/s Write BW Crossbar BW: 270 GB/s total (Read + Write) (Also shared by an I/O port, not shown) 69
Sun Niagara II 8 x 9 Crossbar 8 ports on CPU side (one per core) 100-200 wires/ port (each way) 4 cycle latency (715ps/cycle) Cycles 1-3 are for arbitration Transmit data on cycle 4 Pipelined 8 ports for L2 banks, plus one for I/0 70
A complete switch transfer (4 epochs) Epoch 1: All input ports (that are ready to send data) request an output port Epoch 2: Allocation algorithm decides which inputs get to write Epoch 3: Allocation system informs the winning inputs and outputs Epoch 4: Actual data transfer takes place Allocation is pipelined: a data transfer happens on every cycle, as does the three allocation stages, for different sets of requests 71
Sun Niagara II 8 x 9 Crossbar Every cross of blue and purple is a pass gate with a unique control signal 72 control signals (if distributed unencoded) 72
73
Sun Niagara II Crossbar Notes Low latency: 4 cycles (less than 3 ns) Uniform latency between all port pairs Crossbar defines floorplan: all port devices should be equidistant to the crossbar Did not scale up for 16-core Rainbow Falls Rainbow Falls keeps the 8 x 9 crossbar, and shares each CPU-side port with two cores Design alternatives to crossbar? 74
CLOS networks: from telecom world Build a high-port switch by tiling fixed-sized shuffle units Pipeline registers naturally fit between tiles Trades scalability for latency 75
CLOS networks: an example route Numbers on left and right are port numbers Colors show routing paths for an exchange Arbitration still needed to prevent blocking 76
Electrical Details 77
Flip Flops Revisited 78
Recall: Static RAM cell (6 Transistors) Gnd Vdd Vth Vth Vdd Gnd Crosscoupled inverters noise noise x x! 79
Recall: Positive edge-triggered flip-flop D Q A flip-flop samples right before the edge, and then holds value clk Sampling circuit clk Holds value clk clk clk clk clk Clock to Q delay results fr 16 Transistors: Makes an SRAM look compact! What do we get for the 10 extra transistors? Clocked logic semantics clk 80
Sensing: When clock is low D Q clk A flip-flop samples right before the edge, and then holds value Sampling circuit clk Holds value clk clk clk clk clk = 0 clk = 1 clk clk Clock to Q delay results fr clk clk clk clk clk clk Will capture clk new value on posedge Clock to Q delay results fr Outputs clk last value captured 81
Capture: When clock goes high D Q clk A flip-flop samples right before the edge, and then holds value Sampling circuit clk Holds value clk clk clk clk clk = 1 clk = 0 clk Clock to clk Q delay results fr clk clk clk clk clk clk Remembers value clk just captured Clock to Q delay results fr Outputs value clk just captured 82
Flip Flop delays: clk-to-q? setup? hold? clk clk D Q CLK clk clk clk clk CLK == 0 Sense D, but Q outputs old value clk Clock to Q delay results fr setup clk CLK 0->1 Capture D, pass value to Q hold clk-to-q 83
More Detailed Gate Models 84
Inverters: Circuits and Layout Vdd symbol Vin Vout Vin Vout 85
Inverter: Die Cross Section Vout Vin oxide n+ n+ p- Vin oxide p+ p+ n+ n-well Vin Vout 86
Inverters with Vin = Gnd, Vout = Vdd Is Vsd > Vsg - Vt once Vout is Vdd? Is Vsg > Vt? I sd V s Isd = k (W/L) [Vsg -Vt] [Vsd] Vin I ds V d V d V s Vout This goes as close to 0 as it can while still supplying the leakage current Ids 0, but really a small leakage current 87
Inverters with Vin = Vdd, Vout = Gnd Isd 0, but really a small leakage current Vin V s I sd V d V d Vout This goes as close to 0 as it can while still supplying the leakage current I ds V s Is Vds > Vgs - Vt once Vout is Gnd? Is Vgs > Vt? Ids = k (W/L) [Vgs -Vt] [Vds] 88
On Tuesday Power and Energy Heat Sink Heat Source 89