EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) March 17, 2010 John Wawrzynek Spring 2010 EECS150 - Lec18-timing(2) Page 1 In General... For correct operation: T τ clk Q + τ CL + τ setup for all paths. How do we enumerate all paths? Any circuit input or register output to any register input or circuit output? Note: setup time for outputs is a function of what it connects to. clk-to-q for circuit inputs depends on where it comes from. Spring 2010 EECS150 - Lec18-timing(2) Page 2

Gate Delay is the Result of Cascading Cascaded gates: transfer curve for inverter. Spring 2010 EECS150 - Lec18-timing(2) Page 3 Delay in Flip-flops Setup time results from delay through first latch. clk clk clk Clock to Q delay results from delay through second latch. clk clk clk clk clk Spring 2010 EECS150 - Lec18-timing(2) Page

Even in those cases where the transmission line effect is negligible: Wires posses distributed resistance and capacitance v1 v2 v3 v Time constant associated with distributed RC is proportional to the square of the length Wire Delay For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important. Typically around half of C of gate load is in the wires. For long wires on ICs: busses, clock lines, global control signal, etc. Resistance is significant, therefore distributed RC effect dominates. signals are typically rebuffered to reduce delay: v1 v2 v3 v time Spring 2010 EECS150 - Lec18-timing(2) Page 5 Delay and Fan-out 2 1 3 The delay of a gate is proportional to its output capacitance. Connecting the output of gate one increases it s output capacitance. Therefore, it takes increasingly longer for the output of a gate to reach the switching threshold of the gates it drives as we add more output connections. Driving wires also contributes to fan-out delay. What can be done to remedy this problem in large fan-out situations? Spring 2010 EECS150 - Lec18-timing(2) Page 6

Critical Path Critical Path: the path in the entire design with the maximum delay. This could be from state element to state element, or from input to state element, or state element to output, or from input to output (unregistered paths). For example, what is the critical path in this circuit? Why do we care about the critical path? Spring 2010 EECS150 - Lec18-timing(2) Page 7 Searching for processor critical path Must consider all connected register pairs, paths from input to register, register to output. Don t forget the controller.? Design tools help in the search. Synthesis tools report delays on paths, Special static timing analyzers accept a design netlist and report path delays, and, of course, simulators can be used to determine timing performance. Tools that are expected to do something about the timing behavior (such as synthesizers), also include provisions for specifying input arrival times (relative to the clock), and output requirements (set-up times of next stage). Spring 2010 EECS150 - Lec18-timing(2) Page 8

The critical path Real Stuff: Timing Analysis Most paths have hundreds of picoseconds to spare. Late-mode timing checks (thousands) 200 150 100 50 0 0 20 0 20 0 60 80 100 120 10 160 180 200 220 20 260 280 Timing slack (ps) From The circuit and physical design of the POWER microprocessor, IBM J Res and Dev, 6:1, Jan 2002, J.D. Warnock et al. Spring 2010 EECS150 - Lec18-timing(2) Page 9 Clock Skew Unequal delay in distribution of the clock signal to various parts of a circuit: if not accounted for, can lead to erroneous behavior. Comes about because: clock wires have delay, circuit is designed with a different number of clock buffers from the clock source to the various clock loads, or buffers have unequal delay. clock skew, delay in distribution All synchronous circuits experience some clock skew: more of an issue for high-performance designs operating with very little extra time per clock cycle. Spring 2010 EECS150 - Lec18-timing(2) Page 10

CLK CLK Clock Skew (cont.) CL CLK CLK clock skew, delay in distribution If clock period T = T CL +T setup +T clk Q, circuit will fail. Therefore: 1. Control clock skew a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay. b) don t gate clocks in a non-uniform way. 2. T T CL +T setup +T clk Q + worst case skew. Most modern large high-performance chips (microprocessors) control end to end clock skew to a small fraction of the clock period. Spring 2010 EECS150 - Lec18-timing(2) Page 11 Clock Skew (cont.) CLK CLK CL CLK CLK clock skew, delay in distribution Note reversed buffer. In this case, clock skew actually provides extra time (adds to the effective clock period). This effect has been used to help run circuits as higher clock rates. Risky business! Spring 2010 EECS150 - Lec18-timing(2) Page 12

Delay Real Stuff: Floorplanning Intel XScale 80200 Spring 2010 EECS150 - Lec18-timing(2) Page 13 Grid Tuned sector trees Delay Sector buffers x Clock Tree Delays, IBM Power CPU Buffer level 2 Buffer level 1 Spring 2010 EECS150 - Lec18-timing(2) Page 1 y

1.5 Delay Volts (V) 1.0 20 ps skew 0.5 0.0 0 500 1000 1500 2000 2500 Time (ps) Multiplefingered transmissio line x Clock Tree Delays, IBM Power Spring 2010 EECS150 - Lec18-timing(2) Page 15 y Timing in Xilinx Designs

From earlier lecture: Virtex-5 slice SLICE LUT O6 (D) 6-LUT delay is 0.9 ns (D[6:1]) 6 A[6:1] D Q (DQ) 1.1 GHz toggle speed (C[6:1]) (B[6:1]) 6 6 LUT O6 A[6:1] LUT O6 A[6:1] (Optional) (C) (CQ) D Q (Optional) (B) (BQ) D Q 128 x 32b LUT RAM access time is 1.1 ns 0.909 GHz toggle speed (A[6:1]) 6 A[6:1] LUT O6 (Optional) (A) (AQ) D Q But yet... (CLK) (Optional) Xilinx CPU runs at 201 MHz....5x slower!%&$!"#$ ).0*1$%2(3#&. 567!'()*+,)-.'/(-01 2+(3-')1*5,1 $+(!7 #*.8*59 :.+')1*!'()*+,)-.' $+1* MicroBlaze!'()*+,)-.' 61,.01 @00AB+2 BC-)A%.8-,5< &+<)-D<E ;18-()1*37-<1 =>?=>2 65)5/(-01 2+(3-')1*5,1 $+(!7 6%&$ 6"#$ 2!"##"$%&'"%()%*"+%,-.'% / 0123%,-.'%5*%65$#"789%:!%;< / 0<<%,-.2'%%5*%65$#"78=9%:!%=;< 2 )$(>%?%'#@A"%8B%=%'#@A"%C5C"D5*" 2 *"+%C$(E"''($F%)$(>%<;31% G:5C'H:IJ%#(%0;0%G:5C'H:IJ 2 0K<:IJ%8B%1<0%:IJ 2 022%8B%1?<%GL$M'#(*"%:5C'

Major delay source: Interconnect s define regular connections to the switching fabric, and to slices in s above and below it on the die. X1Y1 X3Y1 (1) X0Y1 X2Y1 Switch Matrix (0) CIN CIN X1Y0 CIN CIN X3Y0 CIN CIN UG190_5_01_122605 X0Y0 X2Y0 UG190_5_02_122605 Simplified model of interconnect... Wires are slow because (1) each green dot is a transistor switch (2) path may not be shortest length (3) all wires are too long! To this Connect this Delay in FPGA designs are particularly layout sensitive. Placement and routing tools spend most of there cycles in timing optimization. When Xilinx designs FPGA chips, wiring channels are optimized for (2) & (3).

R R What are the green dots? Set during configuration. One flip-flop and a pass gate for each switch point. In order to have enough wires in the channels to wire up s for most circuits, we need a lot of switch points! Thus, 80%+ of FPGA is for wiring.. More realistic Virtex-5 model... R!"#$%&'(($)#*+% +$!"(),-))$#./% 6$!"() +"..$+)*.0 123& R 1-hop wires to nearest neighbors!"#$%"0*+% #$-+5$6%,$#%5", Design Examples Design Examples 7-($%,-))$#.% 8"#%-%"9),9)&!"#$%&'()*$$+,,-$$$).'/0$11 Embedded Blocks Figure : Dynamic Power Innovations WP26_0_050206 Virtex-5 Routing Architecture with Diagonal Interconnects Virtex-5 devices contain more embedded (or hard IP) blocks than any prior generation FPGA in the industry. FPGA designs that utilize these blocks properly can see additional dramatic dynamic Virtex- power reductions in comparison Virtex- Virtex-5 to implementing these functions in general purpose FPGA logic. Unlike the FPGA fabric, these hard IP blocks contain only the necessary transistors to implement the required function. There are no programmable interconnects, so routing capacitance is as small as possible. The result is that these hard IP blocks can 1st 751 Ring ps 665 751 ps perform the same function in as little as one-tenth the power of the equivalent implementation in general purpose fabric. In many cases, embedded blocks that existed in Virtex- devices have received significant design overhauls in the Virtex-5 family to improve features, performance, and power consumption. For example, the Virtex- family s 18 Kb block RAM has 2nd 906 Ring ps 723 906 ps been redesigned. Virtex-5 devices now contain 36 Kb block RAM modules that, logically, can be used as a single 36 Kb memory or two individual 18 Kb memories. But what is more interesting from a power perspective is that each of the logical 18 Kb memory blocks is actually composed of two 9 Kb physical memory arrays. To minimize dynamic power consumption, most block RAM configurations require only one of the 9 Kb physical memories within each 18 Kb block to be WP25_0_050106 architectures. active (powered up) during any given Read or Write operation. Control logic on the address, input, and output ports of the block RAM ensure that the proper 9 Kb physical array is selected for each transaction. In this manner, dynamic power consumption occurs in only one half of the 9 Kb physical arrays at a time. To the user, however, the block RAM appears as one continuous memory. Figure 5 shows the 36 Kb block RAM in Virtex-5 devices. Design ExamplesDesign Examples performance and performance easier design and routability. easier design Essentially, routability. the Virtex-5 Essentially, family the interconnect Virtex-5 family interconnect pattern provides pattern fast, predictable provides routing fast, predictable based on routing distance. based on distance. Figure compares Figure the delays compares incurred the from delays a source incurred register from a in source one register driving in one a driving a LUT packed with LUT a second packed register with a in second a surrounding register in. a surrounding The goal is. to measure The goal the is to measure the effect of the incremental effect of routing the incremental delays for routing both the delays Virtex- for both and Virtex-5 the Virtex- family and Virtex-5 family architectures. architectures.!"#$%&'/)*+,$"-.!"#$%&'()*+,$"-. 23)#$ %"05# 1$!"( 1st Ring of s 2nd Ring of s Figure : Multiplexers Multiplexers 1st Ring of s 2nd Ring 2nd Ring of s 1st Ring Routing Figure Delay : Comparison Routing Delay for Virtex- Comparison and Virtex-5 for Virtex- FPGAs and Virtex-5 FPGAs The embedded DSP elements in Virtex-5 devices have also been redesigned to incorporate more functionality at higher performance and lower power consumption. On a slice versus slice comparison, the new Virtex-5 DSP slice has roughly 0% lower dynamic power consumption relative to the Virtex- DSP slice. This is mostly attributable to the voltage and capacitance scaling factors of the 65 nm process that were discussed earlier. 1st Ring Virtex-5 665 ps performance and easier design routability. Essentially, the Virtex-5 family interconnect pattern provides fast, predictable routing based on distance. 723 ps Figure compares the delays incurred from a source register in one driving a LUT packed with a second register in a surrounding. The goal is to measure the effect of the incremental routing delays for both the Virtex- and Virtex-5 family WP25_0_050106 Virtex- 23)#$ 751 ps The benefits of the The new benefits 6-input of LUT the new architecture 6-input %"05# LUT are detailed architecture in the are following detailed in the following examples. examples. 8 www.xilinx.com 1st Ring of swp26 (v1.2) February 1, 2007 2nd Ring 906 ps 1$!"( 2nd Ring of s +$!"() 6$!"() Figure : One of the easiest One examples of the easiest is a multiplexer. examples A is a four-input multiplexer. LUT A can four-input implement LUT a can 2:1 implement a 2:1 MUX. Every multiplexer MUX. Every that multiplexer has more than that two has inputs more requires than two additional inputs requires logic additional logic Design Examples resources. A :1 MUX resources. needs A two :1 MUX -input needs LUTs two and -input a MUXF LUTs in Virtex- and a MUXF architecture. in Virtex- architecture. With the new 6-input With LUT, the new this 6-input :1 MUX LUT, is now this implemented :1 MUX is now with implemented a single LUT. with An a single LUT. An Routing Delay Comparison for Virtex- and Virtex-5 FPGAs Design Examples Virtex-5 665 ps 723 ps WP25_0_050106 The benefits of the new 6-input LUT architecture are detailed in the following examples.

Timing for small building blocks... Virtex- FPGA Virtex-5 FPGA 6-Input Function (1) 1.1 ns 0.9 ns Adder, 6-bit 3.5 ns 2.5 ns Ternary Adder, 6-bit.3 ns 3.0 ns Barrel Shifter, 32-bit 3.9 ns 2.8 ns Magnitude Comparator, 8-bit 2. ns 1.8 ns LUT RAM, 128 x 32-bit 1. ns 1.1 ns Notes: ignificantly improved, as shown in Figure 7. Virtex- FPGAs Delay (ns) 3 2 Virtex-5 FPGAs 1 8-b 16-b 32-b 6-b 8-b 16-b 32-b 6-b WP25_07_051006 Multi-Bit Adder Timing Comparison for Virtex- and Virtex-5 FPGAs Clocking

Clock circuits live in center column. 32 global clock wires go down the red column. Any 10 may be sent to a clock region. Also, regional clocks (restricted functionality). CS 19-6 L6: Timing UC Regents Fall 2008 UCB!"#$%&'() Clocks have dedicated wires (low skew) GCLK7 GCLK5 GCLK6 GCLK BUFGMUX DCM DCM 8 Top Spine 8 8 8 Horizontal Spine Bottom Spine DCM BUFGMUX DCM From: Xilinx Spartan 3 data sheet. Virtex is similar. GCLK2 GCLK0 GCLK3 GCLK1

Die photo: Xilinx Virtex Gold wires are the clock tree. LX110T: 12 Digital Clock Managers (DCM) 6 Phase Locked Loops (PLL) 20 Clock I/O Pads CS 19-6 L6: Timing UC Regents Fall 2008 UCB!"#$%&'()

DCM: Clock deskew, clock phasing CLKIN CLKFB RST DCM_BASE CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED CLKIN RST CLK0 CLK90 CLK180 1 2 3 Periods CLKFX CLKFX180 CLKDV LOCKED LOCK DLL ug190_2_18_0206 Figure 2-17: RESET/LOCK Example DCM adjusts its output delay to synchronize the clock signal at the feedback clock input (CLKFB) to the clock signal at the input clock (CLKIN). Important use is in deskewing on-chip clock distribution relative to input (board level) clock signal. How it works: Delay-line feedback IBUFG IBUF CLKIN CLKFB RST DCM_BASE CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED BUFG OBUF ug190_2_08_032506 Figure 2-8: Standard Usage CLKIN Variable Delay Line CLKOUT Clock Distribution Network Control CLKFB ug190_2_03_032506 Figure 2-3: Simplified DLL Circuit