EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Similar documents
EECS150 - Digital Design Lecture 17 - Circuit Timing. Performance, Cost, Power

EECS150 - Digital Design Lecture 3 - Timing

Why FPGAs? FPGA Overview. Why FPGAs?

EECS150 - Digital Design Lecture 3 - Timing

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

L12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

EE241 - Spring 2005 Advanced Digital Integrated Circuits

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

FPGA Design. Part I - Hardware Components. Thomas Lenzi

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Cyclone II EPC35. M4K = memory IOE = Input Output Elements PLL = Phase Locked Loop

Clocking Spring /18/05

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

FPGA Design with VHDL

2. Logic Elements and Logic Array Blocks in the Cyclone III Device Family

EECS150 - Digital Design Lecture 2 - CMOS

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

11. Sequential Elements

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Outline Synchronous Systems Introduction Field Programmable Gate Arrays (FPGAs) Introduction Review of combinational logic

High-Performance DDR2 SDRAM Interface Data Capture Using ISERDES and OSERDES Author: Maria George

TKK S ASIC-PIIRIEN SUUNNITTELU

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Field Programmable Gate Arrays (FPGAs)

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

Computer Systems Architecture

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

problem maximum score 1 28pts 2 10pts 3 10pts 4 15pts 5 14pts 6 12pts 7 11pts total 100pts

IT T35 Digital system desigm y - ii /s - iii

Clock Generation and Distribution for High-Performance Processors

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

EE-382M VLSI II FLIP-FLOPS

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

Modeling Latches and Flip-flops

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

High Performance Carry Chains for FPGAs

EEM Digital Systems II

Modeling Digital Systems with Verilog

Chapter 6. sequential logic design. This is the beginning of the second part of this course, sequential logic.

VARIABLE FREQUENCY CLOCKING HARDWARE

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

IE1204 Digital Design. F11: Programmable Logic, VHDL for Sequential Circuits. Masoumeh (Azin) Ebrahimi

DEDICATED TO EMBEDDED SOLUTIONS

Lecture 6: Simple and Complex Programmable Logic Devices. EE 3610 Digital Systems

ECE321 Electronics I

Using the Quartus II Chip Editor

Digital Integrated Circuits EECS 312

Field Programmable Gate Array (FPGA) Based Trigger System for the Klystron Department. Darius Gray

Lecture 10: Sequential Circuits

Computer Architecture and Organization

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

CS 250 VLSI System Design

CS 61C: Great Ideas in Computer Architecture

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Synchronous Digital Logic Systems. Review of Digital Logic. Philosophy. Combinational Logic. A Full Adder. Combinational Logic

An FPGA Implementation of Shift Register Using Pulsed Latches

An Efficient High Speed Wallace Tree Multiplier

Combinational vs Sequential

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Sequential Circuit Design: Part 1

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

Design and Simulation of a Digital CMOS Synchronous 4-bit Up-Counter with Set and Reset

DIFFERENTIAL CONDITIONAL CAPTURING FLIP-FLOP TECHNIQUE USED FOR LOW POWER CONSUMPTION IN CLOCKING SCHEME

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS


Clock Gating Aware Low Power ALU Design and Implementation on FPGA

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Sequential Circuit Design: Part 1

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

Lecture 23 Design for Testability (DFT): Full-Scan

Memory elements. Topics. Memory element terminology. Variations in memory elements. Clock terminology. Memory element parameters. clock.

Microprocessor Design

Digital Integrated Circuits EECS 312

Final Exam review: chapter 4 and 5. Supplement 3 and 4

Last time, we saw how latches can be used as memory in a circuit

EITF35: Introduction to Structured VLSI Design

A High-Resolution Flash Time-to-Digital Converter Taking Into Account Process Variability. Nikolaos Minas David Kinniment Keith Heron Gordon Russell

55:131 Introduction to VLSI Design Project #1 -- Fall 2009 Counter built from NAND gates, timing Due Date: Friday October 9, 2009.

More Digital Circuits

Chapter Contents. Appendix A: Digital Logic. Some Definitions

The word digital implies information in computers is represented by variables that take a limited number of discrete values.

EECS150 - Digital Design Lecture 15 Finite State Machines. Announcements

EECS150 - Digital Design Lecture 19 - Finite State Machines Revisited

Transcription:

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) March 17, 2010 John Wawrzynek Spring 2010 EECS150 - Lec18-timing(2) Page 1 In General... For correct operation: T τ clk Q + τ CL + τ setup for all paths. How do we enumerate all paths? Any circuit input or register output to any register input or circuit output? Note: setup time for outputs is a function of what it connects to. clk-to-q for circuit inputs depends on where it comes from. Spring 2010 EECS150 - Lec18-timing(2) Page 2

Gate Delay is the Result of Cascading Cascaded gates: transfer curve for inverter. Spring 2010 EECS150 - Lec18-timing(2) Page 3 Delay in Flip-flops Setup time results from delay through first latch. clk clk clk Clock to Q delay results from delay through second latch. clk clk clk clk clk Spring 2010 EECS150 - Lec18-timing(2) Page

Even in those cases where the transmission line effect is negligible: Wires posses distributed resistance and capacitance v1 v2 v3 v Time constant associated with distributed RC is proportional to the square of the length Wire Delay For short wires on ICs, resistance is insignificant (relative to effective R of transistors), but C is important. Typically around half of C of gate load is in the wires. For long wires on ICs: busses, clock lines, global control signal, etc. Resistance is significant, therefore distributed RC effect dominates. signals are typically rebuffered to reduce delay: v1 v2 v3 v time Spring 2010 EECS150 - Lec18-timing(2) Page 5 Delay and Fan-out 2 1 3 The delay of a gate is proportional to its output capacitance. Connecting the output of gate one increases it s output capacitance. Therefore, it takes increasingly longer for the output of a gate to reach the switching threshold of the gates it drives as we add more output connections. Driving wires also contributes to fan-out delay. What can be done to remedy this problem in large fan-out situations? Spring 2010 EECS150 - Lec18-timing(2) Page 6

Critical Path Critical Path: the path in the entire design with the maximum delay. This could be from state element to state element, or from input to state element, or state element to output, or from input to output (unregistered paths). For example, what is the critical path in this circuit? Why do we care about the critical path? Spring 2010 EECS150 - Lec18-timing(2) Page 7 Searching for processor critical path Must consider all connected register pairs, paths from input to register, register to output. Don t forget the controller.? Design tools help in the search. Synthesis tools report delays on paths, Special static timing analyzers accept a design netlist and report path delays, and, of course, simulators can be used to determine timing performance. Tools that are expected to do something about the timing behavior (such as synthesizers), also include provisions for specifying input arrival times (relative to the clock), and output requirements (set-up times of next stage). Spring 2010 EECS150 - Lec18-timing(2) Page 8

The critical path Real Stuff: Timing Analysis Most paths have hundreds of picoseconds to spare. Late-mode timing checks (thousands) 200 150 100 50 0 0 20 0 20 0 60 80 100 120 10 160 180 200 220 20 260 280 Timing slack (ps) From The circuit and physical design of the POWER microprocessor, IBM J Res and Dev, 6:1, Jan 2002, J.D. Warnock et al. Spring 2010 EECS150 - Lec18-timing(2) Page 9 Clock Skew Unequal delay in distribution of the clock signal to various parts of a circuit: if not accounted for, can lead to erroneous behavior. Comes about because: clock wires have delay, circuit is designed with a different number of clock buffers from the clock source to the various clock loads, or buffers have unequal delay. clock skew, delay in distribution All synchronous circuits experience some clock skew: more of an issue for high-performance designs operating with very little extra time per clock cycle. Spring 2010 EECS150 - Lec18-timing(2) Page 10

CLK CLK Clock Skew (cont.) CL CLK CLK clock skew, delay in distribution If clock period T = T CL +T setup +T clk Q, circuit will fail. Therefore: 1. Control clock skew a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay. b) don t gate clocks in a non-uniform way. 2. T T CL +T setup +T clk Q + worst case skew. Most modern large high-performance chips (microprocessors) control end to end clock skew to a small fraction of the clock period. Spring 2010 EECS150 - Lec18-timing(2) Page 11 Clock Skew (cont.) CLK CLK CL CLK CLK clock skew, delay in distribution Note reversed buffer. In this case, clock skew actually provides extra time (adds to the effective clock period). This effect has been used to help run circuits as higher clock rates. Risky business! Spring 2010 EECS150 - Lec18-timing(2) Page 12

Delay Real Stuff: Floorplanning Intel XScale 80200 Spring 2010 EECS150 - Lec18-timing(2) Page 13 Grid Tuned sector trees Delay Sector buffers x Clock Tree Delays, IBM Power CPU Buffer level 2 Buffer level 1 Spring 2010 EECS150 - Lec18-timing(2) Page 1 y

1.5 Delay Volts (V) 1.0 20 ps skew 0.5 0.0 0 500 1000 1500 2000 2500 Time (ps) Multiplefingered transmissio line x Clock Tree Delays, IBM Power Spring 2010 EECS150 - Lec18-timing(2) Page 15 y Timing in Xilinx Designs

From earlier lecture: Virtex-5 slice SLICE LUT O6 (D) 6-LUT delay is 0.9 ns (D[6:1]) 6 A[6:1] D Q (DQ) 1.1 GHz toggle speed (C[6:1]) (B[6:1]) 6 6 LUT O6 A[6:1] LUT O6 A[6:1] (Optional) (C) (CQ) D Q (Optional) (B) (BQ) D Q 128 x 32b LUT RAM access time is 1.1 ns 0.909 GHz toggle speed (A[6:1]) 6 A[6:1] LUT O6 (Optional) (A) (AQ) D Q But yet... (CLK) (Optional) Xilinx CPU runs at 201 MHz....5x slower!%&$!"#$ ).0*1$%2(3#&. 567!'()*+,)-.'/(-01 2+(3-')1*5,1 $+(!7 #*.8*59 :.+')1*!'()*+,)-.' $+1* MicroBlaze!'()*+,)-.' 61,.01 @00AB+2 BC-)A%.8-,5< &+<)-D<E ;18-()1*37-<1 =>?=>2 65)5/(-01 2+(3-')1*5,1 $+(!7 6%&$ 6"#$ 2!"##"$%&'"%()%*"+%,-.'% / 0123%,-.'%5*%65$#"789%:!%;< / 0<<%,-.2'%%5*%65$#"78=9%:!%=;< 2 )$(>%?%'#@A"%8B%=%'#@A"%C5C"D5*" 2 *"+%C$(E"''($F%)$(>%<;31% G:5C'H:IJ%#(%0;0%G:5C'H:IJ 2 0K<:IJ%8B%1<0%:IJ 2 022%8B%1?<%GL$M'#(*"%:5C'

Major delay source: Interconnect s define regular connections to the switching fabric, and to slices in s above and below it on the die. X1Y1 X3Y1 (1) X0Y1 X2Y1 Switch Matrix (0) CIN CIN X1Y0 CIN CIN X3Y0 CIN CIN UG190_5_01_122605 X0Y0 X2Y0 UG190_5_02_122605 Simplified model of interconnect... Wires are slow because (1) each green dot is a transistor switch (2) path may not be shortest length (3) all wires are too long! To this Connect this Delay in FPGA designs are particularly layout sensitive. Placement and routing tools spend most of there cycles in timing optimization. When Xilinx designs FPGA chips, wiring channels are optimized for (2) & (3).

R R What are the green dots? Set during configuration. One flip-flop and a pass gate for each switch point. In order to have enough wires in the channels to wire up s for most circuits, we need a lot of switch points! Thus, 80%+ of FPGA is for wiring.. More realistic Virtex-5 model... R!"#$%&'(($)#*+% +$!"(),-))$#./% 6$!"() +"..$+)*.0 123& R 1-hop wires to nearest neighbors!"#$%"0*+% #$-+5$6%,$#%5", Design Examples Design Examples 7-($%,-))$#.% 8"#%-%"9),9)&!"#$%&'()*$$+,,-$$$).'/0$11 Embedded Blocks Figure : Dynamic Power Innovations WP26_0_050206 Virtex-5 Routing Architecture with Diagonal Interconnects Virtex-5 devices contain more embedded (or hard IP) blocks than any prior generation FPGA in the industry. FPGA designs that utilize these blocks properly can see additional dramatic dynamic Virtex- power reductions in comparison Virtex- Virtex-5 to implementing these functions in general purpose FPGA logic. Unlike the FPGA fabric, these hard IP blocks contain only the necessary transistors to implement the required function. There are no programmable interconnects, so routing capacitance is as small as possible. The result is that these hard IP blocks can 1st 751 Ring ps 665 751 ps perform the same function in as little as one-tenth the power of the equivalent implementation in general purpose fabric. In many cases, embedded blocks that existed in Virtex- devices have received significant design overhauls in the Virtex-5 family to improve features, performance, and power consumption. For example, the Virtex- family s 18 Kb block RAM has 2nd 906 Ring ps 723 906 ps been redesigned. Virtex-5 devices now contain 36 Kb block RAM modules that, logically, can be used as a single 36 Kb memory or two individual 18 Kb memories. But what is more interesting from a power perspective is that each of the logical 18 Kb memory blocks is actually composed of two 9 Kb physical memory arrays. To minimize dynamic power consumption, most block RAM configurations require only one of the 9 Kb physical memories within each 18 Kb block to be WP25_0_050106 architectures. active (powered up) during any given Read or Write operation. Control logic on the address, input, and output ports of the block RAM ensure that the proper 9 Kb physical array is selected for each transaction. In this manner, dynamic power consumption occurs in only one half of the 9 Kb physical arrays at a time. To the user, however, the block RAM appears as one continuous memory. Figure 5 shows the 36 Kb block RAM in Virtex-5 devices. Design ExamplesDesign Examples performance and performance easier design and routability. easier design Essentially, routability. the Virtex-5 Essentially, family the interconnect Virtex-5 family interconnect pattern provides pattern fast, predictable provides routing fast, predictable based on routing distance. based on distance. Figure compares Figure the delays compares incurred the from delays a source incurred register from a in source one register driving in one a driving a LUT packed with LUT a second packed register with a in second a surrounding register in. a surrounding The goal is. to measure The goal the is to measure the effect of the incremental effect of routing the incremental delays for routing both the delays Virtex- for both and Virtex-5 the Virtex- family and Virtex-5 family architectures. architectures.!"#$%&'/)*+,$"-.!"#$%&'()*+,$"-. 23)#$ %"05# 1$!"( 1st Ring of s 2nd Ring of s Figure : Multiplexers Multiplexers 1st Ring of s 2nd Ring 2nd Ring of s 1st Ring Routing Figure Delay : Comparison Routing Delay for Virtex- Comparison and Virtex-5 for Virtex- FPGAs and Virtex-5 FPGAs The embedded DSP elements in Virtex-5 devices have also been redesigned to incorporate more functionality at higher performance and lower power consumption. On a slice versus slice comparison, the new Virtex-5 DSP slice has roughly 0% lower dynamic power consumption relative to the Virtex- DSP slice. This is mostly attributable to the voltage and capacitance scaling factors of the 65 nm process that were discussed earlier. 1st Ring Virtex-5 665 ps performance and easier design routability. Essentially, the Virtex-5 family interconnect pattern provides fast, predictable routing based on distance. 723 ps Figure compares the delays incurred from a source register in one driving a LUT packed with a second register in a surrounding. The goal is to measure the effect of the incremental routing delays for both the Virtex- and Virtex-5 family WP25_0_050106 Virtex- 23)#$ 751 ps The benefits of the The new benefits 6-input of LUT the new architecture 6-input %"05# LUT are detailed architecture in the are following detailed in the following examples. examples. 8 www.xilinx.com 1st Ring of swp26 (v1.2) February 1, 2007 2nd Ring 906 ps 1$!"( 2nd Ring of s +$!"() 6$!"() Figure : One of the easiest One examples of the easiest is a multiplexer. examples A is a four-input multiplexer. LUT A can four-input implement LUT a can 2:1 implement a 2:1 MUX. Every multiplexer MUX. Every that multiplexer has more than that two has inputs more requires than two additional inputs requires logic additional logic Design Examples resources. A :1 MUX resources. needs A two :1 MUX -input needs LUTs two and -input a MUXF LUTs in Virtex- and a MUXF architecture. in Virtex- architecture. With the new 6-input With LUT, the new this 6-input :1 MUX LUT, is now this implemented :1 MUX is now with implemented a single LUT. with An a single LUT. An Routing Delay Comparison for Virtex- and Virtex-5 FPGAs Design Examples Virtex-5 665 ps 723 ps WP25_0_050106 The benefits of the new 6-input LUT architecture are detailed in the following examples.

Timing for small building blocks... Virtex- FPGA Virtex-5 FPGA 6-Input Function (1) 1.1 ns 0.9 ns Adder, 6-bit 3.5 ns 2.5 ns Ternary Adder, 6-bit.3 ns 3.0 ns Barrel Shifter, 32-bit 3.9 ns 2.8 ns Magnitude Comparator, 8-bit 2. ns 1.8 ns LUT RAM, 128 x 32-bit 1. ns 1.1 ns Notes: ignificantly improved, as shown in Figure 7. Virtex- FPGAs Delay (ns) 3 2 Virtex-5 FPGAs 1 8-b 16-b 32-b 6-b 8-b 16-b 32-b 6-b WP25_07_051006 Multi-Bit Adder Timing Comparison for Virtex- and Virtex-5 FPGAs Clocking

Clock circuits live in center column. 32 global clock wires go down the red column. Any 10 may be sent to a clock region. Also, regional clocks (restricted functionality). CS 19-6 L6: Timing UC Regents Fall 2008 UCB!"#$%&'() Clocks have dedicated wires (low skew) GCLK7 GCLK5 GCLK6 GCLK BUFGMUX DCM DCM 8 Top Spine 8 8 8 Horizontal Spine Bottom Spine DCM BUFGMUX DCM From: Xilinx Spartan 3 data sheet. Virtex is similar. GCLK2 GCLK0 GCLK3 GCLK1

Die photo: Xilinx Virtex Gold wires are the clock tree. LX110T: 12 Digital Clock Managers (DCM) 6 Phase Locked Loops (PLL) 20 Clock I/O Pads CS 19-6 L6: Timing UC Regents Fall 2008 UCB!"#$%&'()

DCM: Clock deskew, clock phasing CLKIN CLKFB RST DCM_BASE CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED CLKIN RST CLK0 CLK90 CLK180 1 2 3 Periods CLKFX CLKFX180 CLKDV LOCKED LOCK DLL ug190_2_18_0206 Figure 2-17: RESET/LOCK Example DCM adjusts its output delay to synchronize the clock signal at the feedback clock input (CLKFB) to the clock signal at the input clock (CLKIN). Important use is in deskewing on-chip clock distribution relative to input (board level) clock signal. How it works: Delay-line feedback IBUFG IBUF CLKIN CLKFB RST DCM_BASE CLK0 CLK90 CLK180 CLK270 CLK2X CLK2X180 CLKDV CLKFX CLKFX180 LOCKED BUFG OBUF ug190_2_08_032506 Figure 2-8: Standard Usage CLKIN Variable Delay Line CLKOUT Clock Distribution Network Control CLKFB ug190_2_03_032506 Figure 2-3: Simplified DLL Circuit