Clock Generation and Distribution for High-Performance Processors Stefan Rusu Senior Principal Engineer Enterprise Microprocessor Division Intel Corporation stefan.rusu@intel.com
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 2
Clock Definition and Parameters The clock is a periodic synchronization signal used as a time reference for data transfers in synchronous digital systems Ref Clk t skew Skew Spatial variation of the clock signal as distributed through the chip Global vs. local skew End Clk t jitter Clock jitter Temporal variation of the clock with respect to a reference edge Long-term vs. cycle-to-cycle jitter Duty cycle variation 50/50 design target t high t low SoC 2004 Stefan Rusu 3
Processor Frequency Trend 10000 Pentium III Pentium 4 Frequency [MHz] 1000 100 386 486 Pentium Pentium Pro Pentium II 10 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 SoC 2004 Stefan Rusu 4
Clock Skew Trend 600 500 Clock Skew [ps] 400 300 200 100 0 100 1000 10000 Processor Frequency [MHz] Source: ISSCC and JSSC papers SoC 2004 Stefan Rusu 5
Relative Clock Skew 10 Clock Skew as Percentage of Cycle Time [%] 7.5 5 2.5 0 100 1000 10000 Processor Frequency [MHz] Clock skew accounts in average for ~5% of the cycle time Source: ISSCC and JSSC papers SoC 2004 Stefan Rusu 6
Sources of Clock Skew With a perfectly balanced distribution, device mismatch is the largest contributor to the clock skew Temperature Mismatch Load Mismatch Supply Mismatch Device Mismatch (Le) 0 20 40 60 Percent Geannopoulos, ISSCC-1998 SoC 2004 Stefan Rusu 7
Clock Jitter Trend 500 Clock Jitter [ps] 400 300 200 100 0 100 1000 10000 Processor Frequency [MHz] Source: ISSCC and JSSC papers SoC 2004 Stefan Rusu 8
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 9
Clock Distribution Networks Tree Mesh Grid H-Tree X-Tree Tapered H-Tree SoC 2004 Stefan Rusu 10
Inductance Effect Xanthopoulos, ISSCC-2001 SoC 2004 Stefan Rusu 11
Itanium Processor Clock Hierarchy CLKP CLKN VCC/2 Reference Clock PLL DSK DSK RCD Main Clock DSK RCD DLCLK OTB Global Distribution Regional Distribution Local Distribution Rusu, ISSCC-2000 SoC 2004 Stefan Rusu 12
Local Clock Distribution Local clock distribution enables flexible skew management to support: Intentional clock skew insertion for timing optimization Clock gating for power reduction Regional Clock Grid Normal Local Clock Buffers Intentional Skew Buffer Combinatorial Block Rusu, ISSCC-2000 SoC 2004 Stefan Rusu 13
Itanium 2 Processor Clock Distribution First level: Pseudo-differential, impedance matched branching, balanced h-tree Second level: balanced, width and length tuned binary h-tree Second Level Clock Buffers: adjustable delay buffer Gaters: all constant input loading with load-tuned drive strength gaters primary driver Repeaters SLCBs (5) (33) Each SLCB ~70 tap points of ~8 gaters each Anderson, ISSCC-2002 SoC 2004 Stefan Rusu 14
Optical Skew Probing Photon s Vin Idn Vout Clock edge generates infrared photon emission Emission peak indicates clock transition edge Tam, VLSI Symposium, 2003 SoC 2004 Stefan Rusu 15
Optical Probing Results Tam, VLSI Symposium, 2003 SoC 2004 Stefan Rusu 16
130nm Itanium 2 Skew Profile Relative Delay (ps) 70 60 50 40 30 20 10 0-10 Default Fuse Adjusted SCAN Adjusted 1 6 11 16 21 Clock Zone Tam, VLSI Symposium, 2003 SoC 2004 Stefan Rusu 17
Pentium 4 Processor Clock Network PLL 2GHz triple-spine clock distribution (180nm) Kurd, JSSC-2001 SoC 2004 Stefan Rusu 18
90nm Clock Distribution skew in ps Sub-10ps clock skew demonstrated in a 90nm processor using clock tree averaging Bindal, ISSCC 2003 SoC 2004 Stefan Rusu 19
Pentium 4 Processor Clock Skew 22ps 7ps 130nm Pentium 4 Processor 90nm Pentium 4 Processor 90nm design has 3x lower clock skew than the 130nm Schutz, ISSCC 2004 SoC 2004 Stefan Rusu 20
Alpha* Processors Clocking Product 21064 21164 21264 Frequency 166MHz 300MHz 600MHz Transistors 1.7M 9.3M 9.3M Process 0.75um 4ML 0.5um, 4ML 0.35um, 6ML Power 25W 50W 72W Clock load 2.75nF 3.75nF 2.8nF Clock Floorplan final drivers pre-driver PLL Clock skew plot Skew (ps) 75 60 45 30 15 0 Chip Vertical Axis Chip Horizontal Axis * Other names and brands may be claimed as the property of others Gronowski, JSSC 1998 SoC 2004 Stefan Rusu 21
1.2GHz Alpha* Processor Clock NCLK DLL DLL DLL GCLK L2LCLK PLL L2RCLK * Other names and brands may be claimed as the property of others Xanthopoulos, ISSCC-2001 SoC 2004 Stefan Rusu 22
Power4* Clock Distribution PLL Clock Distribution 3 2 4 Ref clk in Bypass PLL out 1 Ref clk out Global Clock Grid Feedback Dual core, SOI process, 174M transistors Measured clock skew below 25ps * Other names and brands may be claimed as the property of others Restle, ISSCC-2002 SoC 2004 Stefan Rusu 23
Power4* - 3D Skew Visualization Delay (ps) 800 700 600 500 400 grid Tuned sector trees Sector buffers level 4 buffer level 3 300 200 Y X buffer level 2 100 buffer level 1 Restle, ISSCC-2002 SoC 2004 Stefan Rusu * Other names and brands may be claimed as the property of others 24
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 25
Dual-Zone Clock Deskew X Clk FB Clk Clk_Gen Delay Line Delay Line Delay SR Deskew Ctl Delay SR Left Spine Core PD CL Right Spine Geannopoulos, ISSCC-1998 SoC 2004 Stefan Rusu 26
Itanium Processor Clock Deskew DSK DSK DSK DSK Distributed array of deskew buffers to reduce process related skew CDC 8 deskew clusters each holding up to 4 buffers 30 deskew zones DSK DSK DSK DSK DSK CDC = Cluster of 4 deskew buffers = Central Deskew Controller Rusu, ISSCC-2000 SoC 2004 Stefan Rusu 27
Itanium Processor Deskew Buffer Input Output Enable# TAP I/F 20-bit Delay Control Register Step size = 8.5ps Deskew range = 170ps Small step size enables fine skew control over a wide range TAP read / write access to Control Register enables faster timing debug and performance tuning Rusu, ISSCC-2000 SoC 2004 Stefan Rusu 28
Pentium 4 Processor Deskew Logical diagram of the skew optimization circuit Phase detector network Kurd, JSSC-2001 SoC 2004 Stefan Rusu 29
Deskew Techniques Summary Author Source Clock Zones Skew Before Skew After Step Size Geannopoulos ISSCC-98 2 60ps 15ps 12ps Rusu ISSCC-00 30 110ps 28ps 8ps Kurd ISSCC-01 47 64ps 16ps 8ps Stinson ISSCC-03 23 60ps 7ps 7ps Clock deskew techniques compensate for device and interconnect within-die variations Deskew circuits cut clock skew to less than a quarter of the original value SoC 2004 Stefan Rusu 30
Useful Clock Skew Frequency Improvement (MHz) 300 250 200 150 100 50 0 Initial Stepping 1 2 3 4 Frequency Improvement (MHz) 40 30 20 10 0 Subsequent Stepping 1 2 3 Samples Samples Use de-skew buffers to insert intentional skew to maximize the processor operating frequency Larger benefit achieved in early steppings Tam, VLSI Symposium, 2003 SoC 2004 Stefan Rusu 31
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 32
Pentium 4 Processor Jitter Reduction Vcc R Vcc - IR C I 10% dip in Core Supply 2% dip in Filtered Supply Jitter (ps) 60 40 20 0-20 -40 With Filter No Filter -60 0 10 20 30 40 50 Cycle # RC-filtered power supply for clock drivers reduces clock distribution jitter Kurd, JSSC-2001 SoC 2004 Stefan Rusu 33
Alpha* Processor Voltage Regulator 0 1.5V 2.5V PSRR [db] -10-20 -30-40 -50 LPF - + DLL -60 1.0E+02 1.0E+04 1.0E+06 1.0E+08 1.0E+10 Frequency [Hz] Voltage regulator ensures optimum DLL tracking Supply noise frequencies over 1MHz are attenuated by more than 15dB Xanthopoulos, ISSCC-2001 * Other names and brands may be claimed as the property of others SoC 2004 Stefan Rusu 34
On-Die Clock Jitter Detector Internal Clock Phase bins 0.5 * DL 0.5 * DL n clk ref Array Phase Detector Post Process Circuitry + Registers Counter inc/dec 2 Digital LPF up/dn# Kuppuswamy, VLSI Symposium 2001 SoC 2004 Stefan Rusu 35
Array Phase Detector clk ref... FF FF FF FF FF FF FF FF FF... 7 elements above and below center, with increasing positive and negative built-in offset away from center Phase offset created by progressively delaying data wrt clock SoC 2004 Stefan Rusu 36
Histogram Mode Operation Array Phase Detector XOR Logic Error Detection Logic jitter error count bins SoC 2004 Stefan Rusu 37
Graph Mode Operation jitter error encoded bins Array Phase Detector XOR Logic Error Detection Logic time SoC 2004 Stefan Rusu 38
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 39
Clock Power Breakdown Example 30% of the total power is attributed to clock Most of the clock power is used in the final clock buffers and flip-flops 2.1% 1.5% 26.2% 70.2% 1st Level 2nd Level 3rd Level Rest of chip Anderson, ISSCC-2002 SoC 2004 Stefan Rusu 40
Clock Power Reduction Reduce clock frequency Multiple frequency domains Dual edge triggered flip-flops Reduce voltage swing Low swing clocks Clock Power = f * C * V 2 Reduce clock loading Clock gating Clock-on-demand flip-flop Optimized routing SoC 2004 Stefan Rusu 41
Half Swing Clocking Requires four clock signals Two clock phases with a swing between Vdd and Vdd/2 drive the PMOS devices The other two phases with a swing between Gnd and Vdd/2 drive the NMOS transistors Experimental savings of 67% were demonstrated on a 0.5µm CMOS test chip with only 0.5ns speed degradation Requires additional area for the special clock drivers and suffers from skew problems between the four phases Kojima, JSSC 1995 SoC 2004 Stefan Rusu 42
Clock-on-demand Flip-Flop Activates internal clock only when the input data will change the output - equivalent to single bit clock gating Longer setup time and sensitive to hold time violations Hamada, ISSCC 1999 SoC 2004 Stefan Rusu 43
XScale Processor Clock Gating Three hierarchical clock gating levels GCLK_DA1 DA_BNK1_EN# GCLK_DA2 GCLK_IA1 IA_BNK1_EN# GCLK_IA2 Top level stop clock DA_BNK1_EN IA_BNK1_EN Unit level 83 enables Local clock buffers 400 unique enables GCLK_DA9 GCLK_DA10 CLK SPINE (M5) GCLK_IA10 GCLK_IA9 EGCLK (M6) Clark, JSSC 11/2001 GCLK_DC1 GCLK_IC1 GCLK_RF1 GLB_MA_EN GCLK_MA2 SoC 2004 Stefan Rusu 44
Dual Edge Triggered Flip-Flop 1 st STAGE: X 2 nd STAGE 1 st STAGE: Y CLK Mp1 Mp3 Mp6 Mp4 CLK1 Mp2 X Mp7 Q Mp8 Y Mp5 D Mn1 I1 Mn9 I2 Mn5 D CLK3 Mn2 Mn4 Mn10 I3 Mn8 Mn6 CLK4 Q CLK Mn3 Mn7 CLK1 C L Inv1 Inv2 Inv3 Inv4 CLK CLK1 CLK2 CLK3 CLK4 Operates at half the clock frequency Requires tight control of the clock duty cycle Nedovic, ESSCIRC 2002 SoC 2004 Stefan Rusu 45
Outline Clock Distribution Trends Distribution Networks De-skew Circuits Jitter Reduction Techniques Clock Power Dissipation Future Directions Summary SoC 2004 Stefan Rusu 46
Rotary Clock Distribution Transmission line based, self-regenerating rotary clock generator Wood, ISSCC-2001 SoC 2004 Stefan Rusu 47
Standing Wave Oscillator O Mahony, ISSCC-2003 SoC 2004 Stefan Rusu 48
10GHz Clock Grid Test Chip Fabricated in a 0.18µm 1.8V 6M CMOS process Very low clock skew and power consumption Attractive alternative for 10GHz clocking and beyond SoC 2004 Stefan Rusu 49
Optical Clock Distribution Board-level guided-wave H-tree distribution Monolithic silicon-based detection Couplers provide tolerance for horizontal and vertical misalignment of the flip-chip assembly Optical transmission is immune to process variations, power-grid noise and temperature J.D. Meindl, Georgia Institute of Technology, 2000 SoC 2004 Stefan Rusu 50
Summary High performance processors require a low skew and jitter clock distribution network Clock distribution techniques are optimized to achieve the best skew and jitter with reduced area and power consumption Deskew techniques are demonstrated to cut the skew to ¼ of its original value On-die supply filters are used to reduce jitter Intensive research focuses on novel clock distribution techniques SoC 2004 Stefan Rusu 51