Tutorial Outline 8:3-8:45 8:45-9:5 9:5-9:3 9:3-1:3 1:3-1:5 1:5-12:15 12:15-1:3 1:3-2:3 2:3-3:3 3:3-3:5 3:5-4:3 4:3-4:45 Introduction and motivation Sources of power in CMOS designs Power analysis tools and techniques Gate & functional unit design issues & techniques BREAK Architectural level issues and techniques LUNCH Low power memory system design Software level issues and techniques BREAK Software level issues and techniques, con t Future challenges ISCA Tutorial: Low Power Design Gate.1 Design Levels Abstraction Analysis Analysis Analysis Analysis Energy Level Capacity Accuracy Speed Resources Savings Most Worst Fastest Least Most Application Behavioral Architectural (RTL) Logic (Gate) Transistor (Circuit) Least Best Slowest Most Least ISCA Tutorial: Low Power Design Gate.2 1
Basic Principles of Low Power Design P = C L V DD2 f 1 + t sc V DD I peak f 1 + V DD I leakage Reduce switching (supply) voltage» quadratic effect -> dramatic savings» negative effect on performance Reduce capacitance Reduce switching frequency» switching activity» clock rate Reduce glitching Reduce short circuit currents (slope engineering) Reduce leakage currents ISCA Tutorial: Low Power Design Gate.3 Low Energy Gates: Transistor Sizing Use the smallest transistors that satisfy the delay constraints» slack time - difference between required time and arrival time of a signal at a gate output Positive slack - size down Negative slack - size up Make gates that toggle more frequently smaller Size for slope engineering to reduce short circuit currents ISCA Tutorial: Low Power Design Gate.4 2
Low Energy Gates: Transistor Pin Ordering Logically equivalent inputs may not have identical energy/delay characteristics B A Out C out To conserve energy (and improve speed), connect inputs so that most active input is nearest output C i Need to know signal statistics ISCA Tutorial: Low Power Design Gate.5 Low Energy Gates: Dynamic Gate Pin Ordering Dynamic gates exhibit higher switching activity (and add to clock load) but are fast SelA SelB SelC SelA SelB SelC!A!B!C A B C If A, B, and C have low signal probability ISCA Tutorial: Low Power Design Gate.6 3
Low Energy Gates: Gate Restructuring Logically equivalent CMOS gates may not have identical energy/delay characteristics ISCA Tutorial: Low Power Design Gate.7 Low Energy Gate Networks: Balanced Delay Paths Reduce glitching by balancing the delay path F 1 1 F2 2 F 1 1 F 3 F 2 1 F 3 Equalize lengths of timing paths through logic ISCA Tutorial: Low Power Design Gate.8 4
Low Energy Gate Networks: Network Restructuring Consider logic topology alternatives.5 A B.5 C.5 3/16 W D.5 7/64 X 15/256 F.5A.5B.5 C.5D 3/16 Y Z 3/16 15/256 F Chain implementation has a lower overall switching activity than the tree implementation Ignores glitching effects ISCA Tutorial: Low Power Design Gate.9 Network Restructuring, con t Logically equivalent gate networks may not have identical energy/delay characteristics F = ABCD Technology mapping delay area energy ISCA Tutorial: Low Power Design Gate.1 5
Low Energy Gate Networks: Network Input Ordering Input ordering.5 A B.2 (1-.5x.2)x(.5x.2)=.9 X C F.1.2 B C.1 (1-.2x.1)x(.2x.1)=.196 X A F.5 Beneficial to postpone the introduction of signals with a high transition rate (signals with signal probability close to.5) ISCA Tutorial: Low Power Design Gate.11 Dual Supply Voltages Use two V DD s (e.g., 2.5V and 1.5V)» use the higher supply for gates on the critical path» use the lower supply for gates off the critical path Reduces energy without a performance loss Cons» slight area penalty» increased design time» need level converters to interconnect gates on different supplies (to avoid static currents) ISCA Tutorial: Low Power Design Gate.12 6
Dual Threshold Voltages Use two V T s (e.g.,.6v and.3v for V DD = 2.5V)» use the lower threshold for gates on the critical path» use the higher threshold for gates off the critical path Improves performance without an increase in power Cons» increased fabrication complexity» increased design time» beware of increased leakage in low V T portion of the circuit - could end up with increased power! ISCA Tutorial: Low Power Design Gate.13 Functional Unit Energy Optimization Key processor core functional units» latches and (pipeline) registers» ALUs - adders, multipliers, barrel shifters» control logic (FSMs)» interconnect» multi-ported register file On-chip memories (ROMs, caches, SRAMs, edrams) MMU, TLB Clock generation and distribution Off-chip interconnect (pads) ISCA Tutorial: Low Power Design Gate.14 7
Flipflops and Pipeline Registers Consume a lot of energy because they are clocked every cycle» Clock energy (E c ) energy dissipated when the ff is clocked with stable data» Data energy (E d ) energy dissipated when the ff is clocked and the data has changed so that the ff changes state» Typically the data rate (f d ) is much lower than the clock rate (f c ) Also impacts clock energy since a large portion of clock energy is used to drive the sequential elements ISCA Tutorial: Low Power Design Gate.15 Power Consumption in Latches B D Q 1 % Power 8 6 4 2 Data Clock.1.2.3.4.5 Latch Data AF From Tiwari,, 1998 ISCA Tutorial: Low Power Design Gate.16 8
Some Typical CMOS FFs D Q D Q Static TG FF Dynamic C2MOS FF D Q D Q Dyn Precharged TSPC FF Dyn Non-Precharged TSPC FF ISCA Tutorial: Low Power Design Gate.17 FF Power Comparison Relative Power Consumption 3 25 2 15 1 5.5.15.25.35.45 Latch Data AF TGFF C2MOS PTSPC NPTSPC From Svenson,, 1996 ISCA Tutorial: Low Power Design Gate.18 9
Energy Efficient Flipflops D GND V DD V DD V DD Q Q 16 transistor & B 4 clock loads each Power PC 63 FF B StrongArm SA11 FF 2 transistor 3 clock loads D B Q ISCA Tutorial: Low Power Design Gate.19 EDP of Some Low Power FFs EDPtot (fj) 8 7 6 5 4 3 2 1 High Low Average HLFF SDFF PowerPC mc2mos SA11FF K6ETL From Stojanovic,, 1998 ISCA Tutorial: Low Power Design Gate.2 1
Self-Gating FF When ff input is equal to its output, suppress internal clocking to conserve energy» gating function is derived within the FF Φ Φ D Φ Φ Φ Φ Q Φ Φ Φ D Q Strict rules on when D can change wrt ISCA Tutorial: Low Power Design Gate.21 Power of Self-Gated FF Power dissipation 1 SG FF Reg FF 1 2 Data switching rate f d /f c From Reyes, 1996 ISCA Tutorial: Low Power Design Gate.22 11
Double Edge Triggered FF B B Loads data at both rising and falling clock edges D B Q B ISCA Tutorial: Low Power Design Gate.23 Advantages DETFF Pros and Cons» Clock frequency can be halved to achieve the same computational throughput: P d =.84P s» Also get a 2X energy savings in the clock network Disadvantages» About 15% larger in transistor count» Maximum operating frequency less» Strict requirements on clock skew» Requires a strict 5% duty cycle» Larger clock load ISCA Tutorial: Low Power Design Gate.24 12
Adders (Subtractors) synchronous word parallel adders ripple carry adders (RCA) T = O(n), A = O(n) carry prop min adders signed-digit fast carry prop residue adders adders adders T = O(1), A = O(n) Manchester carry carry conditional carry carry chain select lookahead sum skip T = O(n), A = O(n) T = O(log n) A = O(n log n) T = O(n**1/2), A = O(n) ISCA Tutorial: Low Power Design Gate.25 PDP of Different Adders 1 75 5 25 RCA MCCA CSkA VSkA CSlA CLA BKA ELMA 8 bits 16 bits 32 bits 48 bits 64 bits From Nagendra,, 1996 ISCA Tutorial: Low Power Design Gate.26 13
Parallel Prefix Computation T = log 2 n - 1 A = 2log 2 n T = log 2 n Brent-Kung (CLA) Adder g 15 p 15 g 14 p 14 g 13 p 13 g 12 p 12 g 11 p 11 g 1 p 1 g 9 p 9 g 8 p 8 g 7 p 7 g 6 p 6 g 5 p 5 g 4 p 4 g 3 p 3 g 2 p 2 g 1 p 1 g p c 16 c 15 c 14 c 13 c 12 c 11 c 1 c 9 c 8 c 7 c 6 c 5 c 4 c 3 c 2 c 1 A = n/2 ISCA Tutorial: Low Power Design Gate.27 BK and ELM Adder Optimization 2 15 EDP (pj) 1 5 BK Classic BK Hybrid ELM Classic ELM Hybrid 16 32 64 Number of bits ISCA Tutorial: Low Power Design Gate.28 14
Parallel Multipliers Form partial product array in parallel and add it in parallel» can use multiplier recoding to reduce the high of the partial produce array by half» recoding may cost more energy than it saves!» use delay balancing to reduce glitching Array multipliers (regularity) Pipelined multipliers (higher throughput, longer latency, less glitching but adds to clock load) ISCA Tutorial: Low Power Design Gate.29 multiple forming circuits Parallel Multiplier Structure D D Q ( ier) D D ( icand) partial product array reduction tree fast CPA muxes + tree reduction (log n) + CPA P (product) ISCA Tutorial: Low Power Design Gate.3 15
PP Array Reduction Process icand (4,2) counter ier partial product array reduced partial product array to CPA ISCA Tutorial: Low Power Design Gate.31 (4,2) Counters Built out of (3,2) counters (FA s) (3,2) (3,2) (3,2) (3,2) (3,2) (3,2) Tiles with neighboring (4,2) counters Can use delay balancing in cell design and interconnect to reduce glitching ISCA Tutorial: Low Power Design Gate.32 16
PP Array Reduction Tree Structure multiple generators multiplicand... (4,2) counter slices 2 (4,2) counter slices (4,2) counter slices multiple selection signals ( ier) CPA ISCA Tutorial: Low Power Design Gate.33 Glitch Reduction by Pipelining Glitches are dependent on the logic depth of the circuit Nodes logically deeper are more prone to glitching» arrival times of the gate inputs are more spread due to delay imbalances» usually affected by more primary input switching Reduce depth by adding pipeline registers ISCA Tutorial: Low Power Design Gate.34 17
multiple forming circuits Pipelined Parallel Multiplier D D Q ( ier) D D ( icand) partial product array reduction tree helps to reduce glitching but adds to the clock load fast CPA P (product) clk ISCA Tutorial: Low Power Design Gate.35 CSA Array Multiplier q 3 q 2 q 1 q M 3 M 2 M 1 M d q j carry sum in input M 13 M 12 M 11 M 1 p d 1 d i M 23 M 22 M 21 M 2 p 1 d 2 CSA p 7 M 33 p 6 M 32 p 5 M 31 p 4 M 3 p 3 p 2 d 3 carry out sum output Longest delay path n + n - 1 = 2n - 1 ISCA Tutorial: Low Power Design Gate.36 18
Multiplier Cell Structure B j 2D sum input 1D A i add delay elements to minimize glitching carry out full adder carry in ISCA Tutorial: Low Power Design Gate.37 Pipelined CSA Array Multiplier clk q 3 q 2 q 1 q M 3 M 2 M 1 M d p M 13 M 12 M 11 M 1 d 1 M 23 M 22 M 21 M 2 p 1 d 2 M 33 M 32 M 31 M 3 p 2 d 3 p 3 M 43 M 42 M 41 p 4 M 53 M 52 p 5 M 63 p 6 ISCA Tutorial: Low Power Design Gate.38 p7 19
Barrel Shifters Average Power (mw) 5 1 log PT Array PT log static log dynamic Delay (ns) 12 1 8 6 4 2 log PT Array PT log static log dynamic Influence of architecture: Logarithmic, Array and Gate types: Pass Transistor, Dynamic/Static Mux From Acken,, 1996 ISCA Tutorial: Low Power Design Gate.39 Control Unit Design Inputs Combinational Logic Outputs State FFs State Encoding One of most important factors determining area, speed, and energy of resulting control logic / n! different possible encodings (n states) / 11 1/X 1 1/X,1/1 ISCA Tutorial: Low Power Design Gate.4 2
Energy State Encoding Heuristic Area driven -> try to reduce the distance in Boolean n-space between related states Energy driven -> try to minimize number of bit transitions in the state register» fewer transitions in state register» fewer transitions propagated to combinational logic.1.3 1.4.1.1 11 probability that a transition will occur (sum of all edges equals unity) ISCA Tutorial: Low Power Design Gate.41 Caveat Lowest E[M] may not be lowest in energy it could require more gates and/or signal transitions in the combinational logic Experiments show that the area and energy dissipation of a state machine are correlated when the state encoding is varied ISCA Tutorial: Low Power Design Gate.42 21
Power State Encoding Effects 75 7 65 6 55 5 33 34 35 36 37 38 39 4 41 Area From Yeap,, 1997 ISCA Tutorial: Low Power Design Gate.43 Practical Considerations Balance area-energy by forced encoding of only a subset of states that span the high probability edges» leave assignment of remaining states to the logic synthesis system for area optimization» fortunately, in practice, most state machines have this characteristic Unlike area encoding, energy encoding requires knowledge of probabilities of state transitions and input signals ISCA Tutorial: Low Power Design Gate.44 22
A Low Power Processor Core Example ISCA Tutorial: Low Power Design Gate.45 M CORE Architecture GP reg file (32bitx16) X port Alt reg file (32bitx16) Y port Control reg file (32bitx13) Scale Immed PC increment Branch adder Address bus Sign ext Barrel shift, FF1 Instr pipeline ALU, priority encode, detect Instr decoder Writeback bus H/W acc bus Data bus ISCA Tutorial: Low Power Design Gate.46 23
M CORE Power Distribution 28% 36% 5% 9% 36% Datapath Clock Control 6% 7% 8% 9% 14% 42% Reg File Addr/Data Bus Inst Reg Barrel Shifter X MUX Y MUX Addr Gen Other ISCA Tutorial: Low Power Design Gate.47 Key References Hossain, Low power design using double edge triggered flipflop, IEEE Trans. on VLSI Systems, 2(2):261-265, 1994. Motorola, M CORE Architecture microrisc Engine, MCORE 1/D, www.mot.com/sps/mcore/info_documentation.htm Mutsunori, Low power design method using multiple supply voltages, SLPED, 1997. Rabaey, Digital Integrated Circuits, Prentice-Hall, 1996. Reyes, Low Power FF Circuit and Method Thereof, Patent No 5,498,988, 1996. Roy, Power analysis and design at the system level, Low Power Design in Deep Submicron Electronics, Nebel and Mermet, Ed., Kluwer, 1997. Sakuta, Delay balanced multipliers for low power, SLPE, 1995. Scott, Designing the Low-Power M CORE Architecture, Proc. Inter. Symp. Computer Architecture Power Driven Microarchitecture Workshop, June 1998. Stojanovic, A unified approach in the analysis of latches and FFs for low power systems, ISLPED, 1998. Tiwari, Reducing power in high-performance microprocessors, DAC, 1998. Yeap, CPU controller optimization for HDL logic synthesis, CICC, 1997. Yeap, Practical Low Power Digital VLSI Design, KAP, 1998. ISCA Tutorial: Low Power Design Gate.48 24