ISLPED 2004 8/10/2004 -Optimal Pipelining in Deep Submicron Technology Seongmoo Heo and Krste Asanovi Computer Architecture Group, MIT CSAIL
Traditional Pipelining Goal: Maximum performance Vdd Clk-Q Setup Propagation Delay Clk Clk Clk
Pipelining as a Low- Tool Goal: Low-, Fixed Throughput Vdd Clk-Q Setup Propagation Delay Clk Time Slack Clk Time Slack Clk
Pipelining as a Low- Tool Goal: Low-, Fixed Throughput Vdd Clk-Q Setup Propagation Delay Clk Time Slack Traded for (supply voltage scaling) Clk Time Slack Clk
Pipelining as a Low- Tool * Clock frequency fixed Flip-flop Overhead Pipelining Time slack Delay
Pipelining as a Low- Tool * Clock frequency fixed Supply voltage scaling Saving Delay
-Optimal Pipelining reduction from pipelining limited by power overhead of increased number of flip-flops -Optimal Pipelining
-Optimal Pipelining reduction from pipelining limited by power overhead of increased number of flip-flops -Optimal Pipelining Too shallow pipelining Delay
-Optimal Pipelining reduction from pipelining limited by power overhead of increased number of flip-flops -Optimal Pipelining Too deep pipelining Too shallow pipelining Delay
-Optimal Pipelining reduction from pipelining limited by power overhead of increased number of flip-flops -Optimal Pipelining Too deep pipelining Optimal pipelining Too shallow pipelining Optimal Saving Delay
Pipelining is an old idea. Contribution Research focus has been on performance impact of pipelining. Idea of using pipelining [Chandrakasan 92] to lower power has not been fully explored in deep submicron technology. Analysis and circuit-level simulation of -Optimal Pipelining for different regimes of V th, activity factor, clock gating
Bottom-to-Top Approach 1. Impact of pipelining on power component 2. Impact of pipelining on total power (with/without clock-gating) Total (clock-gated) active inactive active Time Switching Component Leakage Component Idle Component
Bottom-to-Top Approach 1. Impact of pipelining on power component 2. Impact of pipelining on total power (with/without clock-gating) Total (not clock-gated) Switching Component Leakage Component active inactive active Idle Component Time *Idle power = power consumed when circuit is idle and not clock-gated
Target digital system: Fixed throughput, Highly parallel computation, Logic-dominant Test bench TG flip-flops Methodology BPTM (Berkeley Predictive Technology Model) 70nm process: LVT(0.17/-0.2), MVT(0.19/-0.22), HVT(0.21/-0.24) Hspice simulation at 100 C, Clock = 2 GHz Baseline N FO4 inverters (N = 2 ~ 24) TG flip-flops One Pipeline Stage
Pipelining and Switching : Analytical Trend Switching Optimal Saving Flip-flop overhead O(1/N) Optimal FO4 O(N 2 ) Quadratic reduction of logic switching power V 2 dd N 2 Number of FO4 per stage, N
Pipelining and Leakage : Analytical Trend Leakage Optimal Saving O(1/N) Flip-flop overhead Optimal FO4 O(N α ) (1<α< 2) Superlinear reduction of logic leakage power V dd * e(ηv dd ) N α DIBL effect Number of FO4 per stage, N
Pipelining and Idle : Analytical Trend Clock-gating is not always possible Increased control complexity insufficient setup time of clock enable signal Leakage + Flip-flop Switching Between leakage power scaling and flip-flop switching power scaling depending on leakage level
Relative Pipelining and Idle : Leakage Scale Optimal Saving Optimal FO4 O(1/N) Analytical Trend O(N α ) (1<α< 2) Idle Optimal FO4 Flip-flop Switching Scale O(1/N) Optimal Saving O(N) Linear reduction of Flip-flop switching power 1/N * V dd 2 N Number of FO4 per stage, N Number of FO4 per stage, N
Fixed Throughput @ 2 GHz Components Simulation Results: Components Switching Leakage Idle Right hand side curve O(N 2 ) O(N α ) (1<α< 2) O(N) or O(N α ) (1<α< 2) Saving* 79(HVT)~ 82(LVT)% 70(LVT)~ 75(HVT)% 55(HVT)~ 70(LVT)% N* 6 6 8 N = Number of FO4 inverters per stage N* = Optimal N Saving* = Optimal power saving by pipelining (Not including flip-flop delay)
Optimal Saving Optimal FO4 = 6 Clock Gating Optimal FO4 = 6~8 No Clock Gating relative power relative power *2 GHz *Flip-flop delay not included in optimal FO4 activity factor activity factor
Optimal Saving Optimal FO4 = 6 Optimal FO4 = 6~8 Clock Gating Idle No Clock Gating relative power Leakage relative power Switching Switching activity factor activity factor
Optimal Saving Optimal FO4 = 6 Optimal FO4 = 6~8 Clock Gating No Clock Gating relative power relative power LVT activity factor activity factor
Discussion LVT can be fast and power-efficient enables lower V dd Flip-flop delay more important than flip-flop power for power-optimal pipelining
Limitation of This Work Super-linear growth of flip-flops Additional memory Reduced glitches Parasitic wire capacitance Effect on optimal logic depth Effect on optimal power saving
Conclusion Pipelining is an effective low-power tool when used to support voltage scaling in digital system implementing highly parallel computation. Optimal Logic Depth: 6-8 FO4 ~ 8-10 FO4 including flip-flop delay Optimal Saving: 55 80% It depends on V th, AF, Clock-Gating Insights: Pipelining is more effective with High AF Pipelining is most effective at saving switching power Pipelining is more effective with lower V th Except for when leakage power is dominant. Pipelining is more effective with clock-gating reduced flip-flop overhead.
Acknowledgments Thanks to SCALE group members and anonymous reviewers Funded by NSF CAREER award CCR- 0093354, NSF ITR award CCR-0219545, and a donation from Intel Corporation.
BACKUP SLIDES