Energy Recovering ASIC esign Conrad H. Ziesler, Joohee Kim, Marios C. Papaefthymiou Advanced Computer Architecture Laboratory epartment of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, MI, USA cziesler, jooheek, marios @eecs.umich.edu Abstract issipation in the clock tree and state elements of ASIC designs is often a significant fraction of total energy consumption. We propose a methodology for recovering most of this energy by using a novel energy recovering flip-flop and a novel single-phase resonant clock generator. As our state element has near-zero energy consumption when the input data is not switching, it provides the savings of clock gating approaches without the additional complexity of implementing clock gating in the design. To complement this near-zero idle energy property of the flip-flop, our resonant clock generator includes the capability to decide, on a per-cycle basis, whether or not the resonant clock needs to be replenished on the next cycle, thus automatically reducing energy consumption when most of the state elements are idling. ASICs designed with our methodology can achieve sub dissipations on the clock network at frequencies of 2 5MHz and operating voltages of 1. 1.5V in a.25 m process. To evaluate our methodology, we simulated a dual-mode (conventional and energy recovering) ASIC module to directly compare energy savings between the energy recovering and conventional clocking schemes. Our simulations demonstrate savings of over a factor of 4 for the energy-recovering mode versus the conventional mode for low switching activities. low-energy, high-speed flip-flops [1], [2], [3]. Third, our flipflop is very compact, containing only 14 transistors in its minimal configuration, or 18 transistors with an embedded two-input XOR gate, for example. Furthermore, our flip-flop is capable of operating with several different energy recovering power-clock waveforms. An energy recovering power-clock enables the entire clock tree to operate at sub dissipation levels when driven by a resonant clock generator. (a) V h + v n I. INTROUCTION A popular approach to low-energy, high-throughput VLSI system design is voltage-scaled static CMOS with aggressive pipelining. This approach is often combined with clock gating to reduce the dissipation of idle flip-flops and branches of the clock tree. In these systems, due to the large number of flip-flops and loading of the clock tree, the dissipation of the clock tree and state elements (flip-flops) can often be a substantial fraction of total system dissipation. We propose a novel design methodology for energy recovery utilizing our new PMOS energy recovering flip-flop (pterf) and a novel single-phase resonant clock generator. Our singlephase approach has several advantages over other energy recovering methodologies such as simple clock generation and distribution, no need for phase-balancing, skew tolerance, single inductor tuning, and low transistor count. Effectively, our energy recovering technique enables a substantial energy reduction with minimal designer effort. Our new energy recovering flip-flop has several properties making it well suited for deeply pipelined low-voltage static CMOS systems. First, our flip-flop exhibits near zero energy consumption while idle (i.e., when the data input switching activity is zero). This property eliminates the need for clock-gating logic, yielding similar savings as fine-grained clock gating at every flip-flop in the design. With constant input data, our flipflop dissipation is a remarkable 4.9fJ/cycle at 5MHz/1.5V, or 1.8fJ/cycle at 2MHz/1.V in a.25 m process. Second, the energy consumption of our flip-flop for unit switching activity is 75fJ/cycle at 2MHz, making it competitive with other (b) v c Fig. 1. Example system configuration and clock waveform for (a) energy recovery, and (b) no energy recovery. The substantial reduction in dissipation achieved by our flipflop is due to energy recovery. The basic premise of energy recovery is to recycle the charge stored in circuit capacitance using a power-clock signal in conjunction with an inductor or stepwise capacitor driver. Figure 1 contrasts a typical energy recovering and non-energy recovering synchronous system. Several energy recovering clock generator options are available for our flip-flop, such as blip [4], [5] and sinusoid [6]. More advanced energy recovering clock driving techniques, such as harmonic rail drivers are discussed in [7]. Other recent advances in the energy recovery literature are the clock-powered AC-x series microprocessors [8] and the source coupled adiabatic logic family [9]. Our single-phase resonant clock generator has several features making it well suited to drive systems of our pterf flipflops. First, the power topology of our clock generator enables the large resonant currents to bypass the main power switch, allowing the switch to be much smaller than topologies where the entire current is conducted by the switch. Second, the gate drive of the main switch uses an efficient dynamic circuit that optimizes the turn-on and turn-off rates to minimize dissipation. Third, a compact and fast control circuit enables the clock generator to decide, on a cycle by cycle basis, whether or not to replenish the resonating power-clock energy. This capability al-
_ X Y _ X Y Fig. 2. Schematic of PMOS energy recovering flip-flop (pterf). Fig. 3. Schematic of NMOS energy recovering flip-flop (nterf). lows the clock generator to maintain a stable amplitude of while consuming as little energy as possible. To evaluate our energy recovering design methodology, we re-implemented an existing conventional ASIC design using our energy recovering flip-flop and resonant clock generator. To enable direct comparisons and to simplify testing, we combined a conventional static CMOS flip-flop with our pterf flip-flop and a multiplexer to select between conventional and energy recovering flip-flops. A conventional buffered clock tree was used for the static CMOS flip-flops, while a wide metal distribution network driven by the resonant power-clock generator was used for the pterf flip-flops. Our simulation results of this voltage-scaled system indicate greater than 4 fold savings for low switching activity (when the state elements dominates dissipation) and a 2% savings for high switching activity (when the combinational logic dominates dissipation). The remainder of this paper is organized as follows: Section II describes the structure, operation, timing and energy characteristics of our energy recovering flip-flop. Section III describes the structure and operation of our resonant clock generator. Section IV explains our test ASIC implementation and gives our simulation results. We describe ongoing work in Section V. II. PTERF FLIP-FLOP This section describes our novel single-phase, energy recovering flip-flop. Key properties of our flip-flop include near-zero dissipation when the input data is held constant, low overall dissipation when the input is changing, low-voltage operation, compact layout, and a - delay which is inversely proportional to frequency. A. Structure The energy recovering flip-flop we describe in this paper consists of an energy recovering dynamic buffer driving a pair of cross-coupled NOR gates as the static latch element. Figure 2 shows the schematic of the PMOS version, that latches on rising pulses of the power-clock node. The power-clock node supplies both power and timing information to the circuit, in contrast to conventional clock nodes which supply only timing information. Note that correct operation is dependent on the ratioing of the pull-down NMOS and the pull-up cross coupled PMOS. at bt af bf logic function pull-down network Fig. 4. Schematic of PMOS energy recovering flip-flop with embedded AN gate (pterfand). Alternatively, we can derive the complement version using NMOS in place of PMOS devices, as seen in Figure 3. The resulting circuit latches on falling pulses of, using cross coupled NAN gates as the state element. Another interesting option is to replace the two pull down NFETs in pterf with a pull-down tree that computes some given logic function. Figure 4 shows a flip-flop with an embedded dual-rail AN gate. A low overhead reset may be obtained by adding an extra reset transistor to either or. B. Operation As shown in Figure 5, the operation of pterf begins with the data input changing at a suitable time before the rising edge of. The two inverters buffer and derive the complemented input, which is applied to the gate of the NMOS pull-down transistors. When the rising edge of arrives, the cross coupled PMOS devices sense and latch the appropriate value of onto the nodes X and Y. Since the cross coupled NOR gates form a simple set/reset latch, we have that positive pulses on either X or Y will cause the latch to either set or reset, respectively. When is not changing, either X or Y will remain low, with the other node oscillating in phase with in an energy recovering manner, that is, transferring charge to/from the signal. This charge recycling probe operation is the key to the ultra low energy consumption at zero input switching activity. Changes in that occur while is low have no effect on the output, X Y _
voltage, V voltage pterf operation 1.4 1.2.8 1 Pclk.6.4.2 14 16 18 2 22 24 26 28 3 32 34 36 time, ns pterf operation (internal signals) 1.4 1.2.8 1.6.4.2 14 16 18 2 22 24 26 28 3 32 34 36 ns Fig. 5. Operational waveforms for pterf. since the transitions on X or Y are monotonically decreasing. Changes in that occur while is high have no effect on the output because of the ratioing between the PMOS and NMOS. This case is rare, however, and could only occur at low frequencies with high and a very short combinational logic path. C. Timing and Energy d_ xt xf TABLE I TIMING AN ENERGY VARIABLES Energy for or transitions Energy for or transitions Time before at which must be valid Time after at which may change Time after at which becomes valid Cycle time erage energy numbers, the active and idle energy consumption, which refer to an input switching activity of 1 and respectively. Figure 7 shows the test circuit we used to measure the timing and energy values for pterf. The inputs and outputs are buffered/loaded with typical inverters. We measure the timing values as the midpoint crossings of the indicated,, and signals. There are many different ways to account for the energy consumption of a flip-flop, considering dissipation of the input and output loads, the internal dissipation, whether or not to include complementary outputs, or assuming complementary inputs. There have been many debates as to what accounting scheme is most realistic. For simplicity, we measure the energy dissipation of the circuits within the dashed line, minus the energy required to drive the primary input. Ts 16 28% Tcycle Tq 14 Th 12 Valid Tdq (ps) 1 8 Fig. 6. Timing diagram of pterf using sinusoidal clock. 6 energy 4 2MHz,1.V 333MHz,1.2V 5MHZ,1.5V frequency,voltage Fig. 8. pterf - delay (Ts+Tq) input timing 7n:21p 7n:21p 21 7 7 6n:11p _ 21 9n:33p Fig. 7. Circuit for timing simulations. qf qt 1n:18p timing 7n:21p Figure 6 shows the timing definitions used for our timing analysis. The pterf flip-flop samples its input on the rising edge of, so we define the midpoint crossing of the rising edge of as the reference time and define the usual timing variables as indicated in Table I. In addition, we define two av- Figure 8 plots the total flip-flop delay as a function of operating frequency. By total flip-flop delay we mean the setup time plus the clock to output time. At 2MHz and 1.V, pterf requires 1,28ps, while at 5MHz and 1.5V, it requires 57ps. At 5MHz, 1.8V (not shown on the graph), pterf requires only 46ps. Notice the trend that the flip-flop delay decreases with increasing frequency. This behavior is due to the sinusoidal shape of the energy recovering waveform. At all frequencies, pterf consumes roughly a quarter of the total clock period. Increasing the voltage reduces this fraction slightly. For comparison, one fan-out-of-four using 2-input NAN gates in this technology is 21ps at 1.5V and 53ps at 1.V. So, pterf consumes roughly 2-3 fanouts of four out of the 9-1 available per cycle at these low voltages. Figure 9 shows the energy consumption of pterf as a func-
Energy/op (fj) 25 2 15 1 5 idle active 85fF*^2 1.7fJ 2.6fJ 4.9fJ 2MHz,1.V 333MHz,1.2V 5MHZ,1.5V frequency,voltage Fig. 9. Active and idle energy consumption of pterf as function of frequency. Active energy consumption varies with hold times. of the logic supply, as shown in Figure 1. Synchronization occurs with an input reference square-wave clock, which is fed into 3 delay lines that generate the appropriate timing signals for the control circuit. The controller compares the peak value of with a reference voltage. For each cycle, it decides whether or not the inductor current needs replenishing. The output of the controller is a pulse which is buffered and inverted by two ratioed inverters before connecting to the gates of a PMOS pull-up and an NMOS pull-down. These two devices drive the gate terminal of the main NMOS power switch. The sizes of the transistors in the ratioed inverters are chosen so that the pullup and pull-down are never on at the same time. In addition, the main switch is turned on slowly, but turned off quickly, thus minimizing dissipation due to the PMOS pull-up capacitance. The main NMOS power switch is turned on at the time when the voltage difference between and ground is small, replenishing the current in the inductor from the C supply. tion of operating frequency for two different input data conditions, idle (never switching) and active (always switching). The idle energy consumption is near zero at all frequencies, with a dissipation of 1.7fJ at 2MHz and 4.89 fj at 5MHz. The active energy consumption was measured both at the minimum hold time (for the case of cascaded flip-flops), and at a nominal hold time (for the case of logic between flip-flops). For all frequencies, the lower point on the error-bar indicates the energy dissipation at the nominal hold time. III. POWER-CLOCK GENERATOR d1 d2 out The generator converts energy from the C supplies into AC energy using a lumped inductor and an on-chip NMOS switch. The signal is distributed to all of the flip-flops in the ASIC core. By using a single sinusoidal waveform, the clock distribution problem is drastically simplified. Furthermore, the entire capacitance of the clock distribution wires resonates with the generator inductor, thus eliminating the dissipation in a conventional clock tree. d3 d3 d3 ref d1 d2 d3 single cycle controller out reference clock delay lines reference voltage power bus 1/2 to ASIC core 1/2 Gnd Fig. 1. Block diagram of power-clock generator The generator is composed of a control circuit, the large NMOS power transistor and associated drive circuitry, and a lumped inductor connected to a C supply which is half that Fig. 11. Single cycle controller The single cycle controller is built around a two-stage clocked comparator circuit connected to a set-reset latch, as shown in Figure 11. A low-to-high transition on d3 causes the difference between and the reference voltage to be amplified by the cross coupled inverters. The result of this comparison toggles the set-reset latch. The phase difference between d1 and d2 is used to generate a pulse which is gated by the current state of the set-reset latch and fed to the output. Thus the controller efficiently implements single cycle feedback control. Figure 12 shows typical operational waveforms of the clock generator. The g signal is the gate drive of the main power switch. The signal is the power-clock signal. At 1ns, the load connected to undergoes a step increase in dissipation. As a result, the clock generator, which was operating under a 2-on, 2-off periodicity, switches behavior to full-on. In this par-
voltage Power-clock generator operation 1.6 1.4 1.2 1.8.6.4.2 6 7 8 9 1 11 12 vdd g Fig. 12. Power-clock generator operation ticular case, the power-clock generator circuits are running at 1.V while the power-clock is being driven to 1.5V. At 1.V and 333MHz, the power-clock generator has a two cycle control latency rather than the desired single cycle. IV. ASIC ESIGN EXAMPLE To evaluate our new energy recovery ASIC design methodology, we implemented an existing ASIC design using our pterf flip-flop and resonant clock generator. The module implements a multilevel discrete wavelet transform, used as the first stage in neural signal processing chip. It consists of two pipelined multipliers, several pipelined adders, a FIFO, and some control circuits. The original design was done as part of a VLSI course project and used a standard-cell synthesis place-and-route synchronous design methodology, targeting 6MHz in a.18 m technology at 2.5 volts. For fabrication reasons, we targeted a.25 m technology available through MOSIS, choosing to aggressively voltage scale the design while trying to meet a 333MHz throughput goal. d clk Fig. 13. Schematic of conventional flip-flop used for comparisons We took the Verilog sources from the course project and reduced the bitwidth of the datapath in order to accommodate our limited HSPICE simulation facilities. We resynthesized the entire modified design for our custom.25 m standard cell library that includes our energy recovering flip-flop. Our custom library contains only a small set of gates and transistor sizes, so the synthesis results are by no means optimal. After synthesis, we manually replaced each flip-flop with a dual flip-flop multiplexer construct, that includes a conventional flip-flop and our pterf flip-flop. By changing a global select ff signal, we could switch the design between conventional and energy recovering state elements, thus affording direct energy consumption com- ns q TABLE II TEST ASIC MOULE, PRE-LAYOUT ESTIMATES gates flip-flops pre-layout Tcritical @ 1.2V 3,897 387 1,974ps TABLE III TEST ASIC MOULE, POST-LAYOUT ESTIMATES 1.3V 1.4V 1.5V Tcritical Logic 2,33ps 2,62ps 1,941ps Tdq Energy Recovery ff 753ps 74ps 654ps Tdq Conventional ff 817ps 731ps 63ps parisons. A schematic of the conventional flip-flop we used is shown in Figure 13. Fig. 14. Layout plot of test ASIC module and power-clock generator Our final structural netlist was placed and routed using PLACE and WROUTE. We placed, Ground, and Power- Clock distribution wires in repeated stripes on the top metal layer over the entire ASIC core. WROUTE then routed each gate connection the nearest stripe. The ends of the stripes were connected by a wide distribution bus tied to several bond pads along with the clock generator. For a much larger design, the strip ends would connect to the arms of a global H-tree network. As this design was only approximately 4, gates, an H-tree was unnecessary in this case. Figure 14 is a layout plot of the combined ASIC module and power-clock generator. Table II summarizes the given test ASIC module. While the design is relatively small, its complexity is representative of much larger designs. Our test ASIC module is composed of 22 synthesizable Verilog files comprising over 7, lines of code. Table IV summarizes the post-layout timing estimates of the test ASIC module. These results are estimates derived from analysis of HSPICE simulation traces on netlists with postlayout extracted capacitances. At 1.3V, both the conventional and energy recovering flip-flop fail to meet timing for the target 3ns clock period, due to the long delay through the critical path in the logic. In other words, the voltage-scaling limit for this design would be 1.4V, as determined by the amount of pipelining in the design. However, since we added an additional mux to select between the energy recovering and standard flip-flops for testing purposes, the actual minimum voltage limit is 1.5V. Notice that the conventional flip-flop is faster than pterf at high voltages, but slower at low voltages. This trend is because the
TABLE IV ENERGY RECOVERING VS. CONVENTIONAL MOE @ 333MHZ, 1.5V mode Average Logic Only.9pJ 55.28pJ 28.1pJ Energy Recovery 6.74pJ 68.47pJ 37.6pJ Conventional 29.72pJ 78.28pJ 54.pJ conventional flip-flop speeds up with increasing voltage, while pterf delay is dominated by the rise time of and so, only slight reductions with increasing voltage are possible. We simulated our ASIC module from a post-layout extracted netlist using HSPICE on a Sun Blade 1. The minimum voltage for correct operation was 1.5V in both the energy recovering and conventional modes. Simulations took approximately 12MB of RAM each and 6 hours worth of computing time. In each mode, we simulated the ASIC module for 2ns with two regions of switching activity. The period immediately after reset has several cycles with low switching activity ( ), before the module begins it s self-test mode. The period towards the end of the 2ns of simulated time is representative of typical high activity ( ), as the ASIC module is undergoing a pseudo-random self-test. Our simulation measurements represent the total energy drawn from the C supplies, per cycle, averaged over several cycles. Our primary findings are summarized in Table IV. We simulated the ASIC module in both conventional and energy recovering modes at a frequency of 333MHZ and a 1.5V supply. In addition we separate out the losses in the combinational logic. At low activities, the total dissipation represents primarily the losses due to the clock, as the logic is not switching. At higher activities, the total dissipation is dominated by the logic dissipation. Notice that the combined energy recovering mode with clock generator in all cases dissipates less than the conventional mode. The total energy recovery dissipation (including clock generator) is only 23% of the conventional case at low activity and is only 88% of the conventional case at typical activity. V. CONCLUSION We have presented a novel single-phase energy recovering methodology for low-voltage ASIC design. Our methodology complements voltage-scaling approaches by enabling further energy savings once the minimum voltage for a given throughput has been reached. A key benefit of our methodology for ASIC designers is the minimal designer overhead needed to implement our methodology. All steps in transforming an existing ASIC design to be energy recovering have been automated. In contrast, clock-gating requires significant designer effort to implement correctly. Our methodology uses a single-phase sinusoidal energy recovering flip-flop and a single-cycle feedback control resonant power-clock generator. These two components are specifically designed to complement each other for low energy consumption. First, the energy recovering flip-flop has near zero dissipation while the input data is constant using a sinusoidal waveform. This waveform is easily and efficiently distributed over the chip using wide top metal wires, since all of the charge stored in the wiring capacitance is recovered by the power-clock generator. The power-clock generator, in turn, has a topology that enables all of the resonant currents to bypass the main power switch, thus enabling it to drive large capacitive loads with low dissipation. In addition, the power-clock generator only enables its main power switch whenever the resonant system actually needs additional energy, thus effectively idling when the majority of the flip-flops are not switching. We applied our methodology to an existing ASIC design, using a completely automated implementation of the energy recovery feature. We compared, in simulation using post-layout extracted parasitics, a conventional implementation with our energy recovering implementation. The results of this comparison debunk widely held mis-perceptions about energy recovery (a.k.a. adiabatic) circuits. First, energy recovery need not be slow to be efficient. Our flip-flop and power-clock generator perform efficiently at speeds of 2 5MHz in a.25 m process. Second, energy recovery complements, rather than competes with voltage scaling. Our flip-flop works efficiently at voltages down to 1.V while still running at 2MHz. And finally, energy recovery need not be complex and difficult to design. Our flip-flop contains only 14 transistors, while the clock generator uses less than 1. More importantly, our technique allows for full automation of implementing energy recovery with an existing ASIC design using simple custom tools in addition to industry standard synthesis, placement, and routing tools. We are currently awaiting fabrication to validate on silicon our energy recovering methodology. VI. ACKNOWLEGMENTS This research was supported in part by the US Army Research Office under AASERT Grant No. AAG55-97-1-25 and Grant No. AA19-99-1-34. REFERENCES [1] B.S. Kong, S.S. Kim, and Y.H. Jun, Conditional-capture flip-flop for statistical power reduction, IEEE Journal of Solid-State Circuits, vol. 36, no. 8, pp. 1263 1271, Aug. 21. [2] V. Stojanovic and V. G. Oklobdzija, Comparative analysis of master-slave latches and flip-flops for high-performance and low-power systems, IEEE Journal of Sold-State Circuits, vol. SC-34, no. 4, pp. 536 548, Apr. 1999. [3] J. Yuan and C. Svensson, New single-clock CMOS latches and flipflops with improved speed and power savings, IEEE Journal of Solid-State Circuits, vol. SC-32, no. 1, pp. 62 69, Jan. 1997. [4] W. C. Athas, L. J. Svensson, J. G. Koller, N. Tzartzanis, and Y. Chou, Low-power digital systems based on adiabatic-switching principles, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 4, pp. 398 46, ec. 1994. [5]. Maksimovic, V. G. Oklobdzija, B. Nikolic, and K. W. Current, Clocked CMOS adiabatic logic with integrated single-phase power-clock supply, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 4, pp. 46 463, Aug. 2. [6] C. H. Ziesler, S. Kim, and M. C. Papaefthymiou, A power-clock generator for true single-phase adiabatic logic, in Proceedings of International Symposium on Low-Power Electronics and esign, Aug. 21, pp. 159 164. [7] J. S. Moon, W. C. Athas, and P. A. Beerel, Theory and practical implementation of harmonic resonant rail drivers, in Proceedings of the International Symposium of Low-Power Electronics and esign, Aug. 21, pp. 153 158. [8] W. Athas, N. Tzartzanis, W. Mao, L. Peterson, R. Lal, K. Chong, J.S. Moon, L. Svensson, and M. Bolotski, The design and implementation of a lowpower clock-powered microprocessor, IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1561 157, Nov. 2. [9] S. Kim, C. H. Ziesler, and M. C. Papaefthymiou, A true single-phase adiabatic multiplier, in Proceedings of 38th esign Automation Conference, June 21, pp. 758 763.