An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers Shadi T. Khasawneh and Kanad Ghose Department of Computer Science State University of New York, Binghamton, NY 13902-6000, USA {shadi, ghose}@cs.binghamton.edu Abstract. Contemporary superscalar processors, designed with a one-size-fitsall philosophy, grossly overcommit significant portions of datapath resources that remain unnecessarily activated in the course of program execution. We present a simple scheme for selectively activating regions within the register file and the reorder buffer for reducing leakage as well as dynamic power dissipation. Our techniques result in power savings in excess of 60% in these components, on the average with no performance loss. 1 Introduction Modern superscalar processors exploit instruction level parallelism by using a variety of techniques to support out-of-order execution, along with the dispatching of multiple instructions per cycle. Register renaming is at the core of all such techniques. In register renaming, a new physical register is allocated for every instruction that produces a result into a register to maintain the true data dependencies. The allocation of the register takes place as instructions are fetched in program order, decoded and dispatched into an issue queue, regardless of the availability of the source register values. As register values are generated, instructions waiting in the issue queue are notified. An instruction is ready to start execution or issue when all of its source operands are available (or are soon to become available). Such instructions are issued to the execution units and read their source operands from the physical register file and/or from a bypass network. In high end microprocessors, to support a large instruction window for the sake of performance, the physical register files typically have a large number of registers, often as many as 80 to 128 integer or floating point registers. As performance demands continue, the number of such registers is likely to go up. Furthermore, in a S- way superscalar processor that can issue S instructions per cycle, the integer or floating point register file (RF) needs to accommodate at least 2S read ports and S write ports. Moreover contemporary register file structures are heavily accessed. All of these factors make the register file a significant source of power dissipation and a localized hot spot within the die. The hot spot - a localized high-temperature area within the die - results from the large power dissipation within the relatively small

area of the register file. As device sizes continue to shrink, leakage power becomes a significant part, often as much as 50% of the total power dissipation within a CMOS device [7, 3]. Furthermore, as leakage power increases dramatically with temperature, hot spots - such as the physical register file - are likely to become hotter with an increase in the leakage dissipation. This is essentially a positive feedback system and causes a thermal runaway, leading to temperature increases that can cause the devices to malfunction as the junction temperatures exceed safe operating limits [3]. The reorder buffer (ROB), a large, multi-ported register file like structure, also dissipates a significant amount of leakage power. The techniques presented in this paper emphasize a simple implementation and has no impact on the performance. 2 Relevant Circuit/Organizational Techniques The circuit techniques employed by our scheme for reducing static power dissipation in the bitcells used within the register file (RF) and the reorder buffer (ROB) borrows from the Drowsy cache design of [4]. A drowsy cache design effectively switches the supply voltage to a bitcell in-between a lower ( drowsy ) level (that preserves the contents of the bitcell but reduces leakage) and a normal (and higher) operating level. To access the contents of a bitcell, the supply voltage has to be raised to the normal level. The transition time in-between a drowsy state and a normal state can be limited to a single cycle [4]. We extend this mechanism to add a second lower supply level (close to zero) where the bitcell loses its contents but can be switched to a normal mode in two to three cycles. Leakage power is reduced further in this mode (deepsleep mode) compared to the drowsy state. For the simulated data on the 0.07 micron CMOS process used in our studies, the normal operating voltage was 1.0 volts, the state-preserving drowsy supply voltage was assumed as 0.3 volts and the ultra-low leakage non-state-preserving supply voltage was 0.15 volts. (The technology data was scaled from a 0.13 micron process using the linear scaling approach taken in [5]. We also used a modified version of the e-cacti tool [5] to compute leakage and dynamic dissipation in the bitcell arrays and other components.) Leakage within other components of the RF and the ROB such as address decoders, drivers, prechargers, sense amps, control circuitries for reads and writes - is assumed to be minimized using devices with two different threshold voltages. The e-cacti tool also models these dissipations. Zone Address Decoder/Zone Activation Control Decoder Decoder Decoder Decoder Zone 0 Zone 1 Zone 2 Zone 3 I/O drivers for RF Fig. 1. A register file with four zones super bit lines/through buses = Sense Amps/Drivers/Prechargers For the purpose of our coarse-grained leakage reduction technique, we re-organize the RF and ROB from their usual monolithic design to a segmented design with multiple zones, as shown in Figure 1 (for a RF). Registers within the monolithic RF

structure is broken down into 4 groups or zones in Figure 1, each zone having its own drivers, decoders, sense amps and local control logic. Each zone contains the same number of contiguous registers. A register address is broken down into two parts a zone address and a zone-local register address. The zone address is decoded using a decoder external to the zones, as shown. This decoder also includes additional logic to control the state of all the zones A single zone can be in one of the following states: normal or active, content-preserving standby, volatile or deep-sleep mode. This structure permits zones to be kept in different states to minimize the overall leakage power. The zoned structure also reduces dynamic dissipation, as one set of super bit lines and through buses are used by each active and accessed zone; inactive zones do not load up these buses. (In a monolithic implementation, bitcells in all registers load up the common bit lines.) Further reduction in dynamic power occurs through the use of smaller prechargers, sense amps and buffers within each zone: these components are activated only within a zone that is accessed. If the number of registers within a zone is small, one can altogether dispense with the sense amp for the zones. The reorder buffer can be segmented into zoned structures in a similar fashion. 3 Reducing Leakage in the Register File The technique proposed for reducing leakage dissipation in the register file exploits the following observations: 1. A significant number of cycles elapses between the time of register allocation (at instruction dispatch) and the time a result is written into the register. As the register does not hold any valid information during this period, it can be kept in a deep sleep mode to avoid leakage dissipations. 2. After a register has been written and read out by instructions that consume its value, a significant number of cycles elapse before the register is deallocated. However, in this period the register contains a potentially useful value. In this case, we reduce leakage dissipation by keeping the register in a standby mode that preserves its contents but also reduces the leakage. 3.1 Activating RF Zones Dynamically In this scheme, called the on-off scheme, zones within a register file are in either an active state or in the (volatile) deep-sleep state. Initially, all zones are turned off. With the use of register renaming, a new physical register (or two for multiple precision instructions) will be allocated for each instruction being dispatched. The allocation of the new register will be done in the decode stage. However, the first access to the register will be made when the dispatched instructions write the computed result into this register from the write back stage. To reduce RF leakage power, we thus attempt to allocate the destination registers at the time of instruction dispatch within a zone that is already active, to minimize any overhead associated with turning on an inactive zone. If an active zone is not available for any allocation, one (or more)

zone(s) in the deep sleep are used for the allocation and these zones are then activated. Once activated, a zone remains in that state till it is completely free, i.e., till it can be deallocated. The two cycle delay in activating the zone has no consequence on the performance, as the dispatched instructions do not produce a result into their destination for at least two cycles following the dispatch (the time needed to reach write-back stage). A 1-bit wide auxiliary register file is maintained, with a single entry for each zone to indicate the status of each zone (as active or in the deep-sleep mode). The logic for looking up a free RF is adapted with very little change to permit us to make allocations preferentially within a targeted zone. Our studies show that the policy of allocating a new register to a zone has little impact on the overall power savings. We therefore use a policy that is easy to implement in hardware registers are allocated within the first active (FA) zone that is found in the free register list. If all active zones are full, or none are active, then the first inactive (in deep sleep state) zone is activated. 3.2 Putting RF Zones into the Standby Mode The main idea in this scheme, called the standby scheme, is to put all zones in standby mode, and activate zones on a need-to basis. Registers for destinations are allocated at the time of instruction dispatch within zones that are kept in the standby mode till the first access is needed to that zone when the first of the dispatched instructions issue and write the result into the zone. To reduce leakage within a zone containing valid data, we keep the zone activated for only a small number of cycles, say M cycles, before we revert it back to the standby mode. (We have used values of M = 2 and M =3 in our studies). This is done by using a small 2-bit counter with each zone; these counters are part of the status array that holds the status of each zone. Performance penalties are avoided in this scheme by making simple modifications to two pipeline stages. First, the writeback stage needs to be able to identify the zone being written to by each instruction one (or a few) cycles before the actual writeback takes place. Such a requirement is not unusual and is routinely implemented in high-clock rate superscalar machines, where the destination address (wakeup tag) is broadcasted to the issue queue one (or a few) cycles before the result is actually needed. The only change needed in the writeback stage is to have it look up the status of the target zone from a status array (similar to scheme described in section 3.1) and activate that zone before the result is written to it in a later cycle. For the zone sizes used in this study, it takes just one cycle to change the state of a zone from standby to active, thus the transition time can be completely hidden with no impact on performance. The second simple modification is to the issue logic. The issue logic needs to identify instructions that need to read the register file to access one or more source operands. (These are ready instructions that could not be selected for issue in the cycle following the broadcast of the tag that waked up the instruction). As such instructions are selected for issue, the selection logic reads the status of the zones that contain registers to be read and activates them if they are on the standby mode. If such zones are already active and are to remain active for an additional cycle, no

additional steps are needed. If the zone is found to be active for just the current cycle, then the zone s associated counter is reset to M to guarantee that the zone remains active till the cycle where the source operands are read out. Doing this ensures that a zone remains in an active state when back-to-back requests to access the zone happen to occur. Switching glitches caused in the course of switching often between a standby state and an active state are thus avoided when requests to access a zone are clustered over an interval that exceeds M. Read out zone address from src register bank Status Array lookup & activation delay Tag Broadcast Wakeup & Selection Request Propagation Selection & Propagation of Grant signed to IQ entry Instruction Read out from IQ Src. Address decoded reg. Delay of word line driver Cycle 1 Cycle 2 Fig. 2. Timing associated with instruction issue and zone activation The one cycle delay in transitioning a zone from the standby mode to the active mode to allow an issuing instruction to read source register(s) from the zone is effectively hidden by overlapping this transition with the 1 cycle needed to move the instruction to the execution unit. This is possible because of the following reasons. As soon as the selection logic grants the request for a ready instruction to issue, it starts activating the required zone from a standby state to an active state. This is possible as the zone address of the source registers are kept in a dedicated RAM bank, adjacent to the issue queue (IQ) entries; the remaining part of the register addresses are within the IQ entries. As the grant signal comes down the selection tree and the selected instruction is read out on the IQ port and moves to the execution unit, the issued instruction presents the source register addresses to the register file and the register address is decoded. In parallel with all of these events, and starting with the propagation of the grant signal down the selection tree, the narrow bank containing the zone addresses is read out and the required zone is activated if needed, requiring an additional cycle (Figure 2). Thus by the time the word line for bit cells in the RF are to be driven, these bitcells are already activated. Consequently, the one cycle needed for activating a zone is effectively hidden and there is no impact on performance. We are assuming a contemporary issue mechanism where wakeup, selection and issue are spread over two clock cycles. Standby scheme provided more savings than on-off scheme, as shown in section 5. Finally, we discuss a hybrid scheme where the on-off and standby schemes are combined by putting any unused zone into the off (deep-sleep) mode.

3.2 Extending the RF Leakage Management Scheme to the ROB The standby scheme can be also applied to the ROB in a fashion to that deployed for the register file. In a P4-style pipeline, the ROB is accessed in the dispatch, writeback and commit stages. Assuming a 4-way CPU, in the worst case and a ROB with a total of 18 zones (as studied here), 4 zones could be accessed from any of these stages, thus up to 12 zones can be active each cycle, providing a minimum of 22.22% reduction in ROB usage and the associated leakage power. In the dispatch stage, and ROB entry is allocated for each instruction, and since a zone needs 1 cycle to be activated, the allocation is done in fetch/decode where the activation is triggered so that a zone will be ready 1 cycle later to maintain a 0% IPC loss. Similarly, in the writeback stage, the ROB entries corresponding to the instructions in writeback stage will be activated 1-cycle before writeback. At commit, all possible commit entries are activated to simplify the circuitry needed to maintain performance; these entries could span 1 or 2 zones. Each zone is assumed to be active for M cycles (see section 3.2). The first cycle is for the transition from standby into active mode. The read/write access is done in the second cycle. The third cycle is for the transition from active to standby unless the same zone is being accessed by a different instruction, in that case, the zone is assumed (in the simulations) to be active for more 3 cycles. The allocation of ROB entries is done in a circular fashion, and thus there is no room to optimize this policy to gain extra power. 4 Experimental Results We used a modified version of the well-known SimpleScalar simulator for our studies. We simulated a superscalar CPU with a fetch width of 4, an issue width of 4, a dispatch width of 4 and a commit width of 4. The IQ was set to 64 entries, and a ROB of 144 entries. The RF configuration used had 80 registers in each of the integer and floating point RFs (80INT + 80FP registers). The size of the load/store buffer was set at 128. A large subset of the integer and floating point benchmarks of SPEC2000 was used and executed for 100 million cycles, after skipping the first 400 million cycles. 4.1 Register File On/Off Results Fig. 3. Average # of Cycles between Register Allocation and Access

The average number of cycles between register allocation and actual usage is 20.26 cycles, as shown in Figure 3. In figure 4, we show the impact of using alternative allocation policies: FA First active zone (see section 3), MRU allocate within the most recently used zone first, BF allocate within the zone that best matches the allocation size. Figure 4 shows that the number of turned-off zones for the MRU, FA and BF are 4.52 (28.25%), 4.35 (27.19%) and 4.35 (27.19%) respectively. Figure 5, shows the number of turned-off zones for a RF configuration with 16 zones each in the integer and floating point RF. Here, for MRU, FA and BF, the average number of turned off zones are 10.03 (31.34%), 10.08 (31.5%) and 10.07 (31.47%) respectively. Fig. 4. Average number of zones turned off (8 INT + 8 FP) Fig. 5. Average number of zones turned off (16 INT + 16 FP) 4.2 RF and ROB Standby Scheme Results In this section we will show the results for the standby scheme along with the combined hybrid scheme for the RF and ROB. Fig. 6. Average number of standby/off zones for register file. Figure 6 shows the results for the standby mode (the entire bar). It also shows how many of these zones can be turned-off (upper half of each bar). There are 16 zones in

each of the integer and floating point RFs. The hybrid scheme provides the same total number of standby/off zones but realizes added power savings by putting the unallocated zones into off mode instead of the standby mode. The total average number of standby zones is 15.56 (77.8%), and for the on/off scheme is 4.39 (21.95%). The hybrid scheme provides an average number of standby/off zones as 15.56 (77.8%), of which 11.17 (55.85%) is provided by the standby mode alone, and the other 4.39 (21.95%) is for the zones that can be turned off. Fig. 7. Average number of Standby Zones in ROB Figure 7 shows the average number of standby zones for the ROB, partitioned into 18 zones. On the average, 14.49 (72.45%) of the zones are in the standby mode. This percentage is slightly lower than that for the RF. 5 Power Savings We modified the e-cacti tool of [5], which is designed for estimating the dynamic and leakage power of caches, to measure the dynamic and leakage power of the RF and the ROB. Assumptions made in this regard were noted in Section 2. All of the reported measurements are for a 0.07 micron CMOS technology. Fig. 8. On/Off RF Results using MRU, FA and BF allocation schemes. Fig. 9. Standby Scheme Power Savings Percentage for Register File (with FA) Figure 8 shows the RF leakage power savings for different allocation schemes. FA provides the best results and it is also simpler to implement than the other allocation schemes. FA is used in all of the subsequent results for the RF. Figure 9 shows the leakage power savings for the standby scheme and how it varies with the activation period. Extending the activation period of a zone to 5 cycles will decrease the savings to 56.34% (from 57.85%), 61.01% (from 64.09%) and 64.15% (from 66.41%) for 16, 32, 48 zones respectively.

Fig. 10. Register File Leakage and Dynamic Power Savings for Standby Mode The standby scheme provides more power savings (compared to on-off scheme). The total average leakage power savings is 59.81%, as shown in Figure 10, and Dynamic power savings 45.56%. Turning off the unused zones (the hybrid scheme), as shown in Figure 11 - increases the leakage power savings up to 64.89% (an additional 8.49%) compared to using the standby mode alone, which is expected since turnedoff zones do not leak power. Figure 11: Register File Leakage Power for the Hybrid Scheme Fig. 12. Leakage and Dynamic Power Savings in ROB Figure 12 shows a total power savings of 61.99% leakage power and a 43.26% of dynamic power on the average. It is also possible to use the hybrid approach in ROB to increase the savings (as in RF). Furthermore, the commit logic could also be enhanced to activate the commit-zones only if it contains ready-to-commit entries.

6 Conclusions and Related Work We proposed a set of simple microarchitectural techniques for reducing leakage power dissipation in the register file and the reorder buffer of contemporary superscalar processors. The techniques proposed achieve a leakage power reduction in the range of 47% to well over 60% in the register file and the reorder buffer with no performance penalty. Dynamic power dissipations are also reduced through the use of a multi-zoned organization. A large body of work exists on the use of circuit techniques for reducing the leakage energy within bitcells, such as [1, 4, 6, 7]. Our approach is based on the use of circuit techniques similar to that of [4] in conjunction with the use of microarchitectural techniques. Some leakage reduction techniques for register files/bitcells also exploit microarchitectural statistics [1, 3], such as the predominance of zeros within the stored data. A plethora of work exists on reducing the dynamic power dissipation in register files. The work of [2] proposes a fine-grained technique for shutting down unused registers to save leakage power. Once a register is activated, it stays in this mode whether the contents are accessed or not. The work presented here relies on a coarse-grained approach that controls the state of zones within the register file as active, drowsy and deep-sleep and thus saves additional power by putting zones that are not being accessed into the drowsy mode when they contain useful data. The work of [3] proposes a cell design that permits fine-grained activation and deactivation of bitcells to reduce leakage dissipation and shows how energy savings are possible using such bitcells in register file banks and caches. Our approach, in contrast to the work of [3] uses standard bitcells with support for supply voltage management, as used in Drowsy caches [4]. We also achieve dynamic power savings in our techniques because of the use of multi-segmented structures for the register file and the reorder buffer. References 1. Azizi, N. et al, "Low-leakage Asymmetric-cell SRAM", in Proc, ISLPED 2002, pp. 48-51. 2. Goto, M. and Sato, T., "Leakage Energy Reduction in Register Renaming", in Proc. 1st Int'l Workshop on Embedded Computing Systems (ECS) held in conjunction with 24th ICDCS, pp.890-895, March 2004. 3. Heo, S. et al, "Dynamic Fine-grain Leakage Reduction using leakage-biased bitlines", in Proc. ISCA 2002, pp. 137-147. 4. Kim, N. S. et al, "Drowsy Instruction Caches - Leakage Power Reduction using Dynamic Voltage Scaling and Subbank Prediction", in Proc. MICRO-35, 2002, pp. 219-230. 5. Mamidipaka, M. and Dutt, N., "ecacti: An Enhanced Power Estimation Model for Onchip Caches", University of California, Irvine, Center for Embedded Computer Systems, TR-04-28, September 2004. 6. Narendra, S. et al, "Scaling of Stack Effect and its Application for Leakage Reduction", in Proc. ISLPED, 2001, pp.195-200. 7. Powell, M. et al, "Gated Vdd - A Circuit Technique to Reduce Leakage in Deep Submicron Cache Memories", in Proc. ISLPED 2000, pp. 90-95.