Characterizing the Voltage Scaling Limitations of Razor-based Designs

Similar documents
RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Slack Redistribution for Graceful Degradation Under Voltage Overscaling

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Bubble Razor An Architecture-Independent Approach to Timing-Error Detection and Correction

792 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 4, APRIL 2006

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Homework #8 due next Tuesday. Project Phase 3 plan due this Sat.

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Synchronous Timing. Latch Parameters. Class Material. Homework #8 due next Tuesday

EITF35: Introduction to Structured VLSI Design

Retiming Sequential Circuits for Low Power

Scan. This is a sample of the first 15 pages of the Scan chapter.

Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Performance Driven Reliable Link Design for Network on Chips

Combinational vs Sequential

Timing Error Detection and Correction by Time Dilation

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY /$ IEEE

On the Rules of Low-Power Design

Timing Error Detection and Correction for Reliable Integrated Circuits in Nanometer Technologies

Design and analysis of RCA in Subthreshold Logic Circuits Using AFE

Static Timing Analysis for Nanometer Designs

ECE 555 DESIGN PROJECT Introduction and Phase 1

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 11: Sequential Circuit Design

Built-In Proactive Tuning System for Circuit Aging Resilience

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Reducing Pipeline Energy Demands with Local DVS and Dynamic Retiming

Metastability Analysis of Synchronizer


Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

The outputs are formed by a combinational logic function of the inputs to the circuit or the values stored in the flip-flops (or both).

Aging Aware Multiplier with AHL using FPGA

L4: Sequential Building Blocks (Flip-flops, Latches and Registers)

Virtually all engineers use worst-case component

EE273 Lecture 11 Pipelined Timing Closed-Loop Timing November 2, Today s Assignment

Guidance For Scrambling Data Signals For EMC Compliance

SGERC: a self-gated timing error resilient cluster of sequential cells for wide-voltage processor

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

Logic and Computer Design Fundamentals. Chapter 7. Registers and Counters

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

SYNCHRONOUS DERIVED CLOCK AND SYNTHESIS OF LOW POWER SEQUENTIAL CIRCUITS *

ECE321 Electronics I

Clock - key to synchronous systems. Topic 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

Clock - key to synchronous systems. Lecture 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

LFSR Counter Implementation in CMOS VLSI

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Chapter 5: Synchronous Sequential Logic

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Lecture 8: Sequential Logic

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Using on-chip Test Pattern Compression for Full Scan SoC Designs

cascading flip-flops for proper operation clock skew Hardware description languages and sequential logic

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

6.S084 Tutorial Problems L05 Sequential Circuits

2.6 Reset Design Strategy

EDSU: Error detection and sampling unified flip-flop with ultra-low overhead

MC9211 Computer Organization

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

Impact of Intermittent Faults on Nanocomputing Devices

Flip-Flops A) Synchronization: Clocks and Latches B) Two Stage Latch C) Memory Requires Feedback D) Simple Flip-Flop Gate

1. What does the signal for a static-zero hazard look like?

LOW POWER AND HIGH PERFORMANCE SHIFT REGISTERS USING PULSED LATCH TECHNIQUE

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

II. ANALYSIS I. INTRODUCTION

Chapter 6. Flip-Flops and Simple Flip-Flop Applications

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

FPGA TechNote: Asynchronous signals and Metastability

Adaptive Overclocking and Error Correction Based on Dynamic Speculation Window

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains. Outline

APPLICATION NOTE. Figure 1. Typical Wire-OR Configuration. 1 Publication Order Number: AN1650/D

CPS311 Lecture: Sequential Circuits

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Design of Fault Coverage Test Pattern Generator Using LFSR

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

66 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 48, NO. 1, JANUARY 2013

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Flip-Flops. Because of this the state of the latch may keep changing in circuits with feedback as long as the clock pulse remains active.

WINTER 15 EXAMINATION Model Answer

3 Flip-Flops. The latch is a logic block that has 2 stable states (0) or (1). The RS latch can be forced to hold a 1 when the Set line is asserted.

Cascadable 4-Bit Comparator

Chapter 3 Unit Combinational

Administrative issues. Sequential logic

Low Power Approach of Clock Gating in Synchronous System like FIFO: A Novel Clock Gating Approach and Comparative Analysis

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Keeping The Clock Pure. Making The Impurities Digestible

Figure.1 Clock signal II. SYSTEM ANALYSIS


Logic Design Viva Question Bank Compiled By Channveer Patil

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

IT T35 Digital system desigm y - ii /s - iii

FIFO Memories: Solution to Reduce FIFO Metastability

Transcription:

Characterizing the Voltage Scaling Limitations of Razor-based Designs John Sartori and Rakesh Kumar Coordinated Science Laboratory 1308 West Main St Urbana, IL 61801 Abstract Worst-case processor designs have high yields, but are expensive in terms of area and power. Better-than-worstcase designs like Razor allow processors to be designed for the average case and still maintain high yields. One benefit that is often claimed about better than worst-case designs like Razor is that they allow lowpower processing, as the processors can now be run at voltages significantly lower than their nominal input voltage. In this paper, we show that the power benefits of Razor due to voltage scaling are greatly determined by the design of the circuit it is trying to protect. We show that the benefits can be small if the underlying circuit has a small range of timing paths, as such circuits produce catastrophic failures in the face of voltage overscaling (undervolting). Benefits of Razor can be severely limited even for circuits with a wide range of timing paths due to short path and long path constraints. In general, Razorbased designs are shown to be not very effective in face of aggressive undervolting. The results motivate the need for alternative techniques to enable significant undervolting. 1 Introduction Many factors introduce variation into the behavior of CMOS-based processor designs. Non-idealities such as voltage fluctuations in the power supply network, temperature fluctuations in the operating environment, manufacturing variations in parameters such as gate length and doping concentration, and cross-coupling noise all affect the timing behavior of a processor, making it statistical in nature rather than deterministic. In traditional processor designs, great pains are taken to ensure that a processor always produces correct results, even when subjected to a worst-case combination of nonidealities. This means that conservative guardbands are incorporated into design parameters to ensure correct behavior in all possible scenarios. Design for such a conservative operating point incurs a considerable overhead in terms of power spent to ensure correctness. Making matters worse, variation in CMOS-based circuits is expected to increase in coming technology generations [1], resulting in the need for even more conservative designs. Consequently, the already expensive cost of pristine computation will continue to increase in the future. Although processors are traditionally designed to tolerate worst-case conditions, it is unlikely that all nonidealities will take effect at once, pushing a processor to the brink of erroneous behavior. Thus, there exists a considerable potential to increase the power efficiency of processors by relaxing traditional, conservative requirements for correctness in the worst-case and instead designing processors for the average-case. Such better-than-worstcase designs work normally in the average case and have recovery mechanisms to ensure correct operation when errors occur. Several better-than-worst-case (BTWC) designs [6, 8] have been proposed recently that allow tradeoffs between reliability and performance/power. Razor [6], for example, is a circuit-level technique to detect and correct timing errors due to frequency, temperature, and voltage variations. Razor detects timing violations by supplementing critical flip-flops with a shadow latch that strobes the output of a logic stage at a fixed delay after the main flip-flop. Thus, if a timing violation does occur, the main flip-flop and shadow latch will have different values, signaling the need for correction. Correction involves recovery using the correct value(s) stored in the shadow latch(es). Similarly, system-level techniques such as Algorithmic Noise Tolerance [8] techniques have been proven to be effective in overcoming timing errors in specific domains. Such techniques allow timing errors due to frequency/voltage overscaling to propagate to the system or the application. The applications have algorithmic and/or cognitive noise tolerance and, therefore, perform application-level error correction. Application or system-level error detection and correction is also assumed for recently proposed probabilistic SOCs [3] and stochastic processor architectures [10, 11], which are BTWC designs. One benefit that is often claimed about Razor-like BTWC designs [6] is that they allow low-power processing, as a processor can now be run at voltages significantly lower than the nominal input voltage. Any timing violation that occurs in a Razor-based design, for example, is detected and corrected by the Razor latch. Razor, in fact, is often touted as a power reduction technique as well. While the potential of power benefits from Razor may be well-understood, the generality of the approach is not very clear. Specifically, it is not clear how the effectiveness of Razor would change for different circuits in terms of its ability to perform error detection and correction when the input voltage is changed. In this paper, we try to understand the generality of Razor-based designs. We show that the power benefits of Razor are greatly de-

termined by the design of the circuit it is trying to protect. We characterize two different kinds of adder circuits (Kogge-Stone and Ripple Carry) and show that the benefits can be small if the underlying circuit has a small range of delay for timing paths (e.g., for Kogge-Stone), as such circuits produce catastrophic failures in the face of voltage overscaling [12]. We also show that the benefits of Razor can be severely limited even for circuits with a wide range of timing paths (e.g., Ripple Carry adder) due to short path and long path constraints. In general, Razor-based designs are shown to be ineffective in the face of voltage overscaling, demonstrating the need for alternative techniques to take full advantage of power benefits achievable through aggressive voltage scaling. The rest of the paper is organized as follows. In Section 2, we provide an overview of existing better-thanworst-case design techniques and discuss their limitations. We present a brief description of Razor in Section 3. We identify the limitations of Razor that may hinder its effectiveness in the face of undervolting in Section 4. We present methodological details of our study in Section 5, our characterization of the limitations of Razor in Section 6, and concluding remarks in Section 7. 2 Related Work A number of better-than-worst case (BTWC) designs have been proposed in the past to allow circuits to save power by operating under normal conditions rather than conservative worst case limits. One class of BTWC techniques specifies multiple safe voltage and frequency levels that a design may operate at and allows for switching between these states. Examples in this class are correlating VCO [2, 7] and Design-Time DVS [4]. As voltage changes, correlating VCO adapts the frequency of a circuit to a level slightly below the critical frequency. The clock frequency for a given voltage is selected to incorporate a margin for process and temperature variations, as well as noise in the power supply network. Thus, scaling past the critical point is not allowed. Similarly, Design-Time DVS provides the capability to switch between multiple voltage / frequency operating points to match user or application requirements. As with correlating VCO, each operating point incorporates a conservative margin to ensure that errors do not occur. Another class of BTWC designs uses canary circuits to detect when arrival at the critical point is imminent, thus revealing the extent of safe scaling. Delay line speed detectors [5] work by propagating a signal transition down a path that is slightly longer than the critical path of a circuit. Scaling is allowed to proceed until the point where the transition no longer reaches the end of the delay line before the clock period expires. While this circuit enables scaling, no scaling is allowed past the critical path delay plus a safety margin. Another similar circuit technique uses multiple latches which strobe a signal in close succession to locate the critical operating point of a design. The third latch of a triplelatch monitor [9] is always assumed to capture the correct value, while the first two latches indicate how close the current operating point is to the critical point. A design LOGIC 0 CLK skew Main FF 1 Error Shadow Latch Figure 1: The Razor flip-flop. Out LOGIC Error Out is considered to be tuned when the values in the first two latches do not match but the values in last two latches do match, indicating that the setup time of the third latch is longer than the critical delay of the circuit by a small margin. All the BTWC techniques mentioned above have similar limitations. They allow for scaling up to, but never beyond, the critical operating point. However, with increasing variability in circuits, there is also high potential for benefit (in terms of power e.g.) when scaling is allowed to proceed past the critical point. Razor [6] actually allows voltage scaling past the critical point, since it incorporates error detection and correction mechanisms to handle the case when errors occur. Thus, for all subsequent comparison against BTWC designs, we use Razor as the point of comparison, since it represents the least conservative design point. An overview of Razor is provided in the following section. 3 Razor Basics Razor is a circuit-level technique to detect and correct timing errors. It detects timing violations by supplementing critical flip-flops with a shadow latch that strobes the output of a logic stage at a fixed delay (which we refer to as skew) after the main flip-flop. Thus, if a timing violation does occur, the main flip-flop and shadow latch will have different values, signaling the need for correction. The skew between the main flip-flop and the shadow latch is often chosen to be half a cycle. Error correction in Razor-based designs involves recovery using the correct value(s) stored in the shadow latch(es). A pipeline restore signal is generated by ORing together error signals of individual Razor flip-flops. The signal overwrites the shadow latch data into the errant flip-flop. Recovery mechanisms for Razor-based designs include the use of clock gating [14] and a counter-flow pipeline [13]. The occurrence of metastability at the main flip-flop output is flagged using an additional detector. Figure 1 shows the Razor flip-flop. More details on the design and operation of Razor can be found in [6]. 4 Razor Limitations In this section, we discuss the conditions under which Razor may not function correctly. Designs such as Razor that allow scaling past the critical operating point [12] must be mindful of two aspects of error recovery error detection and correction. Razor detects an error when the value latched by the shadow latch

differs from the value latched by the main flip-flop. This happens when the logic signal has not settled to its final value before the setup time of the main flip-flop. If the signal transitions again before the shadow latch latches, an error will be detected. For error correction, the Razor flip-flop must not only detect a timing violation, but must also latch the correct value in the shadow latch. This simply implies that the correct value must arrive by the setup time of the shadow latch for all Razor flip-flops in a design. So, Razor may not be able able to correct errors if a) detection fails (i.e., both the main flip-flop and the shadow latch have the same incorrect value), or b) detection succeeds, but the value latched in the shadow latch is not the correct value. To guarantee correctness, Razor requires two conditions to be met on the circuit delay behavior the short path constraint and the long path constraint. The long path constraint (eqn. 1), states that the maximum delay through a logic stage protected by Razor must be less than the clock period (T) plus the skew between the two clocks (the clock for the main flip-flop and the clock for the shadow latch). delay max < T + skew (1) If the long path constraint is not satisfied, false negative detections can occur when a timing violation causes both the main flip-flop and shadow latch to latch the incorrect value. The short path constraint (eqn. 2) states that there must not be a short path through a logic stage protected by Razor that can cause the output of the logic to change before the shadow latch latches the previous output. delay min > skew + hold (2) Failure to satisfy the short path constraint leads to false positive error detections when the logic output changes in response to new circuit inputs before the shadow latch has sampled the previous output. Combination of the short and long path constraints (eqn. 4) demonstrates that Razor can only guarantee correctness when the range of possible delays for a circuit output falls within a window of size T hold. skew + hold < delay < T + skew (3) delay max delay min < T hold (4) Note that equation 4 implies a tradeoff between the limit of Razor protection and the range of Razor usability. While increasing skew can reduce the number of uncorrectable errors by protecting longer path delays, this also leads to a reduction in the range over which Razor can be applied to correct errors due to violation of the short path constraint. Section 6 characterizes the voltage scaling limitations for Razor-based designs for two canonical circuits. 5 Methodology In this section, we first discuss the two circuits that we use to characterize the voltage scaling limitations of Figure 2: Traditional designs exhibit a critical operating point. Scaling beyond this point results in catastrophic failure. (Critical Operating Point Hypothesis [12]) Razor-based designs. The two circuits are chosen to be canonical examples of two contrasting design philosophies. Then, we present methodological details of our study. 5.1 Simulated Circuits The recently posited Critical Operating Point hypothesis (COP) [12] claims the following regarding generalpurpose processors: In large CMOS circuits there exists a Critical Operating Frequency, F c, and Critical Voltage V c for a fixed ambient temperature T, such that Any frequency above F c causes massive errors Any voltage below V c causes massive errors For any frequency below F c or voltage above V c, no process related errors occur The undervolting experiments in [12] confirm the above hypothesis for current high-performance microprocessors and suggest that voltage/frequency scaling techniques are limited by a fixed threshold beyond which the processor will become unusable. One explanation of critical point behavior is that circuit-level design techniques used for processors produce circuits in which all timing paths through a logic stage are bunched together to match the critical path (to save power/area, for example). So, timing violations, when they occur, are massive and often catastrophic (see Figure 2). The first circuit that we use to characterize the limitations of Razor-based designs is the Kogge-Stone adder (KSA). The KSA architecture (figure 3) has timing paths of nearly the same length, and therefore exhibits a critical operating point (confirmed in Section 6) akin to traditional high performance processor designs. Characterizing the effectiveness of Razor for a Kogge-Stone adder would be a good representative of the effectiveness of Razor in the face of voltage scaling for traditional high performance processor designs. The second circuit that we used to evaluate the effectiveness of Razor in the face of undervolting is the ripplecarry adder (RCA). The RCA architecture (Figure 4) consists of timing paths whose delays depend on the length of the carry chain. So, while the path corresponding to the LSB has the least delay, the path corresponding to the

' & %.- $ -- # +,!" (% " "& " '"" "" "" )* * /"001(0 02 3 /4 5 #"# @ 6 2 5 7 10 89 & :"; < / :>="0 01"0? 6 4 5 2 7 #"# 10 A(9 D@ 5 < 1"0 & 0/ < 3 2 B C/ 6 4 5 2 7 #"# 10 A(9 D@ 5 < 1"0 "$ 0/ < 3 2 B C/ @ 6 2 5 7"$" 10 89 :"; < / :>="0 01"0? Figure 3: The Kogge-Stone adder is designed such that nearly all path delays are clustered around the critical path delay of the circuit. Figure 4: The Ripple-Carry adder is designed such that there is a smooth gradation in terms of path delays. Critical path corresponds to the MSB. MSB has the longest potential delay. Timing violations for such designs may not be massive (confirmed in Section 6) in face of undervolting. Razor may, therefore, be most effective for such designs. Note that smooth gradation in path lengths has recently been advocated for high performance processor designs in the context of design of stochastic processors [10, 11]. So, our evaluations would also determine the effectiveness of Razor-based techniques for such processor designs. 5.2 Simulation Details For circuit characterization, we implement the ripple carry and Kogge-Stone adders using IBM9SF 90nm CMOS technology. Adder circuits are modeled in HSPICE, and exhaustive simulations are run to characterize circuit path delays for different supply voltages. Delay data is then used to supplement RTL descriptions of adder architectures, and RTL-level simulations are run in Cadence to characterize dynamic delay behavior. For RTLlevel simulations, the input set is composed of 180K random input samples. Timing results from RTL-level simulations are processed to determine various error characteristics of circuits under test. For example, to determine error rate at a particular voltage, we simulate for a long clock period and measure the time required for an operation to produce a stable, correct result at the circuit output. This time is compared to the testing clock period to determine when a timing violation has occurred. 6 Results Figure 5 shows the effect of undervolting on the reliability of a Kogge-Stone adder. Reliability is measured in Figure 5: The Kogge-Stone adder has a critical operating point (1.2V for these design parameters). When voltage is scaled below the critical operating point, catastrophic failure occurs. Razor can correct errors over some range, but the extent of scaling is limited due to the critical wall characteristic of the circuit. Increasing clock skew between the clocks of the main and shadow latches actually decreases the range of Razor correction, since it makes Razor unusable at higher voltages without extending Razor s useful range equivalently into the lower voltage range. terms of the percentage of the 180K samples that resulted in incorrect outputs (error rate). The results are shown for a Kogge-Stone adder without Razor support (error rate T=550) and with Razor support (Razor Uncorrectable). There are errors in the face of Razor if the conditions outlined in Section 4 are not met. Results are also shown for different values of skew between the clocks for the main latch and the shadow latch. There are several things to note in the figure. First, the Kogge-Stone adder is indeed representative of the time delay distribution of high performance processors as it shows critical operating point behavior. I.e., as shown in the error curve of figure 5, scaling beyond a certain voltage point leads to a catastrophic failure of the adder (i.e., 100% error rate). Aggressive voltage scaling, therefore, is not possible for such designs. Second, Razor can provide error correction only over a limited region for KSA, represented by zero uncorrectable errors. This is because in all other regions there are uncorrectable errors due to violation of long path constraints. Even in the region that has zero uncorrectable errors, the power consumption may actually increase drastically in spite of voltage scaling. This is because the absolute error rate is high (close to 100%) and the overhead of error recovery for Razor is roughly an order of magnitude more expensive than the overhead of executing an instruction normally [6]. So, for designs like KSA where timing paths are bunched up (like in traditional high performance processor designs), Razor may not be very effective in terms of power reduction through undervolting (i.e., scaling beyond the voltage for which the first timing violation appears). While some power can be saved by eliminating the voltage guardband, scaling past the critical operating point results in nearly 100% erroneous computations. Another thing to note in the figure is that it is not always a good idea to keep Razor turned on. This is because of potential short path constraint violations, especially for large skews. Failure to satisfy the short path constraint leads to false positive error detections when the

v wu { y}} v ÿ ~ vu xyz xyz tu vuw b a` ` ` aj ` ` ai ` ` ah ` ` ag ` ` af ` ` ae ` ` ad ` ` ac ` ` ab ` ` a` ` E F G H IJK L MNE(O PKQ"RST UV Q NWX X YXZV J V [ J IY\]_^YX X V [ J IY\ ` ah `k` ah fk` ai `l` ai fk` aj `m` aj fmb a` ǹb a` fob ab `pb ab fqb ac ` rs(s Figure 6: Error detection and correction for the KSA. ˆ( (Š Š ƒ" ( Œ Ž ( "ƒ ˆ( (Š Š ƒ" ( ( " Œ Ž ( ƒ Œ ƒ" ( Ž "ƒ Œ ƒ" ( ( " Ž ƒ logic output changes in response to new circuit inputs before the shadow latch has sampled the previous output. The practical result for the KSA circuit is that an error is triggered for every operation until the short path constraint is met, making Razor unusable over a range of voltages, as demonstrated by the Razor induced errors in figure 5. So, the use of Razor in architectures should be optional and determined by the skew, clock period, and input voltage. Figure 6 breaks error recovery into detection and correction, demonstrating that the range over which Razor can detect errors extends past the range over which Razor can correct errors for designs like the KSA. However, since Razor correction is always on, even when the long path constraint is not met, these extra detections represent wasted power. These facts motivate the need for new design techniques that do not fail catastrophically and error correction techniques that take advantage of the extended window of detection without forcing erroneous corrections. The ripple carry adder (RCA) architecture is not subject to catastrophic failure in response to scaling past the point of first error. Instead, as figure 7 demonstrates, error rate increases gradually as voltage decreases. Although the minimum delay for any path of the RCA equals the delay of the sum path of a full adder, operational delay ultimately depends on adder inputs, which generate carry chains from lower to higher order bits. The RCA exhibits maximum delay when the carry chain extends from the least significant bit to the most significant bit. However, on average, carry chains are much shorter, leaving extensive room for aggressive scaling past the point where errors begin to occur. In fact, the error rate reaches close to 100% only at very low voltages. The above behavior of RCA may be a suitable desired behavior for high performance processor designs to enable significant power savings through undervolting. Recent attempts [10, 11] at processor designs that produce graceful degradation in reliability in the face of voltage scaling try to mimic this behavior. Error detection for the RCA circuit is 100%. This is due to the wide range of delay paths that affect circuit outputs, eliminating the occurrence of false negative errors when the long path constraint is not met. However, figure 8 demonstrates that correction rates for the RCA ( " š œ ž Ÿ " ± ¼»»» ¹º «ª ² ³ ³ ³ ± ² " ± µ µ µ ª ½"¾ ¾ (¾ ¾À Á ½Â ÃÄ Å"Å"Å Æ À Ç (¾(ÈÉ ÊË Ì ½ Ê>Í ¾ ¾ "¾Î Ï Ã ÐÄ Ñ Æ À Ç (¾ Ò É Ì ¾ ¾½ Ì Á À Ó"Ô½ Â Ã Ä Å Å Å Ï Ã(Ð(Ä Ñ ½"¾ ¾ (¾ ¾À Á ½Â ÃÄ Ñ Å"Å Æ À Ç (¾ Ò É Ì ¾ ¾½ Ì Á À Ó"Ô½ Â Ã Ä Å Å Å Ï Ã(Ð"ÕÑ Æ À Ç (¾(ÈÉ ÊË Ì ½ Ê>Í ¾ ¾ "¾Î Ï Ã Ð ÕÑ Æ À Ç (¾ Ò É Ì ¾ ¾½ Ì Á À Ó"Ô½ Ð(Ä ÑÖÄ Ñ Å"Å Figure 7: The ripple carry adder exhibits a wide range of path delays, characteristic of a circuit that fails gracefully. Error rate for the circuit increases gradually as supply voltage decreases. ò ñð ð ð ð ñú ð ð ð ñù ð ð ð ñø ð ð ð ñ ð ð ð ñö ð ð ð ñõ ð ð ð ñô ð ð ð ñó ð ð ð ñò ð ð ð ñð ð ð (Ø Ù Ú ÛÜÞÝß àá â ãäå æçè éê å áë ì ì íìßíì ì ê î Ü Ûíï ð ñø ð ð ð ñø ö ðûð ñù ð ð ð ñù ö ðûð ñú ð ð ð ñú ö ðµò ñð ð ðüò ñð ö ð ò ñò ð ðüò ñò ö ðûò ñó ð ð ýþ þ Figure 8: Error correction rates for Razor decrease as voltage scales down. Note that these rates assume buffering of short paths to meet constraints. can be low in the face of aggressive scaling. These facts demonstrate the error detection advantage of designs that fail gracefully as well as the need for new techniques that can provide enhanced error correction under aggressive scaling. One may be tempted to conclude from our previous discussion on critical operating point behavior that better than worst-case design techniques such as Razor should perform well for architectures that fail gracefully, since such designs do not have a wall of criticality. However, analysis of the results in figure 7 reveals some serious limitations of using Razor, even in such architectures. Limitations arise due to the potential short path and long path constraint violations as discussed in Section 4. If the long path constraint is not satisfied, false negative detections can occur when the main flip-flop and shadow latch both latch the incorrect value. In figure 7, this condition is demonstrated by uncorrectable errors for Razor. Similarly, the failure to meet short path constraints makes Razor unusable over a range of voltages, as demonstrated by the Razor induced errors in Figure 7. In fact, the same factor that makes the error behavior of RCA graceful (wide range of path delays) makes Razor less effective. This is because Razor relies on the variation in delay to be less than a threshold (see Section 4).

2 u I H F G E C D A B Œ Š ˆ 9 222 8 222 7 222 6 222 5 222 4 222 3 222 Ouuwu {uuwu z uuwu yuuwu xuuwu!"$# %'&)(+*-,).$/)0$1 2: 9 22 2: ;22 2<: =<22 3 : 222 3 : 3 22 3 : 42 2 >@?!? a)b@cedgf hjilknmponq'rns't J!K L MON P Q L M R L M M S R T U S V V W M X T X R P Y Z P T [+\ S Y P ] ^<X U Z P T [$\ S Y P ] P _ ` \ S Y P ] nž < $ š š œ Ÿž $š œ @ž ž Stone adder (KSA) that demonstrates critical operating point behavior similar to modern high-performance microprocessors. The other design was a ripple carry adder (RCA) that produces a graceful degradation in reliability in the face of undervolting. Our experiments showed that Razor is ineffective in the face of undervolting for designs in which the timing paths are bunched up. This is due to massive timing errors upon breaching the critical operating point, as well as violation of short and long path constraints. The effectiveness of Razor is limited even for designs with spread time delay distributions. This is because timing variation within a circuit must be less than a threshold for Razor to work for that circuit. The limitations of Razor in face of aggressive voltage scaling, coupled with the expectation of high variability in coming technology generations, motivates the need for alternative techniques that will allow full extraction of the power benefits available from voltage scaling. References v uuwu ue} ~uu u$} uou u+} uwu v } uwuu v } v uou v } xuou ƒ ' Figure 9: The delay characteristics of the Kogge-Stone adder demonstrate its critical wall behavior and unsuitability for aggressive scaling. The wide range of delays for the ripple carry adder demonstrate its capacity to fail gracefully, as well as the large margin for power savings with aggressive scaling. The variation in delay is significantly larger for an RCA design than a KSA design. Figure 9 shows ranges of possible delay for KSA and RCA architectures at different voltages. In order to make Razor work for circuits that fail gracefully, buffering must be used to increase the delay of short paths, thus shifting them into the window of correction. This buffering adds area and power overheads in a design, negating some of the power savings afforded by better than worst-case design. Secondly, required buffering increases the delay on short paths, transforming a circuit from one that fails gracefully to one that fails catastrophically, thus limiting the extent of possible scaling. So, while Razor is ineffective for circuits like KSA because of massive timing violations in the face of undervolting, it is also not very effective for circuits like RCA due to a large span between the maximum and minimum circuit delays. These results demonstrate the inadequacies of current better than worst-case design methodologies (like Razor) in terms of voltage scaling, motivating the need for new techniques for processor design and error handling. 7 Summary and Conclusion Better-than-worst-case designs like Razor help improve yield, but are often considered good for power reduction as well due to reduced voltage margins. In this paper, we examined the effectiveness of undervolting for two Razor-based designs. The first design was a Kogge- [1] International Technology Roadmap for Semiconductors 2008, http://public.itrs.net. [2] T. D. Burd, S. Member, T. A. Pering, A. J. Stratakos, and R. W. Brodersen. A dynamic voltage scaled microprocessor system. IEEE Journal of Solid-State Circuits, 35:1571 1580, 2000. [3] L. N. Chakrapani, P. Korkmaz, B. E. S. Akgul, and K. V. Palem. Probabilistic system-on-a-chip architectures. volume 12, pages 1 28, New York, NY, USA, 2007. ACM. [4] I. Corporation. Enhanced intel speed step technology for the intel pentium M processor, 2004. [5] S. Dhar, D. Maksimović, and B. Kranzen. Closed-loop adaptive voltage scaling controller for standard-cell asics. In ISLPED 02: Proceedings of the 2002 international symposium on Low power electronics and design, pages 103 107, New York, NY, USA, 2002. ACM. [6] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. Razor: A lowpower pipeline based on circuit-level timing speculation. In MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, page 7, Washington, DC, USA, 2003. IEEE Computer Society. [7] V. Gutnik and A. Chandrakasan. An efficient controller for variable supply-voltage low power processing. pages 158 159, Jun 1996. [8] R. Hegde and N. R. Shanbhag. Energy-efficient signal processing via algorithmic noise-tolerance. In ISLPED 99: Proceedings of the 1999 international symposium on Low power electronics and design, pages 30 35, New York, NY, USA, 1999. ACM. [9] T. Kehl. Hardware self-tuning and circuit performance monitoring. pages 188 192, Oct 1993. [10] R. Kumar. Stochastic processors. In NSF Workshop on Science of Power Management, 2009. [11] S. Narayanan, G. Lyle, R. Kumar, and D. Jones. Testing the critical operating point (cop) hypothesis using fpga emulation of timing errors in over-scaled soft-processors. In In SELSE 5 Workshop - Silicon Errors in Logic - System Effects, 2009. [12] J. Patel. Cmos process variations: A critical operation point hypothesis, 2008. [13] R. F. Sproull, I. E. Sutherland, and C. E. Molnar. The counterflow pipeline processor architecture. volume 11, pages 48 59, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. [14] Q. Wu, M. Pedram, and X. Wu. Clock-gating and its application to low power design of sequential circuits. volume 47, pages 415 420, 2000.