Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky, tomott}@berkeley.edu Abstract With the reduction of feature sizes, more sources of variability arise, leading to extra timing margins built into designs, greatly affecting the overheads that designers must include into their circuits. There are several approaches to deal with this problem, with different ways of optimizing the system in real time to achieve the best performance or minimum power while meeting the required specifications. We will be exploring different timing error detection architectures which use logic to detect when data does not meet necessary setup time constraints, and see how these different architectures detect and sometimes even correct the errors. It is often difficult to directly compare these architectures, but we plan to see how they can minimize power for a given throughput, and potentially increase the operating frequency while still being resilient to errors, regardless of the nature of those errors. The goal is to try and specify which architectures are better for a given set of operating conditions. 1. Introduction There is an ever increasing demand for low power circuits in the current market due to portable, ubiquitous electronic devices that are mostly battery powered. One approach to achieving the lowest power consumption is to reduce the voltage supply as low as possible. But, with large variations due to process corners, designs are often stuck at voltages projected for the worst case plus margin in order for the circuit to function with a high parametric yield. One recent, growing field to mitigate this problem is resilient, or timing error detection circuits. [1-5] These circuits, which are more robust flip-flops and latches, attempt to detect [1,3-5], and in some cases correct [3,5] (through different methods), circuit timing errors. These circuits attempt to detect setup time failures through various methods. The point at which these circuits start to fail timing constraints is called the point of first failure (PoFF) and often times these adaptive schemes that change the supply voltage and/or the frequency of operation can push the design to that point. Some architectural changes enable these designs to operate correctly beyond that point by cleverly borrowing time from other paths or replaying instructions when an error occurs. Timing error detection circuits (TED) provide an interesting field for low power research. With TED circuits, it can be possible to reduce the voltage supply for optimum power savings. In this project, we plan on investigating several different TED circuits. We plan on comparing their effectiveness in identifying the lowest voltage of operation of a given circuit and their effectiveness in identifying errors in general within their respective error detection windows. For our project, we plan on comparing several different TED architectures. Since not all TED architectures involve error correction, we will focus solely on the ability for TED circuits to recognize timing errors and which TED circuits allow for the lowest voltage scaling. We will also compare the power usage of these circuits at both the nominal and the lowest operating voltage. 2. Timing Error Detection Architectures Some of the early advancements in the area of TED circuits are summarized well in [4]. There

are 3 main different types of error detection circuits that are covered: the Razor flip-flop (RFF), transition detector with time borrowing (TDTB), double-sampling with time borrowing (DSTB). Figure 1: Architectures shown in [4] The RFF works by comparing the output of the metastable flip-flop with a latch to try and detect the situation when the data changes after the rising edge of the clock, causing a setup time violation that makes the circuit fail timing and work improperly. The transition detector circuit operates on a different principle, where it compares data with a delayed version of itself, and with proper timing (using the clock signal) the window of transitions can be controlled to catch a certain range of setup time violations. While this approach has a somewhat limited window of operation, it can be designed in such a way that the window of operation greatly increases the probability of timing violations to be caught. To say it a different way, if the timing errors of the circuits under test are more than the designed detection window, then the circuit is not operating within the specification and there is nothing to do except slow down the operation. The DSTB circuit works similar to the TDTB, replacing the transition detector with a shadow latch. These circuits simply detect when the error occurs, and in these implementations do not have any means of changing the circuit in a way that when an error occurs the system can gracefully recover from it. One example of an architectural means of such correction is presented in [1]. Here they show an ARM ISA processor with Razor flip-flops, with a closed loop correction scheme in an attempt to correct any errors that would occur during normal operation. This correction is done using dynamic frequency scaling (DFS) implemented on chip, as well as a power virus which artificially causes voltage droops in the supply to simulate a typical workload of a microprocessor. They also have a dynamic voltage controller, which is capable of changing the supply voltage to minimize power while maintaining a fixed throughput. This dynamic voltage scaling (DVS) allows the processor to minimize its power consumption while still maintaining a constant throughput without the need for overdesign in the timing margins of the critical paths. The benefits from DVS can be used in any generic architecture, so we plan to explore how changing the supply voltage affects the different TED architectures in terms of error detection windows and power consumption. The minimization of power that they are able to achieve is a strong driving force to make sure that these circuits are able to operate across a range of voltages without degrading the accuracy of error detection. As shown below, for a fixed throughput, the circuit can be optimized to reduce its energy very effectively if there are methods of detecting how much voltage is necessary to meet the required clock period. As noted in figure 2, this technique can also enable chips to work that previously would fail under nominal conditions, so this technique has many useful properties. Figure 2: Power Savings using DVS from [1] Another approach to both error detection and correction is to employ the use of time borrowing in order to artificially extend the

clock period to account for minor timing delays that would stop the circuit from functioning properly [3]. The TIMBER (TIMe Borrowing and Error Relaying) flip-flop has the option of using different delayed clocks to add extra time to finish the logic that needs more time. The TIMBER latch has similar options, and has the ability to operate as a master-slave latch pair if configured properly. One newer implementation of these timing error detection schemes is an adaptation of previous designs [1,4-5] in a slightly different context, where errors are propagated through the circuit until the system fully recovers and normal operation can continue. [5] This small penalty decreases throughput slightly in order to operate beyond the PoFF, which means that the circuit will operate, albeit slightly slower occasionally, at a point when it would normally cause errors that would corrupt the program s operation. While this is a novel idea, it is really just an application of previous work and as such will not be tested in this project simply due to time and complexity constraints. Some groups have even started the work of integrating the timing error detection circuitry within the automated design flows [2]. This has shown improvements on the requirements for the number of error detection elements used for the types of designs that only go after the critical paths, reducing both the power and area overhead of the detection and correction schemes. While this work is still being fully optimized, promising first steps have been taken and it is our belief that it will be integrated in all design automation tools in the near future. 3. Analysis: For the Timber flip-flop shown above, the timing window is given by: T!"#$%! = T!"!!" t!"# t!"!# The delay in the time borrowing line is used, as well as the error detection to generate an extra amount of time to enable the flip-flop to detect and correct errors. This generated delayed clock has to pass through a 4-input multiplexer and one nand gates before the logic that combines the clock and the shifted clock to generate a new gating signal. That data has to make it through the transmission gate while it is still open, and the authors in [3] claim that they avoid metastability by using a level sensitive sampling element, but if you are truly operating on the edge of the error detection window there could be a problem with metastability.

implemented this delay using a chain of inverters that were all sized according to the unit cell for that particular circuit. There is a tradeoff between accuracy of that delay element, power it takes to generate, and how much delay is actually needed, which are all up to the designer to tailor to their particular application. For the Timber latch, the timing window is given by: T!"#$%! = T!"# T!" t!"!# Since it is a latch, the transparency when the clk is high is part of the error detection window. Enabling the latch turns off F, and makes the latch use the path through the transmission gate controlled by the signal L. The generation of that signal goes through a delay of an inverter chain, represented here by T!", and is combined with the original clock signal to produce a shifted window in which to detect timing errors. For the Razor flip-flop, the timing window is: T!"#$%! = T!"#$% 2 t!"# t!"# The window gets shortened on both sides of the delay window by the delay of the dynamic OR gate, since the edges need to propagate when both nck and CK are high in order to pull down the DYN node. The length of the delay line plus the inverter set the overlap between nck and CK, and therefore gate the amount of time when an error can be detected. We For the Bubble Razor design, the window is: T!"#$%! = T!"# t!"# t!"#$%&'%%'($ t!"# The latch is nominally transparent when clk is high, and the error signals have to pass through an exclusive-or gate, a transmission gate and an inverter to see the output (following the most critical path) so the window is fairly close to the time in which the clock is high. This allows both normal latch operation in which consecutive sequential blocks are clocked by opposite phases, or pulsed clocks that would enable the circuit to operate as a flip-flop without a hard rising edge. 4. Test Circuits and Design Procedure 4.1 Sizing In an attempt to fairly compare the different circuits explained above, we came up with the following sizing criteria. In the 32nm predictive technology, we first simulated designs to find the optimal pmos to nmos ratio, which we found to be about 1.2:1 for this technology node. We found the stack effect to be negligible in these transistor models, most likely because they are

the predictive technology models and might not model this correctly. Using this sizing information, we sized the gates of each circuit to produce the same clock to q delay within 0.1%. This was an attempt to normalize the flipflops/latches to make them interchangeable. This allows the same critical path of logic to be able to operate at the same frequency with any of the circuits above, enabling all of the circuits to operate under the exact same conditions. 4.2 Error Detection Window After the sizing was completed, we setup and ran a testbench to measure the error detection/correction window for each of the designs. The edge of the input data was swept around the rising edge of the clock, and the output was monitored to see if it detected and/or corrected the error that was introduced. This test was performed across a range of supply voltages to determine the minimum operating voltage, which was determined to be the supply voltage at which the window of detection shrinks by 10% of the nominal value. Since the latch timing is a function of the clock period, it should be noted that the simulations used a 50% duty cycle clock operating at 200MHz. 4.3 Power Consumption After sizing the gates in the different architectures and determining the minimum supply at which they can function normally, we found their overall power consumption, an important metric given that one of the main focus of error detecting flip flops is the reduction of power. For comparison, we found the power consumed by the flip flops at their lowest operating point given for their given sizing, as well as the power consumption at 1V, for a nominal comparison. It is important to compare these circuits at a nominal voltage as well as their lowest operating voltage for functionality. It is possible that in a given circuit implementation, it would be difficult to synthesize the circuit with a different supply then the rest of the circuit. 5. Results A summary of our results is presented in Table 1. The Figure Of Merit (FOM) of each timing error detection architecture was given by the formula: FOM =!!"#$%!,!"#!"#$% @!!"#. Note: this figure of merit may not be a very valid comparison between a latch and a flip-flop, but we believe it is a fair comparison between flipflops themselves, and between latches themselves. They have been scaled for readability. All tests are at 200MHz with a load of 100fF of capacitance. Clk->q delay Power @1V Timber Latch Razor Latch Timber FF Razor FF 300.6p 300.7p 300.6p 300.5 30.35uW 23.8uW 32.03uW 43.9uW V!"# 810mV 720mV 750mV 780mV Power @V!"# Nominal T!"#$%! 19.1uW 14.1uW 17.63uW 26.5uW 2.31n 2.43n 37p 65p FOM 1.209 1.723 2.099 2.453 Table 1 For a circuit whose main function is to detect errors, the most important figure of comparison is the window in which the different circuits are able to detect said errors. In both cases, the razor design was capable of detecting an error for a longer period of time (given our attempts at normalizing the circuits for comparison). The latches were capable of detecting an error for a longer period of time, but this was a function of being a latch as opposed to a flip flop, not a function of the comparison circuit. The Razor flip flop could detect an error 18ps longer than the Timber design, and the Razor latch could detect an error 22ps longer than the Timber latch design, both advantages.

The circuits all proved that they could operate at lower voltages, but by dropping the supply voltage the delay in the critical path increases, so the timing margin necessary to ensure proper operation would only increase, meaning the circuit overall would operate worse under these conditions. The designs are very close in terms of minimum operating voltage. The razor latch operates at the lowest supply voltage, at 720mV. The Timber FF operates at 750mV, the lower of the two flip flops, but not much lower than the Razor flip-flop, which could operate at 780mV. The Timber latch was by far the worse, with a lowest possible supply voltage of 810mV. An interesting function of the latches was the ability to operate at increasingly lower supply voltages, with a decrease in the time detection window possible. This feature is shown below. Window Width (ns) 3 2 1 0 0.5 0.7 0.9 1.1 Supply Voltage (V) Timber Latch Bubble Razor Latch Error Detection Window Versus Supply Voltage Power consumption is an important measured value. For a nominal supply voltage, the Timber latch consumed 27.5% more power than the Razor latch. As these supplies were scaled down to their minimum supply (without loss of data detection window) the discrepancy only became larger; the timber latches power consumption was up to 37% more than the razor latch. Given that the razor can operate at 90mV lower supply than the timber, one would agree that these results follow supply scaling. For the flip-flop designs, the Timber design consumed less power than the razor design. The razor design consumed 37% more at the nominal supply, and consumed 50% more at the lowest possible operating supply. Given that their minimal supply is similar, this discrepancy is very large. 6. Conclusion Through the various metrics proposed in the previous section, such as power and timing overheads, we hope to gain a grasp on the various capabilities of the circuits under real device deployment. These metrics attempt to compare the different designs, although these comparisons are not entirely straightforward due to the difficulty of ensuring that it is a fair comparison. Depending on the operating conditions and the design itself, different architectures have their own strong points, so it is up to the designer to figure out what is the best choice for their particular design point, given the advantages and disadvantages of these different circuits. This is a developing field, and there are many future designs to come. REFERENCES [1] D. Bull et al A power-efficient 32b ARM ISA processor using timing-error detection and correction for transient-error tolerance and adaptation to PVT variation, ISSCC 2010 [2] M. Kurimoto et al Phase-adjustable error detection flip-flops with 2-stage hold-driven optimization, slack-based grouping scheme and slack distribution control for dynamic voltage scaling, ACM Trans. Des. Autom. Electron. Syst., vol. 15, no. 2, pp. 17:1 17:17, Mar. 2010. [3] M. Choudhury et al TIMBER: time borrowing and error relaying for online timing error resilience, Design, Automation and Test in Europe 2010 [4] Bowman, K.A. et al "Energy-Efficient and Metastability-Immune Timing-Error Detection and Instruction-Replay-Based Recovery Circuits for Dynamic-Variation Tolerance," ISSCC 2008. [5] M. Tojtik, et al Bubble Razor: An Architecture Independent Approach to Timing- Error Detection and Correction, ISSCC 2012.