Product Level MTBF Calculation

2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Product Level MTBF Calculation Ang Boon Chong easic Corp bang@easic.com Abstract Synchronizers are used in sampling an asynchronous data for digital circuits. It protects the chips from metastability failure. As mean time between failure degrade with technology scaling while chip performance increase with multiple clock domain on chip and the synchronizer chain s usage increase, the mean time between failure, MTBF requirements is getting tougher to meet with technology scaling. The objective of this paper is to share the proper N number of synchronizer chains calculation as well as the product level mean time between failure, MTBF s derivation and the caveats of using traditional product level s mean time between failure,mtbf estimation. Hopefully the sharing will benefit the readers. Keywords Metastability, Synchronization, Chip I. INTRODUCTION For synchronous design, all sequential elements have to satisfy certain setup and hold requirement to ensure valid state is able to propagate. Metastability happens when asynchronous signals are transferred and the resulting output goes into undetermined state. A common example of metastability is the data violating the setup and hold requirement of the sequential element. It can be illustrated from Figure 1 asynchronous transfer and Figure 2, flipflop s behavior during metastability. flop hovers at voltage level between high and low and causing the output transition to be delayed beyond the specified clock-to-output delay (tco) shown in Figure 2. During metastability, the extra resolution time (Tw) happens sometime after the normally specified clock to output delay,tco if not accounted with extra timing slack, then system failures may occur. The duration of metastable condition is a probabilistic phenomenon and therefore is no guaranteed maximum time.entering a metastable state is a probabilistic function related to the clock frequency, the transition frequency of the asynchronous data signal and a constant that defines the window in which a transition can cause metastability. Metastability can appear as flip-flop that switches late or does not switch at all. It can present a brief pulse at flip-flop output or oscillations. Any of these conditions can cause system failure. Once in a metastability state, the value to which the flipflop resolves cannot be determined. It is analogous to a ball rolling over the hill shown in Figure 3. Each side of the hill represents a stable state and the top of the hill represents the metastable state. Figure 1 Asynchronous Transfer[1]-[3] Figure 3 Metastability Analogy The problem with metastable events is not merely their occurance but when the event causes inconsistent values to be latched into subsequent flip-flops if not synchronized properly shown in Figure 4. Figure 2 Metastability FF Behavior When a flip-flop is in metastable state, the output of the 2166-0662/14 $31.00 2014 IEEE DOI 10.1109/ISMS.2014.137 749

Figure 4 Metastability Propagation and Mitigation In this example, if one flip-flop latches a value of 1 while another flip-flop latches a value of 0, then the design can become unpredictable and may fail. This situation may occur because the 2 paths shown could have different routing delays. For meantime between failure, MTBF models evolution, it is shown in Table 1. TABLE 1 SUMMARY OF MULTISTAGE SYNCHRONIZER S MTBF MODELS[4]-[12] Figure 5 Flip-Flop Schematic Based on Figure 5, Upon entering the metastability, assuming the clock is low and, node A is at logic 1 and input D transitions from low to high. As a result, node A is falling and node B is rising. When the clock, CLK2, rises, it disconnects the input from node A and close the A-B loop. If A and B happens to be around its metastable levels, it would take them long time to diverge away towards legal digital value. From Figure 5, during metastability, the voltage levels of nodes A and B of the master latch are roughly mid-way between logic 1 and logic 0. The exact voltage levels depend on transistor sizing, process variations and are not necessarily equal for the 2 nodes. The settling time required for metastability of a flip-flop can be derived by plotting the data arrival time at node D versus clock to out delay at node B based on fixed clock and data slew shown in Table 2. TABLE 2 DATA ARRIVAL VERSUS CLKTOOUT DELAY SPICE SIMULATION In this paper, the scope of discussion will be product level MTBF formulation, the caveats of traditional product level MTBF estimation as well as brief introduction of MTBF measurement techniques and metaharden flip-flop analysis. II. MTBF MEASUREMENT TECHNIQUE To accurately determine the settling time, T w the Table 2 can be further processed to Table 3. For a typical flip-flop, it is shown in figure 5. 750

TABLE 3 PROCESSED DATA driving output resistance, R and capacitive load C shown in Figure 7. From Table 3, the T dc is the the time where clock to output delay is longer than the usual clock to output delay, T co based on fixed data and clock slew. T extra is the extra clock to output delay required. The negative value on T dc is due to setup volation as data shift from left to the leading edge of clock transition. From Table 3, when the data arrival happens at 4.96, the clock to output delay start to increase. T dc is extracted by deducting the 4.96ns to data arrival time when no data is captured at flop s output. The plot of T dc versus T extra is shown in Figure 6 Figure 7 Master Latch Small Signal Model Based on Figure 7, the model can be written as (3) Assuming symmetry cross-couple inverters,, and subtracting equation (2) and (3) Redefine (2) (4) equation (4) can simplify as (5) If assuming equation (5) can simplify as (6) where = Hence the solution is Figure 6 T dc versus T extra From Figure 6, as T dc approach 0, T extra increases expontially, while T dc increases, T extra will reduce to 0. The equation relating it can be expressed in the following manner[4]: T extra = - ln where T dc T W else T extra =0 ; (1) The value of T W can be derive by from equation (1) using sampling point of T dc at -33ps and -13ps. During metastability exit, the 2 inverters operate at their linear-transfer function region. This can be modeled in small signal with the inverters as negative amplifiers, each (7) The simulation setting of the exiting metastability circuit is shown in Figure 8. For the simulation setup, the switch starts closed with small supply such as to match the none equilibrium state based on the voltage transfer curve. Once the equilibrium voltage required is determined, it is defined as the DC supply value and the switch is shorted initially and then opens the switch around 1ns to allow the latch to resolve. The possible voltage transfer curve anticipated is shown in Figure 9. 751

By dividing equation (8) and (9) (10) Hence the value of can be derive from the linear slope of Figure 10. III. FLOP LEVEL MTBF REQUIREMENT Figure 8 Exiting Metastability Simulation Circuit Setup To derive the flop level meantime between failure,mtbf(s) for a typical 2 stage synchronous flip-flop in Figure 1,we simplify the notations to Figure 11. Figure 11 Async Flop Transfer Figure 9 Voltage Transfer Curve From Figure 9, it is observed that the voltage transfer curve is non-symmetry. Hence, an initial voltage supply, 0.037v is required to match the voltage node of and to equivalibrium stage. If the metastable state values are equal, it is a symmetry latch. For symmetry latch, the initial voltage supply, v The plot of voltage different versus time in natural log is shown in Figure 10. From equation (1), we can derive the equation for failure rate of register in the present of data source whose transition times are uncorrelated to the clock input,. Assume that is the amount of timing slack given to the synchronizer registers to resolve the metastable state, when the is greater than the available timing slack,, it is possible that errors will propagate to downstream logic and cause the system error. Hence the probability of failure after the sampling edge of the capturing clock can be express as : p(failure) =p(enter metastability) x p( > (11) Substitute equation (1) into equation (11) p(failure) = (12) With the account of async data rate, entering metastability shown in Figure 11, the rate of expected failures: p(failure) = (13) For N stage synchronizer where N is greater than 2 flipflops, the expected failure rate can be written as: p(failure) = (14) Figure 10 Output Voltage Difference Versus Time The mean time between failures, MTBF for N stage synchronizers is inverse of expected failure rate Based on equation (7), (8) (9) MTBF( )= (15) From equation (15), it is observed that for N stage synchronizers, the cumulative slack, (N-1) obtained is critical for better mean time between failures,mtbf(s) for synchronizer chain. 752

IV. METAHARDENING FLOP TRADE OFF For comparison of metahardening flop versus normal flip, it is shown in Table 4. TABLE 4: METAHARDEN FLOP COMPARISON Setup Hold Setup + Hold %diff Clk2Out %diff Tw Tau Leakage Dynamic Original 61 4 65 187 78 1.055 0.298 7.019 Metaharden 58 4 62 4.62% 182 2.67% 84 0.338 0.732 7.666 From Table 4, it is observed that though metaharden flop provide better settling time, the clock to out delay is degraded as the loading from the master latch increase. To improve the clock to out delay, further optimization on the slave latch is required. Leakage power increase due to the metaharden flip-flop usage is expected as the best performance is for metaharden flop ssettling time, T w is obtained through low threshold voltage, LVT cells usage. V. CHIP LEVEL MTBF REQUIREMENT Equation (15) provides a formula for single synchronizer chain. As design may consist of M synchronizer chains denoted as,,,, the entire chip effective mean time between failure can be derived as[4]: MTBF( )= (16) The equation (16) is valid under the hypothesis of independent failure which may not always be the case. Based on equation (16), it concludes that for entire chip M synchronizer chains, the effective mean time between failure corresponds to the th of the harmonic means of all synchronizer chains MTBFs and performance of the chip s mean time between failure is dominated by the performance of the worst chain in the design. Traditional method of deriving chip mean time between failure, MTBF( is derived based on the worst synchronizer chain s mean time between failure MTBF( divide by the total number of synchronizer chain, M in a chip. This will result in extremely pessimistic chip level mean time between failure, MTBF(. It can be explained by the following synchronizer chain conditions shown in Table 5. TABLE 5 CHIP S MEAN TIME BETWEEN FAILURE chain per Destination Clock clock domain MTBF single chain (Year) MTBF per clock domain (Year) clock A 200 3.19E+06 1.60E+04 clock B 100 1.13E+06 1.13E+04 clock C 20 5.65E+03 2.83E+02 clock D 100 1.40E+06 1.40E+04 clock E 10 1.40E+06 1.40E+05 total 430 chip MTBF 2.65E+02 calculated chip s mean time between failure, MTBF( would be MTBF( = =13 years while the actual chip s mean time between failure is 265 years. Hence traditional chip mean time between failures calculation, MTBF(C) is pessimistic and results in overdesign metaharden flops requirement or unnecessary high number of synchronizer stages required per synchronizer chain which will degrade the system performance. VI. TOTAL PRODUCTS MTBF REQUIREMENT For a product that consists of L counts of unique chips with different chip level s mean time between failures denoted as,the product s mean times between failures, MTBF( can be derived as: MTBF( )= (17) The equation (17) is valid under the hypothesis of independent failure which may not always be the case. Based on equation (17), it is concluded that for total L chip in a product, the effective mean time between failure corresponds to the th of the harmonic means of all chips MTBFs and the performance of the product s mean time between failure is dominated by the performance of the worst chip s MTBF in the product.example of the product mean time between failure is shown in table 6. TABLE 6 PRODUCT S MEAN TIME BETWEEN FAILURE chip chip count single chip MTBF (Year) MTBF per chip per product (Year) chip A 1 3.19E+03 3.19E+03 chip B 2 1.13E+03 5.65E+02 chip C 3 5.65E+03 1.88E+03 chip E 1 1.40E+03 1.40E+03 total 7 product MTBF 3.00E+02 From table 6, it is shown that the effective chip mean time between failure, MTBF (C) reduces when the chip used in a product is higher than 1. If there is total of K counts of the same products and operating in the field, the total product mean time between failure, can be derived as: MTBF(TP)= (18) Assuming a total of 80 products delivered and operating in the field, the effective chip s mean time between failures per total product is shown in Table 7. From Table 5, based on the traditional method, the 753

TABE 7 PRODUCT S MEAN TIME BETWEEN FAILURE MTBF per total chip in chip single chip MTBF per chip per total product chip count MTBF (Year) product (Year) (Year) chip A 1 3.19E+03 3.19E+03 3.99E+01 chip B 2 1.13E+03 5.65E+02 7.06E+00 chip C 3 5.65E+03 1.88E+03 2.35E+01 chip E 1 1.40E+03 1.40E+03 1.75E+01 total 7 product MTBF 3.00E+02 3.76E+00 From table 7, it is observed that chip B has the worst chip level s mean time between failure among the rest of other chips. The effective total product mean time between failure is merely 3.76 years for total of 80 products shipped and operating in the field. This implies that for every 2.5 weeks, one product will experience a mysterious failure in the field. An IC supplier typically has the visibility of the chip consumption in the targeted market segment as well as the required operating time in each market segment during the chip design phase. The required operating time in various market segments is shown in Table 8. TABLE 8 RELIABILITY REQUIREMENT FOR DIFFERENT MARKET SEGMENT If the chip s mean time between failure, MTBF(C) is improved by a factor of total chip count in total product, the total mean time between failure for total product is shown in Table 9. TABLE 9 IMPROVED PRODUCT S MEAN TIME BETWEEN FAILURE single chip chip count MTBF (Year) MTBF per chip per MTBF per total product chip in total (Year) product (Year) chip chip A 1 2.55E+05 2.55E+05 3.19E+03 chip B 2 1.81E+05 9.04E+04 1.13E+03 chip C 3 1.36E+06 4.52E+05 5.65E+03 chip E 1 1.12E+05 1.12E+05 1.40E+03 total 7 product MTBF 3.83E+04 4.79E+02 From Table 9, it is observed that if chip s mean time between failure, MTBF(C) improved by a factor of total chip in total product, the total product mean time between failure, MTBF (TP) improves to 479 years. Hence every 479 years, a product will experience a mysterious failure in the field. The product is rest assured from asynchronous transfer reliability concern with improved mean time between failure, MTBF s value. VII. CONCLUSION The summaries of product level mean time between failures are: Chip level mean time between failure need to account for total chips shipped to total product s impact to ensure the product can operate reliably in the field. A single chip mean time between failure derivation based on worst synchronizer chain s mean time between failure divided by total synchronizer chains counts will result in over design metaharden flip-flops or higher synchronizer stages which will degrade the system performance. ACKNOWLEDGEMENTS Thanks to Lai Kok Keong and Massimo Verita for the support given. REFERENCES [1] D. Kinniment, K. Heron and G. Russell, Measuring Deep Metastability, ASYNC,10pp-11, 2006. [2] C. Dike and E. Burton, Miller and noise effects in synchronizing flip-flop, JSSC, 34(6):849-855,1999. [3] S. Beer, R. Ginosar, M. Priel, R.Dobkin, A. Kolodny, An on-chip metastability measurement circuit to characterize synchronization behavior in 65nm, ISCAS, pp 2593-2596, 2011. [4] D. Chen, D. Singh et al. A comprehensive approach to modeling, characterizing and optimizing for metastability in FPGAs, FPGA 2010 [5] L.Kleeman and A. Cantoni, Metastable behavior in Digital Systems, IEEE Design and Test of Computers, 4(6), 4-19, 1987 [6] C. Brown and K. Feher, Measuring metastability and its effect on communication signal processing systems, IEEE Transactions on Instrumentationi and Measurement, 46(1), 1997 [7] D. Kinniment, Synchronization and Arbitration in Digital Systems, Wiley 2007 [8] S. Beer, R. Ginosar, et. al The Devolution of Synchronizers, ASYNC 2010 [9] Terrence Mak, Trunaction Error Analysis of MTBF Computation for Multi-Latch Synchronizers, Elsevier, Microelectronics Journal, pp. 1-10, 2011 [10] T.J Gabara, G.J Cyr and C.E Stroud, Metastability of CMOS Master-Slave flip-flops, IEEE Transactions on Circuits and Systems II-Analog and Digital Signal Processing, 734-740, 1992 [11] C. Myers, E Mercer and H. Jacobson, Verifying synchronization strategies in Formal Methods for Globally Asynchronous Locally Synchronous (GALS) Architecture, 2003 [12] I.W. Jones, S. Yang and M. Greenstreet, Synchronizer Behavior and Analysis, ASYNC, pp 117-126, 2009 754