Evaluating BIST Architectures for Low Power

Evaluating BIST Architectures for Low Power C.P. Ravikumar Department of Electrical Engineering Indian Institute of Technology New Delhi 110016 rkumar@ee.iitd.ernet.in N. Satya Prasad * Cadence India NEPZ, Noida U.P., India snitala@cadence.com Abstract The "system-on-chip" revolution has posed a number of new challenges to the test engineers. We address the issue of high power dissipation during testing, which can reach levels that are beyond the safe upper limit associated with the chosen packaging technology. A study undertaken by Zorian reveals that test power can be as large as 200% or more in comparison to the normal power. In the test mode, input vectors are normally applied in an uncorrelated manner, leading to an increase in the average Hamming distance between two successive vectors. This implies a larger switching activity, and, for CMOS circuits, implies a larger power dissipation. In this paper, our attempt is to look at Built-in Self-Test architectures from the view point of power dissipation, fault-coverage, area, and test length. We report experimental results for a CORDIC chip. Our results indicate that BIST architectures differ significantly from one another in terms of power dissipation, giving the test designer an opportunity to address the problem of excessive heating during testing. 1 Introduction Problem Complementary MOS technology has been deemed a low-power technology since power is dissipated in a CMOS circuit only when one or more nodes makes a state transition. The static power in a CMOS circuit is due to the flow of reverse bias current and the "short circuit" current which flows when both the P and A r parts of the circuit conduct simultaneously e.g. during state transition at an output node. As the number of devices packed into a single chip continues to follow Moore's law, it is doubtful if CMOS will be continued to be referred to as a "low-power" technology. Microprocessors such as Digital's Alpha, Intel's Pentium, and Sun's Ultra SPARC are known to dissipate power of *This work was carried out when the second author was an M.Tech student of IIT Delhi in the Department of Electrical Engineering. the order of 30 to 50 Watts. Not only do packaging costs mount when the power dissipation of the chip increases, the reliability of the product gets affected in an adverse way. The "system-on-chip" (SOC) revolution which packs tens (or even hundreds) of millions of transistors on the same chip, poses new challenges to the design team, the packaging team, and the test team. The problem of increased power dissipation in SOC designs takes a turn for the worse when it comes to testing, as explained below. Circuit designers have taken advantage of the fact that the primary inputs to a circuit come from a non-uniform distribution [1]. For example, in a speech signal processing circuit, the input vectors behave in a predictable manner, with the least significant bits more likely to change than the most significant bits. The transition probability of a signal line is the probability that the signal will make a transition from logic 0 to logic 1 (or vice versa). The circuit design can be optimized to ensure that the average transition probability of the internal nodes is reduced if the information about the primary input transition probabilities are known a priori. Similarly, technology mapping, placement, and routing can be carried out to reduce the effective switched capacitance S, S = ^2 n C n T n where C n is the capacitance of node n in the physical realization of the circuit and T n is the transition probability of node n. For a fixed value of clock frequency and supply power, 5 is a measure of the dynamic power dissipation in a CMOS circuit. In anything other than functional testing, input vectors are applied in an uncorrelated manner i.e. there s no definite relationship between one vector and the successive vector. For example, in Built-in Self-Test, which is a predominant test strategy in Application-Specific System-On-Chip (ASSOC) designs, test vectors are applied using a pseudorandom test pattern generator. Even when ATPG+Scan combination is employed, there is no definite correlation between successive test vectors. In summary, the Hamming distance between two successive test vectors is expected to be larger than two successive input vectors during normal mode of operation. The increased switching activity during testing can result in a much larger power dissipa- 430

tion. In a study undertaken by Zorian [13], he observed that test power can be to 200% larger than normal mode power. The painstaking power optimization that the circuit designer undertakes to do may become meaningless if the package selection must be carried out on the basis of test power. Conversely, if the package selection was based on the power estimate of the circuit designer, then the circuit is very likely to burn out when it is tested. Solutions One straight-forward solution to the above problem is to lower the test clock frequency. This would be an acceptable solution if the fault model is limited to static fault such as stuck-at faults. Delay testing, which is a necessity for high-performance circuits, must be carried out at the normal clock speed. Even for DC testing, reducing the test clock frequency by a factor of k will amount to increasing the total test time by a factor of k. Chou, Saluja and Agrawal [2] suggested a way of scheduling tests in a BIST environment. The idea is to reduce the amount of concurrency in testing. If there an; N sub-circuits to be tested, we schedule their testing such that at most N sa f e sub-circuits are concurrently tested, where N sa f e is dictated by the selected package type. This solution is actually practiced, but has the drawback that it increases the total test time. Wang and Gupta [11] suggested a technique to reduce heat dissipation during testing in full scan-based circuits. It exploits the don't cares (DCs) during scan shifting, test application, and response capture to reduce the overall switching in the CUT. During scan shifting, the DCs are used to block the gates that may cause transitions during shifting. The DCs at the inputs are assigned 0/1 values so as to minimize transitions. The authors suggested a technique to generate vectors that have a large number of DCs. The above techniques address the problem at the algorithmic level; in this paper, our interest is to explore architectural-level solutions. To the best of our knowledge, ours is the first such attempt. We restrict ourselves to BIST architectures. It is well known that the choice of the B 1ST architecture influences the fault coverage, test application time, and the area overhead. Our experiments show that the test architecture can also influence the power dissipation significantly. In the past, the choice of test architeclure has been influenced mainly by fault coverage [3, 6]. In the Intel 80386 processor, a BIST architecture resulted in 2 19 test vectors [3]. Nazuyama et al. report test lengths of 2 31 for the TX1 processor [6], In each of the above examples, the test lasted a few seconds. We believe that the choice of the test architecture for the future generation of ASSOCs will be governed by the power dissipation during test mode. We suggest that the product of area overhead, switching activity, and test time be used as a measure of testability of a circuit. We refer to this measure as the AST measure. Section 2 describes some terminology and a few typical Built-in Self Test architectures used in modern practice. A comparative evaluation of these architectures is presented in Section 3. We performed the comparison on the example of a CORDIC computer [9] which is useful in evaluating trigonometric functions in a fast and accurate manner. Our experimental results are given in Section 4 and conclusions are presented in Section 5. 2 BIST Architectures BIST relies on the addition of two components, a pattern generator (PG) and a response analyzer (RA). Input patterns generated by the PG are applied to the circuit under test (CUT) and the responses of the CUT are compressed by the RA. Typically, a linear feedback shift register (LFSR) is used as a PG as well as an RA. An n,-bit LFSR consists of n D-type flip-flops connected in the form of a shift-right register. The D-input to the left-most flip-flop is of the form, G C2 Q2 (D CD Cn Qn (1) where Qi is the output of flip-flop i and Cj is a 0/1 variable. The polynomial P(x) = 1 + Cvx + C 2 x 2 + C 3 x 3 + + C n x n defines the values of C; and is known as the characteristic polynomial of the LFSR. The polynomial P*(x) = l+c n^x+c n - 2 x 2 + --- + Cix n - 1 +x n is called the reciprocal of P{x). An LFSR which uses P*(x) as its characteristic polynomial is known as the reciprocal LFSR of an LFSR whose characteristic polynomial is P(x). The initial content of the LFSR is called the seed. Depending on the characteristic polynomial and the seed, the LFSR generates an rt-bit pseudo-random test sequence at the output (Q1Q2 Qn)- The length of the pseudorandom test sequence (before the sequence repeats itself) is limited by 2". The length of the pseudo-random test sequence governs the test application time as well as the achievable fault coverage. There are several variations of the LFSR to (a) produce a biased distribution of 0's and l's in the output [12] (b) load different seeds at intermediate points [7], and so on. In an LFSR-SR approach, an ni-bit LFSR is cascaded with an ;?, 2 -bit shift register i.e. the output Q, n of the LFSR is connected as the input Di of the shift register. A CUT is called an (n, w) cut if it has n primary inputs and w is the number of input variables on which any output of the CUT is dependent. When w < ^, an LFSR-SR can exhaustively test the CUT. For w > f, a condensed LFSR [10] can be chosen. An LFSR can be used as a response analyzer by integrating the output(s) of the CUT. The content of the response analyzer at the end of the application of the pseudo-random test sequence is known as the signature of the CUT. More 431

1, 1 LFSR 1 CUT 1 1 MISR 1 i i i seed is loaded in both LFSRs. Test generation is terminated when the desired fault coverage has been obtained. Once again, the seed is likely to influence the test length for a specified fault coverage. Test architecture 3 is shown in Figure l(c) and is based on the LFSR-SR scheme for pattern generation. A k-bit LFSR is cascaded with an (n - &)-bit shift register. The scheme can be generalized to the use of r LFSR-SR combinations, each of size -. We shall compare the above three BIST architectures for an example circuit, namely, the CORDIC computer [9]. We shall consider two variations of the CORDIC, namely, CORDIC-CLA and CORDIC-RBA, which make use of two different types of adders. These details are given in Section 4. 3 Comparative Evaluation Figure 1: BIST Architectures In order to compare the three test architectures, we must measure the fault coverage, test length, test area overhead, and the power dissipated during test mode. We used fault coverage as a constraint and used the AST product (see Section 1) as the measure of testability. Fault coverages of often than not, the signature of a faulty CUT is different from that of a fault-free CUT. Aliasing is said to to have resulted when the faulty signature is identical to the signature and were employed in our experiments. We generated behavioral and structural VHDL models of of the fault-free CUT. The probability of aliasing depends the components. The behavioral model was used to verify on the size of the LFSR, the characteristic polynomial, and the functionality of the circuit components. The Synopsys the seed. In the CORDIC example of Section 4, the CUT is Design Compiler tool was used to compile the structural either a 12-bit carry lookahead adder (n = 24) or a 12-bit VHDL descriptions into layouts using 0.8 /jm gate array redundant binary adder (n = 48). libraries available from VTI. For all the three test architectures mentioned in Section 2, we used the Built-in Logic Test Architectures We considered three test architectures for comparison. The first scheme uses an LFSR to exhaustively test the CUT. See Figure l(a). A primitive polynomial is used as the characteristic Block Observer (BILBO) [5] to realize the PRPG and MISR blocks. The BILBO can be reconfigured to function as either a PRPG or an MISR. The CORDIC computer which polynomial of the LFSR. For an (n, w) CUT with we considered as an example uses 12-bit precision. Thus, in outputs, we require an n-bit LFSR as a pattern generator and an m-bit LFSR as a signature register. Exhaustive dundant binary adder. The other data path elements which we realized a 12-bit carry look-ahead adder and a 12-bit re- testing of an n-input CUT requires us to apply 2" patterns. we needer were barrel shifters, parallel-in parallel-out registers, and shift registers. Whenever possible, the registers Since this can be a very large number, we limit the test length to a value which gives adequate fault coverage. The of the data path were reconfigured as BILBOs to reduce the fault coverage is estimated through fault simulation. When test area overhead. we truncate the test sequence before it reaches a length of We had access to the Verifault fault simulator. Structural 2", the seed will play a role in determining the fault coverage and the power dissipation. descriptions of the components were implemented in Verilog HDL to make use of the Verifault tool. For each test The second architecture, shown in Figure l(b), uses two LFSRs of size y to test an n-input CUT. We assume n is a multiple of 2. The characteristic polynomials of the LFSRs architecture, the PRPGs were initialized using a seed value and simulated to obtain a large trace of test vectors. The structural model of the data path elements (such as adder) are reciprocals of one another. The scheme can be gener- were given as input to the fault simulator. We stopped alized to use 2k LFSRs, each of which is ^--bits. The apparent advantage of partitioning a large LFSR into smaller ones is to increase the amount of "randomness" in the n-bit patterns. A smaller number of test patterns in this scheme may suffice to obtain a specified fault coverage. The same the simulation after a specified fault coverage was obtained ( or ). The power dissipation in any component was estimated by an 8-valued logic simulation program which we implemented for this purpose. The logic values supported in the 432

Xblhu XT/ I Table 1: Comparison of CLA and RBA Area fini* Critical path delay (ns) Detectable faults CLA 459.70 4.96 456 RBA 1657.2 4.81 1148 Figure 2: CORDIC Computer simulation are 0, 1, f, 4, 0-1-0 glitch, 1-0-1 glitch, 0-1-0-1 glitch, and 1-0-1-0 glitch. We implemented a compile d- code simulator. The switching activity measured by the program is a pessimistic estimate of the actual value since we assume that glitches will occur whenever they are likely to occur. For instance, if the inputs to a 2-input AND gate are 1" and, then we assume that a 0-1-0 glitch will occur. Thus our simulator may overestimate the glitch power. Another approximation in our simulator is the fact that we limit the glitch patterns to only 4 patterns. A 0-1-0-1-0 glitch will be treated as a 0-1-0 glitch, thus underestimating the number of 0-1 transitions. A more accurate simulation will consider gate and interconnect delays to predict whether glitching will actually occur and the pattern of the glitch. 4 Results CORDIC Computer Voider [9] introduced the CORDIC technique to compute the sine and cosine of an angle accurately without requiring a multiplication operation. The CORDIC core is shown in Figure 2. The essential computation in CORDIC is where >> i stands for "rotate right by i bits." The adder used in the data path is cither a carry look-ahead adder (CLA) or a redundant binary adder (RBA). In the latter case, the X and Y registers hold the intermediate values in redundant binary [8], Redundant binary representation uses three logic values, namely, 0, 1, and -1. CORDIC algorithm based on redundant binary arithmetic executes faster since the time to add two redundant binary numbers is constant. The CLA takes O(log2n) average time to add two n-bit (2) (3) numbers due to carry propagation. There is no carry propagation in an RBA. An n-input CLA is a (2ra + 1, 2n + 1) CUT, since the output carry depends on both n-bit numbers and the input carry. The n-bit RBA, on the other hand, is a (4n,6)CUT. The influence of RBA on power dissipation is not obvious. Since an RBA requires a data path which is twice as wide as compared to a CLA, one would expect that the power will also double. But since there is no carry propagation, the average switching activity in an RBA is much lesser in a CLA. In a separate study, we have shown that the power-delay product of redundant CORDIC is smaller than that of a CLA-CORDIC implementation [4]. Test Plan The X and Y registers of the CORDIC datapath (Figure 2) are implemented as BILBO registers. During test mode, these registers are configured as pseudo-random pattern generators. The amount of shift is set to 0 in both barrel shifters. The registers XNEW and YNEW are configured as signature registers. One adder is tested during one test session. Table 1 shows the relative comparison of a 12-bit CLA and a 12-rbit RBA. The RBA occupies four times the area of a CLA and has a fault set which is three times as large that of a CLA. The critical path delay of an RBA, however, is smaller. In fact, the critical path of an RBA of arbitrary size will be the same as that of the RBA considered in Table 1, whereas the critical path of a larger size CLA will be larger. The results of comparison of test architectures for CORDIC-CLA are given in Table 2. For each test architecture, we report the area overhead (A), switching activity (SA), test length (TL), and AST metric for two specified values of fault coverage (FC), namely, FC = and FC =. The AST metric is scaled by a factor of 10 8. Similar results are reported for CORDIC-RBA in Table 3. When area overhead is the point of comparison, Architecture 3 is superior since a shift register occupies smaller area than a similar sized BILBO. However, this architecture could not offer fault coverage in the case of CORDIC-CLA. Thus, in both datapaths, Architecture 1 offers a lower AST value than Architecture 2 and may be considered superior when fault coverage is required. We may also notice that the RBA circuit offers a lower value of AST. There is a marked difference between the switch- 433

ing activities and test lengths of the RBA and CLA circuits for the same test architecture, implying that the RBA circuit will dissipate less power during testing. This highlights the importance of selecting a suitable datapath architecture when the test power is an important consideration. The importance of selecting a test architecture for reducing test power is evident from the fact that the switching activity can increase 14 fold from one test architecture to other. (Compare the switching activity for Architecture 2, with the switching activity for Architecture 1,.) 5 Conclusions FC SA A TL AST Table 2 Sd BIST Architectures for CORDIC-CLA arch 1 11024 3857 120 40 8.73 1.01 1.5M 15342 1446 1 1420 15.4 arch 2 3330 2787 36 31 0.81 0.58 2.1M 149505 2037 1286 2880 1290 arch 3 4334 2030 46 32 0.95 0.31 99.8% 99.8% 15324 16487 146 173 10.7 13.7 In this paper, our aim has been to study the effect of BIST architecture on the test power. Test architectures have been compared on grounds of the fault coverage, test length and test area overhead. We suggest that the area-switching activity-test time product may be used as a metric to compare test architectures, with fault-coverage as a constraint. We selected three test architectures and compared their AST metric for two different datapaths, two seeds, and two faultcoverages. Our results indicate that no architecture is universally superior. We have shown that the test power can vary as much as 14 times across test architectures. Similarly, we have highlighted the fact that the choice of the datapath for minimizing test power is an important problem. This work needs to be carried further by considering other test architectures and datapath examples to further study their influence on the AST metric. References [1] A.R Chandrakasan and R.W. Brodersen. Low Power Digital CMOS Design. Kluwer Acad. Pub., 1995. [2] R.M. Chou, K.K. Saluja, and V.D. Agrawal. Scheduling tests for VLSI systems under power constraints. IEEE Transactions on VLSI Systems, 5(2), June 1997. [3] P. Geisinger. Design and test of 80386. IEEE Design and Test of Computers, pages 42-50, June 1987. [4] Bindu John and C.P. Ravikumar. Low power CORDIC realization using redundant binary arithmetic. Manuscript, 1997. [ B. Koenemann, J. Nucha, and G. Zwiehoff. Built-in logic block observation techniques. In Proceedings of International Test Conference, 1979. [6] Y. Nozuyama, A. Nashimura, and J. Iwamura. Design for testability of a 32-bit microprocessor, the txl. In Proceedings of International Test Conference, pages 172-182, 1988. Table 3 BIST Architectures for CORDIC-RBA Sd arch 1 arch 2 arch 3 Fault Cov. Swit. Act. Area Test Len AST 25717 18791 95 83 32.7 20.8 30250 080 108 183 45.0 127.0 16974 21183 72 80 16.00 22.90 90268 58964 276 186 337.0 148.0 39976 34769 132 114 43.7 32.8 [7] J. Savir and W.H. McAnney. A multiple seed linear feedback shift register. IEEE Trans, on Comp., C- 41(2):250-2, 1992. [8] N. Takagi et al. Redundant CORDIC chip with a constant scale factor for sine and cosine calculations. IEEE Trans, on Comp., C-40(9), 1992. [9] J.E. Voider. The CORDIC trig, computing tech.. IRE Trans, on Elec. Comp., EC-8:330-334, 1959. [10] L.T. Wang and E.J. McCluskey. Condensed linear feedback shift register testing - a pseudoexhaustive test technique. IEEE Trans, on Comp., 367-369, 1986. [11] S. Wang and S.K. Gupta. ATPG for power dissipation minimization during scan testing. In Proc. of the DAC, 614-619, 1997. [12] H. Wunderlich. Multiple distributions of biased random test. In Proc. ofltc, 236-244, 1988. [13] Y. Zorian. A distributed BIST control scheme for complex VLSI devices. In Proc. of the 11th VTS, 4-9, 1993. 69629 58083 212 173 122.0 83.2 434