Modular redundancy without voters decreases complexity of restoring organ

Modular redundancy without voters decreases complexity of restoring organ by P. T. DESOUSA Rockwell International Richardson, Texas and F. P. MATHUR Wayne State University Detroit, Michigan ABSTRACT Fault-tolerant modules have usually been implemented through the use of static fault-masking or dynamic spareswitching. But a new class of MR (Modular Redundancy), the Responsive schemes, promises higher reliability levels and more efficient implementations for medium to high degrees of redundancy. In particular, Siftout Modular Redundancy (SMR) does not use voters and provides a 2-outof-N redundancy with a very simple restoring organ. The complexity of implementation is analyzed for several MR's and reliability figures are compared for three 2-out-of-N schemes. SMR is shown to have the best performance. INTRODUCTION Fault-tolerant digital systems are usually achieved by the use of Modular Redundancy (MR) techniques. The module tv b~ rrlaols ritun-r01~iailt ;~ ic;pt;~dlt;'j d lluftibei vi lilllt:~. Each one of the replicas wi11 be called a channel. The number of channels is the degree of redundancy. These identical channels constitute the "executive organ." The "restoring organ" is made up of the additional circuits necessary to perform the functions of fault-masking and/or recovery over the executive channels. Fault-masking and spare-switching are the two best known fault tolerance techniques. Static MR provides fault-masking. All channels are online throughout the mission time. The failing of a channel is "masked" by the good channels, keeping the overall structure output correct. Selective MR provides spare-switching. There is a functional core of on-line channels and a standby bank of spare channels. Whenever an on-line channel fails, one of the spare channcls replaces it. Responsive MR schemes do not quite fit in any of the two mentioned categories. There are no spare channels and all channels start the mission on-line. But upon the occurrence of a failure, the structure reconfigures itself. The contribution of the failed channel is reduced or eliminated. Pierce l was the first to propose a scheme of this type, using an adaptive restoring organ. The system output is a weighted vote of the channel outputs. The weight depends on the error probability of the correspondent channel. A weighted-input vote-taker is implemented with linear threshold elements, that perform "linearly separable Boolean functions."2 Adaption circuits estimate the error probability of each of the channels and use the estimate to set the vote-weight. The threshold vote-taker and the adaption circuitry make up the restoring organ, called "decision element" by Pierce. Three questions arise: (1) What vote-weight to use? 4a) Ceft.tiRUQtJf;. R-proportien to the error!)'"ohahihties; (b) Quantized. The vote-weight is either 0 (the channel is disconnected) or 1 (the channel is connected). (2) How to estimate the error probability? (a) Using reliability information generated in the same source that generates the digital information; (b) Counting the errors occurred in a time cycle and setting the weight accordingly; (c) Counting the errors periodically and incrementing the previous cycle data; (d) (for quantized vote-weights) Disconnect a channel whenever the error count exceeds a given threshold. 801

802 National Computer Conference, 1977 (3) How to detect errors? (a) From conditions in the channel itself. Correctly functioning circuits can be arranged to display properties different from properties of circuits with faults; (b) By comparison with an externally supplied correct answer; (c) By comparison with the output of the restoring organ (feedback of information). The output of the restoring organ is assumed to be correct. Pierce analyzed all these alternatives. Answers l(b), 2(d) and 3(c) have been the most appealing. Goldberg, et al.,3 and Losq4 designed implementations for schemes with those answers. Alternative 3(b) is used, for example in the model-assisted BMR of Devaney. 5 Siftout MR6 answers question 3 with a new alternative: (d) By comparing the outputs of the channels with one another. Adaptive redundancy uses threshold-rule in the restoring organ. Other Responsive configurations (Monitored Majority structures) use majority-rule. NMRlSimplex schemes 7 are examples of Monitored Majority Redundancy. Siftout MR does not use a vote-taker. The restoring organ merely discards any channel that does not agree with the majority. The module output is thus equal to the output of any of the channels that remain on-line. --1 CHANNEL 1 ~ 1 ----4>-------.. --1 CHANNEL 2 ~ 2 ---t-.---- --1 CHANNEL N ~ ON ---+-t---.----" Figure l-siftout redundancy shows the main parts of the Checking Unit: the Comparator, the Detector and the Collector. The Comparator is a set of (~) EXCLUSIVE OR gates, that checks the N channels against each other. It is depicted in Figure 2 for the case of N =4. The Detector (Figure 3) is a sequential circuit with (~) OR gates and N AND gates. The signal Fi is equal to 0 when channel i is fault-free. Fi is equal to 1 when channel i has failed. For example let channell be the first channel to fail. It will disagree with the other channels, causing lines E 12, E 13 and E14 to hold a logical value 1. Line F 1 will then be set to 1 and the feedback loop will force it to stay that SIFTOUT MODULAR REDUNDANCY (SMR) When using a Siftout configuration the system is organized into N identical channels, where N is any integer. The channels are synchronized with one another and perform simultaneous operations. Each channel is active as long as it is fault-free. Whenever one of the channels fails, its contribution to the system output stops. The system becomes an (N -I) redundancy scheme. Upon the occurrence of a new failure, the process repeats itself. SMR has a fault tolerance F=N-2. (N-2) channels can fail and the module will still operate correctly. When the module is reduced to two channels and one of them fails, the system is unable to detect which one failed. SMR is a 2- out-of-n structure, or more emphatically, an N-down-totwo redundancy. 03 ---+--... -4---+4 SMR IMPLEMENTION To implement a Siftout redundant structure, a Checking Unit is placed at the outputs of the N channels. The Checking Unit compares the output signals. If one of the signals disagrees with the others, the correspondent channel is "sifted out." The signal of the "good" channels is selected without need for voting. The diagram of Figure I Figure 2-Comparator for a 4-channel siftout redundancy

Modular Redundancy 803 E12 I c=; F1 F1 0, E13 E14 F2 F2 2 F3 OUT E 23 3 E24 F3 F4 E34 F4 4 Figure 4-Collector for a 4-channel siftout redundancy Figure 3-Detector for a 4-channel siftout redundancy way. A flip-flop can be added in the feedback loop, if a reset-retry procedure is desired. Such a flip-flop would make the structure tolerant to transient failures and would facilitate initial checkout. The final step is the Collector, with N OR gates and one AND gate. Each good channel feeds one input to the AND gate. Each bad channel provides a logical value 1 as input to the AND gate. The output of the AND gate is the correct output of the system, provided that at least two channels are good. Figure 4 shows the Collector when N =4. OTHER MR TECHNIQUES SMR is now compared with other redundancy techniques that have been used to provide ultra-reliable digital systems. Static MR TMR (Triple Modular Redundancy) In the basic TMR configuration, the system is organized into three identical channels that feed a voting element. The voting element compares the output signals of the three channels and selects the signal on which the majority of the channels agree. The TMR organization is one of the oldest forms of redundancy and has been considered the most promising for universal application. 8 However, the process that makes TMR fault-tolerant also makes it difficult to maintain. To analyze the performance of a malfunctioned system, error detection and fault isolation are necessary. The TMR majority voting mechanism masks a bad channel but at the same time complicates the detection of the error. To overcome this difficulty, extra hardware has been incorporated into TMR organized computer systems. 8,g A Siftout configuration with three channels (Figure 5) has the same fault tolerance as a TMR configuration. It already has the built-in capability of automatic error detection and fault isolation. The value of the variables Fj provides immediate information about the state of channel i ("good" if Fj=O, "bad" if Fj= 1). This is an important advantage when redundancy is considered for easing maintenance operations and improving availability. NMR (N-tuple modular redundancy) In an NMR system each nonredundant module is replicated an odd number (N) of times. The N identical channels feed a majority voting element. The structure works as long as a majority of the channels is fault-free. The fault tolerance of an NMR configuration is only F=(N -1)/2. The fault tolerance of a Siftout configuration with the same number of channels is F=N-2. Comparing the. NMR v:ot.idg.uwt with the.,siftoutchecking unit. the voter is found to be less complex than the checker for small values of N, but the situation inverts as N increases. (See Table 1). In addition, NMR has the same disadvantages of TMR, of which NMR is a generalization. However, NMR Figure 5-Siftout redundancy with three channels

804 National Computer Conference, 1977 TABLE I-Equivalent Number of Gates for Restoring Organs Majority Voter HMR Threshold Voter Self- Purging Siftout N NMR TMR Core 5MR Core TMR Core MR MR 3 4 34 13 4 71 57 47 21 5 13 91 80 63 31 6 III 144 102 78 43 7 41 131 172 126 95 57 8 151 200 151 113 73 9 145 171 228 177 132 93 can mask some multiple failures, while SMR requires that no more than one channel fails at a time. If the channels have been dormant, sufficient for several failures to have developed, then several of the signals may be erroneous when the system becomes active. Under this circumstance, voting becomes valuable, for systems with fivefold or higher redundancy. Selective MR HMR (hybrid modular redundancy) Hybrid redundancy has been developed as a means to achieve greater reliability and longer times of failure-free operation than those achieved by TMR or NMR systems. 10 It consists of an NMR core and S standby spare channels. The restoring organ includes besides the NMR vote-taker, a disagreement detector and a switching network. If the disagreement detector finds that the output of a channel in the NMR core does not match the output of the vote taker, the switching network replaces it by one of the standby channels. Hybrid redundant systems have the advantages of NMR systems (instant internal fault-masking) and Standby systems (increased reliability for long time missions). They yield a more efficient hardware utilization than the NMR systems, due to a greater fault tolerance. The implementation of the restoring organ of a Hybrid system is not straightforward and requires a fairly complicated switch. 11 Siftout Redundancy appears as a real challenger. It has a fault tolerance as high or higher than Hybrid Redundancy. And it has a simpler implementation. HMR is able to use dormant spare channels, but that capability does not provide a significant increase in reliability. Responsive MR Self-purging redundancy In the Self-Purging MR,4 there are N on-line channels feeding a threshold vote-taker with threshold equal to 2. The vote-taker output is compared with the channels outputs for disagreement detection. If a disagreement is detected, the faulty channel output is forced to a logical zero. Like Siftout MR, this is a 2-out-of-N strategy. The Self Purging restoring organ requires N flip-flops, that increase substantially its complexity. However, these flip-flops can also be used to handle transient errors and for restart procedures. COMPLEXITY OF RESTORING ORGANS The complexity of the restoring organs for the MR's discussed in the paper are display~d in Table 1. The following assumptions were used: (a) A gate is considered to be anyone of the following logic functions: AND, OR, NAND, NOR, Exclusive OR, and Inverter. 12 (b) A J-K or R-S flip-flop is equivalent to 8 gatesy (c) NAND/NOR gates are available with up to 8 inputs. The equivalent number of gates were calculated using the following expressions: (i) NMR (all-nand implementation): where M=(N + 1)/2. (~) +1 (1) (ii) HMR (iterative cell array implementation 11 ): TMR core (S=N-3): Majority voter: [G) +IJ +(3S+8)+(9S+27)+(8S+12)=51+20S (2) Threshold voter: [e1 S ) + 1 J +(3+S)+(9S+27) (3) +(7S+3)=34+ 17S+ e1 S ) 5MR core (S=N-5): Majority voter: [G) + 1 J +(5S+ 17)+(9S+45)+(14S+43)= 116+28S (4) (iii) Self-Purging [4]: (iv) Sijtouf MR: (8+2)N+ (~) +1=(N 2 +19N+2)/2 (5) 2* (~) +2N+ 1=N 2 + N+ 1 (6)

Modular Redundancy 805 TABLE II-Applicability Bounds for 2-out-of-N MR's N Minimal Ro Minimal RR 0.5 0.8889 4 0.2324 0.7248 5 0.1311 0.6028 6 0.0836 0.5137 TABLE IV-Reliability of 2-out-of-5 Systems with 1000 Gates/Channel Ro RE Siftout Hybrid Self-Purging 0.5 0.8125 0.7952 0.7676 0.7778 0.7 0.9692 0.9586 0.9413 0.9477 0.9 0.9995 0.9963 0.9909 0.9929 0.95 0.999 97 0.9984 0.9958 0.9967 R RELIABILITY ANAL YSIS Reliability of 2-out-of-N systems Out of the schemes shown in Table I, Siftout MR, Self Purging MR, and Hybrid MR with TMR core are all 2-outof-N strategies. Regarding the restoring organ as a series element in the reliability block diagram, the reliability of a 2-out-of-N MR is: R=RE'RR R={I-[(I-R o )N+N(l-Ro)N- 1 Ro]}RR ={l-(l-ro)n-l[1 +(N -1)Ro]}RR (7) where RE is the reliability of the executive organ, Ro is the reliability of a single channel and RR is the reliability of the restoring organ. Applicability bounds The crossover point is the minimum value of the reliability of a nonredundant unit for which there is improvement in the reliability using a redundant system. It is geometrically interpreted as the point where the curves for the redundant and the nonredundant system cross each other. The reliability of a nonredundant unit (simplex system) is equal to the reliability of a channel Ro. The crossover point (Rep) of a 2-out-of-N system is the nontrivial root of the equation: Rcp gives the lower bound of applicability of a 2-out-of-N system. The reliability cannot be improved by using redundancy when Ro<Rcp, whatever the value of RR. Similarly, there is a value of RR below which the reliability cannot be improved over the simplex design, whatever the value of Ro. This lower bound for RR is the minimum of TABLE III-Reliability of 2-out-of-4 Systems with 1000 Gates/Channel Ro RE Siftout Hybrid Self-Purging 0.5 0.6875 0.6776 0.6590 0.6655 0.7 0.9163 0.9095 0.8966 0.9011 0.9 0.9963 0.9941 0.9899 0.9914 0.95 0.9995 0.9984 0.9964 0.9971 0.99 0.999 996 0.9998 0.9994 0.9995 0.999 0.999 999 996 0.999 98 0.999 94 0.999 95 R (8) the function Ro/R E Table II shows this minimal RR and the minimal Ro(Rcp) for several values of N. Reliability comparison In order to compare the performance of the three 2-outof-N MR's, an analysis was made based on the values from Table I. If r is the reliability of a single gate and each channel has G gates, Ro=rG (9) If g is the number of gates in the restoring organ, (10) Tables III to V present values of the reliability R for the three MR's discussed, with selected values of Ro. It was assumed G= 1000, that is generally considered as a typical value. 13 The HMR implementation considered was the TMR core, threshold voting. The numbers show that, given a fixed Ro and a fixed G, there is for any MR a maximum number of channels N max to consider. Increasing the degree of redundancy above N max will not increase the system reliability. This result has long been known for the HMR.13 Although there are no drastic differences among the three MR's discussed, Siftout presents the best reliability performance. CONCLUSIONS Responsive schemes, and in particular SMR, have been shown to have significantly better performances than Static or Selective schemes. They provide higher fault tolerance than Static schemes, and have the added value of fault detection capability. Their implementation is much simpler than the Selective schemes, enabling higher limits of reliability. SMR does not use voters, and has a very straightforward TABLE V-Reliability of 2-out-of-6 Systems with 1000 Gates/Channel 0.5 0.7 0.9 0.8906 0.9891 0.999 95 Siftout 0.8645 0.9740 0.9954 R Hybrid 0.8287 0.9530 0.9890 Self-Purging 0.8438 0.9619 0.9918

806 National Computer Conference, 1977 implementation. It was favorably confronted with the older TMR, NMR, HMR, and Self-Purging. SMR is particularly suitable for systems with high availability requirements and systems with high reliability requirements over a long period of time. The SMR reliability was compared with two other 2-outof-N schemes, Self-Purging and the iterative cell array, threshold voter implementation of TMR + spares. Lower bounds were presented for the channel and the restoring organ reliabilities. There is a maximal number of channels for each instant of time, above which the reliability of any scheme starts to degrade. This means that there is an upper limit for the reliability that can be achieved with an MR over each mission time, irrespective of the degree of redundancy. REFERENCES I. Pierce, W. H., "Adaptive Vote-Takers Improve the Use of Redundancy," In Redundancy Techniques for Computing Systems, pp. 229-50. Edited by R. H. Wilcox, and W. C. Mann. Washington: Spartan Books, 1962. 2. Pierce, W. H., Failure-Tolerant Computer Design. New York: Academic Press, 1965. 3. Goldberg, J., K. N. Levitt, and R. A. Short, "Techniques for the Realization of Ultrareliable Spacebome Computers," Final Report- Phase I, Stanford Research Institute Project 5580, Menlo Park, California, September, 1966. 4. Losq, J., "A Highly Efficient Redundancy Scheme: Self-Purging Redundancy," IEEE Transactions on Computers, Vol. C-25, June 1976, pp. 569-578. 5. Devaney, M. J., "Fault Diagnosis and Self-Repair in Operational Synchronous Digital Systems," Ph.D. Dissertation, University of Missouri Columbia, June, 1971. 6. desousa, P. T. and F. P. Mathur, "Sift-out Modular Redundancy," submitted for publication. 7. Mathur, F. P. and P. T. desousa, "Reliability Models of NMR Systems," IEEE Transactions on Reliability, Vol. R-24, June 1975, pp. 108-113. 8. Ball, M. and F. Hardie, "Self-Repair in a TMR Computer," Computer Design, Vol. 8, February 1969, pp. 54-57. 9. Hight, S. L. and D. P. Petersen, "Dissent in a Majority Voting System," IEEE Transactions on Computprs, Vol, C-22, Februa..ry 1973, pp. 168-171. 10. Mathur, F. P. and A. Avizienis, "Reliability Analysis and Architecture of a Hybrid-Redundant Digital System: Generalized Triple Modular Redundancy with Self-Repair," AFlPS Conference Proceedings (Spring Joint Computer Conference), Vol. 36, May 1970, pp. 375-383. II. Siewiorek, D. P. and E. J. McCluskey, "An Iterative Cell Switch Design for Hybrid Redundancy," IEEE Transactions on Computers, Vol. C-22, March 1973, pp. 290-297. 12. "Reliability Prediction of Electronic Equipment," Military Standardization Handbook MIL-HDBK-217B, Department of Defense, U.S.A., September 1974. 13. Ogus, R. C., "Fault-tolerance of the Iterative Cell Array Switch for Hybrid Redundancy," IEEE Transactions on Computers, Vol. C-23, July 1974, pp. 667-681.