Chapter 1 Introduction to FT Computing/Computer Reliability Engineering

Similar documents
Self Restoring Logic (SRL) Cell Targets Space Application Designs

Reduction of operating costs

DESIGNING AN ECU CPU FOR RADIATION ENVIRONMENT. Matthew G. M. Yee College of Engineering University of Hawai`i at Mānoa Honolulu, HI ABSTRACT

Special Specification 6293 Adaptive Traffic Signal Control System

Digital Audio Design Validation and Debugging Using PGY-I2C

An MFA Binary Counter for Low Power Application

Affordable Upgrade Solutions for the Classic B737 Integrated Flat Panel Cockpit Display System. FPDS Features: B737 EFIS

Modular redundancy without voters decreases complexity of restoring organ

Product Update. JTAG Issues and the Use of RT54SX Devices

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR NPTEL ONLINE CERTIFICATION COURSE. On Industrial Automation and Control

V9A01 Solution Specification V0.1

Impact of Intermittent Faults on Nanocomputing Devices

Reconfigurable Communication Experiment using a small Japanese Test Satellite

LED driver architectures determine SSL Flicker,

Contactor Monitoring Relay CMD Cost-Effective Solution for Safe Machines

JJMIE Jordan Journal of Mechanical and Industrial Engineering

Design of Fault Coverage Test Pattern Generator Using LFSR

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

L-Band Block Upconverter MKT-74 Rev B JULY 2017 Page 1 of 7

BUSES IN COMPUTER ARCHITECTURE

Peak Atlas IT. RJ45 Network Cable Analyser Model UTP05. Designed and manufactured with pride in the UK. User Guide

Failure Modes, Effects and Diagnostic Analysis

Sharif University of Technology. SoC: Introduction

THE FUTURE OF NARROWCAST INSERTION. White Paper

Satellite Related Services

MICROMASTER Encoder Module

UCR 2008, Change 3, Section 5.3.7, Video Distribution System Requirements

Application Note. Traffic Signal Controller AN-CM-231

The comparison of actual system with expected system is done with the help of control mechanism. False True

TODAY, the use of embedded systems in safety-critical

Meeting the Challenge of a Difficult Job Speciality Contractor ACE Awards

The use of an available Color Sensor for Burn-In of LED Products

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Avoiding False Pass or False Fail

A study of intermittent faults in digital computers

Interfacing the TLC5510 Analog-to-Digital Converter to the

Chapter 1. Introduction to Digital Signal Processing

BTC and SMT Rework Challenges

Model-based Autonomy for the Next Generation of Robotic Spacecraft. Define model-based autonomy. Describe model-based executive technology (Titan)

Data Converters and DSPs Getting Closer to Sensors

Full Disclosure Monitoring

MELSEC iq-r Inter-Module Synchronization Function Reference Manual

Flight Data Recorders. Debriefing Systems. Military Helicopters

Design for Testability

Testing Digital Systems II

Evaluating Oscilloscope Mask Testing for Six Sigma Quality Standards

PRINCIPLES AND APPLICATIONS

National Park Service Photo. Utah 400 Series 1. Digital Routing Switcher.

M4000 Diagnostic Test System For Power Apparatus Condition Assessment

Agilent I 2 C Debugging

Switching Solutions for Multi-Channel High Speed Serial Port Testing

Three large LCD cockpit concept for retrofit applications

Versatile EMS and EMI measurements for the automobile sector

Research on Precise Synchronization System for Triple Modular Redundancy (TMR) Computer

Using Predictive Analytics to Calibrate FMEDA Why FMEDA gives the best failure rate results

A Reconfigurable, Radiation Tolerant Flexible Communication Platform (FCP) S-Band Radio for Variable Orbit Space Use

VLSI Test Technology and Reliability (ET4076)

SIL-2 8-Ch Analog Input Series Thermocouple, High Level, Low Level

NanoCom ADS-B. Datasheet An ADS-B receiver for space applications

Ku-Band Redundant LNB Systems. 1:1 System RF IN (WR75) TEST IN -40 db OFFLINE IN CONTROLLER. 1:2 System POL 1 IN (WR75) TEST IN -40 db POL 2 IN

Siemens Industry Online Support

Overview of All Pixel Circuits for Active Matrix Organic Light Emitting Diode (AMOLED)

CPD LED Course Notes. LED Technology, Lifetime, Efficiency and Comparison

Colour Explosion Proof Video Camera USER MANUAL VID-C

VLSI System Testing. BIST Motivation

(12) Patent Application Publication (10) Pub. No.: US 2007/ A1

RS232 settings are internally definable via jumper blocks, to accommodate interfacing with a wide range of control products.

We ve got the best technology to fit your power monitoring needs. Scenario works on all!

TECHNICAL SUPPORT , or FD151CV-LP Installation and Operation Manual 15.1 Low Profile LCD

Modular Block Converter Systems

Designing Intelligence into Commutation Encoders

AC103/AT103 ANALOG & DIGITAL ELECTRONICS JUN 2015

Innovative Rotary Encoders Deliver Durability and Precision without Tradeoffs. By: Jeff Smoot, CUI Inc

The DM7 and the Future of High

CONVOLUTIONAL CODING

Fabien Jordan Electrical System Engineer Space Center EPFL Muriel Noca SwissCube Project Manager.

ModuMAX SSPA Systems. C, X, and Ku Bands. Completely modular solid-state power amplifier systems for world-wide satellite communications

Performance of a Low-Complexity Turbo Decoder and its Implementation on a Low-Cost, 16-Bit Fixed-Point DSP

VIDEO GRABBER. DisplayPort. User Manual

Combinational vs Sequential

White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Status Date Prepared Reviewed Endorsed Approved

Using Synthetic Instruments in Your Test System Assessing the benefits and tradeoffs

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Your partner in testing the Internet of Things

Laser phosphor cinema projectors. Reduced total cost of ownership, superior image quality

FD104CV. Installation and Operation Manual 10.4 LCD MAN FD104CV. TECHNICAL SUPPORT , or Document Number: Rev:

Analog Performance-based Self-Test Approaches for Mixed-Signal Circuits

HIGH SPEED ASYNCHRONOUS DATA MULTIPLEXER/ DEMULTIPLEXER FOR HIGH DENSITY DIGITAL RECORDERS

MANAGING POWER SYSTEM FAULTS. Xianyong Feng, PhD Center for Electromechanics The University of Texas at Austin November 14, 2017

About... D 3 Technology TM.

MC9211 Computer Organization

Comparing JTAG, SPI, and I2C

O P E R A T I O N M A N U A L. RF-Reader. Stand-alone-Reader Leser 2plus with RS-232 interface

BUSINESS SYSTEMS MONITORING

The ESA Automated Transfer Vehicle

VK-P10SE WARRANTY REGISTRATION FORM

VITERBI DECODER FOR NASA S SPACE SHUTTLE S TELEMETRY DATA

Transcription:

Chapter Introduction to FT Computing/Computer Reliability Engineering Three dimensions of fault tolerant computer systems:. Physical hardware (h/w), software (s/w), system. Time life of a fault tolerant (FT) system (manufacture, operation, maintenance). Cost money ($), customer requirements/satisfaction Definition of Fault Tolerant Computing the correct execution of a specified algorithm in the presence of defects. This nominally requires a systems approach to FT computing that will encompass numerous disciplines to achieve a desired form of reliability. Definition of a Fault Tolerant Computer a computer system that posses the capability to execute a set of programs correctly in the presence of certain specified faults in the system including hardware failures and software errors. Correct execution of programs Programs not halted or modified by faults in the computer Results do not contain errors caused by faults Achievement of Fault Tolerance (methods) Hardware replication Information Redundancy error correcting codes Software Replication Time Redundancy rollback and recovery Operational Discipline environment, maintenance, man/machine interface, risk analysis Causes of Faults Design Errors Imperfect or incomplete specifications Imperfect implementation of specifications Component Failures Environmental Impacts Characterization of Faults Duration Permanent Transient Intermittent Extent Local Catastrophic (global) Models (some examples) Stuck (open/short) Unidirectional Indeterminate Operator Induced/Human Faults CENG Chapter Introduction Page of

Why Fault Tolerant Computer Systems? (knowing that most computer system implementations are digital verses analog or optical). Typical Requirements for FT Systems a. Deep-Space Vehicles (long mission times), Mars Exploration Rovers: Spirit, Opportunity & Curiosity, Hubble Telescope, International Space Station (ISS) b. FAA Traffic Control (loss of life, economic impact of long term shutdown) c. Aircraft Reliance on Computers inherently unstable aircraft, loss of life minimization where acceptable failures rates of 0-6 per hour or better, Shuttle, 77/767/777 with Cat 0 landing capability, B- stealth bomber, DoD Drones d. Reliance on Communications Internet, Stock Markets, the Bell Telephone ESS (Electronic Switching System) with its two hours of downtime during its 0-year lifetime. System Complexity Implications of Moore s Law. Moore s Law deals with the complexity as represented by the number of transistors in a microprocessor/integrated circuit (the number of transistors on an IC doubles approximately every two years although the period today is more often quoted as 8 months). The downside of this increasing complexity is that with so many components, the probability of a hardware failure is quite finite. For example: Given a pc board with 0 transistors/active devices each with a % initial failure rate. The probability that the board is not defective = (0.99) 0 = 0.669 (% chance that it fails at turn-on). Cost A more fault tolerant system (which will cost more than a lesser FT system) can actually reduce the cost of ownership (higher initial investment will save money over the lifetime of the system).. Social Economic Considerations Quality of life; impact of computers on society the Information Revolution; flexibility for growth and change (different mission objectives using the same basic hardware); difficult task of managing very complex systems; society s reliance on computers (life/death situations). SPEED and MONEY probably the two most important aspects of computer systems. In dealing with fault tolerance, money is probably the primary concern. The economic aspects of fault tolerant computing can be depicted with a simple example of the cost of ownership for two different computer systems, one more reliable than the other. The cost of ownership or the cost of downtime can be related to maintenance and the time value of money (discount rate). CENG Chapter Introduction Page of

The cost of owning a computer system for n years can be expressed as n C = I + (S i P i ) / ( + D) i i = (equation really nothing more than the time value of money) n = system lifetime (assumed operational life, no salvage value at end) I = Initial Cost of equipment (purchase price) Si = cost of one maintenance operation in year i (the cost of each service call) Pi = the expected number of failures during year i D = Discount Rate (time value of money for the customer) Assume that a computer system has a -year life, its failure rate is constant over time, a service call costs $00 and the discount rate is %. Expressing failures as λ failures per million hours of operation and noting that there are 8760 hours in a year results in n = year lifetime Si = $00 cost of each service call D = % discount Rate λ = failures/0 6 hours (the failure rate, assumed to be constant) I = Initial Cost of equipment (purchase price) C = I + 00 ( 8760 hours/year ) ( λ failures/0 6 hours ) /( + 0.) i years / ( + 0.) i = 0.89 + 0.797 + 0.7 + 0.66 + 0.67 =.60 i = i = For these nominal assumptions, the Cost of Ownership for years is C = I + 9.7 λ where λ is in failures per million hours System # (cheaper, less reliable) I = $0K λ = 6,000 failures per 0 6 hours C = $76,80 since λ is constant, then MTTF = / λ = 66.67 hours or approximately 6 service calls in a year period (MTTF = mean time to failure; relationship valid only for constant λ) System # (expensive but more reliable) I = $0K λ =,000 failures per 0 6 hours MTTF = / λ = 0 hours or approximately 7 service calls in a year period C = $67,89 CENG Chapter Introduction Page of

System # is 0% more expensive initially ( I ) has a % improvement in reliability (MTTF) The costlier system results in an.6% reduction in the cost of ownership over a year period which is a direct result of avoiding the extra service calls for the more reliable System #. Reliability, Availability and Risk These terms can be viewed as probabilistic or deterministic (an outcome of the laws of nature). We ll concentrate on the probabilistic characterization of these terms. Reliability is the ability to operate under designated conditions for a period of time. Ability will be designated as a probability or determined deterministically (from the empirical evidence such as failure mechanisms/analysis, testing/inspection, operational performance, etc.) Availability takes down-time into consideration. It can be viewed as a combination of reliability and maintainability. Or conversely, reliability can be considered as instantaneous availability where no maintenance of repair is performed. Risk is a more a systematic term a big picture viewpoint which has a relationship to reliability analysis. Risk in qualitative terms is the potential of loss or injury from exposure to a hazard (danger). More safeguards against exposure to hazards less risk. Quantitative risk analysis involves the probability of loss combined with the probability hazard occurrence. Risk analysis asks the following questions:. What can go wrong if exposed to a hazard?. How likely is this to happen?. If it does, what are the expected consequences? Example (given without proof at this stage) Life tests show that a component fails at a constant-failure rate where 00 items are tested for,000 hours and of these fail in that period. The failure rate λ is failures / (00 items x,000 hours) = x 0 - failures/hour based on the important statement that the failure rate is constant. CENG Chapter Introduction Page of

The reliability function for this type of failure mode (constant failure rate λ ) which represents the probability of no failures within a given operational period (,000 hours in this case) is R(t) = e -λt = e - ( failures / (00 items X,000 hrs) ) (,000 hrs) = e - ( x 0- failures/hr) (,000 hrs) = e - 0.0 = 0.9607 (probability of no failures in 000 hours) For this failure mode (constant failure rate of λ = x 0 - failures/hr) it is also known that the mean time to failure for the single component is Mean Time To Failure = MTTF = / λ =,000 hours Even though these parameters are very good (%), when considering the complexity of using n of these items in a system knowing that all of the items must work in order for the system to work, the reliability of the system Rsys becomes R sys (t) = [R(t)] n = [ e - λt ] n = e -nλt So the overall system reliability for,000 hours with just 0 of these items would be R sys (t=000 hours) = e - n λ t = e -0 x 0.0000 x,000 = 0. or not much of a chance that the system would survive in the first,000 hours (a 87% chance of failing in the first,000 hours). Reliability is a figure citing the probability of an object/system working until it fails; that is, the probability of no failures in a given interval. No repair is considered during the,000 interval nor are any alternatives to the failure considered once it has failed. Availability [ A(t) ] is a measure of performance that does take into account the possibility of repair to a detected failure. It is the probability that a system is operational at a specific and given instant of time. Such activities as preventive maintenance and repair reduce the time that the system is available to the user but hopefully these functions can be performed without serious impact to the system. A descriptive formula for availability A(t) = Uptime / (Uptime + Downtime) It will be shown in this class that such things as repair can add significantly to the desired operational characteristics of a complex system. Although Reliability R(t) and Availability A(t) are radically different figures of merit for a reliable system, they are both are based on the same probabilistic measures. CENG Chapter Introduction Page of

Some Interesting History The st FT digital computer was SAPO, which was built in Prague, Czechoslovakia in 90 9. It was a -bit floating point architecture that was motivated by the very poor component quality and political sensitivity to a project failure. It was based on TMR (triple modular redundancy) which is a system that relies on comparison of results or voting Module Module of vote Rsys Module Rsys (, 0) = Rm - Rm where R (, 0) depicts a redundancy level of (triple) with 0 spares where Rm is the reliability of a single (duplicated) unit The term fault tolerance is shown in this TMR system since it masks faults by a majority voting scheme (easy to conceptualize, extremely difficult to implement). It does not repair the faults, it tolerates the faults. Note that if the TMR system permanently votes out a module (removes it from the TMR system), then it must revert to a simplex (one) module operation. There is no way to have a majority voting scheme with just two modules. The basis of the equation Rsys (, 0) can be shown as the reliability of all three modules working (Rm ) plus the reliability of only out of modules working (Rm - Rm ) for which there are three possible combinations of -out-of- or (Rm - Rm ) thus R sys (, 0) = R m + R m - R m = R m - R m This equation also assumes a perfect voter (no failures) so if we consider the reliability of the one voter which we ll consider to be in series with the three redundant modules Rm as shown above, then Rsys (, 0) = Rv ( Rm - Rm ) CENG Chapter Introduction Page 6 of

Some Applications of FT Techniques Apollo Vehicles (CM, LM, Saturn V computer - LVDC) Bell Telephone ESS Communication Networks Voyager satellite Kepler Telescope Mars Rovers SIFT (software implemented FT) FTMP (FT multiprocessor) C.mmp, Cm*, C.vmp Carnegie Mellon University systems Commercial Systems Tandem/Compaq/HP, Stratus, Sun New York Stock Exchange India s stock exchange Personal computers implemented with RAID Boeing 777, Dreamliner My involvement as an employee with the MIT Instrumentation Laboratory/Draper Laboratory in FT Systems started with an R&D project for NASA Headquarters executed at Cambridge, Massachusetts and the Johnson Space Center. The project was called AIPS. (This government project was also the genesis of this course at UHCL.) Advanced Information Processing System (AIPS) Develop and demonstrate a FT system that will satisfy a broad spectrum of future NASA missions LaRC advanced aircraft JSC Space Station Freedom, Orbital Transfer Vehicles, Space Shuttle Upgrade/Block II Digital System for cost advantages and flexibility Design system for growth and change thru system modularity Evaluate the system in a flight environment Compare the AIPS primarily hardware implementation with software techniques used to achieve fault tolerance Incorporate other technology options into system as desired This eventually led to reliable computer systems for the Shuttle and X-8/ISS Crew Return Vehicle The Shuttle s redundant (but not formally fault-tolerant) computer system can be explained by looking at a proposed upgrade to the computer system. The Shuttle Cockpit Avionics Upgrade (CAU) desired primarily for crew safety (loss of vehicle) and to reduce the crew s workload (more pertinent/graphical display of critical data). Project executed through the CDR (Critical Design Review) phase and then cancelled when it was decided to terminate the Shuttle Program with the last Shuttle flight in 00 which completed the major construction phase of the International Space Station (ISS). The overall computer system concept of complementing the existing PFS (Primary Flight System) with CDPs (Command & Data Processors) was demonstrated by USA (United Space Alliance) in 00 at JSC. CENG Chapter Introduction Page 7 of

CDR PASS-THRU PLT PASS-THRU AFT PASS-THRU CDR PASS-THRU Shuttle Computer Configuration ( redundant set computers, backup computer) DK MIA BUS - DK BUS DK BUS DK BUS IDP CDR IDP PLT IDP BUS BUS BUS ADC A ADC B MFD MFD CDR CDR CRT PLT PLT CRT CRT DK BUS IDP AFT BUS ADC B ADC A AFD CRT Rev G /0/0 SWW Cockpit Avionics Upgrade (CAU) proposed upgrade to the Shuttle s computer system Rev F Architecture DK MIA BUS - AGES AGES AFT CDP A CDR BUS AGES CDP C 7 8 6 AGES PLT A 9 AGES CDP B A Rev G /0/0 SWW AFT CENG Chapter Introduction Page 8 of

X-8 Vehicle Computer built for NASA JSC utilizing a fault-tolerant parallel processor (FTPP) configuration with Network Elements CENG Chapter Introduction Page 9 of

V M E B u s FCP ICP MPCC Analog Out Decom m Network Element FCP ICP MPCC Analog Out Decom m Network Element V M E B u s V M E B u s Network Element Decom Digital m I/O Analog Out MPCC ICP FCP Network Element ICP V M E Network Element Decom Digital m I/O Analog Out MPCC ICP FCP V M E B u s Each channel forms a fault containment region Input data distributed to each channel for data congruency (same data at same time) Redundant processing channels execute same instruction sequence on congruent data at the same time Results are voted and output for execution Errors are detected; failed items are removed and/or reset (brought back into set, a repair feature) Processing elements configures in groups to obtain balance of throughput and redundancy Multiple simplex groups provide high throughput of parallel processing Redundant groups (triplex or quadruplex) provide fault-tolerance (mixed levels of redundancy) Processing elements: Flight Critical Processors (FCP) and Instrumentation Control Processors (ICP) I/O devices can be hosted by a processing element Five fault-containment regions (FCRs) Flight Critical Processors (FCP) with a fifth unit made up of Network Element (NE) One Network Element (NE) per each Fault Containment Region (FCR) Nine Processing Elements (FCP + ICP) configured in 6 processing groups System can accommodate arbitrary non-simultaneous faults Software implements fault recovery/repair during non-critical periods CENG Chapter Introduction Page 0 of

Why We ve Got a Long Way to Go with Computer Fault Tolerance (August 06) CENG Chapter Introduction Page of

CENG Chapter Introduction Page of

CENG Chapter Introduction Page of