DESIGNING AN ECU CPU FOR RADIATION ENVIRONMENT Matthew G. M. Yee College of Engineering University of Hawai`i at Mānoa Honolulu, HI 96822 ABSTRACT NASA s objective is to colonize the planet Mars, for the planet has shown possible signs of life. With a plan to colonize Mars, a vehicle is needed to perform tasks such as going from point A to point B, exploration, etc. Having a vehicle for everyday task a vehicle can provide is essential to the objective. Though planetary exploration on a different planet of our solar system won t have the same environment as Earth. Radiation is detrimental to electronics for external radiation can inject more energy into transistors which will cause a logic bit flip. Bit flips can cause serious flaws in electronics as processors that receive the wrong logic can behave irrationally causing miscalculation or worse life systems shut down. With no way to maintain the vehicle's electronics, reliability of the electronic control unit (ECU) is top priority so radiation hardening the hardware is key to protect the bits throughout the hardware. A radiation hardened technique known as Triple Modular Redundancy (TMR) will be effective against single event upset (SEU). Constructing a robust fault-tolerance module for the CPU of an ECU is the focus of this project. INTRODUCTION The Electronic Control Unit is a part of the vehicle s system that manages the vehicles vital systems. Their function can range from managing motor s status to temperature readings. Every electronic will have processors that have logic gates in them. These logic gates comprises of transistors which is illustrated as the figure below. The transistor that makes the logic gates behave like switches, if base current flow the transistor remains in a stable state on or off which gives it a flip-flop behavior, making transistors a simple memory device that stores a zero when off or a one when on. [1] Now comprise of many logic gates to form a processor that can compute values or store memory in bytes. A computer can only understand 1 or 0 (on or off) and these bytes carry information that is valuable to us. The information within the byte would then be processed by other systems that are also comprised of logic gates that will convert it to numbers, texts, or signals things that we can use. An electronic control unit (ECU) is a module that is mainly used for management and control of a system. For example, an electric car ECU will monitor the status of the battery energy level, the stiffness/comfort of electronic controlled suspension, user interface, or throttle input. If radiation is constantly injecting energy into the ECU, the ECU will get an unexpected current flipping a logic bit, resulting in unexpected loss of power, random error code, false sensor reading, or worse destroying itself. 96
To construct and simulate a radiation harden hardware will be lucrative, so the tools I will be using for this project is Xilinx ISE web pack and Modelism for this project. Although there are multiple techniques to radiation hardened electronics I will be focused on the technique known as Triple Modular Redundancy as I will explain about it next. Figure 1. Transistor TRIPLE MODULAR REDUNDANCY Triple Modular Redundancy (TMR) is a fault-tolerant form of N-modular redundancy. The purpose of this is that during an event of a failure the system still continues the operation. What TMR do to be fault tolerant is implementing a majority vote system between the three systems. The three systems will accept the same data input from one line of data, process the data and send the output to be processed by the majority-vote system. If any one of the three system fails, then the other two systems can correct and mask the fault. The probability of the three digital circuits failing is dependent on the component used in the construction of the circuit. The tools may be subjected to change to make the design robust so for this step, let s assume that the component chosen has a zero chance of failing. The figure below illustrates what a TMR flowchart. The next topic I will be establishing the statistical analysis to calculate the probability of the TMR reliability. 97
Digital Input Digital Vote Output Digital Figure 2. Triple Modular Redundancy Flowchart. TRIPLE MODULAR REDUNDANCY: STATISTICAL ANALYSIS (RELIABILITY) First, digital circuit 1, digital circuit 2, and digital circuit 3 are annotated as letters A, B, and C respectively. Circuits A, B, and C are a subset of a variable S the total. The first thing is to calculate the probability (P) of the system overall failure is described as the expression below. [7] P = A B + B C + A C - 2(A B C) Since the desire system is to be reliable, the goal is to have at most one of the circuit to fail. That means either circuit A, B, or C fail not two of them. As long as one is achieving the system will function correctly, therefore a variable R can be used in the overall probability expression to get the probability of the system reliability. [7] P = R 2 + R 2 + R 2-2R 3 = 3R 2-2R 3 The equation is only viewing the three circuit, so now the expression needs the variable that represent the majority voter reliability RV. Therefore, the expression can now be rewritten as the expression below. [7] RTMR = RV(3R 2-2R 3 ) With the probability equation for both failure and reliability establish the next part is designing the TMR module. TRIPLE MODULAR REDUNDANCY: DIGITAL DESIGN The TMR is basically the majority vote and to describe the behavior of the voter of the TMR is simply if all three input agree, then output the majority vote. If two of the three agree, then output the majority vote. The truth table is established below as figure 3. The circuit of the vote module is established below as figure 4. Although the TMR is capable of fault tolerant it is still capable to fail as well. What is needed is to detect these errors so we know what circuit fail as CPU data pass through. An error detection module is needed to detect the error. The Error truth table is established below as figure 5. The circuit of the error detection module is established below as figure 6. [7] 98
A B C Vote 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 Figure 3. Vote Truth Table [7] The result of the vote module truth table is expressed as a Boolean equation and the circuit below as figure 4. VOTE = AB + AC + BC Figure 4. Vote Module [7] 99
A B C Error ea eb ec 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 0 Figure 5. Error Module truth table [7] The result of the error module truth table is expressed as below. ERROR = ( A B ) + ( A C ) + ( B C ) Aerror = ( A B ) ( A C ) Berror = ( A B ) ( B C ) Cerror = ( A C ) ( B C ) Figure 6. Error Detection Module [7] 100
The two module are constructed as combinational logic. The problem is that both vote and error detection module are only combinational logic. If radiation hit these components and the bit flip occurs, then both designs output will be affected immediately causing SEU and imminent failure after. Jeremy Chan have informed me on the technique known as scrubbing which allows errors in memory to be corrected by using redundant data. Incorporating scrubbing a component that can emulate this would be a D flip-flop with latch features. A flip-flop has a way to store the data with a clk feature a lot of restrictions on inputs to change the output. Analyzing the truth table of various D flip-flop the best one is an edge triggered D flip-flop. The inputs are ignored if the clock (clk) signal is not on the rising edge [5]. This allow us to scrub the errors with a component that stores the data and if error happens in the combinational logic the correct data will be inputted again into the voting module. The circuit diagram of the full TMR module is established below as figure 7. The result of our test is established as figure 8. Figure 7. Complete TMR module 101
Figure 8. Test bench result Unfortunately, I wasn t able to integrate this module with an open source core, however Jeremy suggested that an open source CPU provided by OpenRISC is the best choice. If I were performing the test, I will test with that core first. CONCLUSION With the mission to Mars underway, the astronauts that want to go will be on a one-way trip with little chance to return. This requires the vehicle to be reliable utilizing electric motors. Space and exploring the planets is hostile and radiation varies from planet to planet. Therefore, the hardware must be robust to operate even when errors occur. The current design of the TMR module can scrub data. In conclusion the current design should be tuned for other events like SEU that could harm electronics. ACKNOWLEDGEMENTS I would like to thank the NASA Hawai`i Space Grant Consortium for giving me the opportunity to gain valuable experience doing hands-on designing and applying my computer engineering concept. I would also like to thank Jeremy Chan for taking the time to guide and help me throughout the course of my project. 102
REFERENCES C. Woodford, "How do transistors work?", Explain that Stuff, 2016. [Online]. Available: http://www.explainthatstuff.com/howtransistorswork.html. [Accessed: 3- March- 2016]. "D Flip-Flop w/ Enable", Cypress.com, 2012. [Online]. Available: http://www.cypress.com/file/133031/download. [Accessed: 13- May- 2016]. "FLIP", Gitam.edu. [Online]. Available: http://www.gitam.edu/eresource/comp/gvr/6.1.htm. [Accessed: 19- March- 2016]. "Introduction to Combinational Logic Functions : Combinational Logic Functions - Electronics Textbook", Allaboutcircuits.com, 2016. [Online]. Available: http://www.allaboutcircuits.com/textbook/digital/chpt-9/combinational-logic-functions/. [Accessed: 13- March- 2016]. J. Ryckman, "Triple Modular Redundancy Engineering for Reliability", Sites.google.com, 2006. [Online]. Available: https://sites.google.com/site/judsonryckman/tmr. [Accessed: 21- Mar- 2016]. J. Ryckman, "tmrdesign - judsonryckman", Sites.google.com, 2006. [Online]. Available: https://sites.google.com/site/judsonryckman/tmrdesign. [Accessed: 22- Mar- 2016]. M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. LaBel, M. Friendlich, H. Kim and A. Pham, "Effectiveness of Internal vs. External SEU Scrubbing Mitigation Strategies in a Xilinx FPGA: Design, Test, and Analysis", nasa.gove, 2008. [Online]. Available: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20080040134.pdf. [Accessed: 17- March- 2016]. "OpenRISC - OpenRISC", Openrisc.io, 2016. [Online]. Available: http://openrisc.io/. [Accessed: 25- Apr- 2016]. 103