Soft Errors re-examined Jamil R. Mazzawi Founder and CEO www.optima-da.com Optima Design Automation Ltd 1 v1.2
Topics: Soft errors: definitions FIT Rate Soft-errors problem strengthening in new nodes Logical Masking and deration Mitigation techniques Flip-flop selection CosmicASICs Optima Design Automation Ltd 2
Soft-errors Cosmic Particles influencing our chips Particles can flip the values in flops and memory bits Optima Design Automation Ltd 3
Measuring soft-errors: FIT rate FIT Failure In Time How many Failures in 1 billion hours FIT = 10 9 / MTBF (hours) FIT of a system = FITi i= all its components FIT for a server farm = Sum of the FIT of all its servers, routers etc.. FIT for a chip = Sum of the FIT of all flops, memory bites, combo logic etc.. Optima Design Automation Ltd 4
Example: FIT req. of a chip Server farm for bank XYZ, with 1000 servers Required MTBF(the farm) = 1 year MTBF(each server) = 1000 years Includes power supply, FAN, memory, the CPU Chip, other chips MTBF(CPU chip) = 1200 years 10 FIT(CPU) = 9 = 114077 = 95 1200 365.25 24 1200 Given: FIT(single flip-flop) = 0.01 (@NYC) Given: Chip has 300,000 flops FIT(all flops) = 3,000 > 95 We have a problem Does not include: 1- Deration factors 2- Other component of the chip (i.e, memories) Optima Design Automation Ltd 5
Problem strengthening these days Newer, technologies are more sensitive Smaller transistor dimension => Smaller critical charge => the electrical charge of the particles relatively bigger than the critical charge Two effects that cancel each other: Smaller area per-transistor decrease per-trans FIT-rate More transistor per mm² Increase total FIT (of the chip) together, they almost don t influence the FIT rate Optima Design Automation Ltd 6
Where is it important: Memories Was the only area that needed protection in older nodes Solution: ECC protection Flop-flops Flops must be protected in newer technology nodes Combinatorial logic Second degree problem Solved Problem! Hottest unsolved Problem! Not a problem yet Optima Design Automation Ltd 7
Single Event Upset vs. Soft-Error SEU: A particle caused a flip-flop or memory bit to flip its value Soft-Error: An SEU has propagated and caused a system failure seen outside Most SEU do not convert to Soft-errors Optima Design Automation Ltd 8
Most SEUs do not convert to Soft-errors Ilan Beer, IBM HVC 2008 Definition: FIT rate with derating factors FIT calculated taking into account vanishing SEUs Optima Design Automation Ltd 9
Common mitigation methods: TMR with Majority voting DMR with C-Element Soft-error detectors SE detection with Parity tree More. Optima Design Automation Ltd 10
Solution 1: TMR with Majority voter TMR Triple Modular Redundancy. Extra area ~ +205%, extra power ~ +205%, FIT = 0 (-100%) Optima Design Automation Ltd 11
Solution 2: DMR with C-element DMR - Dual Modular Redundancy using additional C-element additional area and power > +100%, FIT = reduced to 5% Optima Design Automation Ltd 12
Solution 3: Soft-Error detectors These techniques usually used for detecting single bit flips in pipeline storage elements. One simple method is to duplicate the critical node and connect the outputs to XOR gate. Additional area and power is about 100%. Optima Design Automation Ltd 13
Summary of different solutions Family Technique description Extra area Extra power FIT TMR Triplicate of storage elements with majority voter at output Triple Modular Redundancy TMR with majority voting Three time-delayed storage node +200% +200% Down to 0 DMR Dual Modular Redundancy C element Error Detection Copy storage element Using already existing scan design-for-testability Using duplicated storage element with XOR +105% +20% +103% +100% ~+15% ~+100% Down by 95% Parity Tree Parity tree Using transient detector. Used in pipelines and recoverable models --- --- Down to 0 Performance penalty Not always possible Etc.. Optima Design Automation Ltd 14
Flip-flop selection is needed Hardening all flops is not viable Silicon costs: 25%-35% Influence on: Unit cost, NRE cost and Power Solution: Apply these solutions selectively Harden flops that are more sensitive to SEUs A flop Sensitive to SEU means: SEU on the flop has higher probability to convert to soft-error Optima Design Automation Ltd 15
Existing selection methods: Error Injection simulation Run a lot of simulations Each simulation injects a single error on a random flop, at a random cycle (simulating SEU) If the test-bench detects an error this SEU is Soft-Err. How many simulations to run? Option 1: Loop for all flops and all cycles Option 2: select random flops and random cycles to inject errors on lower accuracy Optima Design Automation Ltd 16
Benefits: Error Injection simulation Almost the only available option now Draw backs: Time consuming: 2-4 weeks with low sample-rate Compute resources consuming 2-4 weeks x 10-20 machines during peak project time Internal/in-house solution: needs someone to develop it and maintain it Solution available only for big companies Optima Design Automation Ltd 17
Introducing: CosmicASIC x1000 times faster than existing solutions Plug-and-play solution 100% accuracy Optima Design Automation Ltd 18
Summary The Soft-errors problem is strengthening Mitigation techniques exist: But can cost 25%-35% in silicon, NRE and power Flip-flop selection is a must Solves the soft-error problem at fraction of the cost CosmicASIC : Flip-flop selection EDA tool Visit us at booth A03 in the exhibition area Or at: http://www.optima-da.com Optima Design Automation Ltd 19