Impact of Intermittent Faults on Nanocomputing Devices

Similar documents
SoC IC Basics. COE838: Systems on Chip Design

Future of Analog Design and Upcoming Challenges in Nanometer CMOS

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Digital Integrated Circuits EECS 312

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

11. Sequential Elements


UNIT IV CMOS TESTING. EC2354_Unit IV 1

Lecture 11: Sequential Circuit Design

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Hardware Design I Chap. 5 Memory elements

Performance Driven Reliable Link Design for Network on Chips

24. Scaling, Economics, SOI Technology

Lecture 17: Introduction to Design For Testability (DFT) & Manufacturing Test

Towards Trusted Devices in FPGA by Modeling Radiation Induced Errors

Soft Errors re-examined

Lecture 18 Design For Test (DFT)

Timing EECS141 EE141. EE141-Fall 2011 Digital Integrated Circuits. Pipelining. Administrative Stuff. Last Lecture. Latch-Based Clocking.

Chapter 5 Flip-Flops and Related Devices

Self-Test and Adaptation for Random Variations in Reliability

FinFETs & SRAM Design

Digital Integrated Circuits EECS 312. Review. Remember the ENIAC? IC ENIAC. Trend for one company. First microprocessor

Digital Integrated Circuits EECS 312

Failure Analysis Technology for Advanced Devices

Scan. This is a sample of the first 15 pages of the Scan chapter.

Soft Error Resilient System Design through Error Correction

Chapter 7 Memory and Programmable Logic

IMPACT OF PROCESS VARIATIONS ON SOFT ERROR SENSITIVITY OF 32-NM VLSI CIRCUITS IN NEAR-THRESHOLD REGION. Lingbo Kou. Thesis

RAZOR: CIRCUIT-LEVEL CORRECTION OF TIMING ERRORS FOR LOW-POWER OPERATION

Unit V Design for Testability

Self Restoring Logic (SRL) Cell Targets Space Application Designs

EITF35: Introduction to Structured VLSI Design

On the Rules of Low-Power Design

3D-CHIP TECHNOLOGY AND APPLICATIONS OF MINIATURIZATION

Based on slides/material by. Topic 14. Testing. Testing. Logic Verification. Recommended Reading:

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Testing Sequential Circuits

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

How to overcome/avoid High Frequency Effects on Debug Interfaces Trace Port Design Guidelines

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

ECEN454 Digital Integrated Circuit Design. Sequential Circuits. Sequencing. Output depends on current inputs

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

Digital Integrated Circuits Lecture 19: Design for Testability

ECE321 Electronics I

Clock - key to synchronous systems. Topic 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

Clock - key to synchronous systems. Lecture 7. Clocking Strategies in VLSI Systems. Latch vs Flip-Flop. Clock for timing synchronization

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

At-speed Testing of SOC ICs

MEMORY ERROR COMPENSATION TECHNIQUES FOR JPEG2000. Yunus Emre and Chaitali Chakrabarti

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains

Single Event Upset Hardening by 'hijacking' the multi-vt flow during synthesis

Diagnosis of Resistive open Fault using Scan Based Techniques

Digital Integrated Circuits EECS 312. Review. Combinational vs. sequential logic. Sequential logic. Introduction to sequential elements

Static Timing Analysis for Nanometer Designs

Based on slides/material by. Topic Testing. Logic Verification. Testing

EDSU: Error detection and sampling unified flip-flop with ultra-low overhead

Timing Error Detection and Correction for Reliable Integrated Circuits in Nanometer Technologies

Design for Testability

Sequential Circuit Design: Part 1

A Low-Power 0.7-V H p Video Decoder

HARDENED BY DESIGN APPROACHES FOR MITIGATING TRANSIENT FAULTS IN MEMORY-BASED SYSTEMS DANIEL RYAN BLUM

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

VLSI System Testing. BIST Motivation

Sequential Circuit Design: Part 1

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

EE 447/547 VLSI Design. Lecture 9: Sequential Circuits. VLSI Design EE 447/547 Sequential circuits 1

VLSI Test Technology and Reliability (ET4076)

Chapter 7 Sequential Circuits

Combinational vs Sequential

EMI/EMC diagnostic and debugging

Timing Error Detection and Correction by Time Dilation

CPE/EE 427, CPE 527 VLSI Design I Sequential Circuits. Sequencing

Lecture 10: Sequential Circuits

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Homework #8 due next Tuesday. Project Phase 3 plan due this Sat.

32 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 1, JANUARY /$ IEEE

Sharif University of Technology. SoC: Introduction

Semiconductors Displays Semiconductor Manufacturing and Inspection Equipment Scientific Instruments

System IC Design: Timing Issues and DFT. Hung-Chih Chiang

Abstract. Keywords INTRODUCTION. Electron beam has been increasingly used for defect inspection in IC chip

High Power ARNS/IFF Limiter Module: Ultra Low Flat Leakage & Fast Recovery Time

Tools to Debug Dead Boards

Overview: Logic BIST

PHYS 3322 Modern Laboratory Methods I Digital Devices

DESIGN AND SIMULATION OF A CIRCUIT TO PREDICT AND COMPENSATE PERFORMANCE VARIABILITY IN SUBMICRON CIRCUIT

RX40_V1_0 Measurement Report F.Faccio

Lossless Compression Algorithms for Direct- Write Lithography Systems

Enhanced JTAG to test interconnects in a SoC

Noise Margin in Low Power SRAM Cells

2.6 Reset Design Strategy

100Gb/s Single-lane SERDES Discussion. Phil Sun, Credo Semiconductor IEEE New Ethernet Applications Ad Hoc May 24, 2017

Testing Digital Systems II

Out of order execution allows

Lecture 23 Design for Testability (DFT): Full-Scan

Lecture 23 Design for Testability (DFT): Full-Scan (chapter14)

VLSI Chip Design Project TSEK06

Lecture 1: Circuits & Layout

COMP2611: Computer Organization. Introduction to Digital Logic

EE241 - Spring 2005 Advanced Digital Integrated Circuits

Transcription:

Impact of Intermittent Faults on Nanocomputing Devices Cristian Constantinescu June 28th, 2007 Dependable Systems and Networks

Outline Fault classes Permanent faults Transient faults Intermittent faults Field fault/error data collection Intermittent faults Impact of scaling Mitigation techniques HW vs. SW solutions Summary Q&A 2 June 28 th, 2007

Fault Classes Permanent faults, e.g. stuck-at, bridges, opens Reflect irreversible physical changes Occur at the same location, are always active Transient faults, e.g. particle induced SEU, noise, ESD Induced by temporary environmental conditions Occur at different locations, at random time instances Intermittent faults, e.g. manufacturing residues, oxide breakdown Occur due to unstable, marginal hardware Occur at the same location May be activated and deactivated Induce bursts of errors 3 June 28 th, 2007

Fault/Error Data Collection 4 June 28 th, 2007

Fault/Error Data Collection Study Servers from two manufacturers were instrumented to collect errors Manufacturer A: 193 servers, 16 months Manufacturer B: 64 servers, 10 months Examples of reported errors Memory Front side bus Failure analysis performed when possible Source: C. Constantinescu, SELSE 2006 5 June 28 th, 2007

Server Instrumentation Event Log HAL hardware abstraction layer MCH HAL CI Service CI Device Driver MCH machine check handler CI component instrumentation CPU CHIPSET Instrumentation validated by fault injection 6 June 28 th, 2007

Corrected Memory Errors NUMBER OF SYSTEMS 140 120 100 80 60 40 20 0 0 310.7 server years 1 to 5 6 to 10 11 to 50 51 to 100 101 to 1000 >1000 NUMBER OF SINGLE-BIT ERRORS Servers experiencing intermittent faults: 16 out of 257, i.e. 6.2 % Corrected single-bit errors (SBE) induced by intermittent faults: 12990 out of 16069, i.e. 80.8 % 7 June 28 th, 2007

Typical Signature of Memory Intermittent Faults 120 Daily number of corrected SBE Failure analysis: SBE induced intermittently by poly residue, within memory chips 100 80 SBE 60 40 20 0 80 86 89 92 95 135 Time (days) 138 344 445 448 Source: Hynix Semiconductor 8 June 28 th, 2007

Processor Front Side Bus Errors Front side bus (FSB) errors Bursts of single-bit errors (SBE) on data path SBE detected and corrected (data path protected by ECC) Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3 3264 15 0 0 108 121 97 101 7104 20 0 0 - - - - Servers experiencing FSB intermittent faults: 2 out of 64 (3%) Burst duration examples: 7104 errors in 3 sec; 3264 errors in 18 sec Failure analysis Intermittent contacts at solder joints 9 June 28 th, 2007

More on Intermittent Faults 10 June 28 th, 2007

Timing Violations BLM delamination Timing violations due to increased resistance; slow raise and fall times Intermittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond Permanent failures for previous technology nodes Source: C. Constantinescu, SELSE 2006 11 June 28 th, 2007

Crosstalk Induced Errors Pulse induced by the affecting line into a victim line Timing violations due to crosstalk Signal speedup or delay Signal speedup two adjacent lines switch in the same direction Signal delay two adjacent lines switch in opposite directions Process, voltage and temperature (PVT) variations amplify crosstalk induced skew Crosstalk increases with interconnect scaling and higher clock frequencies 12 June 28 th, 2007

Ultra-thin Oxide Faults Ultrathin oxide reliability Rate of defect generation decreases with supply voltage Tunnel current increases exponentially with decreasing gate oxide thickness Soft breakdown (SBD) Intermittent fluctuating current, high leakage SBD examples Erratic erasure of flash memory cells Erratic fluctuations of Vmin in SRAM 0.8 Vmin [V] 0. 7 0.6 0.5 0 300 600 900 1200 1500 SRAM Vmin 90 nm technology Source: M. Agostinelli et al, IEDM 2005 13 June 28 th, 2007 Time [s]

Scaling Trend of the Vmin Sensitivity Vmin sensitivity to gate leakage 16 Vmin [a.u.] 12 8 45nm 65nm 90nm Incresed cell sensitivity 4 0 1.00E+07 1.00E+06 Rg [Ohms] 1.00E+05 Source: M. Agostinelli et al, IEDM 2005 14 June 28 th, 2007

Impact of Process Variations Increasingly difficult to accurately control device parameters Channel length and width Oxide thickness Doping profile Intra-die variations, e.g., different transistor voltage threshold within the same SRAM cell Intermittent failure of read/write operations Impact of process variations is increasing with scaling 15 June 28 th, 2007

Activation of Intermittent Faults 1.70V ***************************************** ***************************************** ***************************************** ***************************************** 1.45V ***************************************** ****D************************************ HVMWV**ZYZ****************************** LH*NDNPQRFST **************************** 1.20V ABCDEADFGHIJC *************************** 40ns 50ns 60ns 70ns 80ns Voltage and frequency shmoo Voltage Frequency Temperature Workload 16 June 28 th, 2007

Mitigation Techniques 17 June 28 th, 2007

HW Solutions: IBM G5/G6 CPU Mirrored Instruction and Execution units Comparator and register unit R- UNIT Compare outputs in n-1 instruction pipeline stage No error: update checkpoint array (register content and instruction address into R-unit) in last pipeline stage and continue normal execution I & E- UNITS COMPARATOR I & E- UNITS Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry CACHE Transient faults are recovered from Error threshold can be used for intermittent faults Permanent faults require activation of a spare CPU under OS control Source: L. Spainhower, T. A. Greg, IBM JR&D,1999 18 June 28 th, 2007

HW Solutions: IBM G5/G6 CPU Pros Lower design complexity Shorter development and validation time No performance penalty (compare and detect cycles are overlapped) Cons Total circuit overhead about 40% It may not scale well with frequency 19 June 28 th, 2007

SW Solutions: AR-SMT Active-stream/Redundant-stream Simultaneous Multithreading (AR-SMT) Two copies of the same program run concurrently, using the SMT micro architecture Results of the two threads are compared A-STREAM errors are detected with a delay R-STREAM errors are detected before commit Recovery from transient faults (e.g. particle induced soft error) is possible Use committed state of R-STREAM A- STREAM R- STREAM FERCH COMMIT Source: E. Rotenberg, FTCS, 1999 R- STREAM A- STREAM DELAY BUFFER 20 June 28 th, 2007

SW Solutions: AR-SMT Pros AR-SMT relies on existing micro-architectural features, e.g. SMT No HW overhead Cons Increased execution time, 10% - 30% Increased performance penalty or even failure in the case of bursts of high frequency errors 21 June 28 th, 2007

Comparing Fault/Error Handling Techniques HW implementations are fast (e.g. ECC) - can handle bursts of errors induced by intermittent faults SW detection and recovery is slower Performance penalty in the case of large bursts of errors Near coincident fault scenario, in the case of high rate bursts of errors => SW fault/error handling may fail before recovery is completed SW solutions are better suited for failure prediction and resource reconfiguration 22 June 28 th, 2007

Summary Semiconductor technology is a two edge sword Lower dimensions and voltages and higher frequencies have led to tremendous performance gains Intermittent and transient faults have become a serious challenge to developers and manufacturers Designing for particle induced soft errors is too narrowly focused Software only techniques cannot effectively handle bursts of errors occurring at a high rate FAULT TOLERANT CHIPS ARE THE FUTURE 23 June 28 th, 2007

Q & A Performance Dependability 24 June 28 th, 2007