A Practical Look at SEU, Effects and Mitigation

Similar documents
Self-Test and Adaptation for Random Variations in Reliability

Single-Event Upsets in the PANDA EMC

Towards Trusted Devices in FPGA by Modeling Radiation Induced Errors

Single Event Upset Hardening by 'hijacking' the multi-vt flow during synthesis

LogiCORE IP Spartan-6 FPGA Triple-Rate SDI v1.0

Product Update. JTAG Issues and the Use of RT54SX Devices

Polar Decoder PD-MS 1.1

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Single Event Characterization of a Xilinx UltraScale+ MP-SoC FPGA

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Reconfigurable Communication Experiment using a small Japanese Test Satellite

Low Cost Fault Detector Guided by Permanent Faults at the End of FPGAs Life Cycle Victor Manuel Gonçalves Martins

Performance Driven Reliable Link Design for Network on Chips

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

Built-In Self-Test of Embedded SEU Detection Cores in Virtex-4 and Virtex-5 FPGAs

Self Restoring Logic (SRL) Cell Targets Space Application Designs

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Tutorial 11 ChipscopePro, ISE 10.1 and Xilinx Simulator on the Digilent Spartan-3E board

EEM Digital Systems II

A Fast Constant Coefficient Multiplier for the XC6200

Block Diagram. dw*3 pixin (RGB) pixin_vsync pixin_hsync pixin_val pixin_rdy. clk_a. clk_b. h_s, h_bp, h_fp, h_disp, h_line

BIST for Logic and Memory Resources in Virtex-4 FPGAs

T1 Deframer. LogiCORE Facts. Features. Applications. General Description. Core Specifics

2.6 Reset Design Strategy

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Voter Insertion Techniques for Fault Tolerant FPGA Design.

Synchronization Voter Insertion Algorithms for FPGA Designs Using Triple Modular Redundancy

L12: Reconfigurable Logic Architectures

DESIGNING AN ECU CPU FOR RADIATION ENVIRONMENT. Matthew G. M. Yee College of Engineering University of Hawai`i at Mānoa Honolulu, HI ABSTRACT

RELATED WORK Integrated circuits and programmable devices

EMPTY and FULL Flag Behaviors of the Axcelerator FIFO Controller

FPGA Design. Part I - Hardware Components. Thomas Lenzi

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

EITF35: Introduction to Structured VLSI Design

Why FPGAs? FPGA Overview. Why FPGAs?

Field Programmable Gate Arrays (FPGAs)

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description


PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

L11/12: Reconfigurable Logic Architectures

Notes on Digital Circuits

Soft Errors re-examined

Understanding Design Requirements for Building Reliable, Space-Based FPGA MGT Systems Based on Radiation Test Results

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Homework #8 due next Tuesday. Project Phase 3 plan due this Sat.

Lossless Compression Algorithms for Direct- Write Lithography Systems

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

Block Diagram. pixin. pixin_field. pixin_vsync. pixin_hsync. pixin_val. pixin_rdy. pixels_per_line. lines_per_field. pixels_per_line [11:0]

Block Diagram. 16/24/32 etc. pixin pixin_sof pixin_val. Supports 300 MHz+ operation on basic FPGA devices 2 Memory Read/Write Arbiter SYSTEM SIGNALS

Design for Testability

LogiCORE IP Video Timing Controller v3.0

Irradiation Resistivity and Mitigation Measurement Design for Xilinx Kintex-7 FPGAs

EE141-Fall 2010 Digital Integrated Circuits. Announcements. Synchronous Timing. Latch Parameters. Class Material. Homework #8 due next Tuesday

Design of Fault Coverage Test Pattern Generator Using LFSR

C65SPACE-HSSL Gbps multi-rate, multi-lane, SerDes macro IP. Description. Features

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Lecture #4: Clocking in Synchronous Circuits

Using on-chip Test Pattern Compression for Full Scan SoC Designs

Metastability Analysis of Synchronizer

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Overview: Logic BIST

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

LogiCORE IP AXI Video Direct Memory Access v5.01.a

T-COR-11 FPGA IP CORE FOR TRACKING OBJECTS IN VIDEO STREAM IMAGES Programmer manual

Notes on Digital Circuits

ECE 555 DESIGN PROJECT Introduction and Phase 1

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

High Performance Carry Chains for FPGAs

A Low Power Delay Buffer Using Gated Driver Tree

SoC IC Basics. COE838: Systems on Chip Design

in Xilinx Devices each) Input/Output Blocks XtremeDSP slices (DSP48) System Monitor Block

Modeling Latches and Flip-flops

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

Modeling Latches and Flip-flops

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Radiation Effects and Mitigation Techniques for FPGAs

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

VA08V Multi State Viterbi Decoder. Small World Communications. VA08V Features. Introduction. Signal Descriptions

FPGA implementation of a DCDS processor Simon Tulloch European Southern Observatory, Karl Schwarzschild Strasse 2, Garching, 85748, Germany.

FPGA Design with VHDL

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

for Digital IC's Design-for-Test and Embedded Core Systems Alfred L. Crouch Prentice Hall PTR Upper Saddle River, NJ

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

A Reconfigurable, Radiation Tolerant Flexible Communication Platform (FCP) S-Band Radio for Variable Orbit Space Use

Measurements of metastability in MUTEX on an FPGA

VLSI IEEE Projects Titles LeMeniz Infotech

An Efficient Reduction of Area in Multistandard Transform Core

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

RFI MITIGATING RECEIVER BACK-END FOR RADIOMETERS

Level and edge-sensitive behaviour

A Tool For Run Time Soft Error Fault Injection. Into FPGA Circuits

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

An Introduction to Radiation-Induced Failure Modes and Related Mitigation Methods For Xilinx SRAM FPGAs

VLSI Test Technology and Reliability (ET4076)

AbhijeetKhandale. H R Bhagyalakshmi

Fault Detection And Correction Using MLD For Memory Applications

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Transcription:

A Practical Look at SEU, Effects and Mitigation Ken Chapman FPGA Network: Safety, Certification & Security University of Hertfordshire 19 th May 2016

Premium Bonds Each Bond is 1 Each stays in the system until you cash it in (or die!) Each Bond takes part in a monthly draw These 5 Bonds are still worth 5 and have taken part in over 570 monthly draws Page 2

ERNIE picks the winning bonds each month Electronic Random Number Indicator Equipment ERNIE 1 Unveiled in 1957 Generated bond numbers based on signal noise from neon tubes Now on display at the Science Museum in London Page 3

Every month ERNIE picks the winning bonds 1 in every 30,000 Bonds There are ~ 60 Billion Bonds in the system Page 4

Statistics Odds = 1 in 30,000 If you have 30,000 Bonds* Does it guarantee that you win a prise every month? Win nothing = 37% 1 prize = 37% 2 prizes = 18% 3 prizes = 6% 4 prizes = 1% But over a year you ll probably win ~12 prizes and over 10 years you ll win ~120 prizes * Maximum permitted holding is 50,000 Page 5

Which prize will ERNIE give you? Value No of Prizes Will it be a life changing 1,000,000 03% or average good fortune? 12 25 = 300 64% 1% tax free return on investment 933% of Prizes Page 6

What Did The Space Program Ever Do For Me? MTBF Great for space and very special situations but is this practical? Engineering solutions! Standard products do benefit from the space program 9,926yrs 75 days Page 7

Only Soft Errors NO SEL (Single Event Latch-up) Proprietary Design Techniques >40 Patents Immunity to latch-up confirmed continuously by Xilinx testing Continuous monitoring of devices No reports from customers (significant quantities of devices are monitored 24/7) Beam testing at high energy levels NO SEFIs (Single Event Functional Interrupts) observed Only significant in space (< 004 device FIT terrestrially) NO SETs (Single Event Transients) observed Large RCs on logic & DFF nets prevent occurrence NO subtle device behaviour changes observed No performance or frequency degradation Negligible effects on power consumption Upsets only occur in memory cells Values flip from 0 1 or 1 0 Soft Errors Only Page 8

Over 17 Years of Rosetta and Beam Testing

Being Practical Begins and Ends With UG116 Use the known to deal with the unknown! Always use the latest version http://wwwxilinxcom/support/documentation/user_guides/ug116pdf But what does this mean in practical terms? Page 10

Some Xilinx SEU History Xilinx is the only FPGA vendor that openly publishes SEU and Soft Error Rate measurements (see UG116) Observations and experiences of devices in the real atmosphere as well as during beam experiments have enabled Xilinx to understand the susceptibility of our devices 1998 (250nm) 2003 Improvements are generally by design We didn t just get lucky! Use known published data to make informed and relevant decisions about today s devices 2012 (Now) 2015 (Now)

7-Series FIT Rate Failures In Time Time = 10 9 hours = 114,155 years SER (Soft Error Rate) Frequency of soft error occurrences 81 upsets in 114,155 years for every 1 million bits of configuration memory 135 36 1024 = 4,976,640 bits are BRAM contents ** This is close enough for an estimate 30,606,304-4,976,640 = 25,629,664 fixed configuration bits 81 256 = 2,074 FIT ** 10 9 / 2,074 = 482,160 hours = 20,090 days = 55 years Page 12

What Do The 7-Series Figures Tell Us? Operating the following devices at sea level in New York the mean time between upsets will be Artix 7A100T - 55 Years Artix 7A200T - 22 Years Kintex 7K70T - 74 Years Kintex 7K325T - 19 Years Virtex 7VX690T - 8 Years Virtex 7V2000T - 4 Years Now you know why real data collection takes lots of devices and time Now you know why Xilinx do also go beam testing Page 13

Scaling Factors http://wwwseutestcom/cgi-bin/fluxcalculatorcgi Real figures should be scaled for the working environment - Sea Level New York Relative Flux 100 - Xilinx also provide an SEU FIT Rate Calculator Page 14

Scaling For Ground Based Products (Includes aircraft operating at lower altitudes) Operating the following devices anywhere normal on the surface of Earth will experience upsets less frequently than Useful Scaling to Remember 17 Covers anywhere on the surface of The Earth Reference: Longmont,Colorado 4,978ft amsl Flux 352 Artix 7A100T - 1,181 Days (3 Years) Artix 7A200T - 470 Days Kintex 7K70T - 1,583 Days (4 Years) Kintex 7K325T - 403 Days Virtex 7VX690T - 172 Days Virtex 7V2000T - 76 Days But a ground based product may need to operate 24/7 for many years Page 15

Altitude 40,000ft Anywhere Operating the following devices at 40,000 feet the mean time between upsets will be Artix 7A100T - 40 Days Useful Scaling to Remember 500 40,000ft anywhere Artix 7A200T - 16 Days Kintex 7K70T - 54 Days Kintex 7K325T - 14 Days Virtex 7VX690T - 6 Days That s a long time to sit in economy Virtex 7V2000T - 3 Days A device in a high utilization long haul aircraft could expect to experience a few flights a year in which an upset occurs Page 16

SEU Detection Built-in Readback CRC continuously scans the configuration cells Can be completely independent of user design When CRC is incorrect at end of scan INIT_B pin is driven Low - Scan time depends on device size and clock frequency (46ms to 541ms) - XC7A200T scan time 183ms at F MAX - XC7V325T scan time 235ms at F MAX eg 20ms INIT_B=0 CRCERROR=1 What is the longest time between an upset occurring and error being reported? What is the shortest time between an upset occurring and error being reported? What is the average time between an upset occurring and error being reported? 40ms 0ms 20ms Page 17

Error Correction 7-Series also has error correction built-in Automatically corrects all single bit per frame upsets (the most common type) Readback CRC mechanism still used to scan the device - CRC provides redundancy for ECC ECCERROR=1 20ms INIT_B=1 CRCERROR=0 Each frame (101 32 = 3,232 bits) has an Error Correcting Code (ECC) - Detects an error as frame containing error is scanned 50% reduction in average detection time - Identifies location of a single bit error within that frame - Correction time <1ms Page 18

When ECC alone is not enough! Single Bit Error (SBE) Adjacent Frame Double Bit Error = 2 SBE Same Frame Double Bit Error (DBE) Page 19

What Effect Does An Upset Have On My Design? Error injection is a VERY Powerful tool (partial reconfiguration) Not available in ASIC or fixed configuration devices (Only pre-defined error injection points are practical within an ASIC design) It s like having a proton beam on my desk but better Evaluate SEU susceptibility of a particular design - What proportion of upsets effect the design? - What happens when they do? - How many upsets are critical to operation? Where and what is the weakest link? Evaluate and test all your mitigation strategies - Does your system correctly handle and report errors? - Does your TMR scheme really see you through (hard and soft errors)? Page 20

The Proton Beam for Your Desk! XC7K325T on KC705 Board 400 Break Me Modules UART_RX6 UART_TX6 Ports KCPSM6 Port Port 4K ROM Represents a Design filling ~90% of device SEM IP Port icap_grant Status Interface Port 24-bit Counter [23:16] led[7:0] Ports Ports FIFO FIFO Error Injection Interface Monitor Interface status_heartbeat CE Q[23:0] 8-bit Counter Q[7:0] RST CRCERROR Port Port INIT_B (dedicated) Ok CRCERROR ICAPE2 FRAME_ECC2 Page 21

Break Me Module! 400 DSP Circuits ~57 Slices 2 DSP48E1 Other logic ~13 Slices Counter Counter PicoBlaze 32 Slices 8 18 18 256 18 ROM LFSR 25 + 8 18 18 256 18 ROM LFSR 25 + 25 25 43 + DSP48E1 43 + DSP48E1 48 48 = In Latch Out DSP Failure 8 25 In ROM 1 BRAM (36kb) 4 Latch 4 Out KCPSM6 DEFAULT_JUMP 12 18 2K Program ROM Dual Port BRAM 9 12 CRC 8 Slices CRC Calculator In Latch Out ROM Failure KCPSM6 Failure Total Size of each Module 1 BRAM 2 DSP48E1 ~110 Slices Page 22

The Proton Beam for Your Desk! Today s target Target Device : xc7k325t Design Summary -------------- Number of occupied Slices: 44,405 out of 50,950 87% Number of RAMB36E1/FIFO36E1s: 411 out of 445 92% Number of RAMB18E1/FIFO18E1s: 4 out of 890 1% Number of DSP48E1s: 800 out of 840 95% DSP Circuits ~52% PicoBlaze circuits ~40% ROM CRC circuits ~7% SEM IP and system controller ~1% For an XC7K325T, each simulated SEU (arrow!) is equivalent to:- 19 Years at Sea Level New York 403 Days worst case anywhere on the surface of the Earth 14 Days worst case anywhere at 40,000ft Page 23

Results From My Desk! Different circuits have different susceptibility 500 simulated SEU equivalent to 18 Years of worst case continuous operation at 40,000ft Each dot represents a frame in which an error was injected Red dots represent upsets that resulted in disturbance to operation of a break me circuit Most SEU have no effect Simulating SEU in your design helps you to observe the susceptibility of your circuits and focus on the effects to the important ones Design Feature DSP circuits PicoBlaze circuits ROM CRC calculator SEM IP and system controller Relative Susceptibly 59% 17% 24% 0% Percentage of observed disruptions to operation normalised to area occupied by feature Page 24

Break Me Designed To Break AND Report It! Latches any difference between two identical circuits Just 1-bit for 1-clock cycle is captured and reported Latch Counter 100% known input data! 8 18 18 256 18 ROM LFSR 25 + 25 43 + DSP48E1 48 = In Latch Out Counter 8 18 18 256 18 ROM LFSR 25 25 + 25 43 + DSP48E1 48 DSP Failure Matching 48-bit results every clock cycle However, a real DSP algorithm (eg FIR filter or FFT) - Computes results for sets of data samples which are unknown variables - Most calculations errors will be completely indistinguishable from signal noise - The upset will be temporary (eg <23ms) - Naturally flushes with clean data and results following correction Very low probability of any meaningful or observable disturbances Page 25

Break Me PicoBlaze Susceptibility or DVF PicoBlaze + interfacing logic = ~50 Slices (similar to a typical application) 400 PicoBlaze circuits occupy ~40% of the XC7K325T device 500 simulated SEU resulted in 16 disturbances to PicoBlaze operation PicoBlaze circuit susceptibility = (100% / 40%) (16/500) = 008 Design Vulnerability Factor (DVF) = 8% ie Only 1 in 125 SEU landing within the area occupied by a PicoBlaze circuit has an effect One (1) PicoBlaze circuit occupies ~01% of XC7K325TSlices Nominal SEU rate of XC7K325T device is 6,087 FIT (19 Years) 1 PicoBlaze circuit = 6,087 01% 8% = 049 FIT (234,424 Years) Anywhere on Earth (17 ) PicoBlaze circuit = 8 FIT (13,789 Years) 40,000ft anywhere (500 ) PicoBlaze circuit = 245 FIT (469 Years) Page 26

Categorisation of Events 100% Detection Observed results for a variety of real applications ( normalised for device utilisation ) 60-80% Completely miss the design - These upsets will never impact operation - But all SEU are detected and reported 10-40% Touch the design but either Have no effect on operation at all or No effect could be observed <10% will be observed to have any effect Eg PicoBlaze ~8% (in Break Me design) 2-5% is a typical observation rate <1% Impact product functionality (ie The ones that actually matter) Page 27

Typical Design Operational Disturbance Rates Kintex 7K325T SEU Detection Rate Operational Disturbance Rate (Continuous operation of >80% utilized device) Nominal 19 Years 190 to 950 Years Anywhere on Earth (17 ) 403 Days 10 to 51 Years Anywhere at 40,000ft (500 ) 14 Days 137 Days to 2 Years Page 28

How Do So Many SEU Miss My Design? Break Me design fills ~90% of the device but what does used actually mean? In a typical real design only 20% to 40% of configuration bits are used So that means 60% to 80% of upsets miss the design altogether (false alarms?) Page 29

What Happens to X If D Q Enable D Q A D Q B I 2 I 1 I 0 LUT O CE D Q X D Q C R Reset Page 30

Nothing Happens to X Unless D Q Enable When the upset is present (eg a 23ms window ) Enable = 1 D Q A A changes state D Q B I 2 I 1 I 0 LUT O CE D Q X D Q C B = 0 and C = 1 R Reset Reset = 0 Page 31

Risk Assessment Whole Device Let s take a look at the XC7K325T which is a mid-range Kintex-7 device 326,000 logic cells (ie not small!) 1Mb 05Mb of available user flip-flops 751Mb of static configuration 165Mb of available user RAM Page 32

Risk Assessment Resources Actually Used Every design is different so obviously better to work with actual values But let s accept some typical figures for now 015Mb of used flip-flops 30Mb of used or Essential Bits 12Mb of used RAM 1-7Mb of Critical Bits Page 33

Risk Assessment Not all flip-flops are the same! Flip-flops and RAM can easily be 10 to 4,000 times more susceptible 015Mb of flip-flops fabricated using 015Mb a typical of ASIC process used flip-flops 30Mb of used or Essential Bits 12Mb of used RAM 1-7Mb of Critical Bits Know what the figures are and what they mean before you make a decision Page 34

Being Practical Begins and Ends With UG116 Use the known to deal with the unknown! Always use the latest version http://wwwxilinxcom/support/documentation/user_guides/ug116pdf Page 35