Post Silicon Electrical Validation Lecture 2 Tony Muilenburg
Agenda Topics for Today Homework Q&A Escapes Future outlook Product lifecycle Manufacturing flow Validation Disciplines Electrical Validation Disciplines
Homework Q&A Questions about Homework 1? Homework 1 due Tuesday, Jan 14 at 7pm Every assignment due seven days later 7pm before class starts Hoping for preferred email as soon as possible Reading assignments most useful before the next class
Recalls / Escapes
Effective SI is Pre-Product Release. It costs less here. 50 Cost of failure (M$) 40 30 20 10 Why? Time = $ 0 Preprototype Validation Post Release Introduction Richard Mellitz
The Cost of Being Wrong
FDIV Pentium flaw was fixed, but some made it into the field From a calculator: type 4195835 / 3145727 * 3145727 = hit the +/- key then + 4195835 = 256 (instead of 0 ) Floating Point error in the chip Initially repalced chips, but later replaced entire systems.
SNB Recall SATA may degrade over time Cost around $700 Million ($300M + $400M)
Boeing Jet Batteries
Firestone Tires
Bad press, not handled well
Use Cases http://www.youtube.com/watch?v=nfsr52ocylq
Future Outlook
Moore s Law Alive and Well
Some Players
Product Lifecycle
Product Life Cycle Validation cycle integrated in the Product Life Cycle (PLC) Validation coverage at each stage Specifications pushed to limits Product Release Qualification (PRQ) Granted if reliability and quality requirements met 18
19 PLC Details
Wafer to CPU
Manufacturing Flow
What Happens at Sort
Defect Examples
What Happens at Class
Product Platform Verification
Quality Assurance
Catching Defects
Survivability
Validation Disciplines
Post Silicon Validation Disciplines Performance validation, others
Validation Goals
Communicating Status and Health
Example Activities Test WW Planned # of Repeats # of Cases Test Time (h) Overall Time (hr) Overall Time (d) % Canceled % Blocked % Complete Script checkout 1 1 8 0.25 2 0.25 0% 0% 100% Repeatability 2 10 80 0.25 20 2.5 0% 0% 100% BIOS Regression 3 1 24 0.25 6 0.75 0% 0% 40% DIMM configuration - DQ 4 1 100 0.25 20 2.5 0% 0% 100% DIMM configuration - CA 5 1 30 0.25 20 1 0% 0% 0% Debug 6 1 100 0.25 20 2.5 0% 0% 60% VT sweep 7 1 72 0.25 18 2.25 0% 0% 0% Board baseline 8 1 40 0.25 10 1.25 0% 0% 20% Eye test 9 1 32 1 32 1.6 0% 0% 100% Automation executive checkout 10 1 8 0.25 2 0.25 0% 0% 80% Termination impedance checkout 11 1 32 0.25 8 1 0% 0% 0% Voltage check 12 1 32 0.25 8 1 0% 0% 0% Totals 558 166 16.85 0% 0% 55.4%
Homework 2 Find the exercise on d2l.pdx.edu Under the lecture 2 module Create a plot for percent complete vs planned No need for the blocked or cancelled columns Due Jan 16 th, 7pm Weighting assumption Same for each activity Schedule Week 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Station/Platform setup Stability Repeatability Parameter sweep DIMM Config VTZ(s) Monte Carlo/ Stim comp Volume DOE(s) BER MRC Regression UPM
Electrical Validation 101
Agenda Introduction Discipline Summary SMV SIV Bench DV Tester DV
Introduction Multiple disciplines Many interfaces Teams across many geographies Tune recipes Optimization knobs Survivability knobs Find bugs Quantify health Provide UPM estimate
Importance of Electrical Validation Quantify system robustness over HVM Relatively limited data Predict over time Predict over spec. limits (voltage, temperature, etc) EV plays a major role in time to market Debug Bug fixes Interface speed limited by signal integrity EV can predict risk of pushing very high data transfer rates
Example EV Scope
Agenda Introduction Summary of Disciplines SMV SIV Bench DV Tester DV
EV Disciplines System Margin Validation (SMV) Signal Integrity Validation (SIV) Bench Design Validation (BDV) Others
Sample Size SMV: Quick test time, larger sample size Dozens or hundreds of CPUs Model and extrapolate to predict HVM SIV: Longer experiments, longer setup time Handful of CPUs BDV: Early engagement Handful of CPUs Focus is more on functionality
Example System
Example of Challenges (DDR) Dependence on manufacturer Many suppliers Different designs Many Configurations Single sides vs double ECC vs non-ecc UDIMM vs RDIMM vs SODIMM Motherboard dependence Routing differences Trace length, impedance differences Material differences (halide free, lead free, etc.)
Serial Interface Challenges Voltage swing is typically lower Measurement tools require low noise floors Load from instrument probes can corrupt the signal ISI risk due to jitter Very high speed signals Many bits, higher likelihood of bit errors
Agenda Introduction Discipline Summary SMV SIV Bench DV Others
System Margining (SMV) Interface robustness is characterized by: Measuring how far a reference can be offset until the system fails (margin) Examples include: Signal sampling timing Voltage reference level for logic high/low Many things impact margin Examples: Temperature, impedance, etc. Parameters are varied, performance variation captured Performance over HVM predicted Design modified if predicted failure rate is unacceptable
Typical Margining Equipment Examples of margining equipment include: Voltage supply and measurement Current supply Timing Equipment Frequency generator Intel s AVMC card can do all of this Temperature control CPU, PCH, DIMMs Silicon thermal, USTC In target Probe Step though execution, enable test modes, report errors
Example SMV Test Station
Selecting Experiments Many operating conditions impact system performance Often there are interactions between conditions Often non-linear response Cannot test every combination of every parameter Parameter discovery experiments Run early on Reduce variables Make experiments tractable
Input and Response Experiments over operating conditions Results analyzed Failure rate predicted for HVM
Margining Parameters are swept from default settings to the point where data becomes corrupted (margining) Examples: Voltage reference Data vs clock timing Spec limits pushed Temperature Impedance Trace length Termination Note: You need to be in presenter mode to see this animation
Significance of Margin Interface robustness is characterized by margining Healthy systems have wide margin Still operate when design limits are pushed Tight margin indicates the system is close to failing Many systems will likely fail when high volumes are produced This error rate is quantified via statistical modeling Relevant operating conditions are determined via: Pre silicon experiments Historical knowledge Parameter sweep experiments
PVT Corners Process, Voltage, and Temperature are historically significant parameters These are of particular interest when picking test points Spec limits are pushed, margin is captured Parameters are co-varied to capture interactions Process can cover many properties Want to capture the distribution A corner describes a set of parameters set to spec limit Example: high impedance, cold temperature
Design of Experiment DOE methodology is used to: Pick test points intelligently using a software package Model the system After interpolating results, the model can predict Best and worst margin cases High volume response (average, distribution, etc.) Operating conditions are varied Timing or voltage margin response is measured
Example DOE Table
Margin Result Per Lane
Interpreting Results Response Surface Modeling (RSM) Typically used for system characterization Only a small set of measurements are needed Interpolation can be used to predict conditions Monte carlo is an alternative Results from all experiments contribute to a model A statistically significant number of CPUs should be used to predict response over HVM
Margin Distribution A margin distribution is generated from the model The model comprehends parameter weightings Guardband is added for parameters not measured The model should comprehend all significant effects
Defects per Million DPM can be estimated from the margin distribution Model accuracy should be verified Customer data can be included Test the model by verifying predictions with new measurements Results are communicated to stakeholders
Coverage
PRQ Risk Assessment Coverage Known Issues UPM / Margin
PRQ Risk Assessment Coverage Main methodology focus Known Issues UPM / Margin
PRQ Risk Assessment Coverage Known Issues Generally comprehended UPM / Margin
PRQ Risk Assessment Inadequate Coverage May miss key bugs May miss key margin / UPM impacts Known Issues UPM / Margin
What does coverage mean? Traditional (i.e. well comprehended) Process, Voltage, Temperature, Board Z, Trace length, package impedance, etc Fairly well covered in today s methods Non-traditional (perhaps not well comprehended) Circuit robustness Training stability Interoperability System concurrency Out of guideline designs Are we covering these areas consistently and adequately?
Agenda Introduction Discipline Summary SMV SIV Bench DV Others
Signal Integrity Validation (SIV) Signal quality measured vs spec Example Specs DDR JEDEC PCIe CEM Common tools High Speed Oscilloscope Bit error rate testers (BERT) Vector network analyzers (VNA) Time domain reflectometry (TDR) Debug
Signal Integrity Validation (SIV) Activities include: SMV results verification Measurements SMV cannot cover due to lack of hooks Specification compliance Prove robustness of: Transmitter Receiver Link Debug signal integrity issues Comparison of results to customer results Special tests / designer experiment requests
Signal Integrity Issues Some issues that reduce margin include: Impedance mismatch Poor termination Crosstalk Ground Bounce Intersymbol interference Simultaneous switching Poor routing Poor materials Insufficient decoupling Cheap packaging Insufficient power delivery
Example Waveform
Coverage Both serial and parallel interfaces System level measurements This captures issues with: Packaging Reflections Routing Noise budget Increasing interface speeds Decreasing operating voltages
Methods Eye captures over different: Operating conditions Multiple CPUs Compliance testing for: Standard industry interfaces (DDR, PCIe, etc.) Well documented Non-standard interfaces (DMI, QPI, etc.) Not well documented Many similarities between interfaces
Selecting Experiments Knowledge of potential issues is key Examples: jitter, signal loss, circuit characteristics Factors at each level Chip Package Board Volume vs corner checking
Eye Mask Eye health Shape is an indicator Mask use is common Example of a good and bad eye The signal on the right is attenuated Large amount of jitter
Running Experiments More debug early on Manual overrides Stepping through code Recipe tuning Lots of BIOS images High level automation as stability is achieved 24 hour testing Infinite persistence captures Still requires human intervention for: Card or board swapping CPU swapping if a robotic handler is not available Results are stored to a database Post processed using JMP, summarizers, etc.
BERT for Example Setup PCIe Signal generation and error checking Two RF switches Compliance load board (CLB) System board (DUT) Clock generator Clean clock testing.
Interpreting Results Spec comparison Margin analysis BER and UPM leveraged from SMV results Difference between margin and spec limit Long dwell infinite persistence Probe or cable loss comprehended Compare to projects with similar interfaces Use sightings to track issues Tune recipe Impedance compensation (margin vs power) Equalization
Agenda Introduction Discipline Summary SMV SIV Bench DV Others
Bench DV Introduction Covers analog behavior of silicon Silicon that can go in many platforms Does the part meet external specs (EMTS, EDS) Analog Components Supporting circuitry Thermal Power delivery Clocking Works closely with SMV Covers PVT, but often pushes past spec limits Pushing past spec limits will be representative of worst case silicon at spec limits Possible to run with no OS by injecting traffic
BDV Cycle Gather spec requirements Test plan Generation Include tuning, calibration hook, hardware in the design Generate test conditions Run experiments, opening bug sightings Generate yield analysis Close sightings Feed results into PRQ recommendation
Methods More specialized testing than other disciplines Especially for new interfaces A set of traditional tests still exist Difficult to automate due to unique experiments Circuitry with impact on external behavior Example: IO swing calibration Different implementations - unique test requirements Recipe tuning Degradation over time Yield concerns
Close Ties with Design Deep understanding of circuits Often designers own BDV planning and/or execution Work closely with other disciplines Careful not to duplicate effort Often ask other disciplines to carry our experiments Recipe tuning regression is an example
Execution Examples Effect of power supply noise on band gap reference Glitch filter performance on a slow I/O Receiver clock bias setting
Agenda Introduction Discipline Summary SMV SIV Bench DV Others
Tester DV Component level design validation Does not focus on platform interactions Focus is on characterization of design goals in: Controlled environment Pristine conditions for power deliver, and signal integrity PVT coverage Close ties with HVM HVM tests every part shipped to customers Wafer level and post packaging Inline DV (ILDV) large amount of data collected per part Great for data mining
Power Delivery Validation (PDV) Sufficient power Stable power All parts of the die
An Eye is Born
Reading assignment High speed digital designs chapter 1 Link: Heck and Hall (This chapter is free online)