Selfcalibrating Online Wearout Detection Authors Jason Blome Shuguang
Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007 1 University of Michigan Electrical Engineering and Computer Science
Motivation § “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) More failures to come Failures will be wearout induced [Srinivasan, DSN‘ 04] [Borkar, MICRO‘ 05] 2 University of Michigan Electrical Engineering and Computer Science
Current Approaches § Traditional § § § Impractical Design margins Burn-in Detection: based on replication of computation § § § TMR (Tandem/HP Non. Stop servers) DIVA (Bower, MICRO’ 05) Prediction: utilizes precise analytical models and/or sensors § § Canary circuits (Sentinel. Silicion, Ridge. Top) RAMP (Srinivasan, UIUC/IBM) RA MP Static Costly 3 University of Michigan Electrical Engineering and Computer Science
Wearout Mechanisms § Many failure mechanisms have been shown to be progressive § § Hot carrier injection (HCI) Electromigration (EM) 4 § Negative Bias Temperature Inversion (NBTI) § Oxide Breakdown (OBD) University of Michigan Electrical Engineering and Computer Science
Objective § Propose a failure prediction technique that exploits the progressive nature of wearout § Monitor impact on path delays Prediction Detection • Monitors evolution of wearout • Identifies existing fault • Proactive • Reactive • enables failure avoidance/mitigation • enables failure recovery • Continuous feedback • End-of-life feedback • False negatives and positives • False negatives 5 University of Michigan Electrical Engineering and Computer Science
Oxide Breakdown (OBD) § Accumulation of defects leads to a conductive path Percolation Model [Stathis, JAP‘ 06] 6 University of Michigan Electrical Engineering and Computer Science
OBD HSPICE Model § Post-breakdown leakage modeling [BSIM 4. 6. 0, ‘ 06] [Rodriguez, Stathis, Linder, IRPS ‘ 03] 7 University of Michigan Electrical Engineering and Computer Science
Characterization Testbench § 90 nm standard cell library tcircuit tcell 8 University of Michigan Electrical Engineering and Computer Science
Impact on Propagation Delay 9 University of Michigan Electrical Engineering and Computer Science
Delay Profiling Unit (DPU) 0 1 input signal 1 0 0 u. Arch Module 0 1 1 10 0 1 1 1 Latency Sampling 1 1 0 0 University of Michigan Electrical Engineering and Computer Science
TRIX Analysis Magnitude of divergence between TRIXglobal and TRIXlocal reflects amount of degradation 11 University of Michigan Electrical Engineering and Computer Science
TRIX Analysis Details § Exponential Moving Average (EMA) § Triple-smoothed Exponential Moving Average 12 University of Michigan Electrical Engineering and Computer Science
Percent Nominal Delay (%) Noisy Latency Profile Increasing Age 13 University of Michigan Electrical Engineering and Computer Science
DPU with TRIX Hardware TRIXl Calculation 0 input signal 0 1 0 Latency Sampling 0 Prediction 0 TRIXg Calculation 0 1 14 University of Michigan Electrical Engineering and Computer Science
Wearout Detection Unit (WDU) + TRIXl Calculation Latency Sampling Prediction TRIXg Calculation 15 University of Michigan Electrical Engineering and Computer Science
Evaluation Framework Gate-level Processor Simulator OR 1200 Verilog Synthesis and Place and Route 90 nm Library Fully Synthesized, P&R, OR 1200 Core Monte Carlo Workload Simulator Media. Bench Suite Simulator Timing, Power, and Temperature Simulations HSPICE Simulations OBD Wearout Model 16 Wearout Simulator University of Michigan Electrical Engineering and Computer Science
WDU Accuracy 17 University of Michigan Electrical Engineering and Computer Science
WDU Overhead 18 University of Michigan Electrical Engineering and Computer Science
WDU Overhead 19 University of Michigan Electrical Engineering and Computer Science
Long-term Vision § Introspective Reliability Management (IRM) § § Intelligent reliability management directed by on-chip sensor feedback Prospective sensors § § § Delay (WDU) Leakage/Vt Temperature 20 University of Michigan Electrical Engineering and Computer Science
Introspective Reliability Management 21 University of Michigan Electrical Engineering and Computer Science
Conclusions § Many progressive wearout phenomenon impact devicelevel performance. § § WDU performance § § § It’s possible to characterize this impact and anticipate failures Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid) Low-level sensors can be used to enable intelligent reliability management 22 University of Michigan Electrical Engineering and Computer Science
Questions? ? 23 University of Michigan Electrical Engineering and Computer Science
- Slides: 23