SWAT Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware by Treating Software Anomalies Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo, Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu, Sarita Adve, Vikram Adve, Yuanyuan Zhou Department of Computer Science University of Illinois at Urbana-Champaign swat@cs. uiuc. edu

Motivation • Hardware failures will happen in the field – Aging, soft errors, inadequate burn-in, design defects, … Need in-field detection, diagnosis, recovery, repair • Reliability problem pervasive across many markets – Traditional redundancy (e. g. , n. MR) too expensive – Piecemeal solutions for specific fault model too expensive – Must incur low area, performance, power overhead Today: low-cost solution for multiple failure sources 2

Observations • Need handle only hardware faults that propagate to software • Fault-free case remains common, must be optimized Þ Watch for software anomalies (symptoms) Hardware fault detection ~ Software bug detection Zero to low overhead “always-on” monitors Diagnose cause after symptom detected May incur high overhead, but rarely invoked Þ SWAT: Soft. Ware Anomaly Treatment 3

SWAT Framework Components • Detection: Symptoms of S/W misbehavior, minimal backup H/W • Recovery: Hardware/Software checkpoint and rollback • Diagnosis: Rollback/replay on multicore • Repair/reconfiguration: Redundant, reconfigurable hardware • Flexible control through firmware Checkpoint Fault Error Symptom detected Recovery Diagnosis Repair 4

SWAT 1. Detectors w/ Hardware support [ASPLOS ‘ 08] 2. Detectors w/ Software support [Sahoo et al. , DSN ‘ 08] Checkpoint Fault Error 4. Accurate Fault Modeling Symptom detected Recovery Diagnosis Repair 3. Trace Based Fault Diagnosis [Li et al. , DSN ‘ 08] 5

Hardware-Only Symptom-based detection • Observe anomalous symptoms for fault detection – Incur low overheads for “always-on” detectors – Minimal support from hardware • Fatal traps generated by hardware – Division by Zero, RED State, etc. • Hangs detected using simple hardware hang detector • High OS activity detected with performance counter – Typical OS invocations take 10 s or 100 s of instructions 6

Experimental Methodology • Microarchitecture-level fault injection – GEMS timing models + Simics full-system simulation – SPEC workloads on Solaris-9 OS • Permanent fault models – Stuck-at, bridging faults in latches of 8 arch structures – 12, 800 faults, <0. 3% error @ 95% confidence • Simulate impact of fault in detail for 10 M instructions Fault 10 M instr If no symptom in 10 M instr, run to completion Timing simulation Functional simulation App masked, or symptom > 10 M, or silent data corruption (SDC) 7

Efficacy of Hardware-only Detectors • Coverage: Percentage of unmasked faults detected – 98% faults detected, 0. 4% give SDC (w/o FPU) * Additional support required for FPU-like units – 66% of detected faults corrupt OS state, need recovery * Despite low OS activity in fault-free execution • Latency: Number of instr between activation and detection – HW recovery for upto 100 k instr, SW longer latencies – App in 87% of detections recoverable using HW – OS recoverable in virtually all detections using HW * OS recovery using SW hard 8

Improving SWAT Detection Coverage Can we improve coverage, SDC rate further? • SDC faults primarily corrupt data values – Illegal control/address values caught by other symptoms – Need detectors to capture “semantic” information • Software-level invariants capture program semantics – Use when higher coverage desired – Sound program invariants expensive static analysis – We use likely program invariants 9

Likely Program Invariants • Likely program invariants – Hold on all observed inputs, expected to hold on others – But suffer from false positives – Use SWAT diagnosis to detect false positives on-line • i. SWAT - Compiler-assisted symptom detectors – Range-based value invariants [Sahoo et al. DSN ‘ 08] – Check MIN value MAX on data values – Disable invariant when diagnose false-positive 10

i. SWAT implementation Training Phase Application Compiler Pass in LLVM Test, train, Invariant external ----Monitoring Application inputs Code ----Ranges i/p #1 . . Range s i/p #n Invariant Ranges 11

i. SWAT implementation Training Phase Application Compiler Pass in LLVM Test, train, Invariant external ----Monitoring Application inputs Code ----Ranges i/p #1 . . Range s i/p #n Invariant Ranges Fault Detection Phase Compiler Pass in LLVM Invariant Checking Code ----Application ----- Ref input Full System Simulation Invariant Violation SWAT Diagnosis Fault Detection False Positive (Disable Invariant) 12

i. SWAT Results • Explored SWAT with 5 apps on previous methodology • Undetected faults reduce by 30% • Invariants reduce SDCs by 73% (33 to 9) • Overheads: 5% on x 86, 14% on Ultra. Sparc IIIi – Reasonably low overheads on some machines – Un-optimized invariants used, can be further reduced • Exploring more sophistication for coverage, overheads 13

Fault Diagnosis • Symptom-based detection is cheap but – High latency from fault activation to detection – Difficult to diagnose root cause of fault – How to diagnose SW bug vs. transient vs. permanent fault? • For permanent fault within core – Disable entire core? Wasteful! – Disable/reconfigure µarch-level unit? – How to diagnose faults to µarch unit granularity? • Key ideas – Single core fault model, multicore fault-free core available – Checkpoint/replay for recovery replay on good core, compare – Synthesizing DMR, but only for diagnosis 14

S N y o m p s yt o m m p t d o e m t e c t e d SW Bug vs. Transient vs. Permanent • Rollback/replay on same/different core • Watch if symptom reappears Faulty Good Rollback on faulty core Transient or nondeterministic s/w bug Continue Execution False positive (i. SWAT) or Deterministic s/w or Permanent h/w bug Rollback/replay on good core Permanent h/w fault, needs repair! False positive (i. SWAT) or Deterministic s/w bug, send to s/w layer 15

Diagnosis Framework Symptom detected Diagnosis Software bug Transient fault Permanent fault Microarchitecture-Level Diagnosis Unit X is faulty 16

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Faulty Core Execution Fault-Free Core Execution =? Diagnosis Algorithm 17

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke TBFD Rollback faultycore to checkpoint Replay execution, collect info Fault-Free Core Execution =? Diagnosis Algorithm 18

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected What info to collect? Invoke TBFD Rollback faultycore to checkpoint Load checkpoint on fault-free core Replay execution, collect info Fault-free instruction exec What info to compare? =? What to do on divergence? Diagnosis Algorithm 19

Can a Divergent Instruction Lead to Diagnosis? Simpler case: ALU fault Fault-free results Faulty HW used dst dec alu preg add r 1, r 3, r 5 5 x 3 add r 1, r 3, r 5 0 1 12 sub r 6, r 1, r 2 2 x 9 2 1 7 sub r 6, r 1, r 2 Both divergent instructions used same ALU 1 faulty 20

Can a Divergent Instruction Lead to Diagnosis? • Complex example: Fault in register alias table (RAT) entry RAT I A: r 3 r 2 + r 2 I B: r 1 r 5 * r 2 log r 1 r 2 r 3 r 5 phy p 4 p 20 p 24 55 13 p 24 Reg File phy val p 20 4 p 24 3 8 p 4 32 8 Fault-free r 1=12 Diverged! But I does not use faulty HW… error! B • Divergent instructions do not directly lead to faulty unit • Instead, look backward/forward in instruction stream – Need to collect and analyze instruction trace 21

Diagnosing Permanent Fault to µarch Granularity • Trace-based fault diagnosis (TBFD) – Compare instruction trace of faulty vs. good execution – Divergence faulty hardware used diagnosis clues • Diagnose faults to µarch units of processor – Check µarch-level invariants in several parts of processor – Front end, Meta-datapath, datapath faults – Diagnosis in out-of-order logic (meta-datapath) complex • Results – 98% of the faults by SWAT successfully diagnosed – TBFD flexible for other detectors/granularity of repair 22

SWAT 1. Detectors w/ Hardware support [ASPLOS ‘ 08] 2. Detectors w/ Software support [Sahoo et al. , DSN ‘ 08] Checkpoint Fault Error 4. Accurate Fault Modeling Symptom detected Recovery Diagnosis Repair 3. Trace Based Fault Diagnosis [Li et al. , DSN ‘ 08] 23

SWATSim: Fast and Accurate Fault Models • Need accurate µarch-level fault models – Gate level injections accurate but too slow – µarch (latch) level injections fast but inaccurate • Can we achieve µarch-level speed at gate-level accuracy? • Mix-mode (hierarchical) Simulation – µarch-level + Gate-level simulation – Simulate only faulty component at gate-level, on-demand – Invoke gate-level sim at online for permanent faults * Simulating fault effect with real-world vectors 24

SWAT-Sim: Gate-level Accuracy at µarch Speeds µarch simulation r 3 r 1 op r 2 Input Faulty Unit Used? No Yes Stimuli µarch-Level Simulation Output Response r 3 Gate-Level Fault Simulation Fault propagated to output Continue µarch simulation 25

Results from SWAT-Sim • SWAT-sim implemented within full-system simulation – NCVerilog + VPI for gate-level sim of ALU/AGEN modules • SWAT-Sim: High accuracy at low overheads – 100, 000 x faster than gate-level, same modeling fidelity – 2 x slowdown over µarch-level, at higher accuracy • Accuracy of µarch models using SWAT coverage/latency – µarch stuck-at models generally inaccurate – Differences in activation rate, multi-bit flips • Complex manifestations Hard to derive better models – Need SWAT-Sim, at least for now 26

SWAT Summary • SWAT: Soft. Ware Anomaly Treatment – Handle all and only faults that matter – Low, amortized overheads – Holistic systems view enables novel solutions – Customizable and flexible • Prior results: – Low-cost h/w detectors gave high coverage, low SDC rate • This talk: – i. SWAT: Higher coverage w/ software-assisted detectors – TBFD: µarch level fault diagnosis by synthesizing DMR – SWAT-Sim: Gate-level fault accuracy at µarch level speed 27

Future Work • Recovery: hybrid, application-specific • Aggressive use of software reliability techniques – Leverage diagnosis mechanism • Multithreaded software • Off-core faults • Post-silicon debug and test – Use faulty trace as fault-model oblivious test vector • Validation on FPGA (w/ Michigan) • Hardware assertions to complement software symptoms 28

BACKUP SLIDES

Breakup of Detections by SW symptoms • 98% unmasked faults detected within 10 M instr (w/o FPU) – Need HW support or SW monitoring for FPU 30

SW Components Corrupted • 66% of faults corrupt system state before detection – Need to recover system state 31

Latency from Application mismatch • 86% of faults detected under 100 k – 42% detected under 10 k 32

Latency from OS mismatch • 99% of faults detected under 100 k 33

i. SWAT implementation Training Phase Application Compiler Pass in LLVM Test, train, Invariant external ----Monitoring Application inputs Code ----Ranges i/p #1 . . Range s i/p #n Invariant Ranges Fault Detection Phase Compiler Pass in LLVM Invariant Checking Code ----Application ----- Ref input Full System Simulation Invariant Violation SWAT Diagnosis Fault Detection False Positive (Disable Invariant) 34

Trace-Based Fault Diagnosis (TBFD) Permanent fault detected Invoke diagnosis Rollback faultycore to checkpoint Load checkpoint on fault-free core Replay execution, collect µarch info Fault-free instruction exec Faulty trace =? Test trace Faults in Front-end TBFD Meta-datapath Faults Datapath Faults 35

Fault Diagnosability • 98% of detected faults are diagnosed – 89% diagnosed to unique unit/array entry – Meta-datapath faults in out-of-order exec mislead TBFD 36

Accuracy of existing Fault Models • SWAT-sim implemented within full-system simulator – NCVerilog + VPI to simulate gate-level ALU and AGEN • Existing µarch-level fault models inaccurate – Differences in activation rate, multi-bsit flips • Accurate models hard to derive need SWAT-Sim! 37

Summary: SWAT Advantages • Handles all faults that matter – Oblivious to low-level failure modes & masked faults • Low, amortized overheads – Optimize for common case, exploit s/w reliability solutions • Holistic systems view enables novel solutions – Invariant detectors use diagnosis mechanisms – Diagnosis uses recovery mechanisms • Customizable and flexible – Firmware based control affords hybrid, app-specific recovery (TBD) • Beyond hardware reliability – SWAT treats hardware faults as software bugs * Long-term goal: unified system (hw + sw) reliability at lowest cost – Potential applications to post-silicon test and debug 38

Transients Results • 6400 transient faults injected across 8 structures • 83% unmasked faults detected within 10 M instr • Only 0. 4% of injected faults results in SDCs 39