CS 7810 Lecture 25 DIVA A Reliable Substrate
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999
Redundancy • If a processor’s output is error-prone, reliability can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit
Redundancy • If a processor’s output is error-prone, reliability can be provided with redundancy Input Program Primary Core Checker Core Verify & Commit Checker Core One checker can detect errors. For recovery, we may need another checker or some other form of redundancy
Why Redundancy? • Soft Errors: A high energy particle can strike a device and deposit enough charge to flip the value Input Program Primary Core Checker Core Verify & Commit Ø Cosmic rays Ø Alpha particles
Why Redundancy? • Soft Errors: voltage spikes or noise Input Program Primary Core Checker Core Verify & Commit Ø Crosstalk Ø di/dt Ø Lower voltages
Why Redundancy? • Allows unverified or aggressively clocked primary cores Input Program Primary Core Checker Core Verify & Commit Ø Functionally incorrect core: some corner case slips through Ø Electrically incorrect core: high temperature causes a circuit to not meet the timing constraint
DIVA Microarchitecture BPred I-$ Dec/Ren IQ Rename Regs Arch Regs If both checks succeed, write 12 into LR 15 Storage Check Rd LR 3 and LR 7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12 ALU D-$ LR 3 + LR 7 LR 15 4 8 12
Microarchitecture Details • Instructions are fed to checker in order during commit • The logic and storage checks detect errors in ALUs and datapath • The checker core is a simple in-order pipeline – easy to design and verify • An error in an earlier stage (LR 3 instead of LR 2) can be detected by also adding a ren/decode stage to the checker • In-order core has no stalls (need bypass for register file) – no data dependences, cache misses, branch mispredicts • Contention for register file and data cache can degrade primary thread
Recovery • The architected register file and data cache are ECC protected – when an error is detected, it is assumed that checker and architected state are correct • Primary core is re-started from faulting instruction • A fault in the primary core may result in deadlock: e. g. instruction that produces R 5 is waiting for R 5 to be produced (instead of R 4) A timeout in the checker signals an error
Redundant Multi-Threading • Execute two threads in parallel (CMP or SMT) – each thread maintains its own register state • Threads execute as in a conventional processor, except Ø trailing thread commits after verifying result Ø leading thread commits stores to a buffer – these get written to cache/memory only after verification Ø load values of the leading thread are sent to trailing thread, so trailing thread never accesses data cache Ø branch outcomes are also sent to trailing thread Leading Thread Reg results, load values, branch outcomes Store values Trailing Thread
Fault Model • A single error in either core can be detected • Since loads are not replicated, the load/store datapath must be ECC protected • For recovery, a second checker thread is required • ECC in the checker register file will enable recovery in most cases without a second checker
RMT on SMT/CMP + SMT does not require inter-core traffic – values can be read from shared register file/data cache – Single thread performance may be degraded – Each redundant instr executes on high-power pipeline + Trailing CMP core can be a simple in-order processor low power/area overheads + Trailing core’s frequency can be independently controlled + Heterogeneous CMP where cores can be dynamically employed for throughput/reliability + Lower probability for errors
Parallelization of Trailing Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Sequential Thread Is it more power-efficient to execute the verification thread in parallel?
Parallelization of Trailing Thread Parallel Thread 1 Parallel Thread 2 Parallel Thread 3 Parallel Thread 4 Sequential Thread If the trailing cores are frequency-scaled, dynamic power does not change, but leakage power increases If the trailing cores are frequency-and-voltage scaled, dynamic power decreases, and leakage power increases
Error Types
Acronyms!! • MTTF & MTBF: Mean time to/between failures • Errors are either SDC (silent data corruption) or DUE (detected unrecoverable errors) Many errors get masked: • ACE bits: these bits are required for architecturally correct execution • un-ACE bits: these bits do not affect the final output • AVF: architecture vulnerability factor (the percentage of time/space that a structure holds ACE state)
Partial Coverage • RMT covers faults in the entire core (almost!) • If that is too expensive, provide error coverage in specific structures to reduce error probabilities • Are there ways to ensure that an instruction spends less time in architecturally vulnerable structures?
Title • Bullet
- Slides: 18