Transient Fault Detection via Simultaneous Multithreading Steven K

























- Slides: 25
Transient Fault Detection via Simultaneous Multithreading Steven K. Reinhardt stever@eecs. umich. edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan Shubhendu S. Mukherjee Shubu. Mukherjee@compaq. com VSSAD, Alpha Technology Compaq Computer Corporation Shrewsbury, Massachusetts 27 th Annual International Symposium on Computer Architecture (ISCA
Transient Faults that persist for a “short” duration u Cause: cosmic rays (e. g. , neutrons) u Effect: knock off electrons, discharge capacitor u Solution u l no practical absorbent for cosmic rays – u 1 fault per 1000 computers per year (estimated fault rate) Future is worse l smaller feature size, reduce voltage, higher transistor count, reduced noise margin Slide 2
Fault Detection in Compaq Himalaya System R 1 (R 2) microprocessor Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC Replicated Microprocessors + Cycle-by-Cycle Locks Slide 3
Fault Detection via Simultaneous Multithreading R 1 (R 2) THREAD Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC Threads ? Replicated Microprocessors + Cycle-by-Cycle Locks Slide 4
Simultaneous Multithreading (SMT) Thread 1 Thread 2 Instruction Scheduler Functional Units Example: Alpha 21464 Slide 5
Simultaneous & Redundantly Threaded Processor (SRT) SRT = SMT + Fault Detection + Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT Slide 6
SRT Design Challenges u Lockstepping doesn’t work l u SMT may issue same instruction from redundant threads in different cycles Must carefully fetch/schedule instructions from redundant threads branch misprediction l cache miss l Disclaimer: This talk focuses only on fault detection, not recovery Slide 7
Contributions & Outline Sphere of Replication (So. R) u Output comparison for SRT u Input replication for SRT u Performance Optimizations for SRT u SRT outperforms on-chip replicated microprocessors u Related Work u Summary u Slide 8
Sphere of Replication (So. R) Logical boundary of redundant execution within a system • Trade-off between information, time, & space redundancy Sphere of Replication Execution Copy 1 Execution Copy 2 Input Replication Output Comparison Rest of System Slide 9
Example Spheres of Replication Sphere of Replication Microprocessor Input Replication Output Comparison Pipeline 1 Pipeline 2 Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Output Comparison Instruction cache covered by ECC Data cache covered by ECC Memory covered by ECC RAID array covered by parity Servernet covered by CRC Compaq Himalaya ORH-Dual: On-Chip Replicated Hardware (similar to IBM G 5) Slide 10
Sphere of Replication for SRT PC RUU Fp Regs Instruction Cache Decode Register Rename Int. Regs R 1 (R 2) R 3 = R 1 + R 7 R 1 (R 2) R 8 = R 7 * 2 Fp Units Ld /St Units Data Cache Fetch Int. Units Thread 0 Thread 1 Excludes instruction and data caches Alternates So. Rs possible (e. g. , exclude register file)… not in this talk Slide 11
Output Comparison in SRT Sphere of Replication Execution Copy 1 Execution Copy 2 Input Replication Output Comparison Rest of System Compare & validate output before sending it outside the Slide 12
Output Comparison u <address, data> for stores from redundant threads l compare & validate at commit time Store Queue Store: . . . Store: R 1 (R 2) Output Comparison To Data Cache • <address> for uncached load from redundant threads • <address> for cached load from redundant threads: not required • other output comparison based on the boundary of an So. R Slide 13
Input Replication in SRT Sphere of Replication Execution Copy 1 Execution Copy 2 Input Replication Output Comparison Rest of System Replicate & deliver same input (coming from outside So. R) to redund Slide 14
Input Replication u Cached load data pair loads from redundant threads: too slow l allow both loads to probe cache: false faults with I/O or multiprocessors l u Load Value Queue (LVQ) l pre-designated leading & trailing threads add probe cache load R 1 (R 2) sub Slide 15 LVQ add load R 1 (R 2) sub
Input Replication (contd. ) u Cached Load Data: alternate solution l u Active Load Address Buffer Special Cases l Cycle- or time-sensitive instructions l External interrupts Slide 16
Outline Sphere of Replication (So. R) u Output comparison for SRT u Input replication for SRT u Performance Optimizations for SRT u SRT outperforms on-chip replicated microprocessors u Related Work u Summary u Slide 17
Performance Optimizations u Slack fetch maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes l u Branch Outcome Queue l u feed branch outcome from leading to trailing thread Combine the above two Slide 18
Baseline Architecture Parameters L 1 instruction cache 64 K bytes, 4 -way associative, 32 -byte blocks, single ported L 1 data cache 64 K bytes, 4 -way associative, 32 -byte blocks, four read/write ports Unified L 2 Cache 1 M bytes, 4 -way associative, 64 -byte blocks Branch predictor Hybrid local/global (like 21264); 13 -bit global history register indexing 8 K-entry global PHT and 8 K-entry choice table; 2 K 11 -bit local history registers indexing 2 K local PHT; 4 K-entry BTB, 16 -entry RAS (per thread) Fetch/Decode/Issue/Commit Width 8 instructions/cycle (fetch can span 3 basic blocks) Function Units 6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply Fetch to Decode Latency = 5 cycles Decode to Execution Latency = 10 cycles Slide 19
Target Architectures u SRT SMT + fault detection l Output Comparison l Input Replication (Load Value Queue) l Slack Fetch + Branch Outcome Queue l u ORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRT l Two pipelines share fetch stage (including branch predictor) l Slide 20
Performance Model & Benchmarks u Simple. Scalar 3. 0 modified to support SMT by Steve Raasch, U. of Michigan l SMT/Simplescalar modified to support SRT l u Benchmarks compiled with gcc 2. 6 + full optimization l subset of spec 95 suite (11 benchmarks) l skipped between 300 million and 20 billion instructions l simulated 200 million for each benchmark l Slide 21
SRT vs. ORH-Dual Average improvement = 16%, Maximum = 29% Slide 22
Recent Related Work Saxena & Mc. Cluskey, IEEE Systems, Man, & Cybernetics, 1998. u + First to propose use of SMT for fault detection AR-SMT, Rotenberg, FTCS, 1999 u + Forwards values from leading to checker thread DIVA, Austin, MICRO, 1999 u + Converts checker thread into simple processor Slide 23
Improvements over Prior Work u Sphere of Replication (So. R) e. g. , AR-SMT register file must be augmented with ECC l e. g. , DIVA must handle uncached loads in a special way l u Output Comparison l u Input Replication l u e. g. , AR-SMT & DIVA compare all instructions, SRT compares selected ones based on So. R e. g. , AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ Slack Fetch Slide 24
Summary u Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection u Sphere of replication Output comparison of committed store instructions l Input replication via load value queue l Slack fetch & branch outcome queue u SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29% u Slide 25