Bullet Proof A DefectTolerant CMP Switch Architecture Kypros
Bullet. Proof: A Defect-Tolerant CMP Switch Architecture Kypros Constantinides‡ Stephen Plaza‡ Jason Blome‡ Bin Zhang† Valeria Bertacco‡ Scott Mahlke‡ Todd Austin‡ Michael Orshansky† ‡Advanced †Department of Electrical and Computer Architecture Lab Computer Engineering University of Michigan University of Texas at Austin HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 1
Introduction • Reliability is a critical aspect of any computer design • System designers target for very small failure rates • Today reliability targets are met by using fault-avoidance design techniques Transistor Reliability – use of conservative design margins • For future process technologies it would be impossible to avoid system failures by using conservative design margins – need defect-tolerant design techniques Now Future Transistor Lifetime (years) HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 2
Reliable System Design Space TYP DES E OF D IGN E FEA FECT TUR E NO-DETECTION +CORRECTION +REPAIR Mainstream Solutions MANUFACTURING DEFECT Untestable Defects Testing Post-manufacturing recovery ECC - memory WEAR-OUT DEFECT TRANSIENT ERROR System fails in unpredictable way System glitch manifests in unpredictable way Component terminates at first error DMR Component terminates. Hard-reset restore DMR TMR Diva Razor ECC TMR Online defect recovery Post-manufacturing reconfiguration cache-line swap-out Transient fault recovery Online repair Bullet. Proof memory-array spares Specialized Solutions Research-stage Solutions High-end Solutions • Need for cost- and performance-efficient techniques that can provide high reliability in the presence of unreliable components HPCA, Austin, Texas – “Bullet. Proof” Bullet. Proof: A Defect-Tolerant CMP February 13 2006 Switch Architecture 3
CMP Switch Architecture • Goal: A defect tolerant CMP switch design • Baseline switch architecture is provided by Li-Shiuan Peh • Implements the routing and flow-control functions required for transmitting packets in a 2 D Torus network • Wormhole switch pipelined at the flit level (32 -bit flits) • Dimensional order routing • Specified in Verilog and synthesized to a gate-level netlist ~ 9 K logic gates and 1700 sequential elements HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 4
Soft Errors (SEU) Vulnerability • In earlier work we studied the vulnerability of the switch architecture to soft-errors – Only 3. 2% of faults eventually cause an error • Age-related wear-out silicon defects is a more challenging reliability threat for future technologies • In this work we focus on solutions for in-field silicon defects • These solutions also provide soft-error tolerance to the design HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 5
Self-Repairing Systems • Defect-tolerant self-repairing systems need to support: – – Error Detection System Diagnosis (locate the origin of the error) System Repair System Recovery • Key idea: – error detection must be performance efficient • continuously check execution for errors – diagnosis, repair and recovery are insensitive on performance • get invoked only when an error is detected (rare scenario) HPCA, Austin, Texas Bullet. Proof: A Defect-Tolerant CMP February 2006 Switch Architecture • 13 trade-off performance for more cost efficient techniques 6
Traditional Defect-Tolerant Techniques • Traditional techniques for designing defect-tolerant systems: – Triple Modular Redundancy (TMR) M • Forward recovery • Applicable to both combinational M and sequential logic • Can not tolerate more than one M defective modules • Area and power overhead ~ 3 X – Error Correction Codes (ECC) V ECC bits • Lower overhead solution R R D D D R D D 1 2 1 3 2 3 4 4 5 6 7 8 • Applicable only for state Data bits holding structures and busses HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 7
Error Detection: Low-Cost Domain Specific Technique Routing Logic FLIT Error Header Input Buffers CRC Checker Cross-bar CRC Checker CRC Cross-bar Controller Routing Logic Buffer Checker ARB • The synthesized netlist of the added components account for ~10% of the total switch area • Provide error detection for both hard and soft errors HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 8
Adding Defect Resiliency With Lower Cost • Automatic Cluster Decomposition • Balanced recursive min-cut heuristic algorithm Input: a) design’s gate-level netlist b) number of partitions Output: a partitioned netlist Goal: – Balance partition sizes: - smaller partition higher resilience – Minimize cut edges: - reduce cost overhead - reduce vulnerable logic • Partitions can have both combinational and sequential logic HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 9 A A FF II B B G G C C J D D H E
Partition Sparing – Silicon Protection Factor • Partition sparing: – Only one spare is active for each partition of the switch – Replace voting logic with spare swapping logic – Lower power overhead – A defect is fatal if it hits the last spare of a partition or the spare swapping logic A A F F 15. 8 X more B tolerated defects B SPF – Defect Tolerance 1 extra spare I partition per I G 7. 6 XGmore defects C tolerated per unit area C J J Mean Defects to Failure DD = Silicon Protection Factor (SPF) Area. HH Overhead EEare proportional to the – The number of defect in a design’s area – Enables to compare different defect tolerant designs HPCA, Austin, Texas Bullet. Proof: A Defect-Tolerant CMP February 13 2006 Switch Architecture 10
System Recovery a: Correctly routed flit • Add a Recovery Pointer to each b, c: In the switch pipeline d: Next flit to be routed input buffer e: Last flit buffered • Recovery pointers advance 4 cycles e Input ee dd cc bb aa d Buffers after the input controller grants the requesting output channel – Guarantees that flit is CRC • On error detection: Tail Head. Recovery Head checked Error Detection Signal CRC Checker – All CRC checkers drop outgoing flits Routed Flit CRC – Switch pipeline is flushed Checker – Head pointers are set to recovery pointers – Restart execution HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 11 Routed Flit Interconnect Switch Recovery Logic Routed Flit CRC Checker
System Diagnosis and Repair • Iterative trial-and-error technique Recover to the last correct state of the switch For partition i swap in the spare for the current copy and restart execution Increase i Yes Error detected? Yes No i < # partitions? No Fatal Defect Continue Execution • Built-In-Self-Test (BIST) – For each partition keep automatically generated test vectors in ROM – Apply test vectors to each partition through scan chains to locate the defective partition HPCA, Austin, Texas Bullet. Proof: A Defect-Tolerant CMP February 13 2006 Switch Architecture 12
Exploring Defect-Tolerant CMP Switch Designs partitions(cmps)Designs 1212 partitions 206 partitions Pareto Sub-optimal TMR spare input controllers 2 spares 1 per 1 spare per 2/5 partition spare per partition 3. 04 X 1 Area spare=per cmp. (rest) Iterative Built-In-Self-Test replay SPF = 1. 54 Iterative replay Area = 3. 16 X Area = 3. 4 X Area = 2. 3 X SPF = 5. 54 Area = 1. 76 X SPF = 11. 1 SPF = 7. 6 SPF = 2. 53 st r pe bu ea ro ch ore Pareto Optimal Designs m cheaper designs more robust designs ns g si de How does these techniques affect the system’s HPCA, Austin, Texas Bullet. Proof: lifetime? A Defect-Tolerant CMP February 13 2006 Switch Architecture 13
“Bathtub Curve”: A model for semiconductor hard failures • The lifetime failure rate for semiconductor systems follows what is known as the bathtub curve • Trend for future process technologies: – Failure rate of grace period gets larger – Breakdown period is earlier in system’s lifetime Failure Rate (FIT) Future process technologies Time Infant Period HPCA, Austin, Texas February 13 2006 Grace Period Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 14 Breakdown Period
System Lifetime – A Post 65 nm Technology Case Scenario 120000 108000 3/5 spare IC 1 spare rest SPF=3. 01 96000 84000 1 defect every two years 2 spares SPF=11. 11 1 spare SPF=7. 63 72000 60000 48000 36000 24000 12000 HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 15 Failure Rate (FIT) TMR SPF=1. 54
Conclusions – Future Work Conclusions • Traditional mechanisms are insufficient for tolerating moderate numbers of defects • Domain-specific techniques along with resource sparing, iterative diagnosis and reconfiguration are more effective • Decomposing the design into modest-sized partitions is the most effective granularity to apply redundancy Future Work • Use of spare components based on component wearout profiles • Explore low-cost defect-tolerant techniques for HPCA, Austin, Texas Bullet. Proof: A Defect-Tolerant CMP microprocessors February 13 2006 Switch Architecture 16
Questions? HPCA, Austin, Texas February 13 2006 Bullet. Proof: A Defect-Tolerant CMP Switch Architecture 17
- Slides: 17