Phoenix Detecting and Recovering from Permanent Processor Design





























- Slides: 29

Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu

Can a Processor have a Design Defect ? No Way !!! Yes, it is a major challenge. 2 http: //iacoma. cs. uiuc. edu

A Major Challenge ? ? ? 50 -70% effort spent on debugging 1 -2 year verification times Massive computational resources Some defects still slip through to production silicon 3 http: //iacoma. cs. uiuc. edu

Defects slip through ? ? ? 1994 Pentium defect costs Intel $475 million 1999 Defect leads to stoppage in shipping Pentium III servers 2004 AMD Opteron defect leads to data loss 2005 A version of Itanium 2 recalled Increasing features on chip Conventional approaches are ineffective Does not look like it will stop u Micro-code patching u Compiler workarounds u OS hacks u Firmware 4 http: //iacoma. cs. uiuc. edu

Vision Processors include programmable HW for patching design defects Vendor discovers a new defect Vendor characterizes the conditions that exercise the defect Vendor sends a defect signature to processors in the field Customers patch the HW defect 5 http: //iacoma. cs. uiuc. edu

% of defects detected Additional Advantage: Reduced Time to Market Pentium-M, Silas et al. , 2003 l 8 weeks Reduced time to market Vital ingredient of profitability 6 http: //iacoma. cs. uiuc. edu

Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 7 http: //iacoma. cs. uiuc. edu

% of defects detected Defects in Deployed Systems l 100% 50 We studied public domain errata documents for 10 current processors ¡ ¡ ¡ Intel Pentium III, IV, M, and Itanium I and II AMD K 6, Athlon 64 IBM G 3 (PPC 750 FX), MOT G 4 (MPC 7457) 8 http: //iacoma. cs. uiuc. edu

Dissecting a Defect – from Errata doc. q L 1, ALU, Memory, etc. Module Defect Type of Error q Hang, data corruption IO failure, wrong data Condition A (B C D) Signal q q Snoop L 1 hit IO request Low power mode 9 http: //iacoma. cs. uiuc. edu

Types of Defects Design Defect Non-Critical q Performance counters q Error reporting registers q Breakpoint support q Defects in memory, IO, etc. Concurrent Complex q All signals – same time q Different times 10 http: //iacoma. cs. uiuc. edu

Characterization 31% 69% 11 http: //iacoma. cs. uiuc. edu

When can the defects be detected ? Post Defect (37%) Condition Detector Signals Defect Local Pipeline Other Pre Defect (63%) ALU time Memory, IO 12 http: //iacoma. cs. uiuc. edu

Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 13 http: //iacoma. cs. uiuc. edu

Phoenix Conceptual Design Signature Buffer Reconfigurable Logic Signal Selection Unit (SSU) Bug Detection Unit (BDU) Global Recovery Unit q Store defect signatures obtained from vendor q Program the on-chip reconfigurable logic q Tap signals from units q Select a subset q Collect signals from SSUs q Compute defect conditions q Initiate recovery if a defect condition is true 14 http: //iacoma. cs. uiuc. edu

Distributed Design of Phoenix Neighborhood Subsystem To Recovery Unit BDU Subsystem SSU HUB SSU To Recovery Unit BDU Examples of Subsystems Inst. Cache FP ALU Virtual Mem. Fetch Unit L 1 Cache IO Cntrl. 15 http: //iacoma. cs. uiuc. edu

Overall Design Neighborhood Chip Boundary Global Recovery Unit HUB HUB Neighborhood 16 http: //iacoma. cs. uiuc. edu

Software Recovery Handler Flush Pipeline Local Post Reset Module Type of Defect Pipeline Post + Pre Rest of Post Yes Rollback Checkpointing Support No Interrupt to OS Turn condition off continue 17 http: //iacoma. cs. uiuc. edu

Designing Phoenix for a New Processor List of Signals Generic q Learn from other processors Training Data Sizes of Structures Training Data Specific q Processor data sheets q Scatter plot of sizes vs. # of signals in unit q Derive rules of thumb 18 http: //iacoma. cs. uiuc. edu

Designing Phoenix for a New Proc. – II Generate list of signals to tap Decide on breakdown of subsystems and neighborhoods Place BDUs, SSUs, and HUBs Size structures using the rules of thumb Route all signals and realize the logic function of defects 19 http: //iacoma. cs. uiuc. edu

Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 20 http: //iacoma. cs. uiuc. edu

Signals Tapped Generic+Specific 150 -270 Generic Signals Specific Signals q L 2 hit, low power mode q ALU access, etc. q A 20 pin set in Pentium 4 q BAT mode in IBM 750 FX 21 http: //iacoma. cs. uiuc. edu

Defect Coverage Results Concurrent All Defects 63% Pre 37% Complex Recover Training Set: Intel P 3, P 4, P-M Itanium I & II AMD K 6, K 7 AMD Opteron IBM G 3 Motorola G 4 Post Detect 69% 31% Detection Coverage 65% Test Processors Recovery Coverage 60% Test Set: Ultra. Sparc II Intel IXP 1200 Intel PXA 270 PPC 970 Pentium D 22 http: //iacoma. cs. uiuc. edu

Overheads Area q Programmable logic (PLA & interconnect) q Estimated using PLA layouts (Khatri et al. ) 0. 05% Wiring q Wires to route signals q Estimated using Rent’s rule Timing None 0. 48% 23 http: //iacoma. cs. uiuc. edu

Impact of Training Set Size l l Train set only needs to have 7 processors Coverage in new processors is very high 24 http: //iacoma. cs. uiuc. edu

Conclusion l We analyzed the defects in 10 processors Phoenix novel on-chip programmable HW l Evaluated impact: l ¡ ¡ l 150 – 270 signals tapped Negligible area, wiring, and performance overhead Defect coverage: 69% detected, 63% recovered Algorithm to automatically size Phoenix for new procs We can now live with defects !!! 25 http: //iacoma. cs. uiuc. edu

Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu

Backup 27 http: //iacoma. cs. uiuc. edu

Phoenix Algorithm for New Processors Defect Coverage for New Processors Generate Signal List Place a SSU-BDU pair in each subsystem Use k-means clustering to group subsystems in nbrhoods Size hardware using the thumb-rules l Map signals in errata to signals in the list Route all signals and realize the logic function Similar results obtained for 9 Sun processors – Ultra. Sparc III, III++, IIIi, IIIe, IV+, Niagara I and II http: //iacoma. cs. uiuc. edu 28

Where are the Critical defects ? The core is well debugged l Most of the defects are in the mem. system l 29 http: //iacoma. cs. uiuc. edu