Phoenix Detecting and Recovering from Permanent Processor Design





























- Slides: 29
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu
Can a Processor have a Design Defect ? No Way !!! Yes, it is a major challenge. 2 http: //iacoma. cs. uiuc. edu
A Major Challenge ? ? ? 50 -70% effort spent on debugging 1 -2 year verification times Massive computational resources Some defects still slip through to production silicon 3 http: //iacoma. cs. uiuc. edu
Defects slip through ? ? ? 1994 Pentium defect costs Intel $475 million 1999 Defect leads to stoppage in shipping Pentium III servers 2004 AMD Opteron defect leads to data loss 2005 A version of Itanium 2 recalled Increasing features on chip Conventional approaches are ineffective Does not look like it will stop u Micro-code patching u Compiler workarounds u OS hacks u Firmware 4 http: //iacoma. cs. uiuc. edu
Vision Processors include programmable HW for patching design defects Vendor discovers a new defect Vendor characterizes the conditions that exercise the defect Vendor sends a defect signature to processors in the field Customers patch the HW defect 5 http: //iacoma. cs. uiuc. edu
% of defects detected Additional Advantage: Reduced Time to Market Pentium-M, Silas et al. , 2003 l 8 weeks Reduced time to market Vital ingredient of profitability 6 http: //iacoma. cs. uiuc. edu
Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 7 http: //iacoma. cs. uiuc. edu
% of defects detected Defects in Deployed Systems l 100% 50 We studied public domain errata documents for 10 current processors ¡ ¡ ¡ Intel Pentium III, IV, M, and Itanium I and II AMD K 6, Athlon 64 IBM G 3 (PPC 750 FX), MOT G 4 (MPC 7457) 8 http: //iacoma. cs. uiuc. edu
Dissecting a Defect – from Errata doc. q L 1, ALU, Memory, etc. Module Defect Type of Error q Hang, data corruption IO failure, wrong data Condition A (B C D) Signal q q Snoop L 1 hit IO request Low power mode 9 http: //iacoma. cs. uiuc. edu
Types of Defects Design Defect Non-Critical q Performance counters q Error reporting registers q Breakpoint support q Defects in memory, IO, etc. Concurrent Complex q All signals – same time q Different times 10 http: //iacoma. cs. uiuc. edu
Characterization 31% 69% 11 http: //iacoma. cs. uiuc. edu
When can the defects be detected ? Post Defect (37%) Condition Detector Signals Defect Local Pipeline Other Pre Defect (63%) ALU time Memory, IO 12 http: //iacoma. cs. uiuc. edu
Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 13 http: //iacoma. cs. uiuc. edu
Phoenix Conceptual Design Signature Buffer Reconfigurable Logic Signal Selection Unit (SSU) Bug Detection Unit (BDU) Global Recovery Unit q Store defect signatures obtained from vendor q Program the on-chip reconfigurable logic q Tap signals from units q Select a subset q Collect signals from SSUs q Compute defect conditions q Initiate recovery if a defect condition is true 14 http: //iacoma. cs. uiuc. edu
Distributed Design of Phoenix Neighborhood Subsystem To Recovery Unit BDU Subsystem SSU HUB SSU To Recovery Unit BDU Examples of Subsystems Inst. Cache FP ALU Virtual Mem. Fetch Unit L 1 Cache IO Cntrl. 15 http: //iacoma. cs. uiuc. edu
Overall Design Neighborhood Chip Boundary Global Recovery Unit HUB HUB Neighborhood 16 http: //iacoma. cs. uiuc. edu
Software Recovery Handler Flush Pipeline Local Post Reset Module Type of Defect Pipeline Post + Pre Rest of Post Yes Rollback Checkpointing Support No Interrupt to OS Turn condition off continue 17 http: //iacoma. cs. uiuc. edu
Designing Phoenix for a New Processor List of Signals Generic q Learn from other processors Training Data Sizes of Structures Training Data Specific q Processor data sheets q Scatter plot of sizes vs. # of signals in unit q Derive rules of thumb 18 http: //iacoma. cs. uiuc. edu
Designing Phoenix for a New Proc. – II Generate list of signals to tap Decide on breakdown of subsystems and neighborhoods Place BDUs, SSUs, and HUBs Size structures using the rules of thumb Route all signals and realize the logic function of defects 19 http: //iacoma. cs. uiuc. edu
Outline l Analysis and Characterization l Architecture for Hardware Patching l Evaluation 20 http: //iacoma. cs. uiuc. edu
Signals Tapped Generic+Specific 150 -270 Generic Signals Specific Signals q L 2 hit, low power mode q ALU access, etc. q A 20 pin set in Pentium 4 q BAT mode in IBM 750 FX 21 http: //iacoma. cs. uiuc. edu
Defect Coverage Results Concurrent All Defects 63% Pre 37% Complex Recover Training Set: Intel P 3, P 4, P-M Itanium I & II AMD K 6, K 7 AMD Opteron IBM G 3 Motorola G 4 Post Detect 69% 31% Detection Coverage 65% Test Processors Recovery Coverage 60% Test Set: Ultra. Sparc II Intel IXP 1200 Intel PXA 270 PPC 970 Pentium D 22 http: //iacoma. cs. uiuc. edu
Overheads Area q Programmable logic (PLA & interconnect) q Estimated using PLA layouts (Khatri et al. ) 0. 05% Wiring q Wires to route signals q Estimated using Rent’s rule Timing None 0. 48% 23 http: //iacoma. cs. uiuc. edu
Impact of Training Set Size l l Train set only needs to have 7 processors Coverage in new processors is very high 24 http: //iacoma. cs. uiuc. edu
Conclusion l We analyzed the defects in 10 processors Phoenix novel on-chip programmable HW l Evaluated impact: l ¡ ¡ l 150 – 270 signals tapped Negligible area, wiring, and performance overhead Defect coverage: 69% detected, 63% recovered Algorithm to automatically size Phoenix for new procs We can now live with defects !!! 25 http: //iacoma. cs. uiuc. edu
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu
Backup 27 http: //iacoma. cs. uiuc. edu
Phoenix Algorithm for New Processors Defect Coverage for New Processors Generate Signal List Place a SSU-BDU pair in each subsystem Use k-means clustering to group subsystems in nbrhoods Size hardware using the thumb-rules l Map signals in errata to signals in the list Route all signals and realize the logic function Similar results obtained for 9 Sun processors – Ultra. Sparc III, III++, IIIi, IIIe, IV+, Niagara I and II http: //iacoma. cs. uiuc. edu 28
Where are the Critical defects ? The core is well debugged l Most of the defects are in the mem. system l 29 http: //iacoma. cs. uiuc. edu