Copyright 2006 Intel Corporation All Rights Reserved Fault

  • Slides: 35
Download presentation
Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie)

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Analysis Using Pin Srilatha (Bobbie) Manne Intel - -

Copyright © 2006 Intel Corporation. All Rights Reserved. What are we trying to do?

Copyright © 2006 Intel Corporation. All Rights Reserved. What are we trying to do? • Purpose: Simulate the occurrence of transient (or persistent) faults and analyze their impact on applications. • Why Pin? Ø Ø Ø Easy to model faults and measure their impact. Relatively fast (5 -10 minutes per fault injection) Provides full program analysis -2 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Pros & Cons Software Architectural Instrumentation

Copyright © 2006 Intel Corporation. All Rights Reserved. Pros & Cons Software Architectural Instrumentation Simulator RTL Silicon Accuracy Ease of Use -3 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Pin’s View of the world Arch

Copyright © 2006 Intel Corporation. All Rights Reserved. Pin’s View of the world Arch Reg u. Arch State Memory -4 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Modeling Microarchitectural Faults in Pin •

Copyright © 2006 Intel Corporation. All Rights Reserved. Modeling Microarchitectural Faults in Pin • Accuracy of fault methodology depends on the complexity of the underlying microarchitecture Ø Easier to model faults in an in-order, single issue machine • Build a microarchitectural model into Pin Ø Ø A low fidelity model may suffice Adds complexity and slows down simulation time • Mimic certain types of microarchitectural faults in Pin -5 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Destination Register Transmission Fault Exec

Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Destination Register Transmission Fault Exec Unit Latches Fault occurs in latches when forwarding instruction output • Change architectural value of destination register at the instruction where fault occurs • NOTE: This is different than inserting fault into register file because the destination is selected based on the instruction where fault occurs Bypass Logic ROB RS -6 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Load Data Transmission Faults Fault

Copyright © 2006 Intel Corporation. All Rights Reserved. Example: Load Data Transmission Faults Fault occurs when loading data from the memory system • Before load instruction, insert fault into memory • Execute load instruction • After load instruction, remove fault from memory (Cleanup) • NOTE: This models a fault occurring in the transmission of data from the STB or L 1 Cache Load Buffer Latches STB DCache -7 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Five Step Program for Fault Analysis

Copyright © 2006 Intel Corporation. All Rights Reserved. Five Step Program for Fault Analysis 1. Determine ‘when’ the fault occurs 2. Determine ‘where’ the fault occurs 3. Inject Fault 4. Cleanup (Optional) 5. Determine Outcome -8 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN • Reality: Ø

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN • Reality: Ø Assuming that environmental conditions stay the same, transient faults can occur with equal probability at any time during the run of the application. • Approximation: Ø Transient faults occur on any dynamic instruction with equal probability -9 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN • Sample Pin

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 1: WHEN • Sample Pin Tool: Inst. Count. C Ø Purpose: Efficiently determines the number of dynamic instances of each static instruction. • Output: For each static instruction Ø Ø Function name Dynamic instructions per static instruction IP: 135000941 Count: 492714322 Func: propagate_block. 104 IP: 135000939 Count: 492714322 Func: propagate_block. 104 IP: 135000961 Count: 492701800 Func: propagate_block. 104 IP: 135000959 Count: 492701800 Func: propagate_block. 104 IP: 135000956 Count: 492701800 Func: propagate_block. 104 IP: 135000950 Count: 492701800 Func: propagate_block. 104 - 10 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 2: WHERE • Reality: Ø

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 2: WHERE • Reality: Ø Ø Where the transient fault occurs is a function of the size of the structure on the chip. Faults can occur in both architectural and microarchitectural state. • Approximation: Ø Pin only provides architectural state, not microarchitectural state (no uops, for instance) Ø Either inject faults only into architectural state Ø Build an approximation for some microarchitectural state - 11 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 3: Injecting Fault • Pass

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 3: Injecting Fault • Pass context and other relevant information to analysis routine to modify the architectural state • Inject fault • Flush code cache to force immediate reinstrumentation • Force execution at a particular point using the context - 12 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 4: Cleanup • Cleanup is

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 4: Cleanup • Cleanup is an optional step and is only necessary for modeling microarchitectural faults, not architectural faults Ø Modeling a fault in the transmission of data to load op - 13 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5 : Determining Outcome •

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5 : Determining Outcome • Outcomes that can be tracked: Ø Ø Did the program complete? Did the program complete and have the correct IO result? If the program crashed, how many instructions were executed after fault injection before program crashed? If the program crashed, why did it crash (trapping signals)? - 14 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START Insert Fault Count By Basic Block Clear Code Cache Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? Yes Reached Check. Point? No Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 15 - No

Copyright © 2006 Intel Corporation. All Rights Reserved. Register Fault Pin Tool: Reg. Fault.

Copyright © 2006 Intel Corporation. All Rights Reserved. Register Fault Pin Tool: Reg. Fault. C MAIN main(int argc, char * argv[]) { if (PIN_Init(argc, argv)) { return Usage(); }; out_file. open(Knob. Output. File. Value(). c_str()); fault. Inst = Knob. Fault. Inst. Value(); TRACE_Add. Instrument. Function (Trace, 0); INS_Add. Instrument. Function(Instruction, 0); PIN_Add. Fini. Function(Fini, 0); PIN_Add. Signal. Intercept. Function(SIGSEGV, Sig. Func, 0); PIN_Add. Signal. Intercept. Function(SIGFPE, Sig. Func, 0); PIN_Add. Signal. Intercept. Function(SIGILL, Sig. Func, 0); PIN_Add. Signal. Intercept. Function(SIGSYS, Sig. Func, 0); PIN_Start. Program(); return 0; } - 16 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START Insert Fault Count By Basic Block Clear Code Cache Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? No Yes Reached Check. Point? Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 17 - No

TRACE Analysis TRACE Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. if (fine.

TRACE Analysis TRACE Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. if (fine. Grain. Count == false) { for (BBL bbl = TRACE_Bbl. Head(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) { BBL_Insert. If. Call(bbl, IPOINT_BEFORE, (AFUNPTR)Find. Fine. Grain. Threshold, IARG_UINT 32, BBL_Num. Ins(bbl), IARG_END); BBL_Insert. Then. Call(bbl, IPOINT_BEFORE, (AFUNPTR) Switch. To. Fine. Grain. Counting, IARG_END); } } UINT 32 Find. Fine. Grain. Threshold(UINT 32 i) { cur. Dyn. Inst += i; return ( cur. Dyn. Inst >= (fault. Inst - fine. Grain. Trigger) ); } VOID Switch. To. Fine. Grain. Counting() { if (fine. Grain. Count == false) { fine. Grain. Count = true; PIN_Remove. Instrumentation(); } } - 18 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START Insert Fault Count By Basic Block Clear Code Cache Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? No Yes Reached Check. Point? Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 19 - No

Instruction Analysis Instruction Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS

Instruction Analysis Instruction Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS ins, VOID *v) { if (fine. Grain. Count == true) { if (fault. Done == 0) { INS_Insert. If. Call(ins, IPOINT_BEFORE, (AFUNPTR)Find. Fault. Inst, IARG_END); INS_Insert. Then. Call(ins, IPOINT_BEFORE, (AFUNPTR)Insert. Fault, IARG_CONTEXT, IARG_END); } if (fault. Done == 1) { …. INT 32 Find. Fault. Inst() { cur. Dyn. Inst++; return ( cur. Dyn. Inst >= fault. Inst ); } - 20 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START Insert Fault Count By Basic Block Clear Code Cache Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? No Yes Reached Check. Point? Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 21 - No

Fault Insertion Analysis Routine Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Insert.

Fault Insertion Analysis Routine Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Insert. Fault(CONTEXT* _ctxt) { srand(cur. Dyn. Inst); Get. Faulty. Bit(_ctxt, &fault. Reg, &fault. Bit); UINT 32 old_val; UINT 32 new_val; old_val = PIN_Get. Context. Reg(_ctxt, fault. Reg); fault. Mask = (1 << fault. Bit); new_val = old_val ^ fault. Mask; PIN_Set. Context. Reg(_ctxt, fault. Reg, new_val); PIN_Remove. Instrumentation(); fault. Done = 1; PIN_Execute. At(_ctxt); } - 22 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Insertion State Diagram No START Insert Fault Count By Basic Block Clear Code Cache Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? No Yes Reached Check. Point? Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 23 - No

Post Fault Instruction Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS

Post Fault Instruction Instrumentation Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Instruction(INS ins, VOID *v) { if (fine. Grain. Count == true) { if (fault. Done == 0) { …. } if (fault. Done == 1) { if (INS_Has. Fall. Through(ins)) { INS_Insert. Call(ins, IPOINT_AFTER, (AFUNPTR)Print. Heartbeat, IARG_END); } if (INS_Is. Branch. Or. Call(ins)) { INS_Insert. Call(ins, IPOINT_TAKEN_BRANCH, (AFUNPTR)Print. Heartbeat, IARG_END); } } - 24 -

Post Fault Analysis Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Print. Heartbeat()

Post Fault Analysis Copyright © 2006 Intel Corporation. All Rights Reserved. VOID Print. Heartbeat() { post. Fault. Insts++; if (post. Fault. Insts & dump. Mask) { out_file << "H: " << dec << dump. Mask << endl; out_file. flush(); dump. Mask = dump. Mask << 1; } if (dump. Mask > max. HB) { PIN_Detach(); } } - 25 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Program Failure Fault Masked OUTPUT IP:

Copyright © 2006 Intel Corporation. All Rights Reserved. Program Failure Fault Masked OUTPUT IP: 8192 fcf COUNT: 937440391 REG: esi FBIT: 24 MASK: 1000000 OLD: bffeca 90 NEW: befeca 90 H: 1 H: 2 H: 4 H: 8. . . H: 8388608 IP: 80 babc 0 COUNT: 92958481 REG: ebp FBIT: 20 MASK: 100000 OLD: 0 NEW: 100000 H: 1 H: 2 H: 4 H: 8 H: 16 H: 32 Signal: 11 Post. Fault. Insts: 38 - 26 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Sample Results - 27 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Sample Results - 27 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Determining Outcome, Extreme Edition

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Determining Outcome, Extreme Edition • In the Inject. Fault step (STEP 3) Ø Ø • • • Fork a process and inject fault into one process (parent process) Communicate information between processes (mkfifo) After fault injection, keep track of all writes to memory At each checkpoint, compare architectural state and stores What if there’s a control deviation? Ø Ø For every control operation, compare the next IP between processes If the control flow deviates, then wait until both routines return from the function where the deviation occurred before checking state. - 28 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Extreme Edition • Adding

Copyright © 2006 Intel Corporation. All Rights Reserved. Step 5: Extreme Edition • Adding this fork and compare feature takes time but it can be done. • What does it buy? Ø Ø Ø Does the fault propagate? How far does it propagate? How many registers, bytes of memory does it impact? What happens when there is a control deviation? Is there a higher incidence of program failure or IO error in the presence of a control deviation? - 29 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Pin Based Fault Checker Insert Fault

Copyright © 2006 Intel Corporation. All Rights Reserved. Pin Based Fault Checker Insert Fault Count By Basic Block Clear Code Cache No No C hang e START Count Insts After Fault No Yes Restart Using Context Reached Threshold? Print HB & Update Checkpoint Counter Yes No Count Every Instruction Cleanup? No Yes Reached Check. Point? Reached Max HB? Yes Found Inst? Cleanup Fault Pre-Fault Detach From Pin & Run to Completion Post Fault - 30 - No

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Fault Insertion Fork Process

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Fault Insertion Fork Process & Setup Communication Links Parent Process? Yes Insert Fault No Restart Using Context Parent Process? No Post Fault Yes Cleanup Required? Yes Cleanup Fault No Parent Child Both - 31 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Post Fault Get Next

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Post Fault Get Next Inst & Count Insts Store OP? Yes Old Data!= New Data? Parent IP != Child IP? Yes Ctrl OP? Yes No Checkpoint Comparison Check. Point? Yes Read Info From Child & Compare state No Done Or Cont? Parent State == Child State No Send Continue Signal to Child Yes Communicate Reg & Store Data to Parent No Ctrl Deviation No No Parent? Save Data No No Yes Send Done Signal to Child & Detach Yes Done? No Yes Detach & Exit Parent Child Both - 32 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Ctrl Deviation Call Counter

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Ctrl Deviation Call Counter = 0 Get Next Inst Store OP? Yes Old Data!= New Data? Yes Save Data No No Yes Function Call Counter ++ No No Function Return? No Yes Call Counter -- Call Counter <0? No Parent Child Both Yes Checkpoint Comparison - 33 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Additional Info • Cannot

Copyright © 2006 Intel Corporation. All Rights Reserved. Fault Checker: Additional Info • Cannot check faults beyond a system call Ø Ø • • Kill child process and detach parent process from Pin Run parent/faulty process to completion Although not shown in flow chart, the Pin tool detaches after reaching a max number of check points Providing tighter bounds on ctrl deviation: Ø Ø May take a long time before returning from function call On a control deviation Ø For both parent and child processes, save each store address and data Ø For the parent process, tag the store with the number of instructions executed since control deviation occurred. Ø After control merges and if architectural state is the same between the two processes, walk the list of stores from oldest to youngest and determine where the two processes matched. - 34 -

Copyright © 2006 Intel Corporation. All Rights Reserved. Conclusion • Fault insertion using Pin

Copyright © 2006 Intel Corporation. All Rights Reserved. Conclusion • Fault insertion using Pin is a great way to determine the impacts faults have within an application Ø Ø Ø Easy to use Enables full program analysis Accurately describes fault behavior once it has reached architectural state - 35 -