FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS

  • Slides: 20
Download presentation
FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish

FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT THE UNIVERSITY OF BRITISH COLUMBIA

Contributions • Software-driven diagnosis of hardware transient errors – Diagnosis: “isolate the first affected

Contributions • Software-driven diagnosis of hardware transient errors – Diagnosis: “isolate the first affected instruction” • Program-level analysis – Guarantees on the diagnosis • Completeness • Accuracy THE UNIVERSITY OF BRITISH COLUMBIA 2

Why Software-Driven Diagnosis? • No expensive hardware modifications. • Minimal software instrumentation. • Diagnose

Why Software-Driven Diagnosis? • No expensive hardware modifications. • Minimal software instrumentation. • Diagnose faults which manifest at the program-level only. • Direct access to the affected device is not required. THE UNIVERSITY OF BRITISH COLUMBIA 3

Diagnosis Approach Dump File (e. g. failing detector, register file) Error Diagnosis Detector Triggered

Diagnosis Approach Dump File (e. g. failing detector, register file) Error Diagnosis Detector Triggered Transient Error Faulty inst THE UNIVERSITY OF BRITISH COLUMBIA 4

Diagnosis Approach Dump File (e. g. failing detector, register file) Model Checking Detector Triggered

Diagnosis Approach Dump File (e. g. failing detector, register file) Model Checking Detector Triggered Transient Error Faulty inst THE UNIVERSITY OF BRITISH COLUMBIA 5

Model Checking Using Sym. PLFIED • Formal model for analyzing programs[DSN’ 08] – Evaluate

Model Checking Using Sym. PLFIED • Formal model for analyzing programs[DSN’ 08] – Evaluate the effect of transient hardware errors on programs. • Symbolic error propagation technique – Represent errors using a single symbol (err) to avoid state space explosion. THE UNIVERSITY OF BRITISH COLUMBIA 6

Example: Factorial Program 1 movi $2, #1 2 read $1 3 mov $3, $1

Example: Factorial Program 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 Result variable User input Loops while $3 < $4 Error detector THE UNIVERSITY OF BRITISH COLUMBIA 7

Example: Error Propagation 1 movi $2, #1 2 read $1 3 mov $3, $1

Example: Error Propagation 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 $1 = 5 A transient fault, $3 = 13 Detector is triggered THE UNIVERSITY OF BRITISH COLUMBIA 8

Example: Error Propagation 1 movi $2, #1 $1 = 5 2 read $1 3

Example: Error Propagation 1 movi $2, #1 $1 = 5 2 read $1 3 mov $3, $1 A transient fault, $3 = 13 4 movi $4, #1 Dump file: 5 loop: setgt $5, $3, $4 Detector triggered 6 beq $5, #0, exit $1 = 5 7 mult $2, $3 $2 = 13 8 subi $3, #1 $3 = 12 $4 = 1 9 assert($3 < $1 + 1) Detector is triggered 10 beq $0, #0, loop$5 = 1 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 9

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 A transient fault, $3 = err True Exit False Line 7 $2 = err True Line 10 False Detector triggered THE UNIVERSITY OF BRITISH COLUMBIA 10

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 A transient fault, $3 = err Sym. PLFIED’s Solution Dump file: 4 movi $4, #1 True Exit Instruction 3 Injected Detector triggered 5 loop: setgt $5, $3, $4 Detector triggered $1 = 5 Line 7 False 6 beq $1 $5, = 5 #0, exit $2 = 13 $2 =$2, err$2, $3 7 mult $2 = err$3 = 12 $3 $3, = err 8 subi $3, #1 $4 = 1 Line 10 True $4 = 1 < $1 + 1) 9 assert($3 $5 = 1 False Detector triggered 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 11

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1

Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 A transient fault, $3 = err Sym. PLFIED’s Solution Dump file: 4 movi $4, #1 True Exit Instruction 3 Injected Detector triggered 5 loop: setgt $5, $3, $4 triggered The. Detector crash dump file can be to identify $1 used = 5 Line False 7 6 beq $1 $5, = 5 #0, exit $2 = 13 the faulty instruction. $2 =$2, err$2, $3 7 mult $2 = err$3 = 12 $3 $3, = err 8 subi $3, #1 $4 = 1 Line 10 True $4 = 1 < $1 + 1) 9 assert($3 $5 = 1 False Detector triggered 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 12

Experimental Methodology • Enhance Sym. PLFIED to diagnose errors. • Modify Simple. Scalar simulator

Experimental Methodology • Enhance Sym. PLFIED to diagnose errors. • Modify Simple. Scalar simulator to inject faults. • Evaluate for Matrix Multiply and Insertion Sort. Instructions that trigger a detector More inst? Y Inject at a random bit in Simple. Scalar Detector triggered? N Y Done Error diagnosis THE UNIVERSITY OF BRITISH COLUMBIA Create a dump file 13

Results for Matrix Multiply Number of detectors 1 4 6 Number of faults injected

Results for Matrix Multiply Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 THE UNIVERSITY OF BRITISH COLUMBIA 14

Results for Matrix Multiply (1) Number of detectors 1 4 6 Number of faults

Results for Matrix Multiply (1) Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 • The proposed technique diagnoses 77%-100% of the detected errors for the matrix multiply program. • The undiagnosed errors are implementation artifacts of the Sym. PLFIED tool. THE UNIVERSITY OF BRITISH COLUMBIA 15

Results for Matrix Multiply (2) Number of detectors 1 4 6 Number of faults

Results for Matrix Multiply (2) Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 • The number of faults injected in Simple. Scalar is proportional to the number of detectors. • Adding more detectors increases the diagnosis accuracy. THE UNIVERSITY OF BRITISH COLUMBIA 16

Conclusions and Future Work • Software diagnosis of hardware faults is possible and can

Conclusions and Future Work • Software diagnosis of hardware faults is possible and can be automated using formal techniques. – Our diagnosis method is able to diagnose significant number of errors using a few detectors. • Future Work – Investigate improvements with limited hardware support. – Improve scalability using heuristics. – Extend to intermittent & permanent faults. THE UNIVERSITY OF BRITISH COLUMBIA 17

Backup Slides THE UNIVERSITY OF BRITISH COLUMBIA 18

Backup Slides THE UNIVERSITY OF BRITISH COLUMBIA 18

Related Work Hardware Fault Diagnosis Hardware- Based Techniques Probabilistic Techniques Formal Methods THE UNIVERSITY

Related Work Hardware Fault Diagnosis Hardware- Based Techniques Probabilistic Techniques Formal Methods THE UNIVERSITY OF BRITISH COLUMBIA Periodic-Testing Techniques 19

Results for Insertion Sort Number of detectors 1 4 7 Number of faults injected

Results for Insertion Sort Number of detectors 1 4 7 Number of faults injected in SS 11 165 198 Number of faults detected in SS 8 64 83 Diagnosed faults (%) 100 87 89 Undiagnosed fault (%) 0 13 11 THE UNIVERSITY OF BRITISH COLUMBIA 20