FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS





![Model Checking Using Sym. PLFIED • Formal model for analyzing programs[DSN’ 08] – Evaluate Model Checking Using Sym. PLFIED • Formal model for analyzing programs[DSN’ 08] – Evaluate](https://slidetodoc.com/presentation_image_h2/90d319c1a71b78af143a7ab1382f28be/image-6.jpg)














- Slides: 20
FORMAL DIAGNOSIS OF HARDWARE TRANSIENT ERRORS IN PROGRAMS Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan THE ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT THE UNIVERSITY OF BRITISH COLUMBIA
Contributions • Software-driven diagnosis of hardware transient errors – Diagnosis: “isolate the first affected instruction” • Program-level analysis – Guarantees on the diagnosis • Completeness • Accuracy THE UNIVERSITY OF BRITISH COLUMBIA 2
Why Software-Driven Diagnosis? • No expensive hardware modifications. • Minimal software instrumentation. • Diagnose faults which manifest at the program-level only. • Direct access to the affected device is not required. THE UNIVERSITY OF BRITISH COLUMBIA 3
Diagnosis Approach Dump File (e. g. failing detector, register file) Error Diagnosis Detector Triggered Transient Error Faulty inst THE UNIVERSITY OF BRITISH COLUMBIA 4
Diagnosis Approach Dump File (e. g. failing detector, register file) Model Checking Detector Triggered Transient Error Faulty inst THE UNIVERSITY OF BRITISH COLUMBIA 5
Model Checking Using Sym. PLFIED • Formal model for analyzing programs[DSN’ 08] – Evaluate the effect of transient hardware errors on programs. • Symbolic error propagation technique – Represent errors using a single symbol (err) to avoid state space explosion. THE UNIVERSITY OF BRITISH COLUMBIA 6
Example: Factorial Program 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 Result variable User input Loops while $3 < $4 Error detector THE UNIVERSITY OF BRITISH COLUMBIA 7
Example: Error Propagation 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 $1 = 5 A transient fault, $3 = 13 Detector is triggered THE UNIVERSITY OF BRITISH COLUMBIA 8
Example: Error Propagation 1 movi $2, #1 $1 = 5 2 read $1 3 mov $3, $1 A transient fault, $3 = 13 4 movi $4, #1 Dump file: 5 loop: setgt $5, $3, $4 Detector triggered 6 beq $5, #0, exit $1 = 5 7 mult $2, $3 $2 = 13 8 subi $3, #1 $3 = 12 $4 = 1 9 assert($3 < $1 + 1) Detector is triggered 10 beq $0, #0, loop$5 = 1 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 9
Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 4 movi $4, #1 5 loop: setgt $5, $3, $4 6 beq $5, #0, exit 7 mult $2, $3 8 subi $3, #1 9 assert($3 < $1 + 1) 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 A transient fault, $3 = err True Exit False Line 7 $2 = err True Line 10 False Detector triggered THE UNIVERSITY OF BRITISH COLUMBIA 10
Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 A transient fault, $3 = err Sym. PLFIED’s Solution Dump file: 4 movi $4, #1 True Exit Instruction 3 Injected Detector triggered 5 loop: setgt $5, $3, $4 Detector triggered $1 = 5 Line 7 False 6 beq $1 $5, = 5 #0, exit $2 = 13 $2 =$2, err$2, $3 7 mult $2 = err$3 = 12 $3 $3, = err 8 subi $3, #1 $4 = 1 Line 10 True $4 = 1 < $1 + 1) 9 assert($3 $5 = 1 False Detector triggered 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 11
Example: Error Diagnosis 1 movi $2, #1 2 read $1 3 mov $3, $1 A transient fault, $3 = err Sym. PLFIED’s Solution Dump file: 4 movi $4, #1 True Exit Instruction 3 Injected Detector triggered 5 loop: setgt $5, $3, $4 triggered The. Detector crash dump file can be to identify $1 used = 5 Line False 7 6 beq $1 $5, = 5 #0, exit $2 = 13 the faulty instruction. $2 =$2, err$2, $3 7 mult $2 = err$3 = 12 $3 $3, = err 8 subi $3, #1 $4 = 1 Line 10 True $4 = 1 < $1 + 1) 9 assert($3 $5 = 1 False Detector triggered 10 beq $0, #0, loop 11 exit: prints "Factorial = " 12 print $2 THE UNIVERSITY OF BRITISH COLUMBIA 12
Experimental Methodology • Enhance Sym. PLFIED to diagnose errors. • Modify Simple. Scalar simulator to inject faults. • Evaluate for Matrix Multiply and Insertion Sort. Instructions that trigger a detector More inst? Y Inject at a random bit in Simple. Scalar Detector triggered? N Y Done Error diagnosis THE UNIVERSITY OF BRITISH COLUMBIA Create a dump file 13
Results for Matrix Multiply Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 THE UNIVERSITY OF BRITISH COLUMBIA 14
Results for Matrix Multiply (1) Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 • The proposed technique diagnoses 77%-100% of the detected errors for the matrix multiply program. • The undiagnosed errors are implementation artifacts of the Sym. PLFIED tool. THE UNIVERSITY OF BRITISH COLUMBIA 15
Results for Matrix Multiply (2) Number of detectors 1 4 6 Number of faults injected in SS 167 275 286 Number of faults detected in SS 74 135 150 Diagnosed faults (%) 100 77 80 Undiagnosed fault (%) 0 23 20 • The number of faults injected in Simple. Scalar is proportional to the number of detectors. • Adding more detectors increases the diagnosis accuracy. THE UNIVERSITY OF BRITISH COLUMBIA 16
Conclusions and Future Work • Software diagnosis of hardware faults is possible and can be automated using formal techniques. – Our diagnosis method is able to diagnose significant number of errors using a few detectors. • Future Work – Investigate improvements with limited hardware support. – Improve scalability using heuristics. – Extend to intermittent & permanent faults. THE UNIVERSITY OF BRITISH COLUMBIA 17
Backup Slides THE UNIVERSITY OF BRITISH COLUMBIA 18
Related Work Hardware Fault Diagnosis Hardware- Based Techniques Probabilistic Techniques Formal Methods THE UNIVERSITY OF BRITISH COLUMBIA Periodic-Testing Techniques 19
Results for Insertion Sort Number of detectors 1 4 7 Number of faults injected in SS 11 165 198 Number of faults detected in SS 8 64 83 Diagnosed faults (%) 100 87 89 Undiagnosed fault (%) 0 13 11 THE UNIVERSITY OF BRITISH COLUMBIA 20