Error Detection in Hardware VO HardwareSoftwareCodesign Philipp Jahn

  • Slides: 16
Download presentation
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn 6. 6. 2007 Error Detection in

Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn 6. 6. 2007 Error Detection in Hardware

Error detection § How to detect errors with hardware methods during system operation §

Error detection § How to detect errors with hardware methods during system operation § § Conditions Coverage (probability that error is detected) Latency (time between start of error and detection) Performance Slide from VO „Echtzeitsysteme“, H. Kopetz 6. 6. 2007 Error Detection in Hardware 2

Hardware-based error detection § Hardware redundancy § Passive (TMR, majority voting) § Active (duplication

Hardware-based error detection § Hardware redundancy § Passive (TMR, majority voting) § Active (duplication and comparison, standby) § Hybrid § Information redundancy § § Parity § Checksums § Arithmetic Codes Time redundancy Watchdog timers Checking § Capability Checking § Consistency Checking § Control-Flow Checking 6. 6. 2007 Error Detection in Hardware 3

Information redundancy (1) § Detection / Correction § Hamming distance § X = (1001),

Information redundancy (1) § Detection / Correction § Hamming distance § X = (1001), Y = (0111) § d(X, Y) = 3 § SEC – DED 6. 6. 2007 Error Detection in Hardware 4

Information redundancy (2) § Parity § § 6. 6. 2007 One extra bit (even

Information redundancy (2) § Parity § § 6. 6. 2007 One extra bit (even / odd) Decoding circuit (set of XOR gates) Routine checking in busses, memory and registers Detecting single bit errors (no stuck-at faults) Error Detection in Hardware 5

Information redundancy (3) § § Overlapping parity m of n codes Duplication codes Cycle

Information redundancy (3) § § Overlapping parity m of n codes Duplication codes Cycle redundancy checks § § § 6. 6. 2007 Sender and receiver agree upon generator polynom G(x) Append checksum (k bit) at end of data frame (n-k bit) Checksum / G(x) = 0 correct Simplementation (linear feedback shift register and XOR gates) Detect single-bit errors, multiple adjacent bit errors affecting fewer than n-k bits, and burst transient errors High successful in serial transmission (communication channels: Ethernet, Token Ring) Error Detection in Hardware 6

Information redundancy (4) § Checksums 6. 6. 2007 Error Detection in Hardware 7

Information redundancy (4) § Checksums 6. 6. 2007 Error Detection in Hardware 7

Information redundancy (5) § Arithmetic Codes § Detect errors in arithmetic units (parity would

Information redundancy (5) § Arithmetic Codes § Detect errors in arithmetic units (parity would not be preserved) § Separate or nonseparate § Examples § AN codes § Residue codes 6. 6. 2007 Error Detection in Hardware 8

Time redundancy (1) § Repetition of computations two or more times and then comparing

Time redundancy (1) § Repetition of computations two or more times and then comparing (detection or correction by majority) § Error detected maybe retry § Good for detecting transient faults § Not protecting against errors resulting from permanent faults § No extra hardware needed but longer processing time § Non-time-critical applications § Alternate Logic also detects permanent faults (self-checking circuits f(x) = f ‘(x’)) 6. 6. 2007 Error Detection in Hardware 9

Time redundancy (2) § Handle permanent faults per encoding the second computation (must not

Time redundancy (2) § Handle permanent faults per encoding the second computation (must not alter calculation) e. g. k-shift § Error in k-1 consecutive bit of arithmetic or logical operation detected § Additional hardware (two shifters, storage register, comparator) 6. 6. 2007 Error Detection in Hardware 10

Watchdog timers § Implemented in hardware (external timer) or software (process) § If timer

Watchdog timers § Implemented in hardware (external timer) or software (process) § If timer expires system reset or recover § Detect only very specific type = control-flow error § § § 6. 6. 2007 If error occurs but timer reset no detection Difficult to determine runtime High detection latency Error Detection in Hardware 11

Capability & Consistency Checking § Capability checking limits access to objects (e. g. memory

Capability & Consistency Checking § Capability checking limits access to objects (e. g. memory segments) to authorized users (processes) § Implemented in hardware (error traps) or software (firewall) § e. g. checking of address validity by MMU § Consistency checking determines if states or results are reasonable § e. g. range checking, address checking, opcode checking 6. 6. 2007 Error Detection in Hardware 12

Control-Flow Checking (1) § Hardware scheme § Divide application program into blocks § Each

Control-Flow Checking (1) § Hardware scheme § Divide application program into blocks § Each block has a single entry and exit point § Reference signature represents an encoding of the correct execution § Watchdog processor validates the application program by comparing the runtime with the signature § 70% of transient faults lead to control flow errors § Limitations § Only suitable for processors running single programs (multiple processes or threads) § Reduced coverage if transmission errors on the bus to the watchdog processor occurs 6. 6. 2007 Error Detection in Hardware 13

Control-Flow Checking (2) § Signatured Instruction Stream (SIS) § Hardware: Watchdog processor with cyclic

Control-Flow Checking (2) § Signatured Instruction Stream (SIS) § Hardware: Watchdog processor with cyclic code signature generator § Software: Modified assembler and loader § Control Flow Checking using Shadow Processing 6. 6. 2007 Error Detection in Hardware 14

Summary § § Hardware low error latency Hardware is more expensive e. g. Massively

Summary § § Hardware low error latency Hardware is more expensive e. g. Massively parallel multiprocessors Combining error detection mechanism 6. 6. 2007 Error Detection in Hardware 15

References § Ravishankar K. Iyer, Zbigniew Kalbarczyk - Hardware and Software Error § §

References § Ravishankar K. Iyer, Zbigniew Kalbarczyk - Hardware and Software Error § § § Detection - Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign Real-Time Systems, Design Principles for Distributed Embedded Applications Kopetz, Hermann, 1997, 356 p. , Hardcover, ISBN: 978 -0 -79239894 -3 Alireza Vahdatpour, Mahdi Fazeli, Seyed Ghassem Miremadi - Transient Error Detection in Embedded Sysetms Using Reconfigurable Components - IES, October 2006 M. Dal Chin, W. Hohl, E. Michel, A. Pataricza - Error Detection Mechansims for Massively Parallel Multiprocessors - IEEE Proceedings, 1993 Evaluation of error detection coverage and fault-tolerance of digital plant protection system in nuclear power plants http: //robotics. ee. uwa. edu. au/courses/faulttolerant/notes/FT 2 b. pdf A. Steiniger, C. Scherrer - Identifying Efficient Combinations of Error Detection Mechanisms Based on Results of Fault Injection Experiments IEEE Transactions on computers, Vol. 51, No. 2, February 2002 6. 6. 2007 Error Detection in Hardware 16