Chapter 3 FaultTolerant Design EE 141 SystemonChip Test

What is this chapter about? q Gives Overview of Fault-Tolerant Design q Focus on

Fault-Tolerant Design Introduction q Fundamentals of Fault Tolerance q Fundamentals of Coding Theory q

Introduction q Fault Tolerance § Ability of system to continue error-free operation in presence

Faults q Permanent Faults § Due to manufacturing defects, early life failures, wearout failures

Temporary Faults q Transient Errors (Non-recurring errors) § Cause by external disturbance – e.

Redundancy q Fault Tolerance requires some form of redundancy § Time Redundancy § Hardware

Time Redundancy q Perform § § Same Operation Twice See if get same result

Hardware Redundancy q Replicate hardware and compare outputs § From two or more modules

Information Redundancy q Encode outputs with error detecting or correcting code § Code selected

Failure Rate q (t) = Component failure rate § Measured in FITS (failures per

System Failure Rate q System constructed from components q No Fault Tolerance § Any

Reliability q If component working at time 0 § R(t) = Probability still working

Reliability for Series System q Series System § All components need to work for

System Reliability with Redundancy q System reliability with component B in Parallel § Can

Mean-Time-to-Failure (MTTF) q Average time before system fails § Equal to area under reliability

Maintainability q If system failed at time 0 § M(t) = Probability repaired and

Repair Rate and MTTR q = rate at which system repaired § Analogous to

Availability S 1 0 t 0 Normal system operation t 1 t 2 t

Availability q Telephone Systems § Required to have system availability of 0. 9999 (“four

Coding Theory q Coding § Using more bits than necessary to represent data §

Block Code q Message = Data Being Encoded q Block code § Encodes m

Block Code q To detect errors, some redundancy needed § Space of distinct 2

Separable Block Code q Separable § n-bit blocks partitioned into – k information bits

Example of Separable Block Code q (4, 3) Parity Code § Check bit is

Example of Non-Separable Block Code q One-Hot Code § Each Codeword has single 1

Linear Block Codes q Special class § Modulo-2 sum of any 2 codewords also

Linear Block Codes q Generator Matrix, G § kxn Matrix q Codeword c for

Systematic Block Code q First k-bits correspond to message § Last n-k bits correspond

Distance of Code q Distance between two codewords § Number of bits in which

Error Correcting Codes q Code with distance 3 § Called single error correcting (SEC)

Hamming Code q For any value of n § SEC code constructed by –

Error Correction in Hamming Code q Syndrome, s § s = Hv. T for

Example of Error Correction q For (7, 3) Hamming Code § Suppose codeword 0110011

SEC-DED Code q Make SEC Hamming Code SEC-DED § By adding parity check over

Example of Error Correction q For (7, 4) SEC-DED Hamming Code § Suppose codeword

Hsiao Code q Weight of column § Number of 1’s in column q Constructing

Example of Hsiao Code q (7, 3) Hsiao Code § Uses weight-1 and weight-3

Unidirectional Errors q Errors in block of data which only cause 0 1 or

Unidirectional Error Detecting Codes q All unidirectional error detecting (AUED) Codes § Detect all

Two-Rail Code q Two-Rail Code § One check bit for each information bit –

Berger Codes q Lowest redundancy of separable AUED codes § For k information bits,

Berger Codes q Codewords for (5, 3) Berger Code § 00011, 00110, 01010, 01101,

Berger Codes q If 8 information bits § Berger code requires log 2 8+1

Constant Weight Codes q Constant Weight Codes § Non-separable, but lower redundancy than Berger

Constant Weight Codes q Number codewords in m-out-of-n code q Codewords maximized when m

Example q 6 -out-of-12 q 12 -bit constant weight code Berger Code § Only

Constant Weight Codes q Advantage § Less redundancy than Berger codes q Disadvantage §

Burst Error q Burst Error § Common, multi-bit errors tend to be clustered –

Cyclic Codes q Special class of linear code § Any codeword shifted cyclically is

Cyclic Redundancy Check (CRC) Code q Most widely used cyclic code § Uses binary

Message m(x) g(x) c(x) Codeword 0000 0 x 2 + 1 0 000000 0001

CRC Code q Linear block code § Has G-matrix and H-matrix § G-matrix shifted

CRC Code Example q (6, 4) CRC code generated by g(x)=x 2+1 EE 141

Systematic CRC Codes q To obtain systematic CRC code § codewords formed using Galois

Galois Division Example q Encode m(x)=x 2+x with g(x)=x 2+1 § Requires dividing m(x)xn-k

Message m(x) g(x) r(x) c(x) Codeword 0000 0 x 2 + 1 0 0

Generating Check Bits for CRC Code q Use LFSR § With characteristic polynomial equal

Checking CRC Codeword q Checking Received Codeword for Errors § Shift codeword into LFSR

Selecting Generator Polynomial q Key issue for CRC Codes § If first and last

Commonly Used CRC Generators CRC code CRC-5 (USB token packets) Generator Polynomial x 5+x

Fault Tolerance Schemes q Adding Fault Tolerance to Design § Improves dependability of system

Hardware Redundancy q Involves replicating hardware units § At any level of design –

Static Redundancy q Masks faults so no erroneous outputs § Provides uninterrupted operation §

Triple Module Redundancy (TMR) q Well-known static redundancy scheme § Three copies of module

TMR Reliability and MTTF q TMR works if any 2 modules work § Rm

Comparison with Simplex q Neglecting q TMR fault rate of voter has lower MTTF,

Comparison with Simplex q Crossover q RTMR point > Rsimplex when § Mission time

N-Modular Redundancy (NMR) q NMR § N modules along with majority voter – TMR

Interwoven Logic q Replace each gate § with 4 gates using inconnection pattern that

Interwoven Logic with 4 NOR Gates EE 141 System-on-Chip Test Architectures 71 Ch. 3

Example of Error on Third Y Input EE 141 System-on-Chip Test Architectures 72 Ch.

Dynamic Redundancy q Involves § Detecting fault § Locating faulty hardware unit § Reconfiguring

Unpowered (Cold) Spares q Advantage § Extends lifetime of spares q Equations § Assume

Unpowered (Cold) Spares q One cold spare doubles MTTF § Assuming faults always detected

Powered (Hot) Spares q Can use spares for online fault detection q One approach

Pair-and-a-Spare q Avoids halting system to run diagnostic procedure when fault occurs EE 141

TMR/Simplex q When one module in TMR fails § Disconnect one of remaining modules

Comparison of Reliability vs Time EE 141 System-on-Chip Test Architectures 79 Ch. 3 -

Hybrid Redundancy q Combines both static and dynamic redundancy § Masks faults like static

TMR with Spares q If TMR module fails § Replace with spare – can

Self-Purging Redundancy q Uses threshold voter instead of majority voter § Threshold voter outputs

Self-Purging Redundancy EE 141 System-on-Chip Test Architectures 83 Ch. 3 - Fault-Tolerant Design -

Self-Purging Redundancy q Compared with 5 MR § Self-purging with 5 modules – Tolerate

Time Redundancy q Advantage § Less hardware q Drawback § Cannot detect permanent faults

Repeated Execution q Repeat operation twice § Simplest time redundancy approach § Detects temporary

Repeated Execution q Requires mechanism for storing and comparing results of both executions §

Multi-threaded Redundant Execution q Can use in processor-based system that can run multiple threads

Multiple Sampling of Ouputs q Done at circuit-level § Sample once at end of

Multiple Sampling of Outputs q Simple approach using two latches EE 141 System-on-Chip Test

Multiple Sampling of Outputs q Approach using stability checker at output EE 141 System-on-Chip

Diverse Recomputation q Use same hardware, but perform computation differently second time § Can

Information Redundancy q Based on Error Detecting and Correcting Codes q Advantage § Detects

Error Detection q Error detecting codes used to detect errors § If error detected

Rollback q Requires adding storage to save previous state § Amount of rollback depends

Checkpoint q Execution divided into set of operations § Before each operation executed –

Error Detection q Encode outputs of circuit with error detecting code § Non-codeword output

Self-Checking Checker q Has two outputs § Normal error-free case (1, 0) or (0,

Totally Self-Checking Checker q Requires three properties § Code Disjoint – all codeword inputs

Duplicate-and-Compare q Equality checker indicates error § Undetected error can occur only if common

Single-Bit Parity Code q Totally self-checking checker formed by removing final gate from XOR

Single-Bit Parity Code q Cannot detect even bit errors § Can ensure no even

Parity-Check Codes q Each check bit is parity for some set of output bits

Parity-Check Codes q For c check bits and k functional outputs § 2 ck

Checker for Parity-Check Codes q Constructed from single-bit parity checkers and two-rail checkers EE

Two-Rail Checkers q Totally self-checking two-rail checker EE 141 System-on-Chip Test Architectures 106 Ch.

Berger Codes q Inverter-free circuit § Inverters only at primary inputs § Can be

Constant Weight Codes q Non-separable with lower redundancy § Drawback: need decoding logic to

Error Correction q Information redundancy can also be used to mask errors § Not

Error Correction q Memories very dense and prone to errors § Especially due to

Memory ECC Architecture EE 141 System-on-Chip Test Architectures 111 Ch. 3 - Fault-Tolerant Design

Hamming Code for ECC RAM Parity Group 1 Parity Group 2 Parity Group 3

Memory ECC q SEC-DED generally very effective § Memory bit-flips tend to be independent

Memory Scrubbing q Every location in memory read on periodic basis § Reduces chance

Multiple-Bit Upsets (MBU) q Can occur due to single SEU § Typically occur in

Type Issues Goal Examples Techniques Long-Life Systems Difficult or Expensive to Repair Maximize MTTF

Concluding Remarks q Many different fault-tolerant schemes q Choosing scheme depends on § Types

Concluding Remarks q As technology scales § Circuits increasingly prone to failure § Achieving

Slides: 118

Download presentation

Chapter 3 Fault-Tolerant Design EE 141 System-on-Chip Test Architectures 1 Ch. 3 - Fault-Tolerant Design - P.

What is this chapter about? q Gives Overview of Fault-Tolerant Design q Focus on § § Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes – Hardware Redundancy – Information Redundancy – Time Redundancy § Examples of Fault-Tolerant Applications in Industry EE 141 System-on-Chip Test Architectures 2 Ch. 3 - Fault-Tolerant Design - P.

Fault-Tolerant Design Introduction q Fundamentals of Fault Tolerance q Fundamentals of Coding Theory q Fault Tolerant Schemes q Industry Practices q Concluding Remarks q EE 141 System-on-Chip Test Architectures 3 Ch. 3 - Fault-Tolerant Design - P.

Introduction q Fault Tolerance § Ability of system to continue error-free operation in presence of unexpected fault q Important in mission-critical applications § E. g. , medical, aviation, banking, etc. § Errors very costly q Becoming important in mainstream applications § Technology scaling causing circuit behavior to become less predictable and more prone to failures § Needing fault tolerance to keep failure rate within acceptable levels EE 141 System-on-Chip Test Architectures 4 Ch. 3 - Fault-Tolerant Design - P.

Faults q Permanent Faults § Due to manufacturing defects, early life failures, wearout failures § Wearout failures due to various mechanisms – e. g. , electromigration, hot carrier degradation, dielectric breakdown, etc. q Temporary Faults § Only present for short period of time § Caused by external disturbance or marginal design parameters EE 141 System-on-Chip Test Architectures 5 Ch. 3 - Fault-Tolerant Design - P.

Temporary Faults q Transient Errors (Non-recurring errors) § Cause by external disturbance – e. g. , radiation, noise, power disturbance, etc. q Intermittent Errors (Recurring errors) § Cause by marginal design parameters § Timing problems – e. g. , races, hazards, skew § Signal integrity problems – e. g. , crosstalk, ground bounce, etc. EE 141 System-on-Chip Test Architectures 6 Ch. 3 - Fault-Tolerant Design - P.

Redundancy q Fault Tolerance requires some form of redundancy § Time Redundancy § Hardware Redundancy § Information Redundancy EE 141 System-on-Chip Test Architectures 7 Ch. 3 - Fault-Tolerant Design - P.

Time Redundancy q Perform § § Same Operation Twice See if get same result both times If not, then fault occurred Can detect temporary faults Cannot detect permanent faults – Would affect both computations q Advantage § Little to no hardware overhead q Disadvantage § Impacts system or circuit performance EE 141 System-on-Chip Test Architectures 8 Ch. 3 - Fault-Tolerant Design - P.

Hardware Redundancy q Replicate hardware and compare outputs § From two or more modules § Detects both permanent and temporary faults q Advantage § Little or no performance impact q Disadvantage § Area and power for redundant hardware EE 141 System-on-Chip Test Architectures 9 Ch. 3 - Fault-Tolerant Design - P.

Information Redundancy q Encode outputs with error detecting or correcting code § Code selected to minimize redundancy for class of faults q Advantage § Less hardware to generate redundant information than replicating module q Drawback § Added complexity in design EE 141 System-on-Chip Test Architectures 10 Ch. 3 - Fault-Tolerant Design - P.

Failure Rate q (t) = Component failure rate § Measured in FITS (failures per 109 hours) EE 141 System-on-Chip Test Architectures 11 Ch. 3 - Fault-Tolerant Design - P.

System Failure Rate q System constructed from components q No Fault Tolerance § Any component fails, whole system fails EE 141 System-on-Chip Test Architectures 12 Ch. 3 - Fault-Tolerant Design - P.

Reliability q If component working at time 0 § R(t) = Probability still working at time t q Exponential Failure Law § If failure rate assumed constant – Good approximation if past infant mortality period EE 141 System-on-Chip Test Architectures 13 Ch. 3 - Fault-Tolerant Design - P.

Reliability for Series System q Series System § All components need to work for system to work EE 141 System-on-Chip Test Architectures 14 Ch. 3 - Fault-Tolerant Design - P.

System Reliability with Redundancy q System reliability with component B in Parallel § Can tolerate one component B failing EE 141 System-on-Chip Test Architectures 15 Ch. 3 - Fault-Tolerant Design - P.

Mean-Time-to-Failure (MTTF) q Average time before system fails § Equal to area under reliability curve q For Exponential Failure Law EE 141 System-on-Chip Test Architectures 16 Ch. 3 - Fault-Tolerant Design - P.

Maintainability q If system failed at time 0 § M(t) = Probability repaired and operational at time t q System repair time divided into § Passive repair time – Time for service engineer to travel to site § Active repair time – Time to locate failing component, repair/replace, and verify system operational – Can be improved through designing system so easy to locate failed component and verify EE 141 System-on-Chip Test Architectures 17 Ch. 3 - Fault-Tolerant Design - P.

Repair Rate and MTTR q = rate at which system repaired § Analogous to failure rate q Maintainability often modeled as q Mean-Time-to-Repair EE 141 System-on-Chip Test Architectures (MTTR) = 1/ 18 Ch. 3 - Fault-Tolerant Design - P.

Availability S 1 0 t 0 Normal system operation t 1 t 2 t 3 t 4 t failures q System Availability § Fraction of time system is operational EE 141 System-on-Chip Test Architectures 19 Ch. 3 - Fault-Tolerant Design - P.

Availability q Telephone Systems § Required to have system availability of 0. 9999 (“four nines”) q High-Reliability Systems § May require 7 or more nines q Fault-Tolerant Design § Needed to achieve such high availability from less reliable components EE 141 System-on-Chip Test Architectures 20 Ch. 3 - Fault-Tolerant Design - P.

Coding Theory q Coding § Using more bits than necessary to represent data § Provides way to detect errors – Errors occur when bits get flipped q Error § § Detecting Codes Many types Detect different classes of errors Use different amounts of redundancy Ease of encoding and decoding data varies EE 141 System-on-Chip Test Architectures 21 Ch. 3 - Fault-Tolerant Design - P.

Block Code q Message = Data Being Encoded q Block code § Encodes m messages with n-bit codeword q If no redundancy § m messages encoded with log 2(m) bits § minimum possible EE 141 System-on-Chip Test Architectures 22 Ch. 3 - Fault-Tolerant Design - P.

Block Code q To detect errors, some redundancy needed § Space of distinct 2 n blocks partitioned into codewords and non-codewords q Can detect errors that cause codeword to become non-codeword q Cannot detect errors that cause codeword to become another codeword EE 141 System-on-Chip Test Architectures 23 Ch. 3 - Fault-Tolerant Design - P.

Separable Block Code q Separable § n-bit blocks partitioned into – k information bits directly representing message – (n-k) check bits § Denoted (n, k) Block Code q Advantage § k-bit message directly extracted without decoding q Rate of Separable Block Code = k/n EE 141 System-on-Chip Test Architectures 24 Ch. 3 - Fault-Tolerant Design - P.

Example of Separable Block Code q (4, 3) Parity Code § Check bit is XOR of 3 message bits § message 101 codeword 1010 q Single Bit Parity EE 141 System-on-Chip Test Architectures 25 Ch. 3 - Fault-Tolerant Design - P.

Example of Non-Separable Block Code q One-Hot Code § Each Codeword has single 1 § Example of 8 -bit one-hot – 10000000, 01000000, 00100000, 000100001000, 00000100, 00000010, 00000001 § Redundancy = 1 - log 2(8)/8 = 5/8 EE 141 System-on-Chip Test Architectures 26 Ch. 3 - Fault-Tolerant Design - P.

Linear Block Codes q Special class § Modulo-2 sum of any 2 codewords also codeword § Null space of (n-k)xn Boolean matrix – Called Parity Check Matrix, H q For any n-bit codeword c § c. HT = 0 § All 0 codeword exists in any linear code EE 141 System-on-Chip Test Architectures 27 Ch. 3 - Fault-Tolerant Design - P.

Linear Block Codes q Generator Matrix, G § kxn Matrix q Codeword c for message m § c = m. G q GHT =0 EE 141 System-on-Chip Test Architectures 28 Ch. 3 - Fault-Tolerant Design - P.

Systematic Block Code q First k-bits correspond to message § Last n-k bits correspond to check bits q For Systematic Code § G = [Ikxk : Pkx(n-k)] § H = [I(n-k)x(n-k) : PT(n-k)xk] q Example EE 141 System-on-Chip Test Architectures 29 Ch. 3 - Fault-Tolerant Design - P.

Distance of Code q Distance between two codewords § Number of bits in which they differ q Distance of Code § Minimum distance between any two codewords in code § If n=k (no redundancy), distance = 1 § Single-bit parity, distance = 2 q Code with distance d § Detect d-1 errors § Correct up to (d-1)/2 errors EE 141 System-on-Chip Test Architectures 30 Ch. 3 - Fault-Tolerant Design - P.

Error Correcting Codes q Code with distance 3 § Called single error correcting (SEC) code q Code with distance 4 § Called single error correcting and double error detecting (SEC-DED) code q Procedure for constructing SEC code § Described in [Hamming 1950] § Any H-matrix with all columns distinct and no all-0 column is SEC EE 141 System-on-Chip Test Architectures 31 Ch. 3 - Fault-Tolerant Design - P.

Hamming Code q For any value of n § SEC code constructed by – setting each column in H equal to binary representation of column number (starting from 1) § Number of rows in H equal to log 2(n+1) q Example of SEC Hamming Code for n=7 EE 141 System-on-Chip Test Architectures 32 Ch. 3 - Fault-Tolerant Design - P.

Error Correction in Hamming Code q Syndrome, s § s = Hv. T for received vector v § If v is codeword – Syndrome = 0 § If v non-codeword and single-bit error – Syndrome will match one of columns of H – Will contain binary value of bit position in error EE 141 System-on-Chip Test Architectures 33 Ch. 3 - Fault-Tolerant Design - P.

Example of Error Correction q For (7, 3) Hamming Code § Suppose codeword 0110011 has one-bit error changing it to 1110011 EE 141 System-on-Chip Test Architectures 34 Ch. 3 - Fault-Tolerant Design - P.

SEC-DED Code q Make SEC Hamming Code SEC-DED § By adding parity check over all bits § Extra parity bit – 1 for single-bit error – 0 for double-bit error § Makes possible to detect double bit error – Avoid assuming single-bit error and miscorrecting it EE 141 System-on-Chip Test Architectures 35 Ch. 3 - Fault-Tolerant Design - P.

Example of Error Correction q For (7, 4) SEC-DED Hamming Code § Suppose codeword 0110011 has two-bit error changing it to 1010011 – Doesn’t match any column in H EE 141 System-on-Chip Test Architectures 36 Ch. 3 - Fault-Tolerant Design - P.

Hsiao Code q Weight of column § Number of 1’s in column q Constructing n-bit SEC-DED Hsiao Code § First use all possible weight-1 columns – Then all possible weight-3 columns – Then weight-5 columns, etc. § Until n columns formed § Number check bits is log 2(n+1) § Minimizes number of 1’s in H-matrix – Less hardware and delay for computing syndrome – Disadvantage: Correction logic more complex 37 EE 141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P.

Example of Hsiao Code q (7, 3) Hsiao Code § Uses weight-1 and weight-3 columns EE 141 System-on-Chip Test Architectures 38 Ch. 3 - Fault-Tolerant Design - P.

Unidirectional Errors q Errors in block of data which only cause 0 1 or 1 0, but not both § Any number of bits in error in one direction q Example § Correct codeword 111000 § Unidirectional errors could cause – 001000, 000000, 101000 (only 1 0 errors) § Non-unidirectional errors – 101001, 011011 (both 1 0 and 0 1) EE 141 System-on-Chip Test Architectures 39 Ch. 3 - Fault-Tolerant Design - P.

Unidirectional Error Detecting Codes q All unidirectional error detecting (AUED) Codes § Detect all unidirectional errors in codeword § Single-bit parity is not AUED – Cannot detect even number of errors § No linear code is AUED – All linear codes must contain all-0 vector, so cannot detect all 1 0 errors EE 141 System-on-Chip Test Architectures 40 Ch. 3 - Fault-Tolerant Design - P.

Two-Rail Code q Two-Rail Code § One check bit for each information bit – Equal to complement of information bit § Two-Rail Code is AEUD § 50% Redundancy q Example of (6, 3) Two-Rail Code § Message 101 has Codeword 101010 § Set of all codewords – 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000 EE 141 System-on-Chip Test Architectures 41 Ch. 3 - Fault-Tolerant Design - P.

Berger Codes q Lowest redundancy of separable AUED codes § For k information bits, log 2(k+1) check bits § Check bits equal to binary representation of number of 0’s in information bits q Example § Information bits 1000101 – log 2(7+1)=3 check bits – Check bits equal to 100 (4 zero’s) EE 141 System-on-Chip Test Architectures 42 Ch. 3 - Fault-Tolerant Design - P.

Berger Codes q Codewords for (5, 3) Berger Code § 00011, 00110, 01010, 01101, 10010, 10101, 11001, 11100 q If unidirectional errors § Contain 1 0 errors – increase 0’s in information bits – can only decrease binary number in check bits § Contain 0 1 errors – decrease 0’s in information bits – can only increase binary number in check bits EE 141 System-on-Chip Test Architectures 43 Ch. 3 - Fault-Tolerant Design - P.

Berger Codes q If 8 information bits § Berger code requires log 2 8+1 =4 check bits q (16, 8) Two-Rail Code § Requires 50% redundancy q Redundancy advantage of Berger Code § Increases as k increased EE 141 System-on-Chip Test Architectures 44 Ch. 3 - Fault-Tolerant Design - P.

Constant Weight Codes q Constant Weight Codes § Non-separable, but lower redundancy than Berger § Each codeword has same number of 1’s q Example 2 -out-of-3 constant weight code § 110, 011, 101 q AEUD code § Unidirectional errors always change number of 1’s EE 141 System-on-Chip Test Architectures 45 Ch. 3 - Fault-Tolerant Design - P.

Constant Weight Codes q Number codewords in m-out-of-n code q Codewords maximized when m close to n/2 as possible § n/2 -out-of-n when n even § (n/2 -0. 5 or n/2+0. 5)-out-of-n when n odd § Minimizes redundancy of code EE 141 System-on-Chip Test Architectures 46 Ch. 3 - Fault-Tolerant Design - P.

Example q 6 -out-of-12 q 12 -bit constant weight code Berger Code § Only 28 = 256 codewords EE 141 System-on-Chip Test Architectures 47 Ch. 3 - Fault-Tolerant Design - P.

Constant Weight Codes q Advantage § Less redundancy than Berger codes q Disadvantage § Non-separable § Need decoding logic – to convert codeword back to binary message EE 141 System-on-Chip Test Architectures 48 Ch. 3 - Fault-Tolerant Design - P.

Burst Error q Burst Error § Common, multi-bit errors tend to be clustered – Noise source affects contiguous set of bus lines § Length of burst error – number of bits between first and last error § Wrap around from last to first bit of codeword q Example: Original codeword 0000 § 00111100 is burst error length 4 § 00110100 is burst error length 4 – Any number of errors between first and last error EE 141 System-on-Chip Test Architectures 49 Ch. 3 - Fault-Tolerant Design - P.

Cyclic Codes q Special class of linear code § Any codeword shifted cyclically is another codeword § Used to detect burst errors § Less redundancy required to detect burst error than general multi-bit errors – Some distance 2 codes can detect all burst errors of length 4 – detecting all possible 4 -bit errors requires distance 5 code EE 141 System-on-Chip Test Architectures 50 Ch. 3 - Fault-Tolerant Design - P.

Cyclic Redundancy Check (CRC) Code q Most widely used cyclic code § Uses binary alphabet based on GF(2) q CRC code is (n, k) block code § Formed using generator polynomial, g(x) – called code generator – degree n-k polynomial (same degree as number of check bits) EE 141 System-on-Chip Test Architectures 51 Ch. 3 - Fault-Tolerant Design - P.

Message m(x) g(x) c(x) Codeword 0000 0 x 2 + 1 0 000000 0001 1 x 2 + 1 000101 0010 x x 2 + 1 x 3 + x 001010 0011 x+1 x 2 + 1 x 3 + x 2 + x + 1 001111 0100 x 2 + 1 x 4 + x 2 010100 0101 x 2 + 1 x 4 + 1 010001 0110 x 2 + x x 2 + 1 x 4 + x 3 + x 2 + x 011110 0111 x 2 + x + 1 x 2 + 1 x 4 + x 3 + x + 1 011011 1000 x 3 x 2 + 1 x 5 + x 3 101000 1001 x 3 + 1 x 2 + 1 x 5 + x 3 + x 2 + 1 101101 1010 x 3 + x x 2 + 1 x 5 + x 100010 1011 x 3 + x + 1 x 2 + 1 x 5 + x 2 + x + 1 100111 1100 x 3 + x 2 + 1 x 5 + x 4 + x 3 + x 2 111100 1101 x 3 + x 2 + 1 x 5 + x 4 + x 3 + 1 111001 1110 x 3 + x 2 + x x 2 + 1 x 5 + x 4 + x 2 + x 110110 1111 x 3 + x 2 + x + 1 x 2 + 1 x 5 + x 4 + x + 1 110011 EE 141 System-on-Chip Test Architectures 52 Ch. 3 - Fault-Tolerant Design - P.

CRC Code q Linear block code § Has G-matrix and H-matrix § G-matrix shifted version of generator polynomial EE 141 System-on-Chip Test Architectures 53 Ch. 3 - Fault-Tolerant Design - P.

CRC Code Example q (6, 4) CRC code generated by g(x)=x 2+1 EE 141 System-on-Chip Test Architectures 54 Ch. 3 - Fault-Tolerant Design - P.

Systematic CRC Codes q To obtain systematic CRC code § codewords formed using Galois division – nice because LFSR can be used for performing division EE 141 System-on-Chip Test Architectures 55 Ch. 3 - Fault-Tolerant Design - P.

Galois Division Example q Encode m(x)=x 2+x with g(x)=x 2+1 § Requires dividing m(x)xn-k =x 4+x 3 by g(x) § Remainder r(x)=x+1 – c(x) = m(x)xn-k+r(x) = (x 2+x)(x 2)+x+1 = x 4+x 3+x+1 EE 141 System-on-Chip Test Architectures 56 Ch. 3 - Fault-Tolerant Design - P.

Message m(x) g(x) r(x) c(x) Codeword 0000 0 x 2 + 1 0 0 000000 0001 1 x 2 + 1 000101 0010 x x 2 + 1 x x 3 + x 001010 0011 x+1 x 2 + 1 x+1 x 3 + x 2 + x + 1 001111 0100 x 2 + 1 1 x 4 + 1 010001 0101 x 2 + 1 0 x 4 + x 2 010100 0110 x 2 + x x 2 + 1 x+1 x 4 + x 3 + x + 1 011011 0111 x 2 + x + 1 x 2 + 1 x x 4 + x 3 + x + 1 011110 1000 x 3 x 2 + 1 x x 4 + x 3 + x + 1 100010 1001 x 3 + 1 x 2 + 1 x+1 x 4 + x 3 + x + 1 100111 1010 x 3 + x x 2 + 1 0 x 4 + x 3 + x + 1 101000 1011 x 3 + x + 1 x 2 + 1 1 x 4 + x 3 + x + 1 101101 1100 x 3 + x 2 + 1 x+1 x 4 + x 3 + x + 1 110011 1101 x 3 + x 2 + 1 x x 4 + x 3 + x + 1 110110 1110 x 3 + x 2 + x x 2 + 1 1 x 4 + x 3 + x + 1 111001 1111 x 3 + x 2 + x + 1 x 2 + 1 0 x 4 + x 3 + x 2 + x 111100 EE 141 System-on-Chip Test Architectures 57 Ch. 3 - Fault-Tolerant Design - P.

Generating Check Bits for CRC Code q Use LFSR § With characteristic polynomial equal to g(x) § Append n-k 0’s to end of message q Example: m(x)=x 2+x+1 and g(x)=x 3+x+1 EE 141 System-on-Chip Test Architectures 58 Ch. 3 - Fault-Tolerant Design - P.

Checking CRC Codeword q Checking Received Codeword for Errors § Shift codeword into LFSR – with same characteristic polynomial as used to generate it § If final state of LFSR non-zero, then error EE 141 System-on-Chip Test Architectures 59 Ch. 3 - Fault-Tolerant Design - P.

Selecting Generator Polynomial q Key issue for CRC Codes § If first and last bit of polynomial are 1 – Will detect burst errors of length n-k or less § If generator polynomial is mutliple of (x+1) – Will detect any odd number of errors § If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2 n-k-1 – Will detect single, double, triple, and odd errors EE 141 System-on-Chip Test Architectures 60 Ch. 3 - Fault-Tolerant Design - P.

Commonly Used CRC Generators CRC code CRC-5 (USB token packets) Generator Polynomial x 5+x 2+1 CRC-12 (Telecom x 12+x 11+x 3+x 2+x systems) +1 CRC-16 -CCITT (X 25, Bluetooth) CRC-32 (Ethernet) CRC-64 (ISO) x 16+x 12+x 5+1 x 32+x 26+x 23+x 22+ x 16+x 12+x 11+x 10+ x 8 +x 7+x 5+x 4+x+1 x 64+x 3+x+1 EE 141 System-on-Chip Test Architectures 61 Ch. 3 - Fault-Tolerant Design - P.

Fault Tolerance Schemes q Adding Fault Tolerance to Design § Improves dependability of system § Requires redundancy – Hardware – Time – Information EE 141 System-on-Chip Test Architectures 62 Ch. 3 - Fault-Tolerant Design - P.

Hardware Redundancy q Involves replicating hardware units § At any level of design – gate-level, module-level, chip-level, board-level q Three Basic Forms § Static (also called Passive) – Masks faults rather than detects them § Dynamic (also called Active) – Detects faults and reconfigures to spare hardware § Hybrid – Combines active and passive approaches EE 141 System-on-Chip Test Architectures 63 Ch. 3 - Fault-Tolerant Design - P.

Static Redundancy q Masks faults so no erroneous outputs § Provides uninterrupted operation § Important for real-time systems – No time to reconfigure or retry operation § Simple self-contained – No need to update or rollback system state EE 141 System-on-Chip Test Architectures 64 Ch. 3 - Fault-Tolerant Design - P.

Triple Module Redundancy (TMR) q Well-known static redundancy scheme § Three copies of module § Use majority voter to determine final output § Error in one module out-voted by other two EE 141 System-on-Chip Test Architectures 65 Ch. 3 - Fault-Tolerant Design - P.

TMR Reliability and MTTF q TMR works if any 2 modules work § Rm = reliability of each module § Rv = reliability of voter q MTTF for TMR EE 141 System-on-Chip Test Architectures 66 Ch. 3 - Fault-Tolerant Design - P.

Comparison with Simplex q Neglecting q TMR fault rate of voter has lower MTTF, but § Can tolerate temporary faults § Higher reliability for short mission times EE 141 System-on-Chip Test Architectures 67 Ch. 3 - Fault-Tolerant Design - P.

Comparison with Simplex q Crossover q RTMR point > Rsimplex when § Mission time shorter than 70% of MTTF EE 141 System-on-Chip Test Architectures 68 Ch. 3 - Fault-Tolerant Design - P.

N-Modular Redundancy (NMR) q NMR § N modules along with majority voter – TMR special case § Number of failed modules masked = (N-1)/2 § As N increases, MTTF decreases – But, reliability for short missions increases q If goal only to tolerate temporary faults § TMR sufficient EE 141 System-on-Chip Test Architectures 69 Ch. 3 - Fault-Tolerant Design - P.

Interwoven Logic q Replace each gate § with 4 gates using inconnection pattern that automatically corrects errors q Traditionally not as attractive as TMR § Requires lots of area overhead § Renewed interest by researchers investigating emerging nanoelectronic technologies EE 141 System-on-Chip Test Architectures 70 Ch. 3 - Fault-Tolerant Design - P.

Interwoven Logic with 4 NOR Gates EE 141 System-on-Chip Test Architectures 71 Ch. 3 - Fault-Tolerant Design - P.

Example of Error on Third Y Input EE 141 System-on-Chip Test Architectures 72 Ch. 3 - Fault-Tolerant Design - P.

Dynamic Redundancy q Involves § Detecting fault § Locating faulty hardware unit § Reconfiguring system to use spare fault-free hardware unit EE 141 System-on-Chip Test Architectures 73 Ch. 3 - Fault-Tolerant Design - P.

Unpowered (Cold) Spares q Advantage § Extends lifetime of spares q Equations § Assume spare not failing until powered § Perfect reconfiguration capability EE 141 System-on-Chip Test Architectures 74 Ch. 3 - Fault-Tolerant Design - P.

Unpowered (Cold) Spares q One cold spare doubles MTTF § Assuming faults always detected and reconfiguration circuitry never fails q Drawback of cold spare § Extra time to power and initialize § Cannot be used to help in detecting faults § Fault detection requires either – periodic offline testing – online testing using time or information redundancy EE 141 System-on-Chip Test Architectures 75 Ch. 3 - Fault-Tolerant Design - P.

Powered (Hot) Spares q Can use spares for online fault detection q One approach is duplicate-and-compare § If outputs mismatch then fault occurred – Run diagnostic procedure to determine which module is faulty and replace with spare § Any number of spares can be used EE 141 System-on-Chip Test Architectures 76 Ch. 3 - Fault-Tolerant Design - P.

Pair-and-a-Spare q Avoids halting system to run diagnostic procedure when fault occurs EE 141 System-on-Chip Test Architectures 77 Ch. 3 - Fault-Tolerant Design - P.

TMR/Simplex q When one module in TMR fails § Disconnect one of remaining modules § Improves MTTF while retaining advantages of TMR when 3 good modules q TMR/Simplex § Reliability always better than either TMR or Simplex alone EE 141 System-on-Chip Test Architectures 78 Ch. 3 - Fault-Tolerant Design - P.

Comparison of Reliability vs Time EE 141 System-on-Chip Test Architectures 79 Ch. 3 - Fault-Tolerant Design - P.

Hybrid Redundancy q Combines both static and dynamic redundancy § Masks faults like static § Detects and reconfigures like dynamic EE 141 System-on-Chip Test Architectures 80 Ch. 3 - Fault-Tolerant Design - P.

TMR with Spares q If TMR module fails § Replace with spare – can be either hot or cold spare § While system has three working modules – TMR will provide fault masking for uninterrupted operation EE 141 System-on-Chip Test Architectures 81 Ch. 3 - Fault-Tolerant Design - P.

Self-Purging Redundancy q Uses threshold voter instead of majority voter § Threshold voter outputs 1 if number of input that are 1 greater than threshold – Otherwise outputs 0 § Requires hot spares EE 141 System-on-Chip Test Architectures 82 Ch. 3 - Fault-Tolerant Design - P.

Self-Purging Redundancy EE 141 System-on-Chip Test Architectures 83 Ch. 3 - Fault-Tolerant Design - P.

Self-Purging Redundancy q Compared with 5 MR § Self-purging with 5 modules – Tolerate up to 3 failing modules (5 MR cannot) – Cannot tolerate two modules simultaneously failing (5 MR can) q Compared with TMR with 2 spares § Self-purging with 5 modules – simpler reconfiguration circuitry – requires hot spares (3 MR w/spares can use either hot or cold spares) EE 141 System-on-Chip Test Architectures 84 Ch. 3 - Fault-Tolerant Design - P.

Time Redundancy q Advantage § Less hardware q Drawback § Cannot detect permanent faults q If error detected § System needs to rollback to known good state before resuming operation EE 141 System-on-Chip Test Architectures 85 Ch. 3 - Fault-Tolerant Design - P.

Repeated Execution q Repeat operation twice § Simplest time redundancy approach § Detects temporary faults occurring during one execution (but not both) – Causes mismatch in results § Can reuse same hardware for both executions – Only one copy of functional hardware needed EE 141 System-on-Chip Test Architectures 86 Ch. 3 - Fault-Tolerant Design - P.

Repeated Execution q Requires mechanism for storing and comparing results of both executions § In processor, can store in memory or on disk and use software to compare q Main cost § Additional time for redundant execution and comparison EE 141 System-on-Chip Test Architectures 87 Ch. 3 - Fault-Tolerant Design - P.

Multi-threaded Redundant Execution q Can use in processor-based system that can run multiple threads § Two copies of thread executed concurrently § Results compared when both complete § Take advantage of processor’s built-in capability to exploit processing resources – Reduce execution time – Can significantly reduce performance penalty EE 141 System-on-Chip Test Architectures 88 Ch. 3 - Fault-Tolerant Design - P.

Multiple Sampling of Ouputs q Done at circuit-level § Sample once at end of normal clock cycle § Same again after delay of t § Two samples compared to detect mismatch – Indicates error occurred § Detect fault whose duration is less than t § Performance overhead depends on – Size of t relative to normal clock period EE 141 System-on-Chip Test Architectures 89 Ch. 3 - Fault-Tolerant Design - P.

Multiple Sampling of Outputs q Simple approach using two latches EE 141 System-on-Chip Test Architectures 90 Ch. 3 - Fault-Tolerant Design - P.

Multiple Sampling of Outputs q Approach using stability checker at output EE 141 System-on-Chip Test Architectures 91 Ch. 3 - Fault-Tolerant Design - P.

Diverse Recomputation q Use same hardware, but perform computation differently second time § Can detect permanent faults that affects only one computation q For arithmetic or logical operations § Shift operands when performing second computation [Patel 1982] § Detects permanent fault affecting only one bit-slice EE 141 System-on-Chip Test Architectures 92 Ch. 3 - Fault-Tolerant Design - P.

Information Redundancy q Based on Error Detecting and Correcting Codes q Advantage § Detects both permanent and temporary faults § Implemented with less hardware overhead than using multiple copies of module q Disadvantage § More complex design EE 141 System-on-Chip Test Architectures 93 Ch. 3 - Fault-Tolerant Design - P.

Error Detection q Error detecting codes used to detect errors § If error detected – Rollback to previous known error-free state – Retry operation EE 141 System-on-Chip Test Architectures 94 Ch. 3 - Fault-Tolerant Design - P.

Rollback q Requires adding storage to save previous state § Amount of rollback depends on latency of error detection mechanism § Zero-latency error detection – rollback implemented by preventing system state from updating § If errors detected after n cycles – need rollback restoring system to state at least n clock cycles earlier EE 141 System-on-Chip Test Architectures 95 Ch. 3 - Fault-Tolerant Design - P.

Checkpoint q Execution divided into set of operations § Before each operation executed – checkpoint created where system state saved § If any error detected during operation – rollback to last checkpoint and retry operation § If multiple retries fail – operation halts and system flags that permanent fault has occurred EE 141 System-on-Chip Test Architectures 96 Ch. 3 - Fault-Tolerant Design - P.

Error Detection q Encode outputs of circuit with error detecting code § Non-codeword output indicates error EE 141 System-on-Chip Test Architectures 97 Ch. 3 - Fault-Tolerant Design - P.

Self-Checking Checker q Has two outputs § Normal error-free case (1, 0) or (0, 1) § If equal to each other, then error (0, 0) or (1, 1) § Cannot have single error indicator output – Stuck-at 0 fault on output could never be detected EE 141 System-on-Chip Test Architectures 98 Ch. 3 - Fault-Tolerant Design - P.

Totally Self-Checking Checker q Requires three properties § Code Disjoint – all codeword inputs mapped to codeword outputs § Fault Secure – for all codeword inputs, checker in presence of fault will either procedure correct codeword output or non-codeword output (not incorrect codeword) § Self-Testing – For each fault, at least one codeword input gives error indication EE 141 System-on-Chip Test Architectures 99 Ch. 3 - Fault-Tolerant Design - P.

Duplicate-and-Compare q Equality checker indicates error § Undetected error can occur only if common -mode fault affecting both copies § Only faults after stems detected § Over 100% overhead (including checker) EE 141 System-on-Chip Test Architectures 100 Ch. 3 - Fault-Tolerant Design - P.

Single-Bit Parity Code q Totally self-checking checker formed by removing final gate from XOR tree EE 141 System-on-Chip Test Architectures 101 Ch. 3 - Fault-Tolerant Design - P.

Single-Bit Parity Code q Cannot detect even bit errors § Can ensure no even bit errors by generating each output with independent cone of logic – Only single bit errors can occur due to single point fault – Typically requires a lot of overhead EE 141 System-on-Chip Test Architectures 102 Ch. 3 - Fault-Tolerant Design - P.

Parity-Check Codes q Each check bit is parity for some set of output bits q Example: 6 outputs and 3 check bits EE 141 System-on-Chip Test Architectures 103 Ch. 3 - Fault-Tolerant Design - P.

Parity-Check Codes q For c check bits and k functional outputs § 2 ck possible parity check codes § Can choose code based on structure of circuit to minimize undetected error combinations § Fanouts in circuit determine possible error combinations due to single-point fault EE 141 System-on-Chip Test Architectures 104 Ch. 3 - Fault-Tolerant Design - P.

Checker for Parity-Check Codes q Constructed from single-bit parity checkers and two-rail checkers EE 141 System-on-Chip Test Architectures 105 Ch. 3 - Fault-Tolerant Design - P.

Two-Rail Checkers q Totally self-checking two-rail checker EE 141 System-on-Chip Test Architectures 106 Ch. 3 - Fault-Tolerant Design - P.

Berger Codes q Inverter-free circuit § Inverters only at primary inputs § Can be synthesized using only algebraic factoring [Jha 1993] § Only unidirectional errors possible for single point faults – Can use unidirectional code – Berger code gives 100% coverage EE 141 System-on-Chip Test Architectures 107 Ch. 3 - Fault-Tolerant Design - P.

Constant Weight Codes q Non-separable with lower redundancy § Drawback: need decoding logic to convert codeword back to its original binary value § Can use for encoding states of FSM – No need for decoding logic EE 141 System-on-Chip Test Architectures 108 Ch. 3 - Fault-Tolerant Design - P.

Error Correction q Information redundancy can also be used to mask errors § Not as attractive as TMR because logic for predicting check bits very complex § However, very good for memories – Check bits stored with data – Error do not propagate in memories as in logic circuits, so SEC-DED usually sufficient EE 141 System-on-Chip Test Architectures 109 Ch. 3 - Fault-Tolerant Design - P.

Error Correction q Memories very dense and prone to errors § Especially due to single-event upsets (SEUs) from radiation q SEC-DED check bits stored in memory § 32 -bit word, SEC-DED requires 7 check bits – Increases size of memory by 7/32=21. 9% § 64 -bit word, SEC-DED requires 8 check bits – Increases size of memory by 8/64=12. 5% EE 141 System-on-Chip Test Architectures 110 Ch. 3 - Fault-Tolerant Design - P.

Memory ECC Architecture EE 141 System-on-Chip Test Architectures 111 Ch. 3 - Fault-Tolerant Design - P.

Hamming Code for ECC RAM Parity Group 1 Parity Group 2 Parity Group 3 Parity Group 4 Z 1 1 1 0 0 Z 2 1 0 Z 3 0 1 1 0 EE 141 System-on-Chip Test Architectures Z 4 1 1 1 0 Z 5 1 0 0 1 Z 6 0 1 Z 7 1 1 0 1 Z 8 0 0 1 1 c 1 1 0 0 0 c 2 0 1 0 0 c 3 0 0 1 0 c 4 0 0 0 1 112 Ch. 3 - Fault-Tolerant Design - P.

Memory ECC q SEC-DED generally very effective § Memory bit-flips tend to be independent and uniformly distributed § If bit-flip occurs, gets corrected next time memory location accessed § Main risk is if memory word not access for long time – Multiple bit-flips could accumulate EE 141 System-on-Chip Test Architectures 113 Ch. 3 - Fault-Tolerant Design - P.

Memory Scrubbing q Every location in memory read on periodic basis § Reduces chance of multiple errors accumulating in a memory word § Can be implemented by having memory controller cycle through memory during idle periods EE 141 System-on-Chip Test Architectures 114 Ch. 3 - Fault-Tolerant Design - P.

Multiple-Bit Upsets (MBU) q Can occur due to single SEU § Typically occur in adjacent memory cells q Memory interleaving used § To prevent MBUs from resulting in multiple bit errors in same word EE 141 System-on-Chip Test Architectures 115 Ch. 3 - Fault-Tolerant Design - P.

Type Issues Goal Examples Techniques Long-Life Systems Difficult or Expensive to Repair Maximize MTTF Satellites Spacecraft Implanted Biomedical Dynamic Redundancy Reliable Real-Time Systems Error or Delay Catastrophic Fault Masking Capability Aircraft Nuclear Power Plant Air Bag Electronics Radar TMR High Availability Systems Downtime Very Costly High Availability Reservation System Stock Exchange Telephone Systems No Single Point of Failure; Self-Checking Pairs; Fault Isolation High Integrity Systems Data Corruption Very Costly High Data Integrity Banking Transaction Processing Database Checkpointing, Time Redundancy; ECC; Redundant Disks Mainstream Low-Cost Systems Reasonable Level of Failures Acceptable Meet Failure Rate Expectations at Low Cost Consumer Electronics Personal Computers Often None; Memory ECC; Bus Parity; Changing as Technology Scales EE 141 System-on-Chip Test Architectures 116 Ch. 3 - Fault-Tolerant Design - P.

Concluding Remarks q Many different fault-tolerant schemes q Choosing scheme depends on § Types of faults to be tolerated – Temporary or permanent – Single or multiple point failures – etc. § Design constraints – Area, performance, power, etc. EE 141 System-on-Chip Test Architectures 117 Ch. 3 - Fault-Tolerant Design - P.

Concluding Remarks q As technology scales § Circuits increasingly prone to failure § Achieving sufficient fault tolerance will be major design issue EE 141 System-on-Chip Test Architectures 118 Ch. 3 - Fault-Tolerant Design - P.