Computer Engineering Self Repair Technology for Logic Circuits

Computer Engineering Outline 1. Introduction: Nano Structure Problems 2. The Problem of Wear-Out 3.

Computer Engineering 1. Introduction A bunch of new problems from nano structures. . .

Computer Engineering Nanoelectronic Problems Lithography: The wavelength used to „map“ structural information from masks

Computer Engineering New Problems with Nano-Technologies Light source Wave length: 193 nm mask (reticle)

Computer Engineering Layout Correction Modified layout for compensation of mapping faults Compensation is critical

Computer Engineering Doping Fluctuations in MOS Transistors Poly-Si n doping atom n p-Substrate Density

Computer Engineering Nanostructure Problems Individual device characteristics such as Vth are more dependent on

Computer Engineering Fault Tolerant Computing Software-based fault detection & compensation Works only for transient

Computer Engineering 2. Wear-Out Problems and Mechanisms Structures on ICs used to live longer

Computer Engineering IC Structures May Get Tired „Wear-out“ – effects ICs in nano-electronics are

Computer Engineering Fault Effects on ICs CREDES / ZUSYS / DAAD Summer School 2011,

Computer Engineering Wear-Out Mechnisms Metal Migration: Metal atoms (Al, Cu) tend to migrate under

Computer Engineering Metal Migration CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Transistor Degradation Negative Bias Thermal Instability (NBTI): Reduced switching speed for p-channel

Computer Engineering Management of Wear-Out by „Fault Tolerant Computing? Built-in fault tolerance and error

Computer Engineering Triple Modular Redundancy input signal Execution Unit 1 Execution Unit 2 Execution

Computer Engineering Error Detecting / Correcting Codes Data Error correction Transmission / Storage Signature

Computer Engineering Can TMR and Codes Compensate Permanent Faults? Fault / error detection circuitry

Computer Engineering Redundancy and Wear-Out During the normal life time of the system, duplication

Computer Engineering Self Repair? Software-based fault detection & compensation Works only for transient faults!

Computer Engineering 3. Repair for Memory and FPGAs Compensation of transient faults is not

Computer Engineering Memory Test & Repair Lines Line address Read- / write lines spare

Computer Engineering Memory Test & Repair (2) Line address Lines Read- / Write lines

Computer Engineering FPGA-based Self Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering In-System FPGA Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Repair Mechanism: Row/Line-Shift Little Overhead for the re-configuration process Loss of many

Computer Engineering Distributed Backup CLBs Minimum loss of functional CLBs High effort for re-wiring

Computer Engineering Self Repair within FPGA Basic Blocks Heterogeneous repair strategies required (memory, logic)

Computer Engineering Structure of a CLB Slice CREDES / ZUSYS / DAAD Summer School

Computer Engineering FPGAs for a Solution? The granularity of re-configurable logic blocks (CLBs) in

Computer Engineering Self-Repairing FPGA ? Reconfigurable Logic CLB WB CLB WB CLB WB CLB

Computer Engineering Advanced FPGA Structures. . . are only partly re-configurable for performance reasons

Computer Engineering FPGA / CPLD Repair Looks pretty easy at first glance because of

Computer Engineering 4. Basic Logic Repair Strategies Repair techniques that replace failing building blocks

Computer Engineering Mainframes . . will often contain „redundant“ CPUs for eventual fault compensation.

Computer Engineering Granularity of Replacement CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Repair Overhead versus Element Loss Repair procedure overhead Prohibitive overhead 1 10

Computer Engineering Built-in Self Repair (BISR) BISR is well understood for highly regular structures

Computer Engineering Levels of Repair Transistors - Switch Level Replace transistors or transistor groups

Computer Engineering The Fault Isolation Problem Load 1 Driver Gateshort Load 2 GND-shorts of

Computer Engineering Block-Level Repair & & SE SE & Blocks of logic / RT

Computer Engineering Switching Concept (1) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Switching Concept (2) inputs Test in outputs inputs outputs Functional Block 1

Computer Engineering A Regular Switching Scheme The scheme is regular and scalable by nature,

Computer Engineering Overhead Depending on Block Size Transistors Basic Element Functional backup norm switch

Computer Engineering Overhead CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 5. Test and Repair Administration Test Generator RLB Logic RLB CREDES /

Computer Engineering Blocks, Switching, Administration Local (re-) configuration Remote (re-) configuration Columns of Switches

Computer Engineering Combining Test and Re-Configuration Reference Test input Logic under Test next state

Computer Engineering Test and Administration Each of the elements in a block is testable

Computer Engineering Controller for (Re-) Configuration Controller minimum complexity: 80 transistors (3 + 1

Computer Engineering Local Interconnects The block-based repair scheme so far can not cover faults

Computer Engineering Essentials of the Repair Scheme Logic self repair is feasible at cost

Computer Engineering 6. De-Stressing CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering The Purpose of De-Stressing Building blocks in digital systems of equal type

Computer Engineering The Scheme of De-Stressing CREDES / ZUSYS / DAAD Summer School 2011,

Computer Engineering Modified Control Scheme For de-stressing, functions have to be shifted while the

Computer Engineering FSM including Transitional States If a „flying“ transition between repair states becomes

Computer Engineering Control Logic Functionality CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Extended Control Logic CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 7. Overhead and Limitations BISR requires additional overhead. The inevitable extra circuitry

Computer Engineering Cost / Overhead ( 3 functional blocks plus 1 backup in RLB)

Computer Engineering Sources of Overhead Basic Block 2 -NAND H-Adder F-Adder 2 Bit ALU

Computer Engineering Overhead and Block Size CREDES / ZUSYS / DAAD Summer School 2011,

Computer Engineering The Switching Problem (1) Compensates „always on“ Compensates „always off“ Compensates „always

Computer Engineering Single Points of Failure Transistor Switches CREDES / ZUSYS / DAAD Summer

Computer Engineering Pass Transistor Faults Short A short condition between the signal input (Usign)

Computer Engineering Blowing Fuses CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 8. Summary and Conclusions Logic self-repair is not impossible, but noch cheap

Computer Engineering Real Embedded Systems CPU Data Path Mem. Ctrl DSP Cache Memory Ctrl

Computer Engineering Regular Processor Architectures Needs Logic-BISR Crtl. Logic Add Register File Multiple parallel

Computer Engineering Design for Repairability RT netlist Extract obvious regular blocks RLB Control Circuitry

Computer Engineering This is the END ! Thank you for not falling asleep !

Slides: 75

Download presentation

Computer Engineering Self Repair Technology for Logic Circuits Architecture, Overhead and Limitations Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Outline 1. Introduction: Nano Structure Problems 2. The Problem of Wear-Out 3. Repair for Memory and FPGAs 4. Basic Logic Repair Strategies & Structures 5. Test and Repair Administration 6. De-Stressing Strategies 7. Cost, Overhead, Single Points of Failure 8. Summary and Conclusions CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 1. Introduction A bunch of new problems from nano structures. . . CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Nanoelectronic Problems Lithography: The wavelength used to „map“ structural information from masks to wafers is larger (4 times of more) than the minimum structural features (193 versus 90 / 65 / 45 nm). Adaptation of layouts for correction of mapping faults. Statistical Parameter Variations: The number of atoms in MOS-transistor channels becomes so small that statistical variations of doping densities have an impact on device parameters such as threshold voltages. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering New Problems with Nano-Technologies Light source Wave length: 193 nm mask (reticle) resist wafer CREDES / ZUSYS / DAAD Summer School 2011, Tallinn exposed resist Feature size: down to 28 nm

Computer Engineering Layout Correction Modified layout for compensation of mapping faults Compensation is critical and non-ideal Faults are not random but correlated! Requires fast fault diagnosis CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Doping Fluctuations in MOS Transistors Poly-Si n doping atom n p-Substrate Density and distribution of doping atoms cause shifts in transistor threshold voltages! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Nanostructure Problems Individual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles. Primary Relevance: Yield A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production. Primary Relevance: Yield Smaller features mean higher stress (field strength, current density), also foster new mechanisms of early wear-out. Primary Relevance: Lifetime Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Primary Relevance: Dependability

Computer Engineering Fault Tolerant Computing Software-based fault detection & compensation Works only for transient faults! specific HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Fault event Transistor-and switch level compensation CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Typically works for specific types of transient faults only! very specific

Computer Engineering 2. Wear-Out Problems and Mechanisms Structures on ICs used to live longer than either their application or even their users. Not any more. . . CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering IC Structures May Get Tired „Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier, causing a lot of problems for dependable long-time applications ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Fault Effects on ICs CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Wear-Out Mechnisms Metal Migration: Metal atoms (Al, Cu) tend to migrate under high current density and high temperature. Stress migration: Migration effects may be enhanced under mechanical stress conditons. Effect: Metal lines and vias may actually cause line interrupts. The effect is partly reversible by changing current directions. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Metal Migration CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Transistor Degradation Negative Bias Thermal Instability (NBTI): Reduced switching speed for p-channel MOS transistors that have operated under long-time constant negative gate bias. The effect is partly reversible. Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS transistors, induced by positive gate bias and frequent switching. Not reversible. Gate Oxide Deterioration: Induced by high field strengh. Not reversible Dielectric Breakdown: Insulating layers between metal lines may break causing shorts between signal lines. Design technology including a prospective „life time budget“!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Management of Wear-Out by „Fault Tolerant Computing? Built-in fault tolerance and error compensation are needed in nanotechnologies anyway and for the management of transient faults. Wear-out induced faults may show up as „intermittent“ faults first, which become more and more frequent. Fault in synchronous circuits and systems are detected „by clock cycle“. Hence the detection does not even recognize if the fault is permanent or not for many types of fault tolerant architecture. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Triple Modular Redundancy input signal Execution Unit 1 Execution Unit 2 Execution Unit 3 Comparator Voter Result out (majority) Error detect Can detect and compensate almost any type of fault Overhead about 200 -300 %, additional signal delays The voter itself is not covered but must be a „self checking checker“ Standard (by law) in avionics applications! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Error Detecting / Correcting Codes Data Error correction Transmission / Storage Signature Often applicable to 1 - or 2 -bit faults only Often limited to certain fault models (uni-directional) Becomes expensive if applied to computational units CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Signature Comparison Signature Faultdetect

Computer Engineering Can TMR and Codes Compensate Permanent Faults? Fault / error detection circuitry typically works on a clock-cycle base. It does not „know“ if a fault is transient or permanent. A permanent fault is a fault event that occurs in several to many successive clock cycles repeatedly. Error correction technology can detect and compensate such permanent faults as well as transient faults. A critical condition occurs if transient faults occur on top of permanent faults. Then the superposition of fault effects is likely to exceed the system‘s fault handling capacity. System components that run actively „in parallel“ suffer from the same wear-out effects. Therefore there is a an increase in dependability before wear-out limits, but no significant life time extension! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Redundancy and Wear-Out During the normal life time of the system, duplication or triplication can enhance reliability significantly. But also area and power consumption are about triplicated. And by the end of normal operating time (out of fuel / steam) all three systems will fail shortly one after the other !! Reliability enhancement is not equal to life time extension !! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Self Repair? Software-based fault detection & compensation Works only for transient faults! specific HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Fault event Self Repair for permanent faults! Transistor-and switch level compensation CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Typically works for specific types of transient faults only! very specific

Computer Engineering 3. Repair for Memory and FPGAs Compensation of transient faults is not enough. Some technologies for transient compensation can handle permanent faults, too, but not on the long run and with additional transient faults! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Memory Test & Repair Lines Line address Read- / write lines spare columns CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Memory Test & Repair (2) Line address Lines Read- / Write lines spare column Memory BIST controller CREDES / ZUSYS / DAAD Summer School 2011, Tallinn columns. . . is already state-of-the-art!

Computer Engineering FPGA-based Self Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering In-System FPGA Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Repair Mechanism: Row/Line-Shift Little Overhead for the re-configuration process Loss of many “good” CLBs for every fault CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Distributed Backup CLBs Minimum loss of functional CLBs High effort for re-wiring requires massive „embedded“ computing power (32 -bit CPU, 500 MHz) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Self Repair within FPGA Basic Blocks Heterogeneous repair strategies required (memory, logic) Logic blocks may use methods known from memory BISR Additional repair strategies are necessary for logic elements The basic overhead for FPGAs versus standard logic (about 10) is enhanced. Repair strategies for logic may use some features already used in FPGAs (e. g. switched interconnects). CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Structure of a CLB Slice CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering FPGAs for a Solution? The granularity of re-configurable logic blocks (CLBs) in most FPGAs is the order of several thousand transistors. Replacement strategies must be placed on a granularity of blocks in the area of 100 -500 transistors for fault densities between 0. 01 % and 0. 1 %. Efficient FPGA- repair mechanism requires detailed fault diagnosis plus specific repair schemes, which cannot be kept as pre-computed reconfiguration schemes. Computation of specific repair schemes requires „in-system EDA“ (re-placement and routing) with a massive demand for computing power. There is no source of such „always available“ computing power. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Self-Repairing FPGA ? Reconfigurable Logic CLB WB CLB WB CLB WB CLB WB CLB WB CLB WB New-Config. Memory Virtual CPU CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Config. CLB Scheme CLB Program CLB

Computer Engineering Advanced FPGA Structures. . . are only partly re-configurable for performance reasons ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering FPGA / CPLD Repair Looks pretty easy at first glance because of regular architecture! Requires lines / columns of switches for configuration at inputs and between AND / OR matrices. Requires additional programmability of cross-points by double-gate transistor as in EEPROMs or Flash memory. Not fully compatible with standard CMOS Limited number of (re-) configurations Floating gate (FAMOS) transistors are fault-sensitive! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 4. Basic Logic Repair Strategies Repair techniques that replace failing building blocks by redundant elements from a „silent“ storage are not new. IBM has been selling such computer systems specifically for applications in banks for decade. But always with few (2 -10) backup elements (CPUs) assuming a small number of failures (< 10) within years. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Mainframes . . will often contain „redundant“ CPUs for eventual fault compensation. But one faulty transistor then „costs“ a whole CPU, limiting the fault handling to a few (about 10) permanent fault cases. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Granularity of Replacement CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Repair Overhead versus Element Loss Repair procedure overhead Prohibitive overhead 1 10 Functioning elements lost New Methods and Architectures 100 CREDES / ZUSYS / DAAD Summer School 2011, Tallinn 1 k Prohibitive fault density 10 k 100 k 1 M 10 M Size of replaced blocks (granularity)

Computer Engineering Built-in Self Repair (BISR) BISR is well understood for highly regular structures such as embedded memory blocks. BISR is essentially depending on built-in self test (BIST) with high diagnostic resolution. Fault Detection Fault Diagnosis Fault Isolation Redundancy Allocation Fault / Redundancy Management Redundancy management must monitor faults, replacements, available redundancy and must also re-establish a „working“ system state after power-down states. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Levels of Repair Transistors - Switch Level Replace transistors or transistor groups Losses by reconfiguration: (switched-off „good“ devices): Potentially small ( 20 – 50%) for transistor faults Overhead for test and diagnosis: Very high Repair overhead Gate Level will dominate Replace gates or logic cells reliability! Losses by reconfiguration: Medium (60 to 90 %) for single transistor faults Overhead for test and diagnosis: High Macro-Block Level Replace functional macros (ALU, FPU, CPU) Losses by reconfiguration: High, 99% or more Overhead for test and diagnosis: Maybe acceptable CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering The Fault Isolation Problem Load 1 Driver Gateshort Load 2 GND-shorts of input gates affect the whole fan-in network and make redundancy obsolete!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Block-Level Repair & & SE SE & Blocks of logic / RT elements (gates and larger) contain a redundant element each that can replace a faulty unit. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Switching Concept (1) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Switching Concept (2) inputs Test in outputs inputs outputs Functional Block 1 Functional Block 2 Functional Block 3 Replacement Block 3 CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Test out Test in 4 Test out

Computer Engineering A Regular Switching Scheme The scheme is regular and scalable by nature, comprising always k functional blocks of the same nature plus 1 additional block for backup. Building blocks are separated by (pass-) transistor switches at inputs and outputs, providing a full isolation of a faulty block. Always 2 additional pass-transistors between two functional blocks. The reconfiguration scheme is regular in shifting functionality between blocks, which results in a simple scheme of administration. The functional access to the „spare“ block can be used for testing purposes. In any state of (re-) configuration, the potentially „faulty“ block is connected to test input / output terminals. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Overhead Depending on Block Size Transistors Basic Element Functional backup norm switch ext. switch 3 /4 - 2 -NAND 12 4 18 24 3 / 4 2 -AND 18 6 18 24 3/4 2 -XOR 18 6 18 24 H- Adder 36 12 24 30 F- Adder 90 30 30 36 For small basic blocks, the switches make the essential overhead (200%)! For larger basic blocks, the overhead can be reduced to about 30 -50%. . . not counting test- and administration overhead! Extract larger basic units from seemingly irregular logic netlists!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Overhead CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 5. Test and Repair Administration Test Generator RLB Logic RLB CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Conf. RLB BIST Logic Configurator and Status Memory Conf. RLB BIST System Monitoring Test Analyzer Centralized Control Conf. May be faulty! De-centralized test and control

Computer Engineering Blocks, Switching, Administration Local (re-) configuration Remote (re-) configuration Columns of Switches F-Unit F-Unit Red. -Unit F-Unit Conf. -Unit Decoder Conf. -Unit Global Control-Unit CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Combining Test and Re-Configuration Reference Test input Logic under Test next state Config. Memory / Counter CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Test out Compare fault detect

Computer Engineering Test and Administration Each of the elements in a block is testable via specific test inputs. Test is done by comparison with reference outputs. The system is run through states of re-configuration with the same input test pattern applied. At test, a functional unit is always removed from normal operation and connected to test I / O s. In case of a „fault detect“, the system is fixed in the current status. Test in Test out fix at fault Such a procedure of self-test and self-reconfiguration can run at every system start-up, avoiding a central „fault memory“. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Controller for (Re-) Configuration Controller minimum complexity: 80 transistors (3 + 1 configuration) A controller may drive one or several re-configurable blocks in parallel, depending on their size CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Local Interconnects The block-based repair scheme so far can not cover faults on wires between re-configurable blocks. For small basic blocks (such as logic gates) the majority of wiring is between re-configurable units and not covered. For larger (RT-level) basic blocks the majority of wiring is within basic blocks and covered. Schemes that can also cover inter-block wiring are possible, but require FPGA-like configurable switching and complex switching schemes. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Essentials of the Repair Scheme Logic self repair is feasible at cost below triple modular redundancy (TMR). There is a trade-off between the size or the reconfigurable logic blocks (RLBs) and the maximum tolerable fault density. Administration, not redundancy makes the critical overhead. Efforts can be saved by administrating several RLBs in parallel. Low-level interconnects between RLBs make for the essential „single point of failure“ in the repair scheme! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 6. De-Stressing CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering The Purpose of De-Stressing Building blocks in digital systems of equal type may be more or less heavily used. Blocks running with the highest dynamic load and at the highest temperature are candidates for early failure. Using otherwize „silent“ resources to relieve such units from stress periodically may serve the overall life time of the system. The re-configuration scheme developed for repair may also serve such purpose with slight modifications. . . and the scheme must be compatible with repair architectures ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering The Scheme of De-Stressing CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Modified Control Scheme For de-stressing, functions have to be shifted while the system is in „hot“ operation. As long as all building blocks are fully functional, running two functional blocks in parallel serving the same inputs and outputs is possible. With a total of k building blocks (including the spare one) there are k „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1) intermediate states for „handover“ in case of de-stressing. There are no extra switches necessary, but an additional overhead in state management and state decoding. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering FSM including Transitional States If a „flying“ transition between repair states becomes necessary, the control logic will have seven states instead of four! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Control Logic Functionality CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Extended Control Logic CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 7. Overhead and Limitations BISR requires additional overhead. The inevitable extra circuitry used for fault administration is not fault-free by definition. But we can assume that such circuitry, if fabricated correctly, is not in heavy use all the time and will exhibit much reduced failure from stress. Memory cells used for repair state administration are prone to transient fault effects from particle radiation. Wit suitable state encoding (1 -out of n-code) parity check can be applied. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Overhead CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Cost / Overhead ( 3 functional blocks plus 1 backup in RLB) * * with / without extensions for de-stressing, controller design optimized for supervision by parity control. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Sources of Overhead Basic Block 2 -NAND H-Adder F-Adder 2 Bit ALU 4 Bit ALU 8 Bit ALU Complexity Overhead in % (trans. ) redund. switches control ctrl/destr. 4 12 30 352 699 1367 33 33 33 250 111 55 13 8. 5 6. 2 675 225 90 7. 6 3. 8 2 1666 555 222 18. 9 9. 5 4. 8 Switches and control overhead dominate, reasonable lower bound for complexity of basic blocks is around 100 -200 transistors. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Overhead and Block Size CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering The Switching Problem (1) Compensates „always on“ Compensates „always off“ Compensates „always on“ and „always off“. . . always in one single transistor. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Single Points of Failure Transistor Switches CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Pass Transistor Faults Short A short condition between the signal input (Usign) and the control input (Uctrl) may be solved by designing the gate input line (Rbr) as a fuse. Then one additional transistor is needed as a „power sink“. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Blowing Fuses CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering 8. Summary and Conclusions Logic self-repair is not impossible, but noch cheap either. The lower bound for logic blocks is about 100 transistors. Experience shows that most logic designs „yield“ some potential for logic extraction. Repair technologies work even (much) better for regular processor architectures such as VLIW processors. In real-life designs, a large part of the system (memory, 50 -90 %), functional units, 10 -40 %) is regular. Only a small fraction is truly „irregular“ and needs higher overhead. No such strategy yet for analog and mixed signal circuits ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Real Embedded Systems CPU Data Path Mem. Ctrl DSP Cache Memory Ctrl Cache Mixed Signal / RF . . only a small fraction of the real system is truly irregular and needs „expensive“ logic repair ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Regular Processor Architectures Needs Logic-BISR Crtl. Logic Add Register File Multiple parallel Processing units Regular processor structures with multiple parallel units need expensive logic (self-) repair only for their control logic. Reconfiguration of data-path elements can be arranged by software, which does not have wear-out ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Computer Engineering Design for Repairability RT netlist Extract obvious regular blocks RLB Control Circuitry Random Logic done Find and extract regular entities Random Rest Logic CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Compose RT-RLBs Compose Gate-Level RLBs Compose Estimate RLB control Reliability Scheme

Computer Engineering This is the END ! Thank you for not falling asleep ! (I would have. . ) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn