1 Introduction 1 Faults and their manifestation 4

  • Slides: 29
Download presentation
1. Introduction 1. Faults and their manifestation (4) 2. Analysis of faults (12) 3.

1. Introduction 1. Faults and their manifestation (4) 2. Analysis of faults (12) 3. Classification of tests (5) 4. Fault coverage requirements (3) 5. Test economics (4)

1. 1 Faults and their manifestation Definition of the terms: Failure, Error and Fault

1. 1 Faults and their manifestation Definition of the terms: Failure, Error and Fault Failure: A system failure is present when the service of the system differs from the expected service A failure is caused by an error Error: There is an error in the system when its state differs from the state required to deliver the expected service An error is caused by a fault Fault: A fault is present when there is a physical difference between the correct system and the current system

1. 2 Faults and their manifestation: Example: A car cannot be used due to

1. 2 Faults and their manifestation: Example: A car cannot be used due to a flat tire Failure: The car cannot be driven due a flat tire I. e. , the service differs from the expected service The failure is caused by an error Error: The air pressure has an erroneous state An error is caused by a fault Fault: A puncture, causing an erroneous air-pressure-state I. . e, the puncture is the difference between the correct system and the current system Note: A fault may not immediately result in a failure; e. g. , as will be the case with a slowly leaking tire

1. 3 Fault manifestation According to the way faults manifest themselves in time, they

1. 3 Fault manifestation According to the way faults manifest themselves in time, they can be divided into permanent and non-permanent faults Permanent fault: Affects the system’s functional behavior permanently Permanent faults are also referred to as solid or hard faults Examples: Broken wires, functional design errors, etc. Non-permanent fault: Affects the system’s functional behavior only part of the time

1. 4 Non-permanent faults are only present part of the time • They occur

1. 4 Non-permanent faults are only present part of the time • They occur at random moments and affect the system behavior finite periods of time • Therefore, their detection and localization is difficult These faults consist of the groups • Transient faults – Caused by environmental conditions – They are also referred to as soft errors Examples: cosmic rays, -particles, temperature, pressure, vibration • Intermittent faults – Caused by non-environmental conditions Examples: Loose connections, deteriorating or aging components

2. 1 Analysis of faults The following topics explain this subject • Analyze the

2. 1 Analysis of faults The following topics explain this subject • Analyze the frequency of occurrence of faults • Analyze system failure rate over its life time • Show failure rates of series and parallel systems • Explain physical and electrical causes of faults § There are referred to as failure mechanisms

2. 2 Frequency of occurrence of faults (1) Can be explained using reliability theory

2. 2 Frequency of occurrence of faults (1) Can be explained using reliability theory The point in time t at which a fault occurs can be considered a random variable u The probability of a failure before time t , F(t), is the unreliabilty of the system The reliability of a system, R(t), is the probability of a correct functioning system at time t. , or alternatively: It is assumed that: F(0) = 0: Initially the system will be operable F( ) = 1: Ultimately the system will fail : System is either operable or failing

2. 3 Frequency of occurrence of faults (2) The derivative of F(t), f(t), is

2. 3 Frequency of occurrence of faults (2) The derivative of F(t), f(t), is called the failure probability density function Hence: and The failure rate , z(t), is defined as the conditional probability that the system fails during the period (t, t+ t); given that the system was operational at time t Alternatively, z(t) can be expressed as follows:

2. 4 Frequency of occurrence of faults (3) R(t) can be expressed in terms

2. 4 Frequency of occurrence of faults (3) R(t) can be expressed in terms of z(t) as follows or, The average lifetime of a system, , can be expressed as the mathematical expectation of t to be For a non-maintained system, , is called the Mean Time To Failure, MTTF. Using partial integration, and assuming

2. 5 Frequency of occurrence of faults (4) Given a system with the following

2. 5 Frequency of occurrence of faults (4) Given a system with the following reliability The failure rate, z(t), of that system is computed below, and has a constant value Assuming failures occur randomly with a constant rate , the MTTF can be expressed as Note: # of people > 100 yrs old too small Example: R(t) & F(t) of Dutch male population (over years: 1976– 1980)

2. 6 Frequency of occurrence of faults (5) R(t) & F(t) of Dutch male

2. 6 Frequency of occurrence of faults (5) R(t) & F(t) of Dutch male population z(t) f(t) Note: Increase of z(t) & f(t) between ages 18— 20 due to driving accidents Note: Infant mortality rate

2. 7 Failure rate over product lifetime (1) A well-know graphical representation of the

2. 7 Failure rate over product lifetime (1) A well-know graphical representation of the failure rate, z(t), is the bathtub curve. It consists of three regions: 1. Infant mortality Failures in this region are termed infant mortalities. They are attributed to poor quality due to variations in the production process z(t) Dutch males 2. Working life; Constant failure rate: z(t) = Failures are considered to occur randomly in time 3. Wear out; Increasing failure rate This represents the end-of-life period of a system It should be clear that a system should be shipped after it has passed the infant mortality period, in order to reduce the # of field returns.

2. 8 Failure rate over product lifetime (2) Shipping a system after the infant

2. 8 Failure rate over product lifetime (2) Shipping a system after the infant mortality period can be done by: 1. Aging the system for that period (this can be several months) 2. Aging the system under stress – This accelerates the aging process An important stress condition is increased temperature: Burn-In The accelerating effect of temperature follows Arrhenius’ equation • • T 1 and T 2 are absolute temperatures (in degrees Kelvin, K) T 1 and T 2 are the failure rates at T 1 and T 2, respectively Ea is the activation energy; constant expressed in electron-volts, e. V k is Boltzmann’s constant k = 8. 617*10 -5 e. V/K The equation shows that the failure rate is exponentially dependent on the temperature

2. 9 Failure rate over product lifetime (3) Example of use of Arrhenius equation

2. 9 Failure rate over product lifetime (3) Example of use of Arrhenius equation Assume Burn-In takes place at 150 o. C = 423 o. K; i. e. , T 2 = 423 Note: Room temperature is 30 o. C = 303 o. K; i. e. , T 1 = 303 Given that the Ea for the targeted failure rate is: Ea = 0. 6 e. V Then the acceleration factor is: 678 This means that the 150 o. C temperature stress reduces the aging time by a factor of 678. Note: Every failure mechanism has its typical Ea value

2. 10 Failure rates of series and parallel systems A series system is a

2. 10 Failure rates of series and parallel systems A series system is a system of which all components have to be operational in order for the system to be operational Consider that the system consists of n components with reliability Ri(t), then the reliability of the system, R(t), is: It can be shown that A parallel system is a system which is operational as long as one of its n components is operational. The unreliability is: The reliability is:

2. 11 Failure mechanisms describe the physical and electrical causes for faults. They can

2. 11 Failure mechanisms describe the physical and electrical causes for faults. They can be divided into 3 classes: 1. Electrical stress Poor design leading to electrical overstress, or careless handling causing static damage 2. Intrinsic failure mechanisms Inherent to the semiconductor material itself. Examples: Crystal defects, dislocations and processing defects 3. Extrinsic failure mechanisms Originate in the packaging and interconnection process Examples: Poor bonding, corrosion, etc.

2. 12 Failure mechanisms Electrical stress Intrinsic failure mechanisms Failure mechanism class Extrinsic failure

2. 12 Failure mechanisms Electrical stress Intrinsic failure mechanisms Failure mechanism class Extrinsic failure mechanisms Electrical overstress Electrostatic discharge Gate oxide breakdown Ionic contamination Surface charge spreading Charge effects • Slow rapping • Hot electrons • Secondary slow trapping Piping Dislocations Packaging Metallization • Corrosion • Electromigration • Contact migration • Microcracks Bonding (purple plague) Die attachments failure Particle contamination Radiation • External • Intrinsic

3. 1 Classification of tests A test is a procedure which allows one to

3. 1 Classification of tests A test is a procedure which allows one to distinguish between good and bad parts Tests can be classified according to: 1. The technology they are designed for 2. The parameters they measure 3. The purpose for which the test results are used 4. The test application method

3. 2 Technology aspects The type of test depends heavily on the technology of

3. 2 Technology aspects The type of test depends heavily on the technology of the circuit to be tested: 1. Analog tests The domain of input and output signal values is analog; i. e. , they can take on any value within a given range (Ex. : a range of 0 – 5 V) Analog tests aim at determining the values of analog parameters such as voltage and current levels, frequency response, bandwidth, etc. The generation of the input stimuli and the measurement of the responses is inherently imprecise. Therefore, a range of values is used to determine the operational correctness 2. Digital tests The input and output signals are digital (0 or 1); hence, precise. The test are called logical or digital tests. 3. Mixed signal tests The domain of either the input or the output values is analog, while the other is digital. Typically used for testing digital-to-analog and analog-to-digital converters

3. 3 Measured parameter aspects The nature of the measured parameter can be: 1.

3. 3 Measured parameter aspects The nature of the measured parameter can be: 1. Logical: Logical tests aim at detecting faults causing a change in the logical behavior of the system ( a 0 is expected, while a 1 is measured) 2. Electrical: Electrical tests measure the values of electrical parameters (voltage and current levels) as well as their behavior over time; they can be divided into Parametric and Dynamic tests Parametric tests Are concerned with the external behavior of the circuit Ex. : Voltage & current levels & delays on the input & output pins – – DC parametric tests are concerned with the with time-independent properties of the input and output values IDDQ tests are a special class of DC parametric tests; they are concerned with the leakage currents during the quiescent state of the circuit AC parametric tests are concerned with the with time-dependent properties of the input and output values Dynamic tests aim at faults which are time-dependent and internal to the chip

3. 4 Purpose of test results The most obvious use of the test results

3. 4 Purpose of test results The most obvious use of the test results is to distinguish between good and bad parts. This can be done with a test which detects faults. In case of repair, a test capable of locating faults is required. Testing can be done during normal use of the system; referred to as concurrent testing; for example, parity checking is a simple for of concurrent testing. Alternatively, non-concurrent tests cannot be performed during normal use of the system, because they do not preserve the application data. However, they usually have a higher fault detection capability. Design-for-Testability (DFT) includes extra circuitry on the to-betested chip; it allows non-concurrent tests to be performed faster and/or with a higher fault coverage. Built-in-Self Test (BIST) includes extra circuitry on the to-be-tested chip, to the extent that the complete test function can be performed on chip, without external tester support.

3. 5 Test application methods Tests can also be classified according to the way

3. 5 Test application methods Tests can also be classified according to the way the test stimuli are applied and the test responses are evaluated • External test: Automatic Test Equipment ‘ATE’ is used to apply the test stimuli and evaluate the test responses At the board level the stimuli can be applied : – Via the regular board connectors Allows for a simple interface with the ATE and for at-speed testing. However, the nt all circuits are easy to reach. Manual test program design is required, called functional tests – Via special fixture (set of connectors) That way each components pins becomes accessible. Structural tests, which can be generated automatically, can now be used. • Internal test (BIST) The ATE function is completely integrated on the to-be-tested chip. This requires extra silicon area, however, no ATE is required and the chip can be tested at speed.

4. 1 Fault coverage requirements (1) Given a chip with potential defects, the question

4. 1 Fault coverage requirements (1) Given a chip with potential defects, the question can be raised on how extensive the tests have to be? This question can be answered in terms of the chips defect level and the yield of the fabrication process. • Defect Level ‘DL’ is the fraction of bad parts that passes all tests – Values for DL are usually expressed in Parts Per Million ‘PPM’ • Process Yield ‘Y’ is the fraction of the manufactured parts that is fault free. Exact value hard to establish. Therefore, Y approximated as follows: • Fault Coverage ‘FC’ is a measure of the quality of a test. It is defined as: In practice it is impossible to have a complete test (FC=1), because of: 1. 2. 3. Imperfect fault modeling: An actual fault may not correspond with a modeled fault Data dependency of faults (e. g. , the carry function in an ALU) Testability limitations (e. g. , ATE pin and/or speed limitations)

4. 2 Fault coverage requirements (2) Because tests may not be complete, a defective

4. 2 Fault coverage requirements (2) Because tests may not be complete, a defective chip may pass the tests. Assume that a chip has exactly n Stuck-At Faults ‘SAFs’ – A SA 0 fault causes a 0 value on a line; a SA 1 fault causes a 1 value Let m be the number of detected faults (m n) Assume that the probability of a fault is independent of the occurrence of another fault (i. e. , there is no fault clustering) and that all faults are equally likely with probability p Assume that: A is the event that a part is free of defects, and B that a part has been tested for m defects while none were found. Then: • The Fault Coverage of a test is defined as: • The Process Yield is defined as: • • DL can now be expressed as:

4. 3 Fault coverage requirements (3) DL is expressed as (see figure): For large

4. 3 Fault coverage requirements (3) DL is expressed as (see figure): For large values of Y (i. e. , a manufacturing process with a high yield), it approaches a straight line Example: Assume a manufacturing process with Y = 0. 5 and a TC = 0. 8, then: This means that 12. 95% of the shipped parts are defective! If a DL=200 PPM (i. e. , DL = 0. 0002)is required, given Y = 0. 5, then: This is a FC of 99. 971%

5. 1 Test economics Repair cost during the product phases A move from one

5. 1 Test economics Repair cost during the product phases A move from one product phase to the next causes the volume of parts and the test & repair cost to increase by a factor of 10 This is the rule-of-ten Economics and liability of testing. Good tests • • reduce test & repair cost (see above rule-of-ten) can reduce development time & time-to market can reduce field maintenance costs Optimum reduce personal injury and law suits There is an optimum in test development cost and its contribution to profit: Too many tests require a long test development time and test cost

5. 2 Total profit The life time of a product has several economic phases

5. 2 Total profit The life time of a product has several economic phases • The development phase – Product design takes place – No income; only expenses – Area under zero-line is development cost • The market growth phase – Market acceptance increases with time • The market decline phase – Product becomes less attractive – Market share decreases – Price may have to be reduced The total profit over the life time of a product is the area above the zero-line (revenue) – area below the zero-line (development cost) In case of a delay ‘D’ in product development, the development cost is higher, while the revenue is reduced, because the obsolescence point will not change

5. 3 Product development delay cost Assuming M is the maximum market growth, which

5. 3 Product development delay cost Assuming M is the maximum market growth, which is reached after time W, the revenue lost due to a delay D (hatched area) can be computed as follows: • The Expected Revenue ‘ER’ is: • The Revenue of the Delayed Product ‘RDP’ is: • The Lost Revenue ‘LR’ is:

5. 4 Life-cycle cost The cost of a product over its life time, consists

5. 4 Life-cycle cost The cost of a product over its life time, consists of: 1. The design cost This typically is on the order of 5% of the product cost 2. The manufacturing cost This is the cost associated with the production and sales of the product 3. The maintenance cost The cost associated with repair, calibration, etc. This may be the largest cost factor Note: Product life is 30 years; e. g. , for a telephone exchange