Thurel Yves CERN for POCPA 2014 09 Brookhaven
Thurel Yves CERN, for POCPA 2014 -09 @ Brookhaven National Laboratory Introduction to reliability modelling & survivability towards prediction of accelerator-type power converter lifetime for maintenance optimization. te-epc-lpc 31/10/2020 Thurel Yves, CERN, for POCPA 2014 2
Goal and motivation of this presentation It would take 15 mins to try to explain the title (if it does mean something sensible…which is not sure at all). Better illustrate the motivation behind this talk. te-epc-lpc 31/10/2020 Thurel Yves, CERN, for POCPA 2014 3
Does it remind you something? After 10 years without barely any failures on 300 converters of same family, 5 failures occurred last year. ? ? How many failures to expect for next year ? Covered by Spares? te-epc-lpc 31/10/2020 Is it possible to reach next long shutdown? Thurel Yves, CERN, for POCPA 2014 4 Perhaps you should check… something ? ?
Does it remind you something? We spent 1 M$$ upgrading 500 units Thurel Yves, CERN, for POCPA 2014 last year , following your recommendations ! Already 2 units fail . Is that NORMAL ? ? God!!! Are these failures the famous “statistics” How. Failures? to be sure, what te-epc-lpc 31/10/2020 should I check on failed 5
Does it remind you something? Thurel Yves, CERN, for POCPA 2014 Believe me. These converters are the best we designed so far. We have a new designer team with a great potential … Be sure I will find a way to validate the reliability of your units I don’t trust at all !!! te-epc-lpc 31/10/2020 Ok, but how to do so? 6
Context of the presentation Thurel Yves, CERN, for POCPA 2014 For sure, we won’t answer all these questions in 20 mins. Nevertheless, I will try to provide you a taste of § Some basics principles supported by examples § What do you need to conduct a reliability analysis using data from tests or field. § What can be expected from such analyses: § Demonstrating reliability, § Forecasting, predicting life … te-epc-lpc 31/10/2020 7
Thurel Yves, CERN, for POCPA 2014 Part 1/5 • Context • “When the failures makes life harder” te-epc-lpc 31/10/2020 8
Context of the presentation Thurel Yves, CERN, for POCPA 2014 What do we usually understand (our standard background)? § Bathtub curve concept is ok (remember we will all experience it…) § Let’s describe it from accelerator engineer’s point of view Oops!! We need a The operation is a crash program… success because It will cost, since we did good job. you called me too Failure rate High Reliability late ! etc … Well, didn’t expect these initial failures, but this justify the commissioni ng phase…at least te-epc-lpc 31/10/2020 Call an expert!!! Useful product life Time This is the ultimately critical phase !!! 9
Thurel Yves, CERN, for POCPA 2014 Part 2/5 • Understand Distribution h g u o r h t s t lo p ll u ib e W and practical examples te-epc-lpc 31/10/2020 10
Internet is your friend… All the background is available on the web, so please refresh your mathematical knowledge fetching the correct mathematics formula, and demonstration at the correct place As for me, I decided to illustrate some principles you can experience yourself at home or at your office ! te-epc-lpc 31/10/2020 Thurel Yves, CERN, for POCPA 2014 11
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 12 “Game 1”: Random In Time Failures § 8 units in operation (no replacement of failed unit) § At each cycle, a die, attached to unit, is exited § Die result = 5 Unit death § What do we already know? § Each die result is not correlated to previous, nor next one. § Average chance to get 5 on an equilibrated die is 1 / 6 § What we perhaps don’t get quickly in mind is § The shape of the cumulative Distribution Function, which is, because of the constant failure rate (die), given by exponential law. § How many failures do we need to “see” this special function? te-epc-lpc 31/10/2020
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 13 8 Let’s go for real (I really used a die (not 8!) for this sequence !!) Unit 1 s Und 2 er Ope 3 r 4 ation 5 6 7 8 1 te-epc-lpc 31/10/2020 2 3 4 5 6 7 Cycles 8 9 10 11 … 16
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 Let’s compute with basic mathematics § Mean Time Between Failure 39 / 8 = 4. 875 Now trace the Cumulative Distribution Function (CDF) CDF? Just sum the number of failures encountered…up to 8, or 100 % te-epc-lpc 31/10/2020 Can you find the exponential law hidden in these graph? ? ? 14
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 Ok, we are between gentlemen, let’s go for a Weibull graph !! How to create a Weibull plot? § Simply collect failure dates § Enter them in a dedicated software (here R + Abrem librairies), and get the plot, (bonus: 90 % confidence bounds) Why using Weibull? § Just another plot of Cumul. Dist. Function, but displaying lines instead of curves, a lot easier for good-fit evaluation te-epc-lpc 31/10/2020 15
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 “Game 2”: End Of Life Failures § 8 units in operation (no replacement of failed unit) § At each cycle, a paper clip is bent 360° on itself § Paper clip brakes = Unit death § What do we already know? § For sure failure is intrinsically and highly linked with cycle No or its history (fatigue process)! § What we perhaps don’t get quickly in mind is § How many failures do we need to “see” this end-of-life function? te-epc-lpc 31/10/2020 16
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 8 Let’s go for real (I really did destroy 8 paper clips!!) Unit 1 s Und 2 er Ope 3 r 4 ation 5 6 7 8 1 te-epc-lpc 31/10/2020 2 3 4 5 6 7 Cycles 8 9 10 11 17
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 18 Let’s compute with basic mathematics § Mean Time Between Failure 42 / 8 = 5. 25 (4. 9 for die test, not that far. . ) Now trace the Cumulative Distribution Function Failures “take time” to occur. In electronic, we call it end-oflife failure mechanism (fan, electrolytic) te-epc-lpc 31/10/2020
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 Beta = 1. 1 Beta = 2. 3 te-epc-lpc 31/10/2020 19
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 20 What we just demonstrated: § Low number of events (8 x is not that high!!) still allow to guess the nature of the acting process: random-in-time or end-of-life It is possible to get a correct feeling with 5 to 10 events only! § Weibull are easy plots to work with (open source software even exist!), since displaying lines for any distribution being widely used in reliability domain § 2 parameters (line!) only defines a Weibull Cumulative Distribution Function, both giving us a physical meaning § Beta parameter: § Eta parameter: te-epc-lpc 31/10/2020 Shape-rate curve (its slope), its statistical nature Give time where 63 % of the total population is dead. Very close to the Mean Time To Failure.
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 How to use this graph? Always put the events observed in perspective of the actual population levelling the weight of each failure. § die curve gives 12. 5 % of population expected to fail at the 1 st cycle. § 12. 5 % = 1 failure only since ( 1 / 8 = 12. 5 %) § In our game, we indeed get even 2 events @ cycle 1. te-epc-lpc 31/10/2020 12. 5 % 1 st cycle 21
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 Beta = 1 10% 2% 20 hours te-epc-lpc 31/10/2020 Beta = 3 180 hours t=time (or cycles) N=total nb of units r= failed units 22
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 Going even further § It is not required to wait that the entire population die before getting a feeling of distribution in action § This feature is particularly interesting forecasting and planning crash program te-epc-lpc 31/10/2020 I have some ideas for the 85 % remaining units…better contact the engineer…soon! Already 15 % of the population died, what of the coming 85 % 23
Playing with distributions… Thurel Yves, CERN, for POCPA 2014 24 Cumulative Distribution Function? Why often being used? § Compare the same data differently presented: Make your choice! occurrence Random-in-time function i. e. 1 -exp(t/ ) clearly appears on this plot (with 8 pts only) ! 2 1 1 2 5 10 15 Cycles 5 10 15 occurrence 2 1 1 te-epc-lpc 31/10/2020 2 Cycles
Thurel Yves, CERN, for POCPA 2014 Part 3/5 • What data to be used for reliability analyze from the field? te-epc-lpc 31/10/2020 25
Looking for Gold data Thurel Yves, CERN, for POCPA 2014 Time to failure § All of us use / populate large and complex database, with for sure the “date of the failure”, but what is really needed is: Real “Time to Failure” which means: § Counting all the hours during initial testing, commissioning § Not counting the duration where the unit is OFF, or in a state where no stress is applied. (think about a High Power Converter being driven by an Auxiliary Power Supply. They certainly doesn’t share the same duration of operation) te-epc-lpc 31/10/2020 26
Importance of the “Time To Failure” Thurel Yves, CERN, for POCPA 2014 Let’s consider a family of 10 converters in operation. § Each faulty converter is immediately replaced if it fails. § Our maintenance service analyses fault every year and triggers us if something “unusual“ happen. § We got a good database, with the date of failure / replacement § We got plenty of spare converters from the same initial batch (production of let’s say 50 x) 0 1 2 3 0 . An initial population of 50 units will give 5 infant population keeping the same distribution property. 49 50 te-epc-lpc 31/10/2020 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 27
Importance of the “Time To Failure” 9 0 4 0 8 5 1 2 3 4 5 6 7 8 0 2 x 7 5 x 4 x 7 3 8 2 5 4 7 4 x 9 4 1 4 x 3 0 6 6 4 x 5 5 8 3 x 2 1 2 2 1 8 9 9 0 x 7 3 3 1 Thurel Yves, CERN, for POCPA 2014 4 x 4 6 6 3 x 9 0 4 x “Normal” Yearly Level of failures Year 01 Year 02 Year 03 … … Year 11 4 Failures a year seems the “normal” rate, sign of safe conditions. Not at all, converters die in a wear. G N O WR !!! te-epc-lpc 31/10/2020 out way, but you cannot see it year by year !! 28
Importance of the “Time To Failure” Thurel Yves, CERN, for POCPA 2014 29 Optimize / wise approach: § After a Time To Failure (and not yearly approach) approach, it was determined that distribution was actually indicating an end -of-life failure type as the main failure mode. § It was then decided to give a “rated lifetime”, and to exchange/update the converter before they fail, and as soon as they reach their rated lifetime. 0 1 2 3 4 5 6 7 8 9 Rated deduced lifetime te-epc-lpc 31/10/2020 If we replace them before they die, it will consume more spare parts !!! Correct, but we can then refresh them before they crash, and we will save a lot of operation stop !
Time To Failure of replaced units? Thurel Yves, CERN, for POCPA 2014 30 Replaced units case § We often use spares to replace failed units. How to handle it? Population to be considered will become the entire population of units having been used, not the ones in operation. § Simply re-align all the units suffering from the same failure mode, from their departure time, still considering the time the still-alive units accumulated (they become suspended units) Much more information in faulty units than in live ones! TO DO t 0 te-epc-lpc 31/10/2020 tanalyze Suspended Units t 0 (still alive when performing the analyze)
Time To Failure of replaced units? Thurel Yves, CERN, for POCPA 2014 31 Repair units case § Things become more complex…with repair. § If the repair in transparent from a failure mode not already encountered on the unit, Time To Failure stays Time of real/complete operation of the unit, even back in operation. § A repaired unit versus a given failure mode cannot be anymore considered with others, for sure. For a better prediction capability on failures, I propose that we don’t repair anymore the converters. We then have to buy 50 % of spares instead of usually 5. . 10 % Are you serious? ? Do you want to know a very te-epc-lpc 31/10/2020 likely prediction of your
Looking for Gold data Thurel Yves, CERN, for POCPA 2014 32 Failure Mode § All of us use / populate large and complex database, with for sure many details entered by team solving the issue, but what is really needed is: Real “Reason For Failure” which means: § Indicating how the failure occurs, and the circumstances but… § …also digging in the failure to obtain the exact failure mode: Capacitor exploded? Diode in short? Water leak? Etc… te-epc-lpc 31/10/2020
Importance of the “Failure Mode” Thurel Yves, CERN, for POCPA 2014 Statement 3 -a: § Let’s illustrate what can be expected from failure analysis? Family 1: 400 converters running 10 000 hours (failed. Medium unit replaced) quality Failure Time To Failure N° Year Failure Mode 1 100 Bad solder on Control card 2 230 Bold not correctly tightened 3 400 Auxiliary Power Supply 4 700 Mouse entered converter 5 1 200 6 1 600 Dead IGBT on power side 7 2 300 Water leak on top converter 8 5 000 Output capacitor exploded 9 8 900 Main busbar insulation lost 10 9 500 te-epc-lpc 31/10/2020 2013 2014 converter!. MTBF is 400 000 hours. (10 failures over 400 units running 10 000 hours). Bad command Fans stopped In reality, not a bad converter, with some infant faults. Don’t spend time to try to analyze the faults too 33
Trap: Ignoring different Failure Modes Thurel Yves, CERN, for POCPA 2014 34 Analysis ignoring 2 different failure modes = Danger !!! § Which Risk? Predicting time for 63 % of the population failing @ t=22 000 hours, instead of t= 4 300 hours in reality !!!! (Imagine the consequence!!! Crash program planned 2 years late!) D’oh!, nice fit !! Too Fast No hurry: 63 % will fail @ 22 000 hours Nice fit, but wrong 2 x m odes Hidden 2 nd failure mode = wear-out te-epc-lpc 31/10/2020 Wear Out will dominate!! My guess: 63 % will fail @ 4 300
Different Failure modes? Thurel Yves, CERN, for POCPA 2014 35 Different failure modes : which Time To Failure to use? § Failed units from Failure Mode (a) becomes suspended units for failed units from Failure Mode (b). § Treat separately each failure mode, get a model of each failure mode. Concatenate then the global cumulative distribution. b b t 0 te-epc-lpc 31/10/2020 a b b a a b b ice w T !! Dead unit Dead unit TO DO b b b b Is it Schrödinger' s cat case? Dead and alive? Suspended Units Died from Failure Mode (a), but still suspended units for Failure Mode (b) Suspended Units tanalyze still alive when performing the analyze t 0
Looking for Gold data Thurel Yves, CERN, for POCPA 2014 Stress Level § This is an extra parameter, which will refine and help to understand the Weibull plot, when not trivial (Sum of lines) Real “Stress level” Note: § In our accelerators, operating temperature is generally known and controlled or, at least not in extra range (compare to car industry!) § The level of stress, current, voltage, AC Mains, Cycles is normally perfectly known per unit !!! • This is one of the reasons we should invest in reliability analyse, since we know very well the stress level on our units under operation. te-epc-lpc 31/10/2020 36
Thurel Yves, CERN, CAS for Pwr. POCPA Conv. 2014 Part 4/5 • Practical Cases Practical things? Means photos? Cool ! te-epc-lpc 31/10/2020 37
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 113 Power Supply units (same ref) delivered § A PSU-LAB is in charge of the qualification / reception § 2 different tests organized to check units reliability § 1 st test: All units running during 24 hours (nominal conditions) § 2 nd test: 5 units running up to failure or up to 6 months duration. Test duration counters (1 per unit under test) Up to 10 units under test conditions te-epc-lpc 31/10/2020 38
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 39 Test phase 1 § All 113 units tested for 20 hours avg range: [9; 120] hours at full power § RESULT: No failure encountered • This test required a lot of work (5 units tested at once in reality) • Total running time obtained/cumulated = 2 265 hours § Remember Weibull graph, and the distribution “line” cases § Result being obtained is exclusion type, and is relatively poor. These distributions are still possible For sure, field of potential future is still wide open te-epc-lpc 31/10/2020 This point was reached without any failure
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 40 Test phase 1 : Can we believe that math approach? ? § Ok, let’s simplify, and take now 100 units running 20 hours each (it is wrong in our case). We arrived to this result from Assumed Beta Confidence Minimum Deduced Eta [hours] Comments equation: (distribution shape) Level % (eta = f(beta, conf. level) 0. 5 90 37 722 Early Infant failure distribution type 1. 0 90 868 Random failure distribution type 2. 0 90 131 End of life failure distribution type 4. 0 90 51 End of life failure distribution type § We know that a Weibull distribution is defined by: • F(t)=1 -EXP[-((t/eta)beta)], F(t) being the fraction failing up-to-time t § At time 20 hours, each distribution gives the same result, i. e. 2 % • 0. 02 1 -EXP(-((20/37 722)0. 5) 1 -EXP(-((20/868)1) 1 -EXP(-((20/131)2)… • Then, and other said, testing 100 units for 20 hours each should produce statistically at least 2 failures (0. 02% of 100 units = 2 units) • Other said, it is just mathematics behind these possibilities, not more. te-epc-lpc 31/10/2020
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 41 Test phase 1 : a last comment on this 1 st phase example Because of week-end, and some longer run for some pieces, the real test duration distribution is shown below: 50 No Units 45 5 3 9 12 15 56 88 120 Duration [Hours] Does it make big difference? Yes it does, remember the real time to failure! Assumed Beta (distribution shape) Confidence Level % “Real data” minimum “Mean data” minimum (eta = f(beta, conf. level) Deduced Eta [hours] 0. 5 90 39 700 48 274 1. 0 90 984 2. 0 90 221 140 4. 0 90 143 53 Precise data gives precise results, and only some (8 long runs impact significantly the high beta 4. 0 result, eta = 143 vs 43) te-epc-lpc 31/10/2020
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 42 Test phase 2 § 5 units only of the 113 units were tested for a 5 -month long run § RESULT: All failed at these dates [ 960, 1560, 1944, 2976, 4728 ] § Beta = 1. 729 § Eta = 2792 hours te-epc-lpc 31/10/2020
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 43 OK let’s summarize results of this entrance test: § Test 1 (avg duration of 20 hours for 113 units) : § All our units seems operational (no immediate failure), at least… § Test 2 (5 units tested through a long run) : § Failure distribution = weibull dist. [ beta=1. 72 & eta = 2792 hours ]* *72 units (63% of 113) would be dead in [75; 208] days @ 90% Next? § All the 113 units were sent back to the manufacturer and retested 6 months after, once upgraded § A 0 -fail test plan was put in place on 10 units, to ensure that the discovered failure mode had indeed disappeared. No failure shall be encountered testing n units for a duration each of failure. mode. eta ⋅ ((-1/nb. units) ⋅ LN(1 -confidence. level))(1/failure. mode. beta) min duration for 0 -fail @ 90% = 1 194 hours, but we test them 3 400 hours te-epc-lpc 31/10/2020
Illustrated conclusion Thurel Yves, CERN, for POCPA 2014 Let’s plot the 0 -failure result. § Even if testing only 10 units makes the 1 st failure heavier in the distribution (10 %), test duration was long and exclusion of steep beta Cumulative Distrib. Function becomes possible. These distributions are still possible This point was reached without any failure (34 000 hours in total) (1 failure over 10 tested units = 10 %) Compare to 100 units running 340 hours ! (1 failure over 100 tested units would represent 1 %, but exclusion of high beta is very poor) te-epc-lpc 31/10/2020 3 400 44
Example N° 1: Demonstrating Reliability Thurel Yves, CERN, for POCPA 2014 45 Conclusion of this 0 -fail test results § 10 units were used for the 0 -fail plan, with success. § The new status / test updated conclusion can be expressed like : Assumed Beta Confidence Minimum Deduced Eta [hours] Comments (distribution shape) Level % (eta = f(beta, conf. level) 0. 5 90 63 751 Early Infant failure distribution type 1. 0 90 14 679 Random failure distribution type 2. 0 90 7 044 End of life failure distribution type 4. 0 90 4 879 End of life failure distribution type How to obtain these possible distributions? • Using mathematics formula (you don’t need to fully understand them) • Weibayes xls formula is the following (when no failure occurred): eta= ( 2 / CHIINV((1 -confidence. level), 2) ) (1/beta) ⋅ [ ∑ (durationn beta)) ] (1/beta) But what we gain is we didn’t pollute our accelerator complex with initially non reliable devices! … which would te-epc-lpc have been difficult to retrieve aftewards (thousands of these used at CERN) 31/10/2020
Example N° 2: Urgent forecasting !! Thurel Yves, CERN, for POCPA 2014 2 nd case: after 3 -4 years without any issues, we encountered several diodes dying in short on same converter family. te-epc-lpc 31/10/2020 46
Example N° 2: Urgent forecasting !! Thurel Yves, CERN, for POCPA 2014 Weibull Analysis: the death plot or the costly plot § A ultra high beta appears (3. 5 !). It is difficult to find real cases with beta > 4… § Decision to launch the replacement of 10 000 Diodes on power converters installed in the LHC galleries (100 m below surface ground) was taken based on this forecasting plot !!! te-epc-lpc 31/10/2020 Already 15 % of the population really died 4 months before CERN LS 1. Will we survive with spares? 47
Example N° 2: Urgent forecasting !! Thurel Yves, CERN, for POCPA 2014 Weibull Analysis: the result seen on a standard time plot § Thanks to this analysis, and despite the huge pressure on our neck, we get the insurance that, Future seems not encouraging, we enter wear out phase. (@ 90 % confidence) we would survive with our spare stock. 2012 -07 -01 DAI on 2000 new diodes te-epc-lpc 31/10/2020 2012 -12 -11 t= 1108 +1. 7 events expected vs end of model 2013 -02 -15 t= 1172 (shutdown removed) +6 events expected / end of model 2012 -12 -01 t=1096 end of modelisation 48
Thurel Yves, CERN, CAS Pwr Conv. 2014 n o i s u l c n o C Are you still with me? te-epc-lpc 31/10/2020 49
Conclusion Thurel Yves, CERN, for POCPA 2014 Well, I hope really that you enjoyed the practical examples, and small demonstrations I shared with you. * The best conclusion for an introduction, is to give you some next possible steps: § Course on reliability is for sure required before playing with database numbers, and I would recommend this expert: § Chet Haibel § You can create your 1 st Weibull plot with open source software § R + adequate Abrem Libraries § I would really recommend this book: “The new Weibull Handbook” * To be honest, I am not 100 % confident everything is correct in this presentation, from a statistician point of view. If one expert is in the audience, please remember I claim a right for mistake(s) ; -) te-epc-lpc 31/10/2020 50
And finally Thurel Yves, CERN, for POCPA 2014 51 Thank you for your attention ! te-epc-lpc 31/10/2020 This presentation : EDMS N° 1323578
Spare Slides (Cumul. Distribution Function) Thurel Yves, CERN, for POCPA 2014 Weibull plots: lines representing complex distribution § Compare these distributions and their Weibull representations te-epc-lpc 31/10/2020 53
Spare Slides (Probability Density Function) Thurel Yves, CERN, for POCPA 2014 Weibull plots: lines representing complex distribution § Compare these distributions and their Weibull representations te-epc-lpc 31/10/2020 54
- Slides: 54