Reliability MAP Project Reliability Program Overview Michael Bay
Reliability MAP Project Reliability Program Overview Michael Bay MAP System Engineer Michael. Bay@gsfc. nasa. gov Jackson and Tull Aerospace Engineering Division How FMEA, FTA, RBD and PRAs were used on MAP, and how they fit into overall mission success in the Faster Better Cheaper environment. 9 August 2000 w/ updates September 2000 Michael Bay Page 1 8/9/00
Current Environment Reliability Michael Bay Page 2 8/9/00
How to Do It? Reliability Understand: – What makes a Mission Successful? • • Proper Execution of the Basics in Engineering and Project Management Attention to Detail Appropriate Discipline and Rigor Test, and Retest after changes. – What makes a Mission Unsuccessful? • Recent Failures have not been due to the unexpected behavior of a breakthrough new technology • Recent Failures have been due to missing an important detail at more than one level of assembly or test. • Most of the Recent Failures would not have been caught by classical FMEAs, FTAs or PRAs alone. However, it is the mind set, the systems thinking, and all the questions asked in the process that would surface the issues. – The Devil is in the Details • Expect the Unexpected • Do not become complacent • Beware of inappropriate application of “Heritage” and assumptions that do not fit Michael Bay Page 3 8/9/00
How to do More with Less Reliability • Do not forget the basics of what goes into a Successful Mission • Understand the Risks in both the Flight/Ground Segments and Project Execution – Important Distinction for Risk Management – Flight/Ground Segments are the end products performing their desired function in their operational environment • FMEA, FTA, RBD and PRAs are good tools here – Project Execution is the ability to deliver the desired product meeting requirements, on time and within cost. • Classical FMEA, FTA, RBD do not apply here, although the techniques could be applied • PRAs are appropriate here Michael Bay Page 4 8/9/00
Orchestrating a Balance Reliability Michael Bay Page 5 8/9/00
Risk Management Reliability • Risk is the Uncertainty of Performance, Reliability, Cost or Schedule – To quantify risk look at likelihood and consequence of an event • Risk Management – – What Can Go Wrong? How Will We Know Something Has Gone Wrong? When Will We Know that Something Has Gone Wrong? What Will We Do About It? Expect the Unexpected so that The Unexpected Becomes Expected • These Questions Asked Globally Every Day from Design through Manufacturing, Test and Operations will do the most to Assure Mission Success • Recent Red Team Activities ask these types of questions – Try to quantify Risk – Attempt to identify if anything was missed in the basics Michael Bay Page 6 8/9/00
MAP Observatory Reliability Microwave Anisotropy Probe Michael Bay • Map the Cosmic Microwave Background Radiation • Follow on to COBE with 50 times the resolution • Medium Size Explorer, MIDEX • Operate at L 2, Store and forward data every day • 3 Axis, Scan Sky at 1 revolution every 2 minutes • 840 Kg Estimate, Approx 3. 6 meters tall, 5. 1 Across • 400 Watts, 72 Kg Fuel Page 7 8/9/00
MAP Reliability Philosophy Reliability • Maximize Science Return for given Cost and Schedule • AO Direction – Due to cost limits systems predominantly nonredundant or “single string” – Selective redundancy encouraged where resources allow – Redundancy up to each mission and the PI • Reliability Designed in from the Beginning – MAP Assurance Requirements Cover Total Program: • Design with proper parts application • Grade 3 Parts program selected as best value for the MAP Program • Workmanship Inspection program to NHB 5300 or equivalent – A Peer Reviewed, Simple and Robust Design providing graceful degradation – Failure Modes and Effects Analysis, Fault Tree Analysis, Reliability Predictions and Probabilistic Risk Assessment used to identify mission ending failures, designs adjusted where possible to shift effect from “mission ending” to “degraded mission” – Test program accumulating significant mission specific test time – Constant drive to identify and strengthen “weak links” to mission success Michael Bay Page 8 8/9/00
Identify Weak Links Reliability Michael Bay Page 9 8/9/00
The Basics, Launch Readiness Flow Reliability Michael Bay Page 10 8/9/00
Reliability Analysis Flow Reliability Michael Bay Page 11 8/9/00
Reliability Improvement Approach Reliability • Identify Weak Links – Estimate failure rates for each subsystem element (component or card) • Compute failure rates using MIL-HDBK-217 techniques • Collect measured failure rates from flight, life test, or vendor data • Average Failure Rates where multiple sources exist – Evaluate effects/consequence of failure, revisit Failure Modes and Effects Analysis • Evaluate possible mitigation approaches – Total Redundancy – Minimal “point design” hardware to augment existing system – “Back door” paths to allow backup functions • Compute System Reliability improvement for each mitigation approach • Study resources necessary to implement each mitigation approach – Mass, Power, Cost, Parts Availability, Schedule, ability to “descope”, Manpower • Select mitigation approaches to maximize efficiency of total program – Total System Reliability Improvement vs Required Resources Michael Bay Page 12 8/9/00
Probabilistic Risk Assessment Reliability Michael Bay Page 13 8/9/00
Reliability Process Overview Reliability • • • Michael Bay Design • FMEA to identify mission threatening failures from mission degradation • Revise designs to convert “loss of mission” failures into “loss of function” or mission degradation • Reliability failure rate analysis used to weigh the relative benefit of one design implementation versus another • Verification of proper parts application in design • Peer Review Process Manufacturing • Workmanship Inspection to verify as built hardware meets designer’s intent • Materials and Process Control Testing • Verify as built hardware meets designer’s requirements in the intended application • Sufficient Test time to find infant mortality failures Operations • Onboard Fault Detection and Correction to safe spacecraft to provide ground time to react and potentially recover from an anomaly • Operational Contingency Procedures and Backup Plans for mission critical and recoverable failures Reliability Philosophy and MAP Mission Assurance Requirements communicated to MAP Hardware Suppliers (Very important to assure a supplier is not a weak link) Page 14 8/9/00
Reliability Process Reliability Design and Analysis Phase 1. Perform System Level FMEA and FTA to determine failures that result in mission loss versus mission degradation 2. Adjust design or implementation such that failures categorized as mission loss are moved to the degraded mission category. The overall goal is to reduce the number of potential mission failures. 3. Reliability failure rate analysis and Reliability Block Diagrams are used to weigh the relative benefit of one design implementation versus another. 4. Where failures result in graceful degradation and require rapid ground intervention or changes in operational plans to save the spacecraft, prepare contingency procedures or software loads to implement them. 5. Critically review the design of the spacecraft power bus. A short on the primary power bus can take out the whole spacecraft. The design of the power bus is such that shorts are considered not credible by design. 6. Peer Review process for both Hardware and Software to identify potential design and/or implementation problems. Michael Bay Page 15 8/9/00
Design and Analysis Phase Reliability Failure Modes and Effects Analysis • Reliability and Failure Modes and Effects Analysis have different goals for redundant and single string spacecraft. • As a single string spacecraft MAP strives to minimize the effects of a failure whereas a redundant spacecraft strives to avoid single point failures. • For a single string mission, large number of faults can result in mission loss. However, there also many failures that may result in partial loss of function or in a reduction in performance. These type of failures result in “graceful degradation”. Look at interfaces and down to the circuit level. • A redundant spacecraft design focuses primarily on preventing single point failures and focuses less on designing in graceful degradation. Usually stops at interfaces to assure faults do not propagate to redundant unit. • For MAP designing in graceful degradation is much more important since there are minimal redundant units available for backup. • The FMEA is synchronized with the Fault Tree at the major component functional level (i. e. Transponder Receiver, ACE Safehold, PSE Load Switching) Michael Bay Page 16 8/9/00
Integrated Mission Fault Tree Reliability White Box - Failure Propagation Red Colored Box - Single Point Failures Yellow Colored Box - Graceful Failures Green Colored Box - Redundancy Failures Yellow Outline Box - Ground Contingency Procedure Blue Outline Box - Onboard FDC Michael Bay Page 17 8/9/00
Design and Analysis Phase Reliability Fault Tree Analysis • Fault Tree Starts with “Loss of Mission” as the top block. • Key to this Top Block is Understanding what defines a mission loss, Mission Success Criteria • Knowing the Design of the System and how it will be operated, Postulate the faults that could result in loss of mission. • Faults are logically combined and further decomposed until the lowest desired level is reached. • Lowest level should overlap and be consistent with the FMEA. Typically the component major function level. (i. e. Transponder Receiver, ACE Safehold, PSE Load Switching) • Requirements for Contingency procedure and Onboard Autonomous Switching (Fault Detection and Correction) should be included to show where action is required. • The Fault Tree provides a graphical format for organizing postulated failures, understanding their consequence on the system, and understanding their relationship to other systems and subsystems Michael Bay Page 18 8/9/00
MAP Reliability Block Diagram Reliability Michael Bay Page 19 8/9/00
Design and Analysis Phase Reliability Block Diagram • Uncertainty in the absolute number of a total mission reliability prediction. Large error bars. • Relative comparison between approaches or implementation are fairly good – Indicates the relative improvement of redundancy – Comparisons allow selection of more reliable solutions • Computations based on Schematics and MIL-HDBK-217 • Some historical data available from operations database • Averaging of computations and historical data possible Michael Bay Page 20 8/9/00
Reliability Improvement Study Results Reliability Michael Bay Page 21 8/9/00
Representative FMEA/PRA Summary Reliability Michael Bay Page 22 8/9/00
PRA, Graphical Tree Format Reliability White Box - Failure Propagation Red Colored Box - High Risk Failure Yellow Colored Box - Medium Risk Failure Green Colored Box - Low Risk Failures Yellow Outline Box - Ground Contingency Procedure Blue Outline Box - Onboard FDC Michael Bay Page 23 8/9/00
Design and Analysis Phase Reliability New Technology and Mission Success • • Select Approach to Mitigate New Technology Risks to Technical Performance in End Item Application Risks to Project Execution Use Risk Management Techniques to weigh benefit of new technology versus the consequence of it not being ready or not working. • Mitigation Steps – – Michael Bay Test working hardware/software as soon as possible Early verification through Engineering Test Units (ETUs) Define Alternate or Backup Sources Descope Plan - Prepare to scale back to minimum mission requirements Page 24 8/9/00
Reliability Process (cont. ) Reliability Manufacturing and Inspection Phase 1. Failures are viewed as mechanical. Whenever an item fails it usually means that something moved, whether internal to a chip, on a circuit card or in harness. If it worked once and then does not, something moved. (EMI is the exception. ) 2. Stress relief against vibration, mechanical motion, and thermal expansion. 3. Clearance to protect against shorts. Close inspection as lower level sub assemblies are assembled. 4. The power system electronics are carefully inspected during assembly to screen for potential shorts. Shorts on the power bus are considered not credible following inspection. 5. Eliminate sources and provide barriers to contamination that could cause shorts or degrade the surface properties of instruments or thermal control surfaces 6. Walkdowns and Inspections for critical items dependant on workmanship, RF Shields and grounding for ESD protection are examples. 7. Manufacturing process control and inspection are as important as they are on a redundant spacecraft. Manufacturing process control may even be more important because there is only one chance to get it right. Michael Bay Page 25 8/9/00
Reliability Process (cont. ) Reliability Test Phase 1. Accumulate sufficient test time to gain confidence infant mortality failure period has passed. Goal is on the order of ≈1000 hours total with last 100 failure free. 2. Test and or execute the sequences planned for the mission. Perform steps and send commands in the expected sequence with the expected timing 3. Command sequences are verified prior to first time execution onorbit. If a sequence is performed onorbit for the first time, analysis should exist that indicates the item will work. Items are tested in “pieces or in steps” instead of relying on analysis alone. 4. Critically test flight and ground software against requirements as well as the intended end item function independent of the “requirements”. 5. Exercise the hardware and software together during environmental test in the modes they are operated during the mission. 6. Specifically seek out “What is not Tested in Flight Configuration”. Review assumptions made in verification program especially where verification is accomplished in pieces or by simulation. Michael Bay Page 26 8/9/00
Basics, What is not Tested? Reliability • Identify items that can not be test in the flight configuration and environment • Assure that simulations and assumptions are appropriate • Typical areas applicable to most projects – – – – Michael Bay End-to-end Instrument Optical or RF check at flight temperature(Tested in Pieces) Loaded Propulsion System and Thruster firings (Component Test) Solar Array deployment in zero G, vacuum, and temperature with gradients. (tested in pieces) Power System working with illuminated solar arrays (Verified against simulator with sufficient margin) ACS Operating in closed loop with flight hardware and flight software (Hardware tested with stimulators open loop, software verified with HDS closed loop, sensor actuator end to end phasing verification) Launch, ascent, separation, and acquisition sequence with the correct timing of external environmental events. Vibration, Thermal, Vacuum, Solar, etc. (Verified in pieces) Inability to test radiation (SEU in particular) environment (Parts testing based on design & engineering judgement) Inability to test surface or internal charging environment (Materials usage and testing based on design & engineering judgement) Page 27 8/9/00
Reliability Process Reliability (cont. ) Operations Phase 1. Utilize a simple subset of the total Spacecraft electronics suite to provide an ACS Safehold that allows additional time for the ground to recover from an anomaly 2. Onboard failure detection to minimize the impact of mission threatening anomalies 3. Spacecraft informs ground of serious off nominal situations 4. Contingency procedures prepared for critical subsystems and mission events 5. Ground system capable of identifying adverse trends and/or off nominal performance 6. Training and exercising of the flight and ground systems during prelaunch mission simulations 7. Separation/Deployment and Propulsion Maneuvers performed within ground contact Michael Bay Page 28 8/9/00
Summary Reliability • Overall Reliability process addresses total program lifecycle including: – Design, Manufacturing, Test and Operations phases • Reliability built in from the beginning • FMEA, FTA, RBD, and PRA used as tools in an overall Reliability Assurance Program to optimize the architecture and design • The PRA is maintained and updated with test results and other changes throughout the project life cycle • Failure mitigators address Moving Parts, Parts Application, Environments, Software/Operations, Workmanship, and Random failure causes • As part of the total reliability program MAP has implemented designs that provide graceful degradation and backups in selective areas • MAP has achieved a balance of Performance, Reliability, Cost and Schedule within the available resources Michael Bay Page 29 8/9/00
Acronym List Reliability ACS ACE AEU CSS DEU FBC FMEA FTA LVPC MAP PDU PRA PSE RBD XRSN Michael Bay Attitude Control System Attitude Control Electronics Analog Electronics Unit (Part of Instrument Electronics) Coarse Sun Sensors Digital Electronics Unit (Part of Instrument Electronics) Faster, Better, Cheaper Failure Modes and Effects Analysis Fault Tree Analysis Low Voltage Power Converter MIDEX Attitude and C&DH Microwave Anisotropy Probe Power Distribution Unit (Part of Instrument Electronics) Probabilistic Risk Assessment Power System Electronics Reliability Block Diagram Transponder Remote Services Node Page 30 8/9/00
- Slides: 30