Say Goodbye to Post Mortems Say Hello to
Say Goodbye to Post Mortems Say Hello to Effective Problem Management Charles T. Foy Siemens Medical Solutions USA, Inc. Health Services Division charles. foy@siemens. com Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Company: Siemens, AG § Our division: healthcare software § Our department: application hosting § Mainframe, mid-range, open systems, distributed systems § All operating systems (except Tandem) § My role Page 2 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Caution! This company founded by former employees of International Business Machines (IBM) Proclivity for acronyms is part of the culture. Proclivity: “a natural or habitual inclination or tendency; propensity; predisposition” You have been warned… Page 3 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Agenda § What drove creation of a Problem Management System? § First steps § Give it a name? § Got Lucky! § Build versus Buy § It’s a Defect! § What to track? Classifications? § Database Structure § The Process § Trending § Benefits Page 4 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 5 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
What drove creation of a Problem Management System? § Disparate, inconsistent ‘post-mortems’ § Usually driven by customer demand for an explanation § Needed a defined process § Consistent across the company § Communicates to the customer – internal and external Page 6 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
First Steps Launch: § Assigned to a small group § Two service delivery managers § One consultant (employee #26) § Quality Assurance and Process Definition expert § No detailed marching orders other than “standard post-mortem process” Page 7 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
First Steps Started with…. Root Cause Standardized Text Document Root Cause Follow-up Standardized Text Document Database Root Cause Field Follow up Field Document Page 8 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
First Steps Defined our own goal: Redefined project outcomes: § reduces unscheduled outages § increases availability § communicates the root cause and preventive measures implemented to internal and external audiences Has to: § Drive to the root cause § In a searchable manner, track: § § § Page 9 outage details root causes, corrective actions customer communications preventive measures implementation status etc Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
First Steps Give it a Name? Needed a new name § no longer a “Post Mortem” process §“Post Mortem” didn’t sit well §Before fully ITIL-aware How about a working title for our project? §Perhaps the Post Event Analysis Process, a. k. a. PEAP? §Always change it later on Never Happened! Page 10 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Thus, PEAP was born! Page 11 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
And if the Post Event Analysis Process produces a Report, it of course would be called…. Page 12 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
First Steps Post Mortem Report new name: The Post Event Analysis Report Or PEAR Page 13 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Define the database and process Database needs: 1. Description, short term resolution, root cause 2. Customers impacted, length of outage 3. Corrective actions implemented & their status 4. Etc. Process: 1. Capture the root cause 2. Ensure the corrective action was implemented 3. Communicate all the above Seemed straightforward, linear, one to one… Page 14 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 15 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Next Steps – define the database requirements We Got Lucky! Ran into a friend… § Provided us with an excellent service outage to use as our model § Decided to use it as proof of concept Slowdown affecting almost all his applications, Response time dropped to zero within 5 minutes… Started looking for commonalities – network was suspect § A Configuration Management Database (CMDB) would have helped! Started looking like it was the Storage Area Network (SAN) Problem cleared up, 45 minutes into the event Page 16 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Outage Incident § Look up - Jake San Technician § Fixes the problem! § Not! § Battery Swap! § 45 minutes ago, looks good! § Here’s what happened… Page 17 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Root cause: Battery was going to go bad and was swapped out. § So Hardware is the root cause But wait…is it really a Hardware issue? § Battery didn’t actually die… it was Jake San Technician! Human Error! But wait…is it really a “Human Error” issue? § Jake doing his job OK, a… “Rules” issue – “always swap batteries off peak” Page 18 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Root cause? Aren’t these ‘contributing’ root causes? § They didn’t know the battery was alerting § SAN vendor knew § SAN technician walked in and worked without their knowledge § SAN technician education § Data center employees education § No battery swap rule/process Page 19 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Root cause? What would we put as our root cause? Do we need to track all these ‘root’ causes? Do we need to track the corrective actions for each? Don’t most outages have multiple root causes? Page 20 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Conclusion: MULTIPLE root causes Multiple root causes, multiple follow-ups. This would be complex. Page 21 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 22 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Build a database? Designed requirements, got a resource time estimate § Presented to upper management § Anything on the shelf? Tools and Methodology Manager: • Hardware that breaks • Software that breaks • Humans that make errors… Essentially, you’re tracking defects! Page 23 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Defect Tracking Company standard defect tracking application §Fully implemented and operational Subject Matter Expert (SME) § Does 90% of what you need § Easy to implement § What are your major defect categories? Page 24 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Defect Tracking To build this, you need Classifications…. What are your major defect areas? How granular? Page 25 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 26 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Classifications Asked our peers § Specific type of hardware § Specific type of software § Human error Page 27 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Classifications How much detail? § Major category (hardware) § The thing that broke (server) § Thing that caused it to break (bad power supply) § Model that broke (Fleetwood XL 340) Page 28 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Human Error Does that work for Human Error? Example: Jeff mistyped a static route in a backup router. Primary router fails. Backup router kicks in but does not recover all the interfaces… § Major category (human error) § The thing that broke (typing) § Thing that caused it to break (not enough sleep) § Model that broke (Jeff) Page 29 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Human Error? § Do we really want to say “human error”? § What does it mean to make a human error? § Failure To Follow A Process? …FTFAP Eureka! A five letter acronym! Page 30 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Classifications Euphemism at first, then… The “Process” category was born! § Process Not Followed (a. k. a. Human Error) § Process Incomplete § Process Incorrect (covers the “need to change the Rule” root cause) § Documentation wrong Page 31 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
More items to track § Version and vendor of the software/hardware? § Name of the Human? § Impacted application(s)? Impacted customer(s)? § O/S level? , 3 rd party software, something we wrote? § Was this tested before it was put into production? § Did it happen before? § What is the air-speed velocity of an unladen swallow? Page 32 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 33 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Database Structure Supports Multiple Levels of Classification Global Keyword: allows for over-all groupings 1. Hardware 2. Software 3. Process Keyword 1 answers “What broke? ” § Answer: Server Keyword 2 answers “What thing within KW 1 broke? ” § Answer: Power Supply Page 34 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Keyword Grouping Samples Hardware Keyword 1 Keyword 2 Router Chassis Server Cable Memory CPU Nic Card Hard Drive NPE HBA Pwr Supply Memory Mthr. Board Pwr Supply Page 35 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Keyword Grouping Samples Software Keyword 1 Keyword 2 Application A Print Subsys Server BIOS CICS GSM Term Svcs RSA DHCP Service Pack Firewall Configuration IIS Dayend Flow LDAP MODS Virus-Wm PTF Page 36 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Keyword Grouping Samples Process Page 37 Keyword 1 Keyword 2 Process Incomplete Process Incorrect Process Not Follow Documentation Incorrect Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Database Structure Page 38 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Database Structure All root causes and keywords Page 39 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Database Structure All root causes and follow-ups Page 40 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 41 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 42 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process Who will own the process? Owner? PEAP Owner role? (PO? ) We need action in the title… PEAP Driver (PD? ) How about a PEAP Owner/Driver? A POD! Page 43 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process POD role ID all root causes Page 44 Describe Preventive Action Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process Assign follow ups… Page 45 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process Document and Communicate § Document all in the database § Communicate: § Internally § Externally § Drive the process to completion Page 46 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Surprisingly, nobody wants to be a POD! Actually a good thing… If your area contributed or caused an outage, you get to be POD. Incentive not to have outages Page 47 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process - details to work out § How to define an outage? § When is the outage over? § Who is best to drive this process? § How does the process get initiated? Page 48 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process - details Existing Outage Management Process § Existing outage definition § Knowledge of incident § Communicates incident status to customers Eureka! § Outage Manager can launch PEAP § Assign POD = manager of group that fixed the outage Page 49 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process ITIL Terminology: § Incident: Any event that is not part the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service. § Problem: unknown underlying cause of one of more incidents. -from ITIL Foundations by ITpreneurs B. V. 2006 At the end of the Incident Management process, the item is moved to the Problem Management Process Page 50 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
The Process Incident to Problem Transition Outage incident: § Details in incident tracking system § Outage resolved, incident ticket is solved Interface to PEAP – Problem Management system: § Details transferred to a defect record § Defect assigned to an owner – the POD § Updates to defect record pass back to incident ticket Page 51 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Post Event Analysis Process in a nutshell § § Outage ends, incident transferred to PEAP assigned to a manager (POD) POD notified automatically by e-mail POD: § gathers information, determines root causes § enters findings in database and internal post-mortem (PEAR) § assigns follow-ups as needed (new records created) § PEAR sent internally § Customer Letter is created, reviewed, sent to affected customers § Corrective actions implemented § PEAR reviewed by senior management § PEAP solved Page 52 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 53 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Post Event Analysis Process - Rollout § Collateral § PEAR template § Customer Letter templates § Process user guide – database navigation, process steps § Education class § Overview of root cause determination § Overview of process § Navigating the database Page 54 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Post Event Analysis Process - Rollout Limited scope initially § Multi-customer outages > 15 min § All multi-customer outages § All outages Quality Management System – central location § Process description § User Guide § PEAR and Customer Letter templates Page 55 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Post Event Analysis Process - Rollout Challenges: 1. Process defined in QMS but lengthy § “Checklist” with links to QMS section 2. Original process – too many steps § Gathered feedback § Reduced the number of steps § Second round of education – new process, value Page 56 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Post Event Analysis Process - Rollout Challenges: 3. Culture change – gaps in compliance § Phased roll-out § Re-education § Administrative reminders § Senior management support 4. Not all root causes identified § Weekly reviews with senior management § “ 5 Why’s” Page 57 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 58 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Benefits 1. Decreased downtime § “one-fers” (that aren’t) are identified § across platforms § across time spans 2. Increased customer satisfaction § Many “customers” of ours are CIOs or IT staff § Explain to their own customers § Knowledge of cause and remediation Page 59 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Benefits 3. Adjust monitoring focus § Identify gaps – component level § Identify gaps – end-user experience 4. Fewer outages due to late running maintenance § More precise estimates, smaller scopes § Avoid effort to complete PEAP Page 60 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 61 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 62 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 63 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Page 64 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Root Cause Page 65 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Major Category, Keyword 1, Keyword 2 Root Cause Part that failed, Vendor Customers Impacted, Duration Applications Impacted Corrective Actions Page 66 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 67 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending What can you discover? Page 68 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 69 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 70 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 71 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 72 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Trending Page 73 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Conclusion Methodology – works for all size IT shops § Robust defect-tracking database § critical for large shops § smaller scale - standard document, keywords § No group per landscape - someone is responsible Integration of PEAP into workflows § Phased roll-out, repeat education § Admin to assist with tracking and notifications Problem manager to ‘own’ the process? How to categorize keywords– ongoing refinements Page 74 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Thank you! Page 75 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
Page 76 Copyright © 2008 Siemens Medical Solutions USA, Inc. All rights reserved.
- Slides: 76