Root Cause Analysis Principles and Practice in Open

  • Slides: 15
Download presentation
Root Cause Analysis: Principles and Practice in Open. Stack & beyond Elisha Rosensweig, Ifat

Root Cause Analysis: Principles and Practice in Open. Stack & beyond Elisha Rosensweig, Ifat Afek November, 2016 1 © Nokia 2016

Overview • Root Cause Analysis (RCA) – What is it, and why should we

Overview • Root Cause Analysis (RCA) – What is it, and why should we care • Vitrage - building an RCA engine for Open. Stack • Demo • Looking to the future - Vitrage Roadmap 2 © Nokia 2016

Root Cause Analysis (RCA) – What is it, and why should we care •

Root Cause Analysis (RCA) – What is it, and why should we care • Root Cause - “A factor is considered a Root Cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring” [wikipedia] • Root Cause Analysis - RCA is the method identifying the Root Cause(s) of system events, usually failures • Knowing the Root Cause has many uses Insight Understand your system better Fast Reaction Effective fault recovery Past / Reactive 3 © Nokia 2016 Nokia Internal Use Accountability Regulatory & customer req. Operations Treat problem, not symptom Future / Proactive Prediction Based on RCA of past events

RCA approaches Less Expert Judgement • Relies on & reflects the expertise of the

RCA approaches Less Expert Judgement • Relies on & reflects the expertise of the RCA investigator • Subjective bias • Data – RCA investigator experience • Detect if presence/absence of certain factors impacts probability of fault to occur • Data - statistical correlations & chronological ordering of events Data Required Statistical Techniques Analytical Techniques • Formal approaches for analyzing causal dependencies • Rely on counterfactual reasoning • “if X would not have occurred, would Y have occurred? ” More 4 © Nokia 2016 Nokia Internal Use

Vitrage - building an RCA engine for Open. Stack • The Vitrage approach –

Vitrage - building an RCA engine for Open. Stack • The Vitrage approach – Phase I: Automating expert judgment Vitrage Datasources Expert Judgment Nova Neutron Nagios Zabbix Heat Vitrage Entity Graph H Vitrage Templates V Ap 5 © Nokia 2016 Nokia Internal Use S V

Vitrage - building an RCA engine for Open. Stack • The Vitrage approach –

Vitrage - building an RCA engine for Open. Stack • The Vitrage approach – Phase I: Automating expert judgment H V Ap Root Cause Analysis 6 © Nokia 2016 Nokia Internal Use H S V V S V Ap Deduced Alarms

Vitrage - building an RCA engine for Open. Stack • Benefits of Automatic expert

Vitrage - building an RCA engine for Open. Stack • Benefits of Automatic expert judgment - Holistic view • Many systems are well understood locally • Automates the process of putting it all together - Propagate the RCA insights throughout the system • Deduced Alarms & states expose faults to all relevant users • The natural first step in the RCA journey - Fast ramp-up: Stat. & Analytical techniques require much more data to be effective - Configurable: Allows users/vendors to contribute to the effectiveness of Vitrage • 7 Every system is different, with different needs © Nokia 2016 Nokia Internal Use

Vitrage Architecture & Flow – putting it all together Entity Graph Notifiers attached UI

Vitrage Architecture & Flow – putting it all together Entity Graph Notifiers attached UI / API Host Switch contains Alarm on Instance Data Sources Evaluator & Templates scenario: condition: alarm_on_host and host_contains_instance actions: - action: type: set_state target: instance properties: state: suboptimal 8 © Nokia 2016

Vitrage Support – to-date • Datasources - Open. Stack Projects • Nova, Cinder, Neutron,

Vitrage Support – to-date • Datasources - Open. Stack Projects • Nova, Cinder, Neutron, Heat, Aodh - External Projects • Nagios, Zabbix • Notifications - Mark host down – nova notification - POC - Vitrage Alarms notified to Aodh (Ceilometer) • Templates support - YAML format, human readable - Support for complex condition (and/or clauses) 9 © Nokia 2016 Nokia Internal Use

Demo 10 © Nokia 2016

Demo 10 © Nokia 2016

Vitrage Roadmap – Functionality • Alarms aggregation Host Failure Instance Failure - Organize alarms

Vitrage Roadmap – Functionality • Alarms aggregation Host Failure Instance Failure - Organize alarms to improve system clarity Application Failure - Aggregate using various categories - root cause, impact, resource Instance Failure - Drill-down functionality Instance Failure • Alarms history - Store past alarms in persistent (graph? ) DB - Laying groundwork for long-term system analysis H S • Auto detection of causal dependencies V - i. e. , auto generation of Vitrage templates Ap 11 © Nokia 2016 V

Vitrage Roadmap – Usability • Taking the Entity Graph UX to the next level

Vitrage Roadmap – Usability • Taking the Entity Graph UX to the next level - Display system with thousands of entities – in a meaningful way • UI for time-sensitive RCA - E. g. , inactive alarm which caused an active alarm • Templates CRUD – API & UI - Making it easy to add, modify and remove Vitrage templates • Templates expressibility - Supporting “not”, “for all” and higher-level logic 12 © Nokia 2016 Nokia Internal Use

Vitrage Roadmap – More Input, More Output • In Vitrage, more data more insights!

Vitrage Roadmap – More Input, More Output • In Vitrage, more data more insights! • New datasources - Open. Stack projects - External monitors • New consumers - Propagate alarms & states to other Open. Stack projects • Must-have use-cases - Prepared template libraries with mission-critical use cases - Make Vitrage indispensible 13 © Nokia 2016

New Contributors are Welcome to Join! • More info about Vitrage can be found

New Contributors are Welcome to Join! • More info about Vitrage can be found at: - Wiki: https: //wiki. openstack. org/wiki/Vitrage - Launchpad: https: //launchpad. net/vitrage - github: https: //github. com/openstack/vitrage • Contact us: - IRC channel: #openstack-vitrage - Send an email to Open. Stack mailing list with [vitrage] tag - Ifat Afek: ifat. afek@nokia. com - Elisha Rosensweig: elisha. rosensweig@nokia. com 14 © Nokia 2016

15 © Nokia 2016

15 © Nokia 2016