Probabilistic ModelDriven Recovery in Distributed Systems Kaustubh R

Probabilistic Model-Driven Recovery in Distributed Systems Kaustubh R. Joshi, Matti A. Hiltunen, William H. Sanders, and Richard D. Schlichting May 2, 2012 Presented by Weiwei Qiu

2 Background • Approaches for high availability are typically based on the combination of redundancy and human operators’ detection and repairations. • Automating recovery is challenging in practice. ▫ ▫ inaccurate fault diagnosis poor fault localization false positives action selection

3 Objective • Present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed system. ▫ using a theoretically well-founded modelbased mechanism for automated failure detection, diagnosis, and recovery ▫ combining the recovery actions with diagnosis ▫ detect when a problem is beyond its diagnosis and recovery capabilities

4 Approach Overview • Diagnose system problems using the output of any existing monitors and choose the recovery actions that are most likely to restore the system to a proper state at minimum cost. ▫ determine which combinations of component faults can occur in the system in a short window of time (fault hypothesis);

5 Approach Overview (cont. ) ▫ specify the coverage of each monitor m in the system with regard to each fault hypothesis; ▫ specify the effects of each recovery action according to how it modifies the system state and fault hypothesis

6 Motivating Example Enterprise Messaging Network (EMN)

Simplified EMN configuration: implements a Company Object Lookup(COL) System

8 System Model • Fault hypothesis ▫ A fault hypothesis is a Boolean expression that, when true, indicates the presence of one or more faults in the system. Example: Down , Crash, Value Down(r): host r has crashed Crash(c): component c crashed Value(c): component c is alive but does not provide correct service.

9 Monitors • Each monitor returns true if it suspects a fault, and false otherwise • A system may include a variety of monitoring techniques including: ▫ ▫ ▫ Heartbeat-based monitors Test-based monitors End-to-end monitors Error logs Statistical monitors

10 Monitor Coverage • monitor coverage, P[m|h], represents the probability that monitor m will return true given that fault hypothesis is true.

12 Recovery Actions • The application-specific recovery actions A provide the only way for the controller to change the truth value of fault hypotheses. • An action a is specified in terms of its “fault hypothesis effect” function, mean duration a. t(h) and monitors invoked a. M Examples:

13 Bayesian Diagnosis • Let be the subset of monitors invoked in the current round • Let denote the current output of monitor m, and be the current set of all monitor outputs • The vector is the diagnosis vector

14 • • {Value(HG, S 1, S 2)} P[h]=1/3 P[om|Value(HG)]=1 P[om|Value(S 1)]=1/4 P[om|Value(S 2)]=1/4 p[h|Value(HG)]=2/3 p[h|Value(S 1)]=1/6 p[h|Value(S 2)]=1/6

15 Recovery Algorithm

16 Recovery Action Selection • Single-Step Lookahead (SSLRecover) ▫ SSLRecover accepts a cost metric a. cost as input for each action; ▫ greedily makes its choice by “looking” only one recovery action ahead ▫ SSLRecover cannot use actions whose outcomes depend on the order in which they are applied

17 Recovery Action Selection (cont. ) • Multistep Lookahead (MSLRecover) ▫ Extended system model: ▫ state model ▫ recovery action a is represented by a precondition, and a state effect, in addition to the fault hypothesis effect ▫ Optimal action selection: ▫ Transform the system model to a Partially Observable Markov Decision Processes with cost criterion.

Automatic Recovery Architecture

Experimental Results (1) Availability under Fault Injection

Experimental Results (2) Recovery Benchmarks

21 Related Work • System diagnosis • sequential diagnosis • error propagation analysis • Bayesian models/ Hidden Markov Models • Automatic recovery • microreboots • Markov decision theory • learning repair strategies

22 Future Work • Modeling limitations • Not allow for transient faults • Consider one fault hypothesis at a time • Systems extensions • additional monitoring and recovery mechanisms can be integrated into the framework • automatically construct the coverage, action, and cost models • capturing operator domain knowledge regarding the effect of recovery actions

Thank You !