A Proposal of Application Failure Detection and Recovery

  • Slides: 15
Download presentation
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,

A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1, 2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science, AGH 2 Academic Computer Centre -- CYFRONET Institute of Computer Science AGH

Outline n Motivation & introduction n Services useful in fault recovery approach n Overview

Outline n Motivation & introduction n Services useful in fault recovery approach n Overview of our proposal n Problems & workflow approach n Summary 2 Institute of Computer Science AGH

Motivation n Environment and application size increase steadily expensive for large application component does

Motivation n Environment and application size increase steadily expensive for large application component does not raise considerably n Crash is more n Reliability of single n Risk that application crashes is higher Fault tolerance problem becomes important in the Grid 3 Institute of Computer Science AGH

Demands vs. Reality Minimal overhead Checkpointing is costly Automatic, quick recovery Often restarting whole

Demands vs. Reality Minimal overhead Checkpointing is costly Automatic, quick recovery Often restarting whole application Scalability Many global operations Transparent Additional developer’s effort is required Porting to any kind of application Application-specific methods 4 Institute of Computer Science AGH

Two classes of FT approaches n Application Built-in FT n n n Algorithm/structure profile

Two classes of FT approaches n Application Built-in FT n n n Algorithm/structure profile can be exploited, FT activity can by done more efficiently, e. g. checkpointing Naturally Fault Tolerant problem class, e. g. genetic alg. Fault Tolerant-MPI but. . . all must be done by developer n FT realized by external services n n n automatic middleware services no developer effort required but. . . limited functionality n It would be beneficial to combine this two 5 Institute of Computer Science AGH

Services useful in FT approach n Monitoring services n For fault detection in hardware

Services useful in FT approach n Monitoring services n For fault detection in hardware and software n e. g. Check if process is still running, n Checkpointing, logging, redundancy services n For preparing recovery n e. g. Store the current state of application n Recovery services n In case of failure n e. g. Rollback from last checkpointing, n Scheduler and resource broker n For knowledge about started application n For re-scheduling, re-brokering job or it’s part 6 Institute of Computer Science AGH

How to make it work together? Infrastructure Mon. Services n The component that manages

How to make it work together? Infrastructure Mon. Services n The component that manages this services is needed n n n Application Mon. Services part of middleware job companion co-ordinate actions of FT services Checkpointing Services Fault Tolerant Manager n Recovery action taken is more appropriate, because: n n whole job state is considered the most suitable of available services could be used Scheduler Services Recovery Services 7 Institute of Computer Science AGH

FT Manager – Architecture Infrastructure Mon. Services Application Infrastructure Application Monitoring Checkpointing Fault Tolerant

FT Manager – Architecture Infrastructure Mon. Services Application Infrastructure Application Monitoring Checkpointing Fault Tolerant Manager Application Mon. Services Job Supervisor Checkpointing Services Decision Maker Scheduler Services Recovery Scenario Executor Recovery Services 8 Institute of Computer Science AGH

Job Supervisor (1) n Main functionality: n n n Monitors job execution Manages (or

Job Supervisor (1) n Main functionality: n n n Monitors job execution Manages (or stores information about) checkpointing When something is wrong generates Fault Alarm n Fault Alarm contains not only the Fault Tolerant Manager Job Supervisor Fault Alarm Decision Maker information what is wrong, but also the status of job (e. g. last checkpoint) n Job Supervisor can be asked to perform more checking by Decision Maker Recovery Scenario Executor 9 Institute of Computer Science AGH

Job Supervisor (2) – Faults n Typical examples of fault: n process crash n

Job Supervisor (2) – Faults n Typical examples of fault: n process crash n node is not responding n lost connection (link is down) n Extended fault characteristics: n Occurring and duration characteristics n Severity for application, n n Job Supervisor Fault Alarm Decision Maker E. g. Master fault is more dangerous than slave fault Fault is not only when connection is lost, but also when performance dramatically decreases n Fault Tolerant Manager Sophisticated performance monitoring is required Recovery Scenario Executor 10 Institute of Computer Science AGH

Decision Maker n Main functionality: n Analyzes the situation, when gets fault alarm n

Decision Maker n Main functionality: n Analyzes the situation, when gets fault alarm n Prepares recovery scenarios and sends the best of them for execution n Issues to be considered: n What is possible n The cost of each recovery scenario n Do-nothing or wait scenario is always possible and sometimes beneficial n E. g. in case of problem with network link when only recovery is to restart the whole application n Historical data and probabilistic Fault Tolerant Manager Job Supervisor Fault Alarm Decision Maker Recovery Scenario Executor methods should be used 11 Institute of Computer Science AGH

Recovery Scenario Executor n Main functionality: n Executes actions from scenario n Supervises recovery

Recovery Scenario Executor n Main functionality: n Executes actions from scenario n Supervises recovery process Fault Tolerant Manager Job Supervisor n Recovery Scenario contains several actions that could be performed by different recovery services n In case of failure in scenario execution, Decision Maker is alarmed Decision Maker Recovery Scenario Executor 12 Institute of Computer Science AGH

Problems n Many class of services to cooperate with n Many interfaces n How

Problems n Many class of services to cooperate with n Many interfaces n How to obtain information about application? n Which services are available? n Semantic specification for monitoring and recovery services is needed 13 Institute of Computer Science AGH

Feasibility – Work. Flows n Grid-Services-based approach could help to solve our problems n

Feasibility – Work. Flows n Grid-Services-based approach could help to solve our problems n Knowledge about application architecture is accessible n Workflow description details are welcomed n Exchange of single component is better that restart the whole application n Directives for FT Manager could be included in job description n Interfaces are unified 14 Institute of Computer Science AGH

Summary n Fault tolerance issues become more and more important in the Grid n

Summary n Fault tolerance issues become more and more important in the Grid n A service for fault tolerance management has been proposed n. . . which enables more sophisticated fault tolerance for Grid n Workflow-based framework facilites the task n. But, this is a proposal only. . . You are invited for commenting and remarking! 15 Institute of Computer Science AGH