Recovery fault causes erroneous state leads to failure

  • Slides: 11
Download presentation
Recovery fault causes erroneous state leads to failure recovery error valid state An error

Recovery fault causes erroneous state leads to failure recovery error valid state An error is a manifestation of a fault that can lead to a failure. Failure Recovery: • backward recovery • operation based (do undo redo logs) • state based (checkpoints) • forward recovery 1

Domino Effect X x 3 x 2 x 1 y 2 y 1 m

Domino Effect X x 3 x 2 x 1 y 2 y 1 m Y z 1 z 2 n Z Cases: • X fails after x 3 • Y fails after sending message m • Z fails after sending message n 2

Orphan Message x 1 X Y m y 1 The checkpoint set {x 1

Orphan Message x 1 X Y m y 1 The checkpoint set {x 1 , y 1} inconsistent Lost Messages X x 1 m Y y 1 The checkpoint set {x 1 , y 1} inconsistent 3

Strongly Consistent Checkpoints x 2 x 1 X Y y 1 y 2 z

Strongly Consistent Checkpoints x 2 x 1 X Y y 1 y 2 z 1 m z 2 Z No messages can be in transit during this interval The set of checkpoints {x 1 , y 1 , z 1 } are strongly consistent no recovery ever need rollback past this set. The set stops the domino effect. {x 2 , y 2 , z 2} is a consistent set. 4

Checkpoint Notation Each node maintains: • a monotonically increasing counter with which each message

Checkpoint Notation Each node maintains: • a monotonically increasing counter with which each message from that node is labelled. • records of the last message from/to and the first message to all other nodes. last_label_rcvd. X[Y] last_label_sent. X[Y] X m. l (a message m and its label l) Y first_label_sent. Y[X] Note: “sl” denotes a “smallest label” that is < any other label and “ll” denotes a “largest label” that is > any other label 5

Checkpoint Algorithm (1) When must I take a checkpoint? (2) Who else has to

Checkpoint Algorithm (1) When must I take a checkpoint? (2) Who else has to take a checkpoint when I do? x 2 tentative checkpoint x 1 X Y m y 1 z 1 y 2 z 2 Z (1) When I (Y) have sent a message to the checkpointing process, X, since my last checkpoint: last_label_rcvd. X[Y] >= first_label_sent. Y[X] > sl (2) Any other process from whom I have received messages since my last checkpoint. ckpt_cohort. X = {Y | last_label_rcvd. X[Y] > sl} 6

Checkpoint Algorithm Initiator Process Pi : for all p ckpt_cohort. Pi send Take. Tentative.

Checkpoint Algorithm Initiator Process Pi : for all p ckpt_cohort. Pi send Take. Tentative. Checkpoint(Pi, last_label_rcvd. Pi[p]); if all cohorts replied “yes” then for all p ckpt_cohort(Pi) send Make. Checkpoint. Permanent; else for all p ckpt_cohort(Pi) send Undo. Tentative. Checkpoint; 7

Checkpoint Algorithm A Cohort process, p: On receiving Take. Tentative. Checkpoint(q, last_label_rcvdq(p)): if OK_to_take_ckptp

Checkpoint Algorithm A Cohort process, p: On receiving Take. Tentative. Checkpoint(q, last_label_rcvdq(p)): if OK_to_take_ckptp : = “yes” and last_label_rcvdq[p] >= first_label_sentp[q] > sl then Take. Tentative. Checkpoint; for all r ckpt_cohortp send Take. Tentative. Checkpoint(p, last_label_rcvdp(r)); if all cohorts replied “yes” then OK_to_take_ckptp : = “yes”; else OK_to_take_ckptp : = “no”; send(p, OK_to_take_ckptp ) to q; On receiving Make. Checkpoint. Permanent: make checkpoint permanent; for all r ckpt_cohortp send Make. Checkpoint. Permanent; On receiving Undo. Tentative. Checkpoint: undo tentative checkpoint; for all r ckpt_cohortp send Undo. Tentative. Checkpoint; 8

Rollback Algorithm (1) When must I rollback? (2) Who else might have to rollback

Rollback Algorithm (1) When must I rollback? (2) Who else might have to rollback when I do? x 2 x 1 X y 1 Y y 2 z 1 z 2 Z (1) When I , Y, have received a message from the restarting process, X, since X's last checkpoint. last_label_rcvd. Y(X) > last_label_sent. X(Y) (2) Any other process to whom I can send messages. roll_cohort X = {Y | X can send message to Y} 9

Rollback Algorithm Initiator Process Pi: for all p roll_cohorts. Pi send Prepare. To. Rollback(Pi,

Rollback Algorithm Initiator Process Pi: for all p roll_cohorts. Pi send Prepare. To. Rollback(Pi, last_label_sent. Pi(p)); if all cohorts replied “yes” then for all p roll_cohorts. Pi send Roll_back message; ; else for all p roll_cohorts. Pi send Do. Not_Roll_Back message; 10

Rollback Algorithm A Cohort process, p: On receiving Prepare. To. Rollback(q, last_label_sentq(p)): if willing_to_rollp

Rollback Algorithm A Cohort process, p: On receiving Prepare. To. Rollback(q, last_label_sentq(p)): if willing_to_rollp and last_label_rcvdp(q)> last_label_sentq[p] and resume_executionp then resume_executionp : = false; for all r roll_cohortp send Prepare. To. Rollback(p, last_label_sentp(r)); if all r roll_cohortp replied “yes” then willing_to_rollp : = yes; else willing_to_rollp : = no; send(p, willing_to_rollp ) to q; On receiving Roll_back AND if resume_executionp = false: restart from p's permanent checkpoint; for all r roll_cohortp send Rollback; On receiving Do. Not_Roll_Back: resume execution; for all r roll_cohortp send Do. Not_Roll_Back; 11