Recovery fault causes erroneous state leads to failure











- Slides: 11
Recovery fault causes erroneous state leads to failure recovery error valid state An error is a manifestation of a fault that can lead to a failure. Failure Recovery: • backward recovery • operation based (do undo redo logs) • state based (checkpoints) • forward recovery 1
Domino Effect X x 3 x 2 x 1 y 2 y 1 m Y z 1 z 2 n Z Cases: • X fails after x 3 • Y fails after sending message m • Z fails after sending message n 2
Orphan Message x 1 X Y m y 1 The checkpoint set {x 1 , y 1} inconsistent Lost Messages X x 1 m Y y 1 The checkpoint set {x 1 , y 1} inconsistent 3
Strongly Consistent Checkpoints x 2 x 1 X Y y 1 y 2 z 1 m z 2 Z No messages can be in transit during this interval The set of checkpoints {x 1 , y 1 , z 1 } are strongly consistent no recovery ever need rollback past this set. The set stops the domino effect. {x 2 , y 2 , z 2} is a consistent set. 4
Checkpoint Notation Each node maintains: • a monotonically increasing counter with which each message from that node is labelled. • records of the last message from/to and the first message to all other nodes. last_label_rcvd. X[Y] last_label_sent. X[Y] X m. l (a message m and its label l) Y first_label_sent. Y[X] Note: “sl” denotes a “smallest label” that is < any other label and “ll” denotes a “largest label” that is > any other label 5
Checkpoint Algorithm (1) When must I take a checkpoint? (2) Who else has to take a checkpoint when I do? x 2 tentative checkpoint x 1 X Y m y 1 z 1 y 2 z 2 Z (1) When I (Y) have sent a message to the checkpointing process, X, since my last checkpoint: last_label_rcvd. X[Y] >= first_label_sent. Y[X] > sl (2) Any other process from whom I have received messages since my last checkpoint. ckpt_cohort. X = {Y | last_label_rcvd. X[Y] > sl} 6
Checkpoint Algorithm Initiator Process Pi : for all p ckpt_cohort. Pi send Take. Tentative. Checkpoint(Pi, last_label_rcvd. Pi[p]); if all cohorts replied “yes” then for all p ckpt_cohort(Pi) send Make. Checkpoint. Permanent; else for all p ckpt_cohort(Pi) send Undo. Tentative. Checkpoint; 7
Checkpoint Algorithm A Cohort process, p: On receiving Take. Tentative. Checkpoint(q, last_label_rcvdq(p)): if OK_to_take_ckptp : = “yes” and last_label_rcvdq[p] >= first_label_sentp[q] > sl then Take. Tentative. Checkpoint; for all r ckpt_cohortp send Take. Tentative. Checkpoint(p, last_label_rcvdp(r)); if all cohorts replied “yes” then OK_to_take_ckptp : = “yes”; else OK_to_take_ckptp : = “no”; send(p, OK_to_take_ckptp ) to q; On receiving Make. Checkpoint. Permanent: make checkpoint permanent; for all r ckpt_cohortp send Make. Checkpoint. Permanent; On receiving Undo. Tentative. Checkpoint: undo tentative checkpoint; for all r ckpt_cohortp send Undo. Tentative. Checkpoint; 8
Rollback Algorithm (1) When must I rollback? (2) Who else might have to rollback when I do? x 2 x 1 X y 1 Y y 2 z 1 z 2 Z (1) When I , Y, have received a message from the restarting process, X, since X's last checkpoint. last_label_rcvd. Y(X) > last_label_sent. X(Y) (2) Any other process to whom I can send messages. roll_cohort X = {Y | X can send message to Y} 9
Rollback Algorithm Initiator Process Pi: for all p roll_cohorts. Pi send Prepare. To. Rollback(Pi, last_label_sent. Pi(p)); if all cohorts replied “yes” then for all p roll_cohorts. Pi send Roll_back message; ; else for all p roll_cohorts. Pi send Do. Not_Roll_Back message; 10
Rollback Algorithm A Cohort process, p: On receiving Prepare. To. Rollback(q, last_label_sentq(p)): if willing_to_rollp and last_label_rcvdp(q)> last_label_sentq[p] and resume_executionp then resume_executionp : = false; for all r roll_cohortp send Prepare. To. Rollback(p, last_label_sentp(r)); if all r roll_cohortp replied “yes” then willing_to_rollp : = yes; else willing_to_rollp : = no; send(p, willing_to_rollp ) to q; On receiving Roll_back AND if resume_executionp = false: restart from p's permanent checkpoint; for all r roll_cohortp send Rollback; On receiving Do. Not_Roll_Back: resume execution; for all r roll_cohortp send Do. Not_Roll_Back; 11