Recovery in Distributed Systems Transaction Recovery see Coulouris
Recovery in Distributed Systems: Transaction Recovery (see Coulouris et al. )
Transaction Recovery-1 • Atomic Property of Transactions means that the effect of performing a transaction on behalf of one client is free from interference from concurrent transactions being performed on behalf of other clients • It requires the effects of all committed transactions reflected in data items, but none of the effects of incomplete/aborted transactions are reflected in the data items
Transaction Recovery-2 • Two Aspects to consider – Durability - requires that data items are saved in permanent storage and will be available indefinitely, at the servers, or the sites of storage. – Failure Atomicity - requires that the effects of the transaction are atomic even when the server fails – These two aspects are not completely independent and they can be handled by a so called recovery manager, which is based on a two-phase commit protocol.
Recovery Manager (RM)-1 • Restores the server’s database from Recovery File (RF) after a crash, which needs to be resilient to media failure stable storage • Reorganizes the RF to improve the performance of recovery • Reclaims storage space in the RF, through the execution of the application
Recovery Manager (RM)-2 • Recovery File (as a log) is used to deal with recovery of a server involved in a distributed transaction. • The RF contains: – Trans Id and the status of the transaction prepared, committed, aborted – Data items that are part of the transaction and their values – Intentions List for the transaction
Recovery Manager (RM)-3 • RF represents a log containing the history of all the transactions performed – Contains a Checkpoint – Order of entries reflects the order in which transactions have prepared, committed and aborted
Intentions List • Contains a list of data item names and the position in the RF were the values of the data items that are altered by that transaction reside – When a server is prepared to commit a transaction, the RM must save the intentions list in the RF, this ensures the server is able to carry out the commitment later, even if it crashes in the interim – When a transaction is aborted the RM uses the intentions list to delete all the tentative versions of data items made by that transaction
Example-1 • Recovery File (as Log) - fig 15. 1 on a Banking Service transactions T and U, Refer fig 12. 6 – In fig 15. 1, left of double line is the Checkpoint starting at P 0, which represents a snapshot of values A, B, C before transactions T and U started – Server crashes after RM records that U has indicated it is prepared to commit and written the intentions list – In this case, the values of A, B and C must be restored
Example-2 • RM is responsible for restoring the data items so that they include the effects of all the committed transactions and none of the effects of incomplete or aborted transactions – RM starts recovery from End of Log at entry P 7 – Concludes that U has not committed and its effects can be ignored – Moves to P 4 and concludes T has committed – To recover data items affected by T it moves to entry P 3 and finds the intentions list for T – It restores data items A and B from values at P 1 and P 2 – To restore C it moves to P 0 and uses the checkpoint value
Example-3 • Recovery Manager for each transaction with status prepared, adds aborted and completes a new Checkpoint and creates a new RF
Check pointing • The process of writing the current committed values of a server’s data items to a new RF, together with transaction status entries and intentions lists of transactions that have not yet been fully resolved • Its purpose is to reduce number of transactions to be dealt with during recovery and reclaim file space • The failed checkpoint itself must be able to recovered too…
Recovery of Two- Phase Commit Protocol-1 • In a Distributed Transaction, each server (worker or coordinator) keeps its own RF • Recovery Management must be extended to deal with distributed transactions performed using the Two- Phase Commit protocol at a time when a server fails • The RM at coordinator records a coordinator entry - (Trans Id, list of workers) in coordinator’s RF
Recovery of Two- Phase Commit Protocol-2 § RMs use two new transaction status values done and uncertain which can be written to the RF. Both done and uncertain are used when the RF is re-organized § RM of coordinator uses done to indicate two- phase commit is complete § RM of worker uses uncertain to indicate the worker has voted Yes but does not know the outcome § The RM at coordinator records a coordinator entry (Trans Id, list of workers) in coordinator’s RF § The RM at worker records a worker entry - (Trans Id, coordinator) in worker’s RF
Recovery of Two- Phase Commit Protocol-3 • During Phase 1 - Voting – When coordinator is prepared to commit, its RM writes prepared and a coordinator entry to RF – If worker votes Yes, its RM writes prepared, a worker entry and uncertain to the RF – If worker votes No, its RM writes aborted to the RF
Recovery of Two- Phase Commit Protocol-4 • During Phase 2 - Completion – RM of Coordinator writes either committed or aborted to the RF according to the decision made – RMs of Workers write committed or aborted to their RFs depending on message received from coordinator – RM of Coordinator writes done to RF when coordinator has received a have committed message from all its workers
Recovery of Two- Phase Commit Protocol-5 • Recovery of Two- Phase Commit Protocol – Refer to fig 15. 2, which shows entries in a RF for transaction: T where server is coordinator U where server is worker – Action of the RM after a server restarts after a crash is shown in fig 15. 3 – Reorganization of RF When performing Checkpoint: • coordinator entries of transactions without status done are not removed • worker entries with status uncertain are not removed
- Slides: 16