Outline Announcements Fault Tolerance 2202021 COP 5611 Operating































- Slides: 31
Outline • Announcements • Fault Tolerance 2/20/2021 COP 5611 - Operating Systems 1
Announcements • Class evaluation at the beginning of next class – Please come on time so that we still have enough time to cover the materials we need to cover • Discussions – Homework #4 – Quiz #2 • Decisions – Final exam: open book or close book? – Lab 2: Extension? – Quiz #3: A week from today 2/20/2021 COP 5611 - Operating Systems 2
Motivations • A system is fault-tolerant – If it can mask failures • It continues to perform its specified function in the event of a failure • Mainly through redundancy – Or it exhibits a well defined failure behavior in the event of failure • Distributed commit, either all sites commit a particular operation or none of them 2/20/2021 COP 5611 - Operating Systems 3
Fault Tolerance Through Redundancy • The key approach to fault tolerance is redundancy – Three kinds of redundancy • Information redundancy • Time redundancy • Physical redundancy – A system can have • A multiple number of processes • A multiple number of hardware components • A multiple number of copies of data 2/20/2021 COP 5611 - Operating Systems 4
Failure Resilient Processes • A process is resilient if it masks failures and guarantees progress despite a certain number of system failures • Backup processes – In this approach, each resilient process is implemented by a primary process and one or more backup processes – The state of the primary processes is stored at some intervals – If the primary terminates, one of the backup processes becomes active and takes over 2/20/2021 COP 5611 - Operating Systems 5
Failure Resilient Processes – cont. • Replicated execution – Several processes execute the same program concurrently – It can increase the reliability and availability – It requires that all requests at all processes in the same order – Nonidempotent operations need to be taken care of 2/20/2021 COP 5611 - Operating Systems 6
Distributed Commit • The distributed commit problem involves having an operation being performed by each member of a process group or none at all – This is referred to as global atomicity • Commit protocols – Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures 2/20/2021 COP 5611 - Operating Systems 7
One-phase Commit Protocol • One-phase commit protocol – One site is designated as a coordinator – The coordinator tells all the other processes whether or not to locally perform the operation in question – This scheme however is not fault tolerant 2/20/2021 COP 5611 - Operating Systems 8
Two-Phase Commit Protocol • In this protocol, one of the processes acts as a coordinator – Other processes are referred to as cohorts • Cohorts are assumed to be executing at different sites – A stable storage is available at each site – The write-ahead log protocol is used – There are two phases involved in the protocol 2/20/2021 COP 5611 - Operating Systems 9
Two-Phase Commit Protocol – cont. 2/20/2021 COP 5611 - Operating Systems 10
Two-Phase Commit Protocol – cont. 2/20/2021 COP 5611 - Operating Systems 11
Two-Phase Commit Protocol – cont. Coordinator 2/20/2021 COP 5611 - Operating Systems 12
Two-Phase Commit Protocol – cont. • Site failures handling – Suppose the coordinator crashes before having written the COMMIT record • On recovery, the coordinator broadcasts an ABORT message to all the cohorts – Suppose the coordinator crashes after writing the COMMIT record but before writing the COMPETE record • On recovery, the coordinate broadcasts a COMMIT message – Suppose the coordinator crashes after writing the COMPLETE record • On recovery, there is nothing to be done for the transaction 2/20/2021 COP 5611 - Operating Systems 13
Two-Phase Commit Protocol – cont. • Site failures handling - continued – If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort – If a cohort crashes in Phase II (after writing its UNDO and REDO log) • On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction 2/20/2021 COP 5611 - Operating Systems 14
Two-Phase Commit Protocol – cont. • Limitation – It is a blocking protocol • Whenever the coordinator fails, cohort sites will have to wait for its recovery • This is undesirable as these sites may be holding locks on resources • It cannot be used if transactions must be resilient to site failures – This leads to non-blocking commit protocols 2/20/2021 COP 5611 - Operating Systems 15
Non-blocking Commit Protocols • To be non-blocking in the event of site failures – Operational sites should agree on the outcome of the transaction by examining their local states – Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do • Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state 2/20/2021 COP 5611 - Operating Systems 16
Three-Phase Commit Protocol – cont. 2/20/2021 COP 5611 - Operating Systems 17
Three-Phase Commit Protocol for Single Site Failure 2/20/2021 COP 5611 - Operating Systems 18
Three-Phase Commit Protocol – cont. • Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure – If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts • Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I – Otherwise, it sends an Abort message 2/20/2021 COP 5611 - Operating Systems 19
Three-Phase Commit Protocol – cont. • Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts – On receiving a Commit message, a cohort commits the transaction 2/20/2021 COP 5611 - Operating Systems 20
Three-Phase Commit Protocol – cont. • Theoretical results – Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction – There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites – There exists no protocol resilient to network partitioning when messages are lost – There exists no protocol resilient to multiple network partitioning 2/20/2021 COP 5611 - Operating Systems 21
Voting Protocols • Distributed commit protocols are resilient to single site failures – But they are not resilient to multiple site failures, communication failures, and network partitioning • Voting protocols are more fault tolerant – They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data – The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica 2/20/2021 COP 5611 - Operating Systems 22
Static Voting • System model – The replicas of files are stored at different sites – Every file access operation requires that an appropriate lock is obtained • The lock rule allows either “one writer and no readers” or “multiple readers and no writers” – Every file is associated with a version number • Indicates the number of times the file has been updated • Version numbers are stored on stable storage • Every write operation updates its version number 2/20/2021 COP 5611 - Operating Systems 23
Static Voting – cont. • Basic idea – Every replica is assigned a certain number of votes • This information is stored on stable storage – A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process 2/20/2021 COP 5611 - Operating Systems 24
Static Voting – cont. 2/20/2021 COP 5611 - Operating Systems 25
Static Voting – cont. 2/20/2021 COP 5611 - Operating Systems 26
Static Voting – cont. 2/20/2021 COP 5611 - Operating Systems 27
Vote Assignment 2/20/2021 COP 5611 - Operating Systems 28
Vote Assignment Examples 2/20/2021 COP 5611 - Operating Systems 29
Reliable Communication • In a system using replicated data, it is important that data managers behave identically – The data managers are required to have an identical view of the events • Atomic broadcast 2/20/2021 COP 5611 - Operating Systems 30
Summary • Fault tolerance is to mask the failure or behave in a well-defined way in case of failures – The key approach to failure masking is through redundancy • Failure resilient processes – Distributed commit protocols guarantee the global atomicity • Either all sites will commit an operation or none of them 2/20/2021 COP 5611 - Operating Systems 31