Safetynet Improving The Availability Of Shared Memory Multiprocessors
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood 03/05/2010 Presented by Akin Olugbade
Motivation Increase in processor speed and decrease in processor technology size make chips more susceptible to errors Systems need high availability � Shared memory multiprocessor servers make up a lot of internet servers � Rebooting or system crashes are an undesirable way to deal with errors
Saftey. Net Design Create globally consistent checkpoints that the system can recover to in the case an error is detected � Save architected state which consists of processor registers, memory state, coherence state Validate that a checkpoint is fault free Recover to most recent validated checkpoint in case of error
Safety. Net Design Logging space reduced � Only log changes to a certain register, memory block, or coherence permission once per checkpoint interval Point of Atomicity � Requestor does not increment recovery point until all outstanding requests are completed � Consistent logical time ensures global consistency of checkpoints Validation � All components must agree that a checkpoint is a valid fault free point for it to be validated
Logical Time
Evaluation
Evaluation
Conclusion + Checkpoint/Recovery system can be independent of error detection mechanism +Negligible performance overhead in error free common case +Storage and Bandwidth overhead can be minimized greatly by increasing checkpoint interval
Questions Does the Validation Latency matter in the case of output commit? How do we deal with stores in the case of CLB fillup? Is Saftey. Net suitable for mission critical situations? If our validation time is fast enough, would we want to reduce the checkpoint interval time?
- Slides: 9