Rebound Scalable Checkpointing for Coherent Shared Memory Rishi

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu

Checkpointing in Shared-Memory MPs rollback Fault save chkpt • HW-based schemes for small CMPs use Global checkpointing – All procs participate in system-wide checkpoints P 1 P 2 P 3 P 4 checkpoint • Global checkpointing is not scalable – Synchronization, bursty movement of data, loss in rollback… R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 2

Alternative: Coordinated Local Checkpointing • Idea: threads coordinate their checkpointing in groups • Rationale: – Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant P 1 P 2 P 3 P 4 P 5 Global Chkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups – Complexity: Record inter-thread dependences dynamically. R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 3

Contributions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Leverages directory protocol to track inter-thread deps. • Opts to boost checkpointing efficiency: • Delaying write-back of data to safe memory at checkpoints • Supporting multiple checkpoints • Optimizing checkpointing at barrier synchronization • Avg. performance overhead for 64 procs: 2% • Compared to 15% for global checkpointing R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 4

Background: In-Memory Checkpt with Re. Vive [Prvulovic-02] Execution P 1 Register Dump CHK P 2 P 3 Displacement Caches W W WB Writebacks Checkpoint Writeback Application Stalls Log o ol ld ol d d W W Dirty Cache lines R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing Logging Memory 5

Background: In-Memory Checkpt with Re. Vive [Pvrulovic-02] Old Register restored P 2 P 1 P 3 CHK Fault W W W Caches Cache Invalidated W WB Memory Lines Reverted Log Global Broadcast protocol Memory Local Coordinated Scalable protocol R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 6

Coordinated Local Checkpointing Rules P 1 P 2 wr x rd x Producer rollback Consumer rollback chkpt Producer chkpoint chkpt Consumer chkpoint P checkpoints P’s producers checkpoint P rolls back P’s consumers rollback • Banatre et al. used Coordinated Local checkpointing for busbased machines [Banatre 96] R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 7

Rebound Fault Model Chip Multiprocessor Main Memory Log (in SW) • • Any part of the chip can suffer transient or permanent faults. A fault can occur even during checkpointing Off-chip memory and logs suffer no fault on their own (e. g. NVM) Fault detection outside our scope: • Fault detection latency has upper-bound of L cycles R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 8

Rebound Architecture Chip Multiprocessor Main Memory P+L 1 L 2 My. Producer Dep My. Consumer Register Directory Cache LW-ID R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 9

Rebound Architecture Chip Multiprocessor Main Memory P+L 1 L 2 My. Producer Dep My. Consumer Register Directory Cache LW-ID • Dependence (Dep) registers in the L 2 cache controller: • My. Producers : bitmap of proc. that produced data consumed by the local proc. • My. Consumers : bitmap of proc. that consumed data produced by the local proc. R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 10

Rebound Architecture Chip Multiprocessor Main Memory P+L 1 L 2 My. Producer Dep My. Consumer Register Directory Cache LW-ID • Dependence (Dep) registers in the L 2 cache controller: • My. Producers : bitmap of proc. that produced data consumed by the local proc. • My. Consumers : bitmap of proc. that consumed data produced by the local proc. • Processor ID in each directory entry: • LW-ID : last writer to the line in the current checkpoint interval. R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 11

Recording Inter-Thread Dependences P 1 P 2 My. Producers My. Consumers P 1 writes My. Producers My. Consumers Write LW-ID P 1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 12

Recording Inter-Thread Dependences P 1 P 2 My. Consumers P 2 My. Producers My. Consumers P 2 reads P 1 My. Producers My. Consumers My. Producers P 1 LW-ID S P 1 D Write back Logging Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 13

Recording Inter-Thread Dependences P 1 writes P 2 My. Producers My. Consumers P 2 P 1 My. Producers My. Consumers LW-ID P 1 S P 1 D Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 14

Recording Inter-Thread Dependences P 1 checkpoints Clear Dep registers My. Producers My. Consumers P 2 P 1 My. Producers My. Consumers Clear LW-ID P 1 S Writebacks P 1 D LW-ID should remain set till the line is checkpointed Logging Log Memory Assume MESI protocol R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 15

Lazily clearing Last Writers • Clear LW-IDs Expensive process ! • Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. • At checkpoint, the processors clear their Write Signature – Potentially stale LW-ID R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 16

Lazily clearing Last Writers P 1 NO ! P 2 My. Producers My. Consumers WSig Addr ? P 2 reads My. Producers My. Consumers Clear LW-ID Stale LW-ID P 1 S Log Memory R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 17

Distributed Checkpointing Protocol in SW • Interaction Set [Pi]: set of producer processors (transitively) for Pi – Built using My. Producers P 1 P 2 P 3 P 4 Interaction. Set : P 1 chk initiate checkpoint R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 18

Distributed Checkpointing Protocol in SW • Interaction Set [Pi]: set of producer processors (transitively) for Pi – Built using My. Producers P 1 P 2 P 3 Interaction. Set : P 1 P 4 P 1 chk Ck? P 2 Ck? P 3 initiate checkpoint R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 19

Distributed Checkpointing Protocol in SW • Interaction Set [Pi]: set of producer processors (transitively) for Pi – Built using My. Producers P 2 P 3 Interaction. Set : P 1, P 2, P 3 P 4 ce p Ac Ck? pt chk ce Ac P 1 t P 1 P 3 P 2 Ck? initiate checkpoint P 4 R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 21

Distributed Checkpointing Protocol in SW • Interaction Set [Pi]: set of producer processors (transitively) for Pi – Built using My. Producers P 2 P 3 Interaction. Set : P 1, P 2, P 3 P 4 ce p Ac Ck? pt chk ce Ac P 1 t P 1 Ck? P 3 P 2 k Ac Ck? initiate checkpoint Decline P 4 R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 22

Distributed Checkpointing Protocol in SW • Interaction Set [Pi]: set of producer processors (transitively) for Pi – Built using My. Producers P 2 P 3 Interaction. Set : P 1, P 2, P 3 P 4 ce p Ac Ck? pt chk ce Ac P 1 t P 1 Ck? P 3 P 2 k Ac Ck? initiate checkpoint Decline P 4 • Checkpointing is a 2 -phase commit protocol. R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 23

Distributed Rollback Protocol in SW • Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using My. Consumers • Rollback involves – Clearing the Dep. Registers and Write Signature – Invalidating the processor caches – Restoring the data and register context from the logs up to the latest checkpoint. • No Domino Effect R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 24

WB dirty lines sync Stall sync Interval I 2 Stall Checkpoint Interval I 1 sync Interval I 2 Checkpoint Time Optimization 1 : Delayed Writebacks WB dirty lines sync • Checkpointing overhead dominated by data writebacks • Delayed Writeback optimization • Processors synchronize and resume execution • Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back • Still need to record inter-thread dependences on delayed data R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 25

Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit - Increased vulnerability A rollback event forces both intervals to roll back R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 26

Delayed Writeback protocol My. Consumers 0 P 2 YES ! P 1 WSig 0 xxx My. Producers 0 My. Consumers 0 My. Producers 1 My. Consumers 1 P 2 Addr ? NO ! WSig 1 P 2 reads Addr ? P 1 My. Producers 0 My. Consumers 0 My. Producers 1 My. Consumers 1 My. Producers 1 P 1 LW-ID S P 1 D Write back Logging Log Memory R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 27

Optimization 2 : Multiple Checkpoints • Problem: Fault detection is not instantaneous – Checkpoint is safe only after max fault-detection latency (L) Detection Latency Rollback Ckpt 1 Dep registers 1 Ckpt 2 Dep registers 2 tf Fault • Solution: Keep multiple checkpoints – On fault, roll back interacting processors to safe checkpoints • No Domino Effect R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 28

Multiple Checkpoints: Pros/Cons + Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers can be recycled only after fault detection latency - Need to track communication across checkpoints - Combination with Delayed Writebacks: one more Dep register set R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 29

Optimization 3 : Hiding Chkpt behind Global Barrier • Global barriers require that all processors communicate – Leads to global checkpoints • Optimization: – Proactively trigger a global checkpoint at a global barrier – Hide checkpoint overhead behind barrier imbalance spins R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 30

Hiding Checkpoint behind Global Barrier Lock count++ Update if(count == num. Proc) Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 31

Hiding Checkpoint behind Global Barrier Lock count++ if(count == num. Proc) Update Iam_last = TRUE /*local var*/ Unlock If(I am_last) { count = 0 flag = TRUE … } else while(!flag) {} Processor P 1 Update Bar. CK? Update Notify while(!flag) ICHK = {P 1, P 3} • • Processor P 3 Processor P 2 Notify while(!flag) ICHK = {P 2, P 3} Update flag = TRUE ICHK = {P 3} First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 32

Evaluation Setup • • • Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim Applications: SPLASH-2 , some PARSEC, Apache Simulated CMP architecture with up to 64 threads Checkpoint interval : 5 – 8 ms Modeled several environments: • Global: baseline global checkpointing • Rebound: Local checkpointing scheme with delayed writeback. • Rebound_No. DWB: Rebound without the delayed writebacks. R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 33

Avg. Interaction Set: Set of Producer Processors 64 38 • Most apps: interaction set is a small set – Justifies coordinated local checkpointing – Averages brought up by global barriers R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 34

Checkpoint Execution Overhead Global Rebound_No. DWB Rebound 30 15 20 10 SP 2 -AVG Raytrace Ocean Radiosity Water. Nsq Water. Sp Volrend Lu-NC Lu-C Radix Fmm Fft Cholesky 0 2 Barnes % Checkpoint Overhead 40 • Rebound’s avg checkpoint execution overhead is 2% – Compared to 15% for Global R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 35

Checkpoint Execution Overhead Global Rebound_No. DWB Rebound 30 20 SP 2 -AVG Raytrace Ocean Radiosity Water. Nsq Water. Sp Volrend Lu-NC Lu-C Radix Fmm Fft 0 Cholesky 10 Barnes % Checkpoint Overhead 40 • Rebound’s avg checkpoint execution overhead is 2% – Compared to 15% for Global • Delayed Writebacks complement local checkpointing R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 36

Rebound Scalability Constant problem size • Rebound is scalable in checkpoint overhead • Delayed Writebacks help scalability R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 37

Also in the Paper • • Delayed write backs also useful in Global Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 38

Conclusions Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Leverages directory protocol • Boosts checkpointing efficiency: • Delayed write-backs • Multiple checkpoints • Barrier optimization • Avg. execution overhead for 64 procs: 2% • Future work: • Apply Rebound to non-hardware coherent machines • Scalability to hierarchical directories R. Agarwal, P. Garg, J. Torrellas Rebound: Scalable Checkpointing 39

Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http: //iacoma. cs. uiuc. edu