Reliable and Scalable Checkpointing Systems for Distributed Computing

Reliable and Scalable Checkpointing Systems for Distributed Computing Environments Final exam of Tanzima Zerin Islam School of Electrical & Computer Engineering Purdue University West Lafayette, IN Date: April 8, 2013

Distributed Computing Environments High Performance Computing (HPC): Projected MTBF 3 -26 minutes in exascale Failure: hardware, software Grid: Cycle sharing system Highly volatile environment Failure: eviction of guest jobs @Notre Dame @Purdue @Indiana U. Internet Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 1

Fault-tolerance with Checkpoint-Restart Checkpoints are execution states System-level Memory state Compressible Application-level Selected variables Hard to compress Tanzima Islam (tislam@purdue. edu) Struct Toy. Grp{ 1. float Temperature[1024]; 2. int Pressure[20][30]; }; Reliable & Scalable Checkpointing Systems 2

Challenges in Checkpointing Systems HPC: Scalability of checkpointing systems @Notre Dame Grid: Use of dedicated checkpoint servers @Purdue @Indiana U. Internet Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 3

Contributions of This Thesis 2 nd Place, ACM Student Research Competition’ 10 Compression on Multi-core FALCON Reliable Checkpointing System in Grid [Best Student Paper Nomination, SC’ 09] Scalable Checkpointing System in HPC [Best Student Paper Nomination, SC’ 12] MCRENGINE MCRCLUSTER Unpublished Prelim 2007 - 2009 Tanzima Islam (tislam@purdue. edu) 2009 -2010 -2012 -2013 Reliable & Scalable Checkpointing Systems 4

Agenda [MCRENGINE] Scalable checkpointing system for HPC [MCRCLUSTER] Benefit-aware clustering Future directions Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 5

A Scalable Checkpointing System using Data-Aware Aggregation and Compression Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski

Big Picture of HPC Compute Nodes Network Contention Gateway Nodes Atlas Contention for Shared File System Resources Hera Contention for Other Clusters Parallel File System Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 7

Checkpointing in HPC MPI applications Take globally coordinated checkpoints asynchronously Application-level checkpoint High-level data format for portability HDF 5, Adios, net. CDF etc. Checkpoint writing Struct Toy. Grp{ N 1 (Funneled) Application N M (Grouped) 1. float Temperature[1024]; 2. short Pressure[20][30]; }; I/O Library Parallel File System (PFS) Data-Format API Not scalable HDF 5 Net. CDF Tanzima Islam (tislam@purdue. edu) Parallel File System (PFS) Best compromise but complex 1. 2. 3. HDF 5 checkpoint{ N N (Direct) Group “/”{ Group “Toy. Grp”{ DATASET “Temperature”{ DATATYPE H 5 T_IEEE_F 32 LE DATASPACE SIMPLE {(1024) / (1024)} } DATASET “Pressure” Parallel File { DATATYPE H 5 T_STD_U 8 LE System (PFS) DATASPACE SIMPLE {(20, 30) / (20, 30)} }}}} Easiest but contention on PFS Reliable & Scalable Checkpointing Systems 8

IOR Direct (N N): 78 MB per process Observations: less frequent checkpointing poor application performance Average Read Time (s) (−) Large average write time (−) Large average read time Tanzima Islam (tislam@purdue. edu) 92 15 40 8 # of Processes (N) 81 96 40 48 20 24 10 2 51 6 25 8 0 51 2 10 24 20 48 40 96 81 92 15 40 8 50 6 100 25 150 12 200 8 1400 1200 1000 800 600 400 200 0 250 12 Average Write Time (s) Impact of Load on PFS at Large Scale # of Processes (N) Reliable & Scalable Checkpointing Systems 9

What is the Problem? Today’s checkpoint-restart systems will not scale Increasing number of concurrent transfers Increasing volume of checkpoint data Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 10

Our Contributions Data-aware aggregation Reduces the number of concurrent transfers Improves compressibility of checkpoints by using semantic information Data-aware compression Improves compression ratio by 115% compared to concatenation and general-purpose compression Design and develop mcr. Engine Grouped (N M) checkpointing system Improves checkpointing frequency Improves application performance Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 11

Naïve Solution: Data-Agnostic Compression Agnostic scheme – concatenate checkpoints First Phase C 1 p. Gzip C 2 PFS C 2 Agnostic-block scheme – interleave fixed-size blocks C 1 [1 -B] C 2 [1 -B] C 1 [B+1 -2 B] C 2 [B+1 -2 B] p. Gzip PFS C 2 [B+1 -2 B] Observations: (+) Easy (−) Low compression ratio Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 12

Our Solution: [Step 1] Identify Similar Variables Across [Step 2] Merging Scheme II: Aware-Block I: Aware Scheme Processes P 0 P 1 Group Toy. Grp{ Meta-data: float Temperature[1024]; 1. Name int Pressure[20][30]; 2. Data-type 3. Class: }; -- Array, Atomic Concatenating similar variables Group Toy. Grp{ float Temperature[100]; int Pressure[10][50]; }; C 1. T C 1. P C 2. T C 2. P C 1. T C 2. T C 1. P C 2. P C 1. T C 1. P C 2. T C 2. P Interleaving similar variables Tanzima Islam (tislam@purdue. edu) Interleave First Next ‘B’ bytes of Temperature Pressure Reliable & Scalable Checkpointing Systems 13

[Step 3] Data-Aware Aggregation & Compression Aware scheme – concatenate similar variables Aware-block scheme – interleave similar variables C 1. H. T Data-type aware compression C 2. H. T C 1. D. P FPC C 2. D. P Lempel-Ziv First Phase Output buffer T P H D p. Gzip Second Phase PFS Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 14

How MCRENGINE Works CNC : Compute node component ANC: Aggregator node component Rank-order groups, Grouped (N M) transfer Group CNC CNC CNC Compute Component Identifiesdata-aware Applies “similar” variables aggregation and compression Request Meta-data T, PD H T D P H, Aggregator T P H D p. Gzip Request T, PD H T D P H, Meta-data Aggregator T P PFS H D p. Gzip Group CNC CNC Compute Component H T D P H, Request Meta-data T, PD Aggregator T P H D p. Gzip Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 15

Evaluation Applications ALE 3 D – 4. 8 GB per checkpoint set Cactus – 2. 41 GB per checkpoint set Cosmology – 1. 1 GB per checkpoint set Implosion – 13 MB per checkpoint set Experimental test-bed LLNL’s Sierra: 261. 3 TFLOP/s, Linux cluster 23, 328 cores, 1. 3 Petabyte Lustre file system Compression algorithm FPC [1] for double-precision float Fpzip [2] for single-precision float Lempel-Ziv for all other data-types p. Gzip for general-purpose compression Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 16

Evaluation Metrics Effectiveness of data-aware compression What is the benefit of multiple compression phases? How does group size affect compression ratio? Compression ratio = Uncompressed size Compressed size Performance of mcr. Engine Overhead of the checkpointing phase Overhead of the restart phase Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 17

Multiple Phases of Data-Aware Compression No Benefit with Data-Agnostic Double Compression are Beneficial Data-agnostic double compression is not beneficial Because, data-format is non-uniform and uncompressible Data-type aware compression improves compressibility First phase changes underlying data format Compression Ratio 4 3, 5 Data-Agnostic 3 Data-Aware 2, 5 2 1, 5 1 0, 5 0 First Second ALE 3 D Tanzima Islam (tislam@purdue. edu) First Second Cactus First Second Cosmology First Second Implosion Reliable & Scalable Checkpointing Systems 18

Impact of Group Size on Compression Ratio Different merging schemes better for different applications Larger group size beneficial for certain applications ALE 3 D: Improvement of 8% from group size 2 to 32 2, 5 Aware-Block ALE 3 D 32 16 8 4 Aware 1 32 16 8 0, 5 4 2, 5 2 1, 5 1 3, 5 2 Compression Ratio 4, 5 Cactus Group size Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 19

Data-Aware Technique Always Wins over Data-Agnostic Data-aware technique always yields better compression ratio than Data-Agnostic technique Aware-Block 3, 5 Aware 1, 5 Agnostic-Block 2, 5 ALE 3 D 32 16 8 4 Agnostic 2 32 16 8 4 2 1 0, 5 1 Compression Ratio 98 -115% 2, 5 4, 5 Cactus Group size Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 20

Summary of Effectiveness Study Data-aware compression always wins Reduces gigabytes of data for Cactus Larger group sizes may improve compression ratio Different merging schemes for different applications Compression ratio follows course of simulation Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 21

Impact of Data-Aware Compression on Latency IOR with Grouped(N M) transfer, groups of 32 processes Data-aware: 1. 2 GB, data-agnostic: 2. 4 GB Data-aware compression improves I/O performance at large scale Improvement during write 43% - 70% Improvement during read 48% - 70% 350 Agnostic-Write 300 250 Aware-Write Agnostic 200 Agnostic-Read 150 Aware-Read Aware 100 50 28 67 2 24 57 6 20 48 0 16 38 4 15 42 4 81 92 40 96 20 48 10 24 51 2 25 6 0 12 8 Average Transfer Time (sec) 400 # of Processes (N) Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 22

Impact of Aggregation & Compression on Latency 250 Direct (N N): 87 MB per process Grouped (N M): Group size 32, 1. 21 GB per aggregator 200 150 N->N Write 100 N->M Write 50 15 40 8 81 92 40 96 20 48 10 24 51 2 25 6 N->N Read 15 40 8 81 92 40 96 20 48 10 24 51 2 N->M Read 25 6 1400 1200 1000 800 600 400 200 0 12 8 Average Read Time (sec) Average Write Time (sec) Used IOR # of Processes (N) Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 23

End-to-End Checkpointing Overhead 15, 408 processes Group size of 32 for N M schemes Each process takes a checkpoint Total Checkpointing Overhead (sec) Converts network bound operation into CPU bound one Reduction in Checkpointing Overhead 350 300 87% 250 Transfer Overhead 51% CPU Overhead 200 150 100 50 0 No Comp. Indiv. No Comp. Agnostic Comp Direct Grouped ALE 3 D Tanzima Islam (tislam@purdue. edu) Aware No Comp. Indiv. No Comp. Agnostic Comp Direct Aware Grouped Cactus Reliable & Scalable Checkpointing Systems 24

End-to-End Restart Overhead Reduced overall restart overhead Reduced network load and transfer time Total Recovery Overhead (sec) 600 Reduction in I/O Overhead Recovery Overhead 500 62% 400 64% Transfer Overhead CPU Overhead 300 200 43% 71% 100 0 No Comp. Indiv. Comp No Comp. Agnostic Direct Grouped ALE 3 D Tanzima Islam (tislam@purdue. edu) Aware No Comp. Indiv. Comp No Comp. Agnostic Direct Grouped Cactus Reliable & Scalable Checkpointing Systems 25 Aware

Summary of Scalable Checkpointing System Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115% Investigated different merging techniques Demonstrated effectiveness using real-world applications Designed and developed MCRENGINE Reduces recovery overhead by more than 62% Reduces checkpointing overhead by up to 87% Improves scalability of checkpoint-restart systems Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 26

Benefit-Aware Clustering of Checkpoints from Parallel Applications Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski

Our Goal & Contributions Goal: Can suitably grouping checkpoints increase compressibility? Contributions: Design new metric for “similarity” of checkpoints Use this metric for clustering checkpoints Evaluate the benefit of the clustering on checkpoint storage Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 28

Different Clustering Schemes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 5 1 12 8 1 7 4 16 10 3 14 11 3 8 15 7 1 4 2 13 6 Our Solution 11 15 13 7 10 Random Rank-wise 4 14 8 6 Reliable & Scalable Checkpointing Systems 5 13 10 9 12 9 Tanzima Islam (tislam@purdue. edu) 11 16 14 5 6 16 12 15 3 9 2 Data-aware 29

Research Questions How to cluster checkpoints? Does clustering improve compression ratio? Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 30

Benefit-Aware Clustering Similarity metric: Improvement in reduction Goal: Minimize the total compressed size β Benefit matrix of Cactus Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 31

Novel Dissimilarity Metric Two factors for the dissimilarity between two checkpoints 1 Δ(i, j) = × β(i, j) Tanzima Islam (tislam@purdue. edu) N Σ [ (i, k) – β(j, k)]2 k=1 Reliable & Scalable Checkpointing Systems 32

How Benefit-Aware Clustering Works D P T Chunking Sample double T[3000]; double V[10]; double P[5000]; double D[4000]; double R[100]; double T[3000]; D[4000]; double P[5000]; double D[4000]; T[3000]; P 1 P 2 Wavelet P 3 P 4 P 5 D P T Cluster 1 Cluster 2 P 1 P 3 β(14 ) Filter Tanzima Islam (tislam@purdue. edu) Order P 4 Similarity Reliable & Scalable Checkpointing Systems P 2 P 5 Cluster 33

Structure of MCRCLUSTER P 5 F PO 4 F S C OP 3 S C F OP 2 S C P 1 S C F O S A 2 Aggregator A 1 PFS Aggregator C Compute Node Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 34

Evaluation Application IOR (synthetic checkpoints) Cactus Experimental test-bed LLNL’s Sierra: 261. 3 TFLOP/s, Linux cluster 23, 328 cores, 1. 3 Petabyte Lustre file system Evaluation metric: Macro benchmark: Effectiveness of clustering Micro benchmark: Effectiveness of sampling Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 35

Effectiveness of MCRCLUSTER IOR: 32 checkpoints Odd processes write 0 Even processes write: <rank> | 1234567 29% more compression compared to rank-wise, 22% more compared to random grouping Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 36

Effectiveness of Sampling X axis: Each variable Y axis: Range of benefit values Take away: Chunking method preserves benefit relationships the closest Chunking Tanzima Islam (tislam@purdue. edu) Wavelet Transform Reliable & Scalable Checkpointing Systems 37

Contributions of MCRCLUSTER Design similarity and distance metric Demonstrate significant result on synthetic data 22% and 29% improvement compared to random and rank-wise clustering, respectively Future directions for a first year Ph. D. student Study impact on real applications Design scalable clustering technique Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 38

Applicability of My Research Condor systems Compression for scientific data Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 39

Conclusions This thesis addresses: Reliability of checkpointing-based recovery in large-scale computing Proposed three novel systems: FALCON: Distributed checkpointing system for Grids MCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPC MCRCLUSTER: “Benefit-Aware Clustering” Provides a good foundation for further research in this field Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 40

Questions? Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 41

Future Directions: Reliability: Similarity-based process grouping for better compression Group processes based on similarity instead of rank [On going] Analytical solution to group size selection Variable streaming Integrating mcr. Engine with SCR Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 42

Future Directions: Performance Cache usage analysis and optimization Developed user-level tool for analyzing cache utilization [Summer’ 12] Short term goals: Apply to real-applications Automate analysis Long-term goals: Suggest potential code optimizations Automate application tuning Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 43

Contact Information Tanzima Islam (tislam@purdue. edu) Website: web. ics. purdue. edu/~tislam Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 44

Effectiveness of mcr. Cluster Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 45

Backup Slides Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 46

[Backup Slide] Failures in HPC “A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson Breakdown of root causes of failures Tanzima Islam (tislam@purdue. edu) Breakdown of downtime into root causes Reliable & Scalable Checkpointing Systems 47

[Backup Slide] Failures in HPC “Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al. Disparity between network bandwidth and memory size Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 48

[Backup Slides] Falcon Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 49

[Backup Slide] Breakdown of Overheads 180 160 40 500 MB 946 MB Tanzima Islam (tislam@purdue. edu) D F 0 D 0 F 20 D 20 1677 MB 500 MB Reliable & Scalable Checkpointing Systems 946 MB D 60 80 F 80 100 D 100 120 F 120 140 D 140 F Recovery Overhead (sec) 180 F Checkpointing Overhead (sec) Performance scales with checkpoint sizes Lower network transfer overhead 1677 MB 50

[Backup Slide] Parallel Falcon 180 160 500 MB 946 MB Tanzima Islam (tislam@purdue. edu) 1677 MB 500 MB Reliable & Scalable Checkpointing Systems 946 MB D D PF F D PF 0 D 20 PF 40 60 D 60 80 PF 80 100 F 100 120 D 120 140 PF 140 F Recovery Overhead (sec) 180 F Checkpoint Storing Overhead (sec) 67% improvement in CPU time 1677 MB 51

[Backup Slides] mcr. Engine Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 52

[Backup Slide] How to Find Similarity P 0 Group Toy. Grp{ float Temperature[1024]; short Pressure[20][30]; int Humidity; }; P 1 Group Toy. Grp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity; }; Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C Tanzima Islam (tislam@purdue. edu) Var: “Toy. Grp/Temperature” Type: F 32 LE, Array 1[1024] Toy. Grp/Temperature_F 32 LE_Array 1 D Var: “Toy. Grp/Pressure” Type: S 8 LE, Array 2 D [20][30] Toy. Grp/Pressure_S 8 LE_Array 2 D Var: “Toy. Grp/Humidity” Type: I 32 LE, Atomic Toy. Grp/Humidity_I 32 LE_Atomic Var: “Toy. Grp/Temperature” Type: F 32 LE, Array 1 D [50] Toy. Grp/Temperature_F 32 LE_Array 1 D Var: “Toy. Grp/Pressure” Type: S 8 LE, Array 2 D [2][6] Toy. Grp/Pressure_S 8 LE_Array 2 D Var: “Toy. Grp/Unit” Type: F 64 LE, Atomic Toy. Grp/Unit_F 64 LE_Atomic No match Var: “Toy. Grp/Humidity” Type I 32 LE, Atomic Inside a checkpoint: Variables annotated with metadata Toy. Grp/Humidity_I 32 LE_Atomic Generated hash key for matching Reliable & Scalable Checkpointing Systems 53

[Backup Slide] Compression Ratio Follows Course of Simulation Data-aware technique always yields better compression Compression Ratio 2, 8 Cactus Aware-Block 2, 3 Aware 1, 8 Agnostic-Block Agnostic 1, 3 0, 8 2, 3 Cosmology 6, 0 2, 1 5, 0 1, 9 4, 0 1, 7 3, 0 1, 5 2, 0 1, 3 1, 0 Implosion Simulation Time-steps Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 54

[Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme Application Total Size Data Types(%) Aware-Block (GB) DF SF Int (%) Aware (%) ALE 3 D 4. 8 88. 8 Cactus 2. 41 Cosmology Implosion 11. 2 6. 6 - 27. 7 6. 6 - 12. 7 33. 94 0 66. 06 10. 7 – 11. 9 98 - 115 1. 1 24. 3 67. 2 8. 5 20. 1 – 25. 6 20. 6 – 21. 1 0. 013 0 74. 1 25. 9 36. 3 – 38. 4 36. 3 – 38. 8 Tanzima Islam (tislam@purdue. edu) ~0 Reliable & Scalable Checkpointing Systems 55

References 1. M. Burtscher and P. Ratanaworabhan, “FPC: A Highspeed Compressor for Double-Precision Floating-Point Data”. 2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. 3. L. Reinhold, “Quick. LZ”. Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 56

Reliable and Efficient System for Storing Checkpoints in Grid Execution Environment: Grid Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 57

State-of-the-Art: Checkpointing in Grid with Dedicated Storage @Notre Dame @Purdue @Indiana U. Internet Dedicated Storage Server Submitter Tanzima Islam (tislam@purdue. edu) Problems: (−) High transfer latency (−) Contention on servers (−) Stress on shared network resources Reliable & Scalable Checkpointing Systems 58

Research Question Can we improve the performance of applications by storing checkpoints on the grid resources? Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 59

Overview of Our Solution: Checkpointing in Grid with Distributed Storage @Notre Dame @Purdue @Indiana U. Internet Submitter Tanzima Islam (tislam@purdue. edu) Q 1. Which storage nodes? Q 2. How to balance load? Q 3. How to efficiently storage & retrieve? Constraint: -- All components must be user-level Reliable & Scalable Checkpointing Systems 60

Answer to Q 1: Storage Host Selection Build failure model for storage resources Compute correlated temporal reliability Based on historical data Rank machines Based on: reliability, load, and network overhead Output: (m+k) storage hosts Compute Host Addresses Q 2 down Objective function: Storage Host 1 checkpoint storing overhead – benefit from restart down Tanzima Islam (tislam@purdue. edu) down Reliable & Scalable Checkpointing Systems Storage Host 2 61

Checkpoint-Recovery Scheme Disk Original Checkpoint Compression Decompression Compressed Erasure Encoding (m+k) Erasure Decoding (m) Fragments Storage Host 1 2 m+k Checkpoint Storing Phase Tanzima Islam (tislam@purdue. edu) 1 2 m+k Recovery Phase Reliable & Scalable Checkpointing Systems 62

Evaluation Setup 2 different applications with 4 input sets MCF (SPEC CPU 2006) TIGR (Bio. Bench) System-level checkpoints Macro benchmark experiment Average job makespan Micro benchmark experiments Efficiency of checkpoint and restart Efficiency in handling simultaneous clients Efficiency in handling multiple failures Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 63

Checkpoint Storing & Recovery Overhead Performance scales with checkpoint sizes Lower network transfer overhead 180 160 40 500 MB 946 MB Tanzima Islam (tislam@purdue. edu) F D 0 F 0 D 20 F 20 1677 MB 500 MB Reliable & Scalable Checkpointing Systems 946 MB F 60 80 D 80 100 F 100 120 D 120 140 F 140 CPU Overhead D Recovery Overhead (sec) 180 D Checkpointing Overhead (sec) Transfer Overhead 1677 MB 64

Overall Performance Comparison Performance improvement between 11% and 44% Average Makespan Time (min) 160 140 120 Remote Dedicated Server 100 Local Dedicated Server 80 Falcon with Distributed Server 60 40 20 0 mcf tigr Benchmark Applications Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 65

Summary of Reliable Checkpointing System Developed a reliable checkpoint-recovery system FALCON Select reliable storage hosts Prefer lightly loaded ones Compress and encode Store and retrieve efficiently Ran experiments with FALCON in Dia. Grid Performance improvement between 11% and 44% Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 66

Checkpointing in HPC Compute Nodes Network Contention Gateway Nodes Atlas Contention for Shared File System Resources Hera Contention for Other Clusters for File System Parallel File System Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 67

2 -D vs N-D Compression Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 68

Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 69

Challenge in Extreme-Scale: Increase in Failure-Rate 1 Eflop/s 100 Pflop/s 100 Tflop/s N=1 10 Tflop/s 1 Tflop/s N=500 100 Gflop/s Tanzima Islam (tislam@purdue. edu) 20 20 14 20 08 20 02 20 19 1 Gflop/s 96 10 Gflop/s Reliable & Scalable Checkpointing Systems 70

Towards Online Clustering Reduce dimension of β Reduce the number of variables Representative data-type Number of elements greater than a threshold [Example: 100 variables double-type covers 80% of data] Reduce the amount of data Sampling: Random, chunking and wavelet Tanzima Islam (tislam@purdue. edu) Reliable & Scalable Checkpointing Systems 71