mcr Engine a Scalable Checkpointing System using DataAware
“mcr. Engine” a Scalable Checkpointing System using Data-Aware Aggregation and Compression Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.
Background Checkpoint-restart widely used MPI applications Take globally coordinated checkpoints Application-level checkpoint High-level I/O format HDF 5, Adios, net. CDF etc. Checkpoint writing Not scalable Best compromise 1. 2. but complex 3. Application Struct Toy. Grp{ 1. float Temperature[1024]; 2. short Pressure[20][30]; Parallel File }; System (PFS) Tanzima Islam (tislam@purdue. edu) Net. CDF N 1 Data-Format API HDF 5 I/O Library System (PFS) N M Easiest but HDF 5 checkpoint{ Group “/”{ Contention on PFS Group “Toy. Grp”{ DATASET “Temperature”{ DATATYPE H 5 T_IEEE_F 32 LE DATASPACE SIMPLE {(1024) / (1024)} } DATASET “Pressure” { DATATYPE H 5 T_STD_U 8 LE Parallel File System. SIMPLE (PFS) {(20, 30) / (20, 30)} DATASPACE }}}} N N mcr. Engine: Data-aware Aggregation & Compression 1
IOR 78 MB of data per process N N checkpoint transfer Observations: less frequent checkpointing poor application performance Average Read Time (s) (-) Large average write time (-) Large average read time Tanzima Islam (tislam@purdue. edu) 24 20 48 40 96 81 92 15 40 8 12 92 15 40 8 # of Processes (N) 81 96 40 48 20 24 10 2 51 6 25 8 8 0 10 50 2 100 51 150 25 200 6 1400 1200 1000 800 600 400 200 0 250 12 Average Write Time (s) Impact of Load on PFS at Large Scale # of Processes (N) mcr. Engine: Data-aware Aggregation & Compression 2
What is the Problem? Today’s checkpoint-restart systems will not scale Increasing number of concurrent transfers Increasing volume of checkpoint data Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 3
Our Contributions Data-aware aggregation Reduces the number of concurrent transfers Improves compressibility of checkpoints Data-aware compression Reduces data almost 2 x more than simply concatenating them and compressing Design and develop mcr. Engine N M checkpointing system Decouples checkpoint transfer logic from applications Improves application performance Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 4
Overview Background Problem Data aggregation & compression Evaluation Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 5
Data-Agnostic Schemes Agnostic scheme – concatenate checkpoints First Phase C 1 Gzip C 2 PFS C 2 Agnostic-block scheme – interleave fixed-size blocks C 1 [1 -B] C 2 [1 -B] C 1 [B+1 -2 B] C 2 [B+1 -2 B] Gzip PFS C 2 [B+1 -2 B] Observations: (+) Easy (−) Low compression ratio Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 6
Identify Similar. Aware Variables Scheme Across Processes P 0 P 1 Group Toy. Grp{ Meta-data: float Temperature[1024]; 1. Name int Pressure[20][30]; 2. Data-type 3. Class: }; C 1. T C 1. P -- Array, Atomic Concatenating similar variables C 1. T Tanzima Islam (tislam@purdue. edu) C 2. T Group Toy. Grp{ float Temperature[100]; int Pressure[10][50]; }; C 2. T C 2. P C 1. P C 2. P mcr. Engine: Data-aware Aggregation & Compression 7
Aware-Block Scheme P 0 P 1 Group Toy. Grp{ Meta-data: float Temperature[1024]; 1. Name int Pressure[20][30]; 2. Data-type 3. Class: }; C 1. T C 1. P -- Array, Atomic Group Toy. Grp{ float Temperature[100]; int Pressure[10][50]; }; C 2. T C 2. P Interleave First Next ‘B’ bytes of Temperature Pressure Interleaving similar variables Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 8
Data-Aware Aggregation & Compression Aware scheme – concatenate similar variables Aware-block scheme – interleave similar variables C 1. T Data-type aware compression C 2. T C 1. P FPC C 2. P Lempel-Ziv First Phase Output buffer T P H D Gzip Second Phase PFS Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 9
How mcr. Engine Works CNC : Compute node component ANC: Aggregator node component Rank-order groups, Group size = 4, N M checkpointing Identifies “similar” variables. Applies data-aware aggregation and compression Group CNC CNC Request Meta-data T, P D H T D PH, ANC T P H D Gzip Group CNC CNC Request T, PD H T D P H, Meta-data ANC PFS T P H D Gzip Group CNC CNC H T D P H, Request Meta-data T, PD ANC T P H D Gzip Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 10
Overview Background Problem Data aggregation & compression Evaluation Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 11
Evaluation Applications ALE 3 D – 4. 8 GB per checkpoint set Cactus – 2. 41 GB per checkpoint set Cosmology – 1. 1 GB per checkpoint set Implosion – 13 MB per checkpoint set Experimental test-bed LLNL’s Sierra: 261. 3 TFLOP/s, Linux cluster 15, 408 cores, 1. 3 Petabyte Lustre file system Compression algorithm FPC [1] for double-precision float Fpzip [2] for single-precision float Lempel-Ziv for all other data-types Gzip for general-purpose compression Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 12
Evaluation Metrics Effectiveness of data-aware compression What is the benefit of multiple compression phases? How does group size affect compression ratio? How does compression ratio change as a simulation progresses? Compression ratio = Uncompressed size Compressed size Performance of mcr. Engine Overhead of the checkpointing phase Overhead of the restart phase Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 13
No Benefit Data-Agnostic Double Compression Multiplewith Phases of Data-Aware Compression are Beneficial Data-type aware compression improves compressibility First phase changes underlying data format Data-agnostic double compression is not beneficial 4 3. 5 3 2. 5 2 1. 5 1 0. 5 0 Data-Aware Tanzima Islam (tislam@purdue. edu) Second-Phase Im pl os io n First-Phase Second-Phase C os m ol og y First-Phase Second-Phase Ca ct us First-Phase A L E 3 D Second-Phase Data-Agnostic First-Phase Compression Ratio Because, data-format is non-uniform and uncompressible mcr. Engine: Data-aware Aggregation & Compression 14
Impact of Group Size on Compression Ratio Different merging schemes better for different applications Larger group size beneficial for certain applications ALE 3 D: Improvement of 8% from group size 2 to 32 ALE 3 D Compression Ratio 4. 5 3. 5 1. 5 2. 5 0. 5 1 2 4 8 16 32 1 Cosmology 2 2. 7 1 1. 7 2 4 8 16 2 32 Tanzima Islam (tislam@purdue. edu) 64 128 Group size 4 8 16 32 1 2 4 8 Aware-Block Aware Implosion 3. 7 1. 5 1 Cactus 2. 5 16 32 mcr. Engine: Data-aware Aggregation & Compression 64 15
Data-Aware Technique Always Wins over Data-Agnostic Data-aware technique always yields better compression ratio than Data-Agnostic technique ALE 3 D Compression Ratio 4. 5 3. 5 1. 5 2. 5 0. 5 1 2 4 8 16 32 98 -115% 1 Cosmology 2 Cactus 2. 5 2 4 8 16 32 Aware Implosion 3. 7 Aware-Block Agnostic 1. 5 2. 7 1 1. 7 1 2 4 8 16 32 Tanzima Islam (tislam@purdue. edu) 64 128 Group size 1 2 4 8 16 32 mcr. Engine: Data-aware Aggregation & Compression 64 16
Compression Ratio Follows Course of Simulation Data-aware technique always yields better compression Compression Ratio 2. 8 Cactus Aware-Block 2. 3 Aware 1. 8 Agnostic-Block Agnostic 1. 3 0. 8 2. 3 Cosmology 6. 0 2. 1 5. 0 1. 9 4. 0 1. 7 3. 0 1. 5 2. 0 1. 3 1. 0 Implosion Simulation Time-steps Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 17
Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme Application Total Size (GB) Aware-Block (%) Aware (%) ALE 3 D 4. 8 6. 6 - 27. 7 6. 6 - 12. 7 Cactus 2. 41 10. 7 – 11. 9 98 - 115 Cosmology 1. 1 20. 1 – 25. 6 20. 6 – 21. 1 Implosion 0. 013 36. 3 – 38. 4 36. 3 – 38. 8 Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 18
Impact of Aggregation on Scalability 250 N N: Each process transfers 78 MB N M: Group size 32, 1. 21 GB per aggregator 200 150 N->N Write 100 N->M Write 50 15 40 8 81 92 40 96 20 48 10 24 51 2 25 6 N->N Read 15 40 8 81 92 40 96 20 48 10 24 51 2 N->M Read 25 6 1400 1200 1000 800 600 400 200 0 12 8 Average Read Time (sec) Average Write Time (sec) Used IOR # of Processes (N) Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 19
Impact of Data-Aware Compression on Scalability IOR with N M transfer, groups of 32 processes Data-aware: 1. 2 GB, data-agnostic: 2. 4 GB Data-aware compression improves I/O performance at large scale Improvement during write 43% - 70% Improvement during read 48% - 70% 350 Agnostic-Write 300 250 Agnostic 200 150 Aware 100 Aware-Write Agnostic-Read Aware-Read 50 2 67 28 6 57 24 0 48 20 4 38 16 4 42 15 92 81 96 40 48 20 24 10 2 51 6 25 8 0 12 Average Transfer Time (sec) 400 # of Processes (N) Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 20
End-to-End Checkpointing Overhead 15, 408 processes Group size of 32 for N M schemes Each process takes a checkpoint Reduction in checkpointing overhead 350 300 87% 250 51% CPU Overhead Transfer Overhead to PFS 200 150 100 50 gg gg A e+ ar w os gn A A tic +N p. N o Co m m Co v. In di +A -> -> +N p. m Co o N M M N +N -> gg e+ ar w A tic A gn os p. m Co N o A +A -> +N N p+ m Co v. di In gg N -> N +N -> p. m Co o M 0 N Total Checkpointing Overhead (sec) Converts network bound operation into CPU bound one ALE 3 D Tanzima Islam (tislam@purdue. edu) Cactus mcr. Engine: Data-aware Aggregation & Compression 21
End-to-End Restart Overhead Reduced overall restart overhead Reduced network load and transfer time Reduction in recovery overhead Reduction in I/O overhead 500 62% 400 64% CPU Overhead 300 Transfer Overhead to PFS 43% 200 71% ALE 3 D Tanzima Islam (tislam@purdue. edu) Aware+Agg Agnostic+Agg No Comp. +N->M Indiv. Comp. +N->M No Comp. +N->N Aware+Agg Agnostic+Agg No Comp. +N->M 0 No Comp. +N->M 100 No Comp. +N->N Total Recovery Overhead (sec) 600 Cactus mcr. Engine: Data-aware Aggregation & Compression 22
Conclusion Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115% Investigated different merging techniques Evaluated effectiveness using real-world applications Designed and developed a scalable framework Implements N M checkpointing Improves application performance Transforms checkpointing into CPU bound operation Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 23
Contact Information Tanzima Islam (tislam@purdue. edu) Website: web. ics. purdue. edu/~tislam Acknowledgement Purdue: Saurabh Bagchi (sbagchi@purdue. edu) Rudolf Eigenmann (eigenman@purdue. edu) Lawrence Livermore National Laboratory Kathryn Mohror (kathryn@llnl. gov) Adam Moody (moody 20@llnl. gov) Bronis R. de Supinski (bronis@llnl. gov) Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 24
Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 25
Backup Slides Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 26
[Backup Slide] Failures in HPC “A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson Breakdown of root causes of failures Tanzima Islam (tislam@purdue. edu) Breakdown of downtime into root causes mcr. Engine: Data-aware Aggregation & Compression 27
Future Work Analytical solution to group size selection? Better way than rank-order grouping? Variable streaming? Engineering challenge Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 28
References 1. M. Burtscher and P. Ratanaworabhan, “FPC: A Highspeed Compressor for Double-Precision Floating-Point Data”. 2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. 3. L. Reinhold, “Quick. LZ”. Tanzima Islam (tislam@purdue. edu) mcr. Engine: Data-aware Aggregation & Compression 29
- Slides: 30