Efficient Checkpointing of Java Software using ContextSensitive Capture

Outline Motivation - Challenges for checkpointing/replaying Java software - Summary of our approach Contributions

Motivation Checkpointing/replaying has been used for a variety of purposes at system level -

Challenges Ease of use and deployment - Application-level checkpointing: no JVM/runtime support, just code

Summary of Our Approach Tool input: program + checkpoint definition Performs static analyses and

Definitions Crosscut call chain (CC-chain) - A programmer-specified call chain that leads to the

Replaying, Step 1: Recover the Call Stack Predicate decision point: recover boolean value Call

Checkpointing Version 8 void run(String[] args) { process. Cmd. Line(args); load. Necessary. Classes(); static

Replaying Version 9 void run(String[] args) { process. Cmd. Line(args); load. Necessary. Classes(); static

Step 2: Recover at the Checkpoint 10 void run(String[] args) { process. Cmd. Line(args);

Selection of Static Fields A whole program Mod/Use analysis - A static field is

Step 3: Recover after the Checkpoint Replaying only at decision points and the checkpoint

Additional Issues A checkpoint can have multiple run-time instances If a method in CC-chain

Study 1: Static Analysis Program 14 #R #IP compress socksproxy 1 3 6 11

Static Analysis: Locals Reduction 15 PRESTO: Program Analyses and Software Tools Research Group, Ohio

Static Analysis: Static Fields Reduction 16 PRESTO: Program Analyses and Software Tools Research Group,

Static Analysis: Removed/Inserted Statements 17 PRESTO: Program Analyses and Software Tools Research Group, Ohio

Static Analysis Cost Phase 1: Soot infrastructure cost - Between 1. 64 ms and

Study 2: Run-Time Performance (compress) Original program: compressing and decompressing 5 big tar files

compress Performance Normalized running times Normalized size of captured program state 20 PRESTO: Program

Study 2: Run-Time Performance (soot) Input: soot-2. 2. 3 itself containing 2227333 methods Phases

soot Performance Normalized running times Normalized captured program state 22 PRESTO: Program Analyses and

Study 2: Run-Time Performance (jflex-1. 4. 1) Input: a. flex grammar file corresponding to

jflex Performance Normalized running time Normalized size of capture state 24 PRESTO: Program Analyses

Summary of Evaluation Static analysis successfully reduces the size of program state recorded and

Conclusions A static-analysis-based checkpointing/replaying technique An implementation and an evaluation that shows our technique

Questions? 27 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

compress Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 31 471 by 0.

soot Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 461058 36. 2 M

jflex Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 6606489 259. 8 M

Slides: 30

Download presentation

Efficient Checkpointing of Java Software using Context-Sensitive Capture and Replay Guoqing Xu, Atanas Rountev, Yan Tang, Feng Qin Ohio State University ESEC/FSE 07 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Outline Motivation - Challenges for checkpointing/replaying Java software - Summary of our approach Contributions - Static analyses - Multiple execution regions - Experimental evaluation Conclusions 2 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Motivation Checkpointing/replaying has been used for a variety of purposes at system level - Originally designed to support fault tolerance - Debugging of OS and of parallel and distributed software Checkpointing can benefit a number of software engineering tasks - Reduce the cost of manual debugging and testing - Support for automated techniques for debugging and testing: e. g. , dynamic slicing and delta-debugging - Inspired by both system-level checkpointing [Pan-PDD 88, Dunlap-OSDI 02, King-USENIX 05] and “saving-andrestoring” software engineering techniques [Saff-ASE 05, Orso-WODA 06, Elbaum-FSE 06] 3 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Challenges Ease of use and deployment - Application-level checkpointing: no JVM/runtime support, just code analysis and instrumentation - Challenge: no direct access to the call stack; no control over thread scheduling or external resources (files, etc. ) Reduce the size of the recorded state - Dumping the entire heap may be prohibitively expensive, especially for large programs - Challenge: static analyses to prune redundant state Static and dynamic overhead - Static analysis cost is amortized over multiple runs - Approach is intended for long-running applications 4 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Summary of Our Approach Tool input: program + checkpoint definition Performs static analyses and code instrumentation Tool output: two program versions First, an augmented checkpointing version is executed once to record (parts of) the run-time program states - At the checkpoint: heap objects, static fields, locals - At certain points along the call chain leading to the checkpoint Next, a pruned replaying version is executed multiple times - Restore variables saved at the checkpoint - Restore variables saved at points along the call chain How do we resume execution from the checkpoint? - Step 1: control flow quickly reaches the checkpoint - Step 2: recover state at checkpoint - Step 3: incrementally recover state after call sites along the call chain leading to the checkpoint 5 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Definitions Crosscut call chain (CC-chain) - A programmer-specified call chain that leads to the method that contains the checkpoint - E. g. main(44) -> run(28) Decision points - A call site on the CC-chain (e. g. m. run) – due to polymorphism - A predicate on which a decision point or the checkpoint is control-dependent At a decision point, the checkpointing version records the control-flow outcome The replaying version uses this info to force the control flow to reach the checkpoint 6 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Replaying, Step 1: Recover the Call Stack Predicate decision point: recover boolean value Call site decision point o. m(a 1…, an) - Recover the run-time type of the receiver object; instantiated during replaying using sun. misc. Unsafe 7 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Checkpointing Version 8 void run(String[] args) { process. Cmd. Line(args); load. Necessary. Classes(); static void main(String[] Set wp_packs = get. Wpacks(); args) { Set body_packs = get. Bpacks(); Main m = new Main(); boolean b = Options. v(). whole_jimple(); boolean b = => save(b); args. length !=0; if (b){// DP => save(b); get. Pack("cg"). apply(); if (b) // DP // --- checkpoint --=> save(type_of(m)); => save(…); m. run(args); // DP get. Pack("wjtp"). apply(); } get. Pack("wjop"). apply(); get. Pack("wjap"). apply(); } retrieve. All. Bodies(); … }. . . PRESTO: Program Analyses and Software Tools Research Group, Ohio State University }

Replaying Version 9 void run(String[] args) { process. Cmd. Line(args); load. Necessary. Classes(); static void main(String[] Set wp_packs = get. Wpacks(); args) { Set body_packs = get. Bpacks(); Main m = new Main(); boolean b = Options. v(). whole_jimple(); boolean b = => read(b); args. length !=0; if (b){// DP => read(b); get. Pack("cg"). apply(); if (b) // DP // --- checkpoint --=> read(type_of(m)); =>read(…); => unsafe. allocate(m); get. Pack("wjtp"). apply(); => args = null; get. Pack("wjop"). apply(); m. run(args); // DP get. Pack("wjap"). apply(); } } retrieve. All. Bodies(); … } PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Step 2: Recover at the Checkpoint 10 void run(String[] args) { process. Cmd. Line(args); Our static analysis selects locals for recording(for load. Necessary. Classes(); checkpointing)/recovering(for replaying) when Set wp_packs = get. Wpacks(); - They Set arebody_packs written before the checkpoint = get. Bpacks(); (Options. v(). whole_jimple()) { - They ifare read after the checkpoint get. Pack("cg"). apply(); Record primitive-typed // --- checkpoint ---values or entire object graphs onget. Pack("wjtp"). apply(); the heap (all reachable objects) get. Pack("wjop"). apply(); Static fields are selected based on the same idea get. Pack("wjap"). apply(); } retrieve. All. Bodies(); for (Iterator i = body_packs. iterator(); i. has. Next(); ) { body_packs … }… } PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Selection of Static Fields A whole program Mod/Use analysis - A static field is “written” if its value is changed, or any heap object reachable from it is mutated - A static field is “read” if its value is directly read Analysis algorithm - Context-sensitive and flow-insensitive; uses the points-to solution and the call graph from Spark [Lhotak CC-03] - Bottom-up traversal of the SCC-DAG of the call graph - For each method m, a set Cm is maintained to contain all objects from which a mutated object can be reached - Propagate backwards the objects in Cm that escape a callee method to its callers - Select a static field fld if Points. To. Set(fld) ∩ Cm ≠ ∅ 11 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Step 3: Recover after the Checkpoint Replaying only at decision points and the checkpoint is not enough to guarantee correct execution after the checkpoint void main(){ class B{ Additionally record/recover local variables that = new Hash. Set(); Set in s; CC-chain will. Set behsread after each call site B b = new B(hs); //-- reco/rest // (type_of(b)) b. m(); //-- extra reco/rest (hs) if(hs == b. s){ … } hs } uninitialized 12 void m(){ B r 0 = this; r 0. s = new Hash. Set(); //-- checkpoint //-- reco/rest (r 0) r 0. s. add(“”); } } PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Additional Issues A checkpoint can have multiple run-time instances If a method in CC-chain has callers that are not in the chain, it has to be replicated Currently do not support multi-threaded programs Our technique does not guarantee the correctness of the execution, when the post-checkpoint part of the program - Depends on external resources, such as files, databases - Depends on unique-per-execution values, such as clock - Is modified with new cross-checkpoint dependencies Multiple execution regions - Designated by a starting point and an ending point - Specified by two CC-chains 13 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Study 1: Static Analysis Program 14 #R #IP compress socksproxy 1 3 6 11 socksecho 3 14 raytrace 3 10 soot-2. 2. 3 10 35 muffine 3 20 sablecc 4 11 jess 3 8 violet 4 9 javacup 4 9 jtar-1. 21 2 4 db 2 5 jflex 2 8 jb-6. 1 3 5 jlex-1. 2. 6 3 8 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Static Analysis: Locals Reduction 15 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Static Analysis: Static Fields Reduction 16 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Static Analysis: Removed/Inserted Statements 17 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Static Analysis Cost Phase 1: Soot infrastructure cost - Between 1. 64 ms and 30. 6 ms per thousand Jimple statements - On average, 11. 1 ms/1000 statements Phase 2: Our analysis cost - Between 1. 67 ms and 26. 6 ms per thousand Jimple statements - On average, 9. 4 ms/1000 statements This should be amortized across multiple runs of the replaying version 18 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Study 2: Run-Time Performance (compress) Original program: compressing and decompressing 5 big tar files several times Evaluated for five checkpoint definitions - 19 One checkpoint, close to the beginning of the program Two regions of compression and decompression A region containing the process of decompression One checkpoint, close to the end of the program PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

compress Performance Normalized running times Normalized size of captured program state 20 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Study 2: Run-Time Performance (soot) Input: soot-2. 2. 3 itself containing 2227333 methods Phases - Enabling cg. spark, wjtp, wjop. ji, wjap. uft, jtp, jop. cp Evaluated for six checkpoint definitions - 21 Before whole-program packs After cg After wjtp After wjop After wjap After body packs PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

soot Performance Normalized running times Normalized captured program state 22 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Study 2: Run-Time Performance (jflex-1. 4. 1) Input: a. flex grammar file corresponding to a DFA containing 21769 states Evaluated for four checkpoint definitions - 23 After NFA is generated DFA is generated to DFA minimization emission PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

jflex Performance Normalized running time Normalized size of capture state 24 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Summary of Evaluation Static analysis successfully reduces the size of program state recorded and recovered It is more meaningful to checkpoint/replay longrunning programs Checkpoints are better taken after a phase of long time computation with (relatively) small output state - √ compress: small program state, short running time - √ soot: large program state, but very long computation time - X jflex: large program state, short running time 25 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Conclusions A static-analysis-based checkpointing/replaying technique An implementation and an evaluation that shows our technique can be an interesting candidate for testing, debugging, and dynamic slicing of longrunning programs Future work - Language-level checkpointing/replaying multi-threaded programs - More precise static analyses could be employed to reduce the size of program state to be captured - The run-time support for object reading and writing could be improved 26 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Questions? 27 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

compress Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 31 471 by 0. 17% 4. 19 (0. 74%) 4. 14 (0. 38%) 2 545 89. 7 M 28. 8% 5. 22 (10. 4%) 3. 19 (11. 8%) 3 22 89. 7 by 28. 9% 5. 38 (9. 0%) 2. 17 (12. 8%) 4 578 89 M 26. 7% 4. 70 (12. 3%) 1. 39 (24. 7%) 5 31 296 by 0. 008% 4. 17 (8. 1%) 47 (34. 0%) Original running time: 4. 05 s 28 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

soot Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 461058 36. 2 M 36. 3% 4695. 3 (0. 4%) 4643. 5 (0. 5%) 2 65648481 745 M 73. 2% 4712. 2 (7. 2%) 4410. 5 (9. 1%) 3 65648481 745 M 73. 2% 4688. 4 (6. 9%) 4387. 3 (8. 7%) 4 77739391 806. 4 M 79. 0% 4770. 1 (8. 0%) 511. 5 (95. 2%) 5 77767256 806. 5 M 63. 5% 4972. 8 (8. 0%) 533. 1 (97. 8%) 6 75668735 795. 3 M 72. 8% 4661. 6 (8. 0%) 411. 5 (96. 5%) Original running time: 4665. 7 s 29 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

jflex Run #Objects Space %Heap Timec(s) (%wio) Timer(s) (%rio) 1 6606489 259. 8 M 86. 1% 64. 9 (8. 0%) 68. 8 (18. 3%) 2 6695173 385. 1 M 68. 1% 65. 2 (12. 3%) 55. 6 (26. 1%) 3 6695172 385. 1 M 68. 1% 63. 9 (12. 1%) 55. 4 (26. 0%) 4 21 2 K 0. 0003% 56. 2 (0. 14%) 0. 063 (50. 8%) Original running time: 52. 6 s 30 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University