The Inflection Point Hypothesis A Principled Debugging Approach
The Inflection Point Hypothesis: A Principled Debugging Approach for ‘Locating the Root Cause of a Failure Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, Ding Yuan
The Cost of Debugging Programming Time [VA LUE ]% Coding [VAL UE]% ‘- $312 Billion [Britton et al. 2013] Design [VAL Debugging UE]%
Root Cause – A Major Goal of Debugging Bug Report ‘- Root Cause
Root Cause – A Major Goal of Debugging Failure Reproduction Bug Report Failure Execution Root Cause Localization ‘- Root Cause
Failure Reproduction is Well Studied Failure Reproduction Failure Execution Bug Report • • Deterministic replay Tracing Reproduction using log/core-dump Hardware support (Intel-PT) Root Cause Localization ‘- Root Cause
Root Cause Localization is Challenging Failure Reproduction Bug Report Failure Execution Root Cause Localization millions ~ billions of instructions ‘- Root Cause
Our Goal – Automating Root Cause Localization Failure Reproduction Bug Report Failure Execution Root Cause Localization ‘- Root Cause
State of The Art – Probabilistic Approach • Instruments program to record predicates • E. g. , whether a branch condition (a!=0) is satisfied • Collects large number of traces • Strongest statistical correlation Failed Executions Succ. Executions ‘-
State of The Art – Probabilistic Approach • Instruments program to record predicates • E. g. , whether a branch condition (a!=0) is satisfied • Collects large number of traces Failed Executions Succ. Executions ‘- • Strongest statistical correlation • Result is probabilistic • Requires a large number of failure executions • Lacks execution context
Towards Analytical Root Cause Localization What is the fundamental property of root cause that ‘allows us to build a tool to automatically search for it?
What is a Root Cause? The most basic reason for a failure which, if corrected, would have ‘prevented the failure from occurring. -- Wilson et al. ASQ Quality Press 1993
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. ‘- 2 The root cause is the most basic cause of the failure.
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- Thread 1 a=0; a=-1; The root cause is the most basic cause of the failure. if(a!=0) FAIL;
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- Thread 1 a=0; a=-1; The root cause is the most basic cause of the failure. if(a!=0) FAIL;
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- Thread 1 a=0; a=-1; The root cause is the most basic cause of the failure. if(a!=0) FAIL;
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- Thread 1 a=0; The root cause is the most basic cause of the failure. if(a!=0) a=-1;
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- a=0; Thread 1 if(CONFIG) a=-1; The root cause is the most basic cause of the failure. if(a!=0) FAIL; CONFIG = FALSE;
What is a Root Cause? 1 The root cause is an instruction in the failure execution which, if changed, would have resulted in a correct execution. Thread 0 Time 2 ‘- a=0; Thread 1 if(CONFIG) a=-1; The root cause is the most basic cause of the failure. if(a!=0) FAIL;
Inflection Point Hypothesis Failure exe. Thread 0 a=0; Thread 1 if(CONFIG) a=-1; if(a!=0) FAIL; ‘-
Inflection Point Hypothesis – Inflection Point if(CONFIG) a=0; a=-1; if(a!=0) FAIL; if(a!=0) a=-1; ‘-
Inflection Point Hypothesis – Inflection Point if(CONFIG) a=0; a=-1; if(a!=0) FAIL; if(CONFIG) a=0; �� if(a!=0) a=-1; ‘-
Inflection Point Hypothesis – Inflection Point if(CONFIG) a=0; a=-1; if(a!=0) FAIL; if(CONFIG) a=0; Common ��prefix if(a!=0) a=-1; ‘-
Inflection Point Hypothesis if(CONFIG) a=0; a=-1; if(a!=0) FAIL; if(CONFIG) a=0; Common ��prefix if(a!=0) a=-1; ‘
Inflection Point Hypothesis if(CONFIG) a=0; a=-1; if(a!=0) FAIL; ‘- if(CONFIG) a=0; Common ��prefix if(a!=0) a=-1;
Inflection Point Hypothesis �� if(CONFIG) a=0; a=-1; Common prefix if(a!=0) FAIL; ‘- if(CONFIG) a=0;
Inflection Point Hypothesis if(CONFIG) a=0; a=-1; if(a!=0) FAIL; ‘- if(CONFIG) a=0; Common ��prefix if(a!=0) a=-1;
Kairux – Automated Root Cause Localization ‘-
Kairux – Automated Root Cause Localization • Key ideas • Use unit tests ‘-
Kairux – Automated Root Cause Localization • ‘-
Kairux – Automated Root Cause Localization • ‘-
Kairux – Automated Root Cause Localization • ‘- if(CONFIG) a=0; a=-1; if(a!=0) FAIL; if(a!=0) a=-1;
Kairux – Automated Root Cause Localization • ‘-
Kairux – Automated Root Cause Localization • ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. needed. Replications. remove(blk); 18. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. needed. Replications. remove(blk); 18. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. needed. Replications. remove(blk); 18. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. needed. Replications. remove(blk); 18. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread if (storage. get. Storage. Type() != storage. Type) { return false; } if (storage. get. State() == State. READ_ONLY_SHARED) { return false; } Datanode. Descriptor node = storage. get. Datanode. Descriptor(); // check if the node is (being) decommissioned if (node. is. Decommission. In. Progress() || node. is. Decommissioned()) { return false; } 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } if (avoid. Stale. Nodes) { if (node. is. Stale(this. stale. Interval)) { return false; } } final long required. Size = block. Size * Hdfs. Constants. MIN_BLOCKS_FOR_WRITE; final long scheduled. Size = block. Size * node. get. Blocks. Scheduled(); if (required. Size > node. get. Remaining() - scheduled. Size) { return false; } // check the communication traffic of the target machine if (consider. Load) { double avg. Load = 0; if (stats != null) { int size = stats. get. Num. Datanodes. In. Service(); if (size != 0) { avg. Load = (double)stats. get. Total. Load()/size; } } if (node. get. Xceiver. Count() > (2. 0 * avg. Load)) { return false; } } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. needed. Replications. remove(blk); 18. } ‘-
Real-World Example – HDFS-10453 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); if (storage. get. Storage. Type() != storage. Type) { return false; } if (storage. get. State() == State. READ_ONLY_SHARED) { return false; } Datanode. Descriptor node = storage. get. Datanode. Descriptor(); // check if the node is (being) decommissioned if (node. is. Decommission. In. Progress() || node. is. Decommissioned()) { return false; } if (avoid. Stale. Nodes) { if (node. is. Stale(this. stale. Interval)) { return false; } } final long required. Size = block. Size * Hdfs. Constants. MIN_BLOCKS_FOR_WRITE; final long scheduled. Size = block. Size * node. get. Blocks. Scheduled(); if (required. Size > node. get. Remaining() - scheduled. Size) { return false; } // check the communication traffic of the target machine if (consider. Load) { double avg. Load = 0; if (stats != null) { int size = stats. get. Num. Datanodes. In. Service(); if (size != 0) { avg. Load = (double)stats. get. Total. Load()/size; } } if (node. get. Xceiver. Count() > (2. 0 * avg. Load)) { return false; } } 6. if (b. size <= node. capacity) { 7. . // replicate b to this node 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } ‘-
Real-World Example – HDFS-10453 HDFS Cluster ‘Replication. Monitor thread Delete. Block thread
Real-World Example – HDFS-10453 • Repeatedly reproduced • Numerous printf debugging rounds • One month to diagnose HDFS Cluster ‘Replication. Monitor thread Delete. Block thread
Real-World Example – HDFS-10453 Replication. Monitor thread Delete. Block thread 15. void delete. Block(Block blk) { 1. void replicate. Blocks() { 16. blk. size = Long. MAX_VALUE; 2. for (Block b : needed. Replications) { 17. 3. int num. Needed = 3 - b. n. Replica; HDFS Cluster 18. } 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); ‘ 6. if (b. size <= node. capacity) { Replication. Monitor thread 7. 8. num. Needed--; Delete. Block thread 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. }
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. 18. } ‘-
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. 18. } ‘test. Change. Cold. Rep create. File("/foo”, 3); set. Replication("/foo", 5);
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. 18. } ‘test. Change. Cold. Rep create. File("/foo”, 3); set. Replication("/foo", 5);
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. 18. } ‘test. Change. Cold. Rep create. File("/foo”, 3); set. Replication("/foo", 5); test. Remove create. File("/foo”, 3); delete. File("/foo");
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. 18. } ‘combined. Test create. File("/foo”, 3); set. Replication("/foo", 5); delete. File("/foo");
Real-World Example – HDFS-10453 Replication. Monitor thread 1. void replicate. Blocks() { 2. for (Block b : needed. Replications) { 3. int num. Needed = 3 - b. n. Replica; 4. while(num. Needed > 0) { 5. node = choose. Next. Random. Node(. . ); 6. if (b. size <= node. capacity) { 7. 8. num. Needed--; 9. } 10. } 11. if (num. Needed > 0) 12. throw new Not. Enough. Replicas. Exception(); 13. } 14. } Delete. Block thread 15. void delete. Block(Block blk) { 16. blk. size = Long. MAX_VALUE; 17. Root Cause 18. } ‘combined. Test create. File("/foo”, 3); set. Replication("/foo", 5); delete. File("/foo");
Evaluation • Evaluated Kairux on 10 cases from JVM distributed systems • HDFS, HBase, Zoo. Keeper ‘ • One case with noisy operations generated from manual reproduction
Effectiveness of Kairux • Successfully finds inflection points for 7 • 3 unsuccessful cases: - root cause location cannot be reached by modifying unit tests ‘- • Reduces the # of instructions to diagnose Avg. # instr. in dynamic slice 309 # instr. in longest prefix 165 # tests combined 1. 43
Effectiveness of Kairux • Successfully finds inflection points for 7 • 3 unsuccessful cases: - root cause location cannot be reached by modifying unit tests ‘- • Reduces the # of instructions to diagnose Avg. # instr. in dynamic slice 309 # instr. in longest prefix 165 # tests combined 1. 43
Related Work • Statistical debugging • Delta Debugging ‘- • Hybrid approaches • Triage: deterministic replay + statistical debugging • Failure Sketching: hardware (Intel-PT) + statistical debugging
Conclusion • Inflection Point Hypothesis: Enables Principled Search for Root Cause ‘- • Kairux: transforms the inflection point hypothesis to a practical tool • Kairux is effective in locating root cause in real distributed system failures Thanks!
Refined Definition of Root Cause ‘-
Caveats Identifying root cause is subjective. Thread 0 Time a=0; Thread 1 if(CONFIG) ‘a=-1; if(a!=0) FAIL;
- Slides: 63