Delta Debugging Advanced Software Tools Seminar Levin Stella

Delta Debugging Advanced Software Tools Seminar Levin Stella May 2005

Andreas Zeller 2 Professor at Universität des Saarlandes in Saarbrücken, Germany Research in software engineering: construction and evolution of large complex software systems at reasonable cost and high reliability, Analysis of these systems – why they fail to work as they should Future work: automated debugging, self-healing programs, software evolution, experimental program analysis

Contents 3 q Delta Debugging: find failure-inducing § Input § Program state § Thread schedules § Code changes q Others § Don’t Program on Fridays § Praktomat The presentation uses slides and materials form home page of Andreas Zeller

Causes and Effects 4 § A cause is an event preceding another event without which the event in question (the effect) would not have occurred. § A defect is propagated and causes the failure if the failure would not have occurred without the defect § Not every defect causes a failure § The process tests for specific failure § Debugging = search for a defect that causes the failure

5

Find Failure-Inducing Input double mult(double z[], int n) { int i , j ; i = 0; for ( j = 0; j < n; j++) { i = i + j + 1; z[i ] = z[i ]*(z[0] + 1. 0); } return z[n]; } Figure 1: The fail. c program that crashes GCC $ gcc -O fail. c gcc: Internal compiler error: program cc 1 got fatal signal 11 $_ 6

Binary Search 7 # GCC Input Test 1 double mult(. . . ) { int i , j ; i = 0; for (. . . ) {. . . } 2 /* empty file */ 3 double mult(. . . ) { } 4 double mult(. . . ) { int i , j ; i = 0; } 5 double mult(. . . ) { int i , j ; i = 0; for (. . . ) {. . . } } 6 double mult(. . . ) { int i , j ; i = 0; for (. . . ) { } }. . . 18. . . z[i ] = z[i ]*(z[0] + 1. 0); . . . 19. . . z[i ] = z[i ]*(z[0] ); . . . 20. . . z[i ] = z[i ]*(z[0] 1. 0); . . . ? 21. . . z[i ] = z[i ]*(z[0] + ); . . . ? Table 1: Isolating Failure-Inducing Input

Definitions § Difference δ between passed and failed tests can be decomposed to atomic differences δ = δ 1 • δ 2 • δ 3 • … • δn § Composition of differences m<n C = δk 1 • δk 2 • …δkm is called a test case Cp - passed test case; Cf – failed test case § Failing test case C is minimal if removing any single δi would cause to failure to disappear 8

Simplification 10 Basic Idea: Take difference between passed Cp and failed Cf tests and split it until minimal difference is received Properties: Make each part of the test case relevant Removing any part makes the failure go away Problem: It is expensive to simplify the entire input

Simplifying vs. Isolating 11 Alternative approach: We do not simplify the entire input, but the difference with respect to a current passed and failed tests.

Isolation Basic Idea: Update current passed Cp and failed Cf tests at each run, split difference between passed and failed tests until minimal difference is received Properties: Find one relevant part of the test case Removing this particular part makes the failure go away 12

Isolating Algorithm: DDBIN 13 1. Initialize: Cp, Cf, D= Cf -Cp = δ 2. If |D|=1 then STOP 3. Split D to 2 disjoint subsets ∆1, ∆2, ∆1 ≈ ∆2 and test each Cp+∆i 4. If Cp+∆i success then Cp= Cp+∆1, D=∆2 5. Else if Cp+∆i fails then Cf= Cp+∆1, D=∆1 6. Else if Cp+∆2 success then Cp= Cp+∆2, D=∆1 7. Else if Cp+∆2 fails then Cf= Cp+∆2, D=∆2 8. Continue step 2 Cp ∆1 ∆2 Cf

Mozilla Crash 14 What is relevant here?

Simplified Input 15 § Required 12 tests only § 896 lines → 1 line <SELECT NAME="priority" MULTIPLE SIZE=7>

Isolated Input 16 # Mozilla Input Test 1 <SELECT NAME="priority" MULTIPLE SIZE=7> 4 <SELECT NA ty" MULTIPLE SIZE=7> 7 SELECT NA 6 ELECT NA 5 CT NA 3 2 ty" MULTIPLE SIZE=7>

Simplify vs. Isolate 17 Isolation: 7 steps Error: <SELECT> Simplification: 26 steps

Unresolved Test Outcomes 1 3 4 18 ? ? 2 Problem: what to do if the both test cases are unresolved?

Change Granularity 19 Failure of the test case Increase Granularity of test case Decrease Granularity of test case More chances Less chances Progress of the Slower, search Reduced by amount < ½ Faster, Reduced by amount > ½

General DD Algorithm 20 Basic idea: 1. Start with few & large changes first ∆1 ∆2 2. If all alternatives are unresolved, apply more & smaller changes. ∆1 ∆2 ∆3 ∆4 ∆5 ∆6 ∆7 ∆8

General DD Algorithm 21 1. 2. 3. 4. 5. Initialize k=2, Cp, Cf, D= Cf-Cp = δ Split D into k disjoint subsets ∆1…∆k, ∆i ≈ |D|/k If Cp+∆i fails then k=2, Cf=Cp+∆i , D=∆i , step 2 Else if Cf-∆i passes then k=2, Cp=Cf-∆i, D=∆i, step 2 Else if Cp+∆i passes then k=max(k-1, 2), Cp=Cp+∆i, D=D∆i , step 2 6. Else if Cf-∆i fails then k=max(k-1, 2), Cf=Cf-∆i, D=D∆i, step 2 7. Else if |D|>k then k=min(2 k, |D|), step 2 8. Else STOP Cp=> ∆1 ∆2 ∆3 ∆4 ∆5 ∆6 ∆7 ∆8 <=Cf

Properties of DD § C is minimal if removing any single δi would cause to failure to disappear § Output δ = Cf-Cp is minimal § Number of test cases in worst case (Cf is input) = |Cf|²+ 3|Cf| § Number of test cases in best case (no unresolved) ≤ log 2(|Cf|) § Size of output δ = 1 in the best case 22

Applications 23 § § Isolating Failure-Inducing Input Isolating Failure-Inducing Program State Isolating Failure-Inducing Thread Schedules Isolating Failure-Inducing Code Changes

Failure-Inducing Program State 24 § Program state – variables and their values § Program execution – sequence of program states § Basic Idea: examine and alter the program state using a debugger (GDB) so that failure doesn’t occur § Comparable states – the program counter and call stack are the same

GCC Example 25 double mult(double z[], int n) { int i , j ; i = 0; for ( j = 0; j < n; j++) { i = i + j + 1; z[i ] = z[i ]*(z[0] + 1. 0 ); } return z[n]; } Figure 1: The fail. c program that crashes GCC $ gcc -O fail. c gcc: Internal compiler error: program cc 1 got fatal signal 11 $_ + 1. 0 is the failure cause

Small Cause - Big Effect 26 Problem—differences accumulate during execution How do we isolate the relevant state differences?

GCC states 27 # 1 2 3 4 5 reg_rtx_no cur_insn_uid last_linenum first_loop_store_insn test 32 31 32 32 32 74 70 74 74 74 15 14 14 14 15 0 x 81 fc 4 e 4 0 x 81 fc 4 a 0 Problem: deal with pointers ? Crash

Memory Graph 28 Vertex: variable value Edge: - Pointer dereferencing - Struct member access - Array element access δ 3 Pointer Struct

Differences in variable sets Problem: different set of variables in passed and failed runs § Determine common subgraph (approximation) § Any vertex that is not in common subgraph has to be inserted or deleted 29

Differences in variable sets 30

GCC States § § Difference was 871 nodes (variables) DD altered 436 variables and crashed Increase granularity: 218 … crashed … 44 tests reduce diff to one vertex 31

GCC States before failure • 1224 nodes (= variables) are different • Every second test fails – reduce quickly • 15 tests reduce diff to one command set variable link→fld[0]. rtx = link ÞCycle in the RTL tree ÞEndless recursion 32

Automatic Report 1. Execution reaches main 33 Since the program was invoked as “cc 1 -O fail. i”, variable argv[2] is now “fail. i”. 2. Execution reaches combine_instructions Since argv[2] was “fail. i”, variable *first_loop_store_insn→fld[1]. rtx→fld[3]. rtx→fld[1]. rtx is now (new rtx_def). 3. Execution reaches if_then_else_cond Since *first_loop_store_insn→fld[1]. rtx→fld[3]. rtx→fld[1]. rtx was (new rtx_def), variable link→fld[0]. rtx is now link. 4. Execution ends Since variable link→fld[0]. rtx was link, the program now terminates with a SIGSEGV signal. 5. The program fails Totally 59 tests; GCC runtime 6 sec, GDB runtime 90 min

Challenges 1. How do we capture C program state accurately? Does p point to something, and if so, to how many of them? Today: Query memory allocation routines + heuristics Future: Use program analysis, greater program state 2. How do we determine where to check state? Why focus on, say, combine_instructions? Today: Start with stack of failing run Future: Focus on anomalies + transitions; user interaction 3. How do we know a failure is the failure? Can’t our changes just induce new failures? Today: Outcome is “original” only if stack match Future: Also match output, time, code coverage 4. And finally: When does this actually work? 34

www. askigor. org 35 Ask. Igor is Open Source

Failure-Inducing Thread Schedules On some fixed input the application may fail or not The reason: non-determinism in the thread schedule Bugs are hard to reproduce and hard to isolate! 36

Deja. Vu 37 DEterministic JAVa Replay Utility Part of Jalapeño research JVM (IBM) Recording Mode Record Application rd o c Schedule Re Replay Re p y pla Re Input Mixed Mode Input Schedule la Record Input Deja. Vu Replaying Mode y

How does record-replay work 1. Synchronization events. 38 ÞReplay the entire JVM state with lock state of each thread + queue of threads 2. Timed events ÞRecord wall-clock (sleep, timed wait) 3. Time slot interrupts ÞRecord yield points counter Where threads may be preempted

Generate altered schedules 39 Schedule T=< t 1, t 2, … tn> Deja. Vu Generator Fuzz schedule T=< f(t 1), …, f(tn)> f(t) is Gaussian distribution

Schedule Differences 40

Isolate Schedule Differences 41 § Failing difference ∆ is minimal if removing any single atomic δi (± 1) would cause to failure to disappear § ∆1=|278 -254|; t 1, 1= t 1 + δ 1, 1= 255; t 1, 2= t 1, 1+ δ 1, 2= 256… § Delta-Debugging isolate the differences ∆ = Cf-Cp ={δ 1, 1, δ 1, 2 … δn, k} § Deja. Vu replays generated schedule

A Real Program Two schedules with 3, 842, 577, 240 atomic differences 42

A Real Program 43 Delta Debugging isolates one single difference after 50 tests:

Yesterday, my Program Worked 44 DD applied to code changes: Assumption: The failure is caused by some changes between “yesterday” and “today” DD isolate the failure-inducing change

Eclipse Plug-Ins These plug-ins are integrated with JUnit tests and Automatically determine the cause of failed test DDinput: Failure-Inducing Input (available) 45 The program fails when the input contains <SELECT> DDstate: Failure-Inducing States (later fall 2005) First, argc was 3; therefore, a[2] became 0, and thus the output contained "0"— and that's why the program failed DDchange: Failure-Inducing Changes (summer 2005) The change in Line 45 makes the program fail Eclipse Innovation Grant 2003, 2004, 2005

Don’t Program on Fridays 46

Praktomat 47 1. Submit program Automatic compilation and testing 2. Review other program 3. Receive own reviews

Praktomat § Personalized Assignment 48 Use the ifdef(V 1, Quicksort, Mergesort) algorithm to sort the entries from the ifdef(V 2, lowest, greatest) to the ifdef(V 2, greatest, lowest) value. § Maximum distance Praktomat chooses the assignment with maximum distance from the reviewer’s assignment (not below a minimum distance) § Enormous success from students view 63. 5% effectiveness of receiving review 61. 5% effectiveness of making review 57. 7% effectiveness of automatic testing § Programming style improved much faster than earlier courses

Bibliography 49 § http: //www. st. cs. uni-sb. de/~zeller/ § Andreas Zeller, Isolating Cause-Effect Chains from Computer Programs, ACM FSE 2002 § A. Zeller and R. Hildebrandt, Simplifying and Isolating Failure -Inducing Input, IEEE 2. 02 § J. -D. Choi, A. Zeller, Isolating Failure-Inducing Thread Schedules, ACM ISSTA 2002 § J. Sliwerski, T. Zimmermann, A. Zeller, When Do Changes Induce Fixes? MSR 2005 § A. Zeller, Making Student Read and Review Code, ACM 2002 § Delta Debugging Web Site, http: //www. st. cs. uni-sb. de/dd/ § Ask. Igor Web Site, http: //www. askigor. org/