Mitigating the Effects of Flaky Tests on Mutation





















- Slides: 21
Mitigating the Effects of Flaky Tests on Mutation Testing August Shi, Jonathan Bell, Darko Marinov CCF-1421503 CNS-1646305 CNS-1740916 CCF-1763788 CCF-1763822 OAC-1839010 ISSTA 2019 Beijing, China 7/18/2019
Mutation Testing E L B A I L E R Mutant 1 test 1 Code Under test 2 Test test 3 N U test 1 Code Under test 2 Test test 3 Mut 1 Killed Mutant 2 • Compare test suites by mutation score • Guide testing based on mutant-test matrix Mut 1 Mut 2 test 1 Survived test 2 Survived test 3 Killed Survived test 1 Code Under test 2 Test test 3 Mut 2 Survived 2
Mutation Testing with Flaky Tests Mutant 1 test 1 Code Under test 2 Test test 3 F L IL T S Run 2 Run 1 Y K A L test 1 Code Under test 2 Test test 3 Mutant 2 • Get test suite with deterministic outcomes • Debug/fix flaky tests 1 • Remove/ignore flaky tests Mut 1 Killed? Mut 1 Mut 2 test 1 Survived? test 2 Survived? test 3 Killed? Survived? test 1 Code Under test 2 Test test 3 Mut 2 Survived? 1 August Shi et al. “i. Fix. Flakies: A Framework for Automatically Fixing Order-Dependent Tests”. ESEC/FSE 2019 3
Flaky Coverage Example • Other reasons for flakiness: • Concurrency • Randomness • I/O • Order dependency 1 public class Watch. Dog { 2. . . 3 public void run() { 4. . . 5 synchronized (this) { 6 long time. Left = timeout – (System. current. Time. Millis() - start. Time); 7 is. Waiting = time. Left > 0; 8 while (is. Waiting) { 9. . . Variable/Call Value (Run 1) Value (Run 2) 10 wait(time. Left); 11. . . timeout 5000 12 }} start. Time 300000 500000 13. . . 14 }} current. Time. Millis() 300300 510000 public void test() { new Watch. Dog. run(); . . . } time. Left 4700 -5000 is. Waiting true false TEST OUTCOME PASS 4
Motivating Study • Measure flakiness of coverage • 30 open-source Git. Hub projects from prior work • No flaky test outcomes! (all 35, 850 tests pass in 17 runs) • Rerun tests and measure differences in coverage • 113, 356 (22%) statements with different tests covering across runs • 5, 736 (16%) tests cover different statements across runs • Lots of flakiness in coverage, even without flaky outcomes! 5
Mutation Testing with Flaky Coverage 1 public class Watch. Dog { 2. . . 3 public void run() { 4. . . 5 synchronized (this) { 6 long time. Left = timeout – (System. current. Time. Millis() - start. Time); 7 is. Waiting = time. Left > 0; 8 while (is. Waiting) { 9. . . Variable/Call Value (Run 1) Value (Mut Run) 10 wait(time. Left); 11. . . timeout 5000 12 }} start. Time 300000 500000 13. . . 14 }} Mutation current. Time. Millis() 300300 510000 delete call public void test() { new Watch. Dog. run(); . . . } time. Left Mutation not covered! is. Waiting 4700 -5000 true false 6
Mutation Testing Results are Unreliable • Flakiness can shift mutation testing results • Mutation scores may be inflated/deflated • Mutant-test matrix unreliable • Need to mitigate the effects of flakiness on mutation testing! • Mitigation strategies based on reruns and isolation 2 • Implemented on PIT, a popular mutation testing tool for Java https: //doi. org/10. 6084/m 9. figshare. 8226332 https: //github. com/hcoles/pitest/pull/534 https: //github. com/hcoles/pitest/pull/545 2 Jonathan Bell et al. “De. Flaker: Automatically Detecting Flaky Tests”. ICSE 2018 7
Mitigating Flakiness in Mutation Testing Traditional mutation testing Full test-suite coverage collection Mutants to test Test-mutant prioritization Sorted tests per mutant Mutant execution Improvements to cope with flakiness Rerun and isolate tests Run tests with least flaky coverage first Track mutations covered Rerun/isolate tests See paper 8
Coverage Collection All tests in same JVM Each test in own JVM Run Once Default Rerun Multiple Times Default-Reruns Isolation-Reruns • When running multiple times, union coverage • More lines covered means more mutants generated • Run tests in isolation to remove test-order dependencies 9
Executing Tests on Mutants • Monitor if tests actually execute mutated bytecode • Traditionally, mutant-test pair has status Killed or Survived • Only applicable if test executes the mutated bytecode • Mutant-test pair with test that does not execute mutated bytecode has new status Unknown • Test can potentially cover mutation, based on prior coverage Mut 1 Mut 2 test 1 Survived test 2 Survived Unknown test 3 Unknown 10
New Status for Mutants • Overall mutant status depends on status of all mutant-test pairs run for the mutant Killed + Covered Survived + Covered Unknown (not covered) • Need to reduce number of Unknown mutants and pairs 11
Rerunning Mutant-test Pairs • While status of mutant-test pair is Unknown, rerun • Change isolation level during reruns Mutant-test Pairs Default Mutant-test pairs for mutants in same class in same JVM More Isolation Mutant-test pairs for same mutant in same JVM Most Isolation Mutant-test pairs in own JVM Increasing Cost 12
Experimental Setup • Evaluate on same 30 projects in motivating study • All modifications on top of PIT mutation testing tool • RQ 1: Flakiness in traditional mutation testing? • RQ 2: Effect of coverage on mutants generated? See paper • RQ 3: Effect of re-executing tests on mutant status? • RQ 4: Prioritize tests for mutant-test executions? See paper 13
RQ 1: Flakiness in Traditional Mutation Testing Mutants by Status Killed Overall Survived 51, 687 Unknown 11, 965 Total 2, 866 Mut. Score 66, 518 77. 7%-82. 0% Max difference up to 23 pp! Must improve mutation scores more than this variance! Mutant-Test Pairs by Status Killed Overall 1, 569, 658 Survived 1, 097, 506 Unknown 255, 194 Total 2, 922, 358 9% of mutants-test pairs are unknown (max up to 55%)! Matrix results can be unreliable 14
RQ 3: Mutant Re-execution Results Unknown Mutants Before Overall 2, 866 After Unknown Mutant-Test Pairs Reduction 591 2, 275 (79. 4%) Before After Reduction 255, 194 30, 321 224, 873 (88. 1%) Add. Covered Pairs Default Reruns 1 Overall 2 3 4 61, 437 41, 302 14, 787 5 6, 590 18, 762 Add. Covered Pairs More Isolation Reruns 1 Overall 2 3 46, 819 14, 072 4 1, 000 5 629 3, 872 Add. Covered Pairs Most Isolation Reruns 1 Overall 15, 594 2 3 5 4 0 • Increasing isolation greatly increases covered pairs 5 4 0 • Unnecessary to rerun too often with the most isolation 15
Discussion • Flakiness can have negative effects beyond mutation testing • Tools/studies that rely on coverage must consider flakiness • Fault localization, program repair, test prioritization, test-suite reduction, test selection, test generation, runtime verification, … • Mitigation strategies applicable beyond mutation testing • Different isolation strategies for different tasks 16
Conclusions • Even seemingly non-flaky tests have flaky coverage • 22% of statements not covered consistently! • We present problems in mutation testing due to flakiness • We propose techniques to mitigate effects • Different combinations of reruns and isolation • We reduce Unknown mutants/pairs by 79. 4%/88. 1% • Flakiness can have negative effects beyond mutation testing https: //doi. org/10. 6084/m 9. figshare. 8226332 awshi 2@illinois. edu 17
BACKUP
Prioritizing Tests for Mutants • Run mutant-test pairs in the order that gets the overall mutant status faster, more reliably • Once mutant status known, no need to run more • Prioritize tests per mutant based on coverage • Tests with more “stable” coverage on mutant prioritized earlier • Later prioritize based on time • When to rerun? • Immediately rerun pair? • Run all pairs first before rerunning? 19
RQ 2: Coverage and Mutant Generation Number of Mutants Default Overall 70, 773 Isolated 70, 993 Default Reruns Isolated Reruns 70, 877 71, 112 Number of Mutant-Test Pairs Default Overall 3, 089, 051 Isolated Default Reruns Isolated Reruns 3, 162, 138 3, 101, 314 3, 165, 527 • Not much difference in numbers of mutants and pairs • Can potentially use Default for mutant generation 20
RQ 4: Prioritizing Tests Running Time for Immediately Rerun (s) Random Overall 84, 013. 0 Coverage 51, 821. 8 PIT 51, 804. 9 Best Worst 42, 333. 4 299. 203. 5 Running Time for Not Immediately Rerun (s) Random Overall 90, 479. 0 Coverage 60, 810. 3 PIT 60, 793. 3 Best Worst 52, 014. 6 284, 820. 7 21