Mitigating the Effects of Flaky Tests on Mutation

  • Slides: 21
Download presentation
Mitigating the Effects of Flaky Tests on Mutation Testing August Shi, Jonathan Bell, Darko

Mitigating the Effects of Flaky Tests on Mutation Testing August Shi, Jonathan Bell, Darko Marinov CCF-1421503 CNS-1646305 CNS-1740916 CCF-1763788 CCF-1763822 OAC-1839010 ISSTA 2019 Beijing, China 7/18/2019

Mutation Testing E L B A I L E R Mutant 1 test 1

Mutation Testing E L B A I L E R Mutant 1 test 1 Code Under test 2 Test test 3 N U test 1 Code Under test 2 Test test 3 Mut 1 Killed Mutant 2 • Compare test suites by mutation score • Guide testing based on mutant-test matrix Mut 1 Mut 2 test 1 Survived test 2 Survived test 3 Killed Survived test 1 Code Under test 2 Test test 3 Mut 2 Survived 2

Mutation Testing with Flaky Tests Mutant 1 test 1 Code Under test 2 Test

Mutation Testing with Flaky Tests Mutant 1 test 1 Code Under test 2 Test test 3 F L IL T S Run 2 Run 1 Y K A L test 1 Code Under test 2 Test test 3 Mutant 2 • Get test suite with deterministic outcomes • Debug/fix flaky tests 1 • Remove/ignore flaky tests Mut 1 Killed? Mut 1 Mut 2 test 1 Survived? test 2 Survived? test 3 Killed? Survived? test 1 Code Under test 2 Test test 3 Mut 2 Survived? 1 August Shi et al. “i. Fix. Flakies: A Framework for Automatically Fixing Order-Dependent Tests”. ESEC/FSE 2019 3

Flaky Coverage Example • Other reasons for flakiness: • Concurrency • Randomness • I/O

Flaky Coverage Example • Other reasons for flakiness: • Concurrency • Randomness • I/O • Order dependency 1 public class Watch. Dog { 2. . . 3 public void run() { 4. . . 5 synchronized (this) { 6 long time. Left = timeout – (System. current. Time. Millis() - start. Time); 7 is. Waiting = time. Left > 0; 8 while (is. Waiting) { 9. . . Variable/Call Value (Run 1) Value (Run 2) 10 wait(time. Left); 11. . . timeout 5000 12 }} start. Time 300000 500000 13. . . 14 }} current. Time. Millis() 300300 510000 public void test() { new Watch. Dog. run(); . . . } time. Left 4700 -5000 is. Waiting true false TEST OUTCOME PASS 4

Motivating Study • Measure flakiness of coverage • 30 open-source Git. Hub projects from

Motivating Study • Measure flakiness of coverage • 30 open-source Git. Hub projects from prior work • No flaky test outcomes! (all 35, 850 tests pass in 17 runs) • Rerun tests and measure differences in coverage • 113, 356 (22%) statements with different tests covering across runs • 5, 736 (16%) tests cover different statements across runs • Lots of flakiness in coverage, even without flaky outcomes! 5

Mutation Testing with Flaky Coverage 1 public class Watch. Dog { 2. . .

Mutation Testing with Flaky Coverage 1 public class Watch. Dog { 2. . . 3 public void run() { 4. . . 5 synchronized (this) { 6 long time. Left = timeout – (System. current. Time. Millis() - start. Time); 7 is. Waiting = time. Left > 0; 8 while (is. Waiting) { 9. . . Variable/Call Value (Run 1) Value (Mut Run) 10 wait(time. Left); 11. . . timeout 5000 12 }} start. Time 300000 500000 13. . . 14 }} Mutation current. Time. Millis() 300300 510000 delete call public void test() { new Watch. Dog. run(); . . . } time. Left Mutation not covered! is. Waiting 4700 -5000 true false 6

Mutation Testing Results are Unreliable • Flakiness can shift mutation testing results • Mutation

Mutation Testing Results are Unreliable • Flakiness can shift mutation testing results • Mutation scores may be inflated/deflated • Mutant-test matrix unreliable • Need to mitigate the effects of flakiness on mutation testing! • Mitigation strategies based on reruns and isolation 2 • Implemented on PIT, a popular mutation testing tool for Java https: //doi. org/10. 6084/m 9. figshare. 8226332 https: //github. com/hcoles/pitest/pull/534 https: //github. com/hcoles/pitest/pull/545 2 Jonathan Bell et al. “De. Flaker: Automatically Detecting Flaky Tests”. ICSE 2018 7

Mitigating Flakiness in Mutation Testing Traditional mutation testing Full test-suite coverage collection Mutants to

Mitigating Flakiness in Mutation Testing Traditional mutation testing Full test-suite coverage collection Mutants to test Test-mutant prioritization Sorted tests per mutant Mutant execution Improvements to cope with flakiness Rerun and isolate tests Run tests with least flaky coverage first Track mutations covered Rerun/isolate tests See paper 8

Coverage Collection All tests in same JVM Each test in own JVM Run Once

Coverage Collection All tests in same JVM Each test in own JVM Run Once Default Rerun Multiple Times Default-Reruns Isolation-Reruns • When running multiple times, union coverage • More lines covered means more mutants generated • Run tests in isolation to remove test-order dependencies 9

Executing Tests on Mutants • Monitor if tests actually execute mutated bytecode • Traditionally,

Executing Tests on Mutants • Monitor if tests actually execute mutated bytecode • Traditionally, mutant-test pair has status Killed or Survived • Only applicable if test executes the mutated bytecode • Mutant-test pair with test that does not execute mutated bytecode has new status Unknown • Test can potentially cover mutation, based on prior coverage Mut 1 Mut 2 test 1 Survived test 2 Survived Unknown test 3 Unknown 10

New Status for Mutants • Overall mutant status depends on status of all mutant-test

New Status for Mutants • Overall mutant status depends on status of all mutant-test pairs run for the mutant Killed + Covered Survived + Covered Unknown (not covered) • Need to reduce number of Unknown mutants and pairs 11

Rerunning Mutant-test Pairs • While status of mutant-test pair is Unknown, rerun • Change

Rerunning Mutant-test Pairs • While status of mutant-test pair is Unknown, rerun • Change isolation level during reruns Mutant-test Pairs Default Mutant-test pairs for mutants in same class in same JVM More Isolation Mutant-test pairs for same mutant in same JVM Most Isolation Mutant-test pairs in own JVM Increasing Cost 12

Experimental Setup • Evaluate on same 30 projects in motivating study • All modifications

Experimental Setup • Evaluate on same 30 projects in motivating study • All modifications on top of PIT mutation testing tool • RQ 1: Flakiness in traditional mutation testing? • RQ 2: Effect of coverage on mutants generated? See paper • RQ 3: Effect of re-executing tests on mutant status? • RQ 4: Prioritize tests for mutant-test executions? See paper 13

RQ 1: Flakiness in Traditional Mutation Testing Mutants by Status Killed Overall Survived 51,

RQ 1: Flakiness in Traditional Mutation Testing Mutants by Status Killed Overall Survived 51, 687 Unknown 11, 965 Total 2, 866 Mut. Score 66, 518 77. 7%-82. 0% Max difference up to 23 pp! Must improve mutation scores more than this variance! Mutant-Test Pairs by Status Killed Overall 1, 569, 658 Survived 1, 097, 506 Unknown 255, 194 Total 2, 922, 358 9% of mutants-test pairs are unknown (max up to 55%)! Matrix results can be unreliable 14

RQ 3: Mutant Re-execution Results Unknown Mutants Before Overall 2, 866 After Unknown Mutant-Test

RQ 3: Mutant Re-execution Results Unknown Mutants Before Overall 2, 866 After Unknown Mutant-Test Pairs Reduction 591 2, 275 (79. 4%) Before After Reduction 255, 194 30, 321 224, 873 (88. 1%) Add. Covered Pairs Default Reruns 1 Overall 2 3 4 61, 437 41, 302 14, 787 5 6, 590 18, 762 Add. Covered Pairs More Isolation Reruns 1 Overall 2 3 46, 819 14, 072 4 1, 000 5 629 3, 872 Add. Covered Pairs Most Isolation Reruns 1 Overall 15, 594 2 3 5 4 0 • Increasing isolation greatly increases covered pairs 5 4 0 • Unnecessary to rerun too often with the most isolation 15

Discussion • Flakiness can have negative effects beyond mutation testing • Tools/studies that rely

Discussion • Flakiness can have negative effects beyond mutation testing • Tools/studies that rely on coverage must consider flakiness • Fault localization, program repair, test prioritization, test-suite reduction, test selection, test generation, runtime verification, … • Mitigation strategies applicable beyond mutation testing • Different isolation strategies for different tasks 16

Conclusions • Even seemingly non-flaky tests have flaky coverage • 22% of statements not

Conclusions • Even seemingly non-flaky tests have flaky coverage • 22% of statements not covered consistently! • We present problems in mutation testing due to flakiness • We propose techniques to mitigate effects • Different combinations of reruns and isolation • We reduce Unknown mutants/pairs by 79. 4%/88. 1% • Flakiness can have negative effects beyond mutation testing https: //doi. org/10. 6084/m 9. figshare. 8226332 awshi 2@illinois. edu 17

BACKUP

BACKUP

Prioritizing Tests for Mutants • Run mutant-test pairs in the order that gets the

Prioritizing Tests for Mutants • Run mutant-test pairs in the order that gets the overall mutant status faster, more reliably • Once mutant status known, no need to run more • Prioritize tests per mutant based on coverage • Tests with more “stable” coverage on mutant prioritized earlier • Later prioritize based on time • When to rerun? • Immediately rerun pair? • Run all pairs first before rerunning? 19

RQ 2: Coverage and Mutant Generation Number of Mutants Default Overall 70, 773 Isolated

RQ 2: Coverage and Mutant Generation Number of Mutants Default Overall 70, 773 Isolated 70, 993 Default Reruns Isolated Reruns 70, 877 71, 112 Number of Mutant-Test Pairs Default Overall 3, 089, 051 Isolated Default Reruns Isolated Reruns 3, 162, 138 3, 101, 314 3, 165, 527 • Not much difference in numbers of mutants and pairs • Can potentially use Default for mutant generation 20

RQ 4: Prioritizing Tests Running Time for Immediately Rerun (s) Random Overall 84, 013.

RQ 4: Prioritizing Tests Running Time for Immediately Rerun (s) Random Overall 84, 013. 0 Coverage 51, 821. 8 PIT 51, 804. 9 Best Worst 42, 333. 4 299. 203. 5 Running Time for Not Immediately Rerun (s) Random Overall 90, 479. 0 Coverage 60, 810. 3 PIT 60, 793. 3 Best Worst 52, 014. 6 284, 820. 7 21