Testing Concurrent Programs to Achieve High Synchronization Coverage

Problem being addressed Effectiveness of Software Testing Measure the coverage of some aspect of

Fact: There’s a strong correlation between test suites with high coverage and the defectdetection

Estimation phase: identifies coverage requirements R Program P + Test case {(l 1, l

precedence relation represents ordering constraints between actions of two different threads t and t

Synchronization Coverage Definition 1. Synchronization-Pair (SP) Coverage Requirement

Definition 2. SP Coverage Satisfaction Criteria

Prior Work Stress tests Random tests such testing did not reveal a known concurrency

Related Work Random testing Runs the program many times while injecting artificial delays into

Related Work Testing based on Concurrency bug analysis e. g. Cal. Fuzzer/Race. Fuzzer /Deadlock.

Related Work Testing based on Concurrency bug analysis identify potential concurrency bugs using static

Related Work Systematic testing Explores distinct interleavings of the program in each different run.

Related Work (1) how to achieve higher coverage faster Coverage Criteria: def-use pair coverage

Challenges • Create a new way of increasing the coverage of concurrent programs •

Goals of Paper • Achieve high coverage of concurrent programs by generating thread schedules

Technique • Estimation Phase – Identify coverage requirements R(SP requirement) that can be satisfied

Estimation Phase Execute the P once Create every possible pair of lock Generate a

Example Execute the P once Create every possible pair of lock Generate a thread

Testing Phase Invokes scheduling controller before each lock action Covered and uncovered • Determine

P=2 a paused={2 a} Uncovered: output covered: empty uncovered: same paused={2 a}

P=2 b paused={2 a, 2 b} Uncovered: <2 a, 2 b> Paused={2 b} output

P=2 a paused={2 a, 2 b} Uncovered: <2 a, 2 b> Paused={2 b} output

P=3 a Uncovered: output covered: empty uncovered: same paused={2 b} execute 3 a

P=4 a paused={4 a, 2 b} Uncovered: <2 a, 2 b> Paused={4 a} output

P=2 b paused={4 a, 2 b} Uncovered: <2 a, 2 b> Paused={4 a} output

P=2 a paused={2 a} Uncovered: output covered: same uncovered: same paused={2 a}

P=2 b paused={4 a, 2 b} Uncovered: output Remove 2 b Execute 2 b

Methodology • Types of software engineering research questions -- Methods and means of development.

Contributions • What is the contribution? What is new? -- Presents a new technique

Take Home Ideas • The technique • The algorithm • The tool • The

Goals • To evaluate our technique through a prototype tool in Java and performing

Experimental setup Implementation • Take Cal. Fuzzer framework in JAVA • Modify both the

Experimental setup Subjects • Java Library: a set of classes extracted from package

Experimental setup Variables • independent variable is the threadschedule genation technique – TSA: thread-scheduling

Experimental setup Variables • dependent variables – the number of covered SP coverage requirements

Studies and Results • 4 studies – effectiveness – efficiency – precision of estimation

Study 1: Effectiveness • to investigate whether TSA achieves higher coverage than random testing.

Result of Study 1 • SP requirements: TSA ≥ MAX • RND-y < RND-s

Study 2 • to investigate the efficiency of the technique compared to random testing

Result of Study 2 • Saturation-based testing: threshhold

Result of Study 2 • TSA always reaches saturation point faster and covers a

Study 3 • to investigate how precisely our technique estimates a set of SP

Result of Study 3 • false positives – estimation technique is not precise enough

Study 4 • to investigate the impact of the estimation based heuristic on the

Result of Study 4 • the ratio of the application of Rule 3 over

Results - Threats to Validity • threads to external validity: – programs not representative

Methodology • Variables control, reference groups, by comparison, through experiments • Large amount of

Pros • • • New thread scheduling technique to achieve high coverage in concurrent

Cons • • Only 13 sample programs were used Scalability claim is not well-supported

Next Steps • • • Investigate the relationship between coverage and faultdetection ability Test

Slides: 60

Download presentation

Testing Concurrent Programs to Achieve High Synchronization Coverage Haoran Hou Xi Wang Bo Man Ruslan Ryzhkov

Problem being addressed Effectiveness of Software Testing Measure the coverage of some aspect of the software Little research on increasing coverage for concurrent programs How to achieve high coverage of concurrent programs?

Fact: There’s a strong correlation between test suites with high coverage and the defectdetection ability of those test suites.

Background & Prior Work

Estimation phase: identifies coverage requirements R Program P + Test case {(l 1, l 2), (l 1, l 3), (l 2, l 3), …} Testing phase: generates thread schedules to execute the coverage requirements in R Figure 1: Overview of the thread-scheduling technique

Definitions M: Thread Model

M: Thread Model

precedence relation represents ordering constraints between actions of two different threads t and t 0. The ordering constraints are imposed at the time of thread creations.

Interleaves Execution Model

Synchronization Coverage Definition 1. Synchronization-Pair (SP) Coverage Requirement

Definition 2. SP Coverage Satisfaction Criteria

Prior Work Stress tests Random tests such testing did not reveal a known concurrency bug even when executing the software for one week. the techniques may produce the same interleavings for many executions, and may not reveal some concurrency bugs that occur under specific interleaving. Bug-directed random tests they are tailored to specific bug patterns, and may not explore diverse interleavings to reveal other kinds of faults.

Related Work Random testing Runs the program many times while injecting artificial delays into thread schedules to produce different interleavings e. g. Rstest/Con. Test there may be many duplicate interleavings, which may be inadequate for covering previously uncovered interleavings.

Related Work Testing based on Concurrency bug analysis e. g. Cal. Fuzzer/Race. Fuzzer /Deadlock. Fuzzer Fuzz testing or fuzzing is a software testing technique used to discover coding errors and security loopholes in software, operating systems or networks by inputting massive amounts of random data, called fuzz, to the system in an attempt to make it crash. If a vulnerability is found, a tool called a fuzz tester (or fuzzer), indicates potential causes. Fuzz testing was originally developed by Barton Miller at the University of Wisconsin in 1989.

Related Work Testing based on Concurrency bug analysis identify potential concurrency bugs using static analysis or using dynamic analysis obtained from a program trace. The techniques then run the program while manipulating the thread scheduler to trigger the possible bugs. Only manipulating interleavings near possible buggy code points

Related Work Systematic testing Explores distinct interleavings of the program in each different run. The inherent problem of these techniques is that the interleaving space to explore is exponentially larg.

Related Work (1) how to achieve higher coverage faster Coverage Criteria: def-use pair coverage , synchronizationpair coverage, and event-pair coverage criteria, etc. For this paper: directly control thread scheduling to increase coverage, and specifically aims at high synchronization-pair coverage faster. (2) how much testing is enough to guarantee quality Saturation-based testing: monitors the number of executed coverage requirements until the rate of increase of covering new requirements is less than a threshold (i. e. , the coverage reaches a saturation point). This testing is used as the stopping criterion in our empirical studies in Section 3.

Challenges • Create a new way of increasing the coverage of concurrent programs • Build new models and phases for the new technique • Show the new technique is better through experiments.

Goals of Paper • Achieve high coverage of concurrent programs by generating thread schedules to cover uncovered coverage requirements. • Present a description of a prototype tool implemented in Java.

Technique • Estimation Phase – Identify coverage requirements R(SP requirement) that can be satisfied by possible thread interleavings • Testing Phase – Generate thread schedules to execute the coverage requirement R

Estimation Phase Execute the P once Create every possible pair of lock Generate a thread model M Filter out some infeasible pairs • Acceptance Condition (AC)

Example Execute the P once Create every possible pair of lock Generate a thread model M Filter out some infeasible pairs <4 a, 2 a> <4 b, 2 b>

Testing Phase Invokes scheduling controller before each lock action Covered and uncovered • Determine rules

P=2 a paused={2 a} Uncovered: output covered: empty uncovered: same paused={2 a}

P=2 b paused={2 a, 2 b} Uncovered: <2 a, 2 b> Paused={2 b} output covered: empty uncovered: same paused={2 b}

P=2 a paused={2 a, 2 b} Uncovered: <2 a, 2 b> Paused={2 b} output covered: empty uncovered: same paused={2 b} execute 2 a

P=3 a Uncovered: output covered: empty uncovered: same paused={2 b} execute 3 a

P=4 a paused={4 a, 2 b} Uncovered: <2 a, 2 b> Paused={4 a} output covered: empty uncovered: same paused={4 a}

P=2 b paused={4 a, 2 b} Uncovered: <2 a, 2 b> Paused={4 a} output covered: {<2 a, 2 b>} uncovered: rest paused={4 a} execute 2 b

P=2 a paused={2 a} Uncovered: output covered: same uncovered: same paused={2 a}

P=2 b paused={4 a, 2 b} Uncovered: output Remove 2 b Execute 2 b covered: same uncovered: same paused={4 a}

Methodology • Types of software engineering research questions -- Methods and means of development. • Types of software engineering research results -- Procedure and technique. • Types of software engineering research validation -- Evaluation

Contributions • What is the contribution? What is new? -- Presents a new technique that aims at achieving high coverage faster for concurrent program. -- Implements the technique in a prototype tool. -- Shows the estimation-based heuristic contributes to the efficiency and effectiveness.

Take Home Ideas • The technique • The algorithm • The tool • The implementation method – Implement on the top of Cal. Fuzzer framework.

Experiments Bo Man

Goals • To evaluate our technique through a prototype tool in Java and performing several empirical studies with the tool on a number of Java subjects • 3 Steps – (1) Experimental setup – (2) Studies – (3) Threats to validity

Experimental setup Implementation • Take Cal. Fuzzer framework in JAVA • Modify both the instrumentation and the scheduling-controller modules and create new modules • Insert probes before every synchronization operation, shared-data access and thread-related-operation. • Run program once and generate SP coverage requirements and store in a file. • Take the file and execute program multiple times to achieve high SP

Experimental setup Subjects • Java Library: a set of classes extracted from package

Experimental setup Variables • independent variable is the threadschedule genation technique – TSA: thread-scheduling algorithm – TSA-h: TSA without Rule 3 – 15 varieties of the random testing technique – RND-y, insert yield() synchronization keyword at the shared resource accesses and synchronization operations. – RND-s 10, insert random delay up to 10 milliseconds with sleep(). – RND-s 100. • above s 100, effectiveness decreases.

Experimental setup Variables • dependent variables – the number of covered SP coverage requirements – the execution time to attain a certain goal – the number of feasible SP coverage requirements (compare with the first in study 3)

Studies and Results • 4 studies – effectiveness – efficiency – precision of estimation – impact of the estimation based heuristic

Study 1: Effectiveness • to investigate whether TSA achieves higher coverage than random testing. – (a) run estimation phase and for each subject create a set of SP coverage requirements. – (b) run TSA, each of the 15 random testing techniques 30 times – (c) for each run execute the program 500 times then calculate the average SP coverage requirements

Result of Study 1 • SP requirements: TSA ≥ MAX • RND-y < RND-s 100 < TSA • random testing : results vary a lot

Result of Study 1

Study 2 • to investigate the efficiency of the technique compared to random testing techniques. – (a) same as study 1 (a) (b) – (b) for each of the 30 runs, execute the program for 30 minutes and record the average of saturation point and number of covered SP coverage requirements.

Result of Study 2 • Saturation-based testing: threshhold

Result of Study 2 • TSA always reaches saturation point faster and covers a greater number of SP coverage requirements than random testing.

Study 3 • to investigate how precisely our technique estimates a set of SP coverage requirements – (a) same as study 1 (a) (b) – (b) take the union of the accumulated SP coverage requirements and compare it with the estimated SP requirements (in estimation phase)

Result of Study 3

Result of Study 3 • false positives – estimation technique is not precise enough to filter our infeasible coverage requirements • false positives: estimation technique is dynamic – one source is locks in a loop do not appear in an estimation but do appear in a testing execution – another source is aliasing problem. A lock statement appears in more than one

Study 4 • to investigate the impact of the estimation based heuristic on the efficiency of the testing phase. – same as Study 2, except replacing the 15 random testing techniques as TSA-h.

Result of Study 4 • the ratio of the application of Rule 3 over all in Algorithm one. • estimation based heuristics is the key asset of our

Results - Threats to Validity • threads to external validity: – programs not representative – solution: try to cover different kinds such as library classes and server applications • threads to internal validity: – unknown bugs in prototype – solution: build tool on top of the publicly available Cal. Fuzzer tool

Methodology • Variables control, reference groups, by comparison, through experiments • Large amount of repeated trials • All-around statistical analysis from tables and figures (in effectiveness, efficiency and precision)

Pros?

Pros • • • New thread scheduling technique to achieve high coverage in concurrent programs Prototype implemented in Java § 1910 lines of code modified § Publicly available Defined terms well Step-by-step instructions for thread-scheduling algorithm Thread-scheduling algorithm can be used by developers without varying any parameters Estimation-based heuristic improved performance.

Cons?

Cons • • Only 13 sample programs were used Scalability claim is not well-supported § Only one test case was over 20 K lines of code No computational complexity mentioned for thread scheduling algorithm Used their own implementation of random testing techniques instead of existing ones Numerical mistake when discussing Table 2/Figure 4 Issues with false positives and false negatives Bias towards estimation-based heuristic

Next Steps?

Next Steps • • • Investigate the relationship between coverage and faultdetection ability Test the thread scheduling algorithm on a greater variety of programs Extend the technique to satisfy other criteria besides synchronization-pair Fix issues with false positives and false negatives Show the computational complexity of the algorithm and whether or not it can be improved