Diagnosing and Fixing Concurrency Bugs Presented by Tao
Diagnosing and Fixing Concurrency Bugs Presented by Tao Wang Credits to Dr. Guoliang Jin, Computer Science, NC STATE
We need reliable software § People’s daily life now depends on reliable software § Software companies spend lots of resources on debugging § More than 50% effort on finding and fixing bugs § Around $300 billion per year 2
Concurrency bugs hurt § It is an increasingly parallel world § Concurrency bugs in history 3
Multi-threaded program § Concurrent programs under the shared-memory model § Programs execute multiple interacting threads in parallel § Threads communicate via shared memory § Shared-memory accesses should be well-synchronized thread 1 thread 2 thread 3 thread 4 core 1 core 2 core 3 core 4 cache Multicore chip shared memory 4
An example of concurrency bug The interleaving space Thread 1 Thread 2 if (ptr != NULL) { ptr->field = 1; } Huge ptr = NULL; Interleaving space Thread 1 if (ptr != NULL) { ptr->field = 1; } 5 Thread 2 ptr = NULL; Bad Thread 1 Thread 2 interleavings if (ptr != NULL) { ptr = NULL; ptr->field = 1; } Segmentation Fault Previous research focuses on finding
Bug fixing § Software quality does not improve until bugs are fixed § Manual concurrency bug fixing is § time-consuming: 73 days on average § error-prone: 39% patches are buggy in the first release § CFix: automated concurrency-bug fixing [PLDI’ 11*, OSDI’ 12] § *SIGPLAN: Program behaves correctly if bad interleavings do not occur § “one of the first papers Fix concurrency bugs by disabling bad interleavings to attack the problem of automated bug fixing” 6
The interleaving space (again) led Bad b sa Di interleavings Huge Interleaving space Bad interleavings d e l b Bad a s i D interleavings Bad interleavings lead to production-run failures 7
Failure diagnosis § Failures still happen in production runs § The reason behind failure needs to be understood § Tools dealing with production runs demand low overhead § Diagnostic information needs to be informative § Production-run concurrency-bug failure diagnosis § Design new monitoring schemes and sampling strategies § CCI: a pure software solution [OOPSLA’ 10] § PBI, LXR: hardware-assisted solutions [ASPLOS’ 13 & 14] 8
My work on concurrency bugs Bug Detection and software testing: Con. Seq [ASPLOS’ 11] Production-Run Failure Diagnosis: CCI/PBI/LXR [OOPSLA’ 10, ASPLOS’ 13 & 14] 9 Automated Concurrency-Bug Fixing: CFix [PLDI’ 11*, OSDI’ 12] *Received a SIGPLAN CACM nomination
Outline § Motivation and Overview § Automated Concurrency-Bug Fixing § § 10 The problem and idea Overview Internals of CFix Evaluation and summary
Automated fixing is difficult Description: Symptom Triggering condition … ? Patch: Correctness Performance Simplicity § What is the correct behavior? § Usually requires developers’ knowledge § How to get the correct behavior? § Correct program states under bug-triggering inputs § No change to program states under other inputs 11
CFix’ insights Description: Symptom Triggering condition … ? Patch: Correctness Performance Simplicity § What is the correct behavior? § The program state is correct as long as the buggy interleaving does not occur § How to get the correct behavior? § Only need to disable failure-inducing interleavings § Can leverage well-defined synchronization operations 12
Description: ? Interleavings that Symptom Triggering condition lead to software … failure Correctness Performance Simplicity atomicity violation detectors order violation detectors Park. ASPLOS’ 09, Flanagan. POPL’ 04, Lu. ASPLOS’ 06, Chew. Euro. Sys’ 10 Zhang. ASPLOS’ 10, Lucia. MICRO’ 09, Yu. ISCA’ 09, Gao. ASPLOS’ 11 p r c data race detectors Sen. PLDI’ 08, Savage. TOCS’ 97, Yu. SOSP’ 05, Erickson. OSDI’ 10, Kasikci. ASPLOS’ 10 13 Patch: How to get a general solution that generates good patches? I 1 I 2 A B abnormal data flow detectors W b Zhang. ASPLOS’ 11, Shi. OOPSLA’ 10 R Wg
Description: Interleavings that lead to software failure CFix Bug reports 14 Patch: Correctness Performance Simplicity Source code Mutual exclusion Fix-Strategy . . . Order Design Synchronization Patched binary. . . Enforcement Patched binary Patch Testing Selected binary. . . Selected binary & Selection Patch Merged binary Merging Run-time Final patched binary Support
Fix-strategy design: what to fix Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 15 Challenges: § Huge variety of bugs
Two types of Concurrency bugs Atomicity violation Order violation § Why these two? § Real-world concurrency bug characteristics study[SHAN ASPLOS’ 08]: 97% either atomicity violation or order violation § Either can be fixed by mutual exclusion or order enforcement 16
Fix-strategy design: how to fix Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 17 Challenges: § Inaccurate root cause
atomicity-violation Thread 1 if (ptr != NULL) { Thread 2 P R } 18 ptr->field = 1; C ptr = NULL;
Fix-strategy for atomicity-voilation Thread 1 Thread 2 if (ptr != NULL) { ptr = NULL; } 19 ptr->field = 1;
CFix: fix-strategy design Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 20 Challenges: § Inaccurate root cause § Huge variety of bugs Solution: § A combination of mutual exclusion & order relationship enforcement
Fix-strategies Overview AV Detector OV Detector Race Detector DU Detector p r c A B I 1 I 2 Wb R Wg 21
CFix: synchronization enforcement Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 22 Challenges: § Correctness § Performance § simplicity Solution: § Mutual exclusion enforcement: AFix [PLDI’ 11] § Order relationship enforcement: OFix [OSDI’ 12]
Atomicity violation in Fixing § Input: three statements (p, c, r) with contexts p c Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; r § Idea: making the code region from p to c be mutually exclusive with r 23
Mutual exclusion enforcement: AFix § Approach: lock p r c § Goal: § Correctness: paired lock acquisition and release operations § Performance: Make the critical section as small as possible 24
A naïve solution § Add lock on edges reaching p § Add unlock on edges leaving c § Potential new bugs p p c c § Could lock without unlock § Could unlock without lock § etc. 25
The AFix solution § Assume p and c are in the same function f § Step 1: find protected nodes in critical section § Step 2: add lock operations p § unprotected node § protected node unprotected node c § Avoid those potential bugs mentioned 26
Subtle details § p and c adjustment when they are in different functions § Observation: people put lock and unlock in one function § Find the longest common prefix of p’s and c’s stack traces § Adjust p and c accordingly § Put r into a critical section § Do nothing if we can reach r from the p–c critical section § Lock type: § Lock with timeout: if critical section has blocking operations § Reentrant lock: if recursion is possible within critical section 27
OFix: two order relationships Ai … … A B Aj A 1 B An all. A-B 28 destroy A 1 … initialization … use ? first. A-B read
OFix all. A-B enforcement § Approach: condition variable and flag § Insert signal operations in A-threads § Insert wait operation before B § Rules § A-thread signals exactly once when it will not execute more A § A-thread signals as soon as possible § B proceeds when each A-thread has signaled 29
OFix all. A-B enforcement: A side How to identify the last A instance in one thread . . . ; for (. . . ) . . . ; // A . . . ; A § Each thread that executes A § exactly once as soon as it can execute no more A 30
OFix all. A-B enforcement: A side How to identify the last thread that executes A void main() { for (. . . ) thread_create(thr_main); . . . ; } void thr_main() { for (. . . ) . . . ; // A . . . ; } =1 ++ 31 counter for signal threads void ofix_signal() { mutex_lock(L); --; thread _create A if ( == 0) cond_broadcast(con); mutex_unlock(L); }
OFix all. A-B enforcement: B side § Safe to execute only when is 0 B void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); } § Give up if OFix knows that it introduces new deadlock § Timed wait-operation to mask potential deadlocks 32
OFix first. A-B § Basic enforcement A B § When A may not execute § Add a safety-net of signal with all. A-B algorithm 33
CFix: patch testing & selection Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 34 Challenge: § Multi-thread software testing Solution: § CFix-patch oriented testing
Patch testing principles § Two ideas: § No exhaustive testing, but patch oriented testing § Leverage existing techniques, with extra heuristics § The work-flow § Step 1 Prune incorrect patches • Patches causing failures due to wrong fix strategies, etc § Step 2 Prune slow patches § Step 3 Prune complicated patches 35
Run once without external perturbation § Reject if there is a time-out or failure § Patches fixing wrong root cause § Make software to fail deterministically Thread 1 Thread 2 ptr->field = 1; ptr = NULL; ptr->field = 1; 36
Implicit bad patch § A failure in patch_b implies a failure in patch_a § If patch_a is less restrictive than patch_b a Mutual Exclusion b c Order Relationships § Helpful to prune patch_a § Traditional testing may not find the failure in patch_a 37
CFix: patch merging Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 38 Challenge: § One single programming mistake usually leads to multiple bug reports Solution: § Heuristics to merge patches
An example with multiple reports void buf_write() { p 1 int tmp = buf_len + str_len; if (tmp > MAX) return; c 1 p 2 memcpy(buf[buf_len], str_len); r 1 c 2, r 2 buf_len = tmp; } § Too many lock/unlock operations § Potential new deadlocks § May hurt performance and simplicity 39 p 1 c 1 p 2 r 1 c 2, r 2
Related patch: a case of AFix § Merge if p, c, or r is in some other patch’s critical sections lock(L 1) p 1 lock(L 2) p 2 c 1 unlock(L 1) c 2 unlock(L 1) unlock(L 2) 40 lock(L 1) r 1 unlock(L 1) lock(L 2) r 2 unlock(L 2) unlock(L 1)
The merged patch for the example void buf_write() { p 1 int tmp = buf_len + str_len; if (tmp > MAX) { return; } p 1 c 1 p 2 memcpy(buf[buf_len], str_len); r 1 c 2, r 2 buf_len = tmp; } 41 p 1 c 1 p 2 c 1, p 2 r 1 c 2, r 2 c 2, r 1, r 2
CFix: run-time support Fix-Strategy Design Synchronization Enforcement Patch Testing & Selection Patch Merging Run-time Support 42 § To understand whethere is a deadlock underlying timeout § Low-overhead, and suitable for production runs
Evaluation methodology APP. PBZIP 2 x 264 FFT HTTrack Mozilla-1 transmission ZSNES Apache My. SQL-1 My. SQL-2 Mozilla-2 Cherokee Mozilla-3 43 AV Detector OV Detector RA Detector DU Detector
Evaluation result APP. PBZIP 2 AV Detector ü x 264 Detector ü RA Detector DU Detector # of Ops 5 ü 7 ü FFT ü ü ü HTTrack ü ü ü 2 Mozilla-1 ü ü 2 transmission ü ZSNES 44 OV ü ü 5 2 3 ü Apache ü ü 3 My. SQL-1 ü ü ü 5 My. SQL-2 ü ü û 9 Mozilla-2 ü ü Cherokee ü ü ü 2 Mozilla-3 ü ü ü 5 3
Summary § Software reliability is critical § Fixing Concurrency bugs is costly and error-prone § CFix uses some heuristics, with good results in practice § § 45 A combination of mutual exclusion and order enforcement Use testing to select the best patch Fix root cause without requiring detectors to report it Small overhead and good simplicity
Questions ? Thank you 46
- Slides: 46