Tx Race Efficient Data Race Detection Using Commodity

  • Slides: 31
Download presentation
Tx. Race: Efficient Data Race Detection Using Commodity Hardware Transactional Memory Tong Zhang, Dongyoon

Tx. Race: Efficient Data Race Detection Using Commodity Hardware Transactional Memory Tong Zhang, Dongyoon Lee, Changhee Jung Computer Science Department - 1 -

Data Races in Multithreaded Programs • Two threads access the same shared location (one

Data Races in Multithreaded Programs • Two threads access the same shared location (one is write) • Not ordered by synchronizations Thread 1 Thread 2 p = NULL if (p) { crash fput(p, …) } My. SQL bug #3596 - 2 -

Race Conditions Caused Severe Problems • 50+ million people lost power • Cost an

Race Conditions Caused Severe Problems • 50+ million people lost power • Cost an estimated $6 billion Northeast Blackout of 2003 • About 30 million shares’ worth of trading were affected • Cost an estimated $13 million Stock Price Mismatch in 2012 - 3 -

State-Of-The-Art Dynamic Data Race Detector Software based solutions • Fast. Track [PLDI’ 09] •

State-Of-The-Art Dynamic Data Race Detector Software based solutions • Fast. Track [PLDI’ 09] • Intel Inspector XE • Google Thread Sanitizer • . . . ✔Sound (no false negatives) ✔Complete (no false positives) ✗High overhead (10 -100 x) Hardware based solutions • • Re. Enact [ISCA’ 03] CORD [HPCA’ 06] Sig. Race [ISCA’ 09] … ✔Low overhead ✗ Custom hardware - 4 -

Our Approach • Hybrid SW + (existing) HW solution • Leverage the data conflict

Our Approach • Hybrid SW + (existing) HW solution • Leverage the data conflict detection mechanism of Hardware Transactional Memory (HTM) in commodity processors for lightweight data race detection ✔ Low overhead ✔No custom hardware - 5 -

Outline • • • Motivation Background: Transactional Memory Tx. Race: Design and Implementation Experiments

Outline • • • Motivation Background: Transactional Memory Tx. Race: Design and Implementation Experiments Conclusion - 6 -

Transactional Memory (TM) • Allow a group of instructions (a transaction) to execute in

Transactional Memory (TM) • Allow a group of instructions (a transaction) to execute in an atomic manner Thread 1 Thread 2 time Read(X) Transaction begin Write(X) Abort Read(X) Transaction end Data conflict Rollback - 7 -

Challenge 1: Unable to Pinpoint Racy Instructions • When a transaction gets aborted, we

Challenge 1: Unable to Pinpoint Racy Instructions • When a transaction gets aborted, we know that there was a data conflict between transactions Thread 1 ? Thread 2 Read(X) Abort Write(X) • However, we DO NOT know WHY and WHERE - e. g. which instruction? at which address? Which transaction caused the conflict? - 8 -

Challenge 2: False Conflicts → False Positives • HTM detects data conflicts at the

Challenge 2: False Conflicts → False Positives • HTM detects data conflicts at the cache-line granularity → False positives Thread 1 False transaction abort without data race Thread 2 Read(X) Write(Y) Abort Cache line X Y - 9 -

Challenge 3. Non-conflict Aborts • Best-effort (non-ideal) HTM with limitations → Transaction may get

Challenge 3. Non-conflict Aborts • Best-effort (non-ideal) HTM with limitations → Transaction may get aborted without data conflicts → False negatives (if ignored). . Abort . Read(X) Write(Y) Read(Z) . Z Y X Read(X) Write(Y) I/O syscall() Abort Hardware Buffer “Capacity” Abort “Unknown” Abort - 10 -

Outline • • • Motivation Background: Transactional Memory Tx. Race: Design and Implementation Experiments

Outline • • • Motivation Background: Transactional Memory Tx. Race: Design and Implementation Experiments Conclusion - 11 -

Tx. Race: Two-phase Data Race Detection Potential data races Fast-path (HTM-based) ✔ Fast ✗

Tx. Race: Two-phase Data Race Detection Potential data races Fast-path (HTM-based) ✔ Fast ✗ Unable to pinpoint races ✗ False sharing(false positive) ✗ Non-conflict aborts(false negative) Intel Haswell (RTM) Slow-path (SW-based) ✔ Sound(no false negative) ✔ Complete(no false positive) ✗ Slow Google Thread. Sanitizer (TSan) - 12 -

Compile-time Instrumentation • Fast-path: convert sync-free regions into transactions • Slow-path: add Google TSan

Compile-time Instrumentation • Fast-path: convert sync-free regions into transactions • Slow-path: add Google TSan checks Thread 1 Thread 2 Lock() Sync-free X=1 Unlock() X=2 Transaction begin Transaction end Lock() Sync-free Unlock() - 13 -

Fast-path HTM-based Detection Fast-path Slow-path • Leverage HW-based data conflict detection in HTM •

Fast-path HTM-based Detection Fast-path Slow-path • Leverage HW-based data conflict detection in HTM • Problem: On conflict, one transaction gets aborted, but all others just proceed → slow-path missed racy transactions Thread 1 Thread 2 Thread 3 Already passed X=1 Abort X=2 - 14 -

Fast-path HTM-based Detection Fast-path Slow-path • Leverage HW-based data conflict detection in HTM •

Fast-path HTM-based Detection Fast-path Slow-path • Leverage HW-based data conflict detection in HTM • Problem: On conflict, one transaction gets aborted, but all • others just proceed → Cannot switch to slow-path Solution: Abort in-flight transactions artificially Thread 1 R(Tx. Fail) Thread 2 R(Tx. Fail) Thread 3 R(Tx. Fail) Rollback all X=1 Abort W(Tx. Fail) X=2 Abort - 15 -

Slow-path SW-based Detection Fast-path Slow-path • Use SW-based sound and complete data race detection

Slow-path SW-based Detection Fast-path Slow-path • Use SW-based sound and complete data race detection - Pinpoint racy instructions - Filter out false positives (due to false sharing) - Handle non-conflict (e. g. , capacity) aborts conservatively Thread 1 Thread 2 Thread 3 SW-based detection X=1 Abort X=2 Abort - 16 -

Implementation Two-phase data race detection • Fast-path: Intel’s Haswell Processor • Slow-path: Google’s Thread

Implementation Two-phase data race detection • Fast-path: Intel’s Haswell Processor • Slow-path: Google’s Thread Sanitizer Instrumentation • LLVM compiler framework • Compile-time & profile-guided optimizations Evaluation • PARSEC benchmark suites with simlarge input • Apache web server with 300 K requests from 20 clients • 4 worker threads (4 hardware transactions) - 17 -

Outline • • Motivation Background: Hardware Transactional Memory Tx. Race: Design and Implementation Experiments

Outline • • Motivation Background: Hardware Transactional Memory Tx. Race: Design and Implementation Experiments 1) Performance 2) Soundness (Detection capability) 3) Cost-effectiveness • Conclusion - 18 -

ra ps yt ra ce fe rre t x bo 26 dy 4 tra

ra ps yt ra ce fe rre t x bo 26 dy 4 tra c st fac k re am esim cl us te de r d ca up nn e ap al (g ac eo h. m e ea n) Runtime Overhead 30 vi ks flu cho id an les i sw ma ap te tio fre ns qm in e bl ac 1. Performance Overhead TSan 1195 x 25 15 10 5 Tx. Race 63 x >10 x reduction 20 11. 68 x 4. 65 x 0 - 19 -

ho an les im sw at ap e tio fre ns qm in e

ho an les im sw at ap e tio fre ns qm in e vip ra s yt ra ce fe rre t x bo 264 dy tra ck f st ac re es am im clu st er de du ca p nn ea ap l ac he sc ck id flu bl a Number of Race detected 2. Soundness (Detection Capability) TSan 10 112 8 79 Tx. Race 64 64 Recall: 0. 95 False Negative 6 4 2 0 - 20 -

False Negatives • Due to non-overlapped transactions X=1 time Transaction begin Transaction end X=2

False Negatives • Due to non-overlapped transactions X=1 time Transaction begin Transaction end X=2 - 21 -

False Negatives Case Study in vips # of detected data races • Repeat the

False Negatives Case Study in vips # of detected data races • Repeat the experiment to exploit different interleaving 120 100 All detected 80 60 40 20 0 1 2 3 4 5 6 7 # of iterations - 22 -

3. Cost-effectiveness Compared to Sampling • Tx. Race vs. Tsan with Sampling Overhead equivalent

3. Cost-effectiveness Compared to Sampling • Tx. Race vs. Tsan with Sampling Overhead equivalent to naïve sampling at 25. 5% - 23 -

Recall compared to sampling • Tx. Race: Less overhead + High recall Spend 25.

Recall compared to sampling • Tx. Race: Less overhead + High recall Spend 25. 5% Get 47. 2% - 24 -

Conclusion Tx. Race • HTM-based fast-path(most of the time) • SW-based slow-path(on-demand) Performance 11.

Conclusion Tx. Race • HTM-based fast-path(most of the time) • SW-based slow-path(on-demand) Performance 11. 68 x -> 4. 65 x Tx. Race TSan Completeness Soundness Recall: 0. 95 - 25 -

Q&A Thank you! - 26 -

Q&A Thank you! - 26 -

ac ks flu cho l id an es im sw at ap e tio

ac ks flu cho l id an es im sw at ap e tio fre ns qm in e vi ra ps yt ra ce fe rre t x bo 264 dy tra ck f a st re ces am im cl us te r de du ca p nn ea ap l (g ac eo he. m ea n) bl Runtime Overhead Performance overhead baseline 12 11 10 9 8 7 6 5 4 3 2 1 0 Large number of short transactions xbegin/xend Transaction overhead is low 1. 16 - 27 -

ac ks flu cho l id an es im sw at ap e tio

ac ks flu cho l id an es im sw at ap e tio fre ns qm in e vi ra ps yt ra ce fe rre t x bo 264 dy tra ck f a st re ces am im cl us te r de du ca p nn ea ap l (g ac eo he. m ea n) bl Runtime Overhead Performance overhead baseline xbegin/xend conflict aborts 12 11 10 9 8 7 6 5 4 3 2 1 0 2. 73 - 28 -

ac ks flu cho l id an es im sw at ap e tio

ac ks flu cho l id an es im sw at ap e tio fre ns qm in e vi ra ps yt ra ce fe rre t x bo 264 dy tra ck f a st re ces am im cl us te r de du ca p nn ea ap l (g ac eo he. m ea n) bl Runtime Overhead Performance overhead baseline xbegin/xend conflict aborts capacity aborts 12 11 10 9 8 7 6 5 4 3 2 1 0 - 29 -

ac ks flu cho l id an es im sw at ap e tio

ac ks flu cho l id an es im sw at ap e tio fre ns qm in e vi ra ps yt ra ce fe rre t x bo 264 dy tra ck f a st re ces am im cl us te r de du ca p nn ea ap l (g ac eo he. m ea n) bl Runtime Overhead Performance overhead baseline xbegin/xend conflict aborts capacity aborts unknown aborts 12 11 10 9 8 7 6 5 4 3 2 1 0 63. 3 x 31. 6 x - 30 -

False Negative • Transactions finish and escape before being artifically aborted T 1 R(Tx.

False Negative • Transactions finish and escape before being artifically aborted T 1 R(Tx. Fail) W(Tx. Fail) T 2 T 3 R(Tx. Fail) X=1 Abort X=2 - 31 -