Scheduling Memory Transactions Synchronization alternatives Transactional Memory q

  • Slides: 47
Download presentation
Scheduling Memory Transactions

Scheduling Memory Transactions

Synchronization alternatives: Transactional Memory q A (memory) transaction is a sequence of memory reads

Synchronization alternatives: Transactional Memory q A (memory) transaction is a sequence of memory reads and writes executed by a single thread that either commits or aborts q If a transaction commits, all the reads and writes appear to have executed atomically q If a transaction aborts, none of its operations take effect q Transaction operations aren't visible until they commit (if they do)

Transactional Memory Implementations Hardware Transactional Memory q Transactional Memory [Herlihy & Moss, '93] q

Transactional Memory Implementations Hardware Transactional Memory q Transactional Memory [Herlihy & Moss, '93] q Transactional Memory Coherence and Consistency [Hammond et al. , '04] q Unbounded transactional memory [Ananian, Asanovic, Kuszmaul, Leiserson, Lie, '05] … Software Transactional Memory q q Software Transactional Memory [Shavit &Touitou, '97] DSTM [Herlihy, Luchangco, Moir, Scherer, '03] RSTM [Marathe et al. , '06] WSTM [Harris & Fraser, '03], OSTM [Fraser, '04], ASTM [Marathe, Scherer, Scott, '05], SXM [Herlihy] …

“Conventional” STM system high-level structure OS-scheduler-controlled threads Contention Manager arbitrate proceed Contention Detection TM

“Conventional” STM system high-level structure OS-scheduler-controlled threads Contention Manager arbitrate proceed Contention Detection TM system Abort/retry, wait Passive Aggressive Polite Karma greedy Polka

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers TM-scheduling OS support

Conventional conflict resolution policies are often insufficient q Loser resumes execution after pre-determined waiting

Conventional conflict resolution policies are often insufficient q Loser resumes execution after pre-determined waiting period o o May resume execution too early May resume execution too late q Repeated collisions occur under high contention o o Livelocks Performance may become worse than single lock Scheduling-based CM to the rescue.

TM schedulers: rationale q Transactional threads controlled by TM-aware scheduler o Kernel-level, user-level q

TM schedulers: rationale q Transactional threads controlled by TM-aware scheduler o Kernel-level, user-level q Richer “tool-box“ for reducing and/or preventing transaction conflicts Improve performance under high-contention

The first TM schedulers q “Adaptive Transaction Scheduling for transactional memory systems”, Yoo &

The first TM schedulers q “Adaptive Transaction Scheduling for transactional memory systems”, Yoo & Lee, SPAA'08 q “CAR-STM: Scheduling-based collision avoidance and resolution for software transactional memory”, Dolev, Hendler & Suissa, PODC '08 q “Steal-on-abort: dynamic transaction reordering to reduce conflicts in transactional memory”, Ansari , Jarvis, Kirkham, Kotsedilis, Lujan and Watson, Hi. PEAC'09

Our work q “CAR-STM: Scheduling-based collision avoidance and resolution for software transactional memory” [Dolev,

Our work q “CAR-STM: Scheduling-based collision avoidance and resolution for software transactional memory” [Dolev, Hendler & Suissa, PODC '08] q “On the impact of Serializing Contention Management on STM performance” [Heber, Hendler & Suissa, OPODIS '09] q “Scheduling support for transactional memory contention management” [Fedorova, Felber, Hendler, Lawall, Maldonado, Marlier Muller & Suissa, PPo. PP'10]

CAR-STM (Collision Avoidance and Reduction for STM) Design Goals q Limit Parallelism to a

CAR-STM (Collision Avoidance and Reduction for STM) Design Goals q Limit Parallelism to a single transaction per core (or hardware thread) q Serialize conflicting transactions q Contention avoidance

CAR-STM high-level architecture Transaction thread T-Info Dispatcher Collision Avoider TQ thread Serializing contention mgr.

CAR-STM high-level architecture Transaction thread T-Info Dispatcher Collision Avoider TQ thread Serializing contention mgr. Transaction queue #1 Transaction queue #k Core #1 Core #k

TQ-Entry Structure Transaction thread T-Info Dispatcher Collision Avoider TQ thread Serializing contention mgr. wrapper

TQ-Entry Structure Transaction thread T-Info Dispatcher Collision Avoider TQ thread Serializing contention mgr. wrapper method Transaction data T-Info Trans. thread Transaction queue #1 Transaction queue #k Core #1 Core #k Lock, condition var

Transaction dispatching process Enque transaction in most-conflicting queue. Put thread to sleep, notify TQ

Transaction dispatching process Enque transaction in most-conflicting queue. Put thread to sleep, notify TQ thread. 4 4

Transaction execution TQ thread wrapper method Transaction data T-Info Trans. thread Lock, condition var

Transaction execution TQ thread wrapper method Transaction data T-Info Trans. thread Lock, condition var Transaction queue #i Core #i

Serializing Contention Managers q When two transactions collide, fail the newer transaction and move

Serializing Contention Managers q When two transactions collide, fail the newer transaction and move it to the TQ of the older transaction q Fast elimination of live-lock scenarios q Two SCMs implemented o Basic (BSCM) – move failed transaction to end of the other transactions' TQ o Permanent (PSCM) – Make the failed transaction a subordinate-transaction of the other transaction

PSCM Transactions a and b collide, b is older TQ thread Tc TQ thread

PSCM Transactions a and b collide, b is older TQ thread Tc TQ thread Ta Tb PSCM Transaction queue #1 Transaction queue #k Core #1 Core #k Td Te

PSCM TQ thread Tc TQ thread Tb Ta Td Te Ta Tc PSCM Transaction

PSCM TQ thread Tc TQ thread Tb Ta Td Te Ta Tc PSCM Transaction queue #1 Transaction queue #k Core #1 Core #k Losing transaction and its subordinates are made subordinates of winning transaction

Execution time: STMBench 7 R/W dominated workloads

Execution time: STMBench 7 R/W dominated workloads

Throughput: STMBench 7 R/W dominated workloads

Throughput: STMBench 7 R/W dominated workloads

CAR-STM Shortcomings q May restrict parallelism too much § At most a single transactional

CAR-STM Shortcomings q May restrict parallelism too much § At most a single transactional thread per core/hardwarethread § Transitive serialization q High overhead q Non-adaptive

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Scheduling

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Scheduling TM-scheduling OS support

“On the impact of Serializing Contention Management on STM performance” q CBench – synthetic

“On the impact of Serializing Contention Management on STM performance” q CBench – synthetic benchmark generating workloads with pre-determined length and abort probability. q A low-overhead serialization mechanism q Better understanding of adaptive serialization algorithms

A Low Overhead Serialization Mechanism (LO-SER) Transactional threads Condition variables

A Low Overhead Serialization Mechanism (LO-SER) Transactional threads Condition variables

A Low Overhead Serialization Mechanism (cont'd) 3) t change status of t' to ABORT

A Low Overhead Serialization Mechanism (cont'd) 3) t change status of t' to ABORT (writes that t is winner) 4) t' identifies it was aborted t' t 1) t Identifies a collision 2) t calls contention manager: ABORT_OTHER

A Low Overhead Serialization Mechanism (cont'd) 6) Eventually t commits and broadcasts on its

A Low Overhead Serialization Mechanism (cont'd) 6) Eventually t commits and broadcasts on its condition variable… t 5) t' rolls back transaction and goes to sleep on the condition variable of t t'

A Low Overhead Serialization Mechanism (cont'd) t' t

A Low Overhead Serialization Mechanism (cont'd) t' t

Requirements for serialization mechanism q Commit broadcasts only if transaction won a collision since

Requirements for serialization mechanism q Commit broadcasts only if transaction won a collision since last broadcast (or start of execution) q No waiting cycles (deadlock-freedom) q Avoid race conditions

LO-SER algorithm: data structures

LO-SER algorithm: data structures

LO-SER algorithm: pseudo-code

LO-SER algorithm: pseudo-code

LO-SER algorithm: pseudo-code (cont'd)

LO-SER algorithm: pseudo-code (cont'd)

LO-SER algorithm: pseudo-code (cont'd)

LO-SER algorithm: pseudo-code (cont'd)

Adaptive algorithms q Collect (local or global) statistics on contention level. q Apply serialization

Adaptive algorithms q Collect (local or global) statistics on contention level. q Apply serialization only when contention is high. Otherwise, apply a “conventional” contention-management algorithm. q We find that Stabilized adaptive algorithms perform better. First adaptive TM scheduler: “Adaptive transaction scheduling for transactional memory systems” [Yoo & Lee, SPAA'08]

CBench Evaluation Always serializing incurs no overhead in the lack of Always serializing. CAR-STM

CBench Evaluation Always serializing incurs no overhead in the lack of Always serializing. CAR-STM is bad in incurs high contention medium contention overhead as compared with other algorithms Always serializing is best in high contention

CBench Evaluation Adaptive serialization fares well for all contention levels

CBench Evaluation Adaptive serialization fares well for all contention levels

CBench Evaluation Conventional CM performance degrades for high contention

CBench Evaluation Conventional CM performance degrades for high contention

CBench Evaluation (cont'd) CAR-STM has best efficiency but worst throughput

CBench Evaluation (cont'd) CAR-STM has best efficiency but worst throughput

Random. Graph Evaluation Stabilized algorithm improves throughput by up to 30% Throughput and efficiency

Random. Graph Evaluation Stabilized algorithm improves throughput by up to 30% Throughput and efficiency of conventional algorithms are bad

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers

Talk outline n n n Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers TM-scheduling OS support

“Scheduling Support for Transactional Memory Contention Management” q Implement CM scheduling support in the

“Scheduling Support for Transactional Memory Contention Management” q Implement CM scheduling support in the kernel scheduler (Linux & Open. Solaris) § (Strict) serialization § Soft serialization § Time-slice extension q Different mechanisms for communication between user-level STM library and kernel scheduler

TM Library / Kernel Communication via Shared Memory Segment (Ser-k) q User code notifies

TM Library / Kernel Communication via Shared Memory Segment (Ser-k) q User code notifies kernel on events such as: transaction start, commit and abort (in which case thread yields) q Kernel code handles moving thread between ready and blocked queues

Soft Serialization q Instead of blocking, reduce loser thread priority and yield q Efficient

Soft Serialization q Instead of blocking, reduce loser thread priority and yield q Efficient in scenarios where loser transactions may take a different execution path when retrying (non-determinism) q Priority should be restored upon commit or when conflicting transactions terminate

Time-slice extention q Preemption in the midst of a transaction increases conflict “window of

Time-slice extention q Preemption in the midst of a transaction increases conflict “window of vulnerability” q Defer preemption of transactional threads § avoid CPU monopolization by bounding number of extensions and yielding after commit q May be combined with serialization/soft serialization

Evaluation (STMBench 7, 16 core machine) Conventional CM deteriorates when threads>cores Serializing by local

Evaluation (STMBench 7, 16 core machine) Conventional CM deteriorates when threads>cores Serializing by local spinning is efficient as long as threads ≤ cores

Evaluation - STMBench 7 throughput Serializing by sleeping on condition var is best when

Evaluation - STMBench 7 throughput Serializing by sleeping on condition var is best when threads>cores, since system call overhead is negligible (long transactions)

Evaluation - STMBench 7 aborts data

Evaluation - STMBench 7 aborts data

Conclusions q Scheduling-based CM results in § Improved throughput in high contention § Improved

Conclusions q Scheduling-based CM results in § Improved throughput in high contention § Improved efficiency in all contention levels § LO-SER-based serialization incurs no visible overhead q Lightweight kernel support can improve performance and efficiency q Dynamically selecting best CM algorithm for workload at hand is a challenging research direction

Thank you. Any questions?

Thank you. Any questions?