Transactional Locking Nir Shavit Tel Aviv University Joint

Concurrent Programming How do we make the programmer’s life simple without slowing computation down

A FIFO Queue Head a b Dequeue() => a Tail c d Enqueue(d)

A Concurrent FIFO Queue synchronized{} Head Tail Object lock a b P: Dequeue() =>

Fine Grain Locks Better Performance, More Complex Code Head a Tail b P: Dequeue()

Lock-Free (JSR-166) Even Better Performance, Even More Complex Code Head a Tail b P:

Transactional Memory [Herlihy-Moss] Great Performance, Simple Code Head a Tail b P: Dequeue() =>

Transactional Memory [Herlihy-Moss] Great Performance, Simple Code Head a Tail ba P: Dequeue() =>

TM: How Does It Work synchronized{ atomic <sequence of instructions> } Execute all synchronized

Hardware TM [Herlihy-Moss] • Limitations: atomic{<~10 -20 -30? …but not ~1000 instructions>} • Machines

Software Transactional Memory • Implement transactions in Software • All the flexibility of hardware…today

Lock-free 4 200 5 200 6 200 3 200 199 3 199 7 ST

As Good As Fine Grained Postulate (i. e. take it or leave it): If

Premise of Lock-based STMs 1. Memory Lifecycle: work with GC or any malloc/free 2.

Transactional Locking • • TL 2 Delivers all four properties How ? - Unlike

TL Design Choices Map Application Memory Array of Versioned. Write-Locks V# PS = Lock

Encounter Order Locking (Undo Log) Mem Locks V# V# X 00 V#+1 V# V#

Commit Time Locking (Write Buff) [TL, TL 2] Mem Locks X X V# V#+1

Why COM and not ENC? 1. Under low load they perform pretty much the

COM vs. ENC High Load Red-Black Tree 20% Delete 20% Update 60% Lookup Hand

COM vs. ENC Low Load Red-Black Tree 5% Delete 5% Update 90% Lookup Hand

COM: Works with Malloc/Free A PS Lock Array B VALIDATE FAILS IF INCONSISTENT To

ENC: Fails with Malloc/Free A PS Lock Array B X V# VALIDATE Cannot free

Problem: Application Safety 1. All current lock based STMs work on inconsistent states. 2.

Solution: TL 2’s “Version Clock” • Have one shared global version clock • Incremented

Version Clock: Read-Only COM Trans Mem Locks 100 87 87 0 34 34 34

Version Clock: Writing COM Trans Mem X X Y Y 121 120 100 Locks

Version Clock Implementation • On sys-on-chip like Sun T 200™ Niagara: virtually no contention,

Performance Benchmarks • Mechanically Transformed Sequential Red-Black Tree using TL 2 • Compare to

Uncontended Large Red-Black Tree 5% Delete 5% Update 90% Lookup Handcrafted TL/PO TL 2/P

Uncontended Small RB-Tree 5% Delete 5% Update 90% Lookup TL/P 0 TL 2/P 0

Contended Small RB-Tree 30% Delete 30% Update 40% Lookup TL/P 0 TL 2/P 0

Speedup: Normalized Throughput Large RB-Tree 5% Delete 5% Update 90% Lookup TL/PO Hand. Crafted

Overhead • STM scalability is as good if not better than hand-crafted, but overheads

On Sun T 200™ (Niagara): maybe a long way to go… Hand. RB-tree 5%

Conclusions • COM time locking, implemented efficiently, has clear advantages over ENC order locking:

What Next? • • Further improve performance Make TL 1 and TL 2 library

Slides: 38

Download presentation

Transactional Locking Nir Shavit Tel Aviv University Joint work with Dave Dice and Ori Shalev

Concurrent Programming How do we make the programmer’s life simple without slowing computation down to a halt? ! object Shared Memory

A FIFO Queue Head a b Dequeue() => a Tail c d Enqueue(d)

A Concurrent FIFO Queue synchronized{} Head Tail Object lock a b P: Dequeue() => a c d Q: Enqueue(d)

Fine Grain Locks Better Performance, More Complex Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Worry about deadlock, livelock…

Lock-Free (JSR-166) Even Better Performance, Even More Complex Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Worry about deadlock, livelock, subtle bugs, hard to modify…

Transactional Memory [Herlihy-Moss] Great Performance, Simple Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Don’t worry about deadlock, livelock, subtle bugs, etc…

Transactional Memory [Herlihy-Moss] Great Performance, Simple Code Head a Tail ba P: Dequeue() => a b c d Q: Enqueue(d) Don’t worry about deadlock, livelock, subtle bugs, etc…

TM: How Does It Work synchronized{ atomic <sequence of instructions> } Execute all synchronized instructions as an atomic transaction… Simplicity of Global Lock with Granularity of Fine-Grained Implementation

Hardware TM [Herlihy-Moss] • Limitations: atomic{<~10 -20 -30? …but not ~1000 instructions>} • Machines will differ in their support • When we build 1000 instruction transactions, it will not be for free…

Software Transactional Memory • Implement transactions in Software • All the flexibility of hardware…today • Ability to extend hardware when it is available (Hybrid TM) • But there are problems: – Performance? – Ease of programming (software engineering)? – Mechanical code transformation?

Lock-free 4 200 5 200 6 200 3 200 199 3 199 7 ST M( Tra Sha vit, ns T Su ppo ouito rt T u) M( WS Mo ir) TM (Fr DS ase TM (He r, Ha rris OS rlih ) TM ye t al (Fr AS ) a ser TM , H ( M arr T-M ara is) the oni tor et a Hy ( J l) brid aga Me TM (M nnat han ta T oir) …) Loc rans k-O (He S Mc TM TM ( rlihy, (Sa Enn Sha TL vit) als ha (Di ) Ato ce, S et al m. J hav ) ava it)) (Hi ndm an… ) The Breif History of STM Obstruction-free Lock-based

As Good As Fine Grained Postulate (i. e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory. Implication: Lets try to provide TMs that get as close as possible to hand-crafted fine-grained locking.

Premise of Lock-based STMs 1. Memory Lifecycle: work with GC or any malloc/free 2. Transactification: allow mechanical transformation of sequential code 3. Performance: match fine grained 4. Safety: work on coherent state Unfortunately: Hybrid, Ennals, Saha, Atom. Java deliver only 2 and 3 (in some cases)…

Transactional Locking • • TL 2 Delivers all four properties How ? - Unlike all prior algs: use Commit time locking instead of Encounter order locking - Introduce Version Clock mechanism for validation

TL Design Choices Map Application Memory Array of Versioned. Write-Locks V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object)

Encounter Order Locking (Undo Log) Mem Locks V# V# X 00 V#+1 V# V# Y [Ennals, Hybrid, Saha, Harris, …] 00 V# V#+1 00 V#+1 V# V# 00 V# V# V# 000 1. 2. 3. 4. 5. 6. To Read: load lock + location Check unlocked add to Read-Set To Write: lock location, store value Add old value to undo-set Validate read-set v#’s unchanged Release each lock with v#+1 Quick read of values freshly written by the reading transaction

Commit Time Locking (Write Buff) [TL, TL 2] Mem Locks X X V# V#+1 V# V# V#+1 0 00 0 0 10 Y Y V# V#+1 V# V# 00 10 0 0 V# V# V# 00 0 1. 2. 3. 4. 5. 6. 7. To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 Hold locks for very short duration

Why COM and not ENC? 1. Under low load they perform pretty much the same. 2. COM withstands high loads (small structures or high write %). ENC does not withstand high loads. 3. COM works seamlessly with Malloc/Free. ENC does not work with Malloc/Free.

COM vs. ENC High Load Red-Black Tree 20% Delete 20% Update 60% Lookup Hand COM ENC MCS

COM vs. ENC Low Load Red-Black Tree 5% Delete 5% Update 90% Lookup Hand COM ENC MCS

COM: Works with Malloc/Free A PS Lock Array B VALIDATE FAILS IF INCONSISTENT To free B from transactional space: 1. Wait till its lock is free. 2. Free(B) X V# B is never written inconsistently because any write is preceded by a validation while holding lock

ENC: Fails with Malloc/Free A PS Lock Array B X V# VALIDATE Cannot free B from transactional space because undo-log means locations are written after every lock acquisition and before validation. Possible solution: validate after every lock acquisition (yuck)

Problem: Application Safety 1. All current lock based STMs work on inconsistent states. 2. They must introduce validation into user code at fixed intervals or loops, use traps, OS support, … 3. And still there are cases, however rare, where an error could occur in user code…

Solution: TL 2’s “Version Clock” • Have one shared global version clock • Incremented by (small subset of) writing transactions • Read by all transactions • Used to validate that state worked on is always consistent Later: how we learned not to worry about contention and love the clock

Version Clock: Read-Only COM Trans Mem Locks 100 87 87 0 34 34 34 00 88 88 0 V# 99 99 0 44 44 0 50 50 V# 0 VClock 1. RV VClock 2. On Read: read lock, read mem, read lock: check unlocked, unchanged, and v# <= RV 3. Commit. Reads form a snapshot of memory. No read set!

Version Clock: Writing COM Trans Mem X X Y Y 121 120 100 Locks 87 87 87 121 34 34 121 88 88 00 0 0 00 1 0 0 V# 121 99 121 44 44 0 0 10 0 0 50 V# 50 0 Commit 100 VClock 1. RV VClock 2. On Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3. Acquire Locks 4. WV = F&I(VClock) 5. Validate each v# <= RV 6. Release locks with v# WV RV Reads+Inc+Writes =Linearizable

Version Clock Implementation • On sys-on-chip like Sun T 200™ Niagara: virtually no contention, just CAS and be happy • On others: add TID to VClock, if VClock has changed since last write can use new value +TID. Reduces contention by a factor of N. • Future: Coherent Hardware VClock that guarantees unique tick per access.

Performance Benchmarks • Mechanically Transformed Sequential Red-Black Tree using TL 2 • Compare to STMs and hand-crafted fine -grained Red-Black implementation • On a 16–way Sun Fire™ running Solaris™ 10

Uncontended Large Red-Black Tree 5% Delete 5% Update 90% Lookup Handcrafted TL/PO TL 2/P 0 Ennals TL/PS TL 2/PS Farser Harris Lockfree

Uncontended Small RB-Tree 5% Delete 5% Update 90% Lookup TL/P 0 TL 2/P 0

Contended Small RB-Tree 30% Delete 30% Update 40% Lookup TL/P 0 TL 2/P 0 Ennals

Speedup: Normalized Throughput Large RB-Tree 5% Delete 5% Update 90% Lookup TL/PO Hand. Crafted

Overhead • STM scalability is as good if not better than hand-crafted, but overheads are much higher • Overhead is the dominant performance factor – bodes well for HTM • Read set and validation cost (not locking cost) dominates performance

On Sun T 200™ (Niagara): maybe a long way to go… Hand. RB-tree 5% Delete 5% Update 90% Lookup crafted STMs

Conclusions • COM time locking, implemented efficiently, has clear advantages over ENC order locking: – No meltdown under contention – Working seamlessly with malloc/free • VCounter can guarantee safety so we – don’t need to embed repeated validation in user code

What Next? • • Further improve performance Make TL 1 and TL 2 library available Mechanical code transformation tool… Cut read-set and validation overhead, maybe with hardware support? • Add hardware VClock to Sys-on-chip.

Thank You