Software Transactional Memory Nir Shavit TelAviv University and

  • Slides: 55
Download presentation
Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come

Software Transactional Memory Nir Shavit Tel-Aviv University and Sun Labs “Where Do We Come From? What Are We? Where Are We Going? ”

Traditional Software Scaling 7 x Speedup 1. 8 x 3. 6 x User code

Traditional Software Scaling 7 x Speedup 1. 8 x 3. 6 x User code Traditional Uniprocessor Time: Moore’s law 2

Multicore Software Scaling 7 x Speedup 1. 8 x 3. 6 x User code

Multicore Software Scaling 7 x Speedup 1. 8 x 3. 6 x User code Multicore Unfortunately, not so simple…

Real-World Multicore Scaling Speedup 1. 8 x 2 x User code Multicore Parallelization and

Real-World Multicore Scaling Speedup 1. 8 x 2 x User code Multicore Parallelization and Synchronization require great care… 4 2. 9 x

Why? Amdahl’s Law: Speedup = 1/(Parallel. Part/N + Sequential. Part) Pay for N =

Why? Amdahl’s Law: Speedup = 1/(Parallel. Part/N + Sequential. Part) Pay for N = 8 cores Sequential. Part = 25% Effect of 25% becomes more accute as Speedup = only 2. 9 times! num of cores grows 2. 3/4, 2. 9/8, 3. 4/16, 3. 7/32……… 4/ ∞

Shared Data Structures c c c Coarse Grained c The reason we get only

Shared Data Structures c c c Coarse Grained c The reason we get only 2. 9 speedup Fine Grained 25% Shared c c c 75% Unshared c c c c 25% Shared 75% Unshared

A FIFO Queue Head a b Dequeue() => a Tail c d Enqueue(d)

A FIFO Queue Head a b Dequeue() => a Tail c d Enqueue(d)

A Concurrent FIFO Queue Simple Code, easy to prove correct Head Tail Object lock

A Concurrent FIFO Queue Simple Code, easy to prove correct Head Tail Object lock a b P: Dequeue() => a c d Q: Enqueue(d) Contention and sequential bottleneck

Fine Grain Locks Finer Granularity, More Complex Code Head a Tail b P: Dequeue()

Fine Grain Locks Finer Granularity, More Complex Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Verification nightmare: worry about deadlock, livelock…

Fine Grain Locks Complex boundary cases: empty queue, last item Head a Tail ba

Fine Grain Locks Complex boundary cases: empty queue, last item Head a Tail ba P: Dequeue() => a b c d Q: Enqueue(b) Worry how to acquire multiple locks

Lock-Free (JDK 1. 5+) Even Finer Granularity, Even More Complex Code Head a Tail

Lock-Free (JDK 1. 5+) Even Finer Granularity, Even More Complex Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Worry about starvation, subtle bugs, hardness to modify…

Real Applications Complex: Move data atomically between structures Head a Tail b c d

Real Applications Complex: Move data atomically between structures Head a Tail b c d P: Dequeue(Q 1, a) Enqueue(Q 2, a) Head b Tail c d a More than twice the worry…

Transactional Memory [Herlihy. Moss 93]

Transactional Memory [Herlihy. Moss 93]

Promise of Transactional Memory Great Performance, Simple Code Head a Tail b P: Dequeue()

Promise of Transactional Memory Great Performance, Simple Code Head a Tail b P: Dequeue() => a c d Q: Enqueue(d) Don’t worry about deadlock, livelock, subtle bugs, etc…

Promise of Transactional Memory Don’t worry which locks need to cover which variables when…

Promise of Transactional Memory Don’t worry which locks need to cover which variables when… Head a Tail ba P: Dequeue() => a b c d Q: Enqueue(d) TM deals with boundary cases under the hood

For Real Applications Will be easy to modify multiple structures atomically Head a Tail

For Real Applications Will be easy to modify multiple structures atomically Head a Tail b c d P: Dequeue(Q 1, a) Enqueue(Q 2, a) Head b Tail c d Provide Serializability… a

Using Transactional Memory enqueue (Q, newnode) { Q. tail-> next = newnode Q. tail

Using Transactional Memory enqueue (Q, newnode) { Q. tail-> next = newnode Q. tail = newnode }

Using Transactional Memory enqueue (Q, newnode) { atomic{ Q. tail-> next = newnode Q.

Using Transactional Memory enqueue (Q, newnode) { atomic{ Q. tail-> next = newnode Q. tail = newnode } }

Transactions Will Solve Many of Locks’ Problems No need to think what needs to

Transactions Will Solve Many of Locks’ Problems No need to think what needs to be locked, what not, and at what granularity ! s lem No worry about deadlocks and livelocks b o r p No need to think about read-sharing re a e r e h objects in a way that Can compose concurrent t t u B is safe and scalable

Hardware TM [Herlihy. Moss 93] Hardware Transactions 20 -30…but not ~1000 instructions long Diff

Hardware TM [Herlihy. Moss 93] Hardware Transactions 20 -30…but not ~1000 instructions long Diff Machines… expect different hardware support Hardware is not flexible…abort policies, retry policies, all application dependent…

Software Transactional Memory [Shavit. Touitou 94] The semantics of hardware transactions…today Tomorrow: serve as

Software Transactional Memory [Shavit. Touitou 94] The semantics of hardware transactions…today Tomorrow: serve as a standard interface to hardware Allow to extend hardware features when they arrive Today’s focus… Still, we need to have reasonable performance…

Lock-free Obstruction-free 200 5 200 6 200 4 4 200 3 200 199 4

Lock-free Obstruction-free 200 5 200 6 200 4 4 200 3 200 199 4 199 7 ST M( Tra Sha vit, ns Tou Su ppo it rt T ou) M( WS Mo TM ir) ( F DS TM raser , H (He arr OS rlih is) TM ye t al (Fr AS ) ase TM r, H ( M arr T-M ara is) the oni tor So (Ja et al) ft T g Me rans ( anna ta T An tha a n… ran nia Hy n, R ) brid s (H ina erli TM Loc (Mo hy, S rd) hav k-O ir) it) ST Mc M T TL 1 M (S (En /2 ( aha nals Ato Dice, et al ) ) Sh m. J avi ava (Hi t)) ndm an… ) The Brief History of STM 2007 -9…New lock based STMs from IBM, Intel, Sun, Microsoft Lock-based

As Good As Fine Grained Locking Postulate (i. e. take it or leave it):

As Good As Fine Grained Locking Postulate (i. e. take it or leave it): If we could implement fine-grained locking with the same simplicity of course grained, we would never think of building a transactional memory. Implication: Lets try to provide STMs that get as close as possible to hand-crafted fine-grained locking.

Subliminal Cut 25

Subliminal Cut 25

Transactional Consistency • Memory Transactions are collections of reads and writes executed atomically •

Transactional Consistency • Memory Transactions are collections of reads and writes executed atomically • Tranactions should maintain internal and external consistency – External: with respect to the interleavings of other transactions. – Internal: the transaction itself should operate on a consistent state.

External Consistency Invariant x = 2 y 84 X 42 Y Transaction A: Write

External Consistency Invariant x = 2 y 84 X 42 Y Transaction A: Write x Write y Transaction B: Read x Read y Compute z = 1/(x-y) = 1/4 Application Memory

Locking STM Design Choices Map Array of Versioned. Write-Locks Application Memory V# PS =

Locking STM Design Choices Map Array of Versioned. Write-Locks Application Memory V# PS = Lock per Stripe (separate array of locks) PO = Lock per Object (embedded in object)

Encounter Order Locking (Undo Log) Mem [Ennals, Saha, Harris, Tiny. STM, Swiss. TM…] Locks

Encounter Order Locking (Undo Log) Mem [Ennals, Saha, Harris, Tiny. STM, Swiss. TM…] Locks Blue code does not change memory, red does V# V# X V#+1 00 V#+1 V# V# Y 00 00 V# V#+1 00 V#+1 V# V# 00 V# V# V# 000 1. 2. 3. 4. 5. 6. To Read: load lock + location Check unlocked add to Read-Set To Write: lock location, store value Add old value to undo-set Validate read-set v#’s unchanged Release each lock with v#+1 Quick read of values freshly written by the reading transaction

Commit Time Locking (Write Log) [TL, TL 2, Sky. STM…] Mem Locks X X

Commit Time Locking (Write Log) [TL, TL 2, Sky. STM…] Mem Locks X X V# V#+1 V# V# V#+1 0 00 0 0 10 Y Y V# V#+1 V# V# 00 10 0 0 V# V# V# 00 0 1. 2. 3. 4. 5. 6. 7. To Read: load lock + location Location in write-set? (Bloom Filter) Check unlocked add to Read-Set To Write: add value to write set Acquire Locks Validate read/write v#’s unchanged Release each lock with v#+1 Hold locks for very short duration

COM vs. ENC High Load Red-Black Tree 20% Delete 20% Update 60% Lookup Hand

COM vs. ENC High Load Red-Black Tree 20% Delete 20% Update 60% Lookup Hand COM ENC Lock

COM vs. ENC Low Load Red-Black Tree 5% Delete 5% Update 90% Lookup Hand

COM vs. ENC Low Load Red-Black Tree 5% Delete 5% Update 90% Lookup Hand COM ENC Lock

Problem: Internal Inconsistency • A Zombie is a currently active transaction that is destined

Problem: Internal Inconsistency • A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state • If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us

Internal Inconsistency Invariant x = 2 y Application Memory 48 X Transaction B: Read

Internal Inconsistency Invariant x = 2 y Application Memory 48 X Transaction B: Read x = 4 24 Y Transaction A: Write x Write y Transaction B: Read y = 4 {trans is zombie} Compute z = 1/(x-y) DIV by 0 ERROR

Past Approaches 1. Design STMs that allow internal inconsistency. 2. To detect zombies introduce

Past Approaches 1. Design STMs that allow internal inconsistency. 2. To detect zombies introduce validation into user code at fixed intervals or loops, used traps, OS support 3. Still there are cases where zombie’s cannot be detected infinite loops in user code…

Global Clock [TL 2/Snap. Isolation] [Dice. Shalev. Shavit 06/Reigel. Felber. Fetzer 06] • Have

Global Clock [TL 2/Snap. Isolation] [Dice. Shalev. Shavit 06/Reigel. Felber. Fetzer 06] • Have a shared global version clock • Incremented by writing transactions (as infrequently as possible) • Read by all transactions • Used to validate that the state viewed by a transaction is always consistent

TL 2 Version Clock: Read-Only Trans Mem Locks 100 87 87 0 34 34

TL 2 Version Clock: Read-Only Trans Mem Locks 100 87 87 0 34 34 34 00 88 88 0 V# 99 99 0 44 44 0 50 50 V# 0 100 Vclock (shared) 1. RV VClock 2. To Read: read lock, read mem, read lock, check unlocked, unchanged, and v# <= RV 3. Commit. Reads form a snapshot of memory. No read set! RV (private)

TL 2 Version Clock: Writing Trans Mem X X Y Y 121 120 100

TL 2 Version Clock: Writing Trans Mem X X Y Y 121 120 100 Locks 87 87 87 121 34 34 121 88 88 00 0 0 00 1 0 0 V# 121 99 121 44 44 0 0 10 0 0 50 V# 50 0 Commit 100 VClock 1. RV VClock 2. To Read/Write: check unlocked and v# <= RV then add to Read/Write-Set 3. Acquire Locks 4. WV = F&I(VClock) 5. Validate each v# <= RV 6. Release locks with v# WV RV Reads+Inc+Writes =serializable

How we learned to stop worrying and love the clock Version clock rate is

How we learned to stop worrying and love the clock Version clock rate is a progress concern, not a safety concern, so. . – (GV 4) if failed to increment VClock using CAS use VClock set by winner – (GV 5) use WV = VClock + 2; inc VClock on abort – (GV 7) localized clocks… [Avni. Shavit 08]

Uncontended Large Red-Black Tree Hand 5% Delete 5% Update 90% Lookup crafted TL/PO TL

Uncontended Large Red-Black Tree Hand 5% Delete 5% Update 90% Lookup crafted TL/PO TL 2/P 0 Ennals TL/PS TL 2/PS Fraser Harris Lockfree

Contended Small RB-Tree 30% Delete 30% Update 40% Lookup TL/P 0 TL 2/P 0

Contended Small RB-Tree 30% Delete 30% Update 40% Lookup TL/P 0 TL 2/P 0 Ennals

Implicit Privatization [Menon et al] • In real apps: often want to “privatize” data

Implicit Privatization [Menon et al] • In real apps: often want to “privatize” data • Then operate on it non-transactionally • Many STMs (like TL 2) based on “Invisible Readers” • Invisible Readers/Writers are a problem if we want implicit privatization…

Privatization Pathology P privatizes node b then modifies it non-transactionally P a 0 b

Privatization Pathology P privatizes node b then modifies it non-transactionally P a 0 b P: atomically{ a. next = c; } // b is private b. value = 0; c d

Privatization Pathology Invisible reader Q cannot detect non-transactional modification to node b P a

Privatization Pathology Invisible reader Q cannot detect non-transactional modification to node b P a 0 b Q P: atomically{ a. next = c; } // b is private b. value = 0; c d Q: divide by 0 error Q: Q: atomically{ tmp == a. next; foo == (1/tmp. value) }}

Solving the Privatization Problem • Visible Writers – Reads are made aware of overlapping

Solving the Privatization Problem • Visible Writers – Reads are made aware of overlapping writes [Dice. Shavit 07/M 4, Gottschlich Connors 07, Spear. Michael. Von. Praun 08 …] P • Visible Readers b Q – Writes are made aware of overlapping reads [Ellen. Lev. Luchangco. Moir 07/SNZI, Dice. Shavit 09/Bytelocks…]

Visible Readers • Use read-write locks. Transactions also lock to read. • Privatization is

Visible Readers • Use read-write locks. Transactions also lock to read. • Privatization is immediate… • But RW-locks will make us burn in coherence traffic hell: CAS to ? t increment/decrement reader-count i s i r • Which is why we had invisible readers in O the first place

Read-Write Bytelocks [Dice. Shavit 09] • A new read-write lock for multicores • Common

Read-Write Bytelocks [Dice. Shavit 09] • A new read-write lock for multicores • Common case: no CAS, only store + membar to read • Claim: on modern multicores cost of Array of read-write coherent stores. Mapnot toobyte-locks bad… Application Memory a bytelock

The Byte. Lock Record • Writer ID • Visible readers : traditional – Reader

The Byte. Lock Record • Writer ID • Visible readers : traditional – Reader count for unslotted threads • CAS to increment and decrement – Reader array for slotted threads • Array of atomically addressable bytes • 48 or 112 slots, Write + Membar to Modify Writer id a byte per slot. Slots counter for unslotted 1 wrtid 49 2 3 4 5 0 1 0 0 1 Single $ line rdcnt

Byte. Lock Writers wait till readers drain out Mem X CAS Spin until all

Byte. Lock Writers wait till readers drain out Mem X CAS Spin until all 0 1 0 i wrtid 50 Writer i 2 3 4 5 0 1 0 0 1 3 rdcnt Intel, AMD, Sun read 8 or 16 bytes at a time

Slotted Readers give pref to writers Mem No Writer? 1 0 Read Mem 51

Slotted Readers give pref to writers Mem No Writer? 1 0 Read Mem 51 wrtid Slotted Reader i 2 3 4 5 0 1 0 1 3 Release: simple rdcnt On Intel, AMD, Sun store to byte + membar is very fast

Unslotted Read Unslotted readers like in traditional RW Mem If non-0 i wrtid 53

Unslotted Read Unslotted readers like in traditional RW Mem If non-0 i wrtid 53 1 2 0 1 Unslotted Reader i 3 4 5 0 1 CAS 4 3 rdcnt Decrement using CAS and wait for writer to go away

TLRW Bytelock 48 slot Byte. Lock Performance TLRW Bytelock 128 slot TL 2 GV

TLRW Bytelock 48 slot Byte. Lock Performance TLRW Bytelock 128 slot TL 2 GV 6 PS Mutex 54 Transact 2009 TLRW Inc/dec read counters

Where we are heading… • A lot more work on performance – Visible writers,

Where we are heading… • A lot more work on performance – Visible writers, visible readers • Think GC, game just begun – Improve single threaded – Amazing possibilities for compiler optimization – OS support • Explosion of new STMs – ~100 TM papers in last couple of years

A bit further down the road… • Transactional Languages – No Implicit Privatization Problem…

A bit further down the road… • Transactional Languages – No Implicit Privatization Problem… – Composability • And when hardware TM arrives… – Contention management – New possibilities for extending and interfacing…

Need Experience with Apps • Today – MSF, Quake, Apache, Fenix. EDU (Large Distributed

Need Experience with Apps • Today – MSF, Quake, Apache, Fenix. EDU (Large Distributed App), Student Trials in Germany and US… • Need a lot more transactification of applications – Not just rewriting of concurrent apps – But actually applications that are parallelized from scratch using TM

Thanks!

Thanks!