Concurrent Programming Without Locks Keir Fraser Tim Harris

  • Slides: 60
Download presentation
Concurrent Programming Without Locks Keir Fraser & Tim Harris Adapted from an earlier presentation

Concurrent Programming Without Locks Keir Fraser & Tim Harris Adapted from an earlier presentation by Phil Howard

Motivation • Locking precludes parallelism • Recall “A Lock-Free Multiprocessor OS Kernel” by Massalin

Motivation • Locking precludes parallelism • Recall “A Lock-Free Multiprocessor OS Kernel” by Massalin et al – Extensive use of CAS 2 (aka DCAS, DCADS) – instruction does not exist on today’s CPUs • Need a practical and general non-blocking solution

Solutions? • Only use data structures that can be implemented with CAS? – Limiting

Solutions? • Only use data structures that can be implemented with CAS? – Limiting • RCU – Still uses locks for writers – Still limited to CAS data structures • Software MCAS • Transactional Memory

Goals • • Concreteness Linearizability Non-blocking progress guarantee Disjoint access parallelism Read parallelism Dynamicity

Goals • • Concreteness Linearizability Non-blocking progress guarantee Disjoint access parallelism Read parallelism Dynamicity Practicable space costs Composability

Caveats • “It remains possible for a thread to see a mutually inconsistent view

Caveats • “It remains possible for a thread to see a mutually inconsistent view of shared memory if it performs a series of [read] calls. ”

Definitions • Obstruction freedom – a thread will make progress as long as it

Definitions • Obstruction freedom – a thread will make progress as long as it doesn’t contend with other threads access to any location • Lock-freedom – The system as a whole will make progress • Wait-freedom – Every thread makes progress Focus is on Lock-free design Whole transactions are lock-free, not just the subcomponents

Design considerations • Need to update multiple locations atomically – using only “real” instructions

Design considerations • Need to update multiple locations atomically – using only “real” instructions • The secret? – Indirection! – Use descriptors to access values

100 Descriptor 101 102 103 Status Address Old Value New Value 102 100 200

100 Descriptor 101 102 103 Status Address Old Value New Value 102 100 200 104 105 123 105 106 456 789 106 107 Memory

Implications of Descriptors • Commit operation atomically updates status field • All accesses are

Implications of Descriptors • Commit operation atomically updates status field • All accesses are indirect – Need to distinguish between descriptor or value – Need to choose “actual”, “old”, or “new” value • Once a descriptor is made visible, only the status field changes • Once an outcome is decided, the status value doesn’t change – Retries use a new descriptor • Descriptors are managed via garbage collection

Other requirements • Descriptors must be able to own locations • Uncontended commits must

Other requirements • Descriptors must be able to own locations • Uncontended commits must work – Prepare phase – Decision point – Update status value – Clean up – Status values: UNDECIDED, READCHECK, SUCCESSFUL, FAILED

Other Requirements • Contended Commits must make progress – Decided, but not complete •

Other Requirements • Contended Commits must make progress – Decided, but not complete • Help the other thread complete – Undecided, not read-check • Abort contending transactions – Without contention management can lead to live-lock • Help contending transactions – Sort memory addresses to prevent looping – Read-check • Abort at least one contender • Prevent live-locks by totally ordering transactions

Algorithms MCAS Multiple Compare And Swap WSTM Word Software Transactional Memory OSTM Object Software

Algorithms MCAS Multiple Compare And Swap WSTM Word Software Transactional Memory OSTM Object Software Transactional Memory

MCAS CAS( word *address, // actual value word expected_value, word new_value); (logically) MCAS( int

MCAS CAS( word *address, // actual value word expected_value, word new_value); (logically) MCAS( int count, word *address[], // actual values word expected_value[], word new_value[]); (but an extra indirection is added) (pointers must indirect through the descriptor!)

MCAS • Operates only on aligned pointers • Lower 2 bits used to distinguish

MCAS • Operates only on aligned pointers • Lower 2 bits used to distinguish value/descriptor • Descriptors contain – status –N – address[] – expected[] – new_value[]

Data Access Status: SUCCESS descriptor value Address 102 Old Value New Value 100 200

Data Access Status: SUCCESS descriptor value Address 102 Old Value New Value 100 200 300 descriptor Status: UNKNOWN Address 105 Old Value New Value 100 200

CCAS Conditional CAS built from CAS - takes effect only if condition == undecided

CCAS Conditional CAS built from CAS - takes effect only if condition == undecided - used to insert descriptor references CCAS( word *address, word expected_value, word new_value, word *condition); return original value of *address

Word *MCASRead(word **addr) { word *v; retry_read: v = CCASRead(addr); if ( !Is. MCASDesc(v))

Word *MCASRead(word **addr) { word *v; retry_read: v = CCASRead(addr); if ( !Is. MCASDesc(v)) return v; for (int i=0; i<v->N; i++) { if (v->addr[i] == addr) { if (v->status == SUCCESS) if (CCASRead(addr) == v) return v->new[i] else goto retry_read; else // FAILED or UNKNOWN if (CCASRead(addr) == v) return v->expected[i]; else goto retry_read; } } return v; }

MCAS(3, {a, b, c}, {1, 2, 3}, {4, 5, 6}) a 1 b 2

MCAS(3, {a, b, c}, {1, 2, 3}, {4, 5, 6}) a 1 b 2 c 3

MCAS(3, {a, c, b}, {1, 3, 2}, {4, 6, 5}) a b c 1

MCAS(3, {a, c, b}, {1, 3, 2}, {4, 6, 5}) a b c 1 UNKNOWN 3 2 3 a 1 4 b 2 5 c 3 6

MCAS(3, {a, b, c}, {1, 2, 3}, {4, 5, 6}) a b c 1

MCAS(3, {a, b, c}, {1, 2, 3}, {4, 5, 6}) a b c 1 4 SUCCESS 3 2 5 3 6 a 1 4 b 2 5 c 3 6

bool MCAS(int N, word **a[], word *e[], word *n[]) { mcas_descriptor *d = new

bool MCAS(int N, word **a[], word *e[], word *n[]) { mcas_descriptor *d = new mcas_descriptor(); d->N = N; d->status = UNDECIDED; for (int i=0; i<N; i++) { d->a[i] = a[i]; d->e[i] = e[i]; d->n[i] = n[i]; } address_sort(d); return mcas_help(d); }

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; ); // Phase

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; ); // Phase 1: acquire for (int i=0; i<d->N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d->status); if (v = d->e[i] || v == d) break; if (Is. MCASDesc(v) ) mcas_help( (mcas_descriptor *)v else goto decision_point; } } desired = SUCCESS;

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED,

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED, desired); // PHASE 3: clean up success = (d->status == SUCCESS); for (int i=0; i<d->N; i++) { CAS(d->a[i], d, success ? d->n[i] : d->e[i]); } return success; }

Claiming Ownership Status: UNKNOWN Address 102 104 Old Value New Value 102 100 200

Claiming Ownership Status: UNKNOWN Address 102 104 Old Value New Value 102 100 200 104 456 789 108 999 777 CCAS Descr 108 999 &MCAS_Descr &mcas->status

Claiming Ownership Status: UNKNOWN Address 102 104 Old Value New Value 102 100 200

Claiming Ownership Status: UNKNOWN Address 102 104 Old Value New Value 102 100 200 104 456 789 108 999 777 CCAS Descr 108 999 &MCAS_Descr &mcas->status

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new ccas_descriptor(); word *v; (d->a, d->e, d->n, d->cond) = (a, e, n, cond); while ( (v = CAS(d->a, d->e, d)) != d->e ) { if ( Is. CCASDesc(v) ) CCASHelp( (ccas_descriptor *)v); else return v; } CCASHelp(d); return v; } void CCASHelp(ccas_descriptor *d) { bool success = (*d->cond == UNDECIDED); CAS(d->a, d, success ? d->n : d->e); }

word *CCASRead(word **a) { word *v = *a; while ( Is. CCASDesc(v) ) {

word *CCASRead(word **a) { word *v = *a; while ( Is. CCASDesc(v) ) { CCASHelp( (ccas_descriptor *)v); v = *a; } return v; }

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200 104 456 789 108 999 777 Status: UNKNOWN Address 108 Old Value New Value 999 200

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1:

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i<d->N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d->status); if (v = d->e[i] || v == d) break; if (Is. MCASDesc(v) ) mcas_help( (mcas_descriptor *)v ); else goto decision_point; } } desired = SUCCESS; decision_point:

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200 104 456 789 108 999 777 Status: UNKNOWN Address 108 Old Value New Value 999 200

Conflicts Status: UNKNOWN Address 102 104 108 200 Old Value New Value 102 100

Conflicts Status: UNKNOWN Address 102 104 108 200 Old Value New Value 102 100 200 104 456 789 108 999 777

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1:

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i<d->N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d>status); if (v = d->e[i] || v == d) break; if (Is. MCASDesc(v) ) mcas_help( (mcas_descriptor *)v ); else goto decision_point; } } desired = SUCCESS; decision_point:

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200

Conflicts Status: UNKNOWN Address 102 104 108 Old Value New Value 102 100 200 104 456 108 999 Status: UNKNOWN Address Old Value New Value 104 456 123 108 999 200

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1:

bool mcas_help(mcas_descriptor *d) { word *v, desired = FAILED; bool success; // Phase 1: acquire for (int i=0; i<d->N; i++) { while (TRUE){ v = CCAS(d->a[i], d->e[i], d, &d>status); if (v = d->e[i] || v == d) break; if (!Is. MCASDesc(v) ) goto decision_point; mcas_help( (mcas_descriptor *)v ); } } desired = SUCCESS; decision_point:

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED,

mcas_help continued // PHASE 2: read – not used by MCAS decision_point: CAS(&d->status, UNDECIDED, desired); // PHASE 3: clean up success = (d->status == SUCCESS); for (int i=0; i<d->N; i++) { CAS(d->a[i], d, success ? d->n[i] : d->e[i]); } return success; }

CCAS “failure modes” • Someone helped us with the CCAS – call CCASHelp with

CCAS “failure modes” • Someone helped us with the CCAS – call CCASHelp with our own descriptor – next time around, return MCAS descriptor – MCAS continues • Someone else beat us to CCAS – – help them with their CCAS next time around, return their MCAS descriptor Help with their MCAS Our MCAS likely aborts • Source value changed – return new value – MCAS aborts

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new

word *CCAS(word **a, word *e, word *n, word *cond) { ccas_descriptor *d = new ccas_descriptor(); word *v; (d->a, d->e, d->n, d->cond) = (a, e, n, cond); while ( (v = CAS(d->a, d->e, d)) != d->e ) { if ( !Is. CASDesc(v) ) return v; CCASHelp( (ccas_descriptor *)v); } CCASHelp(d); return v; } void CCASHelp(ccas_descriptor *d) { bool success = (*d->cond == UNDECIDED); CAS(d->a, d, success ? d->n : d->e); }

CCASHelp “failure modes” • MCAS aborted so status isn’t UNKNOWN – old value put

CCASHelp “failure modes” • MCAS aborted so status isn’t UNKNOWN – old value put back in place • MCAS aborted, CCASHelp doesn’t restore value – MCAS cleanup will put old value back in place • Race: status switches to SUCCESS between check and CAS – CAS will fail because CCAS descriptor already removed – CCAS return will not cause MCAS failure • Race: status switches to FAILURE between check and CAS – CAS will always fail because for MCAS to fail, someone must have read beyond us

Cost • 3 N + 1 CAS instructions (plus all the other code) •

Cost • 3 N + 1 CAS instructions (plus all the other code) • “it is worth noting that the three batches of N updates all act on the same locations” • “[improvements] may be useful if there are systems in which CAS operates substantially more slowly than an ordinary write. ”

Deep Breath

Deep Breath

WSTM • Remove requirement for space reserved in values being updated • WSTM keeps

WSTM • Remove requirement for space reserved in values being updated • WSTM keeps track of locations rather than caller • Provides read parallelism • Obstruction free, not lock free nor wait free

Data Structures 100 Orecs Status: Undecided 200 a 1: (100, 15) -> (200, 16)

Data Structures 100 Orecs Status: Undecided 200 a 1: (100, 15) -> (200, 16) version 52 300 400 a 2: (200, 52) -> (100, 53)

Logical contents • Orec contains a version number: – value comes direct from memory

Logical contents • Orec contains a version number: – value comes direct from memory • Orec contains a descriptor reference – descriptor contains address • value comes from descriptor based on status – descriptor does not contain address • value comes direct from memory

Transaction Process • Call WSTMRead/WSTMWrite to gather/change data – Builds transaction data structure, but

Transaction Process • Call WSTMRead/WSTMWrite to gather/change data – Builds transaction data structure, but it’s NOT visible • WSTMCommit. Transaction – Get ownership – update ORecs – Read-Check – check version numbers – Decide – Clean up

Data Structures 100 200 100 version 16 15 Status: UNKNOWN SUCCESS a 1: (100,

Data Structures 100 200 100 version 16 15 Status: UNKNOWN SUCCESS a 1: (100, 15) -> (200, 16) version 53 52 300 400 a 2: (200, 52) -> (100, 53) (200, 52)

Complications • Fixed number of Orecs • Hash collisions lead to false sharing

Complications • Fixed number of Orecs • Hash collisions lead to false sharing

Issues • Orec ownership acts like a lock, so simple scheme is not even

Issues • Orec ownership acts like a lock, so simple scheme is not even obstruction free • Can’t help with “cleanup” because might overwrite newer data • Can’t determine value during READCHECK, so we’re forced to shoot down • force_decision() might be circular causing live lock • helping requires <complicated> stealing of transactions • Uncontended cost is N+2

OSTM • Objects are represented as opaque handles – can’t use pointers directly –

OSTM • Objects are represented as opaque handles – can’t use pointers directly – must rewrite data structures • Get accessible pointers via OSTMOpen. For. Reading/OSTMOpen. For. Wr iting • Eliminates need for Orecs/aliasing

Evaluation • “We use … reference-counting garbage collection” • Evaluated with one thread/CPU •

Evaluation • “We use … reference-counting garbage collection” • Evaluated with one thread/CPU • “Since we know the number of threads participating in our experiments…”

Uncontended Performance

Uncontended Performance

Contended Locks

Contended Locks

Data Contention

Data Contention

Data/Lock Contention

Data/Lock Contention

Spare Slides

Spare Slides

word WSTMRead(wstm_transaction *tx, word *addr) { if (entry_exists) return entry->new_value; if (orec->type != descriptor)

word WSTMRead(wstm_transaction *tx, word *addr) { if (entry_exists) return entry->new_value; if (orec->type != descriptor) create entry [current value, orec version] else { force_decision(descriptor); // can’t be ours: not in commit if (descriptor contains our address) if (status == SUCCESS) create entry [descr. new_val, descr. new_ver] else create entry [descr. old_val, descr. old_ver] else create entry [current value, descr. aliased. new_ver] } if (aliased) { if (entry->old_version != aliased->old_version) status = FAILED; entry->old_version = aliased->old_version; entry->new_version = aliased->new_version; } return entry->new_value; }

void WSTMWrite(wstm_transaction *tx, word *addr, word new_value { get entry using WSTMRead logic entry->new_value

void WSTMWrite(wstm_transaction *tx, word *addr, word new_value { get entry using WSTMRead logic entry->new_value = new_value; for each aliased entry { entry->new_version++; } }

bool WSTMCommit(wstm_transaction *tx) { if (tx->status == FAILED) return false; sort descriptor entries desired_status

bool WSTMCommit(wstm_transaction *tx) { if (tx->status == FAILED) return false; sort descriptor entries desired_status = FAILED; for each update if (!acquire_orec) goto decision_point; CAS(status, UNDECIDED, READ_CHECK); for each read if (!read_check) goto decision_point; desired_status = SUCCESS; decision_point:

decision_point: status = tx->status; while (status != FAILED && status != SUCCESS) { CAS(tx->status,

decision_point: status = tx->status; while (status != FAILED && status != SUCCESS) { CAS(tx->status, desired_status); status = tx->status; } if (tx->status == SUCCESS) for each update *addr = entry->new_value; for each update release_orec return (tx->status == SUCCESS); }

bool read_check(wstm_transaction *tx, wstm_entry *entry) { if (orec is WSTM_descriptor) { force_decision() if (SUCCESS)

bool read_check(wstm_transaction *tx, wstm_entry *entry) { if (orec is WSTM_descriptor) { force_decision() if (SUCCESS) version = new_version; else version = old_version } else { version = orec_version; } return (version == entry->old_version); }

Data Structures a 1 100 Orecs a 2 Status: Undecided 200 a 1: (100,

Data Structures a 1 100 Orecs a 2 Status: Undecided 200 a 1: (100, 15) -> (200, 16) version 52 a 3 300 400 a 2: (200, 52) -> (100, 53) a 3: (300, 15) -> (300, 16)