Concurrent Data Structures Concurrent Algorithms 2016 Tudor David

  • Slides: 40
Download presentation
Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis)

Concurrent Data Structures Concurrent Algorithms 2016 Tudor David (based on slides by Vasileios Trigonakis) Tudor David | 11. 2016 1

Data Structures (DSs) • Constructs for efficiently storing and retrieving data – Different types:

Data Structures (DSs) • Constructs for efficiently storing and retrieving data – Different types: lists, hash tables, trees, queues, … • Accessed through the DS interface – Depends on the DS type, but always includes – Store an element – Retrieve an element • Element – Set: just one value – Map: key/value pair CA Tudor David | 11. 2016 2

Concurrent Data Structures (CDSs) • Concurrently accessed by multiple threads – Through the CDS

Concurrent Data Structures (CDSs) • Concurrently accessed by multiple threads – Through the CDS interface linearizable operations! • Really important on multi-cores • Used in most software systems ASCY Tudor David | 11. 2016 3

What do we care about in practice? • Progress of individual operations sometimes •

What do we care about in practice? • Progress of individual operations sometimes • More often: Throughput (Mop/s) – Number of operations per second (throughput) – The evolution of throughput as we increase 12 the number of threads (scalability) 10 8 6 4 2 0 1 10 20 30 40 # Threads CA Tudor David | 11. 2016 4

DS Example: Linked List delete(6) 1 2 3 insert(4) 5 CA 8 4 •

DS Example: Linked List delete(6) 1 2 3 insert(4) 5 CA 8 4 • A sequence of elements (nodes) • Interface – search (aka contains) – insert – remove (aka delete) 6 struct node { value_t value; struct node* next; }; Tudor David | 11. 2016 5

Search Data Structures • Interface search(k) 1. search update(k) 2. insert updates 3. remove

Search Data Structures • Interface search(k) 1. search update(k) 2. insert updates 3. remove k parse(k) modify(k) • Semantics 1. 2. 3. 4. read-only read-write CA Tudor David | 11. 2016 6

Optimistic vs. Pessimistic Concurrency 10 traverse c i t is m ti p o

Optimistic vs. Pessimistic Concurrency 10 traverse c i t is m ti p o 8 6 traverse k loc k 4 2 loc Throughput (Mop/s) 12 20 -core Xeon 1024 elements 0 1 10 20 pessimistic 30 40 # Cores "bad" linked list "good" linked list (Lesson 1) Optimistic concurrency is the only way to get OPTIK scalability Tudor David | 11. 2016 7

The Two Problems in Optimistic Concurrency • Concurrency Control How threads synchronize their writes

The Two Problems in Optimistic Concurrency • Concurrency Control How threads synchronize their writes to the shared memory (e. g. , nodes) – Locks – CAS – Transactional memory OPTIK • Memory Reclamation How and when threads free and reuse the shared memory (e. g. , nodes) – Garbage collectors – Hazard pointers – RCU – Quiescent states Tudor David | 11. 2016 8

Tools for Optimistic Concurrency Control (OCC) • RCU: slow in the presence of updates

Tools for Optimistic Concurrency Control (OCC) • RCU: slow in the presence of updates – (also a memory reclamation scheme) • STM: slow in general • HTM: not ubiquitous, not very fast (yet) • Wait-free algorithms: slow in general • (Optimistic) Lock-free algorithms: • Optimistic lock-based algorithms: We either need a lock-free or an optimistic lock-based OPTIK algorithm Tudor David | 11. 2016 9

Parenthesis: Target platform 2 -socket Intel Xeon E 5 -2680 v 2 Ivy Bridge

Parenthesis: Target platform 2 -socket Intel Xeon E 5 -2680 v 2 Ivy Bridge – 20 cores @ 2. 8 GHz, 40 hyper-threads – 25 MB LLC (per socket) – 256 GB RAM CA c c c c c Tudor David | 11. 2016 10

Concurrent Linked Lists – 5% Updates Blocking Lock-free 1024 elements 5% updates Wait-free Throughput

Concurrent Linked Lists – 5% Updates Blocking Lock-free 1024 elements 5% updates Wait-free Throughput (Mops/s) 12 10 8 6 4 2 0 1 5 9 13 17 21 25 Number of threads 29 33 37 Wait-free algorithm is slow OPTIK Tudor David | 11. 2016 11

Optimistic Concurrency in Data Structures operation Pattern optimistic prepare perform (non-synchronized) (synchronized) validate optimistic

Optimistic Concurrency in Data Structures operation Pattern optimistic prepare perform (non-synchronized) (synchronized) validate optimistic (synchroniz perform prepare ed) detect conflicting failed concurrent operations find insertion spot validate Example linked list insert Validation plays a key role in concurrent data structures OPTIK Tudor David | 11. 2016 12

Validation in Concurrent Data Structures • Lock-free: atomic operations optimistic prepare validate & perform

Validation in Concurrent Data Structures • Lock-free: atomic operations optimistic prepare validate & perform (atomic ops) failed – marking pointers, flags, helping, … • Lock-based: lock validate optimistic prepare loc validat perfor e m k unlock failed unlock – flags, pointer reversal, parsing twice, … Validation is what differentiates algorithms OPTIK Tudor David | 11. 2016 13

Let’s design two concurrent linked lists: A lock-free and a lock-based Tudor David |

Let’s design two concurrent linked lists: A lock-free and a lock-based Tudor David | 11. 2016 14

Lock-free Sorted Linked List: Naïve Search find spot Insert find modification spot CAS Delete

Lock-free Sorted Linked List: Naïve Search find spot Insert find modification spot CAS Delete find modification spot CAS return Is this a correct (linearizable) linked list? OPTIK Tudor David | 11. 2016 15

Lock-free Sorted Linked List: Naïve – Incorrect P 1: CAS P 1: find modification

Lock-free Sorted Linked List: Naïve – Incorrect P 1: CAS P 1: find modification spot P 0: Insert(x) P 1: Delete(y) P 0: find modification spot P 0: CAS y x Lost update! • What is the problem? – Insert involves one existing node; – Delete involves two existing nodes How can we fix the problem? OPTIK Tudor David | 11. 2016 16

Lock-free Sorted Linked List: Fix • Idea! To delete a node, make it unusable

Lock-free Sorted Linked List: Fix • Idea! To delete a node, make it unusable first… – Mark it for deletion so that 1. You fail marking if someone changes next pointer; 2. An insertion fails if the predecessor node is marked. In other words: delete in two steps 2. CAS(remove) 1. CAS(mark) 1. Mark for deletion; and then find modification spot Delete(y) 2. Physical deletion y OPTIK Tudor David | 11. 2016 17

1. Failing Deletion (Marking) P 1: CAS(mark) fa P 1: find modification spot P

1. Failing Deletion (Marking) P 1: CAS(mark) fa P 1: find modification spot P 0: Insert(x) P 1: Delete(y) P 0: find modification spot P 0: CAS y x • Upon failure restart the operation – Restarting is part of “all” state-of-the-art-data structures OPTIK Tudor David | 11. 2016 18

1. Failing Insertion due to Marked Node P 1: CAS(remove) P 1: CAS(mark) P

1. Failing Insertion due to Marked Node P 1: CAS(remove) P 1: CAS(mark) P 1: find modification spot P 0: Insert(x) P 1: Delete(y) P 0: find modification spot P 0: CAS false y • Upon failure restart the operation – Restarting is part of “all” state-of-the-art-data structures How can we implement marking? OPTIK Tudor David | 11. 2016 19

Implementing Marking (C Style) • Pointers in 64 bit architectures – Word aligned -

Implementing Marking (C Style) • Pointers in 64 bit architectures – Word aligned - 8 bit aligned! next pointer 0 0 0 e d o n boolean mark(node_t* n) uintptr_t unmarked = n->next & ~0 x 1 L; uintptr_t marked = n->next | 0 x 1 L; return CAS(&n->next, unmarked, marked) == unmarked; OPTIK Tudor David | 11. 2016 20

Lock-free List: Putting Everything Together • Traversal: traverse (requires unmarking nodes) • Search: traverse

Lock-free List: Putting Everything Together • Traversal: traverse (requires unmarking nodes) • Search: traverse • Insert: traverse CAS to insert • Delete: traverse CAS to mark CAS to remove What • Garbage (marked) nodes happers if this CAS fails? ? – Cleanup while traversing A pragmatic implementation of lock-free linked lists (helping in this course’s terms) OPTIK Tudor David | 11. 2016 21

What is not Perfect with the Lock-free List? 1. Garbage nodes – Increase path

What is not Perfect with the Lock-free List? 1. Garbage nodes – Increase path length; and – Increase complexity if (is_marked_node(n)) … 2. Unmarking every single pointer – Increase complexity curr = get_unmark_ref(curr->next) Can we simplify the design with locks? OPTIK Tudor David | 11. 2016 22

Lock-based Sorted Linked List: Naïve Search find spot Insert find modification spot lock Delete

Lock-based Sorted Linked List: Naïve Search find spot Insert find modification spot lock Delete lock(target) find modification spot lock(predecessor) return Is this a correct (linearizable) linked list? OPTIK Tudor David | 11. 2016 23

Lock-based List: Validate After Locking Search find spot return validate !pred->marked && pred->next did

Lock-based List: Validate After Locking Search find spot return validate !pred->marked && pred->next did not change find modification spot lock Insert mark(curr) lock(curr) find modification spot lock(predecessor) Delete !pred->marked && !curr->marked && pred->next did not change OPTIK Tudor David | 11. 2016 24

Throughput (Mop/s) Concurrent Linked Lists – 0% updates 50 45 40 35 30 25

Throughput (Mop/s) Concurrent Linked Lists – 0% updates 50 45 40 35 30 25 20 15 10 5 0 1024 elements 0% updates Just because the lock-based is not unmarking! 1 10 20 30 40 # Cores lock-free lock-based (Lesson 2) Sequential complexity matters Simplicity OPTIK Tudor David | 11. 2016 25

Optimistic Concurrency Control: Summary • Lock-free: atomic operations optimistic prepare validate & perform (atomic

Optimistic Concurrency Control: Summary • Lock-free: atomic operations optimistic prepare validate & perform (atomic ops) failed – marking pointers, flags, helping, … • Lock-based: lock validate optimistic prepare loc validat perfor e m k unlock failed unlock – flags, pointer reversal, parsing twice, … OPTIK Tudor David | 11. 2016 26

Word of caution: lock-based algorithms • Search data structures • Queues, stacks, counters, .

Word of caution: lock-based algorithms • Search data structures • Queues, stacks, counters, . . . Queue, 40 threads Throughput (Mop/s) 3 2. 5 2 1. 5 1 0. 5 0 Lock-based CA Non-blocking Tudor David | 11. 2016 27

Memory Reclamation: OCC’s Side Effect • Delete a node free and reuse this memory

Memory Reclamation: OCC’s Side Effect • Delete a node free and reuse this memory • Subset of the garbage collection problem • Who is accessing that memory? • Can we just directly do free(node)? P 0: pointer P 0: search on x x P 1: delete(x) P 1: free(x) We cannot directly free the memory! Need memory OPTIK reclamation Tudor David | 11. 2016 28

Memory Reclamation Schemes 1. Reference counting – Count how many references exist on a

Memory Reclamation Schemes 1. Reference counting – Count how many references exist on a node 2. Hazard pointers – Tell to others what exactly you are reading 3. Quiescent states – Wait until it is certain than no one holds references 4. Read-Copy Update (RCU) – Quiescent states – The extreme approach OPTIK Tudor David | 11. 2016 29

1. Reference Counting rc_pointer • Pointer + Counter counter pointer • Dereference: rc_dereference(rc_pointer* rcp)

1. Reference Counting rc_pointer • Pointer + Counter counter pointer • Dereference: rc_dereference(rc_pointer* rcp) atomic_increment(&rcp->counter); return *pointer; • “Release”: rc_release(rc_pointer* rcp) atomic_decrement(&rcp->counter); • Free: iff counter = 0 (Lesson 3) Readers cannot write on the shared nodes Bad bad idea: Readers write on shared nodes! OPTIK Tudor David | 11. 2016 30

2. Hazard pointers (1/2) • Reference counter property of the node • Hazard pointer

2. Hazard pointers (1/2) • Reference counter property of the node • Hazard pointer property of the thread hazard_pointe – A Multi-Reader Single-Writer (MRSW) register addres s • Protect: • hp_protect(node* n) hazard_pointer* hp = hp_get(n); hp->address = n; Depends on the data Release: structure hp_release(hazard_pointer* hp) type hp->address = NULL; OPTIK Tudor David | 11. 2016 31

2. Hazard pointers (2/2) • Free memory x 1. Collect all hazard pointers hazard_pointe

2. Hazard pointers (2/2) • Free memory x 1. Collect all hazard pointers hazard_pointe 2. Check if x is accessed by any thread addres 1. If yes, buffer the free for later s 2. If not, free the memory • Buffering the free is implementation specific • + lock-free structure size) hazard pointers hp_protect • O(data - not scalable OPTIK Tudor David | 11. 2016 32

3. Quiescent States • Keep the memory until it is certain it is not

3. Quiescent States • Keep the memory until it is certain it is not accessed • Can be implemented in various ways • Example implementation search / insert / delete qs_unsafe(); I’m accessing shared data … qs_safe(); I’m not accessing shared data return … The data written in qs_[un]safe must be local-mostly OPTIK Tudor David | 11. 2016 33

3. Quiescent States: qs_[un]safe Implementation • List of “thread-local” (mostly) counters (id = 0)

3. Quiescent States: qs_[un]safe Implementation • List of “thread-local” (mostly) counters (id = 0) qs_state (id = x) qs_state (id = y) qs_state • qs_state (initialized to 0) – even : in safe mode (not accessing shared data) – odd : in unsafe mode • qs_safe / qs_unsafe qs_state++; How do we free memory? OPTIK Tudor David | 11. 2016 34

3. Quiescent States: Freeing memory • List of “thread-local” (mostly) counters (id = 0)

3. Quiescent States: Freeing memory • List of “thread-local” (mostly) counters (id = 0) qs_state (id = x) qs_state (id = y) qs_state • Upon qs_free: Timestamp memory (vector_ts) for t in thread_ids – Can do this for batches of frees if (vts_mem[t] is odd && vts_now[t] = vts_mem[t]) return false; return true; • Safe to reuse the memory vector_tsnow >> vector_tsmem How do the schemes we have seen perform? OPTIK Tudor David | 11. 2016 35

Hazard Pointers vs. Quiescent States 1024 elements 0% updates Throughput (Mop/s) 12 10 8

Hazard Pointers vs. Quiescent States 1024 elements 0% updates Throughput (Mop/s) 12 10 8 6 4 2 0 0 10 20 #Threads None QSBR 30 40 HP Quiescent-state reclamation is as fast as it gets ASCY Tudor David | 11. 2016 36

4. Read-Copy Update (RCU) • Quiescent states at their extreme – Deletions wait all

4. Read-Copy Update (RCU) • Quiescent states at their extreme – Deletions wait all readers to reach a safe state • Introduced in the Linux kernel in ~2002 – More than 10000 uses in the kernel! • (Example) Interface – rcu_read_lock (= qs_unsafe) – rcu_read_unlock (= qs_safe) – synchronize_rcu wait all readers OPTIK Tudor David | 11. 2016 37

4. Using RCU • Search / Traverse rcu_read_lock() … rcu_read_unlock() • Delete … physical

4. Using RCU • Search / Traverse rcu_read_lock() … rcu_read_unlock() • Delete … physical deletion of x synchronize_rcu() free(x) • + simple • + read-only workloads • - bad for writes OPTIK Tudor David | 11. 2016 38

Memory Reclamation: Summary • How and when to reuse freed memory • Many techniques,

Memory Reclamation: Summary • How and when to reuse freed memory • Many techniques, no silver bullet 1. 2. 3. 4. Reference counting Hazard pointers Quiescent states Read-Copy Update (RCU) OPTIK Tudor David | 11. 2016 39

Summary • Concurrent data structures are very important • Optimistic concurrency necessary for scalability

Summary • Concurrent data structures are very important • Optimistic concurrency necessary for scalability – Only recently a lot of active work for CDSs • Memory reclamation is – Inherent to optimistic concurrency; – A difficult problem; – A potential performance/scalability bottleneck OPTIK Tudor David | 11. 2016 40