Algorithm Engineering of Parallel Algorithms and Parallel Data

  • Slides: 45
Download presentation
Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas (C) Ph. Tsigas

Algorithm Engineering of Parallel Algorithms and Parallel Data Structures Philippas Tsigas (C) Ph. Tsigas 2003 -2004

NOBLE A Library of Non-Blocking Concurrent Data Structures Philippas Tsigas Results jointly with: Håkan

NOBLE A Library of Non-Blocking Concurrent Data Structures Philippas Tsigas Results jointly with: Håkan Sundell and Yi Zhang (C) Ph. Tsigas 2003 -2004© Ph. Tsigas 20032004

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization performancebeneficial for

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization performancebeneficial for Parallel Applications? ¢ NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? ¢ Lock-free Skip lists ¢ Conclusions, Future Work (C) Ph. Tsigas 2003 -2004

Systems: SMP ¢ Cache-coherent distributed shared memory multiprocessor systems: UMA l NUMA l (C)

Systems: SMP ¢ Cache-coherent distributed shared memory multiprocessor systems: UMA l NUMA l (C) Ph. Tsigas 2003 -2004

Synchronization Barriers ¢ Locks, semaphores, … (mutual exclusion) ¢ “A significant part of the

Synchronization Barriers ¢ Locks, semaphores, … (mutual exclusion) ¢ “A significant part of the work performed by today’s parallel applications is spent on synchronization. ”. . . (C) Ph. Tsigas 2003 -2004

Lock-Based Synchronization: Sequential (C) Ph. Tsigas 2003 -2004

Lock-Based Synchronization: Sequential (C) Ph. Tsigas 2003 -2004

Non-blocking Synchronization ¢ Lock-Free Synchronization l Optimistic approach • Assumes it’s alone and prepares

Non-blocking Synchronization ¢ Lock-Free Synchronization l Optimistic approach • Assumes it’s alone and prepares operation which later takes place (unless interfered) in one atomic step, using hardware atomic primitives • Interference is detected via shared memory • Retries until not interfered by other operations • Can cause starvation (C) Ph. Tsigas 2003 -2004

Slide provided by Jim Anderson Example: Shared Queue The usual approach is to implement

Slide provided by Jim Anderson Example: Shared Queue The usual approach is to implement operations using retry loops. Here’s an example: type Qtype = record v: valtype; next: pointer to Qtype end shared var Tail: pointer to Qtype; local var old, new: pointer to Qtype procedure Enqueue (input: valtype) new : = (input, NIL); repeat old : = Tail until CAS 2(&Tail, &(old->next), old, NIL, new) old new old Tail (C) Ph. Tsigas 2003 -2004 new

Non-blocking Synchronization ¢ Lock-Free Synchronization l l l ¢ Avoids problems that locks have

Non-blocking Synchronization ¢ Lock-Free Synchronization l l l ¢ Avoids problems that locks have Fast Starvation? (not in the Context of HPC) Wait-Free Synchronization l Always finishes in a finite number of its own steps. • Complex algorithms • Memory consuming • Less efficient on average than lock-free (C) Ph. Tsigas 2003 -2004

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization performancebeneficial for

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization performancebeneficial for Parallel Scientific Applications? ¢ NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? ¢ Conclusions, Future Work (C) Ph. Tsigas 2003 -2004

Non-blocking Synchronisation: ¢ An alternative approach for synchronisation introduced 25 years ago ¢ Many

Non-blocking Synchronisation: ¢ An alternative approach for synchronisation introduced 25 years ago ¢ Many theoretical results Evaluation: ¢ Micro-benchmarks shows better performance than mutual exclusion in real or simulated multiprocessor systems. (C) Ph. Tsigas 2003 -2004

Practice ¢ ¢ Non-blocking synchronization is still not used in practical applications Non-blocking solutions

Practice ¢ ¢ Non-blocking synchronization is still not used in practical applications Non-blocking solutions are often l l l complex having non-standard or un-clear interfaces non-practical (C) Ph. Tsigas 2003 -2004 ? ?

Practice Question? ”How the performance of parallel scientific applications is affected by the use

Practice Question? ”How the performance of parallel scientific applications is affected by the use of non-blocking synchronisation rather than lock-based one? ” (C) Ph. Tsigas 2003 -2004 ? ? ?

Answers How the performance of parallel scientific applications is affected by the use of

Answers How the performance of parallel scientific applications is affected by the use of nonblocking synchronisation rather than lockbased one? ¢ ¢ The identification of the basic locking operations that parallel programmers use in their applications. The efficient non-blocking implementation of these synchronisation operations. The architectural implications on the design of non-blocking synchronisation. Comparison of the lock-based and lock-free versions of the respective applications (C) Ph. Tsigas 2003 -2004

Applications Ocean simulates eddy currents in an ocean basin. Radiosity computes the equilibrium distribution

Applications Ocean simulates eddy currents in an ocean basin. Radiosity computes the equilibrium distribution of light in a scene using the radiosity method. Volrenders 3 D volume data into an image using a raycasting method. Water Evaluates forces and potentials that occur over time between water molecules. Spark 98 a collection of sparse matrix kernels. Each kernel performs a sequence of sparse matrix vector product operations using matrices that are derived from a family of three-dimensional finite element earthquake applications. (C) Ph. Tsigas 2003 -2004

Removing Locks in Applications Many locks are “Simple Locks”. ¢ ¢ Many critical sections

Removing Locks in Applications Many locks are “Simple Locks”. ¢ ¢ Many critical sections contain shared floatingpoint variables. ¢ ¢ Large critical sections. ¢ ¢ CAS, FAA and LL/SC can be used to implement non -blocking version. Floating-point synchronization primitives are needed. A Double. Fetch-and-Add primitive was designed. Efficient Non-blocking implementations of big (C) Ph. Tsigas 2003 -2004 ADT are used.

Experimental Results: Speedup 58 P 32 P 24 P 58 P (C) Ph. Tsigas

Experimental Results: Speedup 58 P 32 P 24 P 58 P (C) Ph. Tsigas 2003 -2004

SPARK 98 Before: spark_setlock(lockid); w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]; w[col][1] += A[Anext][0][1]*v[i][0]

SPARK 98 Before: spark_setlock(lockid); w[col][0] += A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]; w[col][1] += A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]; w[col][2] += A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]; spark_unsetlock(lockid); After: dfad(&w[col][0], A[Anext][0][0]*v[i][0] + A[Anext][1][0]*v[i][1] + A[Anext][2][0]*v[i][2]); dfad(&w[col][1], A[Anext][0][1]*v[i][0] + A[Anext][1][1]*v[i][1] + A[Anext][2][1]*v[i][2]); dfad(&w[col][2], A[Anext][0][2]*v[i][0] + A[Anext][1][2]*v[i][1] + A[Anext][2][2]*v[i][2]); (C) Ph. Tsigas 2003 -2004

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization beneficial for

Overview ¢ Introduction l l Synchronization Non-blocking Synchronization ¢ Is Non-blocking Synchronization beneficial for Parallel Scientific Applications? ¢ NOBLE: A Non-blocking Synchronization Interface. How can we make non-blocking synchronization accessible to the parallel programmer? ¢ Conclusions, Future Work (C) Ph. Tsigas 2003 -2004

Practice ¢ ¢ Non-blocking synchronization is still not used in practical applications Non-blocking solutions

Practice ¢ ¢ Non-blocking synchronization is still not used in practical applications Non-blocking solutions are often l l l complex having non-standard or un-clear interfaces non-practical (C) Ph. Tsigas 2003 -2004 ? ?

NOBLE: Brings Non-blocking closer to Practice ¢ Create a non-blocking inter-process communication interface with

NOBLE: Brings Non-blocking closer to Practice ¢ Create a non-blocking inter-process communication interface with the properties: l l l Attractive functionality Programmer friendly Easy to adapt existing solutions Efficient Portable Adaptable for different programming languages (C) Ph. Tsigas 2003 -2004

NOBLE Design: Portable Noble. h #define NBL. . . Exported definitions Identical for all

NOBLE Design: Portable Noble. h #define NBL. . . Exported definitions Identical for all platforms Platform in-dependent Queue. LF. c Stack. LF. c #include “Platform/Primitives. h” … Sun. Hardware. asm Intel. Hardware. asm CAS, TAS, Spin-Locks … CAS, TAS, Spin-Locks. . . (C) Ph. Tsigas 2003 -2004 . . . Platform dependent. . .

Using NOBLE • First create a global variable handling the shared data object, for

Using NOBLE • First create a global variable handling the shared data object, for example a stack: • Create the stack with the appropriate implementation: Globals #include <noble. h>. . . NBLStack* stack; Main stack=NBLStack. Create. LF(10000); . . . Threads • When some thread wants to do some operation: NBLStack. Push(stack, item); or item=NBLStack. Pop(stack); (C) Ph. Tsigas 2003 -2004

Using NOBLE Globals #include <noble. h>. . . NBLStack* stack; Main ¢ When the

Using NOBLE Globals #include <noble. h>. . . NBLStack* stack; Main ¢ When the data structure is not in use anymore: stack=NBLStack. Create. LF(10000); . . . NBLStack. Free(stack); (C) Ph. Tsigas 2003 -2004

Using NOBLE Globals #include <noble. h>. . . NBLStack* stack; • To change the

Using NOBLE Globals #include <noble. h>. . . NBLStack* stack; • To change the synchronization mechanism, only one line of code has to be changed! Main stack=NBLStack. Create. LB(); . . . NBLStack. Free(stack); Threads NBLStack. Push(stack, item); or item=NBLStack. Pop(stack); (C) Ph. Tsigas 2003 -2004

Design: Attractive functionality ¢ Data structures for multi-threaded usage FIFO Queues l Priority Queues

Design: Attractive functionality ¢ Data structures for multi-threaded usage FIFO Queues l Priority Queues l Dictionaries l Stacks l Singly linked lists l Snapshots l MWCAS l. . . l ¢ Clear specifications (C) Ph. Tsigas 2003 -2004

Status ¢ Multiprocessor support Sun Solaris (Sparc) l Win 32 (Intel x 86) l

Status ¢ Multiprocessor support Sun Solaris (Sparc) l Win 32 (Intel x 86) l SGI (Mips) l Linux (Intel x 86) l Availiable for academic use: http: //www. noble-library. org/ (C) Ph. Tsigas 2003 -2004

Did our Work have any Impact? 1) 2) 3) Industry has initialized contacts and

Did our Work have any Impact? 1) 2) 3) Industry has initialized contacts and uses a test version of NOBLE. Free-ware developers has showed interest. Interest from research organisations. NOBLE is freely availiable for research and educational purposes. (C) Ph. Tsigas 2003 -2004

A Lock-Free Skip list ¢ H. Sundell, Ph. Tsigas Fast and Lock-Free Concurrent Priority

A Lock-Free Skip list ¢ H. Sundell, Ph. Tsigas Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17 th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´ 03), May 2003 (TR 2002). Best Paper Award A very similar skip list algorithm will be presented this August at the ACM Symposium on Principles of Distributed Computing (PODC 2004): ”Lock-Free Linked Lists and Skip Lists” Mikhail Fomitchev, Eric Ruppert (C) Ph. Tsigas 2003 -2004

Randomized Algorithm: Skip Lists ¢ William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced

Randomized Algorithm: Skip Lists ¢ William Pugh: ”Skip Lists: A Probabilistic Alternative to Balanced Trees”, 1990 l Layers of ordered lists with different densities, achieves a tree-like behavior Head Tail 1 2 l 3 4 5 6 7 Time complexity: O(log 2 N) – probabilistic! (C) Ph. Tsigas 2003 -2004 … 25% 50%

Our Lock-Free Concurrent Skip List l 1 3 2 1 D Define node state

Our Lock-Free Concurrent Skip List l 1 3 2 1 D Define node state to depend on the insertion status at lowest level as well as a deletion flag 2 D 3 D 4 D 5 D 6 D 7 D l Insert from lowest level going upwards l Set deletion flag. Delete from highest level going downwards p (C) Ph. Tsigas 2003 -2004 3 2 1 p D

Concurrent Insert vs. Delete operations ¢ b) 1 Problem: 2 Delete 3 Insert -

Concurrent Insert vs. Delete operations ¢ b) 1 Problem: 2 Delete 3 Insert - both nodes are deleted! ¢ 4 a) Solution (Harris et al): Use bit 0 of pointer to mark deletion status 1 b) c) (C) Ph. Tsigas 2003 -2004 2 * a) 3 4

Dynamic Memory Management Problem: System memory allocation functionality is blocking! ¢ Solution (lock-free), IBM

Dynamic Memory Management Problem: System memory allocation functionality is blocking! ¢ Solution (lock-free), IBM freelists: ¢ l Pre-allocate a number of nodes, link them into a dynamic stack structure, and allocate/reclaim using CAS Allocate Head Reclaim Mem 1 Used 1 Mem 2 (C) Ph. Tsigas 2003 -2004 … Mem n

The ABA problem ¢ Problem: Because of concurrency (pre -emption in particular), same pointer

The ABA problem ¢ Problem: Because of concurrency (pre -emption in particular), same pointer value does not always mean same node (i. e. CAS succeeds)!!! Step 1: 1 6 7 3 7 4 Step 2: 2 4 (C) Ph. Tsigas 2003 -2004

The ABA problem ¢ Solution: (Valois et al) Add reference counting to each node,

The ABA problem ¢ Solution: (Valois et al) Add reference counting to each node, in order to prevent nodes that are of interest to some thread to be reclaimed until all threads have left the node New Step 2: 1 * 6 * 1 1 CAS Failes! 2 ? 3 4 1 (C) Ph. Tsigas 2003 -2004 ? 7 ?

Helping Scheme ¢ Threads need to traverse safely 2 * 1 4 or 1

Helping Scheme ¢ Threads need to traverse safely 2 * 1 4 or 1 ¢ 4 ? ? ¢ 2 * Need to remove marked-to-be-deleted nodes while traversing – Help! Finds previous node, finish deletion and continues traversing from previous node 1 2 * 4 (C) Ph. Tsigas 2003 -2004

Overlapping operations on Insert 2 shared data 2 ¢ Example: Insert operation 1 -

Overlapping operations on Insert 2 shared data 2 ¢ Example: Insert operation 1 - which of 2 or 3 gets inserted? ¢ Solution: Compare-And-Swap atomic primitive: CAS(p: pointer to word, old: word, new: word): boolean atomic do if *p = old then *p : = new; return true; else return false; (C) Ph. Tsigas 2003 -2004 4 3 Insert 3

Experiments 1 -30 threads on platforms with different levels of real concurrency ¢ 10000

Experiments 1 -30 threads on platforms with different levels of real concurrency ¢ 10000 Insert vs. Delete. Min operations by each thread. 100 vs. 1000 initial inserts ¢ Compare with other implementations: ¢ Lotan and Shavit, 2000 l Hunt et al “An Efficient Algorithm for Concurrent Priority Queue Heaps”, 1996 l (C) Ph. Tsigas 2003 -2004

Full Concurrency (C) Ph. Tsigas 2003 -2004

Full Concurrency (C) Ph. Tsigas 2003 -2004

Medium Pre-emption (C) Ph. Tsigas 2003 -2004

Medium Pre-emption (C) Ph. Tsigas 2003 -2004

High Pre-emption (C) Ph. Tsigas 2003 -2004

High Pre-emption (C) Ph. Tsigas 2003 -2004

Lessons Learned ¢ ¢ The Non-Blocking Synchronization Paradigm can be suitable and beneficial to

Lessons Learned ¢ ¢ The Non-Blocking Synchronization Paradigm can be suitable and beneficial to large scale parallel applications. Experimental Reproducable Work. Many results claimed by simulation are not consistent with what we observed. Applications gave us nice problems to look at and do theoretical work on. (IPDPS 2003 Algorithmic Best Paper Award) NOBLE helped programmers to trust our implementations. (C) Ph. Tsigas 2003 -2004

Future Work Extend NOBLE for loosely coupled systems. ¢ Extend the set of data

Future Work Extend NOBLE for loosely coupled systems. ¢ Extend the set of data structures supported by NOBLE based on the needs of the applications. ¢ Reactive-Synchronisation ¢ (C) Ph. Tsigas 2003 -2004

Questions? ¢ Contact Information: l l l Address: Philippas Tsigas Computing Science Chalmers University

Questions? ¢ Contact Information: l l l Address: Philippas Tsigas Computing Science Chalmers University of Technology Email: tsigas @ cs. chalmers. se Web: http: //www. cs. chalmers. se/~tsigas http: //www. cs. chalmers. se/~dcs http: //www. noble-library. org

Pointers: ¢ ¢ ¢ ¢ NOBLE: A Non-Blocking Inter-Process Communication Library. ACM Workshop on

Pointers: ¢ ¢ ¢ ¢ NOBLE: A Non-Blocking Inter-Process Communication Library. ACM Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers (LCR ´ 02). Evaluating The Performance of Non-Blocking Synchronization on Shared Memory Multiprocessors. ACM SIGMETRICS 2001/Performance 2001 Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2001). Integrating Non-blocking Synchronization in Parallel Applications: Performance Advantages and Methodologies. ACM Workshop on Software and Performance (WOSP ´ 01). A Simple, Fast and Scalable Non-Blocking Concurrent FIFO queue for Shared Memory Multiprocessor Systems, ACM Symposium on Parallel Algorithms and Architectures (SPAA ´ 01). Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. 17 th IEEE/ACM International Parallel and Distributed Processing Symposium (IPDPS ´ 03). Fast, Reactive and Lock-free Multi-word Compare-and-swap Algorithms. 12 th EEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT ´ 03) Scalable and Lock-free Cuncurrent Dictionaries. Proceedings of the 19 th ACM Symposium on Applied Computing (SAC ’ 04). (C) Ph. Tsigas 2003 -2004