A Review of Lightweight Thread Approaches for High

  • Slides: 34
Download presentation
A Review of Lightweight Thread Approaches for High Performance Computing Adrián Castelló Rafael Mayo

A Review of Lightweight Thread Approaches for High Performance Computing Adrián Castelló Rafael Mayo Enrique S. Quintana-Ortí Antonio J. Peña Sangmin Seo Pavan Balaji Universitat Jaume I de Castelló (Spain) Barcelona Supercomputing Center (Spain) Argonne National Lab (USA) IEEE Cluster 2016. 13 rd – 15 th September. Taipei (Taiwan)

Motivation • Exascale systems will offer massive concurrent hardware 16 20 15 20 14

Motivation • Exascale systems will offer massive concurrent hardware 16 20 15 20 14 20 13 20 12 18 -260 20 11 16 20 10 20 09 9 -14 20 08 8 20 07 20 06 6 20 05 4 20 04 2 20 20 03 1 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 01 Percentage (%) • On-node parallelism is inevitable Year Cores per socket distribution on the Top 500 list A. Castelló IEEE Cluster 2016 1

Current solution • Pthreads is the standard to exploit the current on-node parallelism •

Current solution • Pthreads is the standard to exploit the current on-node parallelism • Using pthreads library • Using high-level programing models • Pros: • Works well for hardware characteristics • Cons: • Falls from the point of view of software requirements • Context switch and synchronizations are expensive mechanisms OS thread A. Castelló IEEE Cluster 2016 2

Lightweight Thread Libraries OS thread U U User-level thread • Lightweight thread with low

Lightweight Thread Libraries OS thread U U User-level thread • Lightweight thread with low context-switch overhead • To better overlap computation and communication/IO • To exploit fine-grained task parallelism A. Castelló IEEE Cluster 2016 3

Lightweight Thread Libraries High-level programming model Converse. Threads Nanos++ Specific OS Windows Fibers Solaris

Lightweight Thread Libraries High-level programming model Converse. Threads Nanos++ Specific OS Windows Fibers Solaris Threads Specific Hardware Tiny-Threads Lightweight Thread abstraction Cilk Go Intel TBB General purpose Stackless threads Massive. Threads Argobots Qthreads Stackless Python Protothreads A. Castelló IEEE Cluster 2016 4

Lightweight Thread Libraries High-level programming model Converse. Threads Nanos++ Specific OS Windows Fibers Solaris

Lightweight Thread Libraries High-level programming model Converse. Threads Nanos++ Specific OS Windows Fibers Solaris Threads Specific Hardware Tiny-Threads Lightweight Thread abstraction Cilk Go Intel TBB General purpose Stackless threads Massive. Threads Argobots Qthreads Stackless Python Protothreads A. Castelló IEEE Cluster 2016 4

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 Group Control - Yes Yes Yes Global Queue Yes* Yes - - Yes Private Queue Yes* Yes Yes - Plug-in Scheduler Yes Yes Yes - Stackable Scheduler - Yes - - Group Scheduler - Yes - - * By programmer A. Castelló IEEE Cluster 2016 5

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 Group Control - Yes Yes Yes Global Queue Yes* Yes - - Yes Private Queue Yes* Yes Yes - Plug-in Scheduler Yes Yes Yes - Stackable Scheduler - Yes - - Group Scheduler - Yes - - * By programmer A. Castelló IEEE Cluster 2016 5

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 Group Control - Yes Yes Yes Global Queue Yes* Yes - - Yes Private Queue Yes* Yes Yes - Plug-in Scheduler Yes Yes Yes - Stackable Scheduler - Yes - - Group Scheduler - Yes - - * By programmer A. Castelló IEEE Cluster 2016 5

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 Group Control - Yes Yes Yes Global Queue Yes* Yes - - Yes Private Queue Yes* Yes Yes - Plug-in Scheduler Yes Yes Yes - Stackable Scheduler - Yes - - Group Scheduler - Yes - - * By programmer A. Castelló IEEE Cluster 2016 5

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of

Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 Group Control - Yes Yes Yes Global Queue Yes* Yes - - Yes Private Queue Yes* Yes Yes - Plug-in Scheduler Yes Yes Yes - Stackable Scheduler - Yes - - Group Scheduler - Yes - - * By programmer A. Castelló More flexibility!! IEEE Cluster 2016 5

Why are not these libraries used? A. Castelló IEEE Cluster 2016 6

Why are not these libraries used? A. Castelló IEEE Cluster 2016 6

Why are not these libraries used? #pragma omp parallel Some_Code(); zz zz zz •

Why are not these libraries used? #pragma omp parallel Some_Code(); zz zz zz • Implemented on top of Pthreads • Directive-based programming model • Parallel code with just one code line A. Castelló IEEE Cluster 2016 6

Go Argobots VS Massive. Threads Our Target Qthreads VS Converse Threads Hardware: 36 -core

Go Argobots VS Massive. Threads Our Target Qthreads VS Converse Threads Hardware: 36 -core Intel Xeon E 5 -2699 v 3. 128 GB of Memory Software: Lightweight thread libraries updated to 05 -2016. gcc 5. 2 and Intel icc 15. 0. 1 (Intel Open. MP Runtime 20151009) Code: Sscal BLAS-1 operation. Method: 1 st Analyze the Open. MP behavior 2 nd Mimic Open. MP mechanism with Lightweight Thread libraries 3 rd Compare the achieved performance with Open. MP (as baseline) A. Castelló IEEE Cluster 2016 7

LWT Programming Model #define N 100 void example(){ printf(“Hellon”); } int main(){ 1 initialization();

LWT Programming Model #define N 100 void example(){ printf(“Hellon”); } int main(){ 1 initialization(); for(int i=0; i<N; i++) ULT_creation_to(example , dest); yield(); 3 for(int i=0; i<N; i++) join(); finalize() } A. Castelló 2 4 1 Environment Initialization 2 ULT/Tasklet creation 3 Context-switch 4 ULT/Tasklet join 5 Environment Finalization 5 IEEE Cluster 2016 8

LWT Programming Model #define N 100 void example(){ printf(“Hellon”); } int main(){ 1 initialization();

LWT Programming Model #define N 100 void example(){ printf(“Hellon”); } int main(){ 1 initialization(); for(int i=0; i<N; i++) ULT_creation_to(example , dest); yield(); 3 for(int i=0; i<N; i++) join(); finalize() } A. Castelló 2 4 1 Environment Initialization 2 ULT/Tasklet creation 3 Context-switch 4 ULT/Tasklet join 5 Environment Finalization 5 IEEE Cluster 2016 8

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads

LWT Programming Model II ULT/Tasklet creation 2 Step Result Pthreads Argobots Qthreads Massive Threads Converse Threads Go New Pthread New ULT or Tasklet New ULT OS Execution Stream Worker Thread Process Oversubscription May be No No No ULT Destination Yes* Yes No Just Tasklet No OS Private (ES) Private (W) Private Shared Executed by Work queues Own Queue access OS Load balance Main Drawback A. Castelló (Almost) Free Mutex Application OS actions Mutex (Almost) Free Mutex Application Work-steal. Application Dispatch step Contention IEEE Cluster 2016 Work-shar. Dispatch step Contention 9

Basic Functionality Creation step • Dispatch overhead • One ULT/Tasklet is created for each

Basic Functionality Creation step • Dispatch overhead • One ULT/Tasklet is created for each thread in LWT • Only the function pointer initialization is measured in Open. MP Joining step • One ULT/Tasklet is joined for each thread in LWT • The join function is measured in Open. MP • Barrier vs Memory status vs Work unit status A. Castelló IEEE Cluster 2016 10

Open. MP microbenchmarks I For loop #pragma omp parallel for(i=0; i<1000; i++) code(i); A.

Open. MP microbenchmarks I For loop #pragma omp parallel for(i=0; i<1000; i++) code(i); A. Castelló • One ULT for each Thread • Iterations are divided between ULTs • Similar to create + join figures IEEE Cluster 2016 11

Open. MP microbenchmarks I Nested for loop #pragma omp parallel for(i=0; i<1000; i++){ #pragma

Open. MP microbenchmarks I Nested for loop #pragma omp parallel for(i=0; i<1000; i++){ #pragma omp parallel for firstprivate(i) for(j=0; j<1000; j++) code(i, j); } A. Castelló • The Y-axis values are in seconds • Converse Threads needs extra scheduler calls • Open. MP generates oversubscription IEEE Cluster 2016 11

Nested parallelism omp_set_num_threads(4); #pragma omp parallel for 1 for(i=0; i<1000; i++){ #pragma omp parallel

Nested parallelism omp_set_num_threads(4); #pragma omp parallel for 1 for(i=0; i<1000; i++){ #pragma omp parallel for firstprivate(i) for(j=0; j<1000; j++){ code(i, j); 3 } } Step GCC Open. MP ICC Open. MP LWT 1 Creates 4 outer loop threads 2 Creates 3 inner loop threads for Checks for idle threads and each outer loop thread creates 3 new inner loop threads if needed Creates 4 inner loop ULTs for each outer loop ULT 3 Puts the inner loop threads inside the idle thread pool Joins the 4 inner loop ULTs OS Threads A. Castelló 12. 004 Creates 4 outer loop threads 2 Puts the inner loop threads inside the idle thread pool 16 IEEE Cluster 2016 Creates 4 outer loop ULTs 4 12

Open. MP microbenchmarks II Tasks in a single region #pragma omp parallel { #pragma

Open. MP microbenchmarks II Tasks in a single region #pragma omp parallel { #pragma omp single for(i=0; i<1000; i++){ #pragma omp task code(i); } } #pragma omp parallel { #pragma omp for(i=0; i<1000; i++){ #pragma omp task code(i); } } Tasks in a parallel region A. Castelló IEEE Cluster 2016 13

Open. MP microbenchmarks II Tasks in a single region • Each Open. MP task

Open. MP microbenchmarks II Tasks in a single region • Each Open. MP task is converted to a ULT or Tasklet in LWT • Tasklets preforms better • Dispatch effect • Work-sharing vs Work-stealing • GCC only implements one shared task queue • ICC employs one task queue for each thread and uses work-stealing Tasks in a parallel region A. Castelló IEEE Cluster 2016 13

Conclusions • Lightweight thread solutions can mimic commonly parallel codes • They achieve a

Conclusions • Lightweight thread solutions can mimic commonly parallel codes • They achieve a performance that is, at least, as good as the Open. MP runtimes • General purpose libraries perform better • Some implementation choices with strong impact have been identified in Open. MP Runtime systems • Moreover… A. Castelló IEEE Cluster 2016 15

Conclusions II • We found that the parallel codes can be implemented with a

Conclusions II • We found that the parallel codes can be implemented with a reduced set of LWT functions Function Argobots Qthreads Massive Threads Converse Threads Initialization ABT_init qthread_initialize myth_init Converse. Init ULT creation ABT_thread_create qthread_fork myth_create Cth. Create Yield ABT_thread_yield qthread_yield myth_yield Cth. Yield Join ABT_thread_free qthread_read. FF myth_join Finalization ABT_finalize qthread_finalize myth_fini A. Castelló IEEE Cluster 2016 Converse. Exit Go go function channel - 16

Current Work Generic Lightweight Thread (GLT) library GLT common API Qthreads Massive. Threads •

Current Work Generic Lightweight Thread (GLT) library GLT common API Qthreads Massive. Threads • Common API for LWT solutions • Two parts: • CORE for common features • EXTENDED for specific solution functions • Two approaches: • Stand-alone • Headers • Scheduling relies on the underlying library • Two types of work-unit support: Tasklet and ULT A. Castelló IEEE Cluster 2016 Argobots www. hpca. uji. es/GLT github. com/adcastel/GLT. git 17

Future Work • To reimplement some pthreads-based high-level programming models on top of that

Future Work • To reimplement some pthreads-based high-level programming models on top of that API • Open. MP • Omp. Ss • etc. A. Castelló IEEE Cluster 2016 18

Thank you! Adrián Castelló (adcastel@uji. es) IEEE Cluster 2016

Thank you! Adrián Castelló (adcastel@uji. es) IEEE Cluster 2016

Open. MP microbenchmarks III void code(int i){ #pragma omp task test 1(); #pragma omp

Open. MP microbenchmarks III void code(int i){ #pragma omp task test 1(); #pragma omp task test 2(); #pragma omp taskwait }. . . Nested tasks #pragma omp parallel { #pragma omp single for(i=0; i<200; i++){ #pragma omp task code(i); } } A. Castelló IEEE Cluster 2016 14