GLTO On the Adequacy of Lightweight Thread Approaches

  • Slides: 23
Download presentation
GLTO: On the Adequacy of Lightweight Thread Approaches for Open. MP Implementations Adrián Castelló

GLTO: On the Adequacy of Lightweight Thread Approaches for Open. MP Implementations Adrián Castelló Rafael Mayo Enrique S. Quintana-Ortí Antonio J. Peña Sangmin Seo Pavan Balaji Universitat Jaume I de Castelló (Spain) Barcelona Supercomputing Center (Spain) Argonne National Lab (USA) ICPP 2017. 14 th – 17 th August. Bristol (UK)

Open. MP • Directive based programming model • Commonly used for shared-memory programming in

Open. MP • Directive based programming model • Commonly used for shared-memory programming in a node • Many different implementations • Typically on top of Pthreads library • Intel, GCC, Clang, IBM, etc. • Well-known issues: • Nested paralleism • Fine-grained task parallelism 2

Open. MP Parallel code Sequential code #pragma omp parallel for (i=0; i<INT_MAX; i++){ do_something();

Open. MP Parallel code Sequential code #pragma omp parallel for (i=0; i<INT_MAX; i++){ do_something(); } } zz CPU CPU zz CPU NODE zz CPU CPU NODE 3

Pthreads • Pthreads is the standard to exploit the current on-node parallelism • Using

Pthreads • Pthreads is the standard to exploit the current on-node parallelism • Using pthreads library • Using high-level programing models • Pros: • Works well for hardware characteristics • Cons: • Falls from the point of view of software requirements • Context switch and synchronizations are expensive mechanisms OS thread 4

Lightweight Thread Libraries OS thread U U • • User-level thread Lightweight thread with

Lightweight Thread Libraries OS thread U U • • User-level thread Lightweight thread with low context-switch overhead Dynamic data scheduling To better overlap computation and communication/IO To exploit fine-grained task parallelism 5

Lightweight Thread Libraries Specific OS High-level programming model Converse. Threads Nanos++ Windows Fibers Solaris

Lightweight Thread Libraries Specific OS High-level programming model Converse. Threads Nanos++ Windows Fibers Solaris Threads Lightweight Thread abstraction Cilk Go Intel TBB Specific Hardware Tiny-Threads General purpose Stackless threads Stackless Python Protothreads Massive. Threads Argobots Qthreads 6

Generic Lightweight Thread (GLT) library GLT common API Qthreads Massive. Threads Argobots • Unified

Generic Lightweight Thread (GLT) library GLT common API Qthreads Massive. Threads Argobots • Unified API for LWT solutions • Two approaches: • Stand-alone (static) and Headers (dynamic) • Programming Model: • Each thread composed by: • OS thread • Work unit queue • Scheduler • Scheduling relies on the underlying library • Two types of work-unit support: Tasklet and ULT * Based on work “GLT: A Unified API for Lightweight Thread Libraries” to be presented in Euro-Par 2017 7

Goals of this work Implement Open. MP runtime over GLT Analyze several Open. MP

Goals of this work Implement Open. MP runtime over GLT Analyze several Open. MP scenarios Improve Open. MP weaknesses 8

GLTO implementation * Glto runtime is based in BOLT project from Argonne National Laboratory

GLTO implementation * Glto runtime is based in BOLT project from Argonne National Laboratory 9

GLTO implementation 10

GLTO implementation 10

GLTO validation Open. UH Open. MP Validation Suite 3. 1 GNU 6. 1 Intel

GLTO validation Open. UH Open. MP Validation Suite 3. 1 GNU 6. 1 Intel 16. 0. 1 GLTO Open. MP constructs 62 62 62 Used tests 123 123 Successful tests 118 121/122 5 5 2/1 Failed tests GLTO obtains better percentage of successful tests! 11

GLTO evaluation Hardware & Software • 2 x Intel Xeon E 5 -2699 v

GLTO evaluation Hardware & Software • 2 x Intel Xeon E 5 -2699 v 3 (2. 3 GHz) • 18 cores • 128 GB of RAM • • • Intel Open. MP runtime 20160808 GOMP 6. 1 GLT 01 -2017 Argobots 01 -2017 Qthreads 1. 10 Massive. Threads 0. 95 12

GLTO evaluation Open. MP as environment creator UTS benchmark (T 1 XXL size) #pragma

GLTO evaluation Open. MP as environment creator UTS benchmark (T 1 XXL size) #pragma omp parallel { int tid = omp_get_thread_num(); do_things(tid); } 13

GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) #pragma

GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) #pragma omp parallel for (i=0; i< N; i++){ do_things(); } 14

GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) Work

GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) Work dispatch time 114 parallel for loops are executed 2, 955 times!! 15

GLTO evaluation Open. MP in nested parallelism #pragma omp parallel for (i=0; i<N; i++){

GLTO evaluation Open. MP in nested parallelism #pragma omp parallel for (i=0; i<N; i++){ #pragma omp parallel for firstprivate (i) for (j=0; j<N; j++){ do_things(i, j); } } Nested parallelism (N = 100) 16

GLTO evaluation Open. MP in nested parallelism Nested parallelism (N = 100) Open. MP

GLTO evaluation Open. MP in nested parallelism Nested parallelism (N = 100) Open. MP Created Threads Reused Threads Created ULTs GCC 3, 536 0 - Intel 1, 296 2, 240 - GLTO 36 0 3, 500 17

Granularity 10 Granularity 20 Granularity 100 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 50

Granularity 10 Granularity 20 Granularity 100 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 50 Open. MP in task parallelism GLTO evaluation 18

Granularity 20 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 100 Granularity 50 Granularity 10

Granularity 20 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 100 Granularity 50 Granularity 10 Open. MP in task parallelism GLTO evaluation Performance lose caused by a cut-off mechanism! 19

GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X

GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X X - GLTO X - X Coarsegrained Finegrained There is not a clear winner… 20

GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X

GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X X - GLTO X - X Coarsegrained Finegrained … but at least, users can change the runtime version! 21

Conclusions & Future work • We have implemented Open. MP on top of Generic

Conclusions & Future work • We have implemented Open. MP on top of Generic Lightweight Thread Library • We have analyzed GLTO and compared it with current Open. MP • GLTO improves some Open. MP scenarios by using Lightweight threads instead of pthreads • To implement other high-level programming models on top of GLT • To analyze the interaction between GLTO and MPI 22

Thanks! Contact: adcastel@uji. es Source code: github. com/adcastel/glto-runtime 23

Thanks! Contact: adcastel@uji. es Source code: github. com/adcastel/glto-runtime 23