GLTO On the Adequacy of Lightweight Thread Approaches
- Slides: 23
GLTO: On the Adequacy of Lightweight Thread Approaches for Open. MP Implementations Adrián Castelló Rafael Mayo Enrique S. Quintana-Ortí Antonio J. Peña Sangmin Seo Pavan Balaji Universitat Jaume I de Castelló (Spain) Barcelona Supercomputing Center (Spain) Argonne National Lab (USA) ICPP 2017. 14 th – 17 th August. Bristol (UK)
Open. MP • Directive based programming model • Commonly used for shared-memory programming in a node • Many different implementations • Typically on top of Pthreads library • Intel, GCC, Clang, IBM, etc. • Well-known issues: • Nested paralleism • Fine-grained task parallelism 2
Open. MP Parallel code Sequential code #pragma omp parallel for (i=0; i<INT_MAX; i++){ do_something(); } } zz CPU CPU zz CPU NODE zz CPU CPU NODE 3
Pthreads • Pthreads is the standard to exploit the current on-node parallelism • Using pthreads library • Using high-level programing models • Pros: • Works well for hardware characteristics • Cons: • Falls from the point of view of software requirements • Context switch and synchronizations are expensive mechanisms OS thread 4
Lightweight Thread Libraries OS thread U U • • User-level thread Lightweight thread with low context-switch overhead Dynamic data scheduling To better overlap computation and communication/IO To exploit fine-grained task parallelism 5
Lightweight Thread Libraries Specific OS High-level programming model Converse. Threads Nanos++ Windows Fibers Solaris Threads Lightweight Thread abstraction Cilk Go Intel TBB Specific Hardware Tiny-Threads General purpose Stackless threads Stackless Python Protothreads Massive. Threads Argobots Qthreads 6
Generic Lightweight Thread (GLT) library GLT common API Qthreads Massive. Threads Argobots • Unified API for LWT solutions • Two approaches: • Stand-alone (static) and Headers (dynamic) • Programming Model: • Each thread composed by: • OS thread • Work unit queue • Scheduler • Scheduling relies on the underlying library • Two types of work-unit support: Tasklet and ULT * Based on work “GLT: A Unified API for Lightweight Thread Libraries” to be presented in Euro-Par 2017 7
Goals of this work Implement Open. MP runtime over GLT Analyze several Open. MP scenarios Improve Open. MP weaknesses 8
GLTO implementation * Glto runtime is based in BOLT project from Argonne National Laboratory 9
GLTO implementation 10
GLTO validation Open. UH Open. MP Validation Suite 3. 1 GNU 6. 1 Intel 16. 0. 1 GLTO Open. MP constructs 62 62 62 Used tests 123 123 Successful tests 118 121/122 5 5 2/1 Failed tests GLTO obtains better percentage of successful tests! 11
GLTO evaluation Hardware & Software • 2 x Intel Xeon E 5 -2699 v 3 (2. 3 GHz) • 18 cores • 128 GB of RAM • • • Intel Open. MP runtime 20160808 GOMP 6. 1 GLT 01 -2017 Argobots 01 -2017 Qthreads 1. 10 Massive. Threads 0. 95 12
GLTO evaluation Open. MP as environment creator UTS benchmark (T 1 XXL size) #pragma omp parallel { int tid = omp_get_thread_num(); do_things(tid); } 13
GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) #pragma omp parallel for (i=0; i< N; i++){ do_things(); } 14
GLTO evaluation Open. MP in compute-bound code Clover Leaf (clover_bm 4. in size) Work dispatch time 114 parallel for loops are executed 2, 955 times!! 15
GLTO evaluation Open. MP in nested parallelism #pragma omp parallel for (i=0; i<N; i++){ #pragma omp parallel for firstprivate (i) for (j=0; j<N; j++){ do_things(i, j); } } Nested parallelism (N = 100) 16
GLTO evaluation Open. MP in nested parallelism Nested parallelism (N = 100) Open. MP Created Threads Reused Threads Created ULTs GCC 3, 536 0 - Intel 1, 296 2, 240 - GLTO 36 0 3, 500 17
Granularity 10 Granularity 20 Granularity 100 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 50 Open. MP in task parallelism GLTO evaluation 18
Granularity 20 Conjugate Gradient (bmwcra_1. 14, 878 rows) Granularity 100 Granularity 50 Granularity 10 Open. MP in task parallelism GLTO evaluation Performance lose caused by a cut-off mechanism! 19
GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X X - GLTO X - X Coarsegrained Finegrained There is not a clear winner… 20
GLTO evaluation summary Environment Compute-bound Nested Task Creator Code Parallelism Pthread-based Open. MP X X - GLTO X - X Coarsegrained Finegrained … but at least, users can change the runtime version! 21
Conclusions & Future work • We have implemented Open. MP on top of Generic Lightweight Thread Library • We have analyzed GLTO and compared it with current Open. MP • GLTO improves some Open. MP scenarios by using Lightweight threads instead of pthreads • To implement other high-level programming models on top of GLT • To analyze the interaction between GLTO and MPI 22
Thanks! Contact: adcastel@uji. es Source code: github. com/adcastel/glto-runtime 23
- Lightweight thread
- Adequacy
- Condition coverage
- Certificate of adequacy
- Cash flow statement
- Criteria of adequacy
- Semantic nets
- Model adequacy checking
- Horizontal and vertical adequacy
- Adequacy
- The number of test of adequacy is
- Liability adequacy test
- Capital adequacy ratio formula
- Binary instrumentation
- Critic markup
- Vehicle lightweight arresting device
- Lightweight remote procedure call
- You have two lightweight metal spheres each hanging
- Density of concrete
- Lightweight vs heavyweight framework
- New newer newest
- Lightweight innovations for tomorrow
- Lightweight rpc
- Orsiro