Introduction to Open MP Introduction July 24 2012

Shared-Memory Systems Processor Bus interface Processor/ memory us Memory controller Shared memory July 24,

Open. MP Open. MP uses compiler directives (similar to Paraguin) to parallelize a program

Getting Started July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

To begin Syntax: #pragma omp parallel structured_block omp indicates that the pragma is an

A parallel region indicates sections of code that are executed by all threads

Hello World int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello

Compiling and Output Flag to tell gcc to interpret Open. MP directives $ gcc

Execution omp_get_thread_num() – get the current threads number omp_get_num_threads() – get the total number

Execution There are 3 ways to indicate how many threads you want: Use num_threads

Shared versus Private Data July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC

Shared versus Private Data int main (int argc, char *argv[]) { int x; x

Shared versus Private Data $. /data Thread 3, Thread 2, Thread 1, Thread 0,

$Another Example Shared versus Private #pragma omp parallel private(tid, n) { tid = omp_get_thread_num();$

Private Variables private clause – creates private copies of variables for each thread firstprivate

Work Sharing Constructs July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Specifying Work Inside a Parallel Region There are 4 constructs: section – each section

$Sections Syntax #pragma omp parallel { #pragma omp sections { #pragma omp section structured_block.$

Sections Example #pragma omp parallel shared(a, b, c, d, nthreads) private(i, tid) { Threads

Sections Example } Another thread does this #pragma omp section { printf("Thread %d doing

Sections Output Thread Thread Thread Thread 0 doing section 1 0: c[0]= 5. 000000

Parallel For Syntax #pragma omp parallel { #pragma omp for (i = 0; i

Parallel For Example #pragma omp parallel shared(a, b, c, nthreads) private(i, tid) { tid

Parallel For Output Thread Thread Number Thread 1 starting. . . 1: i =

Combining Directives If a Parallel Region consists of only one Parallel For or Parallel

Combining Directives Example Declares a Parallel Region and a Parallel For #pragma omp parallel

Scheduling a Parallel For By default, a parallel for is scheduled by mapping blocks

Scheduling a Parallel For Static – Partitions loop iterations into equal sized chunks specified

Scheduling a Parallel For Guided – Similar to dynamic but chunk size starts large

Question Guided scheduling is similar to Static except that the chunk sizes start large

Reduction A reduction is when we apply a commutative operator to an aggregate values

$Single #pragma omp parallel {. . . #pragma omp single structured_block. . . }$

$Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Single Results Thread. . . Thread Thread 0 0 3 2 starting. . .

$Master #pragma omp parallel {. . . #pragma omp master structured_block. . . }$

$Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Is there any difference between these two approaches: Master Directive: #pragma omp parallel {.

Synchronization July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Critical Section A critical section implies mutual exclusion. Only one thread allowed to enter

$Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Critical Results Thread Thread Thread Thread 0 0 3 1 2 0 0 3

Atomic If the critical section is a simple update of a variable, then atomic

Barrier Threads will wait at a barrier until all threads have reached the same

$Barrier Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Barrier Results Thread Thread Thread Thread 3 0 0 2 2 1 1 3

Flush A synchronization point which causes threads to have a “consistent” view of certain

Flush Only applied to thread executing flush, not to all threads in team. (So

More information http: //openmp. org/wp/ July 24, 2012 © copyright 2012, Clayton S. Ferner,

Questions July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Slides: 50

Download presentation

Shared-Memory Systems Processor Bus interface Processor/ memory us Memory controller Shared memory July 24, 2012 Memory All processors can access all of the shared memory © copyright 2012, Clayton S. Ferner, UNC Wilmington

Open. MP Open. MP uses compiler directives (similar to Paraguin) to parallelize a program The programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program This is a higher level of abstraction than pthreads or Java threads Standardized in late 1990 s gcc supports Open. MP July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

To begin Syntax: #pragma omp parallel structured_block omp indicates that the pragma is an Open. MP pragma (other compilers will ignore it) parallel indicates the directive (“parallel” indicates the start of a parallel region) structured_block will be either a single statement (such as a for loop) or a block of statements July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

A parallel region indicates sections of code that are executed by all threads At the end of a parallel region, all threads synchronize as if there were a barrier Code outside a parallel region is executed by master thread only Master thread parallel region Multiple threads Synchronization July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Hello World int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello World from thread = %d of %dn", omp_get_thread_num(), omp_get_num_threads()); } } Very Important Opening brace must be on a new line July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Compiling and Output Flag to tell gcc to interpret Open. MP directives $ gcc -fopenmp hello. c -o hello $. /hello Hello world $ July 24, 2012 from thread 2 0 3 1 of of 4 4 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Execution omp_get_thread_num() – get the current threads number omp_get_num_threads() – get the total number of threads The names of these two functions are similar; easy to confuse. July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Execution There are 3 ways to indicate how many threads you want: Use num_threads withing the directive Use the omp_set_num_threads function E. g. #pragma omp parallel num_threads(5) E. g. omp_set_num_threads(6); Use the OMP_NUM_THREADS environmental variable E. g $ export OMP_NUM_THREADS=8 $. /hello July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Shared versus Private Data int main (int argc, char *argv[]) { int x; x is shared by all threads int tid; tid is private – #pragma omp parallel private(tid) each thread { has its own tid = omp_get_thread_num(); copy if (tid == 0) x = 42; printf ("Thread %d, x = %dn", tid, x); } } Variables declared outside the parallel construct are shared unless otherwise specified July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Shared versus Private Data $. /data Thread 3, Thread 2, Thread 1, Thread 0, Thread 4, Thread 5, Thread 6, Thread 7, x x x x July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington = = = = 0 0 0 42 42 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

$Another Example Shared versus Private #pragma omp parallel private(tid, n) { tid = omp_get_thread_num();$

Another Example Shared versus Private #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); a[ ] is shared n = omp_get_num_threads(); a[tid] = 10*n; } tid and n are private OR optional #pragma omp parallel private(tid, n) shared(a). . . July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Private Variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object. ” July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Specifying Work Inside a Parallel Region There are 4 constructs: section – each section is executed by a different thread for – each iteration is executed by a (potentially) different thread single – executed by a single thread (sequential) master – executed by the master only (sequential) There is a barrier after each construct (except master) unless a nowait clause is given These must be used within a parallel region July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Sections Syntax #pragma omp parallel { #pragma omp sections { #pragma omp section structured_block.$

Sections Syntax #pragma omp parallel { #pragma omp sections { #pragma omp section structured_block. . . Enclosing parallel region Sections executed by available threads } } July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Sections Example #pragma omp parallel shared(a, b, c, d, nthreads) private(i, tid) { Threads do not wait tid = omp_get_thread_num(); after finishing section #pragma omp sections nowait { One thread does this #pragma omp section { printf("Thread %d doing section 1n", tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%fn“, tid, i, c[i]); } } July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington #pragma omp section

Sections Example } Another thread does this #pragma omp section { printf("Thread %d doing section 2n", tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%fn", tid, i, d[i]); } } } /* end of sections */ printf ("Thread %d donen", tid); } /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Sections Output Thread Thread Thread Thread 0 doing section 1 0: c[0]= 5. 000000 0: c[1]= 7. 000000 0: c[2]= 9. 000000 0: c[3]= 11. 000000 0: c[4]= 13. 000000 3 done 2 done 1 doing section 2 1: d[0]= 0. 000000 1: d[1]= 6. 000000 1: d[2]= 14. 000000 1: d[3]= 24. 000000 0 done 1: d[4]= 36. 000000 1 done July 24, 2012 Threads do not wait (i. e. no barrier) © copyright 2012, Clayton S. Ferner, UNC Wilmington

Sections Output Thread Thread Thread Thread 0 doing section 1 0: c[0]= 5. 000000 0: c[1]= 7. 000000 0: c[2]= 9. 000000 0: c[3]= 11. 000000 0: c[4]= 13. 000000 3 doing section 2 3: d[0]= 0. 000000 3: d[1]= 6. 000000 3: d[2]= 14. 000000 3: d[3]= 24. 000000 3: d[4]= 36. 000000 3 done 1 done 2 done 0 done July 24, 2012 Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section. © copyright 2012, Clayton S. Ferner, UNC Wilmington

Parallel For Syntax #pragma omp parallel { #pragma omp for (i = 0; i < N; i++) {. . . } } July 24, 2012 Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants © copyright 2012, Clayton S. Ferner, UNC Wilmington

Parallel For Example #pragma omp parallel shared(a, b, c, nthreads) private(i, tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %dn", nthreads); } printf("Thread %d starting. . . n", tid); } Without “nowait”, threads #pragma omp for (i = 0; i < N; i++) { wait after finishing loop c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %fn", tid, i, c[i]); } /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Parallel For Output Thread Thread Number Thread 1 starting. . . 1: i = 2, c[1] 1: i = 3, c[1] 2 starting. . . 2: i = 4, c[2] 3 starting. . . of threads = 4 0 starting. . . 0: i = 0, c[0] 0: i = 1, c[0] = 9. 000000 = 11. 000000 = 13. 000000 = 5. 000000 = 7. 000000 Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Combining Directives If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined #pragma omp parallel sections #pragma omp parallel for July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Combining Directives Example Declares a Parallel Region and a Parallel For #pragma omp parallel for shared(a, b, c, nthreads) private(i, tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Scheduling a Parallel For By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread Thread Number Thread 1 starting. . . 1: i = 2, c[1] 1: i = 3, c[1] 2 starting. . . 2: i = 4, c[2] 3 starting. . . of threads = 4 0 starting. . . 0: i = 0, c[0] 0: i = 1, c[0] July 24, 2012 = 9. 000000 = 11. 000000 Default Chunk Size = 13. 000000 = 5. 000000 = 7. 000000 Barrier here © copyright 2012, Clayton S. Ferner, UNC Wilmington

Scheduling a Parallel For Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. #pragma omp parallel for schedule (static, chunk_size) Dynamic – Uses internal work queue. Chunksized block of iterations assigned to threads as they become available. #pragma omp parallel for schedule (dynamic, chunk_size) July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Scheduling a Parallel For Guided – Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. #pragma omp parallel for schedule (guided) Runtime – Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used. #pragma omp parallel for schedule (runtime) July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Reduction A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+: sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); Operation Variable } Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here. July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Single #pragma omp parallel {. . . #pragma omp single structured_block. . . }$

Single #pragma omp parallel {. . . #pragma omp single structured_block. . . } Only one thread executes this section No guarantee of which one July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting. . . n", tid); #pragma omp single { printf("Thread %d doing workn", tid); . . . } } /* end of single */ printf ("Thread %d donen", tid); /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Single Results Thread. . . Thread Thread 0 0 3 2 starting. . . doing work starting. . . 1 0 1 2 3 starting. . . done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Master #pragma omp parallel {. . . #pragma omp master structured_block. . . }$

Master #pragma omp parallel {. . . #pragma omp master structured_block. . . } July 24, 2012 Only one thread (the master) executes this section Cannot specify “nowait” here There is no barrier after this block. Threads will NOT wait. © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting. . . n", tid); #pragma omp master { printf("Thread %d doing workn", tid); . . . } } /* end of master */ printf ("Thread %d donen", tid); /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Is there any difference between these two approaches: Master Directive: #pragma omp parallel {. . . #pragma omp master structured_block. . . } July 24, 2012 Using an if statement: #pragma omp parallel private(tid) {. . . tid=omp_get_thread_num(); if (tid == 0) structured_block. . . } © copyright 2012, Clayton S. Ferner, UNC Wilmington

Critical Section A critical section implies mutual exclusion. Only one thread allowed to enter the critical section at a time. #pragma omp parallel {. . . #pragma omp critical (name) structured_block. . . name is optional } July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting. . . n", tid); #pragma omp critical (my. CS) { printf("Thread %d in critical section n", tid); sleep (1); printf("Thread %d finishing critical section n", tid); } } /* end of critical */ printf ("Thread %d donen", tid); /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Critical Results Thread Thread Thread Thread 0 0 3 1 2 0 0 3 3 3 2 2 2 1 1 1 starting. . . in critical section starting. . . finishing critical section done in critical section finishing critical section done July 24, 2012 1 second delay © copyright 2012, Clayton S. Ferner, UNC Wilmington

Atomic If the critical section is a simple update of a variable, then atomic is more efficient Ensures mutual exclusion for the statement #pragma omp parallel {. . . #pragma omp atomic expression_statement. . . } July 24, 2012 Must be a simple statement of the form: x = expression x += expression x -= expression. . . x++; x--; . . . © copyright 2012, Clayton S. Ferner, UNC Wilmington

Barrier Threads will wait at a barrier until all threads have reached the same barrier. All threads must be able to reach the barrier (i. e. be careful about placing the barrier inside an if statement where some threads my not execute it). #pragma omp parallel {. . . #pragma omp barrier. . . } July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

$Barrier Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting.$

Barrier Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting. . . n", tid); No barrier at the end #pragma omp single nowait of the single block { printf("Thread %d busy doing work. . . n", tid); sleep(10); } Threads wait here printf("Thread %d reached barriern", tid); #pragma omp barrier Not here } printf ("Thread %d donen", tid); /* end of parallel section */ July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Barrier Results Thread Thread Thread Thread 3 0 0 2 2 1 1 3 3 3 0 2 1 starting. . . reached barrier busy doing work. . . reached barrier done July 24, 2012 Thread 3 sleeping for 10 seconds 10 second delay © copyright 2012, Clayton S. Ferner, UNC Wilmington

Flush A synchronization point which causes threads to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list) July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Flush Only applied to thread executing flush, not to all threads in team. (So not all threads have to execute the flush. ) Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present). July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington