ECE 1747 Parallel Programming Shared Memory Open MP

What is Open. MP? • Standard for shared memory programming for scientific applications. •

Open. MP API Overview • API is a set of compiler directives inserted in

Open. MP API Example Sequential code: statement 1; statement 2; statement 3; Assume we

Open. MP API Example (2 of 2) Open. MP parallel code: statement 1; #pragma

Important Note • By giving a parallel directive, the user asserts that the program

API Semantics • Master thread executes sequential code. • Master and slaves execute parallel

Open. MP Implementation Overview • Open. MP implementation – compiler, – library. • Unlike

Open. MP Example Usage (1 of 2) Sequential Program Annotated Source Open. MP Compiler

Open. MP Example Usage (2 of 2) • If you give sequential switch, –

Open. MP Directives • Parallelization directives: – parallel region – parallel for • Data

General Rules about Directives • They always apply to the next statement, which must

Open. MP Parallel Region #pragma omp parallel • • A number of threads are

Getting Threads to do Different Things • Through explicit thread identification (as in Pthreads).

Thread Identification int omp_get_thread_num() int omp_get_num_threads() • Gets the thread id. • Gets the

$Example #pragma omp parallel { if( !omp_get_thread_num() ) master(); else worker(); }$

Work Sharing Directives • Always occur within a parallel region directive. • Two principal

Open. MP Parallel For #pragma omp parallel #pragma omp for( … ) { …

Multiple Work Sharing Directives • May occur within a single parallel region #pragma omp

The No. Wait Qualifier #pragma omp parallel { #pragma omp for nowait for( ;

Parallel Sections Directive #pragma omp parallel { #pragma omp sections { {…} #pragma omp

A Useful Shorthand #pragma omp parallel #pragma omp for ( ; ; ) {

Note the Difference between. . . #pragma omp parallel { #pragma omp for( ;

… and. . . #pragma omp parallel for( ; ; ) { … }

Sequential Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) {

Open. MP Matrix Multiply #pragma omp parallel for( i=0; i<n; i++ ) for( j=0;

Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for(

Open. MP SOR for some number of timesteps/iterations { #pragma omp parallel for (i=0;

Equivalent Open. MP SOR for some number of timesteps/iterations { #pragma omp parallel {

Some Advanced Features • Conditional parallelism. • Scheduling options. (More can be found in

Conditional Parallelism: Issue • Oftentimes, parallelism is only useful if the problem size is

Conditional Parallelism: Specification #pragma omp parallel if( expression ) #pragma omp for if( expression

Conditional Parallelism: Example for( i=0; i<n; i++ ) #pragma omp parallel for if( n-i

Scheduling of Iterations: Issue • Scheduling: assigning iterations to a thread. • So far,

Scheduling of Iterations: Specification #pragma omp parallel for schedule(<sched>) • <sched> can be one

Example • Multiplication of two matrices C = A x B, where the A

Sequential Matrix Multiply Becomes for( i=0; i<n; i++ ) for( j=0; j<n; j++ )

Open. MP Matrix Multiply #pragma omp parallel for schedule( cyclic ) for( i=0; i<n;

Data Environment Directives (1 of 2) • All variables are by default shared. •

Reminder: Matrix Multiply #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n;

Data Environment Directives (2 of 2) • Private • Threadprivate • Reduction

Private Variables #pragma omp parallel for private( list ) • Makes a private copy

Private Variables: Example (1 of 2) for( i=0; i<n; i++ ) { tmp =

Private Variables: Example (2 of 2) #pragma omp parallel for private( tmp ) for(

Private Variables: Alternative 1 for( i=0; i<n; i++ ) { tmp[i] = a[i]; a[i]

Private Variables: Alternative 2 f() { for( i=from; i<to; i++ ) { int tmp;

Threadprivate • Private variables are private on a parallel region basis. • Threadprivate variables

Threadprivate #pragma omp threadprivate( list ) Example: #pragma omp threadprivate( x) • • •

Reduction Variables #pragma omp parallel for reduction( op: list ) • op is one

Reduction Variables: Example #pragma omp parallel for reduction( +: sum ) for( i=0; i<n;

SOR Sequential Code with Convergence for( ; diff > delta; ) { for (i=0;

SOR Sequential Code with Convergence for( ; diff > delta; ) { #pragma omp

Synchronization Primitives • Critical #pragma omp critical name Implements critical sections by name. Similar

Synchronization Primitives • Big bummer: no condition variables. • Result: must busy wait for

Code Restructuring Example 1 for( i=0; i<n; i++ ) { tmp = a[i]; a[i]

Scalar Privatization #pragma omp parallel for private(tmp) for( i=0; i<n; i++ ) { tmp

Example 2 for( i=0, sum=0; i<n; i++ ) sum += a[i]; • Dependence on

Reduction #pragma omp parallel for reduction(+, sum) for( i=0, sum=0; i<n; i++ ) sum

Example 3 for( i=0, index=0; i<n; i++ ) { index += i; a[i] =

Induction Variable Elimination #pragma omp parallel for( i=0, index=0; i<n; i++ ) { a[i]

Example 4 for( i=0, index=0; i<n; i++ ) { index += f(i); b[i] =

Loop Splitting for( i=0; i<n; i++ ) { index[i] += f(i); } #pragma omp

Example 5 for( k=0; k<n; k++ ) for( i=0; i<n; i++ ) for( j=0;

Example 5 Parallelization for( k=0; k<n; k++ ) #pragma omp parallel for( i=0; i<n;

Loop Reordering #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++

Example 6 #pragma omp parallel for(i=0; i<n; i++ ) a[i] = b[i]; #pragma omp

Loop Fusion #pragma omp parallel for(i=0; i<n; i++ ) { a[i] = b[i]; c[i]

Example 7: While Loops while( *a) { process(a); a++; } • The number of

Special Case of Loop Splitting for( count=0, p=a; p!=NULL; count++, p++ ); #pragma omp

Example 8 for( i=0, wrap=n; i<n; i++ ) { b[i] = a[i] + a[wrap];

Loop Peeling b[0] = a[0] + a[n]; #pragma omp parallel for( i=1; i<n; i++

Example 9 for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; • Dependence

Another Case of Loop Peeling if(m>n) { #pragma omp parallel for( i=0; i<n; i++

Summary • Reorganize code such that – dependences are removed or reduced – large

Slides: 75

Download presentation

ECE 1747 Parallel Programming Shared Memory: Open. MP Environment and Synchronization

What is Open. MP? • Standard for shared memory programming for scientific applications. • Has specific support for scientific application needs (unlike Pthreads). • Rapidly gaining acceptance among vendors and application writers. • See http: //www. openmp. org for more info.

Open. MP API Overview • API is a set of compiler directives inserted in the source program (in addition to some library functions). • Ideally, compiler directives do not affect sequential code. – pragma’s in C / C++. – (special) comments in Fortran code.

Open. MP API Example Sequential code: statement 1; statement 2; statement 3; Assume we want to execute statement 2 in parallel, and statement 1 and 3 sequentially.

Open. MP API Example (2 of 2) Open. MP parallel code: statement 1; #pragma <specific Open. MP directive> statement 2; statement 3; Statement 2 (may be) executed in parallel. Statement 1 and 3 are executed sequentially.

Important Note • By giving a parallel directive, the user asserts that the program will remain correct if the statement is executed in parallel. • Open. MP compiler does not check correctness. • Some tools exist for helping with that. • Totalview - good parallel debugger

API Semantics • Master thread executes sequential code. • Master and slaves execute parallel code. • Note: very similar to fork-join semantics of Pthreads create/join primitives.

Open. MP Implementation Overview • Open. MP implementation – compiler, – library. • Unlike Pthreads (purely a library).

Open. MP Example Usage (1 of 2) Sequential Program Annotated Source Open. MP Compiler compiler switch Parallel Program

Open. MP Example Usage (2 of 2) • If you give sequential switch, – comments and pragmas are ignored. • If you give parallel switch, – comments and/or pragmas are read, and – cause translation into parallel program. • Ideally, one source for both sequential and parallel program (big maintenance plus).

Open. MP Directives • Parallelization directives: – parallel region – parallel for • Data environment directives: – shared, private, threadprivate, reduction, etc. • Synchronization directives: – barrier, critical

General Rules about Directives • They always apply to the next statement, which must be a structured block. • Examples – #pragma omp … statement – #pragma omp … { statement 1; statement 2; statement 3; }

Open. MP Parallel Region #pragma omp parallel • • A number of threads are spawned at entry. Each thread executes the same code. Each thread waits at the end. Very similar to a number of create/join’s with the same function in Pthreads.

Getting Threads to do Different Things • Through explicit thread identification (as in Pthreads). • Through work-sharing directives.

Thread Identification int omp_get_thread_num() int omp_get_num_threads() • Gets the thread id. • Gets the total number of threads.

$Example #pragma omp parallel { if( !omp_get_thread_num() ) master(); else worker(); }$

Example #pragma omp parallel { if( !omp_get_thread_num() ) master(); else worker(); }

Work Sharing Directives • Always occur within a parallel region directive. • Two principal ones are – parallel for – parallel section

Open. MP Parallel For #pragma omp parallel #pragma omp for( … ) { … } • Each thread executes a subset of the iterations. • All threads wait at the end of the parallel for.

Multiple Work Sharing Directives • May occur within a single parallel region #pragma omp parallel { #pragma omp for for( ; ; ) { … } } • All threads wait at the end of the first for.

The No. Wait Qualifier #pragma omp parallel { #pragma omp for nowait for( ; ; ) { … } #pragma omp for( ; ; ) { … } } • Threads proceed to second for w/o waiting.

Parallel Sections Directive #pragma omp parallel { #pragma omp sections { {…} #pragma omp section this is a delimiter {…} #pragma omp section {…} … } }

A Useful Shorthand #pragma omp parallel #pragma omp for ( ; ; ) { … } is equivalent to #pragma omp parallel for ( ; ; ) { … } (Same for parallel sections)

Note the Difference between. . . #pragma omp parallel { #pragma omp for( ; ; ) { … } f(); #pragma omp for( ; ; ) { … } }

… and. . . #pragma omp parallel for( ; ; ) { … } f(); #pragma omp parallel for( ; ; ) { … }

Sequential Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0. 0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

Open. MP Matrix Multiply #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0. 0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0. 25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

Open. MP SOR for some number of timesteps/iterations { #pragma omp parallel for (i=0; i<n; i++ ) for( j=0, j<n, j++ ) temp[i][j] = 0. 25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

Equivalent Open. MP SOR for some number of timesteps/iterations { #pragma omp parallel { } } #pragma omp for (i=0; i<n; i++ ) for( j=0, j<n, j++ ) temp[i][j] = 0. 25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]

Some Advanced Features • Conditional parallelism. • Scheduling options. (More can be found in the specification)

Conditional Parallelism: Issue • Oftentimes, parallelism is only useful if the problem size is sufficiently big. • For smaller sizes, overhead of parallelization exceeds benefit.

Conditional Parallelism: Specification #pragma omp parallel if( expression ) #pragma omp for if( expression ) #pragma omp parallel for if( expression ) • Execute in parallel if expression is true, otherwise execute sequentially.

Conditional Parallelism: Example for( i=0; i<n; i++ ) #pragma omp parallel for if( n-i > 100 ) for( j=i+1; j<n; j++ ) for( k=i+1; k<n; k++ ) a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j]

Scheduling of Iterations: Issue • Scheduling: assigning iterations to a thread. • So far, we have assumed the default which is block scheduling. • Open. MP allows other scheduling strategies as well, for instance cyclic, gss (guided selfscheduling), etc.

Scheduling of Iterations: Specification #pragma omp parallel for schedule(<sched>) • <sched> can be one of – block (default) – cyclic – gss

Example • Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). 0 A

Sequential Matrix Multiply Becomes for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0. 0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } Load imbalance with block distribution.

Open. MP Matrix Multiply #pragma omp parallel for schedule( cyclic ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0. 0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

Data Environment Directives (1 of 2) • All variables are by default shared. • One exception: the loop variable of a parallel for is private. • By using data directives, some variables can be made private or given other special characteristics.

Reminder: Matrix Multiply #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0. 0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } • a, b, c are shared • i, j, k are private

Data Environment Directives (2 of 2) • Private • Threadprivate • Reduction

Private Variables #pragma omp parallel for private( list ) • Makes a private copy for each thread for each variable in the list. • This and all further examples are with parallel for, but same applies to other region and work-sharing directives.

Private Variables: Example (1 of 2) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } • Swaps the values in a and b. • Loop-carried dependence on tmp. • Easily fixed by privatizing tmp.

Private Variables: Example (2 of 2) #pragma omp parallel for private( tmp ) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } • Removes dependence on tmp. • Would be more difficult to do in Pthreads.

Private Variables: Alternative 1 for( i=0; i<n; i++ ) { tmp[i] = a[i]; a[i] = b[i]; b[i] = tmp[i]; } • Requires sequential program change. • Wasteful in space, O(n) vs. O(p).

Private Variables: Alternative 2 f() { for( i=from; i<to; i++ ) { int tmp; /* local allocation on stack */ tmp = a[i]; a[i] = b[i]; b[i] = tmp; } }

Threadprivate • Private variables are private on a parallel region basis. • Threadprivate variables are global variables that are private throughout the execution of the program.

Threadprivate #pragma omp threadprivate( list ) Example: #pragma omp threadprivate( x) • • • Requires program change in Pthreads. Requires an array of size p. Access as x[pthread_self()]. Costly if accessed frequently. Not cheap in Open. MP either.

Reduction Variables #pragma omp parallel for reduction( op: list ) • op is one of +, *, -, &, ^, |, &&, or || • The variables in list must be used with this operator in the loop. • The variables are automatically initialized to sensible values.

Reduction Variables: Example #pragma omp parallel for reduction( +: sum ) for( i=0; i<n; i++ ) sum += a[i]; • Sum is automatically initialized to zero.

SOR Sequential Code with Convergence for( ; diff > delta; ) { for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; } }

SOR Sequential Code with Convergence for( ; diff > delta; ) { #pragma omp parallel for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; #pragma omp parallel for reduction( max: diff ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; } }

Synchronization Primitives • Critical #pragma omp critical name Implements critical sections by name. Similar to Pthreads mutex locks (name ~ lock). • Barrier #pragma omp critical barrier Implements global barrier.

Synchronization Primitives • Big bummer: no condition variables. • Result: must busy wait for condition synchronization. • Clumsy. • Very inefficient on some architectures.

Code Restructuring Example 1 for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } • Dependence on tmp.

Scalar Privatization #pragma omp parallel for private(tmp) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } • Dependence on tmp is removed.

Example 2 for( i=0, sum=0; i<n; i++ ) sum += a[i]; • Dependence on sum.

Reduction #pragma omp parallel for reduction(+, sum) for( i=0, sum=0; i<n; i++ ) sum += a[i]; • Dependence on sum is removed.

Example 3 for( i=0, index=0; i<n; i++ ) { index += i; a[i] = b[index]; } • Dependence on index. • Induction variable: can be computed from loop variable.

Induction Variable Elimination #pragma omp parallel for( i=0, index=0; i<n; i++ ) { a[i] = b[i*(i+1)/2]; } • Dependence removed by computing the induction variable.

Example 4 for( i=0, index=0; i<n; i++ ) { index += f(i); b[i] = g(a[index]); } • Dependence on induction variable index, but no closed formula for its value.

Loop Splitting for( i=0; i<n; i++ ) { index[i] += f(i); } #pragma omp parallel for( i=0; i<n; i++ ) { b[i] = g(a[index[i]]); } • Loop splitting has removed dependence.

Example 5 for( k=0; k<n; k++ ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; • Dependence on a[i][j] prevents k-loop parallelization. • No dependencies carried by i- and j-loops.

Example 5 Parallelization for( k=0; k<n; k++ ) #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; • We can do better by reordering the loops.

Loop Reordering #pragma omp parallel for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) a[i][j] += b[i][k] + c[k][j]; • Larger parallel pieces of work.

Example 6 #pragma omp parallel for(i=0; i<n; i++ ) a[i] = b[i]; #pragma omp parallel for( i=0; i<n; i++ ) c[i] = b[i]^2; • Make two parallel loops into one.

Loop Fusion #pragma omp parallel for(i=0; i<n; i++ ) { a[i] = b[i]; c[i] = b[i]^2; } • Reduces loop startup overhead.

Example 7: While Loops while( *a) { process(a); a++; } • The number of loop iterations is unknown.

Special Case of Loop Splitting for( count=0, p=a; p!=NULL; count++, p++ ); #pragma omp parallel for( i=0; i<count; i++ ) process( a[i] ); • Count the number of loop iterations. • Then parallelize the loop.

Example 8 for( i=0, wrap=n; i<n; i++ ) { b[i] = a[i] + a[wrap]; wrap = i; } • Dependence on wrap. • Only first iteration causes dependence.

Loop Peeling b[0] = a[0] + a[n]; #pragma omp parallel for( i=1; i<n; i++ ) { b[i] = a[i] + a[i-1]; }

Example 9 for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; • Dependence if m<n.

Another Case of Loop Peeling if(m>n) { #pragma omp parallel for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; } else { … cannot be parallelized }

Summary • Reorganize code such that – dependences are removed or reduced – large pieces of parallel work emerge – loop bounds become known –… • Code can become messy … there is a point of diminishing returns.