Chapter 8 Programming with Shared Memory Shared Memory

Shared Memory Multiprocessor System • Any memory location can be accessible by any of

Address Space Organization: UMA & NUMA • UMA: Uniform Memory Access • Each processor

Shared Memory vs Distributed Memory Shared memory Distributed memory • Attractive to programmer: convenience

Programming Shared Memory Multiprocessors • Using heavy weight processes • Typically assumes all data

Programming Shared Memory (cont’d) • Using library routines with an existing sequential programming language.

Using Heavyweight Processes • Operating systems often based upon notion of a process. •

FORK-JOIN Construct Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel

UNIX System Calls • SPMD model with different code for master process and forked

Unix Processes: Example Program • To sum the elements of an array, a[1000]: int

Unix Processes: Example (cont’d) • Calculation will be divided into two parts, one doing

Shared Memory Locations for UNIX Program Example Slides for Parallel Programming Techniques & Applications

Unix Processes: Example (cont’d) *sum = 0; for (i = 0; i < array

Language Constructs for Parallelism • A few specific language constructs for shared memory parallel

Dependency Analysis • An automatic technique used by parallelizing compilers to detect which processes/statements

Bernstein Conditions • A set of sufficient conditions for two processes to be executed

Bernstein Conditions: An Example • Consider the following pair of statements (in C): a

Directives Based Parallel Programming: Open. MP • An API for programming shared memory multiprocessors

Open. MP Programming Model • Explicit parallelism • Open. MP is an explicit (not

Open. MP Programming Model (cont’d) • Open. MP uses the fork-join model of parallel

Open. MP Program Structure #include <omp. h> int main () { int var 1,

Open. MP Programming • For C/C++, the Open. MP directives are contained in #pragma

Anatomy of Open. MP • Open. MP’s constructs fall into 5 categories: 1. Parallel

Parallel Regions: The parallel Directive • This is the fundamental Open. MP parallel construct.

The parallel Directive: Detailed Form #pragma omp parallel [clause. . . ] newline if

Example 1: Parallel Salam Shabab #include <omp. h> main () { int nthreads, tid;

Number of Threads in a Team • Established by either: 1. num_threads clause after

Example 2: Vector Addition #pragma omp parallel { int id, i, Nthrds, istart, iend;

Work-Sharing • Three constructs in this classification, namely: sections for single • Note •

$The sections Directive • The construct #pragma omp sections { #pragma omp section structured_block.$

The sections Directive: Detailed Form #pragma omp sections [clause. . . ] newline private

Example 3: The sections Directive #pragma omp parallel shared(a, b, c) private(i) { #pragma

The for Directive • The construct #pragma omp for_loop • Causes the for_loop to

The for Directive: Detailed Form #pragma omp for [clause. . . ] newline schedule

The schedule Clause • The schedule clause effects how loop iterations are mapped onto

The schedule Clause (cont’d) • Schedule(runtime) • Schedule and chunk size taken from the

Example 4: The for Directive for (i=0; i < N; i++) a[i] = b[i]

Combined Parallel Work-sharing Constructs • A parallel directive followed by a single for directive

Example 5: Parallel Region and Work-sharing #pragma omp parallel #pragma omp for schedule(static) for(i=0;

The reduction Clause • reduction (op: list) • The reduction clause performs a reduction

Example 6: The reduction Clause #pragma omp parallel for default(shared) private(i) schedule(static,

The master Directive • The master directive: #pragma omp master structured_block • Causes the

Data Environment • Most variables are shared by default. Example: file scope variables and

Synchronization Constructs • Five constructs in this classification, namely: 1. critical: defines a critical

Example 7: The critical Directive #include <omp. h> main() { int x; x =

Example 8: The barrier Directive #pragma omp parallel shared (A, B, C) private(id) {

Open. MP: Library Routines • Lock routines omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock() • Runtime environment

Open. MP: Environment Variables • Control how “omp for schedule(RUNTIME)” loop iterations are scheduled.

Example 9: threadprivate Directive int alpha[10], beta[10], i; #pragma omp threadprivate(alpha) main () {

Shared Memory Programming: Performance Issues • Regardless of the programming model, these include: 1.

Shared Memory Programming: Performance Issues • Cache coherence protocols: need to assure processes have

False Sharing Different parts of block required by different processors but not same bytes.

2. Shared Memory Synchronization • Too much use of synchronization primitives can be a

3. Sequential Consistency • Definition (Lamport 1979): A multiprocessor is sequentially consistent if the

Sequential Consistency Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel

Sequential Consistency (cont’d) • Programs in a sequentially consistent parallel system • Must produce

How Sequential Consistency Affect Performance • Sequential consistency refers to “operations of each individual

Additional References • Introduction to Open. MP, Lawrence Livermore National Laboratory www. llnl. gov/computing/tutorials/workshop/open.

Slides: 58

Download presentation

Chapter 8 Programming with Shared Memory • Shared Memory Multiprocessors • Programming Shared Memory Multiprocessors • Process Based Programming: UNIX Processes • Parallel Programming Languages and Constructs • Directives Based Programming: Open. MP • Performance Issues Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 1

Shared Memory Multiprocessor System • Any memory location can be accessible by any of the processors. • A single address space exists: each memory location is given a unique address within a single range of addresses. • A common architecture is the single-bus architecture: Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 2

Address Space Organization: UMA & NUMA • UMA: Uniform Memory Access • Each processor has equal access to all of the system memory • Each processor also maintains its own on-board data caches • Cache coherence and bus contention limit scalability • NUMA: Non-Uniform Memory Access • Model designed in the 1990 s to increase scalability and assure memory coherency and integrity • Processors have direct access to a private area of main memory • Remote memory access is slower than local memory access • A processor accesses the private memory of others using the system bus Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 3

Shared Memory vs Distributed Memory Shared memory Distributed memory • Attractive to programmer: convenience for sharing • Scales more easily than shared memory • Data sharing mediated via the shared memory • Data sharing achieved through message passing • Synchronization achieved via shared variables (e. g. , semaphores) • Synchronization achieved through message passing • Synchronization mechanisms (which can • Message-passing is error-prone, makes increase execution time of a parallel programs difficult to debug and data must program) are necessary be copied instead of shared – Concurrent access to data must be controlled • Employs a single address space (the shared memory) • Each processor has its own local memory Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 4

Programming Shared Memory Multiprocessors • Using heavy weight processes • Typically assumes all data associated with a process to be private • Important for providing protection in multiuser systems • This protection not necessary for processes cooperating to solve a problem • UNIX processes • Using threads. • Example: Pthreads, Java threads • Using a completely new programming language for parallel programming • Requires learning a new language from scratch • Not popular these days • Example: Ada. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 5

Programming Shared Memory (cont’d) • Using library routines with an existing sequential programming language. • Modifying the syntax of an existing sequential programming language to create a parallel programming language. • Example UPC • Using an existing sequential programming language supplemented with compiler directives for specifying parallelism. • Example Open. MP Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6

Using Heavyweight Processes • Operating systems often based upon notion of a process. • Processor time shared between processes, switching from one process to another. Might occur at regular intervals or when an active process becomes delayed. • Offers opportunity to deschedule processes blocked from proceeding for some reason, e. g. waiting for an I/O operation to complete. • Concept could be used for parallel programming. Not much used because of overhead but fork/join concepts used elsewhere. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 7

FORK-JOIN Construct Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 8

UNIX System Calls • SPMD model with different code for master process and forked slave process. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 9

Unix Processes: Example Program • To sum the elements of an array, a[1000]: int sum, a[1000]; sum = 0; for (i = 0; i < 1000; i++) sum = sum + a[i]; Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 10

Unix Processes: Example (cont’d) • Calculation will be divided into two parts, one doing even i and one doing odd i; i. e. , Process 1 sum 1 = 0; for (i = 0; i < 1000; i = i + 2) sum 1 = sum 1 + a[i]; Process 2 sum 2 = 0; for (i = 1; i < 1000; i = i + 2) sum 2 = sum 2 + a[i]; • Each process will add its result (sum 1 or sum 2) to an accumulating result, sum : sum = sum + sum 1; sum = sum + sum 2; • Sum will need to be shared and protected by a lock. Shared data structure is created: Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 11

Shared Memory Locations for UNIX Program Example Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 12

Unix Processes: Example (cont’d) *sum = 0; for (i = 0; i < array size; i++) *(a+i) = i+1; pid = fork(); if (pid == 0){ partial_sum = 0; for (i = 0; i < array size; i = i + 2) partial sum += *(a + i); }else{ partial_sum = 0; for (i = 1; i < array size; i = i + 2) partial sum += *(a + i); } P(&s); *sum += partial sum; V(&s); Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 13

Language Constructs for Parallelism • A few specific language constructs for shared memory parallel programs are: • Shared data may be specified as shared int; • Different concurrent statements may be introduced using the par statement: par {S 1; S 2; ……; Sn} • Similar concurrent statements may be introduced using the forall statement: forall (i=0; i<n; i++){body} • generates n concurrent blocks, one for each i. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 14

Dependency Analysis • An automatic technique used by parallelizing compilers to detect which processes/statements may be executed in parallel • Some dependencies are easy to see, like forall (i = 0; i < 5; i++) a[i] = 0; • In general dependencies may not be that obvious. • An algorithmic way of recognizing dependencies is needed for automation by compilers Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 15

Bernstein Conditions • A set of sufficient conditions for two processes to be executed simultaneously is the following concurrentread /exclusive-write situation: • Define two sets of memory locations • Ii - the set of memory locations read by process Pi • Oi - the set of memory locations where process Pi writes • Two processes P 1, P 2 may be executed simultaneously if • In words: the writing locations are disjoint and no process reads from a location where the other process writes. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 16

Bernstein Conditions: An Example • Consider the following pair of statements (in C): a = x + y; b = x + z; • From these, we have I 1 = {x, y} O 1 = {a} I 2 = {x, z} O 2 = {b} • We see that the three conditions are satisfied: • Hence, the statements a = x + y; and b = x + z; can be executed simultaneously. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 17

Directives Based Parallel Programming: Open. MP • An API for programming shared memory multiprocessors • Open. MP API consists of three primary components: • Compiler directives • Runtime library routines • Environment variables • Open. MP is standardized: • Jointly defined and endorsed by a group of major computer hardware and software vendors • Expected to become an ANSI standard later • Open. MP stand for Open specifications for Multi Processing • The base languages for Open. MP are C, C++ and Fortran Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 18

Open. MP Programming Model • Explicit parallelism • Open. MP is an explicit (not automatic) programming model, offering the programmer full control over parallelization. • Compiler directive based: • Virtually all of Open. MP parallelism is specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. • Thread based parallelism • A shared memory process can consist of multiple threads. • Open. MP is based upon the existence of multiple threads in the shared memory programming paradigm. • Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 19

Open. MP Programming Model (cont’d) • Open. MP uses the fork-join model of parallel execution: • Programs begin as a single process: the master thread. • Master thread executes sequentially until the first parallel region construct is encountered. • FORK: the master thread then creates a team of parallel threads • The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads • JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 20

Open. MP Program Structure #include <omp. h> int main () { int var 1, var 2, var 3; // Serial code. . . // Beginning of parallel section. Fork a team of threads. //Specify variable scoping #pragma omp parallel private(var 1, var 2) shared(var 3) { // Parallel section executed by all threads. . . // All threads join master thread and disband } // Resume serial code. . . } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 21

Open. MP Programming • For C/C++, the Open. MP directives are contained in #pragma statements. • The Open. MP #pragma statements have the format: #pragma omp directive_name. . . where omp is an Open. MP keyword. • May be additional parameters (clauses) after the directive name for different options. • Some directives require code to specified in a structured block (a statement or statements) that follows the directive and then the directive and structured block form a “construct”. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 22

Anatomy of Open. MP • Open. MP’s constructs fall into 5 categories: 1. Parallel Regions 2. Work-sharing 3. Data Environment 4. Synchronization 5. Runtime functions/environment variables Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 23

Parallel Regions: The parallel Directive • This is the fundamental Open. MP parallel construct. #pragma omp parallel structured_block • When a thread reaches a parallel directive, it creates a team of threads and becomes the master of the team. • The master is a member of that team and has thread number 0 within that team. • Each thread executes the specified structured_block • Structured_block is either a single statement or a compound statement created with {. . . } with a single entry point and a single exit point. • There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 24

The parallel Directive: Detailed Form #pragma omp parallel [clause. . . ] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) structured_block Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 25

Example 1: Parallel Salam Shabab #include <omp. h> main () { int nthreads, tid; #pragma omp parallel private(nthreads, tid) { tid = omp_get_thread_num(); /* Obtain and print thread id */ printf(“Salaam Shabab from thread = %dn", tid); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %dn", nthreads); } } /* All threads join master thread and terminate */ } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 26

Number of Threads in a Team • Established by either: 1. num_threads clause after the parallel directive, or 2. omp_set_num_threads() library routine being previously called, 3. the environment variable OMP_NUM_THREADS is defined in the order given or is system dependent if none of the above. 4. Implementation default • Number of threads available can also be altered automatically to achieve best use of system resources by a “dynamic adjustment” mechanism. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 27

Example 2: Vector Addition #pragma omp parallel { int id, i, Nthrds, istart, iend; id = omp_get_thread_num(); Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart; i<iend; i++) a[i] = a[i] + b[i]; } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 28

Work-Sharing • Three constructs in this classification, namely: sections for single • Note • There is an implicit barrier at the end of the construct, in each case, unless a nowait clause is included. • These constructs do not start a new team of threads. That done by an enclosing parallel construct. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 29

$The sections Directive • The construct #pragma omp sections { #pragma omp section structured_block.$

The sections Directive • The construct #pragma omp sections { #pragma omp section structured_block. . . } • Causes the structured blocks to be shared among threads in team. • Notes: • #pragma omp sections precedes the set of structured blocks. • #pragma omp section prefixes each structured block. • The first section directive is optional. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 30

The sections Directive: Detailed Form #pragma omp sections [clause. . . ] newline private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait { #pragma omp section newline structured_block … } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 31

Example 3: The sections Directive #pragma omp parallel shared(a, b, c) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N/2; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=N/2; i < N; i++) c[i] = a[i] + b[i]; } } /* end of sections */ /* end of parallel section */ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 32

The for Directive • The construct #pragma omp for_loop • Causes the for_loop to be divided into parts and parts shared among threads in the team. • The for loop must be of a simple form. • The way the for_loop divided can be specified by an additional “schedule” clause. • Example: the clause schedule(static, chunk_size) • Causes the for_loop to be divided into sizes specified by chunk_size and allocated in a round robin fashion. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 33

The for Directive: Detailed Form #pragma omp for [clause. . . ] newline schedule (type [, chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) nowait for_loop Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 34

The schedule Clause • The schedule clause effects how loop iterations are mapped onto threads • Schedule(static [, chunk]) • Deal out blocks of iterations of size “chunk” to each thread • The default for chunk is ceiling(N/p) • Schedule(dynamic[, chunk]) • Each thread grabs “chunk” iterations off a queue until all iterations have been handled • The default for chunk is 1 • Schedule(guided[, chunk]) • Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds • The default for chunk is ceiling(N/p) Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 35

The schedule Clause (cont’d) • Schedule(runtime) • Schedule and chunk size taken from the OMP_SCHEDULE environment variable • A chunk cannot be specified with runtime. • Example of run-time specified scheduling setenv OMP_SCHEDULE “dynamic, 2” • Which of the scheduling methods is better in terms of • Lower runtime overhead? • Better data locality? • Better load balancing? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 36

Example 4: The for Directive for (i=0; i < N; i++) a[i] = b[i] = i * 1. 0; chunk = CHUNKSIZE; #pragma omp parallel shared(a, b, c, chunk) private(i) { #pragma omp for schedule(dynamic, chunk) nowait for (i=0; i < N; i++){ c[i] = a[i] + b[i]; } /* end of parallel section */ Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 37

Combined Parallel Work-sharing Constructs • A parallel directive followed by a single for directive can be combined into: #pragma omp parallel for_loop • A parallel directive followed by a single sections directive can be combined into: #pragma omp parallel sections { #pragma omp section structured_block … } • In both cases, the nowait clause is not allowed. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 38

Example 5: Parallel Region and Work-sharing #pragma omp parallel #pragma omp for schedule(static) for(i=0; i<N; i++) a[i] = a[i] + b[i]; • What is the chunk size distributed? • How is the distribution done? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 39

The reduction Clause • reduction (op: list) • The reduction clause performs a reduction on the variables that appear in its list. • A private copy for each list variable is created for each thread. • At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 40

Example 6: The reduction Clause #pragma omp parallel for default(shared) private(i) schedule(static, chunk) reduction(+: result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %fn", result); • How many threads execute the printf statement? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 41

The master Directive • The master directive: #pragma omp master structured_block • Causes the master thread to execute the structured block. • Different from those in the work sharing group in that there is no implied barrier at the end of the construct (nor the beginning). • Other threads encountering this directive will ignore it and the associated structured block, and will move on. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 42

Data Environment • Most variables are shared by default. Example: file scope variables and static variables in C • But not everything is shared. . . • Stack variables in sub-programs called from parallel regions are private • Automatic variables within a statement block are private. • The default status can be modified with: • default (private | shared | none) • The defaults can be changed selectively with: • shared • private • firstprivate • threadprivate • The value of a private inside a parallel loop can be transmitted to a global value outside the loop with: • lastprivate Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 43

Synchronization Constructs • Five constructs in this classification, namely: 1. critical: defines a critical section #pragma omp critical name structured_block 2. barrier: defines a barrier (that must be reachable by all threads) #pragma omp barrier 3. atomic: implementation of a critical section that increments decrements or does some other simple arithmetic operation #pragma omp atomic expression_statement 4. flush: Synchronization point which causes thread to have a “consistent” view of variable_list shared variables in memory. #pragma omp flush (variable_list) 5. ordered: Used with for and parallel for to order execution of a loop as in the equivalent sequential code Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 44

Example 7: The critical Directive #include <omp. h> main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */ } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 45

Example 8: The barrier Directive #pragma omp parallel shared (A, B, C) private(id) { id=omp_get_thread_num(); A[id] = big_calc 1(id); #pragma omp barrier #pragma omp for(i=0; i<N; i++){C[i]=big_calc 3(i, A); } //barrier #pragma omp for nowait for(i=0; i<N; i++){ B[i]=big_calc 2(C, i); } A[id] = big_calc 3(id); } //barrier Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 46

Open. MP: Library Routines • Lock routines omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock() • Runtime environment routines: • Modify/Check the number of threads omp_set_num_threads(), omp_get_thread_num(), omp_get_max_threads() • Turn on/off nesting and dynamic mode omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() • Are we in a parallel region? omp_in_parallel() • How many processors in the system? omp_num_procs() Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 47

Open. MP: Environment Variables • Control how “omp for schedule(RUNTIME)” loop iterations are scheduled. • OMP_SCHEDULE “schedule[, chunk_size]” • Set the default number of threads to use. • OMP_NUM_THREADS int_literal • Can the program use a different number of threads in each parallel region? • OMP_DYNAMIC TRUE || FALSE • Will nested parallel regions create new teams of threads, or will they be serialized? • OMP_NESTED TRUE || FALSE Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 48

Example 9: threadprivate Directive int alpha[10], beta[10], i; #pragma omp threadprivate(alpha) main () { /* First parallel region */ #pragma omp parallel private(i, beta) for (i=0; i < 10; i++) alpha[i] = beta[i] = i; /* Second parallel region */ #pragma omp parallel printf("alpha[3]= %d and beta[3]= %dn", alpha[3], beta[3]); } Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 49

Shared Memory Programming: Performance Issues • Regardless of the programming model, these include: 1. Shared data access 2. Shared memory synchronization 3. Sequential consistency 1. Shared data access: design parallel algorithms to • Fully exploit caching characteristics • Avoid false sharing • Exploiting caching characteristics involves • Caching data to mitigate processor-memory data access speed differences • Modifying cached data • Caches may need to be made coherent more frequently due to false sharing • Can lead to serious reduction in performance Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 50

Shared Memory Programming: Performance Issues • Cache coherence protocols: need to assure processes have coherent caches; two solutions: • Update policy - copies in all caches are updated anytime when one copy is modified • Invalidate policy - when a datum in one copy is modified, the same datum in the other caches is invalidated (resetting a bit); the new data is updated only if the processor needs it • A false sharing may appear when different processors alter different data within the same block • Solution for false sharing: • Compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into different blocks. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 51

False Sharing Different parts of block required by different processors but not same bytes. If one processor writes to one part of the block, copies of the complete block in other caches must be updated or invalidated though the actual data is not shared. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 52

2. Shared Memory Synchronization • Too much use of synchronization primitives can be a major source of reduced performance in shared memory programs • Shared memory synchronization used for three main purposes 1. Mutual exclusion synchronization: used to control access to critical sections • Have as few critical sections as possible 2. Process/thread synchronization: used to make threads wait for each other (sometimes needlessly) • Have as few barriers as possible 3. Event synchronization: used to tell a thread (e. g. , through a shared flag) that a certain event, like updating a variable, has occurred • Reduce busy waiting/critical section Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 53

3. Sequential Consistency • Definition (Lamport 1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processors occur in this sequence in the order specified by its program • That is, the overall effect of a parallel program is not changed by any arbitrary interleaving of instruction execution in time. • Is sequential consistency a property parallel programs should have? • How does sequential consistency affect performance? Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 54

Sequential Consistency Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 55

Sequential Consistency (cont’d) • Programs in a sequentially consistent parallel system • Must produce the results that the program is designed to produce even though the individual requests from the processors can be interleaved in any order • Enable us to reason about the result of the program • Example Process P 1. data = new; flag = TRUE; . . Process 2. . while (flag != TRUE) { }; data_copy = data; . Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 56

How Sequential Consistency Affect Performance • Sequential consistency refers to “operations of each individual processor. . occur in the order specified in its program” or program order. • Clever compilers sometimes reorder statements to improve performance • See example in Textbook • Similarly, modern high-performance processors typically reorder machine instructions to improve performance • See example in Textbook • Enforcing sequential consistency can significantly limit compiler optimizations and processor performance • Some processors have special machine instructions to achieve sequential consistency while improving performance • See examples in Textbook Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 57

Additional References • Introduction to Open. MP, Lawrence Livermore National Laboratory www. llnl. gov/computing/tutorials/workshop/open. MP/MAIN. html • Parallel Computing with Open. MP on distributed shared memory platforms, National Research Council Canada, www. sao. nrc. ca/~gabriel/openmp Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2 nd ed. , by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 58