Advanced Parallel Programming with Open MP Tim Mattson

SC’ 2000 Tutorial Agenda Open. MP: A Quick Recap l Open. MP Case Studies

Open. MP: Recap C$OMP FLUSH C$OMP THREADPRIVATE(/ABC/) #pragma omp critical CALL OMP_SET_NUM_THREADS(10) Open. MP:

Open. MP: Supporters* l Hardware vendors – Intel, HP, SGI, IBM, SUN, Compaq l

Open. MP: Programming Model Fork-Join Parallelism: u. Master thread spawns a team of threads

Open. MP: How is Open. MP typically used? (in C) l Open. MP is

Open. MP: How is Open. MP typically used? (Fortran) l Open. MP is usually

Open. MP: How do threads interact? l Open. MP is a shared memory model.

Summary of Open. MP Constructs l l l Parallel Region C$omp parallel #pragma omp

Performance Tuning and Case Studies with Realistic Applications 1. Performance tuning of several benchmarks

Performance Tuning Example 1: MDG l l MDG: A Fortran code of the “Perfect

MDG: Tuning Steps Step 1: Parallelize the most time-consuming loop. It consumes 95% of

MDG Code Sample Structure of the most timeconsuming loop in MDG: c 1 =

Performance Tuning Example 2: ARC 2 D: A Fortran code of the “Perfect Benchmarks”.

ARC 2 D: Tuning Steps l Step 1: Loop interchanging increases cache locality through

ARC 2 D: Code Samples !$OMP PARALLEL DO !$OMP+PRIVATE(R 1, R 2, K, J)

ARC 2 D: Code Samples !$OMP PARALLEL Increasing !$OMP+PRIVATE(LDI, LD 2, LD 1, J,

ARC 2 D: Code Samples !$OMP PARALLEL DO !$OMP+PRIVATE(n, k, j) DO n =

Performance Tuning Example 3: EQUAKE: A C code of the new SPEC Open. MP

EQUAKE: Tuning Steps l Step 1: Parallelizing the four most time-consuming loops – inserted

/* malloc w 1[numthreads][ARCHnodes][3] */ #pragma omp parallel for (j = 0; j <

What Tools Did We Use for Performance Analysis and Tuning? l Compilers uthe starting

Guidelines for Fixing “Performance Bugs” l The methodology that worked for us: u. Use

Performance Tuning Case 1: if the loop is not parallelized automatically, do this: l

Performance Tuning Case 2: if the loop is parallel but does not perform well,

Case Study of a Large-Scale Application Converting a Seismic Processing Application to Open. MP

Overview of Seismic Representative of modern seismic processing programs used in the search for

Seismic: Basic Characteristics l Program structure: u 240 l Main algorithms: u l Fortran

Basic Open. MP Use: Parallelization Scheme l Split into p parallel tasks (p =

Basic Open. MP Use: Data Privatization l Most data structures are private, i. e.

Basic Open. MP Use: Synchronization and Communication compute copy to shared buffer; barrier_synchronization; copy

Open. MP Issues: Mixing Fortran and C l l Bulk of computation is done

Open. MP Issues: Broadcast Common Blocks common /cc/ cdata common /dd/ ddata c initialization

Open. MP Issues: Multithreading IO and malloc IO routines and memory allocation are called

Open. MP Issues: Processor Affinity l Open. MP currently does not specify or provide

Performance Results Speedups of Seismic on an SGI Challenge system MPI small data set

Generating Open. MP Programs Automatically parallelizing compiler inserts directives user inserts directives Open. MP

The Basics About Parallelizing Compilers l l Loops are the primary source of parallelism

Basic Program Transformations Data privatization: DO i=1, n work(1: n) = …. . …

Basic Program Transformations Reduction recognition: DO i=1, n. . . sum = sum +

Basic Program Transformations Induction variable substitution: i 1 = 0 i 2 = 0

Compiler Options Examples of options from the KAP parallelizing compiler (KAP includes some 60

More About Compiler Options u Limits on amount of optimization: – e. g. ,

Inspecting the Translated Program l Source-to-source restructurers: transformed source code is the actual output

Compiler Listing The listing gives many useful clues for improving the performance: u Loop

Performance of Parallelizing Compilers 5 4 Speedup 3. 5 3 2. 5 2 1.

Tuning Automatically-Parallelized Code This task is similar to explicit parallel programming. l Two important

Why Tuning Automatically. Parallelized Code? Hand improvements can pay off because l compiler techniques

Performance Tuning Tools user inserts directives parallelizing compiler inserts directives Open. MP program Advanced

Profiling Tools l Timing profiles (subroutine or loop level) ushows l Cache profiles upoint

KAI Guide. View: Performance Analysis l Speedup curves u Amdahl’s l Law vs. Actual

Guide. View Analyze each Parallel region Find serial regions that are hurt by parallelism

SGI Speed. Shop and Work. Shop Suite of performance tools from SGI l Measurements

Speed. Shop and Work. Shop Addresses the performance Issues: l Load imbalance u. Call

Work. Shop: Advanced Open. MP, SC'2000 Call Graph View 57

Work. Shop: Source View Advanced Open. MP, SC'2000 58

Purdue Ursa Minor/Major Integrated environment for compilation and performance analysis/tuning l Provides browsers for

Ursa Minor/Major Program Structure View Performance Spreadsheet Advanced Open. MP, SC'2000 60

TAU Tuning Analysis Utilities Performance Analysis Environment for C++, Java, C, Fortran 90, HPF,

TAU Tuning Analysis Utilities Advanced Open. MP, SC'2000 62

SMP Programming Errors l Shared memory parallel programming is a mixed bag: u. It

2 major SMP errors l Race Conditions – The outcome of a program depends

Race Conditions l C$OMP PARALLEL SECTIONS A=B+C C$OMP SECTION B=A+C C$OMP SECTION C=B+A C$OMP

Race Conditions: A complicated solution ICOUNT = 0 C$OMP PARALLEL SECTIONS A=B+C ICOUNT =

Race Conditions l C$OMP PARALLEL SHARED (X) C$OMP& PRIVATE(TMP) ID = OMP_GET_THREAD_NUM() C$OMP DO

Race Conditions l REAL TMP, X C$OMP PARALLEL DO REDUCTION(+: X) DO 100 I=1,

Deadlock l CALL OMP_INIT_LOCK (LCKA) CALL OMP_INIT_LOCK (LCKB) C$OMP PARALLEL SECTIONS C$OMP SECTION CALL

Deadlock l CALL OMP_INIT_LOCK (LCKA) C$OMP PARALLEL SECTIONS C$OMP SECTION CALL OMP_SET_LOCK(LCKA) IVAL =

Open. MP death-traps u. Are you using threadsafe libraries? u. I/O inside a parallel

Navigating through the Danger Zones l Option 1: Analyze your code to make sure

Navigating through the Danger Zones l Option 2: Write SMP code that is portable

Portable Sequential Equivalence l What is Portable Sequential Equivalence (PSE)? – A program is

Portable Sequential Equivalence l Advantages of PSE – A PSE program can run on

2 Forms of Sequential Equivalence l Two forms of Sequential equivalence based on what

Strong Sequential Equivalence: rules u. Control data scope with the base language – Avoid

Strong Sequential Equivalence: example C$OMP PARALLEL PRIVATE(I, TMP) C$OMP DO ORDERED DO 100 I=1,

Weak Sequential equivalence l For weak sequential equivalence only mathematically valid constraints are enforced.

Weak equivalence: example l C$OMP PARALLEL PRIVATE(I, TMP) C$OMP DO DO 100 I=1, NDIM

Sequential Equivalence isn’t a Silver Bullet l C$OMP PARALLEL C$OMP& PRIVATE(I, ID, TMP, RVAL)

What is MPI? The message Passing Interface MPI created by an international forum in

How do people use MPI? The SPMD Model • A parallel program working on

Pi program in MPI #include <mpi. h> void main (int argc, char *argv[]) {

How do people mix MPI and Open. MP? • Create the MPI program with

Pi program in MPI #include <mpi. h> #include “omp. h” void main (int argc,

Mixing Open. MP and MPI Let the programmer beware! l Messages are sent to

Open. MP Futures: The ARB l The future of Open. MP is in the

The Future of Open. MP is an evolving standard. We will see to it

Reference Material on Open. MP Homepage www. openmp. org: The primary source of information

Chapman B, Mehrotra P, Zima H. Enhancing Open. MP with features for locality control.

Honghui Lu, Hu YC, Zwaenepoel W. Open. MP on networks of workstations. Proceedings of

Slides: 95

Download presentation

Advanced Parallel Programming with Open. MP Tim Mattson Rudolf Eigenmann Intel Corporation Computational Sciences Laboratory Purdue University School of Electrical and Computer Engineering Advanced Open. MP, SC'2000 1

SC’ 2000 Tutorial Agenda Open. MP: A Quick Recap l Open. MP Case Studies l – including performance tuning Automatic Parallelism and Tools Support l Common Bugs in Open. MP programs l – and how to avoid them Mixing Open. MP and MPI l The Future of Open. MP l Advanced Open. MP, SC'2000 2

Open. MP: Recap C$OMP FLUSH C$OMP THREADPRIVATE(/ABC/) #pragma omp critical CALL OMP_SET_NUM_THREADS(10) Open. MP: An API b, forc) Writing Multithreaded C$OMP parallel do shared(a, call omp_test_lock(jlok) Applications call OMP_INIT_LOCK (ilok) C$OMP ATOMIC C$OMP MASTER – A set of compiler directives and library setenvapplication OMP_SCHEDULE “dynamic” routines for parallel programmers C$OMP PARALLEL B, C) – Makes. DOit. ORDERED easy to. PRIVATE create (A, multi-threaded (MT)ORDERED C$OMP programs in Fortran, C and C++ C$OMP PARALLEL REDUCTION (+: A, B) C$OMP SECTIONS – Standardizes last 15 years of SMP practice C$OMP SINGLE PRIVATE(X) #pragma omp parallel for private(A, B) C$OMP PARALLEL COPYIN(/blk/) BARRIER C$OMP DO lastprivate(XX) Nthrds = OMP_GET_NUM_PROCS() Advanced Open. MP, SC'2000 !$OMP omp_set_lock(lck) 3

Open. MP: Supporters* l Hardware vendors – Intel, HP, SGI, IBM, SUN, Compaq l Software tools vendors – KAI, PGI, PSR, APR l Applications vendors – ANSYS, Fluent, Oxford Molecular, NAG, DOE ASCI, Dash, Livermore Software, and many others *These names of these vendors were taken from the Open. MP web site (www. openmp. org). We have made no attempts to confirm Open. MP support, verify conformity to the specifications, or measure the degree of Open. MP utilization. Advanced Open. MP, SC'2000 4

Open. MP: Programming Model Fork-Join Parallelism: u. Master thread spawns a team of threads as needed. u. Parallelism is added incrementally: i. e. the sequential program evolves into a parallel program. Master Thread Parallel Regions Advanced Open. MP, SC'2000 5

Open. MP: How is Open. MP typically used? (in C) l Open. MP is usually used to parallelize loops: – Find your most time consuming loops. – Split them up between threads. Split-up this loop between multiple threads void main() { double Res[1000]; for(int i=0; i<1000; i++) { do_huge_comp(Res[i]); } } Sequential Program Advanced Open. MP, SC'2000 #include “omp. h” void main() { double Res[1000]; #pragma omp parallel for(int i=0; i<1000; i++) { do_huge_comp(Res[i]); } } Parallel Program 6

Open. MP: How is Open. MP typically used? (Fortran) l Open. MP is usually used to parallelize loops: – Find your most time consuming loops. – Split them up between threads. Split-up this loop between multiple threads program example double precision Res(1000) do I=1, 1000 call huge_comp(Res(I)) end do end Sequential Program Advanced Open. MP, SC'2000 program example double precision Res(1000) C$OMP PARALLEL DO do I=1, 1000 call huge_comp(Res(I)) end do end Parallel Program 7

Open. MP: How do threads interact? l Open. MP is a shared memory model. – Threads communicate by sharing variables. l Unintended sharing of data causes race conditions: – race condition: when the program’s outcome changes as the threads are scheduled differently. l To control race conditions: – Use synchronization to protect data conflicts. l Synchronization is expensive so: – Change how data is accessed to minimize the need for synchronization. Advanced Open. MP, SC'2000 8

Summary of Open. MP Constructs l l l Parallel Region C$omp parallel #pragma omp parallel Worksharing C$omp do #pragma omp for C$omp sections #pragma omp sections C$single #pragma omp single C$workshare #pragma workshare Data Environment u directive: threadprivate u clauses: shared, private, lastprivate, reduction, copyin, copyprivate Synchronization u directives: critical, barrier, atomic, flush, ordered, master Runtime functions/environment variables Advanced Open. MP, SC'2000 9

Performance Tuning and Case Studies with Realistic Applications 1. Performance tuning of several benchmarks 2. Case study of a large-scale application Advanced Open. MP, SC'2000 11

Performance Tuning Example 1: MDG l l MDG: A Fortran code of the “Perfect Benchmarks”. Automatic parallelization does not improve this code. These performance improvements were achieved through manual tuning on a 4 -processor Sun Ultra: Advanced Open. MP, SC'2000 12

MDG: Tuning Steps Step 1: Parallelize the most time-consuming loop. It consumes 95% of the serial execution time. This takes: uarray privatization ureduction parallelization Step 2: Balancing the iteration space of this loop. u. Loop is “triangular”. By default unbalanced assignment of iterations to processors. Advanced Open. MP, SC'2000 13

MDG Code Sample Structure of the most timeconsuming loop in MDG: c 1 = x(1)>0 Original c 2 = x(1: 10)>0 c 1 = x(1)>0 c 2 = x(1: 10)>0 Parallel Allocate(xsum(1: #proc, n)) C$OMP PARALLEL DO C$OMP+ PRIVATE (I, j, rl, id) C$OMP+ SCHEDULE (STATIC, 1) DO i=1, n id = omp_get_thread_num() DO j=i, n DO i=1, n DO j=i, n IF (c 1) THEN rl(1: 100) = … … IF (c 1) THEN rl(1: 100) = … IF (c 2) THEN … = rl(1: 100) … xsum(id, j) = xsum(id, j) + … IF (c 2) THEN … = rl(1: 100) sum(j) = sum(j) + … ENDDO Advanced Open. MP, SC'2000 ENDDO C$OMP PARALLEL DO DO i=1, n sum(i)=sum(i)+xsum(1: #proc, i) ENDDO 14

Performance Tuning Example 2: ARC 2 D: A Fortran code of the “Perfect Benchmarks”. ARC 2 D is parallelized very well by available compilers. However, the mapping of the code to the machine could be improved. Advanced Open. MP, SC'2000 15

ARC 2 D: Tuning Steps l Step 1: Loop interchanging increases cache locality through stride-1 references l Step 2: Move parallel loops to outer positions l Step 3: Move synchronization points outward l Step 4: Coalesce loops Advanced Open. MP, SC'2000 16

ARC 2 D: Code Samples !$OMP PARALLEL DO !$OMP+PRIVATE(R 1, R 2, K, J) DO j = jlow, jup DO k = 2, kmax-1 r 1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2. )*prss(j, k) r 2 = prss(jminu(j), k) + prss(jplus(j), k) + 2. *prss(j, k) coef(j, k) = ABS(r 1/r 2) ENDDO !$OMP END PARALLEL Loop interchanging increases cache locality !$OMP PARALLEL DO !$OMP+PRIVATE(R 1, R 2, K, J) DO k = 2, kmax-1 DO j = jlow, jup r 1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2. )*prss(j, k) r 2 = prss(jminu(j), k) + prss(jplus(j), k) + 2. *prss(j, k) coef(j, k) = ABS(r 1/r 2) ENDDO !$OMP END PARALLEL Advanced Open. MP, SC'2000 17

ARC 2 D: Code Samples !$OMP PARALLEL Increasing !$OMP+PRIVATE(LDI, LD 2, LD 1, J, LD, K) parallel loop DO k = 2+2, ku-2, 1 !$OMP DO granularity DO j = jl, ju through ld 2 = a(j, k) ld 1 = b(j, k)+(-x(j, k-2))*ld 2 NOWAIT clause ld = c(j, k)+(-x(j, k-1))*ld 1+(-y(j, k-1))*ld 2 ldi = 1. /ld f(j, k, 1) = ldi*(f(j, k, 1)+(-f(j, k-2, 1))*ld 2+(-f(j, k-1, 1))*ld 1) f(j, k, 2) = ldi*(f(j, k, 2)+(-f(j, k-2, 2))*ld 2+(-f(jk-2, 2))*ld 1) x(j, k) = ldi*(d(j, k)+(-y(j, k-1))*ld 1) y(j, k) = e(j, k)*ldi ENDDO !$OMP END DO NOWAIT ENDDO !$OMP END PARALLEL Advanced Open. MP, SC'2000 18

ARC 2 D: Code Samples !$OMP PARALLEL DO !$OMP+PRIVATE(n, k, j) DO n = 1, 4 DO k = 2, kmax-1 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) = s(j, k, n)*phic ENDDO !$OMP END PARALLEL !$OMP PARALLEL DO !$OMP+PRIVATE(nk, n, k, j) DO nk = 0, 4*(kmax-2)-1 n = nk/(kmax-2) + 1 k = MOD(nk, kmax-2)+2 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) = s(j, k, n)*phic ENDDO !$OMP END PARALLEL Increasing parallel loop granularity though loop coalescing Advanced Open. MP, SC'2000 19

Performance Tuning Example 3: EQUAKE: A C code of the new SPEC Open. MP benchmarks. EQUAKE is handparallelized with relatively few code modifications. It achieves excellent speedup. Advanced Open. MP, SC'2000 20

EQUAKE: Tuning Steps l Step 1: Parallelizing the four most time-consuming loops – inserted Open. MP pragmas for parallel loops and private data – array reduction transformation l Step 2: A change in memory allocation Advanced Open. MP, SC'2000 21

/* malloc w 1[numthreads][ARCHnodes][3] */ #pragma omp parallel for (j = 0; j < numthreads; j++) for (i = 0; i < nodes; i++) { w 1[j][i][0] = 0. 0; . . . ; } EQUAKE Code Samples Advanced Open. MP, SC'2000 #pragma omp parallel private(my_cpu_id, exp, . . . ) { my_cpu_id = omp_get_thread_num(); #pragma omp for (i = 0; i < nodes; i++) while (. . . ) {. . . exp = loop-local computation; w 1[my_cpu_id][. . . ][1] += exp; . . . } } #pragma omp parallel for (j = 0; j < numthreads; j++) { for (i = 0; i < nodes; i++) { w[i][0] += w 1[j][i][0]; . . . ; } 22

What Tools Did We Use for Performance Analysis and Tuning? l Compilers uthe starting point for our performance tuning of Fortran codes was always the compiler-parallelized program. u. It reports: parallelized loops, data dependences. l Subroutine and loop profilers ufocusing attention on the most time-consuming loops is absolutely essential. l Performance tables: utypically comparing performance differences at the loop level. Advanced Open. MP, SC'2000 23

Guidelines for Fixing “Performance Bugs” l The methodology that worked for us: u. Use compiler-parallelized code as a starting point u. Get loop profile and compiler listing u. Inspect time-consuming loops (biggest potential for improvement) – Case 1. Check for parallelism where the compiler could not find it – Case 2. Improve parallel loops where the speedup is limited Advanced Open. MP, SC'2000 24

Performance Tuning Case 1: if the loop is not parallelized automatically, do this: l Check for parallelism: uread the compiler explanation ua variable may be independent even if the compiler detects dependences (compilers are conservative) ucheck if conflicting array is privatizable (compilers don’t perform array privatization well) l If you find parallelism, add Open. MP parallel directives, or make the information explicit for the parallelizer Advanced Open. MP, SC'2000 25

Performance Tuning Case 2: if the loop is parallel but does not perform well, consider several optimization factors: serial program Parallelization overhead Spreading overhead parallel program Advanced Open. MP, SC'2000 Memory CPU CPU High overheads are caused by: • parallel startup cost • small loops • additional parallel code • over-optimized inner loops • less optimization for parallel code • load imbalance • synchronized section • non-stride-1 references • many shared references • low cache affinity 26

Case Study of a Large-Scale Application Converting a Seismic Processing Application to Open. MP l Overview of the Application l Basic use of Open. MP l Open. MP Issues Encountered l Performance Results Advanced Open. MP, SC'2000 27

Overview of Seismic Representative of modern seismic processing programs used in the search for oil and gas. l 20, 000 lines of Fortran. C subroutines interface with the operating system. l Available in a serial and a parallel variant. l Parallel code is available in a message-passing and an Open. MP form. l Is part of the SPEChpc benchmark suite. Includes 4 data sets: small to x-large. l Advanced Open. MP, SC'2000 28

Seismic: Basic Characteristics l Program structure: u 240 l Main algorithms: u l Fortran and 119 C subroutines. FFT, finite difference solvers Running time of Seismic (@ 500 MFlops): usmall data set: 0. 1 hours ux-large data set: 48 hours l IO requirement: usmall data set: 110 MB ux-large data set: 93 GB Advanced Open. MP, SC'2000 29

Basic Open. MP Use: Parallelization Scheme l Split into p parallel tasks (p = number of processors) Program Seismic initialization done by master processor only C$OMP PARALLEL call main_subroutine() C$OMP END PARALLEL main computation enclosed in one large parallel region SPMD execution scheme Advanced Open. MP, SC'2000 30

Basic Open. MP Use: Data Privatization l Most data structures are private, i. e. , Each thread has its own copy. l Syntactic forms: Program Seismic. . . C$OMP PARALLEL C$OMP+PRIVATE(a) a = “local computation” call x() C$END PARALLEL Advanced Open. MP, SC'2000 Subroutine x() common /cc/ d c$omp threadprivate (/cc/) real b(100). . . b() = “local computation” d = “local computation”. . . 31

Basic Open. MP Use: Synchronization and Communication compute copy to shared buffer; barrier_synchronization; copy from shared buffer; communicate compute communicate Advanced Open. MP, SC'2000 Copy-synchronize scheme corresponds to message sendreceive operations in MPI programs 32

Open. MP Issues: Mixing Fortran and C l l Bulk of computation is done in Fortran Utility routines are in C: u IO operations u data partitioning routines u communication/synchronization operations l Data privatization in Open. MP/C #pragma omp thread private (item) float item; void x(){. . . = item; } Data expansion in absence a of Open. MP/C compiler Open. MP-related issues: float item[num_proc]; u IF C/Open. MP compiler is not void x(){ available, data privatization must int thread; be done through “expansion”. thread = omp_get_thread_num_(); u Mix of Fortran and C is. . . = item[thread]; implementation dependent } Advanced Open. MP, SC'2000 33

Open. MP Issues: Broadcast Common Blocks common /cc/ cdata common /dd/ ddata c initialization cdata =. . . ddata =. . . Issues in Seismic: • At the start of the parallel region it is not yet known which common blocks need to be copied in. C$OMP PARALEL Solution: C$OMP+COPYIN(/cc/, /dd/) • copy-in all common blocks call main_subroutine() overhead C$END PARALLEL Advanced Open. MP, SC'2000 34

Open. MP Issues: Multithreading IO and malloc IO routines and memory allocation are called within parallel threads, inside C utility routines. l Open. MP requires all standard libraries and instrinsics to be thread-save. However the implementations are not always compliant. system-dependent solutions need to be found l The same issue arises if standard C routines are called inside a parallel Fortran region or in non-standard syntax. Standard C compilers do not know anything about Open. MP and the thread-safe requirement. Advanced Open. MP, SC'2000 35

Open. MP Issues: Processor Affinity l Open. MP currently does not specify or provide constructs for controlling the binding of threads to processors. p 1 2 3 4 parallel region l task 1 task 2 task 3 task 4 tasks may migrate as a result of an OS event Processors can migrate, causing overhead. This behavior is system-dependent. System-dependent solutions may be available. Advanced Open. MP, SC'2000 36

Performance Results Speedups of Seismic on an SGI Challenge system MPI small data set Advanced Open. MP, SC'2000 medium data set 37

Generating Open. MP Programs Automatically parallelizing compiler inserts directives user inserts directives Open. MP program Advanced Open. MP, SC'2000 Source-to-source restructurers: • F 90 to F 90/Open. MP • C to C/Open. MP user tunes program Examples: • SGI F 77 compiler (-apo -mplist option) • Polaris compiler 39

The Basics About Parallelizing Compilers l l Loops are the primary source of parallelism in scientific and engineering applications. Compilers detect loops that have independent iterations. DO I=1, N A(expression 1) = … … = A(expression 2) ENDDO Advanced Open. MP, SC'2000 The loop is independent if, for different iterations, expression 1 is always different from expression 2 40

Basic Program Transformations Data privatization: DO i=1, n work(1: n) = …. . … = work(1: n) ENDDO C$OMP PARALLEL DO C$OMP+ PRIVATE (work) DO i=1, n work(1: n) = …. . … = work(1: n) ENDDO Each processor is given a separate version of the private data, so there is no sharing conflict Advanced Open. MP, SC'2000 41

Basic Program Transformations Reduction recognition: DO i=1, n. . . sum = sum + a(i) … ENDDO C$OMP PARALLEL DO C$OMP+ REDUCTION (+: sum) DO i=1, n. . . sum = sum + a(i) … ENDDO Each processor will accumulate partial sums, followed by a combination of these parts at the end of the loop. Advanced Open. MP, SC'2000 42

Basic Program Transformations Induction variable substitution: i 1 = 0 i 2 = 0 DO i =1, n i 1 = i 1 + 1 B(i 1) =. . . i 2 = i 2 + i A(i 2) = … C$OMP PARALLEL DO DO i =1, n B(i) =. . . A((i**2 + i)/2) = … ENDDO The original loop contains data dependences: each processor modifies the shared variables i 1, and i 2. Advanced Open. MP, SC'2000 43

Compiler Options Examples of options from the KAP parallelizing compiler (KAP includes some 60 options) u optimization levels – optimize : simple analysis, advanced analysis, loop interchanging, array expansion – aggressive: pad common blocks, adjust data layout u subroutine inline expansion – inline all, specific routines, how to deal with libraries u try specific optimizations – e. g. , recurrence and reduction recognition, loop fusion (These transformations may degrade performance) Advanced Open. MP, SC'2000 44

More About Compiler Options u Limits on amount of optimization: – e. g. , size of optimization data structures, number of optimization variants tried u Make certain assumptions: – e. g. , array bounds are not violated, arrays are not aliased u Machine parameters: – e. g. , cache size, line size, mapping u Listing control Note, compiler options can be a substitute for advanced compiler strategies. If the compiler has limited information, the user can help out. Advanced Open. MP, SC'2000 45

Inspecting the Translated Program l Source-to-source restructurers: transformed source code is the actual output u Example: KAP u l Code-generating compilers: typically have an option for viewing the translated (parallel) code u Example: SGI f 77 -apo -mplist u This can be the starting point for code tuning Advanced Open. MP, SC'2000 46

Compiler Listing The listing gives many useful clues for improving the performance: u Loop optimization tables u Reports about data dependences u Explanations about applied transformations u The annotated, transformed code u Calling tree u Performance statistics The type of reports to be included in the listing can be set through compiler options. Advanced Open. MP, SC'2000 47

Performance of Parallelizing Compilers 5 4 Speedup 3. 5 3 2. 5 2 1. 5 1 0. 5 0 ARC 2 D BDNA FLO 52 Q Native Parallelizer Polaris to Native Directives Polaris to Open. MP Advanced Open. MP, SC'2000 HYDRO 2 D MDG 5 -processor Sun Ultra SMP SWIM TOMCATV TRFD 1 23 45 48

Tuning Automatically-Parallelized Code This task is similar to explicit parallel programming. l Two important differences : l u. The compiler gives hints in its listing, which may tell you where to focus attention. E. g. , which variables have data dependences. u. You don’t need to perform all transformations by hand. If you expose the right information to the compiler, it will do the translation for you. (E. g. , C$assert independent) Advanced Open. MP, SC'2000 49

Why Tuning Automatically. Parallelized Code? Hand improvements can pay off because l compiler techniques are limited E. g. , array reductions are parallelized by only few compilers l compilers may have insufficient information E. g. , uloop iteration range may be input data uvariables are defined in other subroutines (no interprocedural analysis) Advanced Open. MP, SC'2000 50

Performance Tuning Tools user inserts directives parallelizing compiler inserts directives Open. MP program Advanced Open. MP, SC'2000 user tunes program we need tool support 51

Profiling Tools l Timing profiles (subroutine or loop level) ushows l Cache profiles upoint l performance-critical program properties Input/output activities upoint l out memory/cache performance problems Data-reference and transfer volumes ushow l most time-consuming program sections out possible I/O bottlenecks Hardware counter profiles ularge number of processor statistics Advanced Open. MP, SC'2000 52

KAI Guide. View: Performance Analysis l Speedup curves u Amdahl’s l Law vs. Actual u Parallel regions u Barrier sections u Serial sections times l Whole program time breakdown u Productive work vs u Parallel overheads l Compare several runs u Scaling processors Advanced Open. MP, SC'2000 Breakdown by section l l Breakdown by thread Breakdown overhead u Types of runtime calls u Frequency and time 53

Guide. View Analyze each Parallel region Find serial regions that are hurt by parallelism Sort or filter regions to navigate to hotspots Advanced Open. MP, SC'2000 www. kai. com 54

SGI Speed. Shop and Work. Shop Suite of performance tools from SGI l Measurements based on l upc-sampling and call-stack sampling – based on time [prof, gprof] – based on R 10 K/R 12 K hw counters ubasic block counting [pixie] l Analysis on various domains uprogram graph, source and disassembled code uper-thread as well as cumulative data Advanced Open. MP, SC'2000 55

Speed. Shop and Work. Shop Addresses the performance Issues: l Load imbalance u. Call l stack sampling based on time (gprof) Synchronization Overhead u. Call stack sampling based on time (gprof) u. Call stack sampling based on hardware counters l Memory Hierarchy Performance u. Call stack sampling based on hardware counters Advanced Open. MP, SC'2000 56

Work. Shop: Advanced Open. MP, SC'2000 Call Graph View 57

Work. Shop: Source View Advanced Open. MP, SC'2000 58

Purdue Ursa Minor/Major Integrated environment for compilation and performance analysis/tuning l Provides browsers for many sources of information: l call graphs, source and transformed program, compilation reports, timing data, parallelism estimation, data reference patterns, performance advice, etc. l www. ecn. purdue. edu/Para. Mount/UM/ Advanced Open. MP, SC'2000 59

Ursa Minor/Major Program Structure View Performance Spreadsheet Advanced Open. MP, SC'2000 60

TAU Tuning Analysis Utilities Performance Analysis Environment for C++, Java, C, Fortran 90, HPF, and HPC++ l compilation facilitator l call graph browser l source code browser l profile browsers l speedup extrapolation l www. cs. uoregon. edu/research/paracomp/tau/ Advanced Open. MP, SC'2000 61

TAU Tuning Analysis Utilities Advanced Open. MP, SC'2000 62

SMP Programming Errors l Shared memory parallel programming is a mixed bag: u. It saves the programmer from having to map data onto multiple processors. In this sense, its much easier. u. It opens up a range of new errors coming from unanticipated shared resource conflicts. Advanced Open. MP, SC'2000 64

2 major SMP errors l Race Conditions – The outcome of a program depends on the detailed timing of the threads in the team. l Deadlock – Threads lock up waiting on a locked resource that will never become free. Advanced Open. MP, SC'2000 65

Race Conditions l C$OMP PARALLEL SECTIONS A=B+C C$OMP SECTION B=A+C C$OMP SECTION C=B+A C$OMP END PARALLEL SECTIONS Advanced Open. MP, SC'2000 l The result varies unpredictably based on detailed order of execution for each section. Wrong answers produced without warning! 66

Race Conditions: A complicated solution ICOUNT = 0 C$OMP PARALLEL SECTIONS A=B+C ICOUNT = 1 C$OMP FLUSH ICOUNT C$OMP SECTION 1000 CONTINUE C$OMP FLUSH ICOUNT IF(ICOUNT. LT. 1) GO TO 1000 B=A+C ICOUNT = 2 C$OMP FLUSH ICOUNT l In this example, we choose the assignments to occur in the order A, B, C. u ICOUNT forces this order. u FLUSH so each thread sees updates to ICOUNT - NOTE: you need the flush on each read and each write. C$OMP SECTION 2000 CONTINUE C$OMP FLUSH ICOUNT IF(ICOUNT. LT. 2) GO TO 2000 C=B+A C$OMP END PARALLEL SECTIONS Advanced Open. MP, SC'2000 67

Race Conditions l C$OMP PARALLEL SHARED (X) C$OMP& PRIVATE(TMP) ID = OMP_GET_THREAD_NUM() C$OMP DO REDUCTION(+: X) DO 100 I=1, 100 TMP = WORK(I) X = X + TMP 100 CONTINUE C$OMP END DO NOWAIT Y(ID) = WORK(X, ID) C$OMP END PARALLEL Advanced Open. MP, SC'2000 l l The result varies unpredictably because the value of X isn’t dependable until the barrier at the end of the do loop. Wrong answers produced without warning! Solution: Be careful when you use NOWAIT. 68

Race Conditions l REAL TMP, X C$OMP PARALLEL DO REDUCTION(+: X) DO 100 I=1, 100 TMP = WORK(I) l X = X + TMP 100 CONTINUE C$OMP END DO Y(ID) = WORK(X, ID) C$OMP END PARALLEL Advanced Open. MP, SC'2000 l The result varies unpredictably because access to shared variable TMP is not protected. Wrong answers produced without warning! The user probably wanted to make TMP private. I lost an afternoon to this bug last year. After spinning my wheels and insisting there was a bug in KAI’s compilers, the KAI tool Assure found the problem immediately! 69

Deadlock l CALL OMP_INIT_LOCK (LCKA) CALL OMP_INIT_LOCK (LCKB) C$OMP PARALLEL SECTIONS C$OMP SECTION CALL OMP_SET_LOCK(LCKA) CALL OMP_SET_LOCK(LCKB) CALL USE_A_and_B (RES) CALL OMP_UNSET_LOCK(LCKB) CALL OMP_UNSET_LOCK(LCKA) C$OMP SECTION CALL OMP_SET_LOCK(LCKB) CALL OMP_SET_LOCK(LCKA) CALL USE_B_and_A (RES) CALL OMP_UNSET_LOCK(LCKA) CALL OMP_UNSET_LOCK(LCKB) C$OMP END SECTIONS Advanced Open. MP, SC'2000 l l l This shows a race condition and a deadlock. If A is locked by one thread and B by another, you have deadlock. If the same thread gets both locks, you get a race condition i. e. different behavior depending on detailed interleaving of the thread. Avoid nesting different locks. 70

Deadlock l CALL OMP_INIT_LOCK (LCKA) C$OMP PARALLEL SECTIONS C$OMP SECTION CALL OMP_SET_LOCK(LCKA) IVAL = DOWORK() IF (IVAL. EQ. TOL) THEN CALL OMP_UNSET_LOCK (LCKA) ELSE CALL ERROR (IVAL) ENDIF C$OMP SECTION CALL OMP_SET_LOCK(LCKA) CALL USE_B_and_A (RES) CALL OMP_UNSET_LOCK(LCKA) C$OMP END SECTIONS Advanced Open. MP, SC'2000 l l This shows a race condition and a deadlock. If A is locked in the first section and the if statement branches around the unset lock, threads running the other sections deadlock waiting for the lock to be released. Make sure you release your locks. 71

Open. MP death-traps u. Are you using threadsafe libraries? u. I/O inside a parallel region can interleave unpredictably. u. Make sure you understand what your constructors are doing with private objects. u. Private variables can mask globals. u. Understand when shared memory is coherent. When in doubt, use FLUSH. u NOWAIT removes implied barriers. Advanced Open. MP, SC'2000 72

Navigating through the Danger Zones l Option 1: Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results. u. This can be prohibitively difficult due to the explosion of possible interleavings. u. Tools like KAI’s Assure can help. Advanced Open. MP, SC'2000 73

Navigating through the Danger Zones l Option 2: Write SMP code that is portable and equivalent to the sequential form. u. Use a safe subset of Open. MP. u. Follow a set of “rules” for Sequential Equivalence. Advanced Open. MP, SC'2000 74

Portable Sequential Equivalence l What is Portable Sequential Equivalence (PSE)? – A program is sequentially equivalent if its results are the same with one thread and many threads. – For a program to be portable (i. e. runs the same on different platforms/compilers) it must execute identically when the Open. MP constructs are used or ignored. Advanced Open. MP, SC'2000 75

Portable Sequential Equivalence l Advantages of PSE – A PSE program can run on a wide range of hardware and with different compilers minimizes software development costs. – A PSE program can be tested and debugged in serial mode with off the shelf tools - even if they don’t support Open. MP. Advanced Open. MP, SC'2000 76

2 Forms of Sequential Equivalence l Two forms of Sequential equivalence based on what you mean by the phrase “equivalent to the single threaded execution”: – Strong SE: bitwise identical results. – Weak SE: equivalent mathematically but due to quirks of floating point arithmetic, not bitwise identical. Advanced Open. MP, SC'2000 77

Strong Sequential Equivalence: rules u. Control data scope with the base language – Avoid the data scope clauses. – Only use private for scratch variables local to a block (eg. temporaries or loop control variables) whose global initialization don’t matter. u. Locate all cases where a shared variable can be written by multiple threads. – The access to the variable must be protected. – If multiple threads combine results into a single value, enforce sequential order. – Do not use the reduction clause. Advanced Open. MP, SC'2000 78

Strong Sequential Equivalence: example C$OMP PARALLEL PRIVATE(I, TMP) C$OMP DO ORDERED DO 100 I=1, NDIM TMP =ALG_KERNEL(I) C$OMP ORDERED CALL COMBINE (TMP, RES) C$OMP END ORDERED 100 CONTINUE l l l C$OMP END PARALLEL Advanced Open. MP, SC'2000 Everything is shared except I and TMP. These can be private since they are not initialized and they are unused outside the loop. The summation into RES occurs in the sequential order so the result from the program is bitwise compatible with the sequential program. Problem: Can be inefficient if threads finish in an order that’s greatly different from the sequential order. 79

Weak Sequential equivalence l For weak sequential equivalence only mathematically valid constraints are enforced. – Floating point arithmetic is not associative and not commutative. – In most cases, no particular grouping of floating point operations is mathematically preferred so why take a performance hit by forcing the sequential order? u In most cases, if you need a particular grouping of floating point operations, you have a bad algorithm. l How do you write a program that is portable and satisfies weak sequential equivalence? u Follow the same rules as the strong case, but relax sequential ordering constraints. Advanced Open. MP, SC'2000 80

Weak equivalence: example l C$OMP PARALLEL PRIVATE(I, TMP) C$OMP DO DO 100 I=1, NDIM TMP =ALG_KERNEL(I) C$OMP CRITICAL CALL COMBINE (TMP, RES) C$OMP END CRITICAL 100 CONTINUE l The summation into RES occurs one thread at a time, but in any order so the result is not bitwise compatible with the sequential program. Much more efficient, but some users get upset when low order bits vary between program runs. C$OMP END PARALLEL Advanced Open. MP, SC'2000 81

Sequential Equivalence isn’t a Silver Bullet l C$OMP PARALLEL C$OMP& PRIVATE(I, ID, TMP, RVAL) ID = OMP_GET_THREAD_NUM() N = OMP_GET_NUM_THREADS() RVAL = RAND ( ID ) C$OMP DO DO 100 I=1, NDIM RVAL = RAND (RVAL) TMP =RAND_ALG_KERNEL(RVAL) C$OMP CRITICAL CALL COMBINE (TMP, RES) C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL Advanced Open. MP, SC'2000 l This program follows the weak PSE rules, but its still wrong. In this example, RAND() may not be thread safe. Even if it is, the pseudo-random sequences might overlap thereby throwing off the basic statistics. 82

What is MPI? The message Passing Interface MPI created by an international forum in the early 90’s. l It is huge -- the union of many good ideas about message passing API’s. l uover 500 pages in the spec uover 125 routines in MPI 1. 1 alone. u. Possible to write programs using only a couple of dozen of the routines MPI 1. 1 - MPIch reference implementation. l MPI 2. 0 - Exists as a spec, full implementations? l

How do people use MPI? The SPMD Model • A parallel program working on a decomposed data set. A sequential program working on a data set • Coordination by passing messages. Replicate the program. Add glue code Break up the data Advanced Open. MP, SC'2000 85

Pi program in MPI #include <mpi. h> void main (int argc, char *argv[]) { int i, my_id, numprocs; double x, pi, step, sum = 0. 0 ; step = 1. 0/(double) num_steps ; MPI_Init(&argc, &argv) ; MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ; MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ; my_steps = num_steps/numprocs ; for (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++) { x = (i+0. 5)*step; sum += 4. 0/(1. 0+x*x); } sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ; } Advanced Open. MP, SC'2000 86

How do people mix MPI and Open. MP? • Create the MPI program with its data decomposition. A sequential program working on a data set • Use Open. MP inside each MPI process. Replicate the program. Add glue code Break up the data Advanced Open. MP, SC'2000 87

Pi program in MPI #include <mpi. h> #include “omp. h” void main (int argc, char *argv[]) { int i, my_id, numprocs; double x, pi, step, sum = 0. 0 ; step = 1. 0/(double) num_steps ; MPI_Init(&argc, &argv) ; MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ; Get the MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ; part done my_steps = num_steps/numprocs ; first, then add #pragma omp parallel do Open. MP for (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++) pragma { where it x = (i+0. 5)*step; makes sense sum += 4. 0/(1. 0+x*x); to do so } sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ; } Advanced Open. MP, SC'2000 88

Mixing Open. MP and MPI Let the programmer beware! l Messages are sent to a process on a system not to a particular thread u. Safest approach -- only do MPI inside serial regions. u… or, do them inside MASTER constructs. u… or, do them inside SINGLE or CRITICAL – But this only works if your MPI is really thread safe! l Environment variables are not propagated by mpirun. You’ll need to broadcast Open. MP parameters and set them with the library routines. Advanced Open. MP, SC'2000 89

Open. MP Futures: The ARB l The future of Open. MP is in the hands of the Open. MP Architecture Review Board (the ARB) – Intel, KAI, IBM, HP, Compaq, Sun, SGI, DOE ASCI The ARB resolves interpretation issues and manages the evolution of new Open. MP API’s. l Membership in the ARB is Open to any organization with a stake in Open. MP. l – Research organization (e. g. DOE ASCI) – Hardware vendors (e. g. Intel or HP) – Software vendors (e. g. KAI)

The Future of Open. MP is an evolving standard. We will see to it that it is well matched to the changing needs of the shard memory programming community. l Here’s what’s coming in the future: l – Open. MP 2. 0 for Fortran: – This is a major update of Open. MP for Fortran 95. – Status. Specification released at SC’ 00 – Open. MP 2. 0 for C/C++ – Work to begin in January 2001 – Specification complete by SC’ 01. To learn more about Open. MP 2. 0, come to the Open. MP Advanced Open. MP, SC'2000 BOF on Tuesday evening 92

Reference Material on Open. MP Homepage www. openmp. org: The primary source of information about Open. MP and its development. Books: Parallel programming in Open. MP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718 Research papers: Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on a cc. NUMA architecture using Open. MP. III. Parallel Computing, vol. 26, no. 7 -8, July 2000, pp. 843 -56. Publisher: Elsevier, Netherlands. Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting Open. MP in an MPI application. Proceedings of the ISCA 12 th International Conference. Parallel and Distributed Systems. ISCA. 1999, pp. 566 -71. Cary, NC, USA. Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N. Applying interposition techniques for performance analysis of OPENMP parallel applications. Proceedings 14 th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp. 235 -40. Los Alamitos, CA, USA. J. M. Bull and M. E. Kambites. JOMPan Open. MP-like interface for Java. Proceedings of the ACM 2000 conference on Java Grande, 2000, Pages 44 - 53. Advanced Open. MP, SC'2000 93

Chapman B, Mehrotra P, Zima H. Enhancing Open. MP with features for locality control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Towards Teracomputing. World Scientific Publishing. 1999, pp. 301 -13. Singapore. Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a cluster of SMP PCs using a parallelization of the MPI programs with Open. MP. Parallel Computing Technologies. 5 th International Conference, Pa. CT-99. Proceedings (Lecture Notes in Computer Science Vol. 1662). Springer-Verlag. 1999, pp. 339 -50. Berlin, Germany. Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a shared memory machine. Computer Physics Communications, vol. 124, no. 1, Jan. 2000, pp. 49 -59. Publisher: Elsevier, Netherlands. Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallel analysis of harbor wave response using MPI and Open. MP. International Journal of High Performance Computing Applications, vol. 14, no. 1, Spring 2000, pp. 49 -64. Publisher: Sage Science Press, USA. Scherer A, Honghui Lu, Gross T, Zwaenepoel W. Transparent adaptive parallelism on NOWS using Open. MP. ACM. Sigplan Notices (Acm Special Interest Group on Programming Languages), vol. 34, no. 8, Aug. 1999, pp. 96 -106. USA. Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiple levels of parallelism in Open. MP: a case study. Proceedings of the 1999 International Conference on Parallel Processing. IEEE Comput. Soc. 1999, pp. 172 -80. Los Alamitos, CA, USA. Advanced Open. MP, SC'2000 94

Honghui Lu, Hu YC, Zwaenepoel W. Open. MP on networks of workstations. Proceedings of ACM/IEEE SC 98: 10 th Anniversary. High Performance Networking and Computing Conference (Cat. No. RS 00192). IEEE Comput. Soc. 1998, pp. 13 pp. . Los Alamitos, CA, USA. Throop J. Open. MP: shared-memory parallelism from the ashes. Computer, vol. 32, no. 5, May 1999, pp. 108 -9. Publisher: IEEE Comput. Soc, USA. Hu YC, Honghui Lu, Cox AL, Zwaenepoel W. Open. MP for networks of SMPs. Proceedings 13 th International Parallel Processing Symposium and 10 th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999. IEEE Comput. Soc. 1999, pp. 302 -10. Los Alamitos, CA, USA. Parallel Programming with Message Passing and Directives; Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner, Bob Kuhn, Bill Magro, Stefano Salvini; SIAM News, Volume 32, No 9, Nov. 1999. Still CH, Langer SH, Alley WE, Zimmerman GB. Shared memory programming with Open. MP. Computers in Physics, vol. 12, no. 6, Nov. -Dec. 1998, pp. 577 -84. Publisher: AIP, USA. Chapman B, Mehrotra P. Open. MP and HPF: integrating two paradigms. [Conference Paper] Euro. Par'98 Parallel Processing. 4 th International Euro-Par Conference. Proceedings. Springer-Verlag. 1998, pp. 650 -8. Berlin, Germany. Dagum L, Menon R. Open. MP: an industry standard API for shared-memory programming. IEEE Computational Science & Engineering, vol. 5, no. 1, Jan. -March 1998, pp. 46 -55. Publisher: IEEE, USA. Clark D. Open. MP: a parallel standard for the masses. IEEE Concurrency, vol. 6, no. 1, Jan. -March 1998, pp. 10 -12. Publisher: IEEE, USA. Advanced Open. MP, SC'2000 95