Parallel Computing Explained Porting Issues Slides Prepared from

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues

Porting Issues In order to run a computer program that presently runs on a

Recompile Some codes just need to be recompiled to get accurate results. The compilers

Word Length Code flaws can occur when you are porting your code to a

Compiler Options for Debugging On the SGI Origin 2000, the MIPSpro compilers include debugging

Compiler Options for Debugging On the IA 32 Linux cluster, the Fortran compiler is

Standards Violations Code flaws can occur when the program has non- ANSI standard Fortran

IEEE Arithmetic Differences Code flaws occur when the baseline computer conforms to the IEEE

Math Library Differences Most high-performance parallel computers are equipped with vendor-supplied math libraries. On

Math Library Differences On the IA 32 Linux cluster, the libraries to link to

Compute Order Related Differences Code flaws can occur because of the non-deterministic computation of

Optimization Level Too High Code flaws can occur when the optimization level has been

Optimization Level Too High Isolating Optimization Level Problems You can sometimes isolate optimization level

Diagnostic Listings The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings

Further Information SGI man f 77/f 90/cc man debug_group man math man complib. sgimath

Scalar Tuning If you are not satisfied with the performance of your program on

Aggressive Compiler Options For the SGI Origin 2000 Linux clusters the main optimization switch

Aggressive Compiler Options It should be noted that –O 3 might carry out loop

Slides: 20

Download presentation

Parallel Computing Explained Porting Issues Slides Prepared from the CI-Tutor Courses at NCSA http: //ci-tutor. ncsa. uiuc. edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3. 1 Recompile 3. 2 Word Length 3. 3 Compiler Options for Debugging 3. 4 Standards Violations 3. 5 IEEE Arithmetic Differences 3. 6 Math Library Differences 3. 7 Compute Order Related Differences 3. 8 Optimization Level Too High 3. 9 Diagnostic Listings 3. 10 Further Information

Porting Issues In order to run a computer program that presently runs on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code. After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined dataset, and save the results from the old or “baseline” computer. Then run the ported code on the new computer and compare the results. If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more accurate than the baseline results. Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and

Recompile Some codes just need to be recompiled to get accurate results. The compilers available on the NCSA computer platforms are shown in the following table: Language SGI Origin 2000 IA-64 Linux Intel GNU Portland Group Intel GNU Fortran 77 f 77 ifort g 77 pgf 77 ifort g 77 Fortran 90 f 90 ifort pgf 90 ifort Fortran 90 f 95 ifort MIPSpro High Performanc e Fortran C C++ Portland Group IA-32 Linux ifort pghpf cc CC pghpf icc icpc gcc g++ pgcc pg. CC icc icpc gcc g++

Word Length Code flaws can occur when you are porting your code to a different word length computer. For C, the size of an integer variable differs depending on the machine and how the variable is generated. On the IA 32 and IA 64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin 2000, the corresponding value is 4 bytes if the code is compiled with the –n 32 flag, and 8 bytes if compiled without any flags or explicitly with the – 64 flag. Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. -in where n is a number: set the default INTEGER to INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters. -rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the

Compiler Options for Debugging On the SGI Origin 2000, the MIPSpro compilers include debugging options via the –DEBUG: group. The syntax is as follows: -DEBUG: option 1[=value 1]: option 2[=value 2]. . . Two examples are: Array-bound checking: check for subscripts out of range at runtime. -DEBUG: subscript_check=ON Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG: trap_uninitialized=ON

Compiler Options for Debugging On the IA 32 Linux cluster, the Fortran compiler is equipped with the following –C flags for runtime diagnostics: -CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic procedure -CU: use of uninitialized variables -CV: correspondence between dummy and actual arguments

Standards Violations Code flaws can occur when the program has non- ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler writers that specify, for example, the value of the do loop index upon exit from the do loop. Standards Violations Detection To detect standards violations on the SGI Origin 2000 computer use the -ansi flag. This option generates a listing of warning messages for the use of non-ANSI standard coding. On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance.

IEEE Arithmetic Differences Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands. You can make your program strictly conform to the IEEE standard. To make your program conform to the IEEE Arithmetic Standards on the SGI Origin 2000 computer use: f 90 -OPT: IEEEarithmetic=n. . . prog. f where n is 1, 2, or 3. This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal. On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp 1 flag.

Math Library Differences Most high-performance parallel computers are equipped with vendor-supplied math libraries. On the SGI Origin 2000 platform, there are SGI/Cray Scientific Library (SCSL) and Complib. sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines. SCSL can be linked with –lscs for the serial version, or –mp – lscs_mp for the parallel version. The complib library can be linked with –lcomplib. sgimath for the serial version, or –mp –lcomplib. sgimath_mp for the parallel version. The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines.

Math Library Differences On the IA 32 Linux cluster, the libraries to link to are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide – lpthread For LAPACK: -L/usr/local/intel/mkl/lib/32 –lmkl_lapack -lmkl -lguide –lpthread When calling MKL routines from C/C++ programs, you also need to link with –l. F 90. On the IA 64 Linux cluster, the corresponding libraries are: For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp –lpthread For LAPACK: -L/usr/local/intel/mkl/lib/64 – lmkl_lapack –lmkl_itp –lpthread When calling MKL routines from C/C++ programs, you

Compute Order Related Differences Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50 th index of a do loop may be computed before the 10 th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program. Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer. Use the following method to detect compute order related differences: If your loop looks like change it to DO I = N, 1, -1 The results should not change if the iterations are independent DO I = 1, N

Optimization Level Too High Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level. Setting the Optimization Level Both SGI Origin 2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O{0, 1, 2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O 3 is used. For example on the Origin 2000 f 90 -O 0 … prog. f turns off all optimizations.

Optimization Level Too High Isolating Optimization Level Problems You can sometimes isolate optimization level problems using the method of binary chop. To do this, divide your program prog. f into halves. Name them prog 1. f and prog 2. f. Compile the first half with -O 0 and the second half with -O 3 f 90 -c -O 0 prog 1. f f 90 -c -O 3 prog 2. f f 90 prog 1. o prog 2. o a. out > results If the results are correct, the optimization problem lies in prog 1. f Next divide prog 1. f into halves. Name them prog 1 a. f and prog 1 b. f Compile prog 1 a. f with -O 0 and prog 1 b. f with -O 3 f 90 -c -O 0 prog 1 a. f f 90 -c -O 3 prog 1 b. f f 90 prog 1 a. o prog 1 b. o prog 2. o a. out > results Continue in this manner until you have isolated the section of code that is producing incorrect results.

Diagnostic Listings The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f 90 f 90 -listing. . . -fullwarn. . . -showdefaults. . . -version. . . -help. . .

Further Information SGI man f 77/f 90/cc man debug_group man math man complib. sgimath MIPSpro 64 -Bit Porting and Transition Guide Online Manuals Linux clusters pages (IA 32, IA 64, Intel 64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux ifort/icc/icpc –help

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4. 1 Aggressive Compiler Options 4. 2 Compiler Optimizations 4. 3 Vendor Tuned Code 4. 4 Further Information

Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter.

Aggressive Compiler Options For the SGI Origin 2000 Linux clusters the main optimization switch is -On where n ranges from 0 to 3. -O 0 turns off all optimizations. -O 1 and -O 2 do beneficial optimizations that will not effect the accuracy of results. -O 3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.

Aggressive Compiler Options It should be noted that –O 3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin 2000 and the Linux clusters, –O 3 can be used together with –OPT: IEEE_arithmetic=n (n=1, 2, or 3) and –mp (or –mp 1), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin 2000, the option -Ofast = ip 27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for