Compiler Ecosystem 1112020 Computation Products Group 1 Compiler

  • Slides: 18
Download presentation
Compiler Ecosystem 11/1/2020 Computation Products Group 1

Compiler Ecosystem 11/1/2020 Computation Products Group 1

Compiler Comparisons Table Critical Features Supported by x 86 Compilers PGI GNU Intel Pathscale

Compiler Comparisons Table Critical Features Supported by x 86 Compilers PGI GNU Intel Pathscale Absoft SUN Microsoft 11/1/2020 Vector SIMD Support Peels Vector Loops Global IPA Open MP Links Aligns Vector Loops Parallel Debuggers Libraries Profile Guided Feedback Large Array Support Medium Memory Model ACML Computation Products Group 2

Intel CPUID Checks How to determine if they exist in a binary q CPUID

Intel CPUID Checks How to determine if they exist in a binary q CPUID instruction reports: § § Types of x 86/x 86 -64 instructions supported (SSE, SSE 2, SSE 3) Vendor of the processor (Genuine Intel or Authentic AMD) q Intel C and FORTRAN compiler’s runtime library enviorments check “Vendor of Processor” and then run down alternate code path that: § § segmentation faults because Intel doesn’t support non-Intel processors executes legacy code optimized for Pentium PRO, PII or PIII q CPUID checks also exist in Intel’s Math Kernel Library § § 11/1/2020 applications calling FFTs or Linear Algebra strongly impacted ISVs and customers must utilize ACML (likely a 2 x performance boost) Computation Products Group 3

Intel CPUID Checks How to determine if they exist in a binary q How

Intel CPUID Checks How to determine if they exist in a binary q How to check if CPUID checks exist in a binary, type: § Dump all assembly instructions in binary to a txt file, type: objdump –d “binary” > binary. txt § Search “binary. txt” file for lines containing cpuid instructions, type: grep “cpuid” binary. txt § Search above will print out instruction address at the beginning of each line containing cpuid § § cpuid located in function called: “Intel. Processor. Identification. Function: ” determine how many times it is called in “binary. txt” by typing: grep “Intel. Processor. Identification. Function” binary. txt Illustrating to ISVs and customers the practices employed by Intel at the user’s inconvenience builds rapport and confidence between them and AMD 11/1/2020 Computation Products Group 4

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q The

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q The compiler is a weapon – maker can control the code generated and run upon their chip and their competitor § working with PGI and NAG we can address the performance and functionality issues of a customer by modifying the compiler or ACML q CPUID checks – instruction compatibility not checked but rather the Vendor ID § § AMD platform issues not supported unless reproducible on Intel platforms CPUID checks placed into code because Intel doesn’t trust users intellect http: //support. intel. com/support/performancetools/c/sb/cs-009787. htm Issues on AMD platforms can not be addressed and will not be reproducible since we do not issue the same VENDOR ID in the CPUID instruction ISVs and customers draw the conclusion AMD Platforms aren’t dependable 11/1/2020 Computation Products Group 5

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q The

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q The AMD Core Math Library (ACML) can not be linked with the Intel 8. 1 AMD 64 compiler, the only option is Intel’s MKL § Opteron runs many Intel MKL routines 25 -75% the rate it runs the counterpart ACML routines (ex: CFFT 1 D, CFFT 2 D, DGEMM, …) § ISVs and customers whose applications are performance bound by FFTs, BLAS or LAPACK strongly impacted (ex: ANSYS performance increased 43% moving to 64 -bit using ACML rather than MKL) q Necessitates increasing the # of compilers and binaries required to support both AMD and Intel platforms § § PGI creates both AMD (-tp k 8 -64) and Intel (-tp p 7 -64) tuned binaries work done by AMD tuning PGI compiler leveraged also in Intel binaries On LS-DYNA the PGI 64 -bit binary targeted towards XEON with -tp p 7 -64 is faster than the Intel 8. 1 binary by 4% 11/1/2020 Computation Products Group 6

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q Intel

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers q Intel has stated at the link below that in 8. 1 Intel compilers the switches to target chips without SSE 2 or SSE 3 will no longer function http: //support. intel. com/support/performancetools/c/sb/cs-009787. htm § § Opteron lacks SSE 3 support until Jackhammer in Q 2 ‘ 05 The user will be unable to tell the compiler not to utilize SSE 3 insturctions ISVs and Customers will have no solution as to using binaries built by Intel compilers upon Opteron Occurrences such as this will continue every time Intel introduces a new instruction set for x 86 based systems (SSE 4? ) Users presently using the Intel compiler upon Opteron based systems or ISVs supporting customers in a similar manner will have no method of optimizing code for an AMD based system with the exception of compiling without optimization 11/1/2020 Computation Products Group 7

Tuning Performance with Compilers Maintaining Stability while Optimizing q STEP 0: Build application using

Tuning Performance with Compilers Maintaining Stability while Optimizing q STEP 0: Build application using the following procedure: § compile all files with the most aggressive optimization flags below: -tp k 8 -64 –fastsse § if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k 8 -64 –fast –Mscalarsse § if problems persist compile at Optimization level 1: -tp k 8 -64 –O 0 q STEP 1: Profile binary and determine performance critical routines q STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability 11/1/2020 Computation Products Group 8

Tuning Memory IO Bandwidth Optimizing large streaming operations q 2 Methods of writing to

Tuning Memory IO Bandwidth Optimizing large streaming operations q 2 Methods of writing to memory in x 86/x 86 -64: § traditional memory stores cause write allocates to cache Mov %rax, [%rdi] 1. 2. 3. § movsd %xmm 0, [%rdi] movapd %xmm 0, [%rdi] page to be modified is read into cache is modified, written to memory when new memory page loaded to write N bytes, 2 N bytes of bandwidth generated non-temporal stores bypass cache and write directly to memory 1. no write allocate to cache, to write N bytes, N bytes of bandwidth generated 2. data is not backed up into cache, do not use with often reused data q Use only on functions which write L 2/2 > bytes of data or more, normally would assure little cache reuse value Group all eligible routines into a common file to as to simplify the compilation procedure. Enable non-temporal stores in PGI compiler with the –Mnontemporal compiler option 11/1/2020 Computation Products Group 9

PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler

PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: q Most aggressive: -tp k 8 -64 –fastsse –Mipa=fast § § enables instruction level tuning for Opteron, O 2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code q Middle of the ground: -tp k 8 -64 –fast –Mscalarsse § enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results § in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code q Least aggressive: -tp k 8 -64 –O 0 (or –O 1) 11/1/2020 Computation Products Group 10

PGI Compiler Flags Functionality Flags q -mcmodel=medium § use if your application statically allocates

PGI Compiler Flags Functionality Flags q -mcmodel=medium § use if your application statically allocates a net sum of data structures greater than 2 GB q -Mlarge_arrays § use if any array in your application is greater than 2 GB q -KPIC § use when linking to shared object (dynamically linked) libraries q -mp § process Open. MP/ Open. MP SGI directives/pragmas (build multi-threaded code) q -Mconcur § 11/1/2020 attempt auto-parallelization of your code on SMP system with Open. MP Computation Products Group 11

Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended Absoft compiler

Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended Absoft compiler flags for flag mining application source bases: q Most aggressive: -O 3 § § loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code q Middle of the road: -O 2 § enables most options by –O 3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. § in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code q Least aggressive: -O 1 11/1/2020 Computation Products Group 12

Absoft Compiler Flags Functionality Flags q -mcmodel=medium § use if your application statically allocates

Absoft Compiler Flags Functionality Flags q -mcmodel=medium § use if your application statically allocates a net sum of data structures greater than 2 GB q -g 77 § enables full compatibility with g 77 produced objects and libraries (must use this option to link to GNU ACML libraries) q -fpic § use when linking to shared object (dynamically linked) libraries q -safefp § 11/1/2020 performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of Na. Ns Computation Products Group 13

Pathscale Compiler Flags Optimization Flags q Most aggressive: -Ofast § Equivalent to –O 3

Pathscale Compiler Flags Optimization Flags q Most aggressive: -Ofast § Equivalent to –O 3 –ipa –OPT: Ofast –fno-math-errno q Aggressive : -O 3 § § optimizations for highest quality code enabled at cost of compile time Some generally beneficial optimization included may hurt performance q Reasonable: -O 2 § § 11/1/2020 Extensive conservative optimizations Optimizations almost always beneficial Faster compile time Avoids changes which affect floating point accuracy. Computation Products Group 14

Pathscale Compiler Flags Functionality Flags q -mcmodel=medium § use if static data structures are

Pathscale Compiler Flags Functionality Flags q -mcmodel=medium § use if static data structures are greater than 2 GB q -ffortran-bounds-check § (fortran) check array bounds q -shared § generate position independent code for calling shared object libraries q Feedback Directed Optimization § § § STEP 0: Compile binary with -fb_create_fbdata STEP 1: Run code collect data STEP 2: Recompile binary with -fb_opt fbdata q -march=(opteron|athlon 64 fx) § 11/1/2020 Optimize code for selected platform (Opteron is default) Computation Products Group 15

Microsoft Compiler Flags Optimization Flags q Recommended Flags : /O 2 /Ob 2 /GL

Microsoft Compiler Flags Optimization Flags q Recommended Flags : /O 2 /Ob 2 /GL /fp: fast § § /O 2 turns on several general optimization & /O 2 enable inline expansion § /fp: fast allows the compiler to use a fast floating point model /GL enables inter-procedural optimizations q Feedback Directed Optimization § § § STEP 0: Compile binary with /LTCG: PGI STEP 1: Run code collect data STEP 2: Recompile binary with /LTCG: PGO q Turn off Buffer Over Run Checking § 11/1/2020 The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance Computation Products Group 16

Microsoft Compiler Flags Functionality Flags q /GT § enables run-time information q /Wp 64

Microsoft Compiler Flags Functionality Flags q /GT § enables run-time information q /Wp 64 § supports fiber safety for data allocated using static thread-local storage q /LD § detects most 64 -bit portability problems q /Oa § creates a dynamic-link library q /Ow § 11/1/2020 assumes aliasing across function calls but not inside functions Computation Products Group 17

64 -Bit Operating Systems Recommendations and Status q SUSE SLES 9 with latest Service

64 -Bit Operating Systems Recommendations and Status q SUSE SLES 9 with latest Service Pack available § § Has technology for supporting latest AMD processor features Widest breadth of NUMA support and enabled by default Oprofile system profiler installable as an RPM and modularized complete support for static & dynamically linked 32 -bit binaries q Red Hat Enterprise Server 3. 0 Service Pack 2 or later § § NUMA features support not as complete as that of SUSE SLES 9 Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactory § only SP 2 or later has complete 32 -bit shared object library support (a requirement to run all 32 -bit binaries in 64 -bit) § Posix-threading library changed between 2. 1 and 3. 0, may require users to rebuild applications 11/1/2020 Computation Products Group 18