Optimizing Nuclear Physics Codes on the XT 5

  • Slides: 58
Download presentation
Optimizing Nuclear Physics Codes on the XT 5 Rebecca J. Hartman-Baker hartmanbakrj@ornl. gov Hai

Optimizing Nuclear Physics Codes on the XT 5 Rebecca J. Hartman-Baker hartmanbakrj@ornl. gov Hai Ah Nam namha@ornl. gov Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

Outline I. Why Nuclear Physics? II. Nuclear Physics Codes III. Performance Profiling IV. Compiler

Outline I. Why Nuclear Physics? II. Nuclear Physics Codes III. Performance Profiling IV. Compiler Optimization V. Compiler Feedback & Hand Tuning VI. Conclusions/Future Work 2

I. WHY NUCLEAR PHYSICS? The band Why? Source: http: //www. clashmusic. com/artists/why 3

I. WHY NUCLEAR PHYSICS? The band Why? Source: http: //www. clashmusic. com/artists/why 3

I. Why Nuclear Physics? • Nuclear physics important to fundamental understanding of the universe

I. Why Nuclear Physics? • Nuclear physics important to fundamental understanding of the universe • Useful for broad applications 4

Scaling the world Big @$$ Supercomputer Energy Consumption of the above atom Hadron (proton/neutrons)

Scaling the world Big @$$ Supercomputer Energy Consumption of the above atom Hadron (proton/neutrons) ~ 10 -15 meters nucleus 5 aka 1 femtometer aka 1 fermi = 1 fm

Matter? You know the periodic table… but nuclear physicists view the elements differently Why

Matter? You know the periodic table… but nuclear physicists view the elements differently Why are certain combinations of protons and neutrons stable, while others are unstable? In which energy states are protons and neutrons in a stable nucleus? 6

Broad Applications National Security Art Medicine Nuclear Physics For All Artifact analysis Nuclear Energy

Broad Applications National Security Art Medicine Nuclear Physics For All Artifact analysis Nuclear Energy Clean, safe & renewable 7 Proton therapy Astrophysics Evolution of the cosmos

A whole lot of nuclei to study • Experiment There are nuclei we can’t

A whole lot of nuclei to study • Experiment There are nuclei we can’t measure or phenomena we can’t explain. • Theory Light Nuclei NCSM, GFMC, CC Medium Mass CI, CC Heavy Nuclei DFT 8

II. NUCLEAR PHYSICS CODES “Code Talkers, ” xkcd comic: http: //xkcd. com/257/ 9

II. NUCLEAR PHYSICS CODES “Code Talkers, ” xkcd comic: http: //xkcd. com/257/ 9

II. Nuclear Physics Codes • Theoretical and Computational Methods • Representative Codes – NUCCOR

II. Nuclear Physics Codes • Theoretical and Computational Methods • Representative Codes – NUCCOR – Bigstick 10

Complementary Methods Application AGFMC MFDn NUCCOR DFT Code Suite Description Argonne Green’s Function Monte

Complementary Methods Application AGFMC MFDn NUCCOR DFT Code Suite Description Argonne Green’s Function Monte Carlo Many Fermion Dynamics nuclear Nuclear Coupled. Cluster Oak Ridge Density Functional Theory Current Production Run Sizes 131, 072 cores @ 20 hours (1 trial wave function) Resource 200, 000 cores @ 5 hours (1 model space size) Jaguar 20, 000 cores @ 5 hours (1 nucleus) Jaguar 100, 000 cores @ 10 hours (entire mass table) Jaguar Intrepid • Validation & Verification – Enhance realistic predictive capability by comparing methods in overlap areas • Input from ab initio methods can be used to optimize/direct nuclear DFT • Strengths and weaknesses 11

The Need for HPC INCITE Utilization CPU-Hours (Millions) 120 100 80 60 40 148

The Need for HPC INCITE Utilization CPU-Hours (Millions) 120 100 80 60 40 148 700+ 20 0 2008 Jaguar Allocation 12 488 2009 Early Science (OLCF) Jaguar Usage Jan-Aug 2010 2011 Intrepid Allocation 2012 2013 Intrepid Usage Early Science Request (ALCF)

NUCCOR • Nuclear Coupled-Cluster Oak Ridge • Developed by David Dean et al. •

NUCCOR • Nuclear Coupled-Cluster Oak Ridge • Developed by David Dean et al. • Solves nuclear many-body problem using coupled-cluster approximation – Single and double excitations computed, plus thirdorder correction – Solves system of nonlinear equations with Broyden’s Method • Polynomial scaling with number of particles and single-particle states • 16, 000 lines of Fortran 90 • Parallelization with MPI; exclusively collective 13

Bigstick • Developed by Calvin Johnson at San Diego State University and Erich Ormand

Bigstick • Developed by Calvin Johnson at San Diego State University and Erich Ormand at Lawrence Livermore National Laboratory • Configuration Interaction (shell model) • Performs on-the-fly recalculation of Hamiltonian (smaller memory footprint than typical CI codes) • Lanczos diagonalization to solve for eigenvalues and eigenvectors representing ground and excited states of system • Written in Fortran 14

III. PERFORMANCE PROFILING http: //www. bmwblog. com/wp-content/uploads/bmw-135 i-performance-package-atalbert-park-australia. jpg 15

III. PERFORMANCE PROFILING http: //www. bmwblog. com/wp-content/uploads/bmw-135 i-performance-package-atalbert-park-australia. jpg 15

III. Performance Profiling • Cray. PAT – About Cray. PAT – Profiling NUCCOR •

III. Performance Profiling • Cray. PAT – About Cray. PAT – Profiling NUCCOR • Vampir. Trace and Vampir – About Vampir. Trace/Vampir – Profiling Bigstick 16

Cray. PAT • Package for instrumenting and tracing codes on Cray systems • Run

Cray. PAT • Package for instrumenting and tracing codes on Cray systems • Run instrumented code to obtain overview of code behavior • Re-run with refined profiling, to trace most important subroutines • Analyzed NUCCOR with Cray. PAT 17

Cray. PAT: j-Coupled NUCCOR • We first profiled j-coupled version of NUCCOR with Cray.

Cray. PAT: j-Coupled NUCCOR • We first profiled j-coupled version of NUCCOR with Cray. Pat • Discovered it spent >50% of its time sorting in test benchmark – Found it was using highly inefficient bubble-sort-like algorithm – Replaced “Frankenstein sort” with heapsort, reduced sorting to ~3% of time • Asked collaborators what they were sorting, and why? – Their response: “We’re sorting something? ” • Removed sorting altogether, code worked just 18

Cray. PAT: NUCCOR • We next profiled standard version • Discovered it spent nearly

Cray. PAT: NUCCOR • We next profiled standard version • Discovered it spent nearly 70% of time in single subroutine: t 2_eqn_store_p_or_n • This subroutine became focus of subsequent work with NUCCOR 19

Vampir. Trace/Vampir • Vampir. Trace: instrument code to produce trace files – Compile with

Vampir. Trace/Vampir • Vampir. Trace: instrument code to produce trace files – Compile with VT wrapper, run code and obtain trace output files • Vampir: use to visualize trace – Run in interactive job of nearly same size as job that produced trace files – Server on interactive job serves as analysis engine to local front-end • Analyzed Bigstick with Vampir. Trace and Vampir 20

Top-Level Overview 21

Top-Level Overview 21

Three Steps within Triangle 22

Three Steps within Triangle 22

Block Reduce Phase 23

Block Reduce Phase 23

Bigstick: Analysis • Triangular pattern in overview reminiscent of sequential algorithm applied across processors

Bigstick: Analysis • Triangular pattern in overview reminiscent of sequential algorithm applied across processors – Digging deeper shows in orthogonalization phase, processors held up by single processor writing to Lanczos vector file – Suggestion: reduce amount of orthogonalization performed • Disproportionate time spent in MPI_Barrier (~30%) – Indicative of load imbalance – Barriers are within clocker subroutine, used for performance timings, obscuring evidence of load imbalance • Majority of time in block reduce phase spent in 24

Source: http: //img. domaintools. com/blog/dt-improved-performance. jpg IV. COMPILER OPTIMIZATION 25

Source: http: //img. domaintools. com/blog/dt-improved-performance. jpg IV. COMPILER OPTIMIZATION 25

IV. Compiler Optimization • Motivation • Experiments • Results 26

IV. Compiler Optimization • Motivation • Experiments • Results 26

Motivation • Observed anomalous behavior in NUCCOR using Intel compiler on Jaguar • Turned

Motivation • Observed anomalous behavior in NUCCOR using Intel compiler on Jaguar • Turned out to be compiler bug • Nuclear Physics community relies heavily on Intel compiler • Question: How will NUCCOR perform when compiled with each of five available compilers on Jaguar? 27

Experiments: Compiler Tests • Compiled NUCCOR with all 5 compilers available on Jaguar (Cray,

Experiments: Compiler Tests • Compiled NUCCOR with all 5 compilers available on Jaguar (Cray, GNU, Intel, Pathscale, PGI) • For each compiler, 6 different optimization levels ( -O 0, -O 1, -O 2, -O 3, default, high) • Ran each executable 3 times on 16 O benchmark, 441 processors, using 8 cores/node on Jaguarpf: – aprun -n 441 -S 4 nuccor. exe • Checked that each executable produced correct results (what good is getting the wrong answer quickly? ) 28

Experiments: Compiler Information Compiler Version High Optimization flags Cray 7. 1. 5 (default) -O

Experiments: Compiler Information Compiler Version High Optimization flags Cray 7. 1. 5 (default) -O 3 GNU 4. 4. 2 (default) -O 2 -ffast-math -fomit-frame-pointer -mfpmath=sse Intel 11. 1. 046 (default) -O 3 Pathscale 3. 2 (default) -Ofast PGI 9. 0. 4 (default) -fast 29

Results: -O 0 Optimization Level 4500 4000 Time (seconds) 3500 3000 2500 2000 1500

Results: -O 0 Optimization Level 4500 4000 Time (seconds) 3500 3000 2500 2000 1500 1000 500 0 Cray GNU Intel Compiler Elapsed Time 30 Iter Time Pathscale PGI

Results: -O 1 Optimization Level 4000 Time (seconds) 3500 3000 2500 2000 1500 1000

Results: -O 1 Optimization Level 4000 Time (seconds) 3500 3000 2500 2000 1500 1000 500 0 Cray GNU Intel Compiler Elapsed Time 31 Iter Time Pathscale PGI

Results: -O 2 Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500

Results: -O 2 Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500 0 Cray GNU Intel Compiler Elapsed Time 32 Iter Time Pathscale PGI

Results: -O 3 Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500

Results: -O 3 Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500 0 Cray GNU Intel Compiler Elapsed Time 33 Iter Time Pathscale PGI

Results: Default Optimization Level 4500 4000 Time (seconds) 3500 3000 2500 2000 1500 1000

Results: Default Optimization Level 4500 4000 Time (seconds) 3500 3000 2500 2000 1500 1000 500 0 Cray GNU Intel Compiler Elapsed Time 34 Iter Time Pathscale PGI

Results: High Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500 0

Results: High Optimization Level 3500 Time (seconds) 3000 2500 2000 1500 1000 500 0 Cray GNU Elapsed Time 35 Intel Compiler Pathscale Iteration Time PGI

Results: Aggregate Performance Results Compiler Performance by Optimization Level Elapsed Time (seconds) 5000 4000

Results: Aggregate Performance Results Compiler Performance by Optimization Level Elapsed Time (seconds) 5000 4000 3000 2000 1000 0 Cray O 0 36 GNU O 1 Intel Pathscal e O 2 O 3 PGI Default O 3 Default High O 2 O 1 O 0

Results: Best Performance at Each Optimization Level Top Performer 2 nd Best Performer %

Results: Best Performance at Each Optimization Level Top Performer 2 nd Best Performer % Difference in walltime between 1 st and 2 nd best performers -O 0 PGI Cray 3. 54 -O 1 Intel Cray 5. 88 -O 2 Cray PGI 24. 9 -O 3 Cray PGI 26. 2 No flags Cray Intel 26. 4 Optimal Pathscale Cray 20. 5 Overall Champion: Cray compiler! 37

V. COMPILER FEEDBACK & HAND TUNING Tuning forks, http: //www. phys. cwru. edu/ccpi/Tuning_fork. html

V. COMPILER FEEDBACK & HAND TUNING Tuning forks, http: //www. phys. cwru. edu/ccpi/Tuning_fork. html 38

V. Compiler Feedback & Hand Tuning • Motivation • Loop Optimization in NUCCOR •

V. Compiler Feedback & Hand Tuning • Motivation • Loop Optimization in NUCCOR • Compiler Feedback • Loop Reordering – Experiments – Results 39

Motivation • Discussed results with Jeff Larkin, including surprise Pathscale victory • He suggested

Motivation • Discussed results with Jeff Larkin, including surprise Pathscale victory • He suggested sending code to Cray compiler developers, so they can improve their compiler • I soon received very nice, very detailed analysis of where Cray compiler did not optimize – Cray also opened ticket against this issue, and fixed it in next release • Lessons can be applied to code and improve performance across all compilers 40

Loop Optimization • NUCCOR contains many deeply nested loops (depth 4) • Loops written

Loop Optimization • NUCCOR contains many deeply nested loops (depth 4) • Loops written symmetrically for readability • But, not easy for compiler to optimize 41 ii=0 do b=below_ef+1, tot_orbs do j=1, below_ef ii=ii+1 jj=0 do a=below_ef+1, tot_orbs do i=1, below_ef jj=jj+1 t 2_ccm_eqn%f 5 d(a, b, i, j)= t 2_ccm_eqn% f 5 d(a, b, i, j) + tmat 7(ii, jj) t 2_ccm_eqn%f 5 d(b, a, i, j)= t 2_ccm_eqn% f 5 d(b, a, i, j) - tmat 7(ii, jj) t 2_ccm_eqn%f 5 d(a, b, j, i)= t 2_ccm_eqn% f 5 d(a, b, j, i) - tmat 7(ii, jj) t 2_ccm_eqn%f 5 d(b, a, j, i)= t 2_ccm_eqn% f 5 d(b, a, j, i) + tmat 7(ii, jj) ops_cnt=ops_cnt+4 end do

Problems in Loop • Loop is easy for humans to read • But, strides

Problems in Loop • Loop is easy for humans to read • But, strides through memory cause cache thrashing and increased bandwidth use • With below_ef = 16 and tot_orbs = 336, each cache line of t 2_ccm_eqn%f 5 d will have to be reloaded 8 times • Also, array tmat 7(ii, jj) referenced through 2 nd subscript, so poor stride • All compilers on Jaguar (except maybe Pathscale with -Ofast) fail to interchange loop nesting 42

Loop Memory Access (Poor Stride) 43

Loop Memory Access (Poor Stride) 43

Loop Memory Access (Good Stride) 44

Loop Memory Access (Good Stride) 44

Compiler Optimizations • Cray compiler will output annotated version of source file – ftn

Compiler Optimizations • Cray compiler will output annotated version of source file – ftn -rm mycode. f 90 – Outputs mycode. lst Primary Loop Type • Examine annotated file Modifiers to figure out what’s going -----------on ------A - Pattern matched a - vector atomic memory operation C D E I M P R - Collapsed Deleted Cloned Inlined Multithreaded Parallel Redundant V - Vectorized 45 b - blocked c - conditional and/or computed f - fused i m p r s t w - interchanged partitioned partial unrolled shortloop array syntax temp used unwound

Annotated Loop 371. 372. 1 ----< 373. 1 2 ------< 374. 1 Unroll 2

Annotated Loop 371. 372. 1 ----< 373. 1 2 ------< 374. 1 Unroll 2 loop 375. 1 2 376. 1 2 3 ----< 377. 1 2 3 r 8 -< 378. 1 2 3 r 8 379. 1 2 3 r 8 380. 1 2 3 r 8 381. 1 2 3 r 8 382. 1 2 3 r 8 383. 1 2 3 r 8 384. 1 2 3 r 8 -> 385. 1 2 3 ----> 386. 1 2 ------> 387. 1 ----> 46 8 ii=0 do b=below_ef+1, tot_orbs do j=1, below_ef ii=ii+1 times jj=0 do a=below_ef+1, tot_orbs do i=1, below_ef jj=jj+1 t 2_ccm_eqn%f 5 d(a, b, i, j)=… t 2_ccm_eqn%f 5 d(b, a, i, j)=… t 2_ccm_eqn%f 5 d(a, b, j, i)=… t 2_ccm_eqn%f 5 d(b, a, j, i)=… ops_cnt=ops_cnt+4 end do +tmat 7(ii, jj) -tmat 7(ii, jj) +tmat 7(ii, jj)

Loop Reordering: Two Things to Try • Improve stride: reorder so that tmat 7

Loop Reordering: Two Things to Try • Improve stride: reorder so that tmat 7 is accessed by consecutive row, not column • Loop fission: put all f 5 d(a, b, : ) in one loop, all f 5 d(b, a, : ) in another • Test these two ideas in simple loop unrolling code 47

Test Code: Original Loop ii = 0 do b = abmin, abmax do j

Test Code: Original Loop ii = 0 do b = abmin, abmax do j = ijmin, ijmax ii = ii+1 jj = 0 do a = abmin, abmax do i = ijmin, ijmax jj = jj+1 f 5 d(a, b, i, j) = f 5 d(a, b, i, j) f 5 d(b, a, i, j) = f 5 d(b, a, i, j) f 5 d(a, b, j, i) = f 5 d(a, b, j, i) f 5 d(b, a, j, i) = f 5 d(b, a, j, i) end do 48 + + tmat 7(ii, jj)

Test Code: Improved Stride do i = ijmin, ijmax jj = 0 do a

Test Code: Improved Stride do i = ijmin, ijmax jj = 0 do a = abmin, abmax do j=ijmin, ijmax jj = jj+1 ii = 0 do b = abmin, abmax ii = ii+1 f 5 d(a, b, i, j) = f 5 d(a, b, i, j) f 5 d(b, a, i, j) = f 5 d(b, a, i, j) f 5 d(a, b, j, i) = f 5 d(a, b, j, i) f 5 d(b, a, j, i) = f 5 d(b, a, j, i) end do 49 + + tmat 7(ii, jj)

Test Code: Loop Fission ii = 0 jj = 0 do j = ijmin,

Test Code: Loop Fission ii = 0 jj = 0 do j = ijmin, ijmax do i = ijmin, ijmax do b = abmin, abmax do a = abmin, abmax ii = ii+1 jj = jj+1 jj = 0 ii = 0 do i = ijmin, ijmax do j = ijmin, ijmax do a = abmin, abmax do b = abmin, abmax jj = jj+1 ii = ii+1 f 5 d(a, b, i, j) = f 5 d(b, a, i, j) = f 5 d(a, b, i, j) + f 5 d(b, a, i, j) tmat 7(ii, jj) f 5 d(a, b, j, i) = f 5 d(b, a, j, i) = f 5 d(a, b, j, i) f 5 d(b, a, i, j) + tmat 7(ii, jj) end do end do 50

Test Code: Cray Compiler Behavior • Original Loop: unrolled 8 times • Improved Stride:

Test Code: Cray Compiler Behavior • Original Loop: unrolled 8 times • Improved Stride: conditionally vectorized, unrolled 2 times • Loop Fission: 1 st loop vectorized, partially unrolled 4 times; 2 nd loop vectorized, unrolled 4 times 51

Test Code: Performance of All Compilers Wall Clock Time for 2 Iterations (seconds) Original

Test Code: Performance of All Compilers Wall Clock Time for 2 Iterations (seconds) Original Loop Improved Stride 30 25 20 15 10 5 0 Cray GNU Intel Pathscale Compiler 52 Loop Fission Pathscale Ofast PGI

VI. CONCLUSIONS & FUTURE WORK Cartoon from Toothpaste for Dinner: http: //www. toothpastefordinner. com/052804/see-into-the-future.

VI. CONCLUSIONS & FUTURE WORK Cartoon from Toothpaste for Dinner: http: //www. toothpastefordinner. com/052804/see-into-the-future. gif 53

VI. Conclusions & Future Work • Much potential for optimization in nuclear physics codes

VI. Conclusions & Future Work • Much potential for optimization in nuclear physics codes • To remain competitive, nuclear physicists must continue to evolve codes to run efficiently on HPC resources – Relatively simple changes can drastically improve performance – In-depth measures must be taken for further performance gains, especially as we move to hybrid systems – Need to get away from manager-worker, centralized paradigms • In process of implementing insights from this 54

Questions What does a nuclear physicist eat for lunch? Fission Chips! 55

Questions What does a nuclear physicist eat for lunch? Fission Chips! 55

Resources • Hai Ah Nam (2010) Femtoscale on Petascale: Nuclear Physics in HPC, LCF

Resources • Hai Ah Nam (2010) Femtoscale on Petascale: Nuclear Physics in HPC, LCF Seminar Series, ORNL. • Rebecca Hartman-Baker (2010) Try Anything Once: A Case Study Using NUCCOR, OLCF Hexcore Workshop, http: //www. nccs. gov/wpcontent/uploads/2010/02/try_anything 2. pdf. • OLCF software: – Craypat: http: //www. olcf. ornl. gov/kb_articles/software-jaguar -craypat/ – Vampir: http: //www. nccs. gov/computingresources/jaguar/software/? software=vampir – Vampir. Trace: http: //www. nccs. gov/computingresources/jaguar/software/? software=vampirtrace 56

Resources • UNEDF Nuclear Physics collaboration: http: //www. unedf. org • G. F. Bertsch,

Resources • UNEDF Nuclear Physics collaboration: http: //www. unedf. org • G. F. Bertsch, D. J. Dean and W. Nazarewich, Sci. DAC Rev. 6, 42 (2007) • D. J. Dean, G. Hagen, M. Hjorth-Jensen, and T. Papenbrock, “Computational aspects of nuclear coupled-cluster theory, ” Comput. Sci. Disc. 1, 015008 (2008) • Nuclear Science Advisory Committee, “The Frontiers of Nuclear Science: A Long Range Plan, ” http: //www. sc. doe. gov/np/nsac/docs/Nuclear. Science. Low-Res. pdf 57

Acknowledgments • A very big thank-you to – David Pigg (Vanderbilt) – Jeff Larkin,

Acknowledgments • A very big thank-you to – David Pigg (Vanderbilt) – Jeff Larkin, Nathan Wichmann, Vince Graziano (Cray) • This work was supported by the US Department of Energy under contract numbers DE-AC 0500 OR 22725 (Oak Ridge National Laboratory, managed by UT-Battelle, LLC). • This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U. S. Department of Energy under Contract No. DE- AC 0500 OR 22725. 58