TAU PARALLEL PERFORMANCE SYSTEM Allen D Malony Sameer

  • Slides: 1
Download presentation
TAU PARALLEL PERFORMANCE SYSTEM Allen D. Malony, Sameer Shende, Alan Morris, Wyatt Spear, Kevin

TAU PARALLEL PERFORMANCE SYSTEM Allen D. Malony, Sameer Shende, Alan Morris, Wyatt Spear, Kevin Huck, Aroon Nataraj, Scott Biersdorff Performance Research Lab, University of Oregon {malony, sameer, amorris, wspear, khuck, anataraj, scottb}@cs. uoregon. edu http: //www. cs. uoregon. edu/research/tau Program Analysis Performance Data Mining Performance Monitoring PDT TAU Architecture Parallel Profile Analysis Perf. DMF Perf. Explorer TAUover. Supermon Parallel Trace Analysis Para. Prof Vampir Server Parallel Trace Generation OTF Kernel-Level Performance KTAU 1 5 7 4 6 8 WRITE_SAVEFILE S 3 D: flow solver for direct numerical simulation of turbulent combustion S 3 D 3 MPI_Wait Results from participation in PERI tiger team: 1. Scaling study (Cray XT 3, XT 4) 2. • 1 -6400 processors (XT 3) 3. • 12000 processors (XT 3+XT 4) 4. 2. Event correlation with time 5. • WRITE_SAVEFILE and MPI_Wait are highly correlated 6. 3. 6400 processor XT 3+XT 4 run shows MPI_Wait times unbalanced and two slow cores 7. 4. 6400 processor XT 3+XT 4 4 D scatterplot shows two distinct clusters: XT 3 cores vs. XT 4 cores 8. 5. Use process metadata to identify processor/core mapping to machine type 9. 6. Compare mean performance between XT 3 with XT 4 on 6400 processors 10. 7. MPI_Wait times reduced and more balance running only on XT 4 cores 11. 8. XT 4 only MPI_Wait times less clustered 2 Performance data available here: http: //www. cs. uoregon. edu/research/tau/s 3 d 1 3 5 7 GTC: particle-in-cell simulation of turbulence in toroidal fusion plasmas G T C Results from participation in PERI tiger team: 1. Scaling behavior on Cray XT 3 per event 2. Per event relative percentage of total time • PUSHI and CHARGEI are dominant • SMOOTH increases in importance 3. Correlation of significant events with total time 4. Positive, linear correlation between CHARGEI and PUSHI is shown, as well as MPI_Allreduce and SHIFTI interaction 5. Relative distribution (histogram over min-max range) of significant events for 2048 processors • MPI_Allreduce has a heavy head and long tail • PUSHI and CHARGEI have heavy tails 6. Same information relative to normal distribution 7. Effect of compiler options on main computation loops in older GTC version, showing how metadata can be used as a comparison parameter (run on BG/L machine with 64 processors) Performance data available here: http: //www. cs. uoregon. edu/research/tau/gtc 2 4 6 CUBE/Expert