PhaseBased Parallel Performance Profiling Allen D Malony Sameer
Phase-Based Parallel Performance Profiling Allen D. Malony, Sameer Shende, Alan Morris {malony, sameer, amorris}@cs. uoregon. edu Department of Computer and Information Science Performance Research Laboratory Neuro. Informatics Center University of Oregon
Outline of Talk r Motivation ¦ ¦ r Problem description ¦ r r r Motivating example Profiling techniques ¦ r Models in parallel scientific applications Phases and performance mapping Flat, callpath, phase profiling Approach and implementation Applications Future work and concluding remarks Par. Co 2005 Phase-Based Parallel Performance Profiling 2
Motivation r Scientific applications designed based on models ¦ ¦ ¦ r Computational models form developer’s “mental” model ¦ r How the program is intended to behave and perform Want to relate performance model to computation model ¦ ¦ r Computational: structural, logical, numerical models, … Correctness: execution order, data consistency, … Performance: expected, factors, parallelism/scalability, … View performance data with respect to “mental” model Better identify problems and guide tuning decisions Must link computational abstractions to performance ¦ Par. Co 2005 Bridge semantic gap – measurements “mental” model Phase-Based Parallel Performance Profiling 3
Performance Mapping r General problem of linking performance to computation ¦ r Associate (map) measured performance data ¦ ¦ r Performance mapping (Irvin and Miller, ‘ 96; Shende, ‘ 01) To higher level, semantic representations Those with model significance to the user What is the difficulty of making the association ¦ Depends on performance information Ø performance events/state visible from instrumentation Ø what performance data can be measured ¦ r How the performance information is used in mapping Difficulty in how performance information is presented ¦ Par. Co 2005 Model-based views (Le. Blanc et al. , ‘ 90) Phase-Based Parallel Performance Profiling 5
Phases and Performance Mapping r r Like to support the association between model and data Concept of “phases” is common in scientific applications ¦ ¦ r How developers think about structure, logic, numerics How performance can be interpreted (Worley, ‘ 92) Worthwhile to consider support for phases ¦ ¦ In performance measurement Bridge semantic gap in parallel performance mapping? Ø tracing has long demonstrated the benefits! (Heath, ‘ 91) Ø phase-based analysis and interpretation r Main contribution ¦ Par. Co 2005 Support for phases in parallel performance profiling Phase-Based Parallel Performance Profiling 6
Problem Description r Performance measured as a consequence of events ¦ ¦ ¦ r Semantics ¦ ¦ r Defines what the event represents Example: subroutine entry Context ¦ ¦ r Events represent actions that occur during execution Events of interest determine performance information Events have semantics and context (pragmatics) Properties of the state in which event occurred Example: subroutine’s calling parent Interrogate context to map event performance data Par. Co 2005 Phase-Based Parallel Performance Profiling 7
Motivating Example – Multi-Physics Application r Assembly of physical objects ¦ ¦ r Calculate physics ¦ ¦ r Different shapes Different materials Heat transfer Mechanical stress Within / between objects Iterate to error tolerance heat() MPIrecv() stress() MPIsend() other routines How is performance attributed? ¦ ¦ Par. Co 2005 Between events (e. g. , routines) and execution components With respect to computational objects (e. g. , data objects) Phase-Based Parallel Performance Profiling 8
Context and Standard Profiling r Flat profiles ¦ ¦ ¦ r Context is whole program (i. e. , program code) Performance distribution across (static) program structure Cannot differentiate dynamics (e. g. , callpath or objects) Callgraph / callpath profiles ¦ ¦ ¦ Identify parent-child calling relationships at exectution Context is calling (event) parent / calling (event) path Extend event semantics to encode context Ø create new event with callpath name Ø requires dynamic event creation for complex callpaths Ø burdens event mechanisms for context identification Ø simple performance associations require many events Par. Co 2005 Phase-Based Parallel Performance Profiling 9
Context and Phase Profiling r View the program execution as collection of phases ¦ Transition between phases (sequenced, nested) Ø easiest ¦ Phases are not events Ø phase r to think of as phase hierarchy (or phase graph) boundaries can mark entry/exit events Context is the current phase ¦ ¦ How do we know what phase we are in? Phases are identified separately from events Ø phases are not encoded in event names Ø event mechanisms are not overloaded r A phase profile is event performance attributed to phases ¦ Par. Co 2005 Phase-specific performance profiles (flat or callpath) Phase-Based Parallel Performance Profiling 10
Approach (Flat Profile) r Create a profile object for each entry/exit event ¦ ¦ Each profile object has a name Static profile object (static event) Ø event ¦ Dynamic profile object (dynamic event) Ø event r r can have multiple instances (created dynamically) Inclusive and exclusive performance statistics ¦ r has a single instance (single name) Must maintain an event stack (or callstack) Context are generally thought of as code locations Dynamic events do allow for dynamic context awareness ¦ ¦ Par. Co 2005 User code can check “state” and create new events BUT only see one level of event! Phase-Based Parallel Performance Profiling 11
Approach (Callpath Profile) r r Show event calling (nesting) relationships Create a profile object for each event calling context ¦ ¦ Each profile object has a name that encodes the callpath Static profile object Ø callpath ¦ Dynamic profile object Ø callpath r has a single instance (single name) can have multiple instances (created dynamically) Reuse event mechanisms ¦ Interrogate the event stack to form event names Ø “main=> ¦ r f 1 => f 2 => MPI_Send” Inclusive and exclusive performance statistics Callpath length and callgraph depth options Par. Co 2005 Phase-Based Parallel Performance Profiling 12
Approach (Phase Profile) r r A phase is an execution abstraction Two questions ¦ ¦ r Create a phase object when new phase is created ¦ ¦ r How to inform the measurement systems about phases? How to collect the performance data? Each phase object has a name Static and dynamic phase objects Phase relationships ¦ ¦ ¦ Par. Co 2005 Phases may be nested (cannot overlap) “Active” phase object follows scoping rules Default (top-level) phase is outermost event (e. g. , main) Phase-Based Parallel Performance Profiling 13
Approach (Phase Profile - API) r Phase creation TAU_PHASE_CREATE_STATIC(var, name, type, group) TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE_EXTERNAL(var) ¦ ¦ r Global phases have global scope (accessible anywhere) External declarations for defined phases outside file scope Phase control TAU_PHASE_START(var) TAU_PHASE_STOP(var) TAU_GLOBAL_PHASE_START(var) TAU_GLOBAL_PHASE_STOP(var) r r Collects a callgraph profile (depth 2) PER PHASE! Phases default as standard events (when disable) Par. Co 2005 Phase-Based Parallel Performance Profiling 14
Approach (Phase Profile - Data Collection) r r Leverages performance mapping and callpath profiling Phase entry ¦ r Phase object pushed to measurement (event) callstack Phase / event entry ¦ Need to determine (event, phase) tuple Ø traverse callstack to find enclosing phase Ø construct key for (event, phase) tuple ¦ Maintain global map Ø new keys for new (event, phase) tuples put into global map l create new profile object for every (event, phase) tuple Ø search global map to determine is tuple occurred before r Use mapping support to store performance data on exit Par. Co 2005 Phase-Based Parallel Performance Profiling 15
Multi-Physics Example Instrumentation phases iterate phase events heat phase heat() MPIrecv() stress phase stress() MPIsend() only two events! other routines Par. Co 2005 Phase-Based Parallel Performance Profiling 16
Implementation r Parallel profiling in the TAU performance system ¦ ¦ ¦ r Multiple performance metrics ¦ ¦ r r Flat profiling Callpath and callgraph (2 -level callpath) profiling Phase profiling Execution time Hardware performance counters (using PAPI) Scalable to tens of thousands of processors Profile analysis and data management tools ¦ ¦ Par. Co 2005 Para. Prof parallel profile analyzer / visualizer Perf. DMF parallel profile database Phase-Based Parallel Performance Profiling 17
Application – NAS Parallel Benchmarks r Phase profiling can provide more refined profile results ¦ r Defining phases is an application-specific issue ¦ r Apply understanding of computational models Unfortunately, we were not the application developers ¦ ¦ r Specific to phase localities How to decide on phases and phase instrumentation? Informed by application documentation and code Look at NAS parallel benchmark application suite ¦ ¦ ¦ Par. Co 2005 Identify benchmarks with phase behavior SP, BT, LU (simulated CFD codes) and CG Focus on BT Phase-Based Parallel Performance Profiling 18
NAS BT – Phase Analysis r Emulates a CFD application ¦ ¦ ¦ System of linear equations Implicit finite-difference discretization of Navier-Stokes Solve three sets of uncoupled systems of equations Ø in ¦ ¦ r X, Y, Z directions Block tridiagonal with 5 x 5 blocks Square number of processors Phase analysis ¦ ¦ Highlight performance for each solution direction Identified in code by three main functions Ø x_solve, ¦ Par. Co 2005 y_solve, z_solve Static phases Phase-Based Parallel Performance Profiling 19
NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase, ’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase, ’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase, ’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase) Par. Co 2005 Phase-Based Parallel Performance Profiling 20
NAS BT – Flat Profile Application routine names reflect phase semantics Par. Co 2005 How is MPI_Wait() distributed relative to solver direction? Phase-Based Parallel Performance Profiling 21
NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events Par. Co 2005 Phase-Based Parallel Performance Profiling 22
Application – MFIX r Multiphase Flow with Interphase e. Xchanges (MFIX) ¦ ¦ National Energy Transfer Laboratory (NETL) Study physical/chemistry properties in fluid-solid systems Ø hydrodynamics, ¦ Characteristic of large-scale iterative simulations Ø major r heat transfer, chemical reactions loop executed as simulation advances in time Testcase ¦ ¦ Par. Co 2005 Models Ozone decomposition in a bubbling fluidized bed Flat profile Iterate phase profile Demonstrate dynamic phases Phase-Based Parallel Performance Profiling 23
MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a 8, i 3)’) ’ITERATE ’, tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler, taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE Par. Co 2005 Phase-Based Parallel Performance Profiling 24
MFIX – Phase Profile (MPI_Waitall) In 51 st iteration, time spent in MPI_Waitall was 85. 81 secs dynamic phases one per interation Total time spent in MPI_Waitall was 4137. 9 secs across all 92 iterations Par. Co 2005 Phase-Based Parallel Performance Profiling 25
MFIX Iterate Phase Behavior Par. Co 2005 Phase-Based Parallel Performance Profiling 26
Concluding Discussion and Future Work r Phased-based profiling can help to bridge semantic gap Computational models performance measurements ¦ Application-specific performance analysis ¦ Implemented phase profiling in TAU r Demonstrated phase profiling r NAS BT benchmark and MFIX application ¦ Also used in S 3 D, Uintah, Flash on large-scale platforms ¦ Requires application-specific knowledge r Might be possible to link to auto phase identification r ¦ Based on memory tracing or application state change Can this idea be extended to global parallel phases? r Working on better ways to present phase performance r Par. Co 2005 Phase-Based Parallel Performance Profiling 27
Support Acknowledgements r Department of Energy (DOE) ¦ Office of Science contracts University of Utah ASCI Level 1 sub-contract ¦ ASC/NNSA Level 3 contract Department of Defense (Do. D) ¦ HPC Modernization Office (HPCMO) ¦ Programming Environment and Training (PET) NSF Research Centre Juelich Los Alamos National Laboratory www. cs. uoregon. edu/research/paracomp/tau ¦ r r r Par. Co 2005 Phase-Based Parallel Performance Profiling 28
- Slides: 27