Hardware performancesoft counter measurements handson VIHPS Team SC

  • Slides: 6
Download presentation
Hardware performance/soft counter measurements hands-on VI-HPS Team SC’ 13: Hands-on Practical Hybrid Parallel Application

Hardware performance/soft counter measurements hands-on VI-HPS Team SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Advanced Measurement Configuration: Metrics • If Score-P has been built with performance metric support

Advanced Measurement Configuration: Metrics • If Score-P has been built with performance metric support it is capable of recording performance counter information • Requested counters will be recorded with every enter/exit event • Supported metric sources – PAPI – Resource usage statistics Note: Additional memory is needed to store metric values. Therefore, you may have to adjust SCOREP_TOTAL_MEMORY, for example as reported using “scorep-score -c” SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 2

Advanced Measurement Configuration: Metrics • Recording hardware counters via PAPI % export SCOREP_METRIC_PAPI=PAPI_TOT_INS, PAPI_FP_INS

Advanced Measurement Configuration: Metrics • Recording hardware counters via PAPI % export SCOREP_METRIC_PAPI=PAPI_TOT_INS, PAPI_FP_INS % OMP_NUM_THREADS=4 mpiexec –n 4. /bt-mz_W. 4 NAS Parallel Benchmarks (NPB 3. 3 -MZ-MPI) - BT-MZ MPI+Open. MP Benchmark [. . . More application output. . . ] • Also possible to record them only per rank % export SCOREP_METRIC_PAPI_PER_PROCESS=PAPI_L 3_DCM % OMP_NUM_THREADS=4 mpiexec –n 4. /bt-mz_W. 4 NAS Parallel Benchmarks (NPB 3. 3 -MZ-MPI) - BT-MZ MPI+Open. MP Benchmark [. . . More application output. . . ] SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 3

Advanced Measurement Configuration: Metrics • Available PAPI metrics – Preset events: common set of

Advanced Measurement Configuration: Metrics • Available PAPI metrics – Preset events: common set of events deemed relevant and useful for application performance tuning • Abstraction from specific hardware performance counters, mapping onto available events done by PAPI internally % papi_avail – Native events: set of all events that are available on the CPU (platform dependent) % papi_native_avail Note: Due to hardware restrictions - number of concurrently measured events is limited - there may be unsupported combinations of concurrent events - Use papi_event_chooser tool to test event combinations SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 4

Advanced Measurement Configuration: Metrics • Recording operating system resource usage % export SCOREP_METRIC_RUSAGE=ru_stime %

Advanced Measurement Configuration: Metrics • Recording operating system resource usage % export SCOREP_METRIC_RUSAGE=ru_stime % OMP_NUM_THREADS=4 mpiexec –n 4. /bt-mz_W. 4 NAS Parallel Benchmarks (NPB 3. 3 -MZ-MPI) - BT-MZ MPI+Open. MP Benchmark [. . . More application output. . . ] • Also possible to record them only per rank % export SCOREP_METRIC_RUSAGE_PER_PROCESS=ru_maxrss % OMP_NUM_THREADS=4 mpiexec –n 4. /bt-mz_W. 4 NAS Parallel Benchmarks (NPB 3. 3 -MZ-MPI) - BT-MZ MPI+Open. MP Benchmark [. . . More application output. . . ] SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 5

Advanced Measurement Configuration: Metrics • Available resource usage metrics % man getrusage [. .

Advanced Measurement Configuration: Metrics • Available resource usage metrics % man getrusage [. . . Output. . . ] struct rusage { struct timeval ru_utime; struct timeval ru_stime; long ru_maxrss; long ru_idrss; long ru_isrss; long ru_minflt; long ru_majflt; long ru_nswap; long ru_inblock; long ru_oublock; long ru_msgsnd; long ru_msgrcv; long ru_nsignals; long ru_nvcsw; long ru_nivcsw; }; /* /* /* /* Note: (1) Not all fields are maintained on each platform. (2) Check scope of metrics (per process vs. per thread) user CPU time used */ system CPU time used */ maximum resident set size */ integral shared memory size */ integral unshared data size */ integral unshared stack size */ page reclaims (soft page faults) */ page faults (hard page faults) */ swaps */ block input operations */ block output operations */ IPC messages sent */ IPC messages received */ signals received */ voluntary context switches */ involuntary context switches */ [. . . More output. . . ] SC’ 13: Hands-on Practical Hybrid Parallel Application Performance Engineering 6