Modelbased Automatic Parallel Performance Diagnosis Allen D Malony

Model-based Automatic Parallel Performance Diagnosis Allen D. Malony, Li Li {malony, lili}@cs. uoregon. edu Performance Research Laboratory Department of Computer and Information Science University of Oregon

Outline r r r Parallel performance diagnosis process Model-based performance diagnosis approach Generation of performance diagnosis knowledge r r r Computational model representation Performance modeling Model-specific performance metric definition Modeling inference steps Hercule and performance diagnosis validation system Experiments Conclusion and future directions WAPA 2005 Hercule 2

Parallel Performance Diagnosis r Performance tuning process r Experts approach systematically and use experience r Process to find and report performance problems Performance diagnosis: detect and explain problems Performance optimization: performance problem repair Hard to formulate and automate expertise Performance optimization is fundamentally hard Focus on the performance diagnosis problem WAPA 2005 Characterize diagnosis processes How it integrates with performance experimentation Understand the knowledge engineering Hercule 3

Parallel Performance Diagnosis Process WAPA 2005 Hercule 4

Obstacles to Performance Diagnosis Systems r Lack of theoretical justification r No formal way to describe or compare how expert programers solve their performance diagnosis problems Lack a theory of what methods work and why No accepted approach to understanding diagnosis system features and fitting them to particular needs Conflict between automation and adaptability WAPA 2005 Automated diagnosis is difficult to generalize Systems must meet specific requirements to be used Highly automated systems are hard to change Highly adaptive systems are harder to automate Hercule 5

Separation of Process Concerns r r Separate observation tools from diagnosis methods Performance diagnosis method Policies used to make (guide) decisions about: Ø what performance data to collect Ø which performance data features to judge significant Ø which hypotheses to pursue Ø how to prioritize and explore candidate hypothesis r Performance diagnosis system WAPA 2005 Supports automation of a diagnosis method Supports automation in performance tool environment Hercule 6

Performance Diagnosis System Architecture WAPA 2005 Hercule 7

Poirot Project (B. Helm) r Define a theory of performance diagnosis processes r Compare and analyze performance diagnosis systems Use theory to create system that is automated / adaptable Poirot performance diagnosis (theory, architecture) Heuristic classification Ø match to characteristics Heuristic search Ø look up solutions based on problem knowledge r Problems WAPA 2005 Assumes pre-enumeration of hypothesis space Lacks case specificity and sensitivity Hercule 8

Problems in Existing Diagnosis Approaches r Descriptive performance feedback at low levels Performance data without high level context Unable to relate back to “first principles” Ø parallel r Separated from program semantics Insufficient explanation power r models of structure and behavior Novices lack diagnostic strategy to interpret bugs Hard to interpret without context of program semantics Performance behavior not tied to operational parallelism Lack of automation in iterative diagnosis process WAPA 2005 Performance experiment design and execution Analysis and diagnostic evaluation of performance data Hercule 9

Model-Based Approach r Knowledge-based performance diagnosis r Capture knowledge about performance problems Capture knowledge about how to detect and explain them Where does the knowledge come from? Extract from parallel computational models Ø structural r and operational characteristics Associate computational models with performance models Do parallel computational models help in diagnosis? WAPA 2005 Enables better understanding of problems Enables more specific experimentation Enables more effective hypothesis testing and search Hercule 10

Implications r Models benefit performance diagnosis Base instrumentation on program semantics Capture performance-critical features based on model Enable explanations close to user’s understanding Ø of computation operation Ø of performance behavior Reuse performance analysis expertise Ø on the commonly-used models Ø on case variants of specific models r Model examples WAPA 2005 Master-worker model Pipeline Divide-and-conquer Domain decomposition Phase-based Compositional Hercule 11

Generic Performance Diagnosis Process r Design and run performance experiments r Find symptoms r Observation deviating from performance expectation Detect by evaluating performance metrics Infer causes from symptoms r Observe performance under a specific circumstance Generate desirable performance evaluation data Interpret symptoms at different levels of abstraction Iterate the process to refine performance bug search WAPA 2005 Refine performance hypothesis based on symptoms found Generate performance data to validate the hypothesis Hercule 12

Model-based Performance Diagnosis Process WAPA 2005 Hercule 13

Approach r Make use of model knowledge to diagnose performance Start with commonly-used computational models Engineer model knowledge Integrate model knowledge with performance measurement system Build a cause inference system Ø define “causes” at parallelism level Ø build causality relation l between the low-level “effects” and the “causes” WAPA 2005 Hercule 14

Three Levels of Performance Knowledge r Model-specific knowledge (generic) r Algorithmic-specific knowledge r Fundamental performance knowledge needed to diagnose parallel programs using the model Foundation from which to derive performance diagnosis process tailored to realistic program implementations performance aspects of model-based algorithm Implementation-specific knowledge WAPA 2005 Captures software / hardware idiosyncrasies Hercule 15

Four Types of Knowledge Extraction r r Representing computational models Modeling performance based on the representations Defining performance metrics and evaluation rules Modeling inference steps WAPA 2005 From low level performance measurements To high level program design factors Hercule 16

Model-based Performance Knowledge Generation Inference Modelinig Performance Metrics Performance Modeling Computational Modeling generic Abstract event 1 algorithmic / implementation specific extend events event 2 instantiate Algorithmicspecific events Implementationspecific events Performance composition refine Algorithmic performance and coupling descriptions modeling Model-based metrics at different abstract levels Performance bug search and cause inference WAPA 2005 Performance factor library extend Algorithmicspecific metrics instantiate Implementation- specific metrics Metric-driven diagnosis extend Algorithmspecific factors Hercule 17

Master-Worker (M-W) Computation Model WAPA 2005 Hercule 18

Computational Model Representation r Abstract events Describes behavioral model and model performance characteristics Abstract sequence of primitive events on different processing units Focus on coordination among the processing units Ø r performance properties of the behavior Behavioral model description WAPA 2005 Adapted from EBBA (Event-Based Behavioral Abstraction) Expression: a regular expression specification of constituent events Constituent event descriptors: describe constituent event format Associated abstract events: list relative abstract event types Constraining clauses: indicate what values an event instance must possess to fit in the constituent event or associated abstract event Performance attributes: present performance properties distinct to the behavioral model Hercule 19

M-W Model Representation (Abstract Events) Event expression Abstract. Event Task. Life. Cycle (id, pid){ Expression (Worker. Send. Req Master. Recv) Constituent event descriptor Event Related abstract event list Associated Events ((Master. Setup. Task Master. Send. Task) Worker. Recv) (Event. Component <name><pid><entering_time><execution_time>[source][dest]) Required Optional Worker. Compute ∆ concurrent o sequence Task. Life. Cycle pre. Task, next. Task Constraints Constraint clauses Worker. Send. Req. name == “MPI_Send”; Master. Recv. name == “MPI_Recv”; Master. Recv. source == Worker. Send. Req. pid; Master. Setup. Task. name == “setup”; Master. Send. Task. name == “MPI_Send”; Worker. Recv. name == “MPI_Recv”; Worker. Recv. pid == Worker. Send. Req. pid; …… define conditions for abstract events use later for performance metric evaluation and inference Performance Attributes Is. Worker. Send. Late: = { Performance attribute descriptor true if Master. Send. Task. entering_time > Worker. Recv. entering_time; Is. Master. Send. Late: = { false otherwise Worker. Waiting. Time. Forthe. Task: = if Is. Worker. Send. Late; { Master. Setup. Task. execution_time, Master. Recv. entering_time - Worker. Send. Req. entering_time + Master. Setup. Task. execution_time, otherwise } WAPA 2005 true if Worker. Send. Req. entering_time > Master. Recv. entering_time; fasle otherwise …… Hercule 20

Performance Modeling r Identify performance overhead categories Based on behavioral models represented by abstract events Performance modeling approaches Breadth decomposition (BD): decomposing performance cost according to computational components Ø Concurrency coupling (CC): formulating performance coupling among interactive processes Ø Parallelism overhead formulation (POF): formulating parallelism-related management overhead (e. g. , task scheduling, work load migration) Ø r MW performance model example WAPA 2005 BD => tworker = tinit + tcomp + tcomm + twait + tfinal BD => tmaster = tinit + tsetup + tcomm + tidle + tfinal CC => twait = tseq + tw-setup + tw-bn + tw-final Hercule 21

Performance Metric Definition r Evaluate performance overhead categories identified Traditional performance metrics Ø without relation to program semantics have little explanation power Model-specific performance metrics Reflect characteristics of computation and process coordination model Ø Defined with reference to program semantics Ø Close to users’ understanding of programs Ø r M-W example WAPA 2005 Worker efficiency : = Worker waiting time for master setting up tasks tw-setup : = Where M is the number of tasks assigned to the worker = time master is setting up the ith task for the worker Hercule 22

Modeling Inference Steps r Identify high-level performance factors r Bottom-up inference process r Program design parameters (e. g. , number of workers in M-W model) Form candidate causes for explaining performance symptoms Direct the programmer to poor design decisions Start with evaluating a metric at low-level program abstraction Refine performance modeling to look at model-specific metrics Stop when symptom can be interpreted by high-level factor Use Inference tree to represent the process WAPA 2005 Root: the symptom to be diagnosed Branch nodes: intermediate observations obtained so far Leaf nodes: an explanation of the root symptom in terms of highlevel performance factors Incorporate algorithm-specific inference knowledge through adding branches at appropriate tree levels Hercule 23

Performance Diagnosis Inference Tree (M-W) low efficiency init. % communication% frequency wait_time% master_seq% : symptoms c 1: Sequential initialization and finalization on master account for lost cycles of the worker. : intermediate observations : causes : inference steps c 2: Master setup task overhead is significant, which means the speed of master processing request is pretty slow. final. % wait_final. % volume c 1 master : tolerance for severity of master idle. wait_setup% c 2 master tidle master req : tolerance for severity of master bottleneck. Nreq ≤ master req c 3 WAPA 2005 wait_bn% master ≤ idle Nreq > master req c 6 master tidle master > idle c 5 c 4 c 3: During the execution, the master was rarely idle. The worker spent quite time waiting in master bottleneck, but only a few worker requests got stuck in the bottlenecks. The situation indicates that master processing request speed is slow. In other words, task setup cost is expensive relative to task computation cost. c 4: During this execution, the master was rarely idle, which implies that there is little room for rescheduling to improve performance. The facts that the worker spent quite time waiting in master bottlenecks and a number of worker requests got stuck at the master imply that the amount of workers used exceeds the needed, given the master processing speed and the input problem size. c 5: Master idle time is significant. There is room for improving performance by rescheduling. Adjusting task assignment order in a way that keeps the mater busy while avoiding bottlenecks. c 6: Time imbalance is significant. Try to normalize the last task finish time of all workers. Hercule 24

CLIPS Performance Knowledge Representation r Encode inference tree with CLIPS Open source tool for building expert systems extensible and low-cost Ø used by a wide range of applications in industry and academia Ø CLIPS production rule syntax : (defrule <rule-name> <condition-element>* => <action>*) where condition-element: : = <model-ce> | <assigned-model-ce> | <not-ce> | <and-ce> < or-ce> | <logical-ce> | <test-ce> | <exists-ce> | <forall-ce> Ø action : : = <constant> | <variable> | <function call> WAPA 2005 Hercule 25

Master-Worker Model in CLIPS r A M-W example rule to distinguish master and workers (defrule identify_master_workers (communicate p p 1) (communicate p p 2) (not (communicate p 1 p 2)) => (assert (master p)) (assert (worker p 1)) (assert (worker p 2))) WAPA 2005 Hercule 26

CLIPS Inference Engine r r Repeatedly fire rules with original and derived assertions Invoke performance knowledge only when needed by current hypothesis validation Support automatic performance reasoning and search Breadth first WAPA 2005 Stop when no new facts can be generated Hercule 27

Hercule Parallel Performance Diagnosis (Li Li) r Goals of automation, adaptability, and validation WAPA 2005 Hercule 28

Hercule Prototype r Knowledge base r Abstract event descriptions Performance metric set and evaluation rules High-level performance factors Adaptation to algorithm variants of computational models Event recognizer Ø Fit event instances into user-defined algorithm-specific abstract event descriptions from event trace Metric evaluator Calculate performance attributes associated with abstract events Ø Compute algorithm-specific metrics w. r. t. user-defined evaluation rules Ø Feed the metrics into inference engine to fuel performance bug search Ø Inference rules Extend inference tree with new branches Ø Translate the branches to CLIPS rules to be fired by inference engine Ø WAPA 2005 Hercule 29

Performance Diagnosis Validation System r “Black box” validation system Inject known (model or algorithm level) performance problems in a parallel program Hercule is informed of the computational model Ø unaware WAPA 2005 of the injected performance problems Run experiments as specified by Hercule diagnoses using experiment performance data Compare Hercule diagnosis result against injected fault Hercule 30

Master-Worker Experiment r Diagnose Master-Worker program A synthetic parallel MPI benchmark using M-W model Test platform Ø distributed memory Linux cluster Ø 16 dual-processor nodes and a gigabyte ethernet switch Performance fault injection Ø expensive r task setup at the master Real-world M-W experiment WAPA 2005 ASCI SPhot benchmark Hercule 31

M-W Execution Trace WAPA 2005 Hercule 32

Diagnosis Results Output (M-W) r Automation r Experiments Factor analysis Cause inference Specialized explanations WAPA 2005 Hercule 33

2 D Pipeline Experiment r Diagnose Sweep 3 D ASCI benchmark Using Wavefront (2 D pipeline) model for parallelization A process communicates to only neighbors Wavefront propogates across 2 D process mesh Ø Eight Test platform Ø IBM sweeps with pipeline direction change p. Series 690 SMP cluster with 16 processors Performance fault injection: Ø load imbalance Ø data distribution assigns more data to one process than others WAPA 2005 Hercule 34

Sweep 3 D Execution Trace Sweep direction change delay sweep 5 - 8 sweep 1 - 4 WAPA 2005 Hercule 35

Diagnosis Results Output (Sweep 3 D) dyna 6 -221: ~/Perf. Diagnosis/ lili$ test. WF WF. clp Begin diagnosing wavefront program. . . Level 1 experiment -- generate performance profiles with respect to computation and communication. _______________________________________________ do experiment 1. . . _______________________________________________ Among the participating processes, process 7 spent 30. 5% of running time on communication. Among the communication time, around 97% is due to waiting time at MPI_Recv function calls. Next let's look at what caused the waiting time. _______________________________________________ Level 2 experiment -- generate performance data with respect to pipeline filling up, emptying, handshaking, and sweep-direction-change. _______________________________________________ do experiment 2. . . _______________________________________________ Process 7 spent 6. 01% of communication time in pipeline filling-up, and 9. 34% in pipeline emptying. Handshaking delay comprises 20. 03% of communication time. Sweep direction change comprises 60. 8%. _______________________________________________ Level 3 experiment for -- generate performance event trace with respect to sweep direction change and handshaking. _______________________________________________ do experiment 3. . . _______________________________________________ In this wavefront program execution, pipeline sweep direction change delay is significant in process 7, especially between sweep 4 and 5. Among the idle time, 89. 44% is spent waiting for successive pipeline stages in sweep 4 to finish up, and 10. 73% waiting for pipeline filling up in sweep 5. In sweep 4, process 7 is at pipeline stage 1, next sweep head, process 4, is at stage 4. Due to pipeline working mechanism, process 7 has to wait process 4 to finish computation before next sweep begins. There is a pronounced work load difference among pipeline stages. The load imbalance accounts for 80. 77% of the long sweep direction change delay in process 7. WAPA 2005 Hercule 36

Conclusions and Future Directions r Model-based automatic performance diagnosis approach r Systematic approach to extracting performance diagnosis knowledge from computational models Represent and encode the knowledge in a manner that supports diagnosis with minimum user intervention Hercule automatic performance diagnosis system Future directions Create knowledge base for additional models Ø Fork-join, WAPA 2005 phase-based, … Include compositional model knowledge generation Enhance Hercule and validation system Hercule 37

Support Acknowledgements r Department of Energy (DOE) Office of Science contracts University of Utah ASCI Level 1 ASC/NNSA Level 3 contract Ø Lawrence r r r Livermore National Laboratory Department of Defense (Do. D) HPC Modernization Office (HPCMO) Programming Environment and Training (PET) NSF Software and Tools for High-End Computing Research Centre Juelich Los Alamos National Laboratory www. cs. uoregon. edu/research/paracomp/tau WAPA 2005 Hercule 38