Load Balancing Techniques II CS 320 Spring 2003

Load Balancing Techniques II CS 320 Spring 2003 Laxmikant Kale http: //charm. cs. uiuc. edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign 1

How to diagnose load imbalance? • Often hidden in statements such as: – MPI_barrier is too slow – MPI_reduce is too slow – Very high synchronization overhead • Most processors are waiting at a reduction • Count total amount of computation (ops/flops) per processor – In each phase! – Because the balance may change from phase to phase 2

Golden Rule of Load Balancing Fallacy: objective of load balancing is to minimize variance in load across processors Example: 50, 000 tasks of equal size, 500 processors: A: All processors get 99, except last 5 gets 100+99 = 199 OR, B: All processors have 101, except last 5 get 1 Identical variance, but situation A is much worse! Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work Finish time = max{Time on I’th processor}Excepting data dependence and communication overhead issues 3

Amdahls’s Law and grainsize • Before we get to load balancing: • Original “law”: – If a program has K % sequential section, then speedup is limited to 100/K. • If the rest of the program is parallelized completely • Grainsize corollary: – If any individual piece of work is > K time units, and the sequential program takes Tseq , • Speedup is limited to Tseq / K • So: – Examine performance data via histograms to find the sizes of remappable work units – If some are too big, change the decomposition method to make smaller units 4

Grainsize Example: Molecular Dynamics • In Molecular Dynamics Program NAMD: – While trying to scale it to 2000 processors – Sequential step time was 57 seconds – To run on 2000 processors, no object should be more than 28 msecs. – Analysis using projections showed the following histogram: 5

Grainsize analysis via Histograms Solution: Split compute objects that may have too much work: using a heuristic based on number of interacting atoms Problem 6

Grainsize reduced 7

Grainsize: Lean. MD for Blue Gene/L • BG/L is a planned IBM machine with 128 k processors • Here, we need even more objects: – Generalize hybrid decomposition scheme 2 -away : • 1 -away to k-away cubes are half the size. 8

5000 vps 256, 000 vps 76, 000 vps 9

Load Balancing Strategies • Classified by when it is done: – Initially – Dynamic: Periodically – Dynamic: Continuously • Classified by whether decisions are taken with global information – Fully centralized • Quite good a choice when load balancing period is high – Fully distributed • Each processor knows only about a constant number of neighbors • Extreme case: totally local decision (send work to a random destination processor, with some probability). – Use aggregated global information, and detailed neighborhood info. 10

Load Balancing: Unrestricted Exchange • This is an initial OR periodic strategy • Each processor reads (or has) Ni particles • Before doing interesting things with the data, we want to distribute it equally across processors • It doesn’t matter where each piece of data goes – No constraints • Issues: – How to decide who sends data to whom – How to minimize communication overhead in the process 11

Balancing number of data items: contd • Find the average (avg) using a reduction – Each processor now knows if they are above or below avg – Collect this information (load vector) globally • Then: – Sort all donors (Li > avg) by decreasing Li – Sort all the receivers (Li < avg) by decreasing need: (avg – Li) – For each donor: assign the destination for its extra data • Using the largest-need receiver first. – This tends to produce the fewest number of messages • But only as a heuristics – Each processor can replicate this calculation! • Assuming each received the load vector • No need to broadcast results 12

Balancing using Dimensional Exchange • Log P phases: exchange info and then data with each neighbor – Send message saying how many items you have – Compare your number with neighbor’s • Calculate average • Send overage to them – Load is balanced at the end of log P phase • In each phase, two halves are perfectly balanced • After first phase, the two planes above are equally loaded – No need to return to exchanging data across planes (via red) 13

Dynamic Load Balancing Scenarios: • Examples representing typical classes of situations – Particles distributed over simulation space • Dynamic: because Particles move. • Cases: – Highly non-uniform distribution (cosmology) – Relatively Uniform distribution – Structured grids, with dynamic refinements/coarsening – Unstructured grids with dynamic refinements/coarsening 14

Example Case: Particles • Orthogonal Recursive Bisection (ORB) – At each stage: divide Particles equally • with specified tolerance – Processor don’t need to be a power of 2: • Divide in proportion – 2: 3 with 5 processors – How to choose the dimension along which to cut? • Choose the longest one – How to draw the line? • All data on one processor? 7 3 4 2 7 10 14 16 – Sort along each dimension • Otherwise: – run a distributed histogramming algorithm – to find the line recursively – Find the entire tree, and then do all data movement at once • Or do it in a few (two or three) steps. • But no reason to redistribute particles after drawing each line. 17

Example Case: Particles • Orthogonal Recursive Bisection (ORB) – At each stage: divide Particles equally • with specified tolerance – Processor don’t need to be a power of 2: • Divide in proportion – 2: 3 with 5 processors – How to choose the dimension along which to cut? • Choose the longest one – How to draw the line? • All data on one processor? 7 7 1 8 1 9 7 16 – Sort along each dimension • Otherwise: – run a distributed histogramming algorithm – to find the line recursively – Find the entire tree, and then do all data movement at once • Or do it in a few (two or three) steps. • But no reason to redistribute particles after drawing each line. 18

Example Case: Particles • Orthogonal Recursive Bisection (ORB) – At each stage: divide Particles equally • with specified tolerance – Processor don’t need to be a power of 2: • Divide in proportion – 2: 3 with 5 processors – How to choose the dimension along which to cut? • Choose the longest one – How to draw the line? • All data on one processor? – Sort along each dimension • Otherwise: – run a distributed histogramming algorithm – to find the line recursively – Find the entire tree, and then do all data movement at once • Or do it in a few (two or three) steps. • But no reason to redistribute particles after drawing each line. 19

Particles: Oct/Quad Trees • In ORB, each chunk has a brick shape, with non-square aspect ratio – Oct trees (Quad in 2 D) lead to cubic boxes • How to distribute particle-data into Oct trees? – Assume data is distributed (randomly) – Build a small top level tree across processors • 2 or 3 deep – Send particles to their box • Let each box create children if it has more than a threshold number of particles and send particles to them. • Continue recursively • Note the tree is non-uniform (unlike ORB) 20

Particles: Space-filling curves • Sort all particles using a key that mixes x, y and z coordinates – So particles with similar values for most significant bits of X, Y, Z coordinates are clustered together. • Snip this linearized list into equal size chunks • This is almost like an Oct-tree, – Except nearby boxes have been collected together, for load balance – First 3 k bits are identical: belong to the same oct-tree node at the k’th level. • But: – Sorting is relatively expensive to do every time – Partitions don’t have a regular shape – Because the space-filling curve jumps around, no real guarantee of communication minimization 21

Particles: Virtualization • You can apply virtualization to all the above methods: – It becomes a two level strategy – Particles are grouped into a large number of boxes • Much more than P • Cubes (oct-tree) or bricks (ORB) – The “system” maps these boxes to processors • Advantages: – You can use higher tolerance for imbalance (both oct and orb) during tree formation – Particles can migrate among existing boxes, and load balancing can be done by just moving boxes across processor • With a lower load balancing overhead • Less frequently, you can re-form the tree, if needed – You can also locally split and coarsen it 22

Structured and Unstructured Grids/Meshes • Similar considerations apply to these – Libraries like Metis partition Unstructured Meshes – ORB, Spacefilling curves are options for structured grids • Virtualization: – Again, virtualization helps by reducing the cost of load balancing • Use any scheme to partition data into large number of chunks • Use a dynamic load balancer to map chunks to procs – It can also decide • If communication costs are significant or not, and • Tune itself to communication patterns better. 23

Dynamic Load Balancing using Objects • Object based decomposition (I. e. virtualized decomposition) helps – Allows RTS to remap them to balance load – But how does the RTS decide where to map objects? – Just move objects away from overloaded processors to underloaded processors Just? ? 24

Measurement Based Load Balancing • Principle of persistence – Object communication patterns and computational loads tend to persist over time – In spite of dynamic behavior • Abrupt but infrequent changes • Slow and small changes • Runtime instrumentation – Measures communication volume and computation time • Measurement based load balancers – Use the instrumented data-base periodically to make new decisions – Many alternative strategies can use the database 25

Periodic Load balancing Strategies • Stop the computation? • Centralized strategies: – Charm RTS collects data (on one processor) about: • Computational Load and Communication for each pair – If you are not using AMPI/Charm, you can do the same instrumentation and data collection – Partition the graph of objects across processors • Take communication into account – Pt-to-pt, as well as multicast over a subset – As you map an object, add to the load on both sending and receiving processor • The red communication is free, if it is a multicast. 26

Object partitioning strategies • You can use graph partitioners like METIS, K-R – BUT: graphs are smaller, and optimization criteria are different • Greedy strategies – If communication costs are low: use a simple greedy strategy • Sort objects by decreasing load • Maintain processors in a heap (by assigned load) • In each step: – assign the heaviest remaining object (O) to the least loaded processor – With small-to-moderate communication cost: • Same strategy, but add communication costs as you add an object to a processor – Always add a refinement step at the end: • Swap work from heaviest loaded processor to “some other processor” • Repeat a few times or until no improvement 27

Object partitioning strategies • When communication cost is significant: – Still use greedy strategy, but: • At each assignment step, choose between assigning O to least loaded processor and the processor that already has objects that communicate most with O. – Based on the degree of difference in the two metrics – Two-stage assignments: » In early stages, consider communication costs as long as the processors are in the same (broad) load “class”, » In later stages, decide based on load • Branch-and-bound – Searches for optimal, but can be stopped after a fixed time 28

Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle As computation progresses, crack propagates, and new elements are added, leading to more complex computations in some chunks 29

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked 30

Distributed Load balancing • Centralized strategies – Still ok for 3000 processors for NAMD • Distributed balancing is needed when: – Number of processors is large and/or – load variation is rapid • Large machines: – Need to handle locality of communication • Topology sensitive placement – Need to work with scant global information • Approximate or aggregated global information (average/max load) • Incomplete global info (only “neighborhood”) • Work diffusion strategies (1980’s work by author and others!) – Achieving global effects by local action… 31

Building on Object-based Load Balancing • Application induced load imbalances • Environment induced performance issues: – – Dealing with extraneous loads on shared machines Vacating workstations Heterogeneous clusters Shrinking and expanding the set of processors allocated to a job! • Automatic checkpointing – Restart on a different number of processors • Pre-fetch capability – Out of Core execution – Optimizing Cache performance 32

Electronic Structures using CP • Car-Parinello method • Based on piny. MD – Glenn Martyna, Mark Tuckerman • Data structures: – A bunch of states (say 128) – Represented as • 3 D arrays of coeffs in Gspace, and • also 3 D arrays in real space – Real-space prob. density – S-matrix: one number for each pair of states • For orthonormalization – Nuclei • Computationally – Transformation from g-space to real-space • Use multiple parallel 3 DFFT – Sums up real-space densities – Computes energies from density – Computes forces – Normalizes g-space wave function 33