Customized Dynamic Load Balancing for a Network of

Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester June 1997 Presenter: Jacqueline Ewell

Static vs. Dynamic Load Balancing Static Load Balancing: • allows the programmer to delegate work before runtime • can accommodate for heterogeneous processor and nonuniform loops • avoids runtime scheduling overheads • needs to know all information about Workstations ahead of time Dynamic Load Balancing: • ability to delegate work based on runtime performance of a Network of Workstations (NOW) • transient external loads by multiple-users, heterogeneous processors, memory availability, network bandwidths and contentions, and software leads to a more logical choice of dynamic load balancing

Dynamic Load Balancing Strategies • Task Queue Model: a centralized queue of work Work queue • Diffusion Model: • all work is delegated to each processor, • when an imbalance is detected between it and its neighbor, work is moved • Predict future performance from past performance: Exchange of performance information • Global Distributed Scheme Distributed • Global Centralized Scheme • Local Distributed Scheme Centralized • Local Centralized Scheme Local Global

Dynamic Load Balancing Strategies Global - all load balancing is done on a global scale Local - processors are divided into groups (size = K) and load balancing decisions are done within a group Centralized - the load balancer is located on one processor Distributed - the load balancer is replicated on every processor

Dynamic Load Balancing Strategies Global Centralized Global Distributed Load Balancer P 1 P 2 P 3 . . . Load Balancer P 1 P 2 P 3 Load Balancer P 1 P 2 G 1 P 3 P 4 Pn Local Distributed Load Balancer P 2 . . . Pn Local Centralized P 1 Load Balancer . . . Pn Load Balancer . . . Load Balancer P 3 G 2 Pn

Strategy Tradeoffs Global vs. Local • Global information is available at synchronization time; therefore work distribution is optimal • Global scheme - synchronization and communication cost is much higher • Local scheme - groups may sit idle while other groups are overloaded Centralized vs. Distributed • Centralized scheme - one load balancer will hurt scalability • Centralized scheme - distribution calculations are on one processor; therefore, done sequentially • Distributed - “all-to-all” exchange of performance profile; therefore, network contention could be a problem

DLB Modeling & Decision Process Modeling Parameters: • number of processors • normalized processor speed • number of neighbors Processor Parameters • data size • number of loop iterations • work per iteration • # of bytes to be comm. /iteration • time per iteration Program Parameters • network latency & bandwidth • network topology Network Parameters • maximum load • duration of persistence of load External Load Modeling

DLB Modeling & Decision Process (cont. ) Total DLB Cost: Synchronization Cost + Cost of Calculating New Distribution + Cost of Sending Instructions* + Cost of Data Movement *only applies to centralized schemes

DLB Modeling & Decision Process (cont. ) Synchronization Cost: • GCDLB: one-to-all(P) + all-to-one(P) • GDDLB: one-to-all(P) + all-to-all(P 2) • LCDLB: one-to-all(K) + all-to-one(K) • LDDLB: one-to-all(K) + all-to-all(K 2) Cost of Calculating New Distribution: Usually very small Cost of Sending Instructions: Number of send Messages * Latency Cost of Data Movement: Number of Message * Latency + Number of Iterations Moved * Number of Bytes that need to be communicated per iteration / Bandwidth

DLB Modeling & Decision Process (cont. ) Initially: work will be divided equally among all processors Synchronization: 1/Pth work has been done load function is known average effective speed is know Performance Metric: (number of iteration per second) load function and other parameters are plugged into the model to select the best strategy Work Movement: if amount of work to be moved is above a threshold Profitability Analysis - move work only if there is a 10% improvement in execution time

Experiment • Global Schemes are best ; computation/communication ratio is high • More Processors -> More Synchronization Cost ; favors Local Scheme • Global is still better at 16 processors • Centralized master, sequential redistribution, instruction sends, and delay factors add sufficient overheads to Centralized scheme

Experiment • Amount of work/iteration is small; Local Distributed is favored • As data size increases; Global Distributed does better • On 16 -processors, Local Distributed is the best • Local is better than Global; since computation/comm. Ratio is small • Distributed is better than Centralized

Modeling Results

Conclusions • Different Schemes are best for different applications • Customized Dynamic Load Balancing is essential when transient external loads are introduced • Given the model, it is possible to select a good scheduling scheme Future Work • Other Dynamic Load Balancing Schemes need to be incorporate into the model (not lying on the extremes) • Instead of Local Central, have one master per group • Local schemes, work should be exchanged between different groups • Dynamic Group memberships