Hierarchical Load Balancing for Large Scale Supercomputers Gengbin

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC Charm++ Workshop 2010 1

Outline n n n Dynamic Load Balancing framework in Charm++ Motivations Hierarchical load balancing strategy Charm++ Workshop 2010 2

Charm++ Dynamic Load. Balancing Framework n n n One of the most popular reasons to use Charm++/AMPI Fully automatic Adaptive Application independent Modular, and extendable Charm++ Workshop 2010 3

Principle of Persistence n n Once an application is expressed in terms of interacting objects, object communication patterns and computational loads tend to persist over time In spite of dynamic behavior n n n Abrupt and large, but infrequent changes (e. g. AMR) Slow and small changes (e. g. particle migration) Parallel analog of principle of locality n Heuristics, that holds for most CSE applications Charm++ Workshop 2010 4

Measurement Based Load Balancing n n Based on Principle of persistence Runtime instrumentation (LB Database) n n communication volume and computation time Measurement based load balancers n n Use the database periodically to make new decisions Many alternative strategies can use the database n n n Centralized vs distributed Greedy vs refinement Taking communication into account Taking dependencies into account (More complex) Topology-aware Charm++ Workshop 2010 5

Load Balancer Strategies n Centralized n n n Object load data are sent to processor 0 Integrate to a complete object graph Migration decision is broadcasted from processor 0 Global barrier Distributed n n Load balancing among neighboring processors Build partial object graph Migration decision is sent to its neighbors No global barrier Charm++ Workshop 2010 6

Limitations of Centralized Strategies n n Now consider an application with 1 M objects on 64 K processors Limitations (inherently not scalable) n n n Central node - memory/communication bottleneck Decision-making algorithms tend to be very slow We demonstrate these limitations using the simulator we developed Charm++ Workshop 2010 7

Memory Overhead (simulation results with lb_test) Run on Lemieux 64 processors Lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2 D-mesh. Charm++ Workshop 2010 8

Load Balancing Execution Time Execution time of load balancing algorithms on 64 K processor simulation Charm++ Workshop 2010 9

Limitations of Distributed Strategies n n Each processor periodically exchange load information and migrate objects among neighboring processors Performance improved slowly Lack of global information Difficult to converge quickly to as good a solution as a centralized strategy Result with NAMD on 256 processors 10

A Hybrid Load Balancing Strategy n Dividing processors into independent sets of groups, and groups are organized in hierarchies (decentralized) n n n Aggressive load balancing in sub-groups, combined with Refinement-based cross-group load balancing Each group has a leader (the central node) which performs centralized load balancing n Reuse existing centralized load balancing Charm++ Workshop 2010 11

Hierarchical Tree (an example) 64 K processor hierarchical tree 1 Level 2 0 Level 1 1024 63488 64512 64 Level 0 0 … 1024 1023 1024 … 2047 …. . . 63488 … 64511 64512 … 65535 • Apply different strategies at each level Charm++ Workshop 2010 12

Issues n Load data reduction n n Reducing data movement n n Semi-centralized load balancing scheme Token-based local balancing Topology-aware tree construction Charm++ Workshop 2010 13

Token-based Hybrid. LB Scheme Refinement-based Load balancing 1 Load Data 0 1024 63488 64512 Load Data (OCG) 0 … 1023 1024 … 2047 …. . . 63488 … Greedy-based Load balancing 64511 64512 … 65535 token object Charm++ Workshop 2010 14

Performance Study with Synthetic Benchmark 180 Peak Mem Usage (MB) 160 140 120 100 Centralized. LB 80 Hierarchical 60 40 20 0 4096 8192 16384 lb_test benchmark on Ranger Cluster (1 M objects) Charm++ Workshop 2010 15

Load Balancing Time (lb_test) 35 30 LB Time (s) 25 20 Centralized. LB 15 Hierarchical 10 5 0 4096 8192 16384 lb_test benchmark on Ranger Cluster Charm++ Workshop 2010 16

Performance (lb_test) 12 Lbtest step time 10 8 no LB 6 Centralized. LB 4 Hierarchical 2 0 4096 8192 16384 lb_test benchmark on Ranger Cluster Charm++ Workshop 2010 17

NAMD Hierarchical LB n NAMD implements its own specialized load balancing strategies n n Based on Charm++ load balancing framework Extended NAMD comprehensive and refinement-based solution n Work on subset of processors Charm++ Workshop 2010 18

NAMD LB Time 45 NAMD Time/Step 40 35 30 25 Comprehensive 20 Refinement 15 10 5 0 256 512 1024 2048 Charm++ Workshop 2010 19

NAMD LB Time (Comprehensive) Load Balancing Time (s) 10 1 Comprehensive Hierarchical 0. 1 512 1024 2048 4096 Charm++ Workshop 2010 8192 20

NAMD LB Time (Refinement) Load Balancing Time (s) 10 1 Refinement Hierarchical 0. 1 512 1024 2048 4096 Charm++ Workshop 2010 8192 21

NAMD Time/Step NAMD Performance 20 18 16 14 12 10 8 6 4 2 0 Centralized Hierarchical 512 1024 2048 4096 Charm++ Workshop 2010 8192 22

Conclusions n n Scalable LBs are needed due to large machines like BG/P Avoid memory and communication bottleneck Achieve similar result to the more expensive centralized load balancer Take processor topology into account Charm++ Workshop 2010 23