Charm Tutorial Parallel Programming Laboratory UIUC 1 Overview

Charm++ Tutorial Parallel Programming Laboratory, UIUC 1

Overview Introduction – Virtualization – Data Driven Execution in Charm++ – Object-based Parallelization Charm++ features with simple examples – – – Chares and Chare Arrays Parameter Marshalling Structured Dagger Construct AMPI Load Balancing Tools – Parallel Debugger – Projections – Live. Viz 2

Technical Approach Seek optimal division of labor between “system” and programmer Decomposition done by programmer, everything else automated Automation Decomposition Mapping Charm++ Scheduling Specialization 3

Object - based Parallelization User is only concerned with interaction between objects System implementation User View 4

Virtualization: Object-based Decomposition Divide the computation into a large number of pieces – Independent of number of processors – Typically larger than number of processors Let the system map objects to processors 5

Chares – Concurrent Objects Can be dynamically created on any available processor Can be accessed from remote processors Send messages to each other asynchronously Contain “entry methods” 6

$“Hello World!” #include “hello. decl. h”. ci file class mymain : public CBase_mymain{ mainmodule$

“Hello World!” #include “hello. decl. h”. ci file class mymain : public CBase_mymain{ mainmodule hello { public: mymain(Ck. Arg. Msg *m) mainchare mymain { entry mymain(Ck. Arg. Msg{*m); ckout <<“Hello World” }; <<endl; }; Ck. Exit(); } Generates }; #include “hello. def. h” hello. decl. h hello. def. h . C file 7

Compile and run the program Compiling • charmc <options> <source file> • -o, -g, -language, -module, -tracemode pgm: pgm. ci pgm. h pgm. C charmc pgm. ci Example Nodelist File: charmc pgm. C group main ++shell ssh charmc –o pgm. o –language charm++ host Host 1 host Host 2 To run a CHARM++ program named ``pgm'' on four processors, type: charmrun pgm +p 4 <params> Nodelist file (for network architecture) • list of machines to run the program • host <hostname> <qualifiers> 8

Data Driven Execution in Charm++ x Ck. Exit() Objects y y->f() ? ? Scheduler Message Q 9

Charm++ solution: proxy classes Proxy class generated for each chare class – For instance, CProxy_Y is the proxy class generated for chare class Y. – Proxy objects know where the real object is – Methods invoked on this object simply put the data in an “envelope” and send it out to the destination Given a proxy p, you can invoke methods – p. method(msg); 10

Ring program • Array of Objects of the same kind • Each one communicates with the next one • Individual chares – cumbersome and not practical Chare Array: –with a single global name for the collection –each member addressed by an index –Mapping of element objects to processors handled by the system 11

Chare Arrays A [0] A [1] A [2] A [3] A [. . ] User’s view System view A [0] A [1] 12

$Array Hello mainmodule m { readonly CProxy_mymain. Proxy; class readonly int n. Elements; mainchare$

Array Hello mainmodule m { readonly CProxy_mymain. Proxy; class readonly int n. Elements; mainchare mymain{ …. } array [1 D] Hello { entry Hello(void); entry void say. Hi(int Hi. No); }; }; (. ci) file Hello : public CBase_Hello { public: Hello(Ck. Migrate. Message *m){} Hello(); void say. Hi(int hi. No); }; Class Declaration n. Elements=4; In mymain: : main. Proxy = this. Proxy; mymain() CProxy_Hello p = CProxy_Hello: : ck. New(n. Elements); //Have element 0 say “hi” p[0]. say. Hi(12345); 13

Array Hello Element index void Hello: : say. Hi(int hi. No) { ckout << hi. No <<"from element" << this. Index << endl; if (this. Index < n. Elements-1) //Pass the hello on: Array this. Proxy[this. Index+1]. say. Hi(hi. No+1); Proxy else //We've been around once-- we're done. main. Proxy. done(); } void mymain: : done(void){ Ck. Exit(); Read-only } 14

Sorting numbers Sort n integers in increasing order. Create n chares, each keeping one number. In every odd iteration chares numbered 2 i swaps with chare 2 i+1 if required. In every even iteration chares 2 i swaps with chare 2 i-1 if required. After each iteration all chares report to the mainchare. After everybody reports mainchares signals next iteration. Sorting completes in n iterations. Even round: Odd round: 15

$Array Sort mainmodule sort{ sort. ci readonly CProxy_my. Main main. Proxy; class sort :$

Array Sort mainmodule sort{ sort. ci readonly CProxy_my. Main main. Proxy; class sort : public CBase_sort{ readonly int n. Elements; private: sort. h mainchare my. Main { int my. Value; entry my. Main(Ck. Arg. Msg *m); public: entry void swapdone(void); sort() ; }; sort(Ck. Migrate. Message *m); array [1 D] sort{ entry sort(void); void set. Value(int number); entry void set. Value(int myvalue); void swap(int round_no); entry void swap(int round_no); swapcount=0; swap. Receive(int from_index, int value); entry void swap. Receive(int from_index, intvoid value); rounds. Done=0; }; }; my. Main: : my. Main() main. Proxy = thishandle; }; CProxy_sort arr = CProxy_sort: : ck. New(n. Elements); for(int i=0; i<n. Elements; i++) arr[i]. set. Value(rand()); arr. swap(0); 16

Array Sort(contd. . ) void sort: : swap(int roundno) { bool sendright=false; if (roundno%2==0 && this. Index%2==0|| roundno%2==1 && this. Index%2==1) sendright=true; //sendright is true if I have to send to right void sort: : swap. Receive(int from_index, int value) if((sendright && this. Index==n. Elements-1) || (!sendright && this. Index==0)) { main. Proxy. swapdone(); if(from_index==this. Index-1 && value>my. Value) else{ void my. Main: : swapdone(void) my. Value=value; { if(sendright) if(from_index==this. Index+1 if (++swapcount==n. Elements) { this. Proxy[this. Index+1]. swap. Receive(this. Index, my. Value); && value<my. Value) my. Value=value; else swapcount=0; main. Proxy. swapdone(); this. Proxy[this. Index-1]. swap. Receive(this. Index, my. Value); rounds. Done++; } } if (rounds. Done==n. Elements) } Ck. Exit(); else arr. swap(rounds. Done); Error!! } } 17

Remember : swap. Receive 3 swap Message passing is asynchronous. Messages can be delivered out of order. 23 swap. Receive s i 2 ! t s o l 18

Array Sort(correct) void sort: : swap(int roundno) { bool sendright=false; if (roundno%2==0 && this. Index%2==0|| roundno%2==1 && this. Index%2==1) sort: : swap. Receive(int value) { sendright=true; void //sendright is true if I havefrom_index, to send to right if ((sendright && this. Index==n. Elements-1) || (!sendright && this. Index==0)) { if (from_index==this. Index-1) { main. Proxy. swapdone(); if (value>my. Value) { void my. Main: : swapdone(void) { } else { this. Proxy[this. Index-1]. swap. Receive(this. Index, my. Value); if (++swapcount==n. Elements) if (sendright) my. Value=value; { } } this. Proxy[this. Index+1]. swap. Receive(this. Index, my. Value); swapcount=0; } else { this. Proxy[this. Index-1]. swap. Receive(this. Index, value); rounds. Done++; } if (rounds. Done==n. Elements) } Ck. Exit(); else if (from_index==this. Index+1) my. Value=value; arr. swap(rounds. Done); } } main. Proxy. swapdone(); } 19

Array Sort II: A Different Approach Do not have the chares do work unless it is needed All processing is message driven (the result of receiving a message, no ‘for’ loops) Do not continue the sort unless there is work to be done… Quiescence Detection “…the state in which no processor is executing an entry point, and no messages are awaiting processing…” --- Charm Manual Uses a Callback Function (more on Callback Functions later) For now: When Quiescence is detected, the Callback Function will be called and perform the desired task (in this case, print the sorted array and call Ck. Exit() to end the program) 20

Array Sort II (cont. ) Member Functions – init. Swap. Sequence. With(int index) When the array is accepted, sorted, all the a swap is requests denied, ano swap more • chare Received when the receiving chare should perform a swap with two ‘request. Swap()’s involved inbetheitswap withchares processing a neighboring is donewill chare, will the chare at index (used to start the sort, each chare told to must answered check with their ‘deny. Swap()’s otherare neighbors receive either an accept or a deny No further messages queued check both of its neighbors) in return More The remaining messages are queued drain from queuesnext depends onreq. Index, int value) – the request. Swap(int What happens the response… Quiescence occurs • Received when chare at req. Index wants to swap values – deny. Swap(int index) • Received in response to request. Swap() call… request is denied – accept. Swap(int index, int value) • Received in response to request. Swap() call… request is accepted – check. For. Pending() • Used to check if a request for a swap was received and buffered while the chare was already busy taking care of another swap 21

Array Sort II (cont. ) void Bubble: : request. Swap(int req. Index, int value) { void Bubble: : deny. Swap(int index) { ///// CODE REMOVED TO SAVE ROOM : Verify the Parameters ///// // Finished with the swap so exit the swap sequence ///// Process the Request ///// void Bubble: : init. Swap. Sequence. With(int index) { Main: : Main(Ck. Arg. Msg *m) { is. Swapping. With = -1; // Check to see if there is a situation void Bubble: : check. For. Pending() { where two neighbors are both sending request. Swap() messages ///// Verify the Parameter ///// // to each other at the same time. If so, one of them (either one) can drop/ignore the request. ///// CODEif. REMOVED TO SAVE ROOM : //Read Lineare Parameters Check. Command to see if there any pending///// items (req. Index == is. Swapping. With) { // Check to see if there is a pending initiate swap check. For. Pending(); if (index < 0 || index >= n. Elements || index == this. Index) // Have one element ignore and have the other handle it if (pending. Init. Index > 0)Detection { the request. Swap() ///// Setup the Quiescence /////} return; // Do Nothing if (this. Index % 2 ==initiate 0) { // There is a pending swap. . . resend request to self (Note: init. Swap. Sequence. With() does is. Swapping. With = -1; // The odd index will understand that a response is not going to come // not}callback(Ck. Index_Main: : quiescense. Handler(), call function so it is safe to do a standard call tothishandle); it from here. ) Ck. Callback else this { ///// Initiate the Swap Sequence ///// init. Swap. Sequence. With(pending. Init. Index); Ck. Start. QD(callback); return; // The even index is going to ignore the request for a swap void Bubble: : accept. Swap(int index, int value) { } pending. Init. Index = -1; // Check to see if this element is already in a swap sequence } Computation ///// } ///// Start the // Swap is needed so replace my. Value with the value of the other >= element if (is. Swapping. With 0) { // Check to see if this element is already taking part in a = swap sequence my. Value value; // Check to see if there is 0)the a{pending request forcomputation a swap // Print outifa(is. Swapping. With message to >= let user know the is about to start if (index == is. Swapping. With || index == pending. Init. Index) if (pending. Request. Index 0) processors { Ck. Printf("Running Bubble on>= %d for %d elements. . . n", Ck. Num. Pes(), n. Elements); // Since the value of this element has just changed, return; request a swap with the other neighbor (as // Buffer the index and value for later // long as their is not a pending request or init already in the works for the other neighbor) pending. Request. Index = req. Index; // There is atopending request for achare swap. . . resend request to self (Note: This function clears // Set main. Proxy the proxy this pending. Request. Value = for value; // and set is. Swapping. With accordingly. // Buffer the index for later // pending. Request. Index and This pending. Request. Value before main. Proxy =return; thishandle; // Finished for now. . . request will be processed later. . . making the call to request. Swap() so when int oni = this. Index + ((this. Index > index) ? 1 : -1); pending. Init. Index = index; } // request. Swap() calls this function again at the end, execution will not enter this if if (oni >= 0 && oni < n. Elements && pending. Request. Index != oni && pending. Init. Index != oni) { // statement which meansbeing there a will not be in an the infinite loop of calls back and // Create the array aofseccond charestime (each element number array) = oni; // This is all that can be done for now so just return // Check to see if this value should be swappedis. Swapping. With for own // forth between the two functions as one might think at first glance. Also note that arr = CProxy_Bubble: : ck. New(n. Elements); if ((req. Index < this. Index && value > my. Value)this. Proxy[oni]. request. Swap(this. Index, || (req. Index > this. Index && value < my. Value)) { my. Value); return; // is. Swapping. With will be -1 if this function is called. ) } else { } // A Swap isin informto req. Index, and exit swapping sequence int temp. Index =Needed, pending. Request. Index; // Tell each element the array checkswap its neighboors is. Swapping. With = -1; this. Proxy[req. Index]. accept. Swap(this. Index, my. Value); // Inform req. Index temp. Value = pending. Request. Value; for (int iint = 1; i < n. Elements; i+=2) { // Flag this element as being in a swap sequence my. Value = value; //} Swap values = -1; if (i >pending. Request. Index 0) arr[i]. init. Swap. Sequence. With(i - 1); is. Swapping. With = index; = -1; if (i <pending. Request. Value n. Elements - 1)a arr[i]. init. Swap. Sequence. With(i + 1); for this element // If there isn't pending request/init with the other neighbor // Check to see if there any pending items // then make a request with the other neighbor to seetemp. Value); if a swap is needed now that this. Proxy[this. Index]. request. Swap(temp. Index, } if (is. Swapping. With < 0) // Initiate Swap Sequence with the Specified Index // element has a new value } } this. Proxy[index]. request. Swap(this. Index, my. Value); int oni = this. Index + ((this. Index > req. Index)check. For. Pending(); ? 1 : -1); } void Bubble: : display. Value() if (oni >= 0 && oni < n. Elements && } pending. Request. Index != oni && pending. Init. Index { != oni) }{ void Main: : quiescense. Handler() is. Swapping. With = oni; { // After all the this. Proxy[oni]. request. Swap(this. Index, activity has stopped, start themy. Value); final sequence of printing if (this. Index == 0) the array } // and exiting the program Ck. Printf("n"); arr[0]. display. Value(); Ck. Printf("Final --- Bubble[%06 d]. display. Value() - my. Value = %dn", this. Index, my. Value); } else { } fflush(stdout); // No Swap is Needed, inform req. Index and exit swapping sequence this. Proxy[req. Index]. deny. Swap(this. Index); // Inform req. Index Main: : done(void) { if (this. Index } void < n. Elements - 1) Ck. Printf("All Donen"); this. Proxy[this. Index + 1]. display. Value(); Ck. Exit(); // Check to see if there any pending items as long as thiselse element is not already // trying to swap with another element } main. Proxy. done(); if (is. Swapping. With < 0) check. For. Pending(); } } 22

Basic Entities in Charm++ Programs: Review Sequential Objects - ordinary sequential C++ code and objects Chares - concurrent objects Chare Arrays - an indexed collection of chares 23

Illustrative example: 5 -Point Stencil Hot temperature on two sides will slowly spread across the entire grid. 24

Illustrative example: 5 -Point 2 -D Stencil Input: 2 D array of values with boundary conditions In each iteration, each array element is computed as the average of itself and its neighbors(average on 5 points) Iterations are repeated till some threshold difference value is reached 25

Parallel Solution! 26

Parallel Solution! Slice up the 2 D array into sets of columns Chare = computations in one set At the end of each iteration – Chares exchange boundaries – Determine maximum change in computation Output result at each step or when threshold is reached 27

Arrays as Parameters Array cannot be passed as pointer specify the length of the array in the interface file – – entry void bar(int n, double arr[n]) n is size of arr[] 28

Stencil Code void Ar 1: : do. Work(int senders. ID, int n, double arr[]) { max. Change = 0. 0; if (senders. ID == this. Index-1) { leftmsg = 1; } // set boolean to indicate we received the left message else if (senders. ID == this. Index+1) { rightmsg = 1; } // set boolean to indicate we received the right message // Rest of the code on a following slide … } 29

Reduction Apply a single operation (add, max, min, . . . ) to data items scattered across many processors Collect the result in one place Reduce x across all elements – contribute(sizeof(x), &x, Ck. Reduction: : sum_int); Must create and register a callback function that will receive the final value, in main chare 30

Callbacks A generic way to transfer control to a chare after a library(such as reduction) has finished. After finishing a reduction, the results have to be passed to some chare's entry method. To do this, create an object of type Ck. Callback with chare's ID & entry method index Different types of callbacks One commonly used type: Ck. Callback cb(<chare’s entry method>, <chare’s proxy>); 31

Code Continued void Ar 1: : do. Work(int senders. ID, int n, double arr[n]) { //Code on previous slide … if (((rightmsg == 1) && (leftmsg == 1)) || ((this. Index == 0) && (rightmsg == 1)) || ((this. Index ==K-1) && (leftmsg == 1))) { // Both messages have been received and we can now compute the new values of the matrix … // Use a reduction to find determine if all of the maximum errors on each processor had a maximum change that is below our threshold value. } contribute(sizeof(double), &max. Change, Ck. Reduction: : max_double); } 32

Types of Reductions Predefined Reductions – A number of reductions are predefined, including ones that – – – – Sum values or arrays Calculate the product of values or arrays Calculate the maximum contributed value Calculate the minimum contributed value Calculate the logical and of integer values Calculate the logical or of contributed integer values Form a set of all contributed values Concatenate bytes of all contributed values Plus, you can create your own 33

Structured Dagger 34

Structured Dagger What is it? – A coordination language built on top of Charm++ Motivation: – To reduce the complexity of program development without adding any overhead – Keeping flags & buffering manually can complicate code in charm++ model. 35

Structured Dagger Constructs To Be Covered in Advanced Charm++ Session atomic {code} overlap {code} when <entrylist> {code} if/else/for/while foreach 36

Stencil Example Using Structured Dagger stencil. ci array[1 D] Ar 1 { … entry void Get. Messages () { when rightmsg. Entry(), leftmsg. Entry() { atomic { Ck. Printf(“Got both left and right messages n”); do. Work(right, left); } } }; entry void rightmsg. Entry(); entry void leftmsg. Entry(); … }; 37

Adaptive MPI 38

AMPI = Adaptive MPI What is it? – An MPI implementation built on Charm++ Motivation: – To provide benefits of Charm++ Runtime System to standard MPI programs • Load Balancing, Checkpointing, Adaptivity to dynamic number of physical processors – Ease of programming for those who know MPI – Porting MPI to AMPI is easy 39

Sample AMPI Program Also a valid MPI Program #include <stdio. h> #include "mpi. h" int main(int argc, char** argv){ int ierr, rank, np, myval=0; MPI_Status status; MPI_Init(&argc, &argv); ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); ierr = MPI_Comm_size(MPI_COMM_WORLD, &np); if(rank < np-1) MPI_Send(&myval, 1, MPI_INT, rank+1, 1, MPI_COMM_WORLD); if(rank > 0) MPI_Recv(&myval, 1, MPI_INT, rank-1, 1, MPI_COMM_WORLD, &status); printf("rank %d completedn", rank); ierr = MPI_Finalize(); } 40

AMPI Compilation Compile: charmc sample. c -language ampi -o sample Run: charmrun. /sample +p 16 +vp 128 [args] Instead of Traditional MPI equivalent: mpirun. /sample -np 128 [args] 41

Processor Virtualization AMPI utilizes processor virtualization features of Charm++, for standard MPI programs – – – User specifies interaction between objects (VPs) Runtime System maps VPs onto physical processors Typically, # virtual processors > # processors System implementation User View 42

Comparison to Native MPI • AMPI Performance – Similar to Native MPI – Not utilizing any other features of AMPI(load balancing, etc. ) • AMPI Flexibility – AMPI runs on any # of Physical Processors (eg 19, 33, 105). Native MPI needs cube #. Problem setup: 3 D stencil calculation of size 2403 run on Lemieux. 43

Shrink/Expand Problem: Availability of computing platform may change. Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16 -node cluster Additional 16 nodes available at step 600 44

Current AMPI Capabilities Automatic checkpoint/restart mechanism – Robust implementation available Load Balancing and “process” Migration MPI 1. 1 compliant, Most of MPI 2 implemented Interoperability – With Frameworks – With Charm++ Performance visualization 45

Overview of Tools Debugging support Projections Using Load Balancing Strategies Live. Viz 46

Parallel debugging support Parallel debugger (charmdebug) Allows programmer to view the changing state of the parallel program Based on CCS Java GUI client 47

Debugger features Provides a means to easily access and view the major programmer visible entities, including array elements and messages in queues, during program execution Provides an interface to set and remove breakpoints on remote entry points, which capture the major programmer-visible control flows 48

Debugger features (contd. ) Provides the ability to freeze and unfreeze the execution of selected processors of the parallel program, which allows a consistent snapshot Provides a way to attach a sequential debugger (like GDB) to a specific subset of processes of the parallel program during execution, which keeps a manageable number of sequential debugger windows open 49

Alternative debugging support Uses gdb for debugging • Runs each node under gdb in an xterm window, prompting the user to begin execution Charm program has to be compiled using ‘-g’ and run with ‘++debug’ as a command-line option. 50

Projections: Quick Introduction Projections is a tool used to analyze the performance of your application The tracemode option is used when you build your application to enable tracing You get one log file per processor, plus a separate file with global information These files are read by Projections so you can use the Projections views to analyze performance 51

Screen shots – Load imbalance Jacobi 2048 X 2048 Threshold 0. 1 Chares 32 Processors 4 52

Timelines – load imbalance 53

Migration Array objects can migrate from one PE to another To migrate, must implement pack/unpack or pup method Need this because migration creates a new object on the destination processor while destroying the original pup combines 3 functions into one – Data structure traversal : compute message size, in bytes – Pack : write object into message – Unpack : read object out of message Basic Contract : here are my fields (types, sizes and a pointer) 54

Pup – How to write it? Class Show. Pup { double a; int x; char y; unsigned long z; float q[3]; int *r; // heap allocated memory public: … other methods … void pup(PUP: : er &p) { p | a; // notice that you can either use | operator p | x; p | y; p(z); // or () p(q, 3); // but you need () for arrays if(p. is. Unpacking() ) r = new int[ARRAY_SIZE]; p(r, ARRAY_SIZE); } }; 55

Load Balancing All we need is a working pup routine The idea is to use the object migration to balance the work load among processors effectively Principle of Persistence Measurement-based Load Balancing Two major types of strategies • Centralized • Distributed 56

Centralized Load Balancing Uses information about activity on all processors to make load balancing decisions Advantage: since it has the entire object communication graph, it can make the best global decision Disadvantage: Higher communication costs/latency, since this requires information from all running chares 57

Neighborhood Load Balancing Load balances among a small set of processors (the neighborhood) to decrease communication costs Advantage: Lower communication costs, since communication is between a smaller subset of processors Disadvantage: Could leave a system which is globally poorly balanced 58

Centralized Load Balancing Strategies Greedy. Comm. LB: a “greedy” load balancing strategy which uses the process load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor Refine. LB: move objects off overloaded processors to under-utilized processors to reach average load Others: charm++ manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed 59

Neighborhood Load Balancing Strategies Neighbor. LB: each processor tries to average out its load only among its neighbors WSLB: for workstation clusters -- can detect load changes on desktops (and other timeshared processors) and adjust load without interferes with other's use of the desktop 60

When to Re-balance Load? Default: Load balancer will migrate when needed Programmer Control: At. Sync load balancing At. Sync method: enable load balancing at specific point – Object ready to migrate – Re-balance if needed – At. Sync() called when your chare is ready to be load balanced – load balancing may not start right away – Resume. From. Sync() called when load balancing for this chare has finished 61

Using a Load Balancer link a LB module – -module <strategy> – Refine. LB, Neighbor. LB, Greedy. Comm. LB, others… – Every. LB will include all load balancing strategies compile time option (specify default balancer) – -balancer Refine. LB runtime option – +balancer Refine. LB 62

Load Balancing in Jacobi 2 D Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop. 63

Load Balancing in Jacobi 2 D (cont. ) worker: : worker(void) { //Initialize other parameters Void worker: : do. Compute(void){ uses. At. Sync=Cmi. True; // }do all the jacobi computation sync. Count++; if(sync. Count%64==0) At. Sync(); else contribute(1*sizeof(float), &error. Max, Ck. Reduction: : max_float ); void worker: : Resume. From. Sync(void){ } contribute(1*sizeof(float), &error. Max, Ck. Reduction: : max_float); } 64

Processor Utilization: After Load Balance 65

Timelines: Before and After Load Balancing 66

Live. Viz – What is it? Charm++ library Visualization tool Inspect your program’s current state Java client runs on any machine You code the image generation 2 D and 3 D modes 67

Live. Viz – Monitoring Your Application Live. Viz allows you to watch your application’s progress Doesn’t slow down computation when there is no client 68

Live. Viz - Compilation Live. Viz is part of the standard Charm++ distribution – when you build Charm++, you also get Live. Viz 69

Running Live. Viz Build and run the server – cd pgms/charm++/ccs/live. Viz/pollserver – Make –. /run_server. sh Or in detail… 70

Running Live. Viz Run the client – live. Viz [<host> [<port>]] Should get a result window: 71

Live. Viz Request Model Client Get Image Live. Viz Server Code Send Image to Client Image Buffer Chunk Request Passed to Server Combines Image Chunks Poll Request Returns Work Poll for Request 72

Jacobi 2 D Example Structure Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop. 73

Live. Viz Setup #include <live. Viz. Poll. h> void main: : main(. . . ) { // Do misc initilization stuff // Create the workers and register with liveviz Ck. Array. Options opts(0); // By default allocate 0 // array elements. live. Viz. Config cfg(true, true); // color image = true and // animate image = true live. Viz. Poll. Init(cfg, opts); // Initialize the library // Now create the jacobi (empty)2 D jacobi array 2 D array work = CProxy_matrix: : ck. New(opts); CProxy_matrix: : ck. New(0); // Distribute work to the array, filling it as you do } 74

Adding Live. Viz To Your Code void matrix: : service. Live. Viz() { live. Viz. Poll. Request. Msg *m; while ( (m = live. Viz. Poll((Array. Element *)this, timestep)) != NULL ) { request. Next. Frame(m); } } void matrix: : start. Time. Slice() { void { // matrix: : start. Time. Slice() Send ghost row north, south, east, west, . . . // Send ghost row north, east, . . . send. Msg(dims. x-2, NORTH, south, dims. x+1, 1, west, +0, -1); send. Msg(dims. x-2, NORTH, dims. x+1, 1, +0, -1); // Now having sent all our ghosts, service live. Viz // while waiting for neighbor’s ghosts to arrive. service. Live. Viz(); } } 75

Generate an Image For a Request void matrix: : request. Next. Frame(live. Viz. Poll. Request. Msg *m) { // Compute the dimensions of the image bit we’ll send // // Compute the image data of the chunk we’ll send – image data is just a linear array of bytes in row-major order. For greyscale it’s 1 byte, for color it’s 3 bytes (rgb). // The live. Viz library routine color. Scale(value, min, max, // *array) will rainbow-color your data automatically. // Finally, return the image data to the library live. Viz. Poll. Deposit((Array. Element *)this, timestep, m, loc_x, loc_y, width, height, image. Bits); } 76

Link With The Live. Viz Library OPTS=-g CHARMC=charmc $(OPTS) LB=-module Refine. LB OBJS = jacobi 2 d. o all: jacobi 2 d: $(OBJS) $(CHARMC) -language charm++ –lm -o jacobi 2 d $(OBJS) $(LB) -lm -module live. Viz jacobi 2 d. o: jacobi 2 d. C jacobi 2 d. decl. h $(CHARMC) -c jacobi 2 d. C 77

Live. Viz Summary Easy to use visualization library Simple code handles any number of clients Doesn’t slow computation when there are no clients connected Works in parallel, with load balancing, etc. 78

Advanced Features Groups Node Groups Priorities Entry Method Attributes Communications Optimization Checkpoint/Restart 79

Conclusions Better Software Engineering – Logical Units decoupled from “Number of processors” Message Driven Execution – Adaptive overlap between computation and communication – Predictability of execution Flexible and dynamic mapping to processors – Flexible mapping on clusters – Change the set of processors for a given job – Automatic Checkpointing Principle of Persistence 80

More Information http: //charm. cs. uiuc. edu – Manuals – Papers – Download files – FAQs ppl@cs. uiuc. edu 81