Charm Tutorial Parallel Programming Laboratory UIUC 1 Overview
Charm++ Tutorial Parallel Programming Laboratory, UIUC 1
Overview n Introduction – Virtualization – Data Driven Execution in Charm++ – Object-based Parallelization n Charm++ features with simple examples – – – Chares and Chare Arrays Parameter Marshalling Structured Dagger Construct Load Balancing Tools – Projections – Live. Viz 2
Technical Approach Seek optimal division of labor between “system” and programmer Decomposition done by programmer, everything else automated Automation Decomposition Mapping Charm++ Scheduling Specialization 3
Object - based Parallelization User is only concerned with interaction between objects System implementation User View 4
Virtualization: Object-based Decomposition n Divide the computation into a large number of pieces – Independent of number of processors – Typically larger than number of processors n Let the system map objects to processors 5
Chares – Concurrent Objects Can be dynamically created on any available processor n Can be accessed from remote processors n Send messages to each other asynchronously n Contain “entry methods” n 6
“Hello World!”. ci file mainmodule hello { mainchare main { entry main(); }; }; Generates hello. decl. h hello. def. h #include “hello. decl. h” class main : public CBase_main{ public: main(int argc, char **argv) { ckout <<“Hello World” <<endl; Ck. Exit(); } }; #include “hello. def. h” . C file 7
Compile and run the program Compiling • charmc <options> <source file> • -o, -g, -language, -module, -tracemode pgm: pgm. ci pgm. h pgm. C charmc –c pgm. ci charmc –c pgm. c charmc –o pgm. o –language charm++ To run a CHARM++ program named ``pgm'' on four processors, type: charmrun pgm +p 4 <params> Nodelist file (for network architecture) • list of machines to run the program • host <hostname> <qualifiers> 8
Data Driven Execution in Charm++ Ck. Exit() x Objects y y->f() ? ? Scheduler Message Q 9
Charm++ solution: proxy classes n Proxy class generated for each chare class – For instance, CProxy_D is the proxy class generated for chare class D. – Proxy objects know where the real object is – Methods invoked on this object simply put the data in an “envelope” and send it out to the destination n Given a proxy p, you can invoke methods – p. method(msg); 10
Ring program • Array of Objects of the same kind • Each one communicates with the next one • Individual chares – cumbersome and not practical A collection of chares, –with a single global name for the collection –each member addressed by an index –Mapping of element objects to processors handled by the system 11
Chare Arrays A[0] A[1] A[2] A[3] A[. . ] User’s view System view A[0] A[1] 12
Array Hello mainmodule m { readonly CProxy_main. Proxy; class Hello : public CBase_Hello { mainchare main{ …. public: } array [1 D] Hello { Hello(Ck. Migrate. Message *m){} entry Hello(void); Hello(); entry void say. Hi(int hi. No); say. Hi(int Hi. No); Class Declaration }; (. ci) file }; }; int n. Elements=4; In main: : main() main. Proxy = this. Proxy; CProxy_Hello p = CProxy_Hello: : ck. New(n. Elements); //Have element 0 say “hi” p. Say. Hi(…) p[0]. say. Hi(12345); 13
Array Hello Element index void Hello: : say. Hi(int hi. No) { ckout << hi. No <<"from element" << this. Index << endl; if (this. Index < n. Elements-1) Array //Pass the hello on: Proxy this. Proxy[this. Index+1]. say. Hi(hi. No+1) ; else //We've beenvoid around once-- we're main: : done(void){ done. Ck. Exit(); main. Proxy. done(); } } Read-only 14
Basic Entities in Charm++ Programs n Sequential Objects – ordinary sequential C++ code and objects n Read-only variables – initialized in main: : main() – used as “global” variables n Chares (concurrent objects) n Chare Arrays (an indexed collection of chares) 15
Illustrative example: Jacobi 1 D n Input: 2 D array of values with boundary conditions n In each iteration, each array element is computed as the average of itself and its neighbors n Iterations are repeated till some threshold error value is reached 16
Jacobi 1 D: Parallel Solution! n Slice up the 2 D array into sets of columns n Chare = computations in one set n At the end of each iteration – Chares exchange boundaries – Determine maximum change in computation n Output result when threshold is reached 17
Arrays as Parameters n Array cannot be passed as pointer n specify the length of the array in the interface file – entry void bar(int n, double arr[n]) 18
Jacobi Code void Ar 1: : do. Work(int senders. ID, int n, double arr[n]) { max. Change = 0. 0; if (senders. ID == this. Index-1) { leftmsg = 1; // set boolean to indicate we received the right message } else if (senders. ID == this. Index+1) { rightmsg = 1; // set boolean to indicate we received the right message } // Rest of the code on the next slide … } 19
Reduction Like Barrier in MPI n Apply a single operation (add, max, min, . . . ) to data items scattered across many processors n Collect the result in one place n Reduce x across all elements n – contribute(sizeof(x), &x, Ck. Reduction: : sum_int, process. Result ); – Function “process. Result()” n All contribute calls from one array must name the same function 20
Jacobi Code Continued void Ar 1: : do. Work(int senders. ID, int n, double arr[n]) { //Code on previous slide … if (((rightmsg == 1) && (leftmsg == 1)) || ((this. Index == 0) && (rightmsg == 1)) || ((this. Index ==K-1) && (leftmsg == 1))) { // Both messages have been received and we can now compute the new values of the matrix … // Use a reduction to find determine if all of the maximum errors on each processor had a maximum change that is below our threshold value. contribute(8, &max. Change, Ck. Reduction: : max_double, cb); } } 21
Structured Dagger n What is it? – A coordination language built on top of Charm++ n Motivation: – To reduce the complexity of program development without adding any overhead 22
Structured Dagger Constructs atomic {code} – Specifies that no structured dagger constructs appear inside of the code so it executes atomically. n overlap {code} – Enables all of its component constructs concurrently and can execute these constructs in any order. n when <entrylist> {code} – Specifies dependencies between computation and message arrival. n 23
Structure Dagger Constructs Continued if / else/ while / for – These are the same as their C++ conterparts, except that they can contain when blocks in their respective code segments. Hence execution can be suspended while they wait for messages. n forall – Functions like a for statement, but enables its component constructs for its entire iteration space at once. As a result it doesn’t need to execute its iteration space in strict sequence. n 24
Jacobi Example Using Structured Dagger jacobi. ci array[1 D] Ar 1 { … entry void Get. Messages (My. Msg *msg) { when rightmsg. Entry(My. Msg *right), leftmsg. Entry(My. Msg *left) { atomic { Ck. Printf(“Got both left and right messages n”); do. Work(right, left); } } }; entry void rightmsg. Entry(My. Msg *m); entry void leftmsg. Entry(My. Msg *m); … }; In a 1 D jacobi that doesn’t use structured dagger the code in the. C file for the do. Work function is much more complex. The code needs to manually check if both messages have been received by using if/else statements. By using structured dagger, the do. Work function will not be called until both messages have been received. The compiler will translate the structured dagger code into code that will 25 do the appropriate checks, hence making the programmers job simpler.
Screen shots – Load imbalance Jacobi 2048 X 2048 Threshold 0. 1 Chares 32 Processors 4 26
Timelines – load imbalance 27
Migration Array objects can migrate from one PE to another n To migrate, must implement pack/unpack or pup method n pup combines 3 functions into one n – Data structure traversal : compute message size, in bytes – Pack : write object into message – Unpack : read object out of message n Basic Contract : here are my fields (types, sizes and a pointer) 28
Pup – How to write it? Class foo { double a; int x; char y; unsigned long z; float q[3]; int *r; // heap allocated memory public: … other methods … void pup(PUP: er &p) { p(a); p(x); p(y); p(z); p(q, 3); if(p. is. Unpacking() ) r = new int[ARRAY_SIZE]; p(r, ARRAY_SIZE); } }; 29
Load Balancing n All you need is a working pup n link a LB module – -module <strategy> – Comm. LB, Greedy. Ref. LB, Heap. Cent. LB, Metis. LB, Neighbor. LB n runtime option – +balancer Comm. LB 30
When to Re-balance Load? n n n Default: Load balancer will migrate when needed Programmer Control At. Sync method: enable load balancing at specific point – Object ready to migrate – Re-balance if needed – At. Sync(), Resume. From. Sync() n Manual trigger: specify when to do load balancing – – – All objects ready to migrate Re-balance now Turn. Manual. LBOn(), Start. LB() 31
Processor Utilization: After Load Balance 32
Timelines: Before and After Load Balancing 33
Other tools: Live. Viz 34
Live. Viz – What is it? n Charm++ library n Visualization tool n Inspect your program’s current state n Client runs on any machine (java) n You code the image generation n 2 D and 3 D modes 35
Live. Viz – Monitoring Your Application n Live. Viz allows you to watch your application’s progress n Can use it from work or home n Doesn’t slow down computation when there is no client 36
Live. Viz - Compilation n Compile the Live. Viz library itself – Must have built charm++ first! n From the charm directory, run: – cd tmp/libs/ck-libs/live. Viz – make 37
Running Live. Viz n Build and run the server – cd pgms/charm++/ccs/live. Viz/serverpush – Make –. /run_server n Or in detail… 38
Running Live. Viz n Run the client – cd pgms/charm++/ccs/live. Viz/client –. /run_client [<host> [<port>]] n Should get a result window: 39
Live. Viz Request Model Client Get Image Live. Viz Server Code Send Image to Client Image Buffer. Chunk Request Passed to Server Combines Image Chunks Poll Request Returns Work Poll for Request Parallel Application 40
Jacobi 2 D Example Structure Main: Setup worker array, pass data to them Workers: Start looping Send messages to all neighbors with ghost rows Wait for all neighbors to send ghost rows to me Once they arrive, do the regular Jacobi relaxation Calculate maximum error, do a reduction to compute global maximum error If timestep is a multiple of 64, load balance the computation. Then restart the loop. 41
Live. Viz Setup #include <live. Viz. Poll. h> void main: : main(. . . ) { // Do misc initilization stuff // Create the workers and register with liveviz Ck. Array. Options opts(0); // By default allocate 0 // array elements. live. Viz. Config cfg(true, true); // color image = true and // animate image = true live. Viz. Poll. Init(cfg, opts); // Initialize the library // Now create the (empty) jacobi 2 D jacobi array 2 D array work = CProxy_matrix: : ck. New(opts); CProxy_matrix: : ck. New(0); // Distribute work to the array, filling it as you do } 42
Adding Live. Viz To Your Code void matrix: : service. Live. Viz() { live. Viz. Poll. Request. Msg *m; while ( (m = live. Viz. Poll((Array. Element *)this, timestep)) != NULL ) { request. Next. Frame(m); } } void matrix: : start. Time. Slice() {{ void // Send ghost row north, south, east, west, . . . // send. Msg(dims. x-2, NORTH, dims. x+1, 1, 1, +0, -1); send. Msg(dims. x-2, // Now having sent all our ghosts, service live. Viz // while waiting for neighbor’s ghosts to arrive. service. Live. Viz(); }} 43
Generate an Image For a Request void matrix: : request. Next. Frame(live. Viz. Poll. Request. Msg *m) { // Compute the dimensions of the image bit we’ll send // // Compute the image data of the chunk we’ll send – image data is just a linear array of bytes in row-major order. For greyscale it’s 1 byte, for color it’s 3 bytes (rgb). // The live. Viz library routine color. Scale(value, min, max, // *array) will rainbow-color your data automatically. // Finally, return the image data to the library live. Viz. Poll. Deposit((Array. Element *)this, timestep, m, loc_x, loc_y, width, height, image. Bits); } 44
Link With The Live. Viz Library OPTS=-g CHARMC=charmc $(OPTS) LB=-module Refine. LB OBJS = jacobi 2 d. o all: jacobi 2 d: $(OBJS) $(CHARMC) -language charm++ -o jacobi 2 d $(OBJS) $(LB) -lm –lm -module live. Viz jacobi 2 d. o: jacobi 2 d. C jacobi 2 d. decl. h $(CHARMC) -c jacobi 2 d. C 45
Live. Viz Summary n Easy to use visualization library n Simple code handles any number of clients n Doesn’t slow computation when there are no clients connected n Works in parallel, with load balancing, etc. 46
Advanced Features n Priorities – Each message(method invocation) can have a priority – Integer or bit vector n Complex index types for arrays – 2 D, 3 D … 6 D – User defined indexes, eg: Oct-tree n Object groups – Similar to arrays but with exactly one element on each processor – Find local member via function call : • Synchronous Method invocation 47
Entry Method Attributes entry [attribute 1, . . . , attribute. N] void Entry. Method(parameters); Attributes: n threaded – entry methods which are run in their own nonpreemptible threads n sync – methods return message as a result 48
Advanced features: Reductions n Callbacks – transfer control back to a client after a library has finished – Various pre-defined callbacks, eg: Ck. Exit the program n Callbacks in reductions – contribute(sizeof(x), &x, Ck. Reduction: : sum_int, process. Result ); – my. Proxy. ck. Set. Reduction. Client(new Ck. Callback(. . . )); n User defined reductions – performing a user-defined operation on user-defined data 49
Benefits of Virtualization n Better Software Engineering – Logical Units decoupled from “Number of processors” n Message Driven Execution – Adaptive overlap between computation and communication – Predictability of execution n Flexible and dynamic mapping to processors – Flexible mapping on clusters – Change the set of processors for a given job – Automatic Checkpointing n Principle of Persistence 50
More Information n http: //charm. cs. uiuc. edu – Manuals – Papers – Download files – FAQs n ppl@cs. uiuc. edu 51
- Slides: 51