Big Sim Tutorial Presented by Gengbin Zheng and
Big. Sim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign Charm++ Workshop 2004
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 2
Simulation-based Performance Prediction l Extremely large parallel machines are being built with enormous compute power ¡ Very large number of processors with petaflops level peak performance l Are existing software environments ready for these new machines? ¡ How to write a peta-scale parallel application? ¡ What will be the performance like? Can these applications scale? Charm++ Workshop 2004 3
Big. Sim Objective l Aim at developing techniques and methods to facilitate the development of efficient peta-scale applications on very large parallel machines. l Based on performance prediction via simulation Charm++ Workshop 2004 4
Simulation-based Performance Prediction With focus on Charm++ and AMPI programming models l Performance prediction is based on Parallel Discrete Event Simulation (PDES) l Simulation is challenging, aims at different levels of fidelity l Processor prediction ¡ Network prediction ¡ l Two approaches Direct execution (online mode) ¡ Trace-driven (post-mortem mode) ¡ Charm++ Workshop 2004 5
Architecture of Big. Sim (postmortem mode) Performance visualization (Projections) Offline PDES Network Simulator Big. Net. Sim (POSE) Simulation output trace logs Online PDES engine Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, . . ) Simple Network Model Big. Sim Emulator Charm++ and MPI applications Charm++ Workshop 2004 7
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 8
Emulator l Emulate full machine on existing parallel machines ¡ Actually run a parallel program with multi-million way parallelism l Started with mimicking Blue Gene/C low level API l Machine layer abstraction ¡ Many multiprocessor (SMP) nodes connected via message passing Charm++ Workshop 2004 9
Big. Sim Emulator: functional view Communication processors Worker processors in. Buf f Correctio n. Q Non-affinity message queues Affinity message queues Correctio n. Q Non-affinity message queues Target Node Affinity message queues Target Node Converse scheduler Real Processor Converse Q Charm++ Workshop 2004 10
Big. Sim Programming API l Machine initialization in out ¡ Set/get machine configuration ¡ Get node ID: (x, y, z) l Message passing ¡ Register handler functions on node ¡ Send packets to other nodes (x, y, z) with a handler ID Charm++ Workshop 2004 11
User’s API l l l Bg. Emulator. Init(), Bg. Node. Start() Bg. Get. XYZ() Bg. Get. Size(), Bg. Set. Size() Bg. Get. Num. Work. Thread(), Bg. Set. Num. Work. Thread() Bg. Get. Num. Comm. Thread(), Bg. Set. Num. Comm. Thread() Bg. Get. Node. Data(), Bg. Set. Node. Data() Bg. Get. Thread. ID(), Bg. Get. Global. Thread. ID() Bg. Get. Time() Bg. Register. Handler() Bg. Send. Packet(), etc Bg. Shutdown() Charm++ Workshop 2004 12
Examples l charm/examples/bigsim/emulator ¡ ring ¡ jacobi 3 D ¡ max. Reduce ¡ prime ¡ octo ¡ line ¡ little. MD Charm++ Workshop 2004 13
Big. Sim application example - Ring typedef struct { char core[Cmi. Blue. Gene. Msg. Header. Size. Bytes]; int data; } Ring. Msg; void Bg. Node. Start(int argc, char **argv) { int x, y, z, nx, ny, nz; Bg. Get. XYZ(&x, &y, &z); if (x == 0 && y==0 && z==0) nextxyz(x, y, z, &nx, &ny, &nz); { Ring. Msg msg = new Ring. Msg; msg->data = 888; Bg. Send. Packet(nx, ny, nz, pass. Ring. ID, LARGE_WORK, sizeof(Ring. Msg), (char *)msg); } } void pass. Ring(char *msg) { int x, y, z, nx, ny, nz; Bg. Get. XYZ(&x, &y, &z); nextxyz(x, y, z, &nx, &ny, &nz); if (x==0 && y==0 && z==0) if (++iter == MAXITER) Bg. Shutdown(); Bg. Send. Packet(nx, ny, nz, pass. Ring. ID, LARGE_WORK, sizeof(Ring. Msg), msg); } Charm++ Workshop 2004 14
Emulator Compilation l Emulator libraries implemented on top of Converse/machine layer: ¡ libconv-bluegene. a ¡ libconv-bluegene-logs. a l Compile with normal Charm++ with “bluegene” target ¡. /build bluegene net-linux l Compile an application with emulator API ¡ charmc -o ring. C -language bluegene Charm++ Workshop 2004 15
Execute Application on the Emulator l Define machine configuration ¡ Function API l ¡ Bg. Set. Size(x, y, z), Bg. Set. Num. Work. Thread(), Bg. Set. Num. Comm. Thread() Command line options l l l +x +y +z +cth +wth E. g. • charmrun +p 4 ring +x 10 +y 10 +z 10 +cth 2 +wth 4 ¡ Config l file +bgconfig Charm++ Workshop 2004 16
Running with bgconfig file l +bgconfig. /bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1. 0 fpfactor 5 e-7 traceroot /tmp log yes correct no network bluegene Charm++ Workshop 2004 17
Ring Output clarity>. /ring 2 2 2 Charm++: standalone mode (not using charmrun) BG info> Simulating 2 x 2 x 2 nodes with 2 comm + 2 work threads each. BG info> Network type: bluegene. alpha: 1. 000000 e-07 packetsize: 1024 CYCLE_TIME_FACTOR: 1. 000000 e-03. CYCLES_PER_HOP: 5 CYCLES_PER_CORNER: 75. 0 0 0 => 0 0 1 => 0 1 0 => 0 1 1 => 1 0 0 => 1 0 1 => 1 1 0 => 1 1 1 => 0 0 0 BG> Blue. Gene emulator shutdown gracefully! BG> Emulation took 0. 000265 seconds! Program finished. Charm++ Workshop 2004 18
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 19
Big. Sim Charm++/AMPI Need high level programming language such as Charm++/AMPI l Charm++/AMPI implemented on top of Big. Sim emulator, using it as another machine layer l Support frameworks and libraries l Load balancing framework ¡ Communication optimization library (comlib) ¡ FEM ¡ Multiphase Shared Array (MSA) ¡ Charm++ Workshop 2004 20
Big. Sim Charm++ NS Selector Charm++ BGConverse Emulator Converse UDP/TCP, MPI, Myrinet, etc Charm++ Workshop 2004 21
Build Charm++ on Big. Sim l Compile Charm++ on top of Big. Sim emulator ¡ Build option “bluegene” ¡ E. g. Charm++: . /build bluegene net-linux bluegene l AMPI: . /build bgampi net-linux bluegene l Charm++ Workshop 2004 22
Running Charm++/AMPI Applications l Compile Charm++/AMPI applications ¡ Same as normal Charm++/AMPI ¡ Just use charm/net-inux-bluegene/bin/charmc l Running ¡ Same Big. Sim Charm++ applications as running on emulator Use command line option, or l Use bgconfig file l Charm++ Workshop 2004 23
Example – AMPI Cjacobi 3 D l cd charm/net-linuxbluegene/pgms/charm++/ampi/Cjacobi 3 D l Make ¡ charmc -o jacobi. o -language ampi module Every. LB Charm++ Workshop 2004 25
. /charmrun +p 2. /jacobi 2 2 2 +vp 8 +bgconfig ~/bg_config +balancer Greedy. LB +LBDebug 1 [0] Greedy. LB created iter 1 time: 1. 022634 maxerr: 2020. 200000 iter 2 time: 0. 814523 maxerr: 1696. 968000 iter 3 time: 0. 787009 maxerr: 1477. 170240 iter 4 time: 0. 825189 maxerr: 1319. 433024 iter 5 time: 1. 093839 maxerr: 1200. 918072 iter 6 time: 0. 791372 maxerr: 1108. 425519 iter 7 time: 0. 823002 maxerr: 1033. 970839 iter 8 time: 0. 818859 maxerr: 972. 509242 iter 9 time: 0. 826524 maxerr: 920. 721889 iter 10 time: 0. 832437 maxerr: 876. 344030 [Greedy. LB] Load balancing step 0 starting at 11. 647364 in PE 0 n_obj: 8 migratable: 8 ncom: 24 Greedy. LB: 5 objects migrating. [Greedy. LB] Load balancing step 0 finished at 11. 777964 [Greedy. LB] duration 0. 130599 s mem. Usage: LBManager: 800 KB Central. LB: 0 KB iter 11 time: 1. 627869 maxerr: 837. 779089 iter 12 time: 0. 951551 maxerr: 803. 868831 iter 13 time: 0. 960144 maxerr: 773. 751705 iter 14 time: 0. 952085 maxerr: 746. 772667 iter 15 time: 0. 956356 maxerr: 722. 424056 iter 16 time: 0. 965365 maxerr: 700. 305763 iter 17 time: 0. 947866 maxerr: 680. 097726 iter 18 time: 0. 957245 maxerr: 661. 540528 iter 19 time: 0. 961152 maxerr: 644. 421422 iter 20 time: 0. 960874 maxerr: 628. 564089 BG> Blue. Gene emulator shutdown gracefully! BG> Emulation took 36. 762261 seconds! Charm++ Workshop 2004 26
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation l Performance analysis/visualization Charm++ Workshop 2004 27
Performance Prediction l How to predict performance? ¡ ¡ Different levels of fidelity Processor model: l l User supplied timing expression Wallclock time Performance counters Instruction level simulation • Not supported yet ¡ Network model: l l Simple latency-based network model Contention-based network simulation Charm++ Workshop 2004 28
How to Ensure Simulation Accuracy l The idea: ¡ Take advantage of inherent determinacy of an application ¡ Don’t need rollback - same user function then is executed only once ¡ In case of out of order delivery, only timestamps of events are adjusted Charm++ Workshop 2004 29
Timestamp Correction (Jacobi 1 D) T(e 2) T(e 1) Original Timeline T(e 2) T”(e 1) Incorrect Updated Timeline T(e 2) T’’’(e 1) Correct Updated Timeline LEGEND: get. Strip. From. Right (e 1) get. Strip. From. Left (e 2) do. Work Charm++ Workshop 2004 30
Structured Dagger (Jacobi 1 D) entry void jacobi. Life. Cycle() { for (i=0; i<MAX_ITER; i++) { atomic {send. Strip. To. Left. And. Right(); } overlap { when get. Strip. From. Left(Msg *left. Msg) { atomic { copy. Strip. From. Left(left. Msg); } } when get. Strip. From. Right(Msg *right. Msg) { atomic { copy. Strip. From. Right(right. Msg); } } } atomic{ do. Work(); /* Jacobi Relaxation */ } } } Charm++ Workshop 2004 31
Sequential time - Bg. Elapse l Bg. Elapse entry void jacobi. Life. Cycle() { for (i=0; i<MAX_ITER; i++) { atomic {send. Strip. To. Left. And. Right(); } overlap { when get. Strip. From. Left(Msg *left. Msg) { atomic { copy. Strip. From. Left(left. Msg); } } when get. Strip. From. Right(Msg *right. Msg) { atomic { copy. Strip. From. Right(right. Msg); } } } atomic{ do. Work(); Bg. Elapse(10 e-3); } } } Charm++ Workshop 2004 34
Sequential Time – using Wallclock l Wallclock measurement of the time can be used via a suitable multiplier (scale factor) l Run application with +bgwalltime and +bgcpufactor, or l +bgconfig. /bgconfig: timing walltime cpufactor 0. 7 l Good for predicting a larger machine using a fraction of the machine Charm++ Workshop 2004 35
Sequential Time – performance counters l Count floating-point, integer, memory and branch instructions (for example) with hardware counters ¡ with a simple heuristic, use the expected time for each of these operations on the target machine to give the predicted total computation time. Cache performance and the memory footprint effects can be approximated by percentage of memory accesses and cache hit/miss ratio. l Perfex and PAPI are supported l Example of use, for a floating-point intensive code: +bgconfig. /bg_config l timing counter fpfactor 5 e-7 Charm++ Workshop 2004 36
Simple Network Model l No contention modeling ¡ Latency l Built-in and topology based network models for ¡ Quadrics (Lemieux) ¡ Blue Gene/C ¡ Blue Gene/L Charm++ Workshop 2004 37
Choose Network Model at Run-time l Command line option: ¡ +bgnetwork l Big. Sim bluegenel config file: ¡ +bgconfig . /bg_config network bluegenel Charm++ Workshop 2004 38
How to Add a New Network Model l Inherit from this base class defined in blue_network. h: class Big. Sim. Network { protected: double alpha; // cpu overhead of sending a message char *myname; // name of this network public: inline double alphacost() { return alpha; } inline char *name() { return myname; } virtual double latency(int ox, int oy, int oz, int nx, int ny, int nz, int bytes) = 0; virtual void print() = 0; }; Charm++ Workshop 2004 39
How to Obtain Predicted Time l Bg. Get. Time() ¡ Print to stdout is not useful actually ¡ Because the printed time at execution time is not final. ¡ Final timestamp can only be obtained after timestamp correction (simulation) finishes. Charm++ Workshop 2004 40
How to Obtain Predicted Time (cont. ) l Bg. Print (char *) ¡ Bookmarking events ¡ E. g. Bg. Print(“start at %fn”); ¡ Output to bg. Print. File. 0 when simulation finishes Look back these bookmarks l Replace “%f” with the committed time l Charm++ Workshop 2004 41
Running Applications with Simulator l Two modes ¡ With simple network model (timestamp correction) l +bgcorrect ¡ Partial prediction only (no timestamp correction) +bglog l Generate trace logs for post-mortem simulation l Charm++ Workshop 2004 42
With bgconfig +bgconfig. /bg_config x 64 y 32 z 32 cth 1 wth 1 stacksize 4000 timing walltime #timing bgelapse #timing counter cpufactor 1. 0 #fpfactor 5 e-7 traceroot /tmp log yes correct no network bluegene Charm++ Workshop 2004 43
Big. Sim Trace Log l Execution of messages on each target processor is stored in trace logs (binary format) ¡ l named bg. Trace[#], # is simulating processor number. Can be used for Visualization/Performance study ¡ Post-mortem simulation with different network models ¡ l Loadlog tool Binary to human readable ascii format conversion ¡ charm/examples/bigsim/tools/loadlog ¡ Charm++ Workshop 2004 44
ASCII Log Sample [22] 0 x 80 a 7 a 60 name: msgep (srcnode: 0 msg. ID: 21) ep: 1 [[ recvtime: 0. 000498 start. Time: 0. 000498 end. Time: 0. 000498 ]] backward: forward: [0 x 80 a 7 af 0 23] [23] 0 x 80 a 7 af 0 name: Chunk_atomic_0 (srcnode: -1 msg. ID: -1) ep: 0 [[ recvtime: -1. 000000 start. Time: 0. 000498 end. Time: 0. 000503 ]] msg. ID: 3 sent: 0. 000498 recvtime: 0. 000499 dst. Pe: 7 size: 208 msg. ID: 4 sent: 0. 000500 recvtime: 0. 000501 dst. Pe: 1 size: 208 backward: [0 x 80 a 7 a 60 22] forward: [0 x 80 a 7 ca 8 24] [24] 0 x 80 a 7 ca 8 name: Chunk_overlap_0 (srcnode: -1 msg. ID: -1) ep: 0 [[ recvtime: -1. 000000 start. Time: 0. 000503 end. Time: 0. 000503 ]] backward: [0 x 80 a 7 af 0 23] forward: [0 x 80 a 7 dc 8 25] [0 x 80 a 8170 28] Charm++ Workshop 2004 45
Example (AMPI CJacobi 3 D) l l l cd charm/examples/ampi/Cjacobi 3 D Make Bgconfig: x 2 y 2 z 1 cth 1 wth 1 stacksize 10000 timing walltime #timing bgelapse #timing counter cpufactor 1. 0 traceroot. log yes correct yes network lemieux #projections 2, 4 -8 Charm++ Workshop 2004 48
Output (using Bg. Print) l . /charmrun +p 3 jacobi 2 2 2 10 +vp 8 +bgconfig. /bg_config +bgelapse Reading Bluegene Config file. /bg_config. . . BG info> Simulating 2 x 2 x 1 nodes with 1 comm + 1 work threads each. BG info> Network type: lemieux. bandwidth: 2. 560000 e+08; alpha: 8. 000000 e-06. BG info> cpufactor is 1. 000000. BG info> BG stack size: 10000 bytes. BG info> Using Bg. Elapse for timing method. BG info> Generating timing log. BG info> Perform timestamp correction. BG info> bg. Trace root is. //. interation starts at 0. 000235 interation starts at 0. 000790 interation starts at 0. 001347 interation starts at 0. 001903 interation starts at 0. 002459 interation starts at 0. 003015 interation starts at 0. 003572 interation starts at 0. 004128 interation starts at 0. 004685 interation starts at 0. 005241 BG> Blue. Gene emulator shutdown gracefully! BG> Emulation took 25. 381299 seconds! Charm++ Workshop 2004 49
Final Predictions (using Bg. Print) l clarity>cat bg. Print. File. 0 [0] interation starts at 0. 000217 [0] interation starts at 0. 000756 [0] interation starts at 0. 001295 [0] interation starts at 0. 001835 [0] interation starts at 0. 002374 [0] interation starts at 0. 002913 [0] interation starts at 0. 003452 [0] interation starts at 0. 003992 [0] interation starts at 0. 004531 [0] interation starts at 0. 005070 Charm++ Workshop 2004 50
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 51
Postmortem Simulation l Run application once, get trace logs, and run simulation with logs for a variety of network configurations l Implemented on POSE simulation framework Charm++ Workshop 2004 52
How to Obtain Predicted Time l Use Bg. Print(char *) in similar way ¡ Each Bg. Print() called at execution time in online execution mode is stored in Bg. Log as a printing event l In postmortem simulation, strings associated with Bg. Print event is printed when the event is committed l “%f” in the string will be replaced by committed time. Charm++ Workshop 2004 53
Compile Postmortem Simulator l Compile pose ¡ Use normal charm++ ¡ cd charm/net-linux/tmp ¡ make pose l Compile Hi. Sim simulator ¡ cd charm/net-linux/pgms/pose/Hi. Sim/Blue. Gene ¡ make Charm++ Workshop 2004 54
Example (AMPI CJacobi 3 D cont. ) l charm/net-linux/examples/pose/Hi. Sim/tmp/BGHi. Sim 0 0 bgtrace: total. BGProcs=4 X=2 Y=2 Z=1 #Cth=1 #Wth=1 #Pes=3 Opts: netsim on: 0 Initializing POSE. . . POSE initialization complete. Using Inactivity Detection for termination. Starting simulation. . . 256 4 1024 1. 750000 9 1000000 0 1 0 0 0 8 16 4 Info> timing factor 1. 000000 e+08. . . Info> invoking startup task from proc 0. . . [0: AMPI_Barrier_END] interation starts at 0. 000217 [0: RECV_RESUME] interation starts at 0. 000755 [0: RECV_RESUME] interation starts at 0. 001292 [0: RECV_RESUME] interation starts at 0. 001829 [0: RECV_RESUME] interation starts at 0. 002367 [0: RECV_RESUME] interation starts at 0. 002904 [0: RECV_RESUME] interation starts at 0. 003441 [0: RECV_RESUME] interation starts at 0. 003978 [0: RECV_RESUME] interation starts at 0. 004516 [0: RECV_RESUME] interation starts at 0. 005053 Simulation inactive at time: 587350 Final GVT = 587351 Charm++ Workshop 2004 55
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 56
Big Network Simulator l When message passing performance is critical and needs to consider network contention Charm++ Workshop 2004 57
Big. Net. Sim Overview l Networks l Configuration l Design l Modular l POSE ¡ Mix l Catalog and match architecture, topology, routing of Network Simulations l Building l Running Hi. Sim l Using the Transceiver l Extensibility Charm++ Workshop 2004 58
Networks Indirect Network Direct Network Charm++ Workshop 2004 59
Hi. Sim Design Features Components l packetization l Node l flit l Processor level (Qs. Net) l routing l topology l contention l latency l Switch l Channel l Virtual Channel l Network Interface l Routing protocol Charm++ Workshop 2004 60
Implementation l Post-Mortem Network simulators are Parallel Discrete Event Simulations ¡ Parallel Object Simulation Environment (POSE) ¡ Network layer constructs (NIC, Switch, Node, etc) implemented as poser simulation objects ¡ Network data constructs (message, packet, etc) implemented as event methods on simulation objects Charm++ Workshop 2004 61
POSE Charm++ Workshop 2004 62
Net. Sims l Direct ¡ Hi. Sim/Bluegene ¡ Simple. Net. Sim ¡ Big. Net. Sim (original Bluegene network) ¡ TCSim time stamp correction l Indirect ¡ Qs. Net for transceiver ¡ Qs. Net. Trace Charm++ Workshop 2004 63
Building POSE l POSE ¡ cd charm ¡. /build pose net-linux ¡ options are set in pose_config. h stats enabled by POSE_STATS_ON=1 l user event tracing TRACE_DETAIL=1 l more advanced configuration options l • speculation • checkpoints • load balancing Charm++ Workshop 2004 64
Building Hi. Sim l Build Hi. Sim/Bluegene ¡ cd net-linux/examples/Hi. Sim/Bluegene ¡ make ¡ for sequential simulator l ¡ cd make clean; make SEQUENTIAL=1 . . /tmp Charm++ Workshop 2004 65
Running l charmrun +p 4 BGHi. Sim 1 1 l Parameters ¡ First parameter controls detailed network simulation 1 will use the detailed model l 0 will use simple latency l ¡ Second parameter controls simulation skip 1 will skip forward to the time stamp set during trace creation l 0 if not set or network startup interesting l Charm++ Workshop 2004 66
Configuring Hi. Sim USE_TRANSCEIVER 0 For network analysis ignore trace and generate random traffic NUM_NODES 0 Number of nodes, taken from trace file or set for transceiver MAX_PACKET_SIZE 256 Maximum packet size SWITCH_VC 4 The number of switch virtual channels SWITCH_PORT 8 Number of ports in switch, calculated automatically for direct networks SWITCH_BUF 1024 Size in memory of each virtual channel CHANNELBW 1. 75 Bandwidth in 100 MB/s CHANNELDELAY 9 Delay in 10 ns. So 9 => 90 ns RECEPTION_SERIAL 0 Used for direct networks where reception FIFO access has to be serialized INPUT_SPEEDUP 8 Used to limit simultaneous access by VC in a port. Should be less than or equal to number of VC. Currently used only for bluegene. ADAPTIVE_ROUTING 0 Additional flag to use adaptive/deterministic routing COLLECTION_INTERVAL 1000000 Collection * 10 ns gives statistics bin size DISPLAY_LINK_STATS 0 Display statistics for each link DISPLAY_MESSAGE_DELAY 0 Display message delay statistics Charm++ Workshop 2004 67
Output l Completion time for trace run l Turn on -tproj to get simple updated trace of network performance l POSE trace for projections output ¡ limited value to end user l Coming soon: projections output displaying user events in simulation time (like Big. Sim) Charm++ Workshop 2004 68
Hi. Sim Modules l Build your own interconnect ¡ Topology 3 DMesh l Fat. Tree l Torus l Hyper. Cube l ¡ Architecture Bluegene l Hyp. Cube. Arch l OB Output Buffered l IB Input Buffered l Red. Storm (incomplete) l ¡ Routing Adaptive l Static l ¡ Hybrid l Charm++ Workshop 2004 Fat Tree + Torus 69
Transceiver l Generate traffic patterns instead of using trace files l Pattern 1 kshift l 2 ring l 3 bittranspose l 4 bitreversal l 5 bitcomplement l 6 poisson l ¡ additional command line parameters Pattern l Frequency l l Frequency 0 linear l 1 uniform l 2 exponential l Charm++ Workshop 2004 70
Hi. Sim Design Charm++ Workshop 2004 71
Hi. Sim: Data Flow Charm++ Workshop 2004 72
Hi. Sim API: Extensibility Charm++ Workshop 2004 73
Future l Projections trace log of user events in simulation time. l Hybrid Networks: both direct and indirect l Improved scalability ¡ adaptive strategies ¡ load balancing l Collective communication l Representative collection of netconfig files Charm++ Workshop 2004 74
Case Study - NAMD l Molecular Dynamics Simulation Applications l Compile Big. Sim Charm++: ¡. /build bluegene net-linux bluegene l Compile ¡ Get l NAMD: source code from: http: //charm. cs. uiuc. edu/~gzheng/namd-bg. tar. gz ¡. /config fftw Linux-i 686 -g++ Charm++ Workshop 2004 75
Validation with Simple Network Model NAMD Apo-Lipoprotein A 1 with 92 K atom. Performance simulation using 8 Lemieux processors Processors 128 256 512 1024 Actual time (ms) 71. 5 40. 3 23. 9 17. 6 Predicted time (ms) 75. 8 43. 6 25. 1 20. 8 Charm++ Workshop 2004 76
Network Communication Pattern Analysis • NAMD with apoa 1 • 15 timestep Charm++ Workshop 2004 77
Network Communication Pattern Analysis Data transferred (KB) in a single time step Charm++ Workshop 2004 78
Outline l Overview l Big. Sim Emulator l Charm++ on the Emulator l Simulation framework ¡ Online mode simulation ¡ Post-mortem simulation ¡ Network simulation (Eric) l Performance analysis/visualization Charm++ Workshop 2004 80
Performance Analysis/Visualization l trace-projections l One is available for Big. Sim challenge: ¡ Number of log files can be overwhelming Charm++ Workshop 2004 81
Generate Projections Logs l Link application with –tracemode projections l Select subset of processors in bgconfig: projections 0 -100, 2000, 3100 -3200 l With timestamp correction, two sets of projections logs are generated ¡ Before and after timestamp correction Charm++ Workshop 2004 82
Generate Projections Logs (the hideous secret) l Problem: ¡ ¡ l Solution: ¡ l Projections tracing function maintains a fix sized buffer for storing projections logs Buffer is flushed to disk when it is filled up, disk I/O can effect predicted time Use +logsize runtime option to provide large projections buffer size In fact, in online mode simulation, simulation aborts when disk I/O occurs. Charm++ Workshop 2004 83
Projections with Jacobi l l l cd charm/examples/bigsim/sdag/jacobi-no-redn. /charmrun +p 4. /jacobi 16384 10 8192 +bgconfig. /bg_config Config file: x 32 y 16 z 16 cth 1 wth 1 stacksize 10000 #timing walltime timing bgelapse #timing counter cpufactor 1. 0 fpfactor 5 e-7 traceroot. log yes correct yes network lemieux projections 0, 1000, 8189 -8191 Charm++ Workshop 2004 84
Charm++ Workshop 2004 85
Make bgtest With 16 processors Charm++ Workshop 2004 86
Performance Analysis Tool: Projections Charm++ Workshop 2004 87
Charm++ Workshop 2004 88
Projections with Postmortem Simulation Charm++ Workshop 2004 89
Thank You! Free download of Charm++ and Big. Sim at http: //charm. cs. uiuc. edu Send comments to ppl@charm. cs. uiuc. edu Charm++ Workshop 2004 90
- Slides: 83