Adaptive MPI Tutorial Chao Huang Parallel Programming Laboratory

Adaptive MPI Tutorial Chao Huang Parallel Programming Laboratory University of Illinois Parallel Programming Laboratory @ UIUC

Motivation l Challenges ¡ New l generation parallel applications are: Dynamically varying: load shifting, adaptive refinement ¡ Typical l Not naturally suitable for dynamic applications ¡ Set l l MPI implementations are: of available processors: May not match the natural expression of the algorithm AMPI ¡ 3/4/2021 MPI with virtualization: VP Parallel Programming Laboratory @ UIUC 2

Outline MPI basics l Charm++/AMPI introduction l How to write AMPI program l ¡ Running with virtualization How to convert an MPI program l Using AMPI features l Automatic load balancing ¡ Checkpoint/restart mechanism ¡ Interoperability with Charm++ ¡ ELF and global variables ¡ 3/4/2021 Parallel Programming Laboratory @ UIUC 3

MPI Basics l Standardized message passing interface Passing messages between processes ¡ Standard contains the technical features proposed for the interface ¡ Minimally, 6 basic routines: ¡ l l l 3/4/2021 int MPI_Init(int *argc, char ***argv) int MPI_Finalize(void) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Parallel Programming Laboratory @ UIUC 4

MPI Basics l MPI-1. 1 contains 128 functions in 6 categories: Point-to-Point Communication ¡ Collective Communication ¡ Groups, Contexts, and Communicators ¡ Process Topologies ¡ MPI Environmental Management ¡ Profiling Interface ¡ l l Language bindings: for Fortran, C 20+ implementations reported. 3/4/2021 Parallel Programming Laboratory @ UIUC 5

MPI Basics l MPI-2 Standard contains: Further corrections and clarifications for the MPI -1 document ¡ Completely new types of functionality ¡ l l l Dynamic processes One-sided communication Parallel I/O Added bindings for Fortran 90 and C++ ¡ Lots of new functions: 188 for C binding ¡ 3/4/2021 Parallel Programming Laboratory @ UIUC 6

Example: Hello World! #include <stdio. h> #include <mpi. h> int main( int argc, char *argv[] ) { int size, myrank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf( "[%d] Hello, parallel world!n", myrank ); MPI_Finalize(); return 0; } 3/4/2021 Parallel Programming Laboratory @ UIUC 7

Example: Send/Recv. . . double a[2] = {0. 3, 0. 5}; double b[2] = {0. 7, 0. 9}; MPI_Status sts; if(myrank == 0){ MPI_Send(a, 2, MPI_DOUBLE, 1, 17, MPI_COMM_WORLD); }else if(myrank == 1){ MPI_Recv(b, 2, MPI_DOUBLE, 0, 17, MPI_COMM_WORLD, &sts); }. . . [Demo…] 3/4/2021 Parallel Programming Laboratory @ UIUC 8

Charm++ l Basic idea of processor virtualization ¡ ¡ ¡ User specifies interaction between objects (VP’s) RTS maps VP’s onto physical processors Typically, # virtual processors > # processors System implementation User View 3/4/2021 Parallel Programming Laboratory @ UIUC 9

Charm++ l Charm++ features Data driven objects ¡ Asynchronous method invocation ¡ Mapping multiple objects per processor ¡ Load balancing, static and run time ¡ Portability ¡ l TCharm features User level threads, do not block CPU ¡ Light-weight: context-switch time ~ 1μs ¡ Migratable threads ¡ 3/4/2021 Parallel Programming Laboratory @ UIUC 10

AMPI: MPI with Virtualization l Each virtual process implemented as a userlevel thread embedded in a Charm++ object MPI “processes” processes Implemented as virtual processes (user-level migratable threads) Real Processors 3/4/2021 Parallel Programming Laboratory @ UIUC 11

Comparison with Native MPI l Performance ¡ Slightly worse w/o optimization ¡ Being improved l Flexibility ¡ Big runs on a few processors ¡ Fits the nature of algorithms Problem setup: 3 D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PE’s (eg 19, 33, 105). Native MPI needs cube #. 3/4/2021 Parallel Programming Laboratory @ UIUC 12

Build Charm++/AMPI l Download website: ¡ http: //charm. cs. uiuc. edu/download. html ¡ Please l Build ¡ ¡ 3/4/2021 register for better support Charm++/AMPI >. /build <target> <version> <options> [charmc-options] To build AMPI: l >. /build AMPI net-linux -g Parallel Programming Laboratory @ UIUC 13

How to write AMPI program (1) Write your normal MPI program, and then… l Link and run with Charm++ l Build your charm with target AMPI ¡ Compile and link with charmc ¡ l l ¡ Run with charmrun l 3/4/2021 charm/bin/mpicc|mpi. CC|mpif 77|mpif 90 > charmc -o hello. c -language ampi > charmrun hello Parallel Programming Laboratory @ UIUC 14

How to write AMPI program (1) l Now we can run MPI program with Charm++ l Demo - Hello World! 3/4/2021 Parallel Programming Laboratory @ UIUC 15

How to write AMPI program (2) Do not use global variables l Global variables are dangerous in multithread programs l ¡ Global variables are shared by all the threads on a processor and can be changed by other thread Thread 1 count=1 block in MPI_Recv Thread 2 count=2 block in MPI_Recv b=count 3/4/2021 Parallel Programming Laboratory @ UIUC 16

How to write AMPI program (2) Now we can run multithread on one processor l Running with many virtual processors l +vp command line option ¡ > charmrun +p 3 hello +vp 8 ¡ Demo - Hello Parallel World! l Demo - 2 D Jacobi Relaxation l 3/4/2021 Parallel Programming Laboratory @ UIUC 17

How to convert an MPI program l Remove ¡ Pack l global variables them into struct/TYPE or class Allocated in heap or stack Original Code AMPI Code MODULE shareddata INTEGER : : myrank DOUBLE PRECISION : : xyz(100) END MODULE 3/4/2021 MODULE shareddata TYPE chunk INTEGER : : myrank DOUBLE PRECISION : : xyz(100) END TYPE END MODULE Parallel Programming Laboratory @ UIUC 18

How to convert an MPI program Original Code AMPI Code PROGRAM MAIN USE shareddata include 'mpif. h' INTEGER : : i, ierr CALL MPI_Init(ierr) CALL MPI_Comm_rank( MPI_COMM_WORLD, myrank, ierr) DO i = 1, 100 xyz(i) = i + myrank END DO CALL sub. A CALL MPI_Finalize(ierr) END PROGRAM SUBROUTINE MPI_Main USE shareddata USE AMPI INTEGER : : i, ierr TYPE(chunk), pointer : : c CALL MPI_Init(ierr) ALLOCATE(c) CALL MPI_Comm_rank( MPI_COMM_WORLD, c%myrank, ierr) DO i = 1, 100 c%xyz(i) = i + c%myrank END DO CALL sub. A(c) CALL MPI_Finalize(ierr) END SUBROUTINE 3/4/2021 Parallel Programming Laboratory @ UIUC 19

How to convert an MPI program Original Code AMPI Code SUBROUTINE sub. A USE shareddata INTEGER : : i DO i = 1, 100 xyz(i) = xyz(i) + 1. 0 END DO END SUBROUTINE l 3/4/2021 SUBROUTINE sub. A(c) USE shareddata TYPE(chunk) : : c INTEGER : : i DO i = 1, 100 c%xyz(i) = c%xyz(i) + 1. 0 END DO END SUBROUTINE C examples can be found in the manual Parallel Programming Laboratory @ UIUC 20

How to run an AMPI program l Use virtual processors ¡ Run with +vp option ¡ Specify stacksize with +tcharm_stacksize option ¡ Demo 3/4/2021 – big stack Parallel Programming Laboratory @ UIUC 21

How to convert an MPI program l Fortran program pgm . . . 3/4/2021 entry point: MPI_Main subroutine MPI_Main. . . Parallel Programming Laboratory @ UIUC 22

AMPI Extensions l Automatic load balancing l Non-blocking collectives l Checkpoint/restart mechanism l Multi-module programming l ELF and global variables 3/4/2021 Parallel Programming Laboratory @ UIUC 23

Automatic Load Balancing l Load imbalance in dynamic applications hurts the performance l Automatic load balancing: MPI_Migrate() ¡ Collective call informing the load balancer that the thread is ready to be migrated, if needed. ¡ If there is a load balancer present: First sizing, then packing on source processor l Sending stack and pupped data to the destination l Unpacking data on destination processor l 3/4/2021 Parallel Programming Laboratory @ UIUC 24

Automatic Load Balancing l To use automatic load balancing module: ¡ Link l ¡ > charmc pgm hello. o -language ampi -module Every. LB Run with +balancer option l 3/4/2021 with LB modules >. /charmrun pgm +p 4 +vp 16 +balancer Greedy. Comm. LB Parallel Programming Laboratory @ UIUC 25

Automatic Load Balancing l Link-time flag -memory isomalloc makes heapdata migration transparent Special memory allocation mode, giving allocated memory the same virtual address on all processors ¡ Ideal on 64 -bit machines ¡ Should fit in most cases and highly recommended ¡ 3/4/2021 Parallel Programming Laboratory @ UIUC 26

Automatic Load Balancing l Limitation with isomalloc: ¡ Memory waste l l ¡ l 4 KB minimum granularity Avoid small allocations Limited space on 32 -bit machine Alternative: PUPer 3/4/2021 Parallel Programming Laboratory @ UIUC 27

Automatic Load Balancing Group your global variables into a data structure l Pack/Un. Pack routine (a. k. a. PUPer) l ¡ l Heap data –(Pack)–> network message –(Unpack)–> heap data Demo – Load balancing 3/4/2021 Parallel Programming Laboratory @ UIUC 28

Collective Operations l Problem with collective operations ¡ ¡ Complex: involving many processors Time consuming: designed as blocking calls in MPI Time breakdown of 2 D FFT benchmark [ms] 3/4/2021 Parallel Programming Laboratory @ UIUC 29

Collective Communication Optimization Time breakdown of an all-to-all operation using Mesh library Computation is only a small proportion of the elapsed time l A number of optimization techniques are developed to improve collective communication performance l 3/4/2021 Parallel Programming Laboratory @ UIUC 30

Asynchronous Collectives l Our implementation is asynchronous ¡ Collective operation posted ¡ Test/wait for its completion ¡ Meanwhile useful computation can utilize CPU MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); 3/4/2021 Parallel Programming Laboratory @ UIUC 31

Asynchronous Collectives Time breakdown of 2 D FFT benchmark [ms] VP’s implemented as threads l Overlapping computation with waiting time of collective operations l Total completion time reduced l 3/4/2021 Parallel Programming Laboratory @ UIUC 32

Checkpoint/Restart Mechanism l Large scale machines suffer from failure l Checkpoint/restart mechanism ¡ State of applications checkpointed to disk files ¡ Capable of restarting on different # of PE’s ¡ Facilitates future efforts on fault tolerance 3/4/2021 Parallel Programming Laboratory @ UIUC 33

Checkpoint/Restart Mechanism l Checkpoint with collective call ¡ MPI_Checkpoint(DIRNAME) ¡ Synchronous l Restart ¡ l checkpoint with run-time option >. /charmrun pgm +p 4 +vp 16 +restart DIRNAME Demo: checkpoint/restart an AMPI program 3/4/2021 Parallel Programming Laboratory @ UIUC 34

Interoperability with Charm++ l Charm++ has a collection of library support l You can make use of them by running Charm++ code in AMPI programs l Also we can run MPI code in Charm++ programs l Demo: 3/4/2021 interoperability with Charm++ Parallel Programming Laboratory @ UIUC 35

ELF and global variables l Global variables are not thread-safe ¡ l The Executable and Linking Format (ELF) ¡ ¡ l Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2. x, and more Integrated in Charm++/AMPI ¡ l Can we switch global variables when we switch threads? Invoke by compile time option -swapglobals Demo: thread-safe global variables 3/4/2021 Parallel Programming Laboratory @ UIUC 36

Future Work l Performance prediction via direct simulation ¡ l Support for visualization ¡ l Performance tuning w/o continuous access to large machine Facilitating debugging and performance tuning Support for MPI-2 standard ¡ ¡ 3/4/2021 ROMIO as parallel I/O library One-sided communications Parallel Programming Laboratory @ UIUC 37

Thank You! Free download and manual available at: http: //charm. cs. uiuc. edu/ Parallel Programming Lab at University of Illinois 3/4/2021 Parallel Programming Laboratory @ UIUC 38