Collective Communication in MPI and Advanced Features Pachecos

  • Slides: 80
Download presentation
Collective Communication in MPI and Advanced Features Pacheco’s book. Chapter 3 T. Yang, CS

Collective Communication in MPI and Advanced Features Pacheco’s book. Chapter 3 T. Yang, CS 240 A 2016 Part of slides from the text book, CS 267 K. Yelick from UC Berkeley and B. Gropp, ANL

Outline • Collective group communication § Extra Application Examples – Pi computation – Summation

Outline • Collective group communication § Extra Application Examples – Pi computation – Summation of long vectors – Matrix-vector multiplication § performance evaluation – Parallel sorting • Safety and other MPI issues. Copyright © 2010, Elsevier Inc. All rights Reserved

What MPI Functions are commonly used • For simple applications, these are common: §

What MPI Functions are commonly used • For simple applications, these are common: § Startup – MPI_Init() – MPI_Finalize() § Information on the processes – MPI_Comm_rank() – MPI_Comm_size() – MPI_Get_processor_name() § Point-to-Point communication – MPI_Send() & MPI_Recv() – MPI_Isend() & MPI_Irecv, – MPI_Wait() § Collective communication – MPI_Allreduce() , MPI_Bcast(), MPI_Allgather() • http: //mpitutorial. com/mpi-broadcast-and-collective-communication/ 3

Blocking Message Passing • The call waits until the data transfer is done §

Blocking Message Passing • The call waits until the data transfer is done § MPI_Send() – The sending process waits until all data are transferred to the system buffer § MPI_Recv() – The receiving process waits until all data are transferred from the system buffer to the receive buffer § Buffers can be freely reused

Blocking Message Send MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm

Blocking Message Send MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm); buf Specifies the starting address of the buffer. count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements dest Specifies the rank of the destination process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes

Blocking Message Send Standard: MPI_Send() Buffered: MPI_Bsend() Synchronous: MPI_Ssend() Ready: MPI_Rsend() The sending process

Blocking Message Send Standard: MPI_Send() Buffered: MPI_Bsend() Synchronous: MPI_Ssend() Ready: MPI_Rsend() The sending process returns when the system can buffer the message or when the message is received and the buffer is ready for reuse. The sending process returns when the message is buffered in an application-supplied buffer. The sending process returns only if a matching receive is posted and the receiving process has started to receive the message. The message is sent as soon as possible (ASAP).

Blocking Message Receive MPI_Recv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm

Blocking Message Receive MPI_Recv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Status *status); buf Specifies the starting address of the buffer count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements source tag Specifies the rank of the source process in the group associated with the communicator comm Denotes the message label comm Designates the communication context that identifies a group of processes status Returns information about the received message

Example … if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i;

Example … if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i; MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD); } else if (rank == 1) { for (i=0; i<10; i++) buffer[i] = -1; MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status); for (i=0; i<10; i++) if (buffer[i] != i) printf("Error: buffer[%d] = %d but is expected to be %dn", i, buffer[i], i); } … More Examples at http: //mpi. deino. net/mpi_functions/index. htm

Non-blocking Message Passing • Returns immediately after the data transferred is initiated • Allows

Non-blocking Message Passing • Returns immediately after the data transferred is initiated • Allows to overlap computation with communication • Need to be careful though § When send and receive buffers are updated before the transfer is over, the result will be wrong • MPI_Request() represents a handle on a non-blocking operation. • it represents a handle on a non-blocking operation. • It can be used by wait • MPI_Wait() • MPI_Waitall() • MPI_Waitany() • MPI_Waitsome() • to know when the non-blocking operation handled completes.

Non-blocking Message Passing MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm

Non-blocking Message Passing MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Request *req); MPI_Wait(MPI_Request *req, MPI_Status *status); • A call to MPI_Wait returns when the operation identified by request is complete. • req Specifies the request used by a completion routine when called by the application to complete the send operation. Blocking Non- MPI_Send MPI_Bsen MPI_Ssen d d MPI_Rsen d MPI_Recv MPI_Isen MPI_Irecv MPI_Ibse MPI_Issen

Non-blocking Message Passing … right = (rank + 1) % nproc; left = rank

Non-blocking Message Passing … right = (rank + 1) % nproc; left = rank - 1; if (left < 0) left = nproc – 1; MPI_Irecv(buffer, 10, MPI_INT, left, 123, MPI_COMM_WORLD, &request); MPI_Isend(buffer 2, 10, MPI_INT, right, 123, MPI_COMM_WORLD, &request 2); MPI_Wait(&request, &status); MPI_Wait(&request 2, &status); …

Collective communications • A single call handles the communication between all the processes in

Collective communications • A single call handles the communication between all the processes in a communicator • There are 3 types of collective communications § Data movement (e. g. MPI_Bcast) § Reduction (e. g. MPI_Reduce) § Synchronization (e. g. MPI_Barrier) – You may find more examples at: § http: //mpi. deino. net/mpi_functions/index. htm

Broadcast • int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); §

Broadcast • int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); § One process (root) sends data to all the other processes in the same communicator § Must be called by all the processes with the same arguments P 1 A P 2 P 3 P 4 B C D MPI_Bcast P 1 A B C D P 2 A B C D P 3 A B C D P 4 A B C D

Gather • int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype

Gather • int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm) § One process (root) collects data to all the other processes in the same communicator § Must be called by all the processes with the same arguments P 1 A P 2 B P 2 P 3 C P 4 D MPI_Gather P 3 P 4 B C D

Gather to All • int MPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int

Gather to All • int MPI_Allgather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm) § All the processes collects data to all the other processes in the same communicator § Must be called by all the processes with the same arguments P 1 A B C D P 2 B P 2 A B C D P 3 A B C D P 4 A B C D P 3 C P 4 D MPI_Allgather

Reduction • int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int

Reduction • int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) § One process (root) collects data to all the other processes in the same communicator, and performs an operation on the data § MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more § MPI_Op_create(): User defined operator P 1 A … … … P 2 B … … … P 3 C … … … P 4 D … … … MPI_Reduce P 1 A+B+C+ D P 2 P 3 P 4

Reduction to All • int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op

Reduction to All • int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) § All the processes collect data to all the other processes in the same communicator, and perform an operation on the data § MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more § MPI_Op_create(): User defined operator P 1 A … … … P 2 B … … … P 3 C … … … P 4 D … … … MPI_Allreduce P 1 A+B+C+ D P 2 A+B+C+ D P 3 A+B+C+ D P 4 A+B+C+ D

Synchronization • int MPI_Barrier(MPI_Comm comm) #include <mpi. h> #include <stdio. h> int main(int argc,

Synchronization • int MPI_Barrier(MPI_Comm comm) #include <mpi. h> #include <stdio. h> int main(int argc, char *argv[]) { int rank, nprocs; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Barrier(MPI_COMM_WORLD); printf("Hello, world. I am %d of %dn", rank, nprocs); MPI_Finalize(); return 0; }

For more functions… • • http: //www. mpi-forum. org http: //www. llnl. gov/computing/tutorials/mpi/ http:

For more functions… • • http: //www. mpi-forum. org http: //www. llnl. gov/computing/tutorials/mpi/ http: //www. nersc. gov/nusers/help/tutorials/mpi/intro/ http: //www-unix. mcs. anl. gov/mpi/tutorial/gropp/talk. html http: //www-unix. mcs. anl. gov/mpi/tutorial/ MPICH (http: //www-unix. mcs. anl. gov/mpich/) Open MPI (http: //www. open-mpi. org/) • MPI descriptions and examples are referred from § http: //mpi. deino. net/mpi_functions/index. htm § Stéphane Ethier (PPPL)’s PICSci. E/PICASso Mini-Course Slides

MPI Collective Communication • Collective routines provide a higher-level way to organize a parallel

MPI Collective Communication • Collective routines provide a higher-level way to organize a parallel program § Each process executes the same communication operations § Communication and computation is coordinated among a group of processes in a communicator § Tags are not used § No non-blocking collective operations. • Three classes of operations: § synchronization, § data movement, § collective computation. 20

Synchronization • MPI_Barrier( comm ) • Blocks until all processes in the group of

Synchronization • MPI_Barrier( comm ) • Blocks until all processes in the group of the communicator comm call it. • Not used often. Sometime used in measuring performance and load balancing 21

Collective Data Movement: Broadcast, Scatter, and Gather P 0 A P 1 Broadcast P

Collective Data Movement: Broadcast, Scatter, and Gather P 0 A P 1 Broadcast P 2 P 3 P 0 P 1 P 2 P 3 ABCD Scatter Gather A A A B C D 22

Broadcast • Data belonging to a single process is sent to all of the

Broadcast • Data belonging to a single process is sent to all of the processes in the communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

Comments on Broadcast • All collective operations must be called by all processes in

Comments on Broadcast • All collective operations must be called by all processes in the communicator • MPI_Bcast is called by both the sender (called the root process) and the processes that are to receive the broadcast § MPI_Bcast is not a “multi-send” § “root” argument is the rank of the sender; – this tells MPI which process originates the broadcast and which receive 24

Collective Data Movement: Allgather and Allto. All P 3 A B C D P

Collective Data Movement: Allgather and Allto. All P 3 A B C D P 0 A 1 A 2 A 3 P 1 B 0 B 1 B 2 B 3 P 2 C 0 C 1 C 2 C 3 A 2 B 2 C 2 D 2 P 3 D 0 D 1 D 2 D 3 A 3 B 3 C 3 D 3 P 0 P 1 P 2 Allgather A A B B C C D D A 0 B 0 C 0 D 0 Alltoall A 1 B 1 C 1 D 1 25

Collective Computation: Reduce vs. Scan P 0 P 1 P 2 P 3 A

Collective Computation: Reduce vs. Scan P 0 P 1 P 2 P 3 A B C D Reduce Scan R(ABCD) R(AB) R(ABCD) 26

MPI_Reduce

MPI_Reduce

Predefined reduction operators in MPI Copyright © 2010, Elsevier Inc. All rights Reserved

Predefined reduction operators in MPI Copyright © 2010, Elsevier Inc. All rights Reserved

MPI Scan MPI_Scan( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm

MPI Scan MPI_Scan( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm );

MPI_Allreduce • Useful in a situation in which all of the processes need the

MPI_Allreduce • Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger computation. Copyright © 2010, Elsevier Inc. All rights Reserved

MPI Collective Routines: Summary • Many Routines: Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gatherv,

MPI Collective Routines: Summary • Many Routines: Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gatherv, Reduce, Reduce_scatter, Scan, Scatterv • All versions deliver results to all participating processes. • V versions allow the hunks to have variable sizes. • Allreduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions. • MPI-2 adds Alltoallw, Exscan, intercommunicator versions of most routines 31

Example Extra: Self Study • MPI PI program

Example Extra: Self Study • MPI PI program

Example of MPI PI program using 6 Functions • Using basic MPI functions: §

Example of MPI PI program using 6 Functions • Using basic MPI functions: § MPI_INIT § MPI_FINALIZE § MPI_COMM_SIZE § MPI_COMM_RANK • Using MPI collectives: § MPI_BCAST § MPI_REDUCE Slide source: Bill Gropp, ANL 33

Midpoint Rule for f(x) a xm b x

Midpoint Rule for f(x) a xm b x

Example: PI in C - 1 #include "mpi. h" #include <math. h> #include <stdio.

Example: PI in C - 1 #include "mpi. h" #include <math. h> #include <stdio. h> int main(int argc, char *argv[]) { int done = 0, n, myid, numprocs, i, rc; double PI 25 DT = 3. 141592653589793238462643; double mypi, h, sum, x, a; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; Input and broadcast parameters Slide source: Bill Gropp, ANL 35

Example: PI in C - 2 h = 1. 0 / (double) n; Compute

Example: PI in C - 2 h = 1. 0 / (double) n; Compute local pi values sum = 0. 0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0. 5); sum += 4. 0 / (1. 0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); Compute summation if (myid == 0) printf("pi is approximately %. 16 f, Error is. 16 fn", pi, fabs(pi - PI 25 DT)); } MPI_Finalize(); return 0; 36 Slide source: Bill Gropp, ANL }

Collective vs. Point-to-Point Communications • All the processes in the communicator must call the

Collective vs. Point-to-Point Communications • All the processes in the communicator must call the same collective function. § Will this program work? if(my_rank==0) MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM, 0, 0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications • All the processes in the communicator must call the

Collective vs. Point-to-Point Communications • All the processes in the communicator must call the same collective function. § For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv on another process is erroneous, and, in all likelihood, the program will hang or crash. if(my_rank==0) MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM, 0, 0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications • The arguments passed by each process to an MPI

Collective vs. Point-to-Point Communications • The arguments passed by each process to an MPI collective communication must be “compatible. ” § Will this program work? if(my_rank==0) MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications • The arguments passed by each process to an MPI

Collective vs. Point-to-Point Communications • The arguments passed by each process to an MPI collective communication must be “compatible. ” § For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash. if(my_rank==0) MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a, &b, 1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Example of MPI_Reduce execution Multiple calls to MPI_Reduce with MPI_SUM and Proc 0 as

Example of MPI_Reduce execution Multiple calls to MPI_Reduce with MPI_SUM and Proc 0 as destination (root) Is b=3 on Proc 0 after two MPI_Reduce() calls? Is d=6 on Proc 0? Copyright © 2010, Elsevier Inc. All rights Reserved

Example: Output results • However, the names of the memory locations are irrelevant to

Example: Output results • However, the names of the memory locations are irrelevant to the matching of the calls to MPI_Reduce. • The order of the calls will determine the matching so the value stored in b will be 1+2+1 = 4, and the value stored in d will be 2+1+2 = 5. Copyright © 2010, Elsevier Inc. All rights Reserved

Example Parallel Matrix Vector Multiplication Collective Communication Application Textbook p. 113 -116

Example Parallel Matrix Vector Multiplication Collective Communication Application Textbook p. 113 -116

Matrix-vector multiplication: y= A * x

Matrix-vector multiplication: y= A * x

Partitioning and Task graph for matrix-vector multiplication yi= Row Ai * x

Partitioning and Task graph for matrix-vector multiplication yi= Row Ai * x

Execution Schedule and Task Mapping yi= Row Ai * x

Execution Schedule and Task Mapping yi= Row Ai * x

Data Partitioning and Mapping for y= A*x

Data Partitioning and Mapping for y= A*x

SPMD Code for y= A*x

SPMD Code for y= A*x

Evaluation: Parallel Time • Ignore the cost of local address calculation. • Each task

Evaluation: Parallel Time • Ignore the cost of local address calculation. • Each task performs n additions and n multiplications. • Each addition/multiplication costs ω • The parallel time is approximately

How is initial data distributed? Assume initially matrix A and vector x are distributed

How is initial data distributed? Assume initially matrix A and vector x are distributed evenly among processes Need to redistribute vector x to everybody in order to perform parallel computation! What MPI collective communication is needed?

Communication Pattern for Data Redistribution Data requirement for Process 0 MPI_Gather Data requirement for

Communication Pattern for Data Redistribution Data requirement for Process 0 MPI_Gather Data requirement for all processes MPI_Allgather

MPI Code for Gathering Data gather for Process 0 Repeat for all processes

MPI Code for Gathering Data gather for Process 0 Repeat for all processes

A AB Allgather B AB C AB D AB • Concatenates the contents of

A AB Allgather B AB C AB D AB • Concatenates the contents of each process’ send_buf_p and stores this in each process’ recv_buf_p. • As usual, recv_count is the amount of data being received from each process. C C Copyright © 2010, Elsevier Inc. All rights Reserved D D

MPI SPMD Code for y=A*x

MPI SPMD Code for y=A*x

MPI SPMD Code for y=A*x

MPI SPMD Code for y=A*x

PERFORMANCE EVALUATION OF MATRIX VECTOR MULTIPLICATION COPYRIGHT © 2010, ELSEVIER INC. ALL RIGHTS RESERVED

PERFORMANCE EVALUATION OF MATRIX VECTOR MULTIPLICATION COPYRIGHT © 2010, ELSEVIER INC. ALL RIGHTS RESERVED

How to measure elapsed parallel time • Use MPI_Wtime() that returns the number of

How to measure elapsed parallel time • Use MPI_Wtime() that returns the number of seconds that have elapsed since some time in the past. Copyright © 2010, Elsevier Inc. All rights Reserved

Measure elapsed sequential time in Linux • This code works for Linux without using

Measure elapsed sequential time in Linux • This code works for Linux without using MPI functions • Use GET_TIME() which returns time in microseconds elapsed from some point in the past. • Sample code for GET_TIME() #include <sys/time. h> /* The argument now should be a double (not a pointer to a double) */ #define GET_TIME(now) { struct timeval t; gettimeofday(&t, NULL); now = t. tv_sec + t. tv_usec/1000000. 0; }

Measure elapsed sequential time Copyright © 2010, Elsevier Inc. All rights Reserved

Measure elapsed sequential time Copyright © 2010, Elsevier Inc. All rights Reserved

Use MPI_Barrier() before time measurement Start timing until every process in the communicator has

Use MPI_Barrier() before time measurement Start timing until every process in the communicator has reached the same time stamp

Run-times of serial and parallel matrix-vector multiplication (Seconds) Copyright © 2010, Elsevier Inc. All

Run-times of serial and parallel matrix-vector multiplication (Seconds) Copyright © 2010, Elsevier Inc. All rights Reserved

Speedup and Efficiency Copyright © 2010, Elsevier Inc. All rights Reserved

Speedup and Efficiency Copyright © 2010, Elsevier Inc. All rights Reserved

Speedups of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Speedups of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Efficiencies of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Efficiencies of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Scalability • A program is scalable if the problem size can be increased at

Scalability • A program is scalable if the problem size can be increased at a rate so that the efficiency doesn’t decrease as the number of processes increase. • Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable. • Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable. • Copyright © 2010, Elsevier Inc. All rights Reserved

Safety Issues in MPI programs

Safety Issues in MPI programs

Safety in MPI programs • Is it a safe program? (Assume tag/process ID is

Safety in MPI programs • Is it a safe program? (Assume tag/process ID is assigned properly) Process 0 Process 1 Send(1) Recv(1) Send(0) Recv(0) ) Copyright © 2010, Elsevier Inc. All rights Reserved

Safety in MPI programs • Is it a safe program? (Assume tag/process ID is

Safety in MPI programs • Is it a safe program? (Assume tag/process ID is assigned properly) Process 0 Process 1 Send(1) Recv(1) Send(0) Recv(0) • May be unsafe because MPI standard allows MPI_Send to behave in two different ways: § It can simply copy the message into an MPI managed buffer and return, § or it can block until the matching call to MPI_Recv starts. Copyright © 2010, Elsevier Inc. All rights Reserved

Buffer a message implicitly during MPI_Send() • When you send data, where does it

Buffer a message implicitly during MPI_Send() • When you send data, where does it go? One possibility is: Process 0 Process 1 User data Local buffer the network Local buffer User data Slide source: Bill Gropp, ANL 69

Avoiding Buffering • Avoiding copies uses less memory • May use more time Process

Avoiding Buffering • Avoiding copies uses less memory • May use more time Process 0 Process 1 User data the network User data MPI_Send() waits until a matching receive is executed. Slide source: Bill Gropp, ANL 70

Safety in MPI programs • Many implementations of MPI set a threshold at which

Safety in MPI programs • Many implementations of MPI set a threshold at which the system switches from buffering to blocking. § Relatively small messages will be buffered by MPI_Send. § Larger messages, will cause it to block. • If the MPI_Send() executed by each process blocks, no process will be able to start executing a call to MPI_Recv, and the program will hang or deadlock. § Each process is blocked waiting for an event that will never happen. Copyright © 2010, Elsevier Inc. All rights Reserved

Example of unsafe MPI code with possible deadlocks • Send a large message from

Example of unsafe MPI code with possible deadlocks • Send a large message from process 0 to process 1 § If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) Process 0 Process 1 Send(1) Recv(1) Send(0) Recv(0) • This may be “unsafe” because it depends on the availability of system buffers in which to store the data sent until it can be received Slide source: Bill Gropp, ANL 72

Safety in MPI programs • A program that relies on MPI provided buffering is

Safety in MPI programs • A program that relies on MPI provided buffering is said to be unsafe. • Such a program may run without problems for various sets of input, but it may hang or crash with other sets. Copyright © 2010, Elsevier Inc. All rights Reserved

How can we tell if a program is unsafe • Replace MPI_Send() with MPI_Ssend()

How can we tell if a program is unsafe • Replace MPI_Send() with MPI_Ssend() • The extra “s” stands for synchronous and MPI_Ssend is guaranteed to block until the matching receive starts. • If the new program does not hang/crash, the original program is safe. • MPI_Send() and MPI_Ssend() have the same arguments Copyright © 2010, Elsevier Inc. All rights Reserved

Some Solutions to the “unsafe” Problem • Order the operations more carefully: Process 0

Some Solutions to the “unsafe” Problem • Order the operations more carefully: Process 0 Process 1 Send(1) Recv(0) Send(0) • Simultaneous send and receive in one call Process 0 Process 1 Sendrecv(1) Sendrecv(0) Slide source: Bill Gropp, ANL 75

Restructuring communication in odd-even sort Copyright © 2010, Elsevier Inc. All rights Reserved

Restructuring communication in odd-even sort Copyright © 2010, Elsevier Inc. All rights Reserved

Use MPI_Sendrecv() to conduct a blocking send a receive in a single call. Copyright

Use MPI_Sendrecv() to conduct a blocking send a receive in a single call. Copyright © 2010, Elsevier Inc. All rights Reserved

More Solutions to the “unsafe” Problem • Supply own space as buffer for send

More Solutions to the “unsafe” Problem • Supply own space as buffer for send Process 0 Process 1 Bsend(1) Recv(1) Bsend(0) Recv(0) • Use non-blocking operations: Process 0 Process 1 Isend(1) Irecv(1) Waitall Isend(0) Irecv(0) Waitall 78

Concluding Remarks (1) • MPI works in C, C++, or Python • A communicator

Concluding Remarks (1) • MPI works in C, C++, or Python • A communicator is a collection of processes that can send messages to each other. • Many parallel programs use the SPMD approach. • Most serial programs are deterministic: if we run the same program with the same input we’ll get the same output. § Parallel programs often don’t possess this property. • Collective communications involve all the processes in a communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

Concluding Remarks (2) • Performance evaluation § Use elapsed time or “wall clock time”.

Concluding Remarks (2) • Performance evaluation § Use elapsed time or “wall clock time”. § Speedup = sequential/parallel time § Efficiency = Speedup/ p § If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a parallel program is said to be scalable. • An MPI program is unsafe if its correct behavior depends on the fact that MPI_Send is buffering its input. Copyright © 2010, Elsevier Inc. All rights Reserved