Topic Overview MatrixMatrix Multiplication Block Matrix Operations A

Topic Overview • Matrix-Matrix Multiplication • Block Matrix Operations • A Simple Parallel Matrix-Matrix Multiplication • Cannon's Algorithm • Overlapping Communication with Computation Sahalu Junaidu ICS 573: High Performance Computing 1

Matrix-Matrix Multiplication • Building on our matrix-vector multiplication (Quinn’s Chapter 8), we now consider matrix-matrix multiplication – multiplying two n x n dense, square matrices A and B to yield the product matrix C = A x B. • For simplicity, we use the following serial algorithm: Sahalu Junaidu ICS 573: High Performance Computing 2

Block Matrix Operations • Matrix computations involving scalar algebraic operations on the matrix elements can be expressed in terms of identical operations on submatrices of the original matrix. • Such algebraic operations on the submatrices are called block matrix operations. – useful in matrix multiplication as well as in a variety of other matrix algorithms • In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai, j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. • We perform q 3 matrix multiplications, each involving (n/q) x (n/q) matrices. – requiring (n/q)3 additions and multiplications Sahalu Junaidu ICS 573: High Performance Computing 3

Block Matrix Operations Sahalu Junaidu ICS 573: High Performance Computing 4

A Simple Parallel Matrix-Matrix Multiplication Algorithm • Consider two n x n matrices A and B partitioned into p blocks Ai, j and Bi, j (0 ≤ i, j < ) of size each. • Process Pi, j initially stores Ai, j and Bi, j and computes block Ci, j of the result matrix. • Computing submatrix Ci, j requires all submatrices Ai, k and Bk, j for 0 ≤ k <. • All-to-all broadcast blocks of A along rows and B along columns. • Perform local submatrix multiplication. Sahalu Junaidu ICS 573: High Performance Computing 5

Matrix-Matrix Multiplication: Performance Analysis • The two broadcasts take time • The computation requires multiplications of sized submatrices. • The parallel run time is approximately Sahalu Junaidu ICS 573: High Performance Computing 6

Drawback of the Simple Parallel Algorithm • A major drawback of this algorithm is that it is not memory optimal • Each process has blocks of both matrices A and B at the end of each communication phase • Thus, each process requires – Since each block requires memory • The total memory requirement over all the processes is i. e. , times the memory requirement of the sequential algorithm. Sahalu Junaidu ICS 573: High Performance Computing 7

Matrix-Matrix Multiplication: Cannon's Algorithm • Cannon's algorithm is a memory-efficient version of the simple parallel algorithm – With a total memory requirement of Q(n 2) • Matrices A and B are partitioned into p square blocks as in the simple parallel algorithm • Although every process in the ith row requires all submatrices, the all-to-all broadcast can be avoided by – scheduling the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai, k. – systematically rotating these blocks among the processes after every submatrix multiplication so that every process gets a fresh Ai, k after each rotation. • If an identical schedule is applied to the columns of B, then no process holds more than one block of each matrix at any time Sahalu Junaidu ICS 573: High Performance Computing 8

Communication Steps in Cannon's Algorithm Sahalu Junaidu ICS 573: High Performance Computing 9

Communication Steps in Cannon's Algorithm • First, align the blocks of A and B in such a way that each process multiplies its local submatrices: – shift submatrices Ai, j to the left (with wraparound) by i steps – shift submatrices Bi, j up (with wraparound) by j steps. • After alignment (Figure 8. 3 c): – Process Pi, j has submatrices – Perform local block matrix multiplication. and . • Next, each block of A moves one step left and each block of B moves one step up (again with wraparound). • Perform next block multiplication, add to partial result, repeat until all the blocks have been multiplied. Sahalu Junaidu ICS 573: High Performance Computing 10

Cannon's Algorithm: An Example • Consider the matrices to be multiplied: • Assume that these matrices are portioned into 4 square blocks as follows: • After the initial alignment, matrices A and B become: Sahalu Junaidu ICS 573: High Performance Computing 11

Cannon's Algorithm: An Example • After this alignment, process – – • P 0, 0 ends up with A 0, 0 and B 0, 0 and should compute c 0, 0 P 0, 1 ends up with A 0, 1 and B 1, 1 and should compute c 0, 1 P 1, 0 ends up with A 1, 1 and B 1, 0 and should compute c 1, 0 P 1, 1 ends up with A 1, 0 and B 0, 1 and should compute c 1, 1 The local block matrix multiplications proceed as follows: Sahalu Junaidu ICS 573: High Performance Computing 12

Cannon's Algorithm: An Example • Shift 1: shift each block of A one step to the left and shift each block of B one step up: • Next, each process Pi, j performs block multiplication, updating Ci, j : Sahalu Junaidu ICS 573: High Performance Computing 13

Cannon's Algorithm: Performance Analysis • In the alignment step, the maximum distance over which a block shifts is , – the two shift operations require a total of time. • Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time. • The computation time for multiplying is. matrices of size • The parallel time is approximately: Sahalu Junaidu ICS 573: High Performance Computing 14

MPI_Cart_shift Function • Shifting data along the dimensions of the 2 -D mesh is a frequent operation in the Cannon’s algorithm – MPI provides the function MPI_Cart_shift for this purpose. int MPI_Cart_shift( MPI_Comm comm_cart, /* communicator with Cartesian structure (handle)*/ int int dir, /* direction of shift (> 0: up shift, < 0: down shift) */ s_step, /* shift size/displacement */ *rank_source, /* rank of source process */ *rank_dest) /* rank of destination process */ • Here is an example program exercising this function. Sahalu Junaidu ICS 573: High Performance Computing 15

Sending and Receiving Messages Simultaneously • To exchange messages, MPI provides the following function: int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, int recvtag, MPI_Comm comm, MPI_Status *status) • The arguments include arguments to the send and receive functions. • If we wish to use the same buffer for both send and receive, we can use: int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status) • A Parallel program for Cannon’s algorithm is here. Sahalu Junaidu ICS 573: High Performance Computing 16

Overlapping Communication with Computation • • Our MPI programs so far used blocking send/receive operations to perform point-to-point communication. As discussed earlier, – a blocking send operation remains blocked until the message has been copied out of the send buffer – a blocking receive operation returns only after the message has been received and copied into the receive buffer. • In the Cannon algorithm, for example, each process blocks on MPI_Sendrecv_replace – until the specified matrix block has been sent and received by the corresponding processes. • Note that the blocks of matrices A and B do not change as they are shifted among the processors – Thus, we can overlap the transmission of these blocks with the computation for the matrix-matrix multiplication – Many recent distributed-memory parallel computers have dedicated communication controllers that can perform the transmission of messages without interrupting the CPUs. Sahalu Junaidu ICS 573: High Performance Computing 17

Non-Blocking Communication Operations • In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations. int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) • • These operations return before the operations have been completed. Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) • MPI_Wait waits for the operation to complete. An example is here. int MPI_Wait(MPI_Request *request, MPI_Status *status) Sahalu Junaidu ICS 573: High Performance Computing 18

Canon’s Algorithm using Non-Blocking Operations • Here is the parallel program for Cannon’s algorithm using nonblocking operations • Two main differences between this program and the earlier one using blocking operations: 1. 2. Additional arrays a_buffers and b_buffers, are used for the blocks of A and B that are being received while the computation involving the previous blocks is performed. in the main computational loop, it first starts the non-blocking send operations to send the locally stored blocks of A and B to the processes left and up the grid, and then starts the non-blocking receive operations to receive the blocks for the next iteration from the processes right and down the grid. • After starting these four non-blocking operations, it proceeds to perform the matrix-matrix multiplication of the blocks it currently stores. • Finally, before it proceeds to the next iteration, it uses MPI_Wait to wait for the send and receive operations to complete. Sahalu Junaidu ICS 573: High Performance Computing 19