Message Passing Models Miodrag Bolic 1 Overview Hardware

Message Passing Models Miodrag Bolic 1

Overview • Hardware model • Programming model • Message Passing Interface 2

Generic Model Of A Message-passing Multicomputer [5] Node Node Message-passing direct network interconnection Node Gyula Fehér Node 3

Generic Node Architecture [5] External channel Node-processor Processor + Local memory +. . Fat-Node -powerful processor -large memory -many chips -costly/node -moderate parallelism Thin-Node Router External channel Communication Processor + Switch unit+. . External channel Gyula Fehér Internal channel(s) -small processor External -small memory channel -one-few chips -cheap/node -high parallelism 4

Generic Organization Model [5] P+M CP S Switching network P+M CP S (b) Decentralized P+M CP (c) Centralized Gyula Fehér 5

Message Passing Properties [1] • Complete computer as building block, including I/O • Programming model: directly access only private address space (local memory) • Communication via explicit messages (send/receive) • Communication integrated at I/O level, not memory system, so no special hardware • Resembles a network of workstations (which can actually be used as multiprocessor systems) 6

Message Passing Program [1] • Problem: Sum all of the elements of an array of size n. INITIALIZE; //assign proc_num and num_procs if (proc_num == 0) //processor with a proc_num of 0 is the master, //which sends out messages and sums the result { read_array(array_to_sum, size); //read the array and array size from file size_to_sum = size/num_procs; for (current_proc = 1; current_proc < num_procs; current_proc++) { lower_ind = size_to_sum * current_proc; upper_ind = size_to_sum * (current_proc + 1); SEND(current_proc, size_to_sum); SEND(current_proc, array_to_sum[lower_ind: upper_ind]); } //master nodes sums its part of the array sum = 0; for (k = 0; k < size_to_sum; k++) sum += array_to_sum[k]; global_sum = sum; for (current_proc = 1; current_proc < num_procs; current_proc++) { RECEIVE(current_proc, local_sum); global_sum += local_sum; } printf(“sum is %d”, global_sum); } else //any processor other than proc_num = 0 is a slave { sum = 0; RECEIVE(0, size_to_sum); RECEIVE(0, array_to_sum[0 : size_to_sum]); for (k = 0; k < size_to_sum; k++) sum += array_to_sum[k]; SEND(0, sum); } END; 7

Message Passing Program (cont. ) [1] Multiprocessor Software Functions Provided: • INITIALIZE – assigns a number (proc_num) to each processor in the system, assigns the total number of processors (num_procs). • SEND(receiving_processor_number, data) - sends data to another processor • BARRIER(n_procs) – When a BARRIER is encountered, a processor waits at that BARRIER until n_procs processors reach the BARRIER, then execution can proceed. 8

Advantages [1] • Advantages – Easier to build than scalable shared memory machines – Easy to scale (but topology is important) – Programming model more removed from basic hardware operations – Coherency and synchronization is the responsibility of the user, so the system designer need not worry about them. • Disadvantages – Large overhead: copying of buffers requires large data transfers (this will kill the benefits of multiprocessing, if not kept to a minimum). – Programming is more difficult. – Blocking nature of SEND/RECEIVE can cause increased latency and deadlock issues. 9

Message-Passing Interface – MPI [3] • Standardization - MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms. Practically, it has replaced all previous message passing libraries. • Portability - There is no need to modify your source code when you port your application to a different platform that supports the MPI standard. • Performance Opportunities - Vendor implementations should be able to exploit native hardware features to optimize performance. • Functionality - Over 115 routines are defined. • Availability - A variety of implementations are available, both vendor and public domain. 10

MPI basics [3] • • • Start Processes Send Messages Receive Messages Synchronize With these four capabilities, you can construct any program. 11

Communicators [3] • Provide a named set of processes for communication: – System allocated unique tags to processes – All processes can be numbered from 0 to n-1 – Allow construction of libraries: application creates communicators • MPI_COMM_WORLD – MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. – Provide functions (split, duplicate, . . . ) for creating communicators from other communicators – Functions (size, my_rank, …) for finding out about all processes within a communicator • Blocking vs. non-blocking 12

Hello world example [3] #include <stdio. h> #include "mpi. h" main(int argc, char** argv) { int my_PE_num; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); printf("Hello from %d. n", my_PE_num); MPI_Finalize(); } 13

Hello world example [3] • • Hello from 5. Hello from 3. Hello from 1. Hello from 2. Hello from 7. Hello from 0. Hello from 6. Hello from 4. 14

MPMD [3] Use MPI_Comm_rank: if (my_PE_num = 0) Routine 1 else if (my_PE_num = 1) Routine 2 else if (my_PE_num =2) Routine 3. . . 15

Blocking Sending and Receiving Messages [3] #include <stdio. h> #include "mpi. h" main(int argc, char** argv) { int my_PE_num, numbertoreceive, numbertosend=42; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_PE_num); if (my_PE_num==0) { MPI_Recv( &numbertoreceive, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); printf("Number received is: %dn", numbertoreceive); } else MPI_Send( &numbertosend, 1, MPI_INT, 0, 10, MPI_COMM_WORLD); MPI_Finalize(); } 16

Non-Blocking Message Passing Routines [4] #include "mpi. h" #include <stdio. h> int main(int argc, char *argv[]) { int numtasks, rank, next, prev, buf[2], tag 1=1, tag 2=2; MPI_Request reqs[4]; MPI_Status stats[4]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); prev = rank-1; next = rank+1; if (rank == 0) prev = numtasks - 1; if (rank == (numtasks - 1)) next = 0; MPI_Irecv(&buf[0], 1, MPI_INT, prev, tag 1, MPI_COMM_WORLD, &reqs[0]); MPI_Irecv(&buf[1], 1, MPI_INT, next, tag 2, MPI_COMM_WORLD, &reqs[1]); MPI_Isend(&rank, 1, MPI_INT, prev, tag 2, MPI_COMM_WORLD, &reqs[2]); MPI_Isend(&rank, 1, MPI_INT, next, tag 1, MPI_COMM_WORLD, &reqs[3]); { do some work } MPI_Waitall(4, reqs, stats); MPI_Finalize(); } 17

Collective Communications [3] • The Communicator specifies a process group to participate in a collective communication • MPI implements various optimized functions: – Barrier synchronization – Broadcast – Reduction operations: • with one destination or all in group destination • Collective operations are blocking 18

Comparison MPI vs. Open. MP Features Open. MP MPI Apply parallelism in steps yes no Scale to large number of processors maybe yes Code complexity Small increase Major increase Runtime environment Expensive compilers Free Cost of hardware Very expensive Cheap 19

References 1. J. Kowalczyk, “Multiprocessor Systems, ” Xilinx, 2003. 2. D. Culler, J. P. Singh, Parallel Computer Architectures, A Hardware/Software Approach, Morgan Kaufman, 1999. 3. MPI Basics 4. Message Passing Interface (MPI) 5. D. Sima, T. Fountain and P. Kascuk, Advanced Computer Architectures – A Design Space Approach, Pearson, 1997. 20