Types of Parallelism Overt Parallelism is visible to
Types of Parallelism • Overt – Parallelism is visible to the programmer – Difficult to do (right) – Large improvements in performance • Covert – – 3/1/2021 Parallelism is not visible to the programmer Compiler responsible for parallelism Easy to do Small improvements in performance Message Passing Computing 1
Parallel Architectures • For a long time parallel programs were written with a specific architecture in mind – Programs would only run on one type, maybe even only one, machine – A programmer would have to fit a problem to a specific architecture • Over time, programmers started writing programs in a particular style – The programs are then mapped onto a specific machine by a compiler 3/1/2021 Message Passing Computing 2
Problem Architectures • Synchronous (SIMD) – The same operation is performed on all data points at the same time • Loosley Synchronous (SPMD) – The same operations are performed by all processors but they need not be done at exactly the same time – Not synchronized at the computer clock cycle but rather only macroscopically “every now and then” • Asynchronous (MPMD) – Every processor executes its own instruction on its own data 3/1/2021 Message Passing Computing 3
SIMD Controller� Program P 0 P 1 P 2 P 3 P 4 P 5 P 6 Pn-1 Interconnection Network 3/1/2021 Message Passing Computing 4
SPMD Same program - but no longer strictly synchronized Program P 0 Program P 1 Program P 2 P 3 P 4 Program P 5 P 6 Program Pn-1 Interconnection Network 3/1/2021 Message Passing Computing 5
MPMD Prog 0 Prog 1 Prog 2 Prog 3 Prog 4 P 1 P 2 P 3 P 4 Prog 5 Prog 6 Progn-1 P 6 Pn-1 Interconnection Network 3/1/2021 Message Passing Computing 6
Processes • One can view a parallel programming as consisting of a number of independent processes – These processes are mapped to the physical processors • Ideally(? ) one process per processor – You can also think of these as threads, although technically threads are a different sort of beast – For program development we do not really care about the mapping • Two ways to create processes – Static • All processes are specified before execution • The system executes a fixed number of processes • In a world were there is a mapping between process and processor this is only view that makes sense – Dynamic • Processes can be created at runtime • More powerful but incurs overhead at runtime 3/1/2021 Message Passing Computing 7
Communication • Communication is vital in any kind of distributed application. • Initially most people wrote their own protocols. – Tower of Babel effect. • Eventually standards appeared. – Parallel Virtual Machine (PVM). – Message Passing Interface (MPI). 3/1/2021 Message Passing Computing 8
Message Passing • In basic message passing, processes coordinate activities by explicitly sending and receiving messages – Commonly used in distributed-memory MIMD systems • Programming in an MP environment can be achieved by – Designing a special parallel language • Occam – Extending an existing sequential language to include MP constructs • Inmos C – Use a middleware layer that in conjunction with an existing language provides MP faciltities • MPI • Parallel Java • PVM 3/1/2021 Message Passing Computing 9
Synchronous Message Passing send(2, x) Blocks until matching recv() is complete 3/1/2021 sync point recv(x) Blocks until matching send() is complete send(2, y) sync point recv(y) Message Passing Computing 10
Asynchronous Message Passing Copies to buffer and continues send(2, x) Buffer recv(x) send(2, y) 3/1/2021 No synchronization point Message Passing Computing recv(y) May or may not block 11
Broadcast data P 1 P 2 P 3 Buffer P 0 May or may not be synchronous 3/1/2021 Message Passing Computing 12
Multicast data P 1 P 2 P 3 Buffer P 0 May or may not be synchronous 3/1/2021 Message Passing Computing 13
Scatter data P 0 P 1 P 2 P 3 May or may not be synchronous 3/1/2021 Message Passing Computing 14
Gather data P 0 P 1 P 2 P 3 May or may not be synchronous 3/1/2021 Message Passing Computing 15
Reduction • • Method to calculate a commutative (i. e. , sum, product, minimum, maximum, …) value in log P steps Think of summing the values in a tree 56 34 22 15 10 3/1/2021 19 5 15 14 4 6 8 8 Message Passing Computing 1 7 16
Reduction • • If each node is a process… Pairs of nodes on the bottom add and pass to parent Pairs at next level do the same Repeat until at root 34 15 10 3/1/2021 5 22 19 15 4 56 14 6 8 Message Passing Computing 8 1 7 17
Reduction • Instead of a tree, consider the reduction in a group of processors 3/1/2021 10 5 15 4 6 8 1 7 7 5 3 2 1 0 6 4 Message Passing Computing 18
Reduction • Sum to even processors 3/1/2021 10 15 15 19 6 14 1 8 7 2 0 6 5 4 3 1 Message Passing Computing 19
Reduction • Repeat 3/1/2021 10 15 15 34 6 14 1 22 7 2 0 6 5 4 3 1 Message Passing Computing 20
Reduction • Repeat one last time 3/1/2021 10 15 15 29 6 14 1 56 7 2 0 6 5 4 3 1 Message Passing Computing 21
Think Binary 111 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 22
Step 1 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 23
Step 1 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 24
Step 2 111 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 25
Step 2 111 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 26
Step 3 111 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 27
Step 3 111 110 101 100 011 010 001 000 3/1/2021 Message Passing Computing 28
Programming It Mask: 001 110 101 100 011 010 001 001 000 Bitwise AND 3/1/2021 Message Passing Computing 29
Programming It Mask: 010 111 110 101 100 011 010 001 000 010 3/1/2021 000 010 Message Passing Computing 000 Bitwise AND 30
Programming It Mask: 100 111 110 101 100 011 010 001 000 100 3/1/2021 Message Passing Computing 000 Bitwise AND 31
Reduction • Okay now I know who sends when, but… • How do I know who to send to? 3/1/2021 Message Passing Computing 32
Programming It Mask: 001 110 101 100 011 010 001 000 001 000 Bitwise AND 110 100 010 000 Bitwise XOR 3/1/2021 Message Passing Computing 33
Programming It Mask: 010 111 110 101 100 011 010 001 000 010 100 3/1/2021 000 010 000 Message Passing Computing 000 Bitwise AND Bitwise XOR 34
Programming It Mask: 100 111 110 101 100 011 010 001 000 100 000 3/1/2021 Message Passing Computing 000 Bitwise AND Bitwise XOR 35
Reduce data P 1 P 2 P 3 + Buffer P 0 May or may not be synchronous 3/1/2021 Message Passing Computing 36
What is MPI? • A message passing library specification – Message-passing model – Not a compiler specification (i. e. not a language) – Not a specific product • Designed for parallel computers, clusters, and heterogeneous networks • Lets users, tool writers, library developers concentrate on their code as opposed to the low level communication code – API – Middleware 3/1/2021 Message Passing Computing 37
The MPI Process • Development began in early 1992 • Open process/Broad participation – IBM, Intel, TMC, Meiko, Cray, Convex, Ncube – PVM, p 4, Express, Linda, … – Laboratories, Universities, Government • Final version of draft in May 1994 • Public and vendor implementations are now widely available 3/1/2021 Message Passing Computing 38
Why Message Passing? • Message passing is a mature paradigm – CSP was developed in 1978 – Well understood • Relatively easy to match to distributed hardware • Goal was to provide a full featured portable system – – – 3/1/2021 Modularity Peak performance Portability Heterogeneity Performance measurement tools Message Passing Computing 39
Features • Communicators – A collection of processes that can send messages to each other • Point-to-point Communication • Collective Communication – – – 3/1/2021 Barrier synchronization Broadcast Gather/Scatter data All-to-all exchange of data Global reduction Scan across all members of a communicator Message Passing Computing 40
Bare bones MPI Program #include <mpi. h> void main( int argc, char **argv ) { // Non-MPI Stuff can go here MPI_Init( &argc, &argv ); // Your parallel code goes here MPI_Finalize(); // Non-MPI Stuff can go here } 3/1/2021 Message Passing Computing 41
Odds and Ends • Even though programs are running on different processors you can print using printf() – No promise about ordering – Very useful for debugging • Supposedly scanf() – Be sure to use the –i option • Although it appears that argc and argv do what you expect, in some implementations they do not work – Send messages instead • Be careful with random number generators – If everyone seeds with the same value, numbers will not be very random 3/1/2021 Message Passing Computing 42
Communicators • Many MPI calls require a communicator • A communicator is a collection of processes that can send messages to each other – Think of a communicator as defining a group • Only processes in the same communicator can communicate • Allows you to segment your communication traffic • Every process belongs to the MPI_COMM_WORLD communicator 3/1/2021 Message Passing Computing 43
Getting Information • You can gather information about your environment • MPI_Comm_Rank( communicator, &ret. Val ); – Returns your rank – original process gets 0 • MPI_Get_processor_name( str_array, &length ); – Returns information about processor – MPI_MAX_PROCESSOR_NAME 3/1/2021 Message Passing Computing 44
Hello. World. Print. c #include <stdio. h> #include <mpi. h> void main ( int argc, char** argv ) { int my. Rank; int name. Len; char my. Name[ MPI_MAX_PROCESSOR_NAME ]; /* Initialize MPI */ MPI_Init( &argc, &argv ); /* Obtain information about the process */ MPI_Comm_rank( MPI_COMM_WORLD, &my. Rank ); MPI_Get_processor_name( my. Name, &name. Len ); /* Standard print */ printf( "Hello world from process #%d on %sn", my. Rank, my. Name ); /* Terminate MPI */ MPI_Finalize(); } 3/1/2021 Message Passing Computing 45
Compiling Parallel Programs • All clusters within the CS department are running Sun’s HPC software – Contains a variety of tools – including MPI – Everything (including documentation) is in /opt/SUNWhpc – Executables are in /opt/SUNWhpc/bin • Probably should add that to your path – Note that only the “clusters” have this software installed – See http: //www. cs. rit. edu/~ark/runningpj. shtml for details • Compile MPI C programs using mpcc Hello. World. Print. c –o hello –lmpi 3/1/2021 Message Passing Computing 46
CS Parallel Resources • SMP parallel computers – paradise/parasite -- 8 processors, 1. 35 GHz clock, 16 GB RAM – paradox/paragon -- 4 processors, 450 MHz clock, 4 GB RAM • Cluster parallel computer – paranoia. cs. rit. edu (296 MHz clock, 192 MB RAM) • 32 backend computers (thug 01 through thug 32) -- each an Ultra. SPARC -IIe CPU, 650 MHz clock, 1 GB RAM • 100 -Mbps switched Ethernet backend interconnection network • Hybrid SMP cluster parallel computer (Not for class use) – tardis. cs. rit. edu (CPU, 650 MHz clock, 512 MB RAM) • 10 backend computers (dr 00 through dr 09) -- each with two AMD Opteron four processors, 2. 6 GHz clock, 8 GB RAM • 1 -Gbps switched Ethernet backend interconnection 3/1/2021 Message Passing Computing 47
Running Parallel Programs • Rules of Engagement – Use the paradox and paradise machines to run SMP parallel programs. – Use the java mprun command on the paranoia machine to run MPI cluster parallel programs. Do not use the mprun command directly. – Run parallel java cluster programs on the paranoia machine – Details at: http: //www. cs. rit. edu/~ark/runningpj. shtml • Account setup – You need to setup your account so you can ssh to the parallel machines without specifying a password – You need to include the parallel java libraries in your classpath 3/1/2021 Message Passing Computing 48
Sample Run paranoia> mpcc Hellow. World. Print. c -o hello -l mpi paranoia> java mprun -np 6 hello Job 2, thug 05, thug 06, thug 07, thug 08, thug 09, thug 10 Hello world from process #1 on thug 06 Sun. OS 5. 9 SUNW, Sun-Blade-100 Hello world from process #2 on thug 07 Sun. OS 5. 9 SUNW, Sun-Blade-100 Hello world from process #3 on thug 08 Sun. OS 5. 9 SUNW, Sun-Blade-100 Hello world from process #4 on thug 09 Sun. OS 5. 9 SUNW, Sun-Blade-100 Hello world from process #5 on thug 10 Sun. OS 5. 9 SUNW, Sun-Blade-100 Hello world from process #0 on thug 05 Sun. OS 5. 9 SUNW, Sun-Blade-100 paranoia> 3/1/2021 Message Passing Computing Sun_Microsystems Sun_Microsystems 49
Sending/Receiving Messages • MPI places messages inside “envelopes” • Point-to-Point Messages are sent/received using – MPI_Send( buffer, count, type, dest, tag, comm ); – MPI_Recv( buffer, count, type, src, tag, comm ); • These are blocking calls – Return when buffer is available/full 3/1/2021 Message Passing Computing 50
Send/Recv Parameters • buffer – Where the data is, typically an array • count – Number of items in the buffer (not bytes – items) • type – MPI type of the data • dest – Rank where message is being sent • comm – The communicator 3/1/2021 Message Passing Computing 51
MPI Data Types MPI Type C Type MPI_CHAR char MPI_SHORT short int MPI_INT int MPI_LONG long MPI_FLOAT float MPI_DOUBLE double 3/1/2021 Message Passing Computing 52
Source and Tag • source – Who you want to receive messages from • tag – The tag on messages you are willing to receive • recv() will filter messages – Source and tag must match the message in order to receive it – Things that do not match are received and buffered • You can specify wild cards – MPI_ANY_SOURCE – MPI_ANY_TAG 3/1/2021 Message Passing Computing 53
Hello. World. Msg. c #include <stdio. h> #include <string. h> #include <mpi. h> #define MAX_MESSAGE_SIZE 100 void sender(); void receiver(); int main ( int argc, char** argv ) { int my. Rank; int num. Processes; /* Start up MPI and get basic info */ MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &my. Rank ); MPI_Comm_size( MPI_COMM_WORLD, &num. Processes ); /* Everyone except process 0 sends a message */ if ( my. Rank != 0 ) sender( my. Rank ); else receiver( num. Processes ); /* Terminate MPI */ MPI_Finalize(); } 3/1/2021 Message Passing Computing 54
Hello. World. Msg. c void sender( int my. Rank ) { int name. Len; char my. Name[ MPI_MAX_PROCESSOR_NAME ]; char message[ MAX_MESSAGE_SIZE ]; /* Obtain information about the process */ MPI_Get_processor_name( my. Name, &name. Len ); // Prepare the message sprintf( message, "Hello from process #%d on %s", my. Rank, my. Name ); MPI_Send( message, strlen( message ) + 1, MPI_CHAR, 0, // destination 0, // tag MPI_COMM_WORLD ); } 3/1/2021 Message Passing Computing 55
Hello. World. Msg. c void receiver( int num. Processes ) { int source; char message[ MAX_MESSAGE_SIZE ]; MPI_Status status; /* Read a message from every process */ for ( source = 1; source < num. Processes; source++ ) { MPI_Recv( message, MAX_MESSAGE_SIZE, MPI_CHAR, source, 0, // tag MPI_COMM_WORLD, &status ); printf( "%s (status = %d)n", message, status. MPI_ERROR ); } } 3/1/2021 Message Passing Computing 56
Sample Run paranoia> mpcc Hellow. World. Msg. c -o hello -l mpi paranoia> java mprun -np 8 hello Job 4, thug 23, thug 24, thug 25, thug 26, thug 27, thug 28, thug 29, thug 30 Hello from process #1 on thug 24 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #2 on thug 25 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #3 on thug 26 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #4 on thug 27 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #5 on thug 28 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #6 on thug 29 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems Hello from process #7 on thug 30 Sun. OS 5. 9 SUNW, Sun-Blade-100 Sun_Microsystems paranoia> 3/1/2021 Message Passing Computing (status (status 57 = = = = 0) 0)
Hot Potato Program • Manager generates messages to send to workers – Messages are integer arrays • • msg[ 0 ] message id msg[ 1 ] number of workers to visit msg[ 2 ] workers visited so far msg[ 3. . n-1 ] rank of workers visited – Manager will send a fixed number of messages • • Fixed number can be outstanding at a time Workers receive message – Records the visit in the message – If has been passed around the specified number of times • • • Message is sent to manager Else message is sent to a random worker Forms general template for manager/worker programs 3/1/2021 Message Passing Computing 58
Shutting Things Down • The workers do not know when to shut down • Manager needs to tell them • Two message tags used in the program – WORK_TAG Normal message to process – TERMINATE_TAG Shutdown message • Manager sends terminate message to each worker after it has received all of the messages back from the workers 3/1/2021 Message Passing Computing 59
MPI Collective Communications • MPI collective communication routines differ in many ways from MPI point-to-point communication routines: – Involve coordinated communication within a group of processes identified by an MPI communicator – Substitute for a more complex sequence of point-to-point calls – All routines block until they are locally complete – Communications may, or may not, be synchronized (implementation dependent) – In some cases, a root process originates or receives all data – Amount of data sent must exactly match amount of data specified by receiver – Many variations to basic categories – No message tags are needed • MPI collective communication can be divided into three subsets: synchronization, data movement, and global computation. 3/1/2021 Message Passing Computing 60
Data Movement • MPI provides three types of collective data movement routines: – – – broadcast gather scatter allgather alltoall • Let's take a look at the functionality and syntax of these routines. 3/1/2021 Message Passing Computing 61
Broadcast • Exactly what it says – Implementation will do what ever is most efficient for the given hardware (might use a reduction) MPI_Bcast( buffer, count, datatype, root, communicator ); • What might catch you by surprise is that the receiving process calls MPI_Bcast() as well • I often use broadcast to distribute parameters 3/1/2021 Message Passing Computing 62
Reduce • MPI provides functions that perform standard reductions across processors Operation Name Meaning MPI_MAX Maximum MPI_MIN Minimum int MPI_Reduce( void *operand, void *result, int count, MPI_Datatype, MPI_Op operator, int root MPI_Comm comm ); MPI_SUM Sum MPI_PROD Product MPI_LAND Logical and MPI_BAND Bitwise and MPI_LOR Logical or MPI_BOR Bitwise or MPI_LXOR Logical xor MPI_BXOR Bitwise xor MPI_MAXLOC Max and location MPI_MINLOC Min and location 3/1/2021 Message Passing Computing 63
Dot Product • The dot product of two vectors is defined as – x · y = x 0 y 0 + x 1 y 1 + x 2 y 2 + … + xn-1 yn-1 • Imagine having two vectors each containing n elements stored on p processors – Each processor will have N = n/p elements • Lets assume a block distribution of data meaning – P 0 has x 0, x 1, …, x. N-1 and y 0, y 1, … y. N-1 – P 1 has x. N, x. N+1, …, x 2 N-1 and y. N, y. N+1, … y 2 N-1 – … 3/1/2021 Message Passing Computing 64
Serial_dot float Serial_dot( float x[], float y[], int n ) { int i; float sum = 0. 0; for ( i = 0; i < n; i++ ) { sum = sum + x[ i ] * y[ i ]; } return sum; } 3/1/2021 Message Passing Computing 65
Parallel_dot float Parallel_dot( float local_x[], float local_y[], int local_n ) { float local_dot; float dot = 0. 0; local_dot = Serial_dot( local_x, local_y, local_n ); MPI_Reduce( &local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD ); return dot; /* only process 0 will have result */ } 3/1/2021 Message Passing Computing 66
MPI_Allreduce • Note that MPI_Reduce() leaves the result in the root processor • What if you wanted the result everywhere? – You could reduce and broadcast • Consider the modified reduction 3/1/2021 Message Passing Computing 67
MPE • MPI Parallel Environment (MPE) is a software package that contains a number of useful tools – – Profiling Library Viewers for logfiles Parallel X Graphics library Debugger setup routines • MPE is not part of the SUN HPC package, but it works with it. I have compiled and installed it in my account 3/1/2021 Message Passing Computing 68
X Routines • • • • MPE_Open_graphics - (collectively) opens an X Windows display MPE_Draw_point - Draws a point on an X Windows display MPE_Draw_points - Draws points on an X Windows display MPE_Draw_line - Draws a line on an X 11 display MPE_Fill_rectangle - Draws a filled rectangle on an X 11 display MPE_Update - Updates an X 11 display MPE_Close_graphics - Closes a X 11 graphics device MPE_Xerror( return. Val, function. Name ) MPE_Make_color_array - Makes an array of color indices MPE_Num_colors Gets the number of available colors MPE_Draw_circle - Draws a circle MPE_Draw_logic - Sets logical operation for laying down new pixels MPE_Line_thickness - set thickness of lines MPE_Add_RGB_color( graph, red, green, blue, mapping ) MPE_Get_mouse_press - Waits for mouse button press MPE_Iget_mouse_press - Checks for mouse button press MPE_Get_drag_region - get ``rubber-band box'' region (or circle 3/1/2021 Message Passing Computing 69
Using X Routines #include “mpe. h” #include “mpe_graphics. h” MPE_XGraph win; MPE_Open_graphics( &win, MPI_COMM_SELF, (char *)0, -1, 500, MPE_GRAPH_INDEPENDENT ); // // // MPE_Draw_point( win, col, row, color ); // Display handle // Coordinate of // the point // Color to use MPE_Close_graphics( &win ); // Display handle 3/1/2021 Message Passing Computing Display handle Communicator X Display Location on screen Size Collective 70
Compiling • A little more to compiling: mpcc -I/home/fac/ptt/pub/mpe/include -L/home/fac/ptt/pub/mpe/lib -o mandel Mandelbrot. c -lmpi -lmpe -l. X • Run it the same way 3/1/2021 Message Passing Computing 71
Timing Parallel Programs • There are two things you can time – The actual time your program takes to run: wall time • On a multi-tasking computer this will include the time the processor is executing other programs or performing system related tasks – The time the processor actually spends running your program: CPU time • Preferable since it will not be affected by system load • Wall time is what you are trying to reduce in order to increase your throughput and hence is the most common measurement people use 3/1/2021 Message Passing Computing 72
time • Unix systems provide the time command that can be used to time the execution of a program parasite> time mprun -np 4 pmonte PI 3. 143032 real user sys 3/1/2021 0 m 2. 622 s wall time 0 m 0. 030 s CPU time 0 m 0. 160 s System time Message Passing Computing 73
Comments • On a heavily loaded system, wall time will be much larger than CPU time – On a lightly loaded system they should be about the same • Some multi-processor systems report the sum of the CPU times – Ours appears not to • Timing resolution is often restricted – So only useful for long runs • Cannot only get a global perspective – Can’t be used to time sections of code 3/1/2021 Message Passing Computing 74
MPI Timers • MPI defines a set of library routines that can be used for timing: – MPI_Wtime(): • returns the time in seconds since some time in the past. – MPI_Wtick(): • • queries the resolution of the timer. The value returned by MPI_Wtime() is not synchronized – It only makes sense to compare values from the same processor • Unfortunately there is no CPU timer in MPI, probably for the reasons discussed above. 3/1/2021 Message Passing Computing 75
Using MPI Timers double start_time, end_time; MPI_Init( argc, argv ); start_time = MPI_Wtime(); // Parallel stuff happens here end_time = MPI_Wtime(); if ( my. Rank == 0 ) { printf( "Elapsed time: %. 16 fn", end - start ); printf( "Clock resolution: %. 16 fn", MPI_Wtick() ); } MPI_Finalize(); 3/1/2021 Message Passing Computing 76
Sample Output parasite> mprun -np 4 pmonte PI 3. 144288 Elapsed time: 0. 4880635365843773 Clock resolution: 0. 000040322581 parasite> time mprun -np 4 pmonte PI 3. 140424 Elapsed time: 0. 4884441774338484 Clock resolution: 0. 000040322581 real user sys 3/1/2021 0 m 2. 707 s 0 m 0. 020 s 0 m 0. 190 s Message Passing Computing 77
- Slides: 77