Paraguin Compiler Communication July 24 2012 copyright 2012

I/O Now that we have parallelized the main loop nest of the program we

I/O First, we will have the master thread only read the input from a

I/O if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if

How do we get the data to the other processors MPI provides a number

Paraguin only provides a Broadcast This send the entire data set to the

Gather Paraguin does provide a mechanism for gather; however, it does use the MPI_Gather

Syntax: #pragma paraguin gather <array reference> <matrix> Where <matrix> is a system of

Example, array reference 0 and iterations where i=j-1: #pragma paraguin gather 0 x

Consider the elimination step of Gaussian Elimination: for (i = 1; i <=

Still You Can Use MPI Commands Remember, you can still put in MPI Commands

Dependency and Communication If there is a data dependence across iterations (loop carried dependency)

Data Dependency Consider the elimination step of Gaussian Elimination: for (i = 1; i

Data Dependency This is called a loop-carried dependence < i, j, k> <1, 2,

Data Dependency So we need to tell Paraguin which array references depend on which

Data Dependency We are going to skip the details of specifying loop-carried data dependence

Results from Gaussian Elimination Execution Time (seconds) 12 Sequential 4 8 12 16 20

I have been trying to figure out how to get better performance on

“Until inter-processor communication latency is at least a fast as memory access latency,

So what good are distributedmemory systems? Question: What can we run on distributedmemory systems

Communication Patterns Like This After the scattering of input data and before the gathering

Examples Matrix Multiplication (the obvious algorithm) Sobel Edge Detection Monte Carlo algorithms Traveling Salesman

Slides: 22

Download presentation

I/O Now that we have parallelized the main loop nest of the program we need to get the input data from a file and to all the processors So how do we do this? July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

I/O First, we will have the master thread only read the input from a file. This is because we aren’t using parallel file I/O. This is easily done by putting file I/O outside a parallel region July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

I/O if (argc < 2) { fprintf (stderr, usage, argv[0]); return -1; } if ((fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading. n“, argv[0], argv[1]); fprintf (stderr, usage, argv[0]); return -1; } for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%f", &a[i][j]); ; #pragma paraguin begin_parallel July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

How do we get the data to the other processors MPI provides a number of ways to do this Since MPI command can be put directly in the source code, you can still do everything in Paraguin that you can in MPI. Ex. #ifdef PARAGUIN MPI_Scatter(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Paraguin only provides a Broadcast This send the entire data set to the other processors Its complexity is O(log 2 NP) as opposed to O(NP), where NP is the number of processors Ex. ; #pragma paraguin bcast a This should be inside a parallel region July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Gather Paraguin does provide a mechanism for gather; however, it does use the MPI_Gather and isn’t as efficient. Paraguin will send individual message from each processor back to the master thread To use the Paraguin gather, you did to specify which array access and values are the final values. July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Syntax: #pragma paraguin gather <array reference> <matrix> Where <matrix> is a system of inequalities that identifies which iterations produce the final values July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Example, array reference 0 and iterations where i=j-1: #pragma paraguin gather 0 x 0 C -1 1 i -1 1 j 1 -1 k 0 x 0 So what is the array reference? Array references are enumerated started at 0 Iterations where i=j-1 is when the value of the array is written and not changed again (final value) July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Consider the elimination step of Gaussian Elimination: for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 Reference 2 Reference 1 July 24, 2012 Reference 4 Reference 3 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Still You Can Use MPI Commands Remember, you can still put in MPI Commands directly #ifdef PARAGUIN MPI_Gather(a, blksz, MPI_FLOAT, b , blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); #endif July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Dependency and Communication If there is a data dependence across iterations (loop carried dependency) then this data must be transmitted from the processor that computes the value to the processor that needs it Paraguin can generate the MPI code to perform the communication You need to tell it about the dependency July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Data Dependency Consider the elimination step of Gaussian Elimination: for (i = 1; i <= N; i++) for (j = i+1; j <= N; j++) for (k = N+1; k >= i; k--) a[j][k] = a[j][k] - a[i][k] * a[j][i] / a[i][i]; Reference 0 July 24, 2012 Reference 2 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Data Dependency This is called a loop-carried dependence < i, j, k> <1, 2, 5>: a[2][5] = a[2][5] - a[1][5] * a[2][1] / a[1][1] <1, 2, 4>: a[2][4] = a[2][4] - a[1][4] * a[2][1] / a[1][1] <1, 2, 3>: a[2][3] = a[2][3] - a[1][3] * a[2][1] / a[1][1] <1, 2, 2>: a[2][2] = a[2][2] - a[1][2] * a[2][1] / a[1][1] <1, 2, 1>: a[2][1] = a[2][1] - a[1][1] * a[2][1] / a[1][1]. . . <2, 3, 5>: a[3][5] = a[3][5] - a[2][5] * a[3][2] / a[2][2] <2, 3, 4>: a[3][4] = a[3][4] - a[2][4] * a[3][2] / a[2][2] <2, 3, 3>: a[3][3] = a[3][3] - a[2][3] * a[3][2] / a[2][2] <2, 3, 2>: a[3][2] = a[3][2] - a[2][2] * a[3][2] / a[2][2] July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Data Dependency So we need to tell Paraguin which array references depend on which references and the mapping of the read iteration instance to the write iteration instance The format for a dependence pragma is: #pragma paraguin dep <write array reference> <read array reference> <matrix> Where <matrix> is a system of inequalities that maps the read iteration instance to the write iteration instance July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Results from Gaussian Elimination Execution Time (seconds) 12 Sequential 4 8 12 16 20 24 28 32 36 10 8 6 4 2 0 100 July 24, 2012 200 300 400 500 600 700 Problem Size 800 900 © copyright 2012, Clayton S. Ferner, UNC Wilmington 1000

I have been trying to figure out how to get better performance on problems like LU Decomposition and Gaussian Elimination Even with the progress made, we still can’t get speedup when there is inter-processor communication from loop-carried dependencies July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

“Until inter-processor communication latency is at least a fast as memory access latency, we won’t achieve speedup on a distributedmemory system for problems that require interprocess communication beyond scattering the input data and gathering the partial results. ” [Ferner: 2012] July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

So what good are distributedmemory systems? Question: What can we run on distributedmemory systems and achieve speedup? Answer: Parallel programs that do not need inter-processor communication, beyond the scatter and gather process In other words: “embarrassingly parallel” applications July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Communication Patterns Like This After the scattering of input data and before the gathering of partial results, the processors work independently 0 0 Scatter 1 2 3 4 5 6 7 Gather 0 July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Examples Matrix Multiplication (the obvious algorithm) Sobel Edge Detection Monte Carlo algorithms Traveling Salesman Problem Several Spec Benchmark algorithms July 24, 2012 © copyright 2012, Clayton S. Ferner, UNC Wilmington