Patterns Paraguin Compiler Version 2 1 September 1

Patterns As of right now, there are only two patterns implemented in Paraguin: Scatter/Gather (also known as master/slave) Stencil September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 2

Scatter/Gather Master prepares input Input is scatter to all processors 0 Scatter 0 Processors work independently (no communication) 1 2 3 4 5 6 7 Gather 0 Partial results are gathered together to build final result September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 3

Scatter/Gather This pattern is done as a template rather than a single pragma 1. 2. 3. 4. Master prepares input Scatter input Compute partial results Gather partial results into the final result September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 4

Scatter/Gather Example Matrix Addition int main(int argc, char *argv[]) { int i, j, error = 0; double a[N][N], b[N][N], c[N][N]; char *usage = "Usage: %s filen"; FILE *fd; if (argc < 2) { fprintf (stderr, usage, argv[0]); error = -1; } Make sure we have the correct number of arguments Make sure we can open the input file if (!error && (fd = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "%s: Cannot open file %s for reading. n", argv[0], argv[1]); fprintf (stderr, usage, argv[0]); The variable error is used error = -1; stop the other processors } September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington to 5

Scatter/Gather Example Matrix Addition #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; #pragma paraguin end_parallel 1. Master prepares input for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &a[i][j]); The error code is broadcast to all processors so that they know to exit. If we just had a “return -1” in the above two if statements then the master only would exit and the workers would not, causing a deadlock. for (i = 0; i < N; i++) for (j = 0; j < N; j++) fscanf (fd, "%lf", &b[i][j]); fclose(fd); September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 6

Scatter/Gather Example Matrix Addition #pragma paraguin begin_parallel 2. Scatter input #pragma paraguin scatter a b 3. Compute partial results // Parallelize the following loop nest assigning iterations // of the outermost loop (i) to different partitions. #pragma paraguin forall for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { c[i][j] = a[i][j] + b[i][j]; } } Since this is a forall loop, each processors will compute a partition of the rows of the results September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 7

Scatter/Gather Example Matrix Addition 4. Gather partial results into the final result ; #pragma paraguin gather c #pragma paraguin end_parallel This semicolon is here to prevent the gather pragma from being placed INSIDE the above for loop nest. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 8

More on Scatter/Gather The scatter/gather pattern can also use either broadcast or reduction or both 1. 2. 3. 4. Master prepares input Broadcast input Compute partial results Reduce partial results into the final result September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 9

Integration To demonstrate Broadcast/Reduce, consider the problem if integrating a function using rectangles: y=f(x) f(x+h) f(x) a x x+h As h approaches zero the area of the rectangles approaches the area under the curve between a and b b September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 10

Scatter/Gather Example Integration double f(double x) { return 4. 0 * sin(1. 5*x) + 5; } Let f(x)=4 sin(1. 5 x) + 5 int main(int argc, char *argv[]) { char *usage = "Usage: %s a b Nn"; int i, error = 0, N; double a, b, x, y, h, area, overall_area; if (argc < 4) { fprintf (stderr, usage, argv[0]); error = -1; } else { Make sure we have the correct number of arguments The variable error is used to stop the other processors September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 11

Scatter/Gather Example Integration 1. Master prepares input } a = atof(argv[1]); b = atof(argv[2]); N = atoi(argv[3]); if (b <= a) { fprintf (stderr, "a should be smaller than bn"); error = -1; } #pragma paraguin begin_parallel #pragma paraguin bcast error if (error) return error; The error code is broadcast to all processors so that they know to exit. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 12

Scatter/Gather Example Integration 2. Broadcast input ; #pragma paraguin bcast a b N 3. Compute partial results This semicolon is here to prevent the bcast pragma from being placed INSIDE the above if statement. h = (b - a) / N; area = 0. 0; #pragma paraguin forall for (i = 0; i < N-1; i++) { x = a + i * h; y = f(x); area += y * h; } Since this is a forall loop, each processors will compute a partition of the rectangles September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 13

Scatter/Gather Example Integration 4. Reduce partial results into the final result This semicolon is here to prevent the reduce pragma from being placed INSIDE the above for loop nest. ; #pragma paraguin reduce sum area overall_area #pragma paraguin end_parallel Final area is in overal_area September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 14

Stencil Pattern 0 September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington

Jacobi Iteration September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 16

Jacobi Iteration int main() { int i, j; double A[N][M], B[N][M]; Skip the boundary values // A is initialized with data somehow for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) // Multiplying by 0. 25 is faster than dividing by 4. 0 B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0. 25; } }. . . for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[i][j] = B[i][j]; Then copied back to the original. Newly computed values are placed in a new array. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 17

Improved Jacobi Iteration int main() { int i, j, current, next; double A[2][N][M]; } Add another dimension of size 2 to A. A[0] is old A and A[1] is old B // A[0] is initialized with data somehow and duplicated into A[1] current = 0; next = (current + 1) % 2; for (time = 0; time < MAX_ITERATION; time++) { for (i = 1; i < N-1; i++) for (j = 1; j < M-1; j++) A[next][i][j] = (A[current][i-1][j] + A[current][i+1][j] + A[current][i][j-1] + A[current][i][j+1]) * 0. 25; current = next; next = (current + 1) % 2; We toggle between } copies of the array // Final result is in A[current]. . . This avoids copying values back into the original array. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 18

Row versus Block Partitioning September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC

Row versus Block Partitioning With block partitioning, we will need to communicate data across both rows and columns This will result in too much communication (too fine granularity) With row partitioning, each processor only needs to communication with at most 2 other processors September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 20

Communication Pattern with Row Partitioning September 1, 2013 © copyright 2012, Clayton S. Ferner,

Paraguin Stencil Pragma A stencil pattern is done with a stencil pragma: #pragma paraguin stencil <data> <#rows> <#cols> <max_iterations> <fname> All on one line Where <data> is a 3 dimensional array 2 x #rows x #cols <max_iterations> is the number of iterations of the time loop <fname> is the name of a function to perform each calculation September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 22

Paraguin Stencil Pragma The function to perform each calculation should be declared as: <type> <fname> (<type> <data>[ ][ ], int i, int j) Where <type> is the base type of the array The function should calculate and return the value at location <data>[i][j]. It should not modify that value, but simply return the new value. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 23

Paraguin Stencil Pragma int __guin_current = 0; // This is needed to access the last // copy of the data // Function to compute each value double compute. Value (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0. 25; } September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 24

$Paraguin Stencil Pragma int main() { int i, j, n, m, max_iterations; double A[2][N][M];$

Paraguin Stencil Pragma int main() { int i, j, n, m, max_iterations; double A[2][N][M]; A has a 3 rd dimension of size 2 // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel n = N; m = M; max_iterations = TOTAL_TIME; All pragma parameters must be literals or variables. No preprocessors constants. #pragma paraguin stencil A n m max_iterations compute. Value #pragma paraguin end_parallel } // Final result is in A[__guin_current] or A[max_iterations % 2] September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 25

The Stencil Pragma is Replaced with Code to do: 1. The 3 -dimensional array given as an argument to the stencil pragma is broadcast to all available processors. 2. __guin_current is set to zero and __guin_next is set to one. 3. A loop is created to iterate max_iteration number of times. Within that loop, code is inserted to perform the following steps: September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 26

1. The Stencil Pragma is Replaced with Code to do: a. Each processor (except the last one) will send its last row to the processor with rank one more than its own rank. b. Each processor (except the first one) will receive the last row from the processor with rank one less than its own rank. c. Each processor (except the first one) will send its first row to the processor with rank one less than its own rank. d. Each processor (except the last one) will receive the first row from the processor with rank one more than its own rank. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 27

1. The Stencil Pragma is Replaced with Code to do: e. Each processor will iterate through the values of the rows for which it is responsible and use the function provided compute the next value. f. __guin_current and __guin_next toggle 4. The data is gathered back to the root processor (rank 0). September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 28

Stopping the Iterations Based Upon a Condition The stencil pattern will execute a fixed number of iterations What if we want to continue until the data converges to a solution For example, if the maximum difference between the values in A[0] and A[1] is less than a tolerance, like 0. 0001 The problem with doing this in parallel is that it requires communication September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 29

Why Communication is Needed to Test For a Termination Condition There are 2 reasons inter-processor communication is needed the test for a termination condition: 1. The data is scattered across processors; and 2. The processors need to all agree whether to continue or terminate. Parts of the data may converge faster than others Some processors may decide to stop and others do not Without agreement, there will be a deadlock September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 30

Stopping the Iterations Based Upon a Condition We could put the stencil pragma inside a loop Each processor computes the max difference between the old and new values We reduce this to a final max value Broadcast it back out to decide whether or not to continue Problem: The stecil pattern has a built in Broadcast and Gather. Solution: Stencil. Lite September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 31

Paraguin Stencil Pragma with Termination Condition int __guin_current = 0; // This is needed to access the last copy // of the data This part is the same. // Function to compute each value double compute. Value (double A[][M], int i, int j) { return (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) * 0. 25; } September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 32

Paraguin Stencil Pragma with Termination Condition int main() { int i, j, n, m, max_iterations, done; double A[2][N][M], *a. Ptr, diff, max_diff, tol; New variables // A[0] is initialized with data somehow and duplicated into A[1] #pragma paraguin begin_parallel // tol is used to determine if the termination condition is met // When the change in values are ALL less than tol, the values // have converged sufficiently. tol = 0. 0001; Broadcast is now done n = N; m = M; by the user BEFORE max_iterations = TOTAL_TIME; the while loop begins #pragma paraguin bcast A September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 33

Paraguin Stencil Pragma with Termination Condition done = 0; // false while (!done) { Need a logical-controlled loop ; // This is to make sure the following pragma is inside the while #pragma paraguin stencil. Lite A n m max_iterations compute. Value // Each processor determines the maximum change in values of // the partition for which it is responsible. The loop bounds need // to be 1 and n-1 to match the bounds of the stencil. Otherwise, // the partitioning will be incorrect. All processors determine the maximum absolute difference between the old values and the newly computed values. max_diff = 0. 0; #pragma paraguin forall for (i = 1; i < n - 1; i++) { for (j = 1; j < n - 1; j++) { diff = fabs(A[__guin_current][i][j] - A[__guin_next][i][j]); if (diff > max_diff) max_diff = diff; } } September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 34

Paraguin Stencil Pragma with Termination Condition ; // This is needed to prevent the pragma from being located in the // above loop nest Reduce to find the // Reduce the max_diff's from all processors maximum difference #pragma paraguin reduce max_diff across all processors. // Broadcast the diff so that all processors will agree to continue // or terminate The variable diff is #pragma paraguin bcast diff being reused here. } // Termination condition if the maximum change in values is less // than the tolerance. if (diff <= tol) done = 1; // true Broadcast so all processors are in agreement. September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 35

Paraguin Stencil Pragma with Termination Condition a. Ptr = &A[__guin_current][1][0]; n = (N - 2) * M * sizeof(double); #pragma paraguin gather a. Ptr( n ) The 1 st and last rows of the array were not included in the partitioning #pragma paraguin end_parallel } // Final result is in A[__guin_current] // Cannot use max_iterations % 2 This is why we use a pointer September 1, 2013 © copyright 2012, Clayton S. Ferner, UNC Wilmington 36