Hybrid Parallel Programming Introduction ITCS 41455145 Parallel Programming
Hybrid Parallel Programming Introduction ITCS 4145/5145, Parallel Programming C. Ferner and B. Wilkinson March 13, 2014. hybrid. ppt 1
Hybrid Systems Since most computers are multi-core, most clusters have both shared-memory and distributed-memory. Interconnection Network Core Core Core Core Memory Multi-core Computer 2
Hybrid (MPI-Open. MP) Parallel Computing We can use MPI to run processes concurrently on each computer We can use Open. MP to run threads concurrently on each core of a computer Advantage: we can make use of shared-memory where communication is required Why? – Because inter-computer communication is an order of magnitude slower than synchronization 3
Message-passing routines used to pass messages between computer systems and threads execute on each computer system using the multiple cores on the system 4
How to create a hybrid Open. MP- MPI program • Write source code with both MPI routines and Open. MP directives/routines • mpicc uses gcc linked with appropriate MPI libraries. gcc supports Open. MP with –fopenmp option. So can use that: mpicc -fopenmp -o hybrid. c • Execute as an MPI program. For example on UNCC cluster cci-gridgw. uncc. edu mpiexec. hydra -f <machinesfile> -n <number of processes> . /hybrid (VERY IMPORTANT -- NOT FROM cci-grid 05) 5
#include <stdio. h> #include <string. h> #include <stddef. h> #include <stdlib. h> #include "mpi. h" #define CHUNKSIZE 10 #define N 100 void openmp_code(){ … / next slide } main(int argc, char **argv ) { char message[20]; int i, rank, size, type=99; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if(rank == 0) { strcpy(message, "Hello, world"); for (i=1; i<size; i++) MPI_Send(message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD); } else MPI_Recv(message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status); openmp_code(); //all MPI processes run Open. MP code, no message passing printf( "Message from process =%d : %. 13 sn", rank, message); 6 MPI_Finalize(); } Example
void openmp_code(){ int nthreads, tid, i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1. 0; // initialize arrays chunk = CHUNKSIZE; #pragma omp parallel shared(a, b, c, nthreads, chunk) private(i, tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %dn", nthreads); } printf("Thread %d starting. . . n", tid); #pragma omp for schedule(dynamic, chunk) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %fn", tid, i, c[i]); } } /* end of parallel section */ } 7
Parallelizing a double for loop int main(int argc, char *argv[]) { int i, j, blksz, rank, P, tid; char *usage = "Usage: %s n"; FILE *fd; char message[80]; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &P); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/P); Loop i parallelized #pragma omp parallel private (tid, i, j) across computers { tid = omp_get_thread_num(); Loop j parallelized for (i = rank*blksz; i < min((rank + 1) * blksz, N); i++) { #pragma omp for across threads for (j = 0; j < N; j++) { printf ("rank %d, thread %d: executing loop iteration i=%d j=%dn", rank tid, i, j); } } } 8 } Code and results from Dr. Ferner
Simple Example Result rank 0, thread 4: executing loop iteration i=0 j=4 rank 0, thread 2: executing loop iteration i=0 j=2 rank 0, thread 1: executing loop iteration i=0 j=1 rank 1, thread 0: executing loop iteration i=2 j=0 rank 1, thread 4: executing loop iteration i=2 j=4 rank 1, thread 2: executing loop iteration i=2 j=2 rank 1, thread 3: executing loop iteration i=2 j=3 rank 0, thread 0: executing loop iteration i=0 j=0 rank 1, thread 1: executing loop iteration i=2 j=1 rank 0, thread 3: executing loop iteration i=0 j=3 rank 2, thread 2: executing loop iteration i=4 j=2 rank 2, thread 0: executing loop iteration i=4 j=0 rank 2, thread 3: executing loop iteration i=4 j=3 rank 2, thread 4: executing loop iteration i=4 j=4 rank 2, thread 1: executing loop iteration i=4 j=1 rank 0, thread 2: executing loop iteration i=1 j=2 rank 0, thread 4: executing loop iteration i=1 j=4 rank 0, thread 3: executing loop iteration i=1 j=3 rank 0, thread 0: executing loop iteration i=1 j=0 rank 0, thread 1: executing loop iteration i=1 j=1 rank 1, thread 0: executing loop iteration i=3 j=0 rank 1, thread 2: executing loop iteration i=3 j=2 rank 1, thread 3: executing loop iteration i=3 j=3 rank 1, thread 1: executing loop iteration i=3 j=1 rank 1, thread 4: executing loop iteration i=3 j=4 9
Hybrid (MPI-Open. MP) Parallel Computing Caution: Using the hybrid approach may not necessarily result in increased performance though – will strongly depend upon application. 10
Matrix Multiplication, C = A * B where A is an n x l matrix and B is an l x m matrix. 11
One way to parallelize Matrix multiplication using hybrid approach for (i = 0; i < N; i++) for (j = 0; j < N; j++) { c[i][j] = 0. 0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } Parallelize i loop into partitioned among the computers with MPI Parallelize j loop into partitioned among cores within each computer, using Open. MP 12
Matrix Multiplication MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &P); MPI_Comm_rank (MPI_COMM_WORLD, &rank); blksz = (int) ceil (((double) N)/P); MPI_Scatter (a, N*blksz, MPI_FLOAT, 0, MPI_COMM_WORLD); MPI_Bcast (b, N*N, MPI_FLOAT, 0, MPI_COMM_WORLD); #pragma omp parallel private (tid, i, j, k) { for (i = 0; i < blksz && rank * blksz < N; i++) { #pragma omp for nowait for (j = 0; j < N; j++) { c[i][j] = 0. 0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } Parallelize i loop into partitioned among the computers with MPI Parallelize j loop on each computer into partitioned using Open. MP Code and results from Dr. Ferner 13
Matrix Multiplication Results $ diff out MMULT. o 5356 1 c 1 < elapsed_time= 1. 525183 (seconds) -->elapsed_time= 0. 659652 (seconds) $ diff out MMULT. o 5357 1 c 1 < elapsed_time= 1. 525183 (seconds) --> elapsed_time= 0. 626821 (seconds) $ Sequential Execution Time Hybrid Execution Time Sequential Execution Time MPI-only Execution Time Hybrid did not do better than MPI only 14
Perhaps we could do better parallelizing the i loop both with MPI and Open. MP Parallelize i loop into partitioned among the computers/threads with MPI and Open. MP #pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait for (i = 0; i < blksz && rank * blksz < N; i++) { for (j = 0; j < N; j++) { c[i][j] = 0. 0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } But this loop is too complicated for Open. MP j loop not parallelized 15
#pragma omp parallel private (tid, i, j, k) { #pragma omp for nowait An if statement can for (i = 0; i < blksz; i++) { simplify the loop if (rank * blksz < N) { for (j = 0; j < N; j++) { c[i][j] = 0. 0; for (k = 0; k < N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } 16
Matrix Multiplication Results $ diff out MMULT. o 5356 1 c 1 < elapsed_time= 1. 525183 (seconds) -->elapsed_time= 0. 688119 (seconds) Sequential Execution Time Hybrid Execution Time Still not better 17
Discussion Point • Why does the hybrid approach not outperform MPI-only for this problem? • For what kinds of problem might a hybrid approach do better? 18
Hybrid Parallel Programming with the Paraguin compiler • The Paraguin compiler can also create hybrid programs • This is because it uses mpicc, it will pass the Open. MP pragma through to the resulting source 19
Compiling • First we need to compile to source code – scc -DPARAGUIN -D__x 86_64__ matrixmult. c -. out. c • Then we can compile with MPI and openmp – mpicc –fopenmp matrixmult. out. c –o matrixmult. out 20
Hybrid Matrix Multiplication using Paraguin #pragma paraguin begin_parallel #pragma paraguin scatter a The i loop will be partitioned among #pragma paraguin bcast b the computers #pragma paraguin forall for (i = 0; i < N; i++) { #pragma omp parallel for private(t. ID, j, k) num_threads(4) for (j = 0; j < N; j++) { c[i][j] = 0. 0; for (k = 0; k < N; k++) { The j loop will be c[i][j] = c[i][j] + a[i][k] * b[k][j]; partitioned among the 4 cores within a computer } } 21
Debug Statements <pid 0, thread 1>: c[0][1] += a[0][0] * b[1][0] <pid 0, thread 1>: c[0][1] += a[0][1] * b[1][1] <pid 0, thread 1>: c[0][1] += a[0][2] * b[1][2] <pid 0, thread 2>: c[0][2] += a[0][0] * b[2][0] <pid 0, thread 2>: c[0][2] += a[0][1] * b[2][1] <pid 0, thread 2>: c[0][2] += a[0][2] * b[2][2] <pid 1, thread 1>: c[1][1] += a[1][0] * b[1][0] <pid 1, thread 1>: c[1][1] += a[1][1] * b[1][1] <pid 1, thread 1>: c[1][1] += a[1][2] * b[1][2] <pid 2, thread 1>: c[2][1] += a[2][0] * b[1][0] <pid 2, thread 1>: c[2][1] += a[2][1] * b[1][1] <pid 2, thread 1>: c[2][1] += a[2][2] * b[1][2] <pid 0, thread 0>: c[0][0] += a[0][0] * b[0][0] <pid 0, thread 0>: c[0][0] += a[0][1] * b[0][1] <pid 0, thread 0>: c[0][0] += a[0][2] * b[0][2] <pid 2, thread 0>: c[2][0] += a[2][0] * b[0][0] <pid 2, thread 0>: c[2][0] += a[2][1] * b[0][1] <pid 2, thread 0>: c[2][0] += a[2][2] * b[0][2] <pid 1, thread 0>: c[1][0] += a[1][0] * b[0][0] <pid 1, thread 0>: c[1][0] += a[1][1] * b[0][1] <pid 1, thread 0>: c[1][0] += a[1][2] * b[0][2] <pid 1, thread 2>: c[1][2] += a[1][0] * b[2][0] <pid 1, thread 2>: c[1][2] += a[1][1] * b[2][1] <pid 1, thread 2>: c[1][2] += a[1][2] * b[2][2] <pid 2, thread 2>: c[2][2] += a[2][0] * b[2][0] <pid 2, thread 2>: c[2][2] += a[2][1] * b[2][1] <pid 2, thread 2>: c[2][2] += a[2][2] * b[2][2] 22
What does not work with Paraguin • Consider: #pragma omp parallel structured_block • Example: #pragma omp parallel private(t. ID) num_threads(4) { t. ID = omp_get_thread_num(); printf("<pid %d>: tid = %dn", __guin_rank, t. ID); } Very Important 23 Opening brace must be on a new line
What does not work with Paraguin • The SUIF compiler removes the braces because they are not associated with a control structure • A #pragma is not a control structure, but rather a preprocessor directive. • After compiling with scc: #pragma omp parallel private(t. ID) num_threads(4) Braces are removed t. ID = omp_get_thread_num(); printf("<pid %d>: tid = %dn", __guin_rank, t. ID); 24
The Fix • The trick is to put in a control structure that basically does nothing: dummy = 0; #pragma omp parallel private(t. ID) num_threads(4) If statement will always be true. if (dummy == 0) { t. ID = omp_get_thread_num(); printf ("<pid %d>: tid = %dn", __guin_rank, t. ID); } This code is basically left intact. • Note: “if (1)” does not work 25
Questions 26
- Slides: 26