Decomposition Data Decomposition Dividing the data into subgroups

Decomposition • Data Decomposition – Dividing the data into subgroups and assigning each piece to different processors – Example: Embarrassingly parallel applications • Functional Decomposition – Dividing an algorithm into its functional pieces and executing the pieces in separate processors – Example: Pipelining

Pipelined Computations • Divide a problem into a series of tasks • A processor completes a task sequentially and pipes the results to the next processor zero P 0 P 1 P 2 P 3 P 4 P 5 ∑A[i 0] ∑A[i 1] ∑A[i 2] ∑A[i 3] ∑A[i 4] ∑A[i 5] P 0 P 1 P 2 P 3 P 4 P 5 Example of Summing Groups of Numbers Question: Is this data or is it functional decomposition? total

Where is Pipelining Applicable? Type 1 – More than one instance of a problem – Example: Multiple simulations with different parameter settings Type 2 – Series of data items with multiple operations – Example: Signal Filter or Eratosthenes Sieve Type 3 – Partial results passed on while processing continues – Example: Solving sets of linear equations Considerations Are there a series of sequential tasks? Is the processing of each tack approximately equal? Can items be grouped to minimize communication cost If stages exceed processors o Group stages o Wrap last stage back to the first – Determine where the result will be at the end of the process – –

Summing Numbers Example process Pi>0 && <N-1 recv(&sum, Pi-1); sum += number; send(&sum, Pi+1); Process P 0 send(&number, P 1); Process PN-1 recv(&number, Pn-2); sum += number; Save or display result

Application • Remove frequencies from a signal – Sequential Algorithm: Fourier Analysis (O(N lg(N)) – Parallel: Apply filters to the signal (O(N*Filter. Length)) with convolution. – Filter Examples: Chebyshev, Buttor. Worth, etc. – Derive filter: Set Z-domain poles and zeroes, perform inverse tranformation. – Filters can be useful to manipulate signals, detect patterns, etc.

Chebyshev Filter Design Chebyshev in the z-domain Chebyshev Frequency Response Note: Depending on the placement of the poles (+) and zeroes (0), the filter will effect a signal differently

Type 1: Multiple Instances Space Time Diagram Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 P 0 P 1 P 2 P 3 P 4 P 5 P 0 P 1 P 2 P 3 P 4 P 5 Time Sequential execution: t 1 = m*tm Parallel Processing: (m + p – 1)*tm/p Parallel Communication: (m+p-1)*(tstart+n*tdata) Speed up: tp= m*tm/((m+p-1)*(tm/p+tstart+n*tdata)) Notation 1. m = instances, p = processors 2. tstart = latency tdata = bandwidth 3. n = data transmitted /instance 4. tm = total time to process an instance 5. Total pipeline cycles = m + p – 1 6. Assume: Equal processing per stage

Type 2: Multiple Data Elements d 9 d 8 d 7 d 6 d 5 d 4 d 3 d 2 d 1 d 0 P 0 P 0 P 0 Example: Signal Filter Each process removes one or more frequencies from a digitized signal Filter f 1 Filter f 0 Unfiltered Signal P 0 Filter f 3 Filter f 2 P 1 P 2 Filter f 5 Filter f 4 P 3 P 4 P 5 Filtered Signal

Type 2 Timing Diagram

Type 3: Partial Processing • The next stage receives information to continue processing • Additional processing continues at the source processor P 5 P 4 P 3 P 2 P 1 P 4 P 3 = Idle P 2 = Executing P 1 P 0 Linear Equations A More Balanced Load Question: How do we determine speed-up?

Operation at each processor Types 1 and 2 • Processor with rank r = 0 – Generate the instance (type 1) or the data (type 2) to process – Process appropriately – Send message to the processor with rank 1 • Processors with rank r = 1, 2, p-2 – Receive message from the processor with rank r-1 – Process appropriately – Send message to the processor with rank r+1 • Processor with rank r = p-1 – Receive message from processor with rank r-1 – Process appropriately Examples – Output final results 1) Adding Numbers: n 1 -> n 1+n 2+n 3 ->. . . 2) Frequency removal: f(t) -> f 0; f(t-f 0)-> f 1; f(t-f 0 -f 1)->. . .

Parallel Pipeline Sort • Pseudo code Receive xi IF xi < xmax Send xi ELSE Send xmax = xi Step Numbers P 0 P 1 P 2 1 4, 3, 1, 2, 5 2 4, 3, 1, 2 5 3 4, 3, 1 5 4 4, 3 5 5 4 5 P 3 P 4 2 1 3 4 2 2 1 2 6 5 7 5 4 8 5 4 3 9 5 4 3 2 10 5 4 3 2 3 Note: Processors can hold blocks of numbers for better efficiency 3 1 2 1 1 1

Bi-Directional Pipeline • Use the pipeline to return results to the master – Useful for line topologies, ring, or hypercube P 0 P 1 Example: Sorting • Sort Phase P 2 P 3 P 5 P 4 P 3 P 2 P 1 P 0 If (myid == 0) generate number Else receive(&number, pmyid-1) If (number > max and myid<P-1) { send(max, pmyid+1); maximu. So. Far=number; } • Gather phase P 4 If (myid < P-1) receive sorted numbers from pmyid+1 If (myid > 0) send sorted numbers to pmyid-1 Time Sorting Phases Gather Phase • N(generate steps); • N-1 (propagate steps); • N-1 (return steps) = 3 N-2

Sieve of Eratosthenes

Prime Number Generation Sieve of Eratosthenes (Type 2 pipeline) • Concept – Each processor filters blocks of non-primes from the flow of data – The “potential” prime numbers pass through to the next processor • Pseudo-code The Master processor generates an array of odd n numbers In a loop after receiving a group of numbers Filter a group of numbers; pass unfiltered numbers down the pipeline Gather all of the primes • Notes – Wrapping the pipeline in a ring could help maintain load balance – A termination message determines when the pipeline empties Question: What range of numbers should each processor get?

Implementation Sequential code for (i = 2; i < n; i++) prime[i] = 1; for (i = 2; i <= sqrt_n; i++) if (prime[i] == 1) for (j = i + i; j < n; j = j + i) prime[j] = 0 Sequential Time O(n 2) Parallel Code Processor pi > 0 Recv(number, rank-1); PRIME = TRUE; FOR (int x=MIN; x<MAX; x+=MIN) IF ((number % x) == 0) PRIME = FALSE and BREAK IF (PRIME) send(number, rank+1); Termination recv(number, rank-1); send(number, rank+1) IF (number == terminator) break;

Upper Triangular Matrix All entries below the diagonal are zero Useful for solving N equations and N unknowns

Solving Sets of Linear Equations This is a type 3 pipeline example • • Upper Triangular Form an-1, 0 x 0 + an-1, 1 x 1 + … + an-1, n-1 xn-1 = bn-1 an-2, 0 x 0 + an-2, 1 x 1 + … + an-2, n-2 xn-1 = bn-2 a 1, 0 x 0 + a 1, 1 x 1 = b 1 a 0, 0 x 0 = b 0 Back Substitution x 0=b 0/a 0, 0 x 1=(b 1 -a 1, 0 x 0)/a 1, 1 x 2=(b 2 -a 2, 0 x 0 -a 2, 1 x 1)/a 2, 2 • Parallel code for pi where 1<=i<n sum = 0 For (j=0; j<i; j++) { receive(&x[j], pi-1); sum += ai, j * xj; send(xj, pi+1) } xi = (bi – sum)/ai, i Note: ai, j and bi are constants • General solution for xi xi= (bi – ∑j=0 to i-1 ai, j xj)/ai, I • Sequential code x[0] = b 0/a 0, 0, FOR (i=1; i<n; i++) sum=0; FOR (j=0; j<i; j++) sum += ai, I xj xi= (bi – sum)/ai, I • Parallel Pseudo code for (j = 0; j < i; j++) recv(x[j], p-1); send(x[j], p+1); sum = 0; for (j = 0; j < i; j++) sum = sum + a[i][j]*x[j] x[i] = (b[i] - sum)/a[i][i]; send(x[i], p+1);

Pipeline Solution DO IF p ≠ master, receive xj from previous processor IF p ≠ P-1, send xj to next processor back substitute xj UNTIL xi evaluated IF p ≠ P-1 send xi to the next processor • Notes: 1. Processing continues after sending values down the pipeline 2. Is the load imbalanced?

Illustration of Type 3 Solution P 0 Compute x 0 P 1 x 0 P 5 P 4 P 3 P 2 P 1 P 0 x 0 Compute x 1 P 2 P 3 Compute x 2 x 0 x 1 x 2 Compute x 3 How balanced is This load? Time x 0 x 1 x 2 x 3