Parallel Prefix and Data Parallel Operations Motivation basic
Parallel Prefix and Data Parallel Operations Motivation: basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a 1 ) a 2) ) a 3 = a 1 ) (a 2 ) a 3 ) How to compute (a 1 ) a 2 ) …. ) an ) in parallel in O(logn) time?
Approach 1 a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 [0: 0] [0: 1] [1: 2] [2: 3] [3: 4] [4: 5] [5: 6] [6: 7] d=1 [0: 0] [0: 1] [0: 2] [0: 3] [1: 4] [2: 5] [3: 6] [4: 7] d=2 [0: 0] [0: 1] [0: 2] [0: 3] [0: 4] [0: 5] [0: 6] [0: 7] d=4 Assume that n = 2 k for i = 0 to k-1 for j = 0 to n-1 -2 i do in parallel x[j+ 2 i ] = x[j] + x[j+ 2 i ]
How to do on Tree Architecture? for each node if there is a signal from left and right St <- Sl + Sr if there is a signal R, send R to both its children if the node is a leaf and there is a signal R, X <- X + R St R Sl Sr
How to do on a Hypercube A complete binary tree can be embedded into a hypercube Simpler solution: each node computes prefix and total sum for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[ji] if i-th bit of j = 1 sum[j ] = sum[j] + sum[ji], where ji and j have the same binary number representation except their i-th bit, where the i-th bit of ji is the complement of the i-bit of j.
Prefix on Hypercube a 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 X SUM [0: 0] [0: 1] [2: 2] [2: 3] [4: 4] [4: 5] [6: 6] [6: 7] d=1 X SUM [0: 0] [0: 3] [0: 1] [0: 3] [2: 2] [0: 3] [2: 3] [0: 3] [4: 4] [4: 7] [4: 5] [4: 7] [4: 6] [4: 7] d=2 X SUM [0: 0] [0: 7] [0: 1] [0: 7] [2: 2] [0: 7] [2: 3] [0: 7] [0: 4] [0: 7] [0: 5] [0: 7] [0: 6] [0: 7] d=4 for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[ji] if i-th bit of j = 1 sum[j ] = sum[j] + sum[ji],
Applications of Data Parallel Operations Any associative operations: Examples: – – – min, max, adding two binary numbers finite state automata radix sort segmented prefix sum routing • packing • unpacking • broadcast (copy-scan) – solving recurrence equations – straight line computation (parallel arithmetic evaluation)
Adding two n bit numbers as parallel prefix • • a = an-1 b = bn-1 s=a+b note that …. …. a 0 b 0 si = ai bi ci-1 • to compute ci define g and p as: gi = ai bi , pi = ai bi • define as : (g, p) (g’, p’) = (g (p g’), p p’) Then carry bit ci can be computed by: (g, p) (g’, p’) = (g (p g’), p p’) (Gi, Pi) = (gi, pi) (gi-1, pi-1) … (g 0, p 0) and Gi = ci
Hardware circuit of recursive look-ahead adder a 15 b 15 a 14 b 14 a 13 b 13 a 12 b 12 a 11 b 11 a 10 b 10 a 9 b 9 a 8 b 8 a 7 b 7 a 6 b 6 a 5 b 5 a 4 b 4 a 3 b 3 a 2 b 2 a 1 b 1 a 0 b 0
Parsing a regular language q 1 b q 0 c b (q 0, b) = q 2, (q 0, c) = q 1, (q 1, b) = q 0, (q 1, c) = qr, (q 2, b) = qr, (q 2, c) = q 0 qr: reject state q 2 c q 0 qr qr q 0 q 1 qr q 0 ->q 2 q 1 ->q 0 q 2 ->qr b q 0 q 1 qr cq 1 qr q 0 b q 2 q 0 qr cq 1 qr q 0 qr q 2 q 1’ q 2’ q 3’ q 0 qr q 2 c q 1 b q 2 qr q 0 qr q 1’ q 2’ q 3’
Segmented Prefix operation before 1 2 3 4 5 6 1 3 3 7 12 18 7 15 after Segment boundary
Segmented Prefix computation Let be any associative operation. For segmented operation of , define ’ as follows: ’ a |a b | (a b) |b |b |b Then ’ is associative and we can compute segmented operation in O(logn) time.
Enumerating = [5 6 3 1 8 3 7 5 9 2] active procs = [1 0 1 1 0 0 1 0] enumerated [0 x 1 2 x x 3 x 4 0] Data =
packing data = [5 6 3 1 8 3 7 5 9 2] active procs = [1 0 1 1 0 0 1 0] enumerated = [0 x 1 2 x x 3 x 4 x] packed data = [5 3 1 7 9 x x x]
Packing and Unpacking on Hypercube Packing • adjust bit 0 • adjust bit 1 • adjust bit 2 • . . . • adjust bit k-1 Unpacking • adjust bit k-1 • adjust bit k-2 • . . . • adjust bit 1 • adjust bit 0 How about in the order of adjust bit 0, 1, . . . , k-1 for packing?
Unpacking Address 0 1 2 3 4 5 6 7 8 9 data = [6 2 3 5 9 x x x] active procs = [1 0 1 1 0 0 1 0] enumerated = [0 x 1 2 x x 3 x 4 x] destination = [0 2 3 6 8 x x x] [6 x 2 3 x x 5 x 9 x] unpacked data =
Copy Scan (broadcast) address data = segmented bit = result = 0 1 2 3 4 5 6 7 8 9 [ 6 2 3 5 9 4 1 7 8 10] [ 1 0 1 1 0 0 1 0] [ 6 6 3 5 5 5 1 1 8 8]
Radix Sort for j = k-1 to 0 // x has k bits for all i in [0. . n-1] do parallel { if j-th bit of x[i] is 0 { y[i] = enumerate c = count } if j-th bit of x[i] is 1 y [i] <- enumerate + c x [y[i]] = x [i] } Radix sort another code for j = k-1 to 0 // x has k bits for all i in [0. . n-1] do parallel { pack left x[i] if j-th bit of x[i] pack right x[i] if j-th bit of x[i] }
Quick Sort 1. Pick a pivot p 2. Broadcast p 3. For all PE i, compare A[i] with p { if A[i] <p, pack left A[i] in the segment if A[i] >= p, pack right A[i] in the segment } 4. Mark the segment boundary 5. Each segment, quick sort recursively
Solving Linear Recurrence Equations fn=an-1 fn-1 + an-2 fn-2 fn fn-1
Pointer Jumping and Tree Computation How to compute a prefix on a linked list? 1 2 3 4 5 6 7 If NEXT[i] != NILL then X[i] <- X[i] + X[NEXT[i]] NEXT[i] <- NEXT[i]] 3 5 7 9 11 13 7 10 14 18 22 18 13 7 28 27 25 22 18 13 7 How to make 1 3 6 10 15 21 28 order?
Application: Tree computation Pre-order numbering Each node 1 Leaf node 1 Can be applied to in order, post order number of children, depth etc. Bi-component, etc also
Recurrence Equation Example: LU decomposition on a triangular matrix
- Slides: 22