Lecture 18 SIMD Vector Processing W W Hwu

  • Slides: 30
Download presentation
Lecture 18 SIMD: Vector Processing © W. W. Hwu and S. J. Patel, 2005

Lecture 18 SIMD: Vector Processing © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois

General-purpose to Specific Application Domains • General purpose computing presents tough problems in architecture.

General-purpose to Specific Application Domains • General purpose computing presents tough problems in architecture. • One pathway to better architectures is to “known” the application domain. • Example: Scientific applications © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 2

DAXPY : A common kernel in scientific codes Y = a. X + Y

DAXPY : A common kernel in scientific codes Y = a. X + Y • Vector times a scalar plus a vector • Many elements of the arrays need to go throughthe same processing © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 3

SIMD at work © W. W. Hwu and S. J. Patel, 2005 ECE 511,

SIMD at work © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 4

Types of Vector Processing • Attached co-processor to improve scientific application performance – TI

Types of Vector Processing • Attached co-processor to improve scientific application performance – TI ASC, CDC STAR 100, IBM 3838, FPS-164 • Supercomputers designed to run scientific applications – CRAY-1, Cyber 205, CRAY-XMP, CRAY-2, CRAYYMP, Fujitsu VP 100/200, Hitachi S 810/820, NEC SX/2 • Minicomputers designed to give better price performance than supercomputers – CONVEX C-1, Alliant FX-8 • Instruction set extension to improve performance – IBM 3090, VAX 6000, X 86 MMX, 3 DNow, etc. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 5

Vector Processing. Topics • • • Vectorization Memory systems design Vector length and stride

Vector Processing. Topics • • • Vectorization Memory systems design Vector length and stride Chaining Memory units (Cray X-MP) Scatter/gather, conditional execution © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 6

Typical Vector Architecture • A vector unit typically consists of: – a vector instruction

Typical Vector Architecture • A vector unit typically consists of: – a vector instruction processor – a collection of vector registers (e. g. 8 64 -entry registers in CRAY-1) – a vector length register (e. g. 6 bits in CRAY-1), implicit in MMX – a mask register (e. g. 64 bits in CRAY-1) – a set of pipelined function units (e. g. , load/store, FP add, FP multiply, FP reciprocal, integer add, logic, shift in CRAY-1) © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 7

Vector Code • Vector code generated for a register-to-register vector architecture: • VL N

Vector Code • Vector code generated for a register-to-register vector architecture: • VL N (vector length, the elements processed by all subsequent vector instructions) • v 0 Load B • v 1 Load C • v 2 v 0 + v 1 • Store A v 2 • An outer loop may be required if N is greater than the max length allowed, details discussed later. • If N is sufficiently big, each vector instruction would take about N cycles to execute. – With aggressive design with chaining, all the vector instructions © W. W. Hwu and S. J. Patel, 2005 can overlap to all finish in about N cycles. ECE 511, University of Illinois 8

Loop Distribution • Basic Transformation for vectorization – transform a multi-statement loop into a

Loop Distribution • Basic Transformation for vectorization – transform a multi-statement loop into a sequence of singlestatement loops. – Each single-statement loop becomes the basis of a vector instruction • Example – DO I = 1, N • S 1 • … • SN – END DO © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois • Becomes: –DO I = 1, N • S 1 –END DO –… –DO I = 1, N • s N -END DO 9

Forming Vector Instructions • Convert single-statement loops into vector statements – DO I =

Forming Vector Instructions • Convert single-statement loops into vector statements – DO I = 1, N • A(I) = B(I) + C(I) – END DO • vector statement: A(1: N) = B(1: N) + C(1: N) – 1 st through Nth elements of A, B, and C to be processed • Converting to actual vector instructions: – – – VL N v 1 LOAD(B, 1) (1 is stride, to be explained later. ) v 2 LOAD(C, 1) v 3 v 1 + v 2 STORE(A, 1) v 3 © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 10

Loop Carried Dependence • Dependence between instructions from different iterations of the same loop.

Loop Carried Dependence • Dependence between instructions from different iterations of the same loop. – DO I = 1, N • S: A(I+1) = A(I) * B(I) – END DO • If we unwind the loop, the execution of S in different iterations look like the following: – – A(2) = A(1) *. . . Iteration 1 A(3) = A(2) *. . . Iteration 2 A(4) = A(3) *. . . Iteration 3 There is a flow dependence from iteration i to iteration i+1. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 11

Example of Backward Loop Carried Dependence • DO I = 1, N – S

Example of Backward Loop Carried Dependence • DO I = 1, N – S 1: D(I) = A(I-1) * D(I) – S 2: A(I) = B(I) + C(I) • END DO • The execution of S 1 and S 2 in different iterations: – – S 1: D(1) = A(0) * D(1) S 2: A(1) = B(1) + C(1) S 1: D(2) = A(1) * D(2) S 2: A(2) = B(2) + C(2) • There is a flow dependence from S 2 of iteration i to S 1 of iteration i+1. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 12

Problem with backward loop carried dependences • multi-statement loops with backward loop carried dependences

Problem with backward loop carried dependences • multi-statement loops with backward loop carried dependences cannot be distributed. – DO I = 1, N • S 1: C(I) = A(I-1) +. . . • S 2: A(I) =. . . – END DO • The execution of iterations looks like: – – – C(1) = A(0) +. . . S 1 of iteration 1 A(1) =. . . S 2 of iteration 1 C(2) = A(1) +. . . S 1 of iteration 2 A(2) =. . . S 2 of iteration 2 S 2 in iteration i delivers its result to S 1 in iteration i+1. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 13

Problem (Cont. ) • Loop distribution generates single-statement loops: – DO I = 1,

Problem (Cont. ) • Loop distribution generates single-statement loops: – DO I = 1, N • S 1: C(I) = A(I-1) +. . . – END DO – DO I = 1, N • S 2: A(I) =. . . – END DO • All iterations of S 1 are now done before those of S 2. – The result of S 2 in iteration i can no longer be delivered to S 1 in iteration i+1. Therefore, the execution is invalid after loop distribution. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 14

Overcoming Backward Dependence • Statement reordering: If S 2 does not dependent on S

Overcoming Backward Dependence • Statement reordering: If S 2 does not dependent on S 1 in the same iteration, one can reorder the syntactic ordering of S 1 and S 2: • Before –DO I = 1, N • S 1: C(I) = A(I-1) +. . . • S 2: A(I) =. . . –END DO © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois • After –DO I = 1, N • S 2: A(I) =. . . • S 1: C(I) = A(I-1) +. . . –END DO 15

Overcoming Backward Dependence (Cont. ) • Now with statement reordering and loop distribution, the

Overcoming Backward Dependence (Cont. ) • Now with statement reordering and loop distribution, the reordered loop becomes: – DO I = 1, N • S 2: A(I) =. . . – END DO – DO I = 1, N • S 1: C(I) = A(I-1) +. . . – END DO • Note that all results of S 2 are now generated before the execution of S 1. The execution result remain valid after loop distribution. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 16

Problem of Cyclic Dependence • Cyclic Dependence: A loop cannot be distributed if there

Problem of Cyclic Dependence • Cyclic Dependence: A loop cannot be distributed if there is a cyclic loop-carried dependence. – is typically formed by one or more forward dependence arcs within the same iteration followed by a backward loop carried dependence – A cyclic dependence can also be formed by having a backward loop carried dependence from one statement to itself • Loop statements involved in the cyclic dependence cannot be distributed – thus not vectorizable. • Question: Can we increase the success rate of vectorization in the presence of cyclic loop-carried dependence? © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 17

Common Solution • Loop interchange: Reverse the role of Inner and Outer loops •

Common Solution • Loop interchange: Reverse the role of Inner and Outer loops • In the example, the inner loop has a cyclic loopcarried dependence but the outer loop does not. – DO I = 1, N • DO J = 1, N – S: A(I, J+1) = A(I, J) * B(I, J) • END DO – END DO • With the cyclic dependence, the inner loop cannot be converted to a vector statement. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 18

Loop Interchange • However, with loop interchange: – DO J = 1, N •

Loop Interchange • However, with loop interchange: – DO J = 1, N • DO I = 1, N – S: A(I, J+1) = A(I, J) * B(I, J) • END DO – END DO • The inner loop can now be a vector statement – DO J = 1, N • A(1: N, J+1) = A(1: N, J) * B(1: N, J) © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 19

Limited-Length Vector Registers – DO I = 1, N • A(I) = B(I) *

Limited-Length Vector Registers – DO I = 1, N • A(I) = B(I) * C(I) – END DO • Ideally, the loop could be converted to the following sequence of vector instructions: • • • VL ← N v 1 ← LOAD(B, 1) v 2 ← LOAD(C, 1) v 3 ← v 1 * v 2 STORE(A, 1) ← v 3 © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 20

Limited-length Vector Registers (Cont. ) • If each vector register can only hold n

Limited-length Vector Registers (Cont. ) • If each vector register can only hold n elements and n < N, then we need to use loop sectioning (or strip mining) to use these vector registers of limited length. • Example n=64 q = N / 64 DO %SEC = 1, q IF (N - 64*q) ≠ 0 THEN DO %BL = 1, (N - 64*q) DO %BL = 1, 64 I = (%SEC -1) * 64 * %BL A(I) = B(I) + C(I) END DO © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois I = 64*q + %BL A(I) = B(I) * C(I) END DO END IF 21

Limited-Length Vector Registers (Cont. ) • The sectioned loop can be vectorized as follows:

Limited-Length Vector Registers (Cont. ) • The sectioned loop can be vectorized as follows: • This leads to the following vector instructions: q = N/64 DO %SEC = 1, q I = (%SEC -1) * 64 A(I+1: I+64) = B(I+1: I+64) * C(I+1: I+64) END DO r = N - 64*q IF r ≠ 0 THEN A(64*q+1: N) = B(64*q+1: N) *(64*q+1: N) END IF © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois VL = 64 q = N / 64 DO %SEC = 1, q I = (%SEC -1) * 64 v 1 LOAD(B+I, 1) v 2 LOAD(C+I, 1) v 3 v 1 * v 2 STORE(A, I+1) v 3 – – END DO r = N - 64*q IF r ≠ 0 THEN …. . 22

Vector Loads and Stores • LOAD: R A(x 1 : n [: stride]) •

Vector Loads and Stores • LOAD: R A(x 1 : n [: stride]) • STORE: A(x 1: n [: stride]) R – A – array base address – x 1 - starting element index – number of elements accessed (usually implicitly the vector length) – stride - increment after each access © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 23

Gather and Scatter • Also known as COMPRESS/EXPAND. • GATHER: A LOAD with a

Gather and Scatter • Also known as COMPRESS/EXPAND. • GATHER: A LOAD with a mask bit provided for each element. – The element is loaded if the corresponding mask bit is true. – The destination vector register thus loaded will contain only the elements whose mask bits are true – E. g. , if MASK = 1 (for A(4)) 0 (for A(3)) 1 (for A(2)) 0 (for A(1)), GATHER: R A(1, 4, 1) will result in R = (A(4), A(2)) • The mask is usually generated by a previous vector compare instruction. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 24

Gather and Scatter (Cont. ) • SCATTER: A STORE with a mask bit provided

Gather and Scatter (Cont. ) • SCATTER: A STORE with a mask bit provided for each element. – The source vector registers are scattered into the elements whose mask bits are true. – E. g. , if MASK is the same as above, SCATTER will store the first element of R into A(2) and second element of R into A(4), leaving A(1) and A(3) untouched. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 25

Indirect Load and Store • A vector of indices is provided to the Indirect

Indirect Load and Store • A vector of indices is provided to the Indirect LOAD: R A(X, n) – X: index vector – N: no. of elements (usually implicitly the vector length) • INDIRECT-LOAD: R A(X, 4)) – X – index vector – will deposit (A(X(4)), A(X(3)), A(X(2)), and A(X(1))) into R. • Often used for sparse matrices © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 26

Masks • Used to support conditional branches – A mask vector is provided. –

Masks • Used to support conditional branches – A mask vector is provided. – The exception condition and the destination write for an element are canceled if the corresponding mask bit is false. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 27

Masks (cont. ) • DO I = 1, s IF B(I) 0 THEN C(I)

Masks (cont. ) • DO I = 1, s IF B(I) 0 THEN C(I) = A(I) / B(I) ELSE C(I) = MAXNUMBER END IF • END DO © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois • Vector code with masked arithmetic instructions: VL s v 0 Load (A, 1) v 1 Load (B, 1) v 2 MAXNUMBER MASK v 1 0 where (MASK) v 2 v 0/v 1 Store (C, 1) v 2 28

Alternative Vectorization using Gather/Scatter VL s v 1 Load (B, 1) MASK v 1

Alternative Vectorization using Gather/Scatter VL s v 1 Load (B, 1) MASK v 1 0 where (MASK) v 0 Load (A, 1) where (MASK) v 1 Load (B, 1) v 2 MAXNUMBER store (C, 1) v 2 v 0/v 1 where (MASK) Store (C, 1) v 2 © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois • Note that the division no longer needs mask • Gather/Scatter pay off when many steps of computations need to be performed once the arrays are loaded. – Not in this particular example 29

Chaining • The results of a vector instruction are forwarded to a subsequent vector

Chaining • The results of a vector instruction are forwarded to a subsequent vector instruction on an element-byelement basis. – If there is a flow dependence from I 1 to I 2, I 2 does not have to wait until I 1 finishes. – It only needs to wait until the first output element of I 1 is generated. • Chaining slot: – The clock cycle that the first element of I 1 arrives at the destination register. – In CRAY 1, if I 2 can not start in this cycle due to FU conflicts, I 2 must wait until I 1 finishes. © W. W. Hwu and S. J. Patel, 2005 ECE 511, University of Illinois 30