Cpr E Com S 583 Reconfigurable Computing Prof
Cpr. E / Com. S 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other Spatial Styles
Systolic Architectures • Goal – general methodology for mapping computations into hardware (spatial computing) structures • Composition: • Simple compute cells (e. g. add, sub, max, min) • Regular interconnect pattern • Pipelined communication between cells • I/O at boundaries x October 2, 2007 x + x c x Cpr. E 583 – Reconfigurable Computing min Lect-13. 2
Example – Matrix-Vector Product for (i=1; i<=n; i++) for (j=1; j<=n; j++) y[i] += a[i][j] * x[j]; October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 3
Matrix-Vector Product (cont. ) t=4 a 41 a 23 a 14 – t=3 a 31 a 22 a 13 – – t=2 a 21 a 12 – – – t=1 a 11 – – x 1 x 2 x 3 x 4 October 2, 2007 … Cpr. E 583 – Reconfigurable Computing xn y 1 t=n y 2 t = n+1 y 3 t = n+2 y 4 t = n+3 Lect-13. 4
Banded Matrix-Vector Product q p for (i=1; i<=n; i++) for (j=1; j<=p+q-1; j++) y[i] += a[i][j-q-i] * x[j]; October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 5
Banded Matrix-Vector Product (cont. ) t=5 a 23 – a 32 – t=4 – a 22 – a 31 t=3 a 12 – a 21 – t=2 – a 11 – – t=1 – – yi t=1 x 1 t=2 – t=3 x 2 yout yin t=4 – xin xout t=5 x 3 October 2, 2007 ain Cpr. E 583 – Reconfigurable Computing Lect-13. 6
Outline • Recap • Non-Numeric Systolic Examples • Systolic Loop Transformations • Data dependencies • Iteration spaces • Example transformations • Reading – Cellular Automata • Reading – Bit-Serial Architectures October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 7
Example – Relational Database • Relation is a collection of tuples that all have the same attributes • Tuple is a fixed number of objects • Represented in a table tuple # Name School Age QB Rating 0 T. Brady Michigan 30 141. 8 1 T. Romo E. Illinois 27 112. 9 2 J. Delhomme LA-Lafayette 32 111. 9 3 P. Manning Tennessee 31 110. 4 4 R. Grossman Florida 27 45. 2 October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 8
Database Operations • Intersection: A ∩ B – all records in both relation A and B • Must compare all |A| x |B| tuples • Compare via sequence compare B 1 B 2 B 3 A 1 T 12 T 13 A 1 B 2 T 12 A 2 T 21 T 22 T 23 A 2 B 1 T 21 A 3 T 31 T 32 T 33 A 3 • Or along row or column to get inclusion bitvector October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 9
Database Operations (cont. ) • Tuple Comparison • Problem – tuples are long, comparison time might limit computation rate • Strategy – perform comparison in pipelined manner by fields • Stagger fields • Arrange to compute field i on cycle after i-1 • Cell: tout = tin and ain xnor bin B[j, 1] B[j, 2] B[j, 3] B[j, 4] True A[i, 1] October 2, 2007 A[i, 2] A[i, 3] A[i, 4] Cpr. E 583 – Reconfigurable Computing Lect-13. 10
Database Intersection b 52 True b 51 a 21 True b 41 a 31 b 34 a 14 b 31 a 41 T 21 b 23 a 23 b 22 a 42 b 21 a 51 T 22 b 24 a 24 b 14 a 34 b 13 a 43 a 52 October 2, 2007 T 12 b 33 a 23 b 32 a 32 True b 43 a 13 b 42 a 22 True b 44 Cpr. E 583 – Reconfigurable Computing a 44 Lect-13. 11
Database Intersection (cont. ) FALSE b 52 True b 51 a 21 True b 41 a 31 b 31 a 41 b 24 a 24 b 22 a 42 b 21 a 51 b 14 a 34 T 2 OR OR b 23 a 23 T 3 OR OR b 33 a 23 T 1 OR OR b 13 a 43 a 52 October 2, 2007 OR b 34 a 14 b 32 a 32 True b 43 a 13 b 42 a 22 True b 44 a 44 Cpr. E 583 – Reconfigurable Computing Lect-13. 12
Database Operations (cont. ) • Unique: remove duplicate elements in multirelation A • Intersect A with A • Union: A U B – one copy of all tuples in A and B • Concatenate A and B • Use Unique to remove duplicates • Projection: collapse A by removing select fields of every tuple • Sample fields in A’ • Use Unique to remove duplicates October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 13
Database Join • Join: AJC B – where columns CA intersect columns CB in B, concatenate tuple Ai and Bj A, CB • Match CA of A with CB of B • Keep all Ti, j • Filter i, j for which Ti, j = true • Construct join from matched pairs • Claim: Typically, | AJC October 2, 2007 B , C A B | << | A | | B | Cpr. E 583 – Reconfigurable Computing Lect-13. 14
Database Summary • Input database – O(n) data • Operations require O(n 2) data • O(n) if sorted first • O(n log(n)) to sort • Systolic implementation – works on O(n) processing elements in O(n) time • Typical database [Kun. Loh 80 A]: • 1500 bit tuples • 10, 000 records in a relation • ~1 4 -LUT per bit-compare • ~1600 XC 4062 FPGAs • ~84 XC 4 LX 200 FPGAs October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 15
Systolic Loop Transformations • Automatically re-structure code for • Parallelism • Locality • Driven by dependency analysis October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 16
Defining Dependencies • Flow Dependence • Anti-Dependence • Output Dependence • Input Dependence W R R W W W R R δf δa δo δi true false S 1) a = 0; S 2) b = a; S 3) c = a + d + e; S 4) d = b; S 5) b = 5+e October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 17
Example Dependencies S 1) a = 0; S 2) b = a; S 3) c = a + d + e; S 4) d = b; f S 2 S 1 δ S 5) b = 5+e S 1 δf S 3 S 2 δf S 4 S 3 δa S 4 δa S 5 S 2 δo S 5 S 3 δi S 5 October 2, 2007 1 2 due to a due to b due to d due to b due to e Cpr. E 583 – Reconfigurable Computing 3 4 5 Lect-13. 18
Data Dependencies in Loops • Dependence can flow across iterations of the loop • Dependence information is annotated with iteration information • If dependence is across iterations it is loop carried otherwise loop independent δf loop carried for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; δf loop } independent October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 19
Unroll Loop to Find Dependencies δf loop carried for (i=0; i<n; i++) { A[i] = B[i]; B[i+1] = A[i]; δf loop } independent A[0] B[1] A[1] B[2] A[2] B[3] October 2, 2007 = B[0]; = A[0]; = B[1]; = A[1]; = B[2]; = A[2]; . . . i=0 i=1 Distance/direction of dependence is also important i=2 Cpr. E 583 – Reconfigurable Computing Lect-13. 20
Thought Exercise • Consider the Laplace Transformation: for (i=1; i<N; i++) for (j=1; j<N; j++) c = -4*a[i][j] + a[i-1][j] + a[i+1][j]; c += a[i][j+1] + a[i][j-1] b[i][j] = c; } } • In teams of two, try to determine the flow dependencies, anti dependencies, output dependencies, and input dependencies • Use loop unrolling to find dependencies • Most dependencies found gets a prize October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 21
Iteration Space • Every iteration generates a point in an n-dimensional space, where n is the depth of the loop nest for (i=0; i<n; i++) { [4] . . . } for (i=0; i<n; i++) { for (j=0; j<5; j++) { [3; 2] . . . } } October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 22
Distance Vectors for (i=0; i<n; i++) { A[i] = B[i]; Distance vector is the B[i+1] = A[i]; difference between } A[0] B[1] A[1] B[2] A[2] B[3] = B[0]; = A[0]; = B[1]; = A[1]; = B[2]; = A[2]; . . . October 2, 2007 i=0 the target and source iterations d = I t - Is i=1 i=2 Exactly the distance of the dependence, i. e. , Cpr. E 583 – Reconfigurable Computing Is + d = I t Lect-13. 23
Distance Vectors Example for (i=0; i<n; i++) { for (j=0; j<m; j++) { A[i, j] = ; A 0, 2= = A[i, j]; B 0, 3= B[i, j+1] = ; C 1, 2= = B[i, j]; A 0, 1= C[i+1, j] = ; B 0, 2= j = C[i, j+1]; C 1, 1= } A 0, 0= B 0, 1= A yields [0; 0] C 1, 0= B yields [0; 1] =A 0, 2 =B 0, 2 =C 0, 3 =A 0, 1 =B 0, 1 =C 0, 2 =A 0, 0 =B 0, 0 =C 0, 1 A 1, 2= B 1, 3= C 2, 2= A 1, 1= B 1, 2= C 2, 1= A 1, 0= B 1, 1= C 2, 0= =A 1, 2 =B 1, 2 =C 1, 3 =A 1, 1 =B 1, 1 =C 1, 2 =A 1, 0 =B 1, 0 =C 1, 1 A 2, 2= B 2, 3= C 3, 2= A 2, 1= B 2, 2= C 3, 1= A 2, 0= B 2, 1= C 3, 0= =A 2, 2 =B 2, 2 =C 2, 3 =A 2, 1 =B 2, 1 =C 2, 2 =A 2, 0 =B 2, 0 =C 2, 1 i C yields [1; -1] October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 24
FIR Distance Vectors for (i=0; i<n; i++) for (j=0; j<m; j++) Y[i] = Y[i]+X[i+j]*W[j]; Y yields: δa [0; 0] Y yields: δf [0; 1] X yields: δi [1; -1] W yields: δi [1; 0] October 2, 2007 Y 0= =Y 0 =X 3 =W 3 Y 0= =Y 0 =X 2 =W 2 Y 0= =Y 0 =X 1 =W 1 Y 0= =Y 0 =X 0 =W 0 Cpr. E 583 – Reconfigurable Computing Y 1= =Y 1 =X 4 =W 3 Y 1= =Y 1 =X 3 =W 2 Y 1= =Y 1 =X 2 =W 1 Y 1= =Y 1 =X 1 =W 0 Y 2= =Y 2 =X 5 =W 3 Y 2= =Y 2 =X 4 =W 2 Y 2= =Y 2 =X 3 =W 1 Y 2= =Y 2 =X 2 =W 0 Y 3= =Y 3 =X 6 =W 3 Y 3= =Y 3 =X 5 =W 2 Y 3= =Y 3 =X 4 =W 1 Y 3= =Y 3 =X 3 =W 0 Lect-13. 25
Re-label / Pipeline Variables • Remove anti-dependencies and input dependencies by relabeling or pipelining variables for (i=0; i<n; i++) { for (j=0; j<m; j++) { Wi[j] = Wi-1[j]; Xi[i+j]=Xi-1[i+j]; Yj[i] = Yj-1[i]+Xi[i+j]*Wi[j]; } } YWX • Creates new flow dependencies • Removes anti/input dependencies October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 26
FIR Dependencies YWX Y 20= =Y 10 X 02= =X-12 W 02= =W-12 Y 21= =Y 11 X 13= =X 03 W 12= =W 02 Y 22= =Y 12 X 24= =X 14 W 22= =W 12 Y 10= =Y 00 j X 01= =X-11 W 01= =W-11 Y 11= =Y 01 X 12= =X 02 W 11= =W 01 Y 12= =Y 02 X 23= =X 13 W 21= =W 11 Y 00= =Y-10 X 00= =X-10 W 00= =W-10 Y 01= =Y-11 X 11= =X 01 W 02= =W 00 Y 02= =Y-12 X 22= =X 12 W 20= =W 10 i October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 27
Transforming to Time and Space • Using data dependencies, find T • T defines a mapping of the iteration space into a time component π, and a space component, S • T = [π; S] • If π·I 1 = π·I 2, then I 1 and I 2 execute at the same time • π·d – amount of time units to move data items (π·d > 0) • Any S can be picked that makes T a bijection • See [Mol 83 A] for more details October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 28
Calculating T for FIR • For π = [p 1 p 2] • Since π·d > 0, we see that: • p 2 != 0 (from Y) • p 1 != 0 (from W) • p 1 > p 2 (from X) YWX • Smallest solution π = [2 1] • S can be [1 0], [0 1], [1 1] October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 29
Time An Example Transformation (0, 4) (1, 4) (2, 4) (3, 4) (0, 3) (1, 3) (2, 3) (3, 3) (2, 2) (3, 2) (0, 1) (1, 2) Y (1, 1) (2, 1) (3, 1) (0, 0) (1, 0) (2, 0) (3, 0) YWX W (0, 2) X Space October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 30
An Example Transformation (cont. ) Time (0, 4) (1, 4) (2, 4) (3, 4) (1, 3) W (0, 2) (1, 2) (2, 3) (3, 3) (2, 2) (0, 3) (0, 1) (1, 1) (2, 1) (3, 2) X (3, 1) (0, 0) (1, 0) (2, 0) (3, 0) Y YWX Space October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 31
Summary • Non-numeric (database ops) example of systolic computing • Multiple use of each input data item • Concurrency • Regular data and control flow • Loop transformations • Data dependency analysis • Restructure code for parallelism, locality October 2, 2007 Cpr. E 583 – Reconfigurable Computing Lect-13. 32
- Slides: 32