Systolic Architectures Why is RC fast Greg Stitt

Why are microprocessors slow? n Von Neumann architecture n “Stored-program” machine n Memory for

Von Neumann architecture n Summary n n n 1) Fetch instruction 2) Decode instruction,

Von Neumann architecture n Problem 2: Von Neumann bottleneck n Constantly reading/writing data for

Improvements n Increase resources in datapath to execute multiple instructions in parallel n VLIW

Why is RC fast? n RC implements custom circuits for an application n n

Types of Parallelism n Bit-level C Code for Bit Reversal x x x =

Types of Parallelism n Arithmetic-level Parallelism C Code for (i=0; ii << 128; i++)

Types of Parallelism n Pipeline Parallelism for (i=0; i < 128; i++) y[i] +=

Types of Parallelism n Task-level Parallelism e. g. MPEG-2 n Execute each task in

How to exploit parallelism? n General Idea n n Create circuit for each task

Systolic Architectures n Systolic definition n n The rhythmic contraction of the heart, especially

Systolic Architecture n General Idea: Fully pipelined circuit, with I/O at top and bottom

Systolic Architecture n Simple Example n Create DFG (data flow graph) for body of

Simple Example n Add pipeline stages to each level of DFG b[i] b[i+1] +

Simple Example n Allocate one resource (adder, ALU, etc) for each operation in DFG

Simple Example n Allocate one resource for each operation in DFG n Resulting systolic

u. P Performance Comparison n Assumptions: n n Total SW cycles: n n 10

u. P Performance Comparison n What if u. P clock is 15 x faster?

Simple Example, Cont. n Improvement to systolic array n Why not execute multiple iterations

Simple Example, Cont. n How much to unroll? n Limited by memory bandwidth and

Unrolling Example n Original circuit for (i=0; i < 100; I++) a[i] = b[i]

Unrolling Example n Original circuit Each cycle brings in 4 inputs (instead of 6)

Performance after unrolling n How much unrolling? n n n Assume b[] elements are

Importance of Memory Bandwidth n Performance with wider memories n 128 -bit bus n

Delay Registers n Common mistake n Forgetting to add registers for values not used

Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i]

Another Example n Your turn short b[1004], a[1000]; for (i=0; i < 1000; i++)

Another Example, Cont. n What if divider takes 20 cycles? n n But, fully

Dealing with Dependencies n op 2 is dependent on op 1 when the input

Dealing with Dependencies Partial solution n n Parallelizing transformations n a e. g. tree

Dealing with Dependencies n Simple example w/ inter-iteration dependency - potential problem for systolic

Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less

Dealing with Dependencies n Your turn char b[1006]; for (i=0; i < 1000; i++)

Dealing with Control n If statements char b[1006], a[1000]; for (i=0; i < 1000;

Dealing with Control n If conversion, not always so easy char b[1006], a[1000], a

Other Challenges n Outputs can also limit unrolling long b[1004], a[1000]; for (i=0, j=0;

Other Challenges n Requires streaming data to work well for (i=0; i < 4;

Other Challenges n Memory bandwidth n n Values so far are “peak” values Can

Other Challenges n n Memory bandwidth, cont. Example 2 n Multiple array inputs long

Other Challenges n Dynamic memory access patterns int f( int val ) { long

Other Challenges n Pointer-based data structures n Even if scanning through list, data could

Other Challenges n Not all code is just one loop n n Main point

Other Options n n Try something completely different Try slight variation n Example -

Variations n Example, cont. n Break previous rules - use extra delay registers for

Entire Circuit Input Address Generator RAM Buffer Controller Datapath Buffer Output Address Generator RAM

Slides: 62

Download presentation

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida

Why are microprocessors slow? n Von Neumann architecture n “Stored-program” machine n Memory for instructions (and data)

Von Neumann architecture n Summary n n n 1) Fetch instruction 2) Decode instruction, fetch data 3) Execute 4) Store results 5) Repeat from 1 until end of program Problem n Inherently sequential n n Only executes one instruction at a time Does not take into consideration parallelism of application

Von Neumann architecture n Problem 2: Von Neumann bottleneck n Constantly reading/writing data for every instruction requires high memory bandwidth RAM Bandwidth not sufficient “Von Neumann Control bottleneck” n Datapath Performance limited by bandwidth of memory

Improvements n Increase resources in datapath to execute multiple instructions in parallel n VLIW - very long instruction word n n Superscalar n n Compiler encodes parallelism into “very-long” instructions Architecture determines parallelism at run time - out-of-order instruction execution Von Neumann bottleneck still problem RAM Control Datapath . . .

Why is RC fast? n RC implements custom circuits for an application n n VLIW/Superscalar Parallelism n n Circuits can exploit massive amount of parallelism ~5 ins/cycle in best case (rarely occurs) RC n n n Potentially thousands As many ops as will fit in device Also, supports different types of parallelism

Types of Parallelism n Bit-level C Code for Bit Reversal x x x = = = (x ((x ((x >>16) >> 8) >> 4) >> 2) >> 1) & & 0 x 00 ff) 0 x 0 f 0 f) 0 x 3333) 0 x 5555) | | | (x ((x ((x <<16); << 8) & << 4) & << 2) & << 1) & Circuit for Bit Reversal X Value Bit. Original Reversed X Value 0 xff 00); 0 xf 0 f 0); 0 xcccc); 0 xaaaa); . . . Compilation. . . Binary sll $v 1[3], $v 0[2], 0 x 10 srl $v 0[2], 0 x 10 or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 8 and $v 1[3], $t 5[13] sll $v 0[2], 0 x 8 and $v 0[2], $t 4[12] or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 4 and $v 1[3], $t 3[11] sll $v 0[2], 0 x 4 and $v 0[2], $t 2[10]. . . Bit Reversed XX Value Processor n Requires between 32 and 128 cycles Processor FPGA n Requires only 1 cycle (speedup of 32 x to 128 x) for same clock

Types of Parallelism n Arithmetic-level Parallelism C Code for (i=0; ii << 128; i++) yy[i] +=+= c[i] * x[i]. . . Circuit * * * + + + Processor n 1000’s of instructions n Several thousand cycles FPGA Processor n ~ 7 cycles n Speedup > 100 x for same clock

Types of Parallelism n Pipeline Parallelism for (i=0; i < 128; i++) y[i] += c[i] * x[i]. . . for (j=0; j < n; j++) { y = a[j]; x = b[j]; for (i=0; i < 128; i++) y += c[i] * x[i]; // output y y = 0; } Start new inner loop every cycle * * * + + + After filling up pipeline, performs 128 mults + 127 adds every cycle

Types of Parallelism n Task-level Parallelism e. g. MPEG-2 n Execute each task in parallel n

How to exploit parallelism? n General Idea n n Create circuit for each task n n Identify tasks Communication between tasks with buffers How to create circuit for each task? n n Want to exploit bit-level, arithmetic-level, and pipeline-level parallelism Solution: Systolic architectures (arrays/computing)

Systolic Architectures n Systolic definition n n The rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole. Analogy with heart pumping blood n n We want architecture that pumps data through efficiently. Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Hung]

Systolic Architecture n General Idea: Fully pipelined circuit, with I/O at top and bottom level n Local connections - each element communicates with elements at same level or level below Inputs arrive each cycle Outputs depart each cycle, after pipeline is full

Systolic Architecture n Simple Example n Create DFG (data flow graph) for body of loop n Represent data dependencies of code for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i] b[i+1] + + a[i] b[i+2]

Simple Example n Add pipeline stages to each level of DFG b[i] b[i+1] + + a[i] b[i+2]

Simple Example n Allocate one resource (adder, ALU, etc) for each operation in DFG n Resulting systolic architecture: b[0] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] Cycle 1 + +

Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[1] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] + b[3] Cycle 2 b[2] b[0]+b[1] +

Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[3] + b[4] Cycle 3 b[3] b[1]+b[2] + b[0]+b[1]+b[2]

Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[4] + b[5] Cycle 4 b[4] b[2]+b[3] + a[0] b[1]+b[2]+b[3] First output appears, takes 4 cycles to fill pipeline

Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[4] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[5] + b[6] Cycle 5 b[5] b[3]+b[4] + Total Cycles => 4 init + 99 = 103 a[1] b[2]+b[3]+b[4] One output per cycle at this point, 99 more until completion

u. P Performance Comparison n Assumptions: n n Total SW cycles: n n 10 instructions for loop body CPI = 1. 5 Clk 10 x faster than FPGA 100*10*1. 5 = 1, 500 cycles RC Speedup n (1500/103)*(1/10) = 1. 46 x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];

u. P Performance Comparison n What if u. P clock is 15 x faster? n n RC Speedup n n e. g. 3 GHz vs. 200 MHz (1500/103)*(1/15) =. 97 x RC is slightly slower n for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; But! n RC requires much less power n n n Several Watts vs ~100 Watts SW may be practical for embedded u. Ps => low power n Clock may be just 2 x faster n (1500/103)*(1/2) = 7. 3 x faster RC may be cheaper n n Depends on area needed This example would certainly be cheaper

Simple Example, Cont. n Improvement to systolic array n Why not execute multiple iterations at same time? n n No data dependencies Loop unrolling for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Unrolled DFG b[i] b[i+1] + b[i+2] b[i+1] b[i+2] + + a[i] b[i+3] . . . + a[i+1]

Simple Example, Cont. n How much to unroll? n Limited by memory bandwidth and area Must get all inputs once per cycle b[i] b[i+1] + b[i+2] b[i+1] b[i+2] b[i+3] + + a[i] . . . + a[i+1] Must write all outputs once per cycle Must be sufficient area for all ops in DFG

Unrolling Example n Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] + + a[0] 1 st iteration requires 3 inputs

Unrolling Example n Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] Each unrolled iteration requires one additional input b[2] + b[3] + + + a[0] a[1]

Unrolling Example n Original circuit Each cycle brings in 4 inputs (instead of 6) for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] b[3] + b[0]+b[1] + b[2] b[4] + b[1]+b[2] + b[3]

Performance after unrolling n How much unrolling? n n n Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit n n Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle n Can perform 6 iterations in parallel n n (24 + 8 +8 +8 +8) = 64 bits New performance n Unrolled systolic architecture requires n n n for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 4 cycles to fill pipeline, 100/6 iterations ~ 21 cycles With unrolling, RC is (1500/21)*(1/15) = 4. 8 x faster than 3 GHz microprocessor!!!

Importance of Memory Bandwidth n Performance with wider memories n 128 -bit bus n 14 iterations in parallel n n Total cycles = 4 to fill pipeline + 100/14 = ~11 Speedup (1500/11)*(1/15) = 9. 1 x n n 64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations Doubling memory width increased speedup from 4. 8 x to 9. 1 x!!! Important Point n n Performance of hardware often limited by memory bandwidth More bandwidth => more unrolling => more parallelism => BIG SPEEDUP

Delay Registers n Common mistake n Forgetting to add registers for values not used during a cycle n Values “delayed” or passed on until needed + + Instead of + Incorrect + Correct

Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] Cycle 1 + +

Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] + b[3] Cycle 2 b[0]+b[1] + b[2] + ? ? ?

Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[3] + b[4] Cycle 3 b[1]+b[2] + b[0] + b[1] + b[3] b[2] + ? ? ?

Another Example n Your turn short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] ); n Steps n n n Build DFG for body of loop Add pipeline stages Map operations to hardware resources n n Determine maximum amount of unrolling n n Assume divide takes one cycle Memory bandwidth = 128 bits/cycle Determine performance compared to u. P n Assume 15 instructions per iteration, CPI = 1. 5, CLK = 15 x faster than RC

Another Example, Cont. n What if divider takes 20 cycles? n n But, fully pipelined Calculate the effect on performance In systolic architectures, performance usually dominated by throughput of pipeline, not latency

Dealing with Dependencies n op 2 is dependent on op 1 when the input to op 2 is an output from op 1 n Problem: limits arithmetic parallelism, increases latency n i. e. Can’t execute op 2 before op 1 op 2 n Serious Problem: FPGAs rely on parallelism for performance n Little parallelism = Bad performance

Dealing with Dependencies Partial solution n n Parallelizing transformations n a e. g. tree height reduction b c d + a c b + + d + + + Depth = # of adders Depth = log 2( # of adders )

Dealing with Dependencies n Simple example w/ inter-iteration dependency - potential problem for systolic arrays n Can’t keep pipeline full a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] + a[0] b[2] b[3] + + + a[1] Can’t execute until 1 st iteration completes limited arithmetic parallelism, increases latency

Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] + a[0] b[2] b[3] + + + a[1]

Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] Add pipeline stages => systolic array b[2] a[0] b[2] + b[3] a[1] + b[4] + + + b[3] . . a[2] + Only works if loop is fully unrolled! Requires sufficient memory bandwidth *Outputs not shown

Dealing with Dependencies n Your turn char b[1006]; for (i=0; i < 1000; i++) { acc=0; for (j=0; j < 6; j++) acc += b[i+j]; } n Steps Build DFG for inner loop (note dependencies) n. Fully unroll inner loop (check to see if memory bandwidth allows) n. Assume bandwidth = 64 bits/cycle n. Add pipeline stages n. Map operations to hardware resources n. Determine performance compared to u. P n n Assume 15 cycles per iteration, CPI = 1. 5, CLK = 15 x faster than RC

Dealing with Control n If statements char b[1006], a[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a[I] = b[I+2] + b[I+3] ; } b[i] Can’t wait for result of condition - stalls pipeline Convert control into computation - if conversion b[I+2] b[I+1] b[I+3] + * MUX a[i] i 2 %

Dealing with Control n If conversion, not always so easy char b[1006], a[1000], a 2[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a 2[I] = b[I+2] + b[I+3] ; } a[i] b[I+1] * 2 b[I+2] i % b[I+3] + MUX a[i] a 2[i]

Other Challenges n Outputs can also limit unrolling long b[1004], a[1000]; for (i=0, j=0; i < 1000; i+=4, j++) { a[i] = b[j] + 10 ; a[i+1] = b[j] * 23; a[i+2] = b[j] - 12; a[i+3] = b[j] * b[j]; } n Example n 4 outputs, 1 input n n Each output 32 bits Total output bandwidth for 1 iteration = 128 bits Memory bus = 128 bits Can’t unroll, even though inputs only use 32 bits

Other Challenges n Requires streaming data to work well for (i=0; i < 4; i++) a[i] = b[i] + b[i+1]; b[0] n b[1] b[2] b[3] b[4] + + a[0] a[1] a[2] a[3] Systolic array n n But, pipelining is wasted because small data stream Point - systolic arrays work best with repeated computation

Other Challenges n Memory bandwidth n n Values so far are “peak” values Can only be achieved if all input data stored sequentially in memory n n Often not the case Example n Two-dimensional arrays long a[100], b[100]; for (i=1; i < 100; i++) { for (j=1; j < 100; j++) { a[i][j] = avg( b[i-1][j], b[I][j-1], b[I+1][j], b[I][j+1]); } }

Other Challenges n n Memory bandwidth, cont. Example 2 n Multiple array inputs long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[i] + c[i] } n b[] and c[] stored in different locations n n Memory accesses may jump back and forth Possible solutions n n Use multiple memories, or multiported memory (high cost) Interleave data from b[] and c[] in memory (programming effort) n If no compiler support, requires manual rewite

Other Challenges n Dynamic memory access patterns int f( int val ) { long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[rand()%100] + c[i * val] } } n n n Sequence of addresses not known until run time Clearly, not sequential Possible solution n Something creative enough for a Ph. D thesis

Other Challenges n Pointer-based data structures n Even if scanning through list, data could be all over memory n n Can cause aliasing problems n n n Very unlikely to be sequential Greatly limits optimization potential Solutions are another Ph. D. Pointers ok if used as array int f( int val ) { long a[100], b[100]; long *p = b; for (i=0; i < 100; i++, p++) { a[i] = *p + 1; } } equivalent to int f( int val ) { long a[100], b[100]; for (i=0; i < 100; i++) { a[i] = b[i] + 1; } }

Other Challenges n Not all code is just one loop n n Main point to remember n n Yet another Ph. D. Systolic arrays are extremely fast, but only certain types of code work What can we do instead of systolic arrays?

Other Options n n Try something completely different Try slight variation n Example - 3 inputs, but can only read 2 per cycle Not possible - can only read two inputs per cycle for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; + +

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[i] b[i+1] + + b[i+2] a[i] = b[i] + b[i+1] + b[i+2];

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[0] Junk b[1] a[i] = b[i] + b[i+1] + b[i+2]; Cycle 1 + +

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[0] a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[1] Cycle 2 + Junk +

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[1] Junk b[2] Junk a[i] = b[i] + b[i+1] + b[i+2]; Junk Cycle 3 + b[2] b[0]+b[1] + Junk

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[1] a[i] = b[i] + b[i+1] + b[i+2]; b[3] b[2] Cycle 4 + Junk + b[0]+b[1]+b[2] Junk

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[2] Junk b[3] Junk a[i] = b[i] + b[i+1] + b[i+2]; Junk Cycle 5 + b[3] b[1] + b[2] + a[0] Junk First output after 5 cycles

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[2] a[i] = b[i] + b[i+1] + b[i+2]; b[4] Junk b[4] b[3] Cycle 6 + Junk + b[1]+b[2]+b[3] Junk on next cycle

Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[3] Junk b[4] Junk Cycle 7 + b[4] b[2]+b[3] Valid output every 2 cycles approximately 1/2 the performance a[i] = b[i] + b[i+1] + b[i+2]; + a[1] Junk Second output 2 cycles later

Entire Circuit Input Address Generator RAM Buffer Controller Datapath Buffer Output Address Generator RAM Buffers handle differences in speed between RAM and datapath