Systolic Architectures Why is RC fast Greg Stitt
- Slides: 62
Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida
Why are microprocessors slow? n Von Neumann architecture n “Stored-program” machine n Memory for instructions (and data)
Von Neumann architecture n Summary n n n 1) Fetch instruction 2) Decode instruction, fetch data 3) Execute 4) Store results 5) Repeat from 1 until end of program Problem n Inherently sequential n n Only executes one instruction at a time Does not take into consideration parallelism of application
Von Neumann architecture n Problem 2: Von Neumann bottleneck n Constantly reading/writing data for every instruction requires high memory bandwidth RAM Bandwidth not sufficient “Von Neumann Control bottleneck” n Datapath Performance limited by bandwidth of memory
Improvements n Increase resources in datapath to execute multiple instructions in parallel n VLIW - very long instruction word n n Superscalar n n Compiler encodes parallelism into “very-long” instructions Architecture determines parallelism at run time - out-of-order instruction execution Von Neumann bottleneck still problem RAM Control Datapath . . .
Why is RC fast? n RC implements custom circuits for an application n n VLIW/Superscalar Parallelism n n Circuits can exploit massive amount of parallelism ~5 ins/cycle in best case (rarely occurs) RC n n n Potentially thousands As many ops as will fit in device Also, supports different types of parallelism
Types of Parallelism n Bit-level C Code for Bit Reversal x x x = = = (x ((x ((x >>16) >> 8) >> 4) >> 2) >> 1) & & 0 x 00 ff) 0 x 0 f 0 f) 0 x 3333) 0 x 5555) | | | (x ((x ((x <<16); << 8) & << 4) & << 2) & << 1) & Circuit for Bit Reversal X Value Bit. Original Reversed X Value 0 xff 00); 0 xf 0 f 0); 0 xcccc); 0 xaaaa); . . . Compilation. . . Binary sll $v 1[3], $v 0[2], 0 x 10 srl $v 0[2], 0 x 10 or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 8 and $v 1[3], $t 5[13] sll $v 0[2], 0 x 8 and $v 0[2], $t 4[12] or $v 0[2], $v 1[3], $v 0[2] srl $v 1[3], $v 0[2], 0 x 4 and $v 1[3], $t 3[11] sll $v 0[2], 0 x 4 and $v 0[2], $t 2[10]. . . Bit Reversed XX Value Processor n Requires between 32 and 128 cycles Processor FPGA n Requires only 1 cycle (speedup of 32 x to 128 x) for same clock
Types of Parallelism n Arithmetic-level Parallelism C Code for (i=0; ii << 128; i++) yy[i] +=+= c[i] * x[i]. . . Circuit * * * + + + Processor n 1000’s of instructions n Several thousand cycles FPGA Processor n ~ 7 cycles n Speedup > 100 x for same clock
Types of Parallelism n Pipeline Parallelism for (i=0; i < 128; i++) y[i] += c[i] * x[i]. . . for (j=0; j < n; j++) { y = a[j]; x = b[j]; for (i=0; i < 128; i++) y += c[i] * x[i]; // output y y = 0; } Start new inner loop every cycle * * * + + + After filling up pipeline, performs 128 mults + 127 adds every cycle
Types of Parallelism n Task-level Parallelism e. g. MPEG-2 n Execute each task in parallel n
How to exploit parallelism? n General Idea n n Create circuit for each task n n Identify tasks Communication between tasks with buffers How to create circuit for each task? n n Want to exploit bit-level, arithmetic-level, and pipeline-level parallelism Solution: Systolic architectures (arrays/computing)
Systolic Architectures n Systolic definition n n The rhythmic contraction of the heart, especially of the ventricles, by which blood is driven through the aorta and pulmonary artery after each dilation or diastole. Analogy with heart pumping blood n n We want architecture that pumps data through efficiently. Data flows from memory in a rhythmic fashion, passing through many processing elements before it returns to memory. [Hung]
Systolic Architecture n General Idea: Fully pipelined circuit, with I/O at top and bottom level n Local connections - each element communicates with elements at same level or level below Inputs arrive each cycle Outputs depart each cycle, after pipeline is full
Systolic Architecture n Simple Example n Create DFG (data flow graph) for body of loop n Represent data dependencies of code for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[i] b[i+1] + + a[i] b[i+2]
Simple Example n Add pipeline stages to each level of DFG b[i] b[i+1] + + a[i] b[i+2]
Simple Example n Allocate one resource (adder, ALU, etc) for each operation in DFG n Resulting systolic architecture: b[0] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] Cycle 1 + +
Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[1] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] + b[3] Cycle 2 b[2] b[0]+b[1] +
Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[2] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[3] + b[4] Cycle 3 b[3] b[1]+b[2] + b[0]+b[1]+b[2]
Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[3] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[4] + b[5] Cycle 4 b[4] b[2]+b[3] + a[0] b[1]+b[2]+b[3] First output appears, takes 4 cycles to fill pipeline
Simple Example n Allocate one resource for each operation in DFG n Resulting systolic architecture: b[4] for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[5] + b[6] Cycle 5 b[5] b[3]+b[4] + Total Cycles => 4 init + 99 = 103 a[1] b[2]+b[3]+b[4] One output per cycle at this point, 99 more until completion
u. P Performance Comparison n Assumptions: n n Total SW cycles: n n 10 instructions for loop body CPI = 1. 5 Clk 10 x faster than FPGA 100*10*1. 5 = 1, 500 cycles RC Speedup n (1500/103)*(1/10) = 1. 46 x for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2];
u. P Performance Comparison n What if u. P clock is 15 x faster? n n RC Speedup n n e. g. 3 GHz vs. 200 MHz (1500/103)*(1/15) =. 97 x RC is slightly slower n for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; But! n RC requires much less power n n n Several Watts vs ~100 Watts SW may be practical for embedded u. Ps => low power n Clock may be just 2 x faster n (1500/103)*(1/2) = 7. 3 x faster RC may be cheaper n n Depends on area needed This example would certainly be cheaper
Simple Example, Cont. n Improvement to systolic array n Why not execute multiple iterations at same time? n n No data dependencies Loop unrolling for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; Unrolled DFG b[i] b[i+1] + b[i+2] b[i+1] b[i+2] + + a[i] b[i+3] . . . + a[i+1]
Simple Example, Cont. n How much to unroll? n Limited by memory bandwidth and area Must get all inputs once per cycle b[i] b[i+1] + b[i+2] b[i+1] b[i+2] b[i+3] + + a[i] . . . + a[i+1] Must write all outputs once per cycle Must be sufficient area for all ops in DFG
Unrolling Example n Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] + + a[0] 1 st iteration requires 3 inputs
Unrolling Example n Original circuit for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] Each unrolled iteration requires one additional input b[2] + b[3] + + + a[0] a[1]
Unrolling Example n Original circuit Each cycle brings in 4 inputs (instead of 6) for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] b[3] + b[0]+b[1] + b[2] b[4] + b[1]+b[2] + b[3]
Performance after unrolling n How much unrolling? n n n Assume b[] elements are 8 bits First iteration requires 3 elements = 24 bits Each unrolled iteration requires 1 element = 8 bit n n Due to overlapping inputs Assume memory bandwidth = 64 bits/cycle n Can perform 6 iterations in parallel n n (24 + 8 +8 +8 +8) = 64 bits New performance n Unrolled systolic architecture requires n n n for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; 4 cycles to fill pipeline, 100/6 iterations ~ 21 cycles With unrolling, RC is (1500/21)*(1/15) = 4. 8 x faster than 3 GHz microprocessor!!!
Importance of Memory Bandwidth n Performance with wider memories n 128 -bit bus n 14 iterations in parallel n n Total cycles = 4 to fill pipeline + 100/14 = ~11 Speedup (1500/11)*(1/15) = 9. 1 x n n 64 extra bits/8 bits per iteration = 8 parallel iterations + 6 original unrolled iterations = 14 total parallel iterations Doubling memory width increased speedup from 4. 8 x to 9. 1 x!!! Important Point n n Performance of hardware often limited by memory bandwidth More bandwidth => more unrolling => more parallelism => BIG SPEEDUP
Delay Registers n Common mistake n Forgetting to add registers for values not used during a cycle n Values “delayed” or passed on until needed + + Instead of + Incorrect + Correct
Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[0] b[1] b[2] Cycle 1 + +
Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[1] b[2] + b[3] Cycle 2 b[0]+b[1] + b[2] + ? ? ?
Delay Registers n Illustration of incorrect delays for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[3] + b[4] Cycle 3 b[1]+b[2] + b[0] + b[1] + b[3] b[2] + ? ? ?
Another Example n Your turn short b[1004], a[1000]; for (i=0; i < 1000; i++) a[i] = avg( b[i], b[i+1], b[i+2], b[i+3], b[i+4] ); n Steps n n n Build DFG for body of loop Add pipeline stages Map operations to hardware resources n n Determine maximum amount of unrolling n n Assume divide takes one cycle Memory bandwidth = 128 bits/cycle Determine performance compared to u. P n Assume 15 instructions per iteration, CPI = 1. 5, CLK = 15 x faster than RC
Another Example, Cont. n What if divider takes 20 cycles? n n But, fully pipelined Calculate the effect on performance In systolic architectures, performance usually dominated by throughput of pipeline, not latency
Dealing with Dependencies n op 2 is dependent on op 1 when the input to op 2 is an output from op 1 n Problem: limits arithmetic parallelism, increases latency n i. e. Can’t execute op 2 before op 1 op 2 n Serious Problem: FPGAs rely on parallelism for performance n Little parallelism = Bad performance
Dealing with Dependencies Partial solution n n Parallelizing transformations n a e. g. tree height reduction b c d + a c b + + d + + + Depth = # of adders Depth = log 2( # of adders )
Dealing with Dependencies n Simple example w/ inter-iteration dependency - potential problem for systolic arrays n Can’t keep pipeline full a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] + a[0] b[2] b[3] + + + a[1] Can’t execute until 1 st iteration completes limited arithmetic parallelism, increases latency
Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] + a[0] b[2] b[3] + + + a[1]
Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] a[0] b[2] + b[3] + + a[1] +
Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] b[2] a[0] b[2] + b[3] a[1] + b[4] + + + b[3] . . a[2] +
Dealing with Dependencies n But, systolic arrays also have pipeline-level parallelism - latency less of an issue a[0] = 0; for (i=1; i < 8; I++) a[i] = b[i] + b[i+1] + a[i-1]; b[1] Add pipeline stages => systolic array b[2] a[0] b[2] + b[3] a[1] + b[4] + + + b[3] . . a[2] + Only works if loop is fully unrolled! Requires sufficient memory bandwidth *Outputs not shown
Dealing with Dependencies n Your turn char b[1006]; for (i=0; i < 1000; i++) { acc=0; for (j=0; j < 6; j++) acc += b[i+j]; } n Steps Build DFG for inner loop (note dependencies) n. Fully unroll inner loop (check to see if memory bandwidth allows) n. Assume bandwidth = 64 bits/cycle n. Add pipeline stages n. Map operations to hardware resources n. Determine performance compared to u. P n n Assume 15 cycles per iteration, CPI = 1. 5, CLK = 15 x faster than RC
Dealing with Control n If statements char b[1006], a[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a[I] = b[I+2] + b[I+3] ; } b[i] Can’t wait for result of condition - stalls pipeline Convert control into computation - if conversion b[I+2] b[I+1] b[I+3] + * MUX a[i] i 2 %
Dealing with Control n If conversion, not always so easy char b[1006], a[1000], a 2[1000]; for (i=0; i < 1000; i++) { if (I % 2 == 0) a[I] = b[I] * b[I+1]; else a 2[I] = b[I+2] + b[I+3] ; } a[i] b[I+1] * 2 b[I+2] i % b[I+3] + MUX a[i] a 2[i]
Other Challenges n Outputs can also limit unrolling long b[1004], a[1000]; for (i=0, j=0; i < 1000; i+=4, j++) { a[i] = b[j] + 10 ; a[i+1] = b[j] * 23; a[i+2] = b[j] - 12; a[i+3] = b[j] * b[j]; } n Example n 4 outputs, 1 input n n Each output 32 bits Total output bandwidth for 1 iteration = 128 bits Memory bus = 128 bits Can’t unroll, even though inputs only use 32 bits
Other Challenges n Requires streaming data to work well for (i=0; i < 4; i++) a[i] = b[i] + b[i+1]; b[0] n b[1] b[2] b[3] b[4] + + a[0] a[1] a[2] a[3] Systolic array n n But, pipelining is wasted because small data stream Point - systolic arrays work best with repeated computation
Other Challenges n Memory bandwidth n n Values so far are “peak” values Can only be achieved if all input data stored sequentially in memory n n Often not the case Example n Two-dimensional arrays long a[100], b[100]; for (i=1; i < 100; i++) { for (j=1; j < 100; j++) { a[i][j] = avg( b[i-1][j], b[I][j-1], b[I+1][j], b[I][j+1]); } }
Other Challenges n n Memory bandwidth, cont. Example 2 n Multiple array inputs long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[i] + c[i] } n b[] and c[] stored in different locations n n Memory accesses may jump back and forth Possible solutions n n Use multiple memories, or multiported memory (high cost) Interleave data from b[] and c[] in memory (programming effort) n If no compiler support, requires manual rewite
Other Challenges n Dynamic memory access patterns int f( int val ) { long a[100], b[100], c[100]; for (i=0; i < 100; i++) { a[i] = b[rand()%100] + c[i * val] } } n n n Sequence of addresses not known until run time Clearly, not sequential Possible solution n Something creative enough for a Ph. D thesis
Other Challenges n Pointer-based data structures n Even if scanning through list, data could be all over memory n n Can cause aliasing problems n n n Very unlikely to be sequential Greatly limits optimization potential Solutions are another Ph. D. Pointers ok if used as array int f( int val ) { long a[100], b[100]; long *p = b; for (i=0; i < 100; i++, p++) { a[i] = *p + 1; } } equivalent to int f( int val ) { long a[100], b[100]; for (i=0; i < 100; i++) { a[i] = b[i] + 1; } }
Other Challenges n Not all code is just one loop n n Main point to remember n n Yet another Ph. D. Systolic arrays are extremely fast, but only certain types of code work What can we do instead of systolic arrays?
Other Options n n Try something completely different Try slight variation n Example - 3 inputs, but can only read 2 per cycle Not possible - can only read two inputs per cycle for (i=0; i < 100; I++) a[i] = b[i] + b[i+1] + b[i+2]; + +
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[i] b[i+1] + + b[i+2] a[i] = b[i] + b[i+1] + b[i+2];
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[0] Junk b[1] a[i] = b[i] + b[i+1] + b[i+2]; Cycle 1 + +
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[0] a[i] = b[i] + b[i+1] + b[i+2]; b[2] b[1] Cycle 2 + Junk +
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[1] Junk b[2] Junk a[i] = b[i] + b[i+1] + b[i+2]; Junk Cycle 3 + b[2] b[0]+b[1] + Junk
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[1] a[i] = b[i] + b[i+1] + b[i+2]; b[3] b[2] Cycle 4 + Junk + b[0]+b[1]+b[2] Junk
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[2] Junk b[3] Junk a[i] = b[i] + b[i+1] + b[i+2]; Junk Cycle 5 + b[3] b[1] + b[2] + a[0] Junk First output after 5 cycles
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) Junk b[2] a[i] = b[i] + b[i+1] + b[i+2]; b[4] Junk b[4] b[3] Cycle 6 + Junk + b[1]+b[2]+b[3] Junk on next cycle
Variations n Example, cont. n Break previous rules - use extra delay registers for (i=0; i < 100; I++) b[3] Junk b[4] Junk Cycle 7 + b[4] b[2]+b[3] Valid output every 2 cycles approximately 1/2 the performance a[i] = b[i] + b[i+1] + b[i+2]; + a[1] Junk Second output 2 cycles later
Entire Circuit Input Address Generator RAM Buffer Controller Datapath Buffer Output Address Generator RAM Buffers handle differences in speed between RAM and datapath
- Why systolic architectures
- Gstitt
- Greg stitt
- Greg stitt
- Andreas carlsson bye bye bye
- Example of acid-fast bacteria
- Acid fast and non acid fast bacteria
- Systolic and diastolic pressure
- Stroke volume
- Ejection fraction vs stroke volume
- Sam systolic anterior motion
- Mitral regurgitation murmur
- Machinary murmur
- Elevated right ventricular systolic pressure
- Tapping sounds
- Systolic mm hg
- Ps murmur
- Systolic array vs simd
- Trunctus
- Systolic array
- Precapillary sphincter
- Isolated systolic hypertension
- What is systolic and diastolic pressure
- Integral vs modular architecture
- Database and storage architectures
- Base system architectures
- Backbone network architectures
- Autoencoders
- Scalable internet architectures
- Integral product architecture example
- Gui architectures
- Database system architectures
- Cdn architectures
- Aaron bannert
- Data warehouse architecture in data mining
- Isa in computer
- What is e commerce architecture
- Distributed systems architectures
- Backbone network architectures
- Cache coherence for gpu architectures
- Dont ask why why why
- What to do during ramadan
- Why didn’t elie fast on yom kippur?
- What is happening in this photo
- Greg kesden cmu
- Business senko labs
- Gregory reznik
- Where are liver flukes found
- Greg camm
- Greg gordon md
- John szyc
- Dave provenzano
- Greg jones md
- Greg tielke
- Greg ashman blog
- Greg dudkin
- Greg se karakterontwikkeling
- Greg kuperberg
- Greg michaelson
- Greg thomson missing
- Greg whalley net worth
- Greg badros
- Greg blumenthal