VLSI Signal Processing Lecture 2 Unfolding Transformation ADSP

  • Slides: 47
Download presentation
VLSI Signal Processing Lecture 2 Unfolding Transformation ADSP Lecture 2 - Unfolding (cwliu@twins. ee.

VLSI Signal Processing Lecture 2 Unfolding Transformation ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 2 -1

Multiple-Data Processing • Create a program with more than one iteration, e. g. J

Multiple-Data Processing • Create a program with more than one iteration, e. g. J loops unrolling • Example: Loop unrolling + software pipelining clock cycle operation 1 1 2 2 3 3 4 1 5 2 6 3 7 1 8 2 clock cycle 1 1 2 3 4 5 3 6 7 8 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 2

Basic Ideas • Parallel processing • Pipelined processing P 1 a 2 a 3

Basic Ideas • Parallel processing • Pipelined processing P 1 a 2 a 3 a 4 P 1 P 2 b 1 b 2 b 3 b 4 P 2 P 3 c 1 c 2 c 3 c 4 P 3 P 4 d 1 d 2 d 3 d 4 P 4 a 1 b 1 c 1 d 1 a 2 b 2 c 2 d 2 a 3 b 3 c 3 d 3 a 4 b 4 c 4 time ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) d 4 time 3

Data Dependence • Parallel processing requires NO data dependence between processors • Pipelined processing

Data Dependence • Parallel processing requires NO data dependence between processors • Pipelined processing will involve inter-processor communication P 1 P 2 P 3 P 4 time ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) time 4

Parallel Processing • • In a J-unfolded system, each delay is J-slow. That is,

Parallel Processing • • In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(k. J+m), then the output is x((k-1)J+m) = x(k. J+m-J) ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 5

Parallel Processing • Block processing – the number of inputs processed in a clock

Parallel Processing • Block processing – the number of inputs processed in a clock cycle is referred to as the block size – at the k-th clock cycle, three inputs x(3 k), x(3 k+1), and x(3 k+2) are processed simultaneously to generate y(3 k), y(3 k+1), and y(3 k+2) ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 6

I/O Conversion • Serial to parallel converter • Parallel to serial converter ADSP Lecture

I/O Conversion • Serial to parallel converter • Parallel to serial converter ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 7

General approach for block processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu.

General approach for block processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 8

Mathematical Formulation • e. g. y(n) = ay(n-9) + x(n) • 2 -parallel Y(2

Mathematical Formulation • e. g. y(n) = ay(n-9) + x(n) • 2 -parallel Y(2 k) = ay(2 k-9) + x(2 k) Y(2 k+1) = ay(2 k-8) + x (2 k+1) • In 2 -parallel SDFG, one active clock edge leads two samples Y(2 k) = ay(2(k-5)+1) + x(2 k) Y(2 k+1) = ay(2(k-4)+0) + x(2 k+1) • Dependency with less than # parallelism of sample delays can be implemented with internal routing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 9

Unfolding the DFG T=J Ts T=Ts Not trivial, even for a simple graph ADSP

Unfolding the DFG T=J Ts T=Ts Not trivial, even for a simple graph ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 10

Block Processing for FIR Filter • One form of vectorized parallel processing of DSP

Block Processing for FIR Filter • One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) • Block vector: [x(3 k) x(3 k+1) x(3 k+2)] • Clock cycle: can be 3 times longer • Original (FIR filter): • Rewrite 3 equations at a time: ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 11

Block Processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 12

Block Processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 12

Block Processing for IIR Digital Filter • Original formulation: • Rewrite: n: sample period

Block Processing for IIR Digital Filter • Original formulation: • Rewrite: n: sample period • Vector formulation: k: processor period Tsample≠Tclk ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 13

Block IIR Filter x(2 k) x(n) S/P D y(2 k) + x(2 k+1) clock

Block IIR Filter x(2 k) x(n) S/P D y(2 k) + x(2 k+1) clock period not equal to sampling period y(2(k-1)) + y(2 k+1) y(2(k-1)+1) P/S y(n) D ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 14

Timing Comparison x(1) x(2) 1 MAC x(3) 2 y(1) x(4) 3 y(2) 4 y(3)

Timing Comparison x(1) x(2) 1 MAC x(3) 2 y(1) x(4) 3 y(2) 4 y(3) y(4) • Pipelining Add x(1) x(2) x(3) x(4) x(5) x(6) x(7) 1 2 3 4 5 6 7 8 y(1) y(2) y(3) y(4) y(5) y(6) y(7) 3 4 5 6 7 a y(1) Mul 1 2 8 • Block processing x(2) 2 x(4) 2 x(1) 1 4 x(6) 4 x(3) 1 3 6 x(8) 6 x(5) 3 5 8 8 x(7) 5 7 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 7 15

Definitions • Unfolding is the process of unfolding a loop so that several iterations

Definitions • Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. • Also known as (a. k. a. ) – Loop unrolling (in compilers for parallel programs) – Block processing • Applications – Reducing sampling period to achieve iteration bound (desired throughput rate) T. – Parallel (block processing) to execute several iterations concurrently. – Digit-serial or bit-serial processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 16

Unfolding the DFG • y(n)=ay(n-9)+x(n) • Rewrite the algorithm formulation: y(2 k)=ay(2 k-9)+x(2 k)

Unfolding the DFG • y(n)=ay(n-9)+x(n) • Rewrite the algorithm formulation: y(2 k)=ay(2 k-9)+x(2 k) y(2 k+1)=ay(2 k-8)+x(2 k+1) y(2 k)=ay(2(k-5)+1)+x(2 k) y(2 k+1)=ay(2(k-4))+x(2 k+1) • After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 17

Timing Diagram y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11)

Timing Diagram y(0) y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(8) y(9) y(10) y(11) y(12) y(13) 9 T T=Ts 9 T T=2 Ts y(0) y(2) y(4) y(6) y(8) y(10) y(12) y(7) y(9) y(11) y(13) 4 T y(1) • • y(3) y(5) 5 T Above timing diagram is obtained assuming that the sampling period Ts remains unchanged. Thus, the clock period T is increased J-fold. Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4 T and 5 T later. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 18

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1 0 3 1 1 1 0 1 2 1 1 1 3 0 2 S 0 Q 0 T 0 S Q R 0 T 2 D 3 D S 1 R Q 1 T =3 R 1 Step 1. Duplicate J copies of each node ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 19

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1 0 3 1 1 1 0 1 2 1 1 1 3 0 2 S 0 Q 0 T 0 S Q R 0 T 2 D 3 D S 1 R Q 1 T =3 R 1 Step 2. Add all edges with 0 delay on them. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 20

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1

Another DFG Unfolding Example J=2 i w (i+w)%J 0 0 0 2 0 1 0 3 1 1 1 0 1 2 1 1 1 3 0 2 S 0 Q 0 S Q D R 0 T 2 D T 0 S 1 R Q 1 T =3 Step 3. Use table on the left to figure out edges with delays. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 2 D D 3 D T 1 D R 1 T =6 21

Unfolding Transformation • • For each node U in the original DFG, draw J

Unfolding Transformation • • For each node U in the original DFG, draw J node U 0, U 1, …, UJ-1 For each edge UV with w delays in the original DFG, draw the J edges Ui. V(i + w)%J with floor[(i+w)/J] delays for i=0, 1, …, J-1 Example • • Unfolding of an edge with w delays in the original DFG produces Jw edges with no delays and w edges with 1 delay in J-unfolded DFG for w < J Unfolding preserves precedence constraints of a DSP algorithm ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 22

Precedence Preservation ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 23

Precedence Preservation ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 23

Delay Preservation • Unfolding preserves the number of delays in a DFG • Let

Delay Preservation • Unfolding preserves the number of delays in a DFG • Let , where ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 24

Example • Unfold the following DFG using folding factor 2 and 5 ADSP Lecture

Example • Unfold the following DFG using folding factor 2 and 5 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 25

Properties of Unfolding • Unfolding preserves the number of registers (delays) in a DFG

Properties of Unfolding • Unfolding preserves the number of registers (delays) in a DFG • For a loop with w delays in a DFG that has been unfolded J times, it leads to – g. c. d. (w, J) loops in the unfolded DFG, with each of these loops containing W/(g. c. d. (w, J)) delays and J/(g. c. d. (w, J)) copies of each node that appear in the original loop. • Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT. • A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. • Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 26

When a Loop is Unfolded • A loop ℓ with w delays in a

When a Loop is Unfolded • A loop ℓ with w delays in a DFG • Travel the loop A~>A p times also a loop with pw delays • In J-unfolded DFG, consider the path Ai A(i+pw)%J. It is a loop if i=(i+ pw)%J. This implies that J | pw • The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one can travel the loop A~>A J/gcd(J, w) times. • Recall that there are totally J copies of node A. Hence, there are J/(J/gcd(J, w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays. • The iteration bound in J-unfolded DFG is then ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 27

When a Path is Unfolded • If w<J, then a path containing w delays

When a Path is Unfolded • If w<J, then a path containing w delays within a DFG will lead to (J-w) paths with no delays and w paths with 1 delay in the J-unfolded DFG. • If w≥J, then the path leads to J paths with one or more delays in the J-unfolded DFG. This implies that these paths are not critical. • Assume that the critical path of the J-unfolded DFG is c. If D(U, V)≥c, then Wr(UV)=W(UV)+r(V)-r(U) ≥ J • Any feasible clock cycle period that can be obtained by retiming the J-unfolded DFG can be achieved by retiming the original DFG directly and followed by J-unfolding. ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 28

When a Path is Unfolded • Suppose r’ is a legal retiming for the

When a Path is Unfolded • Suppose r’ is a legal retiming for the J-unfolded DFG, G J, which leads to critical path c. • Let r(U) = i r’(Ui), 0≤i≤J-1. – r is a feasible retiming for the original DFG, G. – The retiming leads to a critical path c i 0≤i≤J-1 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 29

Sample Period Reduction • Case 1: A node in the DFG having computation time

Sample Period Reduction • Case 1: A node in the DFG having computation time greater than T∞ • Case 2: Iteration bound is not an integer • Case 3: Longest node computation is larger than the iteration T∞, and T∞ is not an integer ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 30

Case 1 • Critical path dominates, since a node computation time is more than

Case 1 • Critical path dominates, since a node computation time is more than iteration bound Retiming cannot be used to reduce sample period ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 31

Sample Period Reduction • Rule of Thumb: T∞=6, Tcritical=6 ADSP Lecture 2 - Unfolding

Sample Period Reduction • Rule of Thumb: T∞=6, Tcritical=6 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 32

Case 2 • Iteration period cannot achieve the iteration bound ADSP Lecture 2 -

Case 2 • Iteration period cannot achieve the iteration bound ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 33

Sample Period Reduction ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 34

Sample Period Reduction ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 34

Case 3 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 35

Case 3 ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 35

Parallel Processing • Parallel processing can be performed by unfolding ADSP Lecture 2 -

Parallel Processing • Parallel processing can be performed by unfolding ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 36

Bit-Level Parallel Processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 37

Bit-Level Parallel Processing ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 37

ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 38

ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 38

Bit-Serial Adder ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 39

Bit-Serial Adder ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 39

Unfolding of Switches ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 40

Unfolding of Switches ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 40

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 41

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 41

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 42

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 42

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 43

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 43

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 44

Example ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 44

Switches with Delays ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 45

Switches with Delays ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 45

Switch with Delays ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 46

Switch with Delays ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 46

If Wordlength is not a Multiple of J ADSP Lecture 2 - Unfolding (cwliu@twins.

If Wordlength is not a Multiple of J ADSP Lecture 2 - Unfolding (cwliu@twins. ee. nctu. edu. tw) 47