Reducing Hardware Complexity of Linear DSP Systems by

Reducing Hardware Complexity of Linear DSP Systems by Iteratively Eliminating Two-Term Common Subexpressions IEEE/ACM Asia South Pacific Design Automation Conference (ASP-DAC), Shanghai, 2005 Anup Hosangadi Ryan Kastner ECE Department, UCSB Farzan Fallah Advanced CAD Research Fujitsu Labs of America

Outline n n n Introduction Related Work Polynomial transformation Common Subexpression elimination Results Conclusions

Introduction n Multiplications by constants encountered in many application areas n n DSP transforms in Audio, Video, Image processing (DFT, DCT, IDCT etc. . ) Filtering operations in Communication (FIR, IIR filters) Multiple Input Multiple Output (MIMO) systems Polynomials in Computer graphics

Introduction n n Multiplication is expensive in hardware Decompose constant multiplications into shifts and additions n n Signed digits can reduce the number of additions/subtractions n n n 13*X = (1101)2*X = X + X<<2 + X<<3 Canonical Signed Digits (CSD) (Knuth’ 74) (57)10 = (0110111)2 = (100 -1001)CSD Further reduction possible by common subexpression elimination n Upto 50% reduction (R. Hartley TCS’ 96)

Introduction n Common subexpressions = common digit patterns n n 4+, 4<< F 1 = 7*X = (0111)*X = X + X<<1 + X<<2 F 2 = 13*X = (1101)*X = X + X<<2 + X<<3 D 1 = X + X<<2 F 1 = D 1 + X<<1 F 2 = D 1 + X<<3 “ 0101” => X + X<<2 3+, 3<< Good for single variable: FIR filters (transposed form) Multiple variable? (DFT, DCT etc. . ? ? )

Related Work n Simple Bipartite matching (Potkonjak et. al TCAD’ 95) n n n (10101) and (01101) => common pattern = “ 101” (10010) and (010010) => cannot detect pattern “ 1001” Recursive Shift and Add (RESANDS) (H. Nguyen et. Al, TVLSI 2000) n n (10010) and (010010) => common pattern “ 1001” Exhaustive enumeration of all digit patterns (Pasko et. Al. TCAD’ 99) n (1011) => “ 0011”, “ 1001”, “ 1010”, “ 0101”, “ 1011”

Related Work n Extending techniques for multiple variables Y 1 Y 2 Y 3 = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 x X 1 X 2 X 3 Potkonjak et. al. TCAD’ 95 All Distinct Sij. Xj and Cik. Dk Y 1 1 0 0 Y 2 0 1 1 1 0 1 Y 3 1 0 0 1

Related Work n Multiple Variable Common Subexpression elimination (A. Hosangadi et. al ASAP’ 04) n n Polynomial transformation of linear systems. Use rectangular covering methods n Cannot find subexpressions with reversed signs eg. (X 1 – X 2<<1) ≠ (X 2<<1 – X 1) Common occurrence when signed digits are used Rectangle covering has exponential complexity n Method to overcome these limitations ? n n

Related Work n Algebraic methods in multilevel logic synthesis (MLLS) n n n Reducing literal count in a set of Boolean expressions Factoring, decomposition: Established algebraic techniques Typically used for thousands of variables and literals D 1 = X 1+ X 2<<2 n Apply these methods to optimize linear systems? Y 1 = D 1 + D 1<<3 + X 1<<3 Y 2 = D 1 + X 2<<2

Linear systems and polynomial transformation n View linear systems as set of arithmetic expressions n n n Expressions consisting of +, -, << operators Develop methodology for extracting common subexpressions Polynomial formulation C × X = (±X×Li) (14)10 × X = = (1110)2 × X X<<3 + X<<2 + X<<1 XL 3 + XL 2 + XL 1 (100 -10)CSD × X = XL 4 – XL 1

Linear Systems and polynomial transformation n Y 0 Y 1 Y 2 Y 3 n = 1 1 X 0 2 1 -1 -1 1 1 -2 2 -1 X 2 X 3 H. 264 Integer Transform Decomposing constant multiplications 12+, 4<< Y 0 Y 1 Y 2 Y 3 = = X 0 + X 1 + X 2 + X 3 X 0<<1 + X 1 - X 2 - X 3<<1 X 0 - X 1 - X 2 + X 3 X 0 - X 1<<1 + X 2<<1 - X 3

Linear Systems and polynomial transformation n Y 0 Y 1 Y 2 Y 3 n = 1 1 X 0 2 1 -1 -1 1 1 -2 2 -1 X 2 X 3 Polynomial transformation 12+, 4<< Y 0 Y 1 Y 2 Y 3 = = X 0 + X 1 + X 2 + X 3 X 0 L + X 1 - X 2 - X 3 L X 0 - X 1 - X 2 + X 3 X 0 - X 1 L + X 2 L - X 3 H. 264 Integer Transform

Fx algorithm n Concurrent Decomposition and Factorization of Boolean Expressions (J. Rajski et. al TCAD’ 92) n n Popular as Fast-Extract (Fx) algorithm Expression f = gh + r n n n g = (ab + c) => Double cube divisor g = ab => Single cube divisor Fx algorithm for Linear systems?

Two-term divisors n Obtained from every pair of terms in each expression n n Divide by the minimum exponent of L n eg. F = X 1 + X 2 L + X 3 L 3 n { +X 2 L, +X 3 L 3}: Divide by L => (X 2 + X 3 L 2) n Divisors = (X 1 + X 2 L), (X 1 + X 3 L 3), (X 2 + X 3 L 2) Two divisors intersect if n n The terms involved are distinct (X 1 – X 2 L) ∩ (X 1 - X 2 L) = φ (X 1 – X 2 L) ∩ (-X 1 + X 2 L) = φ (reversed signs allowed !!)

Two-term divisors n n Theorem: Multiple term common subexpression in set of expression iff nonoverlapping intersection among two-term divisors Many divisors with intersections, which one to choose? n n Use greedy selection of divisor with most # of intersections Selecting divisors changes expressions n Perform concurrent decomposition of expressions

Algorithm (Step 1) n Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; {Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of {Divisors} ; }

Algorithm (Step 2) Common Subexpression Elimination {Divisors} = Set of all 2 -term divisors; while( intersections present) { Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }

Algorithm complexity n Mx. M constant matrix; N digits of precision M Y 0 Y 1. . YM-1 1111 N M … 1111 1001 … … … Y 0 = X 0 + X 0 L +. . . XM-1 L 3 + XM-1 O(MN) terms => O(M 2 N 2) divisors 1111 1110 0011 1010

Algorithm (Step 1) n Creating set of divisors {Divisors}; {Divisors} = φ; for each expression Pi { {Dnew} = Divisors for Pi; O(M 2 N 2) distinct divisors {Divisors} = {Divisors} ∩ {Dnew}; Update frequency statistics of {Divisors} ; O(M 2 N 2) } O(M 3 N 2)

Algorithm (Step 2) Common Subexpression Elimination O(M 2 N 2) {Divisors} = Set of all 2 -term divisors; while( intersections present) { O(M 2 N 2) Find Best_Divisor in {Divisors} ; {T} = Set of terms involved in intersection; {D} = Set of divisors involving any term in {T} ; {Divisors} = {Divisors} – {D}; Rewrite Expressions; {Dnew} = New Divisors involving new terms; {Divisors} = {Divisors} ∩ {Dnew}; }

Algorithm n H. 264 example Y 0 Y 1 Y 2 Y 3 n = = X 0 + X 1 + X 2 + X 3 X 0 L + X 1 - X 2 - X 3 L X 0 - X 1 - X 2 + X 3 X 0 - X 1 L + X 2 L - X 3 >> Select D 0 = (X 0 + X 3)

Algorithm n H. 264 example Y 0 Y 1 Y 2 Y 3 n = = D 0 + X 1 + X 2 X 0 L + X 1 - X 2 - X 3 L D 0 - X 1 - X 2 X 0 - X 1 L + X 2 L - X 3 >> Select D 1 = (X 1 – X 2)

Algorithm n H. 264 example Y 0 Y 1 Y 2 Y 3 n = = D 0 + X 1 + X 2 X 0 L + D 1 - X 3 L D 0 - X 1 - X 2 X 0 - D 1 L - X 3 >> Select D 2 = (X 1 + X 2)

Algorithm n H. 264 example Y 0 Y 1 Y 2 Y 3 n = = D 0 + D 2 X 0 L + D 1 - X 3 L D 0 - D 2 X 0 - D 1 L - X 3 >> Select D 3 = (X 0 – X 3)

Final Implementation n Extracting 4 divisors D 0 = D 1 = D 2 = D 3 = X 0 + X 3 X 1 – X 2 X 1 + X 2 X 0 - X 3 Y 0 = Y 1 = Y 2 = Y 3 = 8+, 2<< D 0 + D 2 D 1 + D 3 L D 0 - D 2 D 3 – D 1 L Original: 12+, 4<< Rectangle Covering: 10+, 3<<

Experimental Setup n Goal n n n n Reduction in #additions/subtractions Effect on area/latency on synthesis Simulate designs to estimate power consumption Transforms DCT, IDCT, DFT, DST, DHT. 8 x 8 constant matrices 16 digits precision (CSD representation) Compare with n n n Potkonjak (TCAD’ 95) RESANDS (Nguyen et. al TVLSI’ 2000) Rectangle Covering (A. Hosangadi et. al ASAP’ 04)

Experimental Results # of additions/subtractions Original (I) Potkonjak (II) RESANDS (III) Rectangle Covering (IV) Two-term CSE (V) DCT 274 202 227 174 153 IDCT 242 183 222 162 143 Real. DFT 253 193 208 165 144 Imag. DFT 207 178 198 134 124 DST 320 238 252 200 187 DHT 284 209 211 175 158 263. 3 200. 5 219. 7 168. 3 151. 5 Example Average Run Time 0. 81 s 0. 08 s

(III) RESANDS (IV) Rect. Covering Experimental results n (V) 2 -term CSE Synthesis results (minimum latency constraints) Example Area (Library Units) Latency (Clock cycles) (III) (IV) (V) DCT 90667 73311 66759 10 11 10 IDCT 81868 66864 62883 10 11 10 R-DFT 90496 69827 64026 10 11 10 I-DFT 75140 55940 54606 10 10 10 DST 108101 84715 81214 11 11 11 DHT 93939 71272 67775 11 11 10 Average 90110 70322 66211 10. 3 10. 8 10. 2

(III) RESANDS (IV) Rect. Covering Experimental results n (V) 2 -term CSE Power consumption Example Power consumption (µWatts) (III) (IV) (V) DCT 729 504 531 IDCT 662 547 569 R-DFT 707 544 554 I-DFT 644 575 490 DST 607 718 595 DHT 598 545 527 Average 657. 8 572. 2 544. 3

Conclusions n n A new technique for eliminating common subexpressions in linear systems Fewer operations than known methods Much faster than rectangle covering Combine with scheduling on given resources

n n Thank you Questions? ?