Algebraic Techniques To Enhance Common Subexpression Extraction for
Algebraic Techniques To Enhance Common Sub-expression Extraction for Polynomial System Synthesis Sivaram Gopalakrishnan Synopsys Inc. , Hillsboro, OR – 97124 Priyank Kalla Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, UT- 84112
Outline Ø Problem context: Polynomial datapath synthesis • Our Focus: Integrating CSE and Algebraic methods • Applications: DSP for audio, video, multimedia…. Ø Motivation Ø Previous Work and Limitations Ø Integrated Approach • Square-free factorization • Common Coefficient Extraction • Common Cube Extraction • Algebraic Division Ø Results: Area Optimization Ø Conclusions & Future Work
The Synthesis Flow
Polynomial representation? ØQuadratic filter design for polynomial signal processing Ø y = a 0. x 12 + a 1. x 1 + b 0. x 02 + b 1. x 0 + c. x 0. x 1
Motivation Ø Ø Ø Ø Ø P 1 = x 2 + 6 xy + 9 y 2 P 2 = 4 xy 2 + 12 y 3 P 3 = 2 zx 2 + 6 xyz Direct Implementation P 1 = x(x+ 6 y) + 9 y 2 P 2 = 4 xy 2 + 12 y 3 P 3 = x(2 zx + 6 yz) Horner form P 1 = x(x+ 6 y) + 9 y 2 P 2 = y 2(4 x+ 12 y) P 3 = xz(2 x + 6 y) 17 Mults & 4 Adds 15 Mults & 4 Adds Factorization + CSE 12 Mults & 4 Adds
Motivation Ø d 1 P 2 P 3 Ø d 1 is a good building block Ø Ø Ø = x + 3 y = d 12 = 4 d 1 y 2 = 2 xzd 1 Our Approach 8 Mults & 1 Add How to identify such building blocks across multiple polynomial datapaths? Need an methodology to expose many common expressions!!!
Conventional Methods Ø Extracting control-dataflow graphs (CDFGs) from RTL • Scheduling • Resource sharing • Retiming • Control synthesis Ø Algebraic Transforms for arithmetic designs • Factorization [Hosangadi et al, ICCAD 04] • Common Sub-expression Elimination [Hosangadi et al, VLSI 05] • Term-rewriting [Arvind et al, IEEE. Micro 98] • Tree-Height Reduction [De Micheli 94] Ø Lack of symbolic computer algebra manipulation
Conventional Methods… Ø Kernel/Co-kernel Extraction (Factorization + CSE) Ø Integrates CSE with cube/coefficient extraction Ø Uses coefficients and variables to identify cubes (co-kernels) to obtain kernels Ø Subsequently uses CSE for further optimization Ø P = 5 x 2 + 10 y 3 + 15 pq; Ø Uses {5, 10, 15, x, y, p, q} for kernel/co-kernel extraction Ø Does not perform algebraic division Ø Cannot determine decomposition 5(x 2 + 2 y 3 + 3 pq) Ø P = x 2 + 2 xy + y 2; -> (x+y)2 Ø Cannot determine the above decomposition
Symbolic algebra techniques Polynomial models for complex computational blocks Ø Guiding Synthesis engines using Gröbner’s basis [Peymandoust and De Micheli, TCAD 02] Ø • Given polynomial F and Library elements <I 1, …, In> • F = h 1 I 1 + …… + hn In • Restricted to library elements Datapath optimization using word-length information Ø [Gopalakrishnan et al, ICCAD 07] • Restricted to fixed-size datapaths • Cannot address systems of polynomials
Optimization techniques • Canonical Form representation ∑ck. Yk • ck : Coefficient in the range (0 ≤ ck ≤ bk) • Yk : Falling factorial • F = 3 x 2 y 2 - 3 x 2 y - 3 xy 2 + 3 xy = 3 x(x-1)y(y-1) f 1 = 5 x 3 y 2 - 5 x 3 y - 15 x 2 y 2 + 15 x 2 y + 10 xy 2 - 10 xy + 3 z 2 f 2 = 3 x 2 y 2 - 3 x 2 y - 3 xy 2 + 3 xy + z + 1 d 1 = x(x-1)y(y-1) f 1 = 5 d 1(x-2) + 3 z 2 f 2 = 3 d 1 + z + 1
Optimization techniques Ø Square-free factorization Ø Let F be an integral domain Z Ø A polynomial u in F[x] is square-free if there is no polynomial v in F[x] with deg(v, x) > 0, such that v 2 | u. Ø u 1 = x 2 + 3 x + 2; u 1 = (x+1)(x+2) is square-free Ø u 2 = x 4 + 7 x 3 + 18 x 2 + 20 x + 8; u 2 = (x+1)(x+2)2 is not square-free!!!
Optimization techniques Ø Common Coefficient Extraction Ø P = 8 x + 16 y + 24 z; Ø P 1 = 2(4 x + 8 y + 12 z); Ø P 2 = 4(2 x + 4 y + 6 z); Ø P 3 = 8(x + 2 y + 3 z); best transformation Ø Use GCD computation Ø Get the coefficients (ais) Ø Compute GCD of every pair (ai, aj) Ø Retain GCDs > atleast (ai, aj) Ø Arrange GCDs in decreasing order, perform extraction Ø Update GCD list and continue…
Optimization techniques Ø Common Coefficient Extraction (Example) Ø P = 8 x + 16 y + 24 z + 15 a + 30 b; Ø Coefficients {8, 16, 24, 15, 30} Ø GCD list {8, 8, 1, 2, 1, 6, 15} Ø Reduced GCD list {8, 15} -> decreasing order {15, 8} Ø Extracting 15 results in Ø P = 8 x + 16 y + 24 z + 15(a + 2 b); Ø Similarly, extracting 8 results in Ø P = 8(x + 2 y + 3 z) + 15(a + 2 b);
Optimization techniques Ø Common Cube Extraction Ø Similar to kernel/co-kernel extraction (for variables…) Ø Ø P 1 = x 2 y + xyz; Ø P 2 = ab 2 c 3 + b 2 c 2 x; Ø P 3 = axz + x 2 z 2 b; kernel/co-kernel extraction results in Ø P 1 = xy(x + z); Ø P 2 = b 2 c 2(ac + x); Ø P 3 = xz(a + xzb);
Optimization techniques Ø Polynomial long division Ø Given two polynomials a(x) and b(x), algebraic division determines q(x) and r(x) such that a(x) = b(x) q(x) + r(x) Ø a(x) = x 4 - 2 x 3 + 5; Ø b(x) = x 2 + 3 x - 2; Ø a(x) = b(x) (x 2 – 5 x + 17) – 61 x + 39 q(x) r(x)
Optimization techniques Ø Common Sub-Expression Elimination Ø Identify isomorphic patterns in an arithmetic expression tree and merge them!!! Ø k = x + y; Ø m = x + y + z; Ø n = xy + x + y; Ø k = x + y; Ø m = k + z; Ø n = xy + k;
Integrated approach Ø Input: The polynomial system Porig (list of arrays) Ø Perform Canonization, Square-free factorization Ø Get best initial cost: Cinitial Ø Perform Coefficient extraction: Pcce Ø Perform cube extraction: Pcce_cube, get linear blocks Ø Get the lists representing the system Ø For every linear block, for each list perform algebraic division Ø Pick the best cost
Illustration
Integrated approach (Example) Ø P 1 = 13 x 2 + 26 xy + 13 y 2 + 7 x - 7 y + 11; Ø P 2 = 15 x 2 - 30 xy + 15 y 2 + 11 x + 11 y + 9; Porig Ø Square-free factorization does not work!!! Ø Initial cost: 16 M and 10 A Ø After common coefficient extraction (Pcce) Ø P 1 = 13(x 2 + 2 xy + y 2) + 7(x – y) + 11; Ø P 2 = 15(x 2 - 2 xy + y 2) + 11(x + y) + 9; Ø Linear blocks: (x – y), (x + y)
Integrated approach (Example…) Ø After common cube extraction (Pcce_cube) Ø P 1 = 13(x(x + 2 y) + y 2) + 7(x – y) + 11; Ø P 2 = 15(x(x- 2 y) + y 2) + 11(x + y) + 9; Ø Linear blocks: (x – y), (x + 2 y), (x – 2 y) Ø Perform algebraic division using the linear blocks Ø Pcce is the best cost implementation with (x+y) (x-y) Ø d 1 = x + y; d 2 = x - y; Ø P 1 = 13 d 12 + 7 d 2 + 11; Ø P 2 = 15 d 22 + 11 d 1 + 9; Ø Cost: 6 M and 6 A
Results Benchmark Var/Deg/m Factor/CSE Proposed ↑Area % ↑Delay % SG 3 X 2 2/2/16 204805 102386 50 21. 3 SG 4 X 2 2/2/16 449063 197599 55. 9 -24. 1 SG 4 X 3 2/3/16 690208 557252 19. 2 -16. 3 SG 5 X 2 2/2/16 570384 271729 52. 3 -13. 9 SG 5 X 3 2/3/16 1365774 614955 54. 9 -20. 7 Quad 2/2/16 36405 30556 16 -9. 5 Mibench 3/2/8 20359 8433 58. 6 -3. 7 MVCS 2/3/16 31040 22214 28. 4 -32 Average area improvement: 42%
Results Benchmark Var/Deg/m Factor/CSE Proposed ↑Area % ↑Delay % SG 3 X 2 2/2/16 204805 102386 50 21. 3 SG 4 X 2 2/2/16 449063 197599 55. 9 -24. 1 SG 4 X 3 2/3/16 690208 557252 19. 2 -16. 3 SG 5 X 2 2/2/16 570384 271729 52. 3 -13. 9 SG 5 X 3 2/3/16 1365774 614955 54. 9 -20. 7 Quad 2/2/16 36405 30556 16 -9. 5 Mibench 3/2/8 20359 8433 58. 6 -3. 7 MVCS 2/3/16 31040 22214 28. 4 -32 Average area improvement: 42%
Conclusions & Future Work Ø Polynomial decomposition approach for arithmetic datapaths Ø Arithmetic datapaths modeled as polynomial systems Ø Integrating CSE with algebraic manipulation Ø Performing algebraic decomposition to enhance the power of CSE Ø Impressive area savings Ø But delay penalty!!! Ø Future Work: • Address the concerns in delay!!! • Retarget the approach towards power savings? ? ?
Questions? ? ?
- Slides: 24