High Speed FIR Filter Implementation Using Add and

  • Slides: 25
Download presentation
High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi,

High Speed FIR Filter Implementation Using Add and Shift Method Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner University of California, Santa Barbara ICCD 2006 San Jose, California October 2006 UC Santa Barbara ICCD 2006

Outline n n Introduction FIR filter implementation n Traditional Methods n n n New

Outline n n Introduction FIR filter implementation n Traditional Methods n n n New method n n Add and Shift method and CSE (Common Subexpresssion Elimination) Experiments and results n n n MAC (Multiply Accumulate) implementation DA (Distributed Arithmetic) implementation Resource utilization Power consumption Conclusion UC Santa Barbara ICCD 2006

Introduction n Extensive use of FPGAs in computationally intensive applications such as DSP n

Introduction n Extensive use of FPGAs in computationally intensive applications such as DSP n n More available logic resources in current FPGAs Broad applications of FIR filters in multimedia and communications Need to efficient design methods to save area/power Research motivation n n Develop a more efficient implementation method for FIR filters that consumes less area at comparable performance. Develop a unified tool for performing redundancy elimination, scheduling and module assignment. Perform physically aware optimizations. Architecture design exploration for ASIC and FPGA implementations (Distributed Arithmetic based, adder-shifter based, multiplier-adder based). UC Santa Barbara ICCD 2006

FIR Filter MAC Implementation n L tap FIR filter n Convolution of the latest

FIR Filter MAC Implementation n L tap FIR filter n Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = ∑ h[k] x[n-k] n k= 0, 1, . . . , L-1 Disadvantages n n Large area on FPGA due to multipliers and the fact that full flexibility of general purpose multipliers are not required Limited number of embedded resources such as MAC engines, multipliers, etc. in FPGAs UC Santa Barbara ICCD 2006

FIR Filter DA (Distributed Arithmetic) Implementation n An alternative to MAC implementation which is

FIR Filter DA (Distributed Arithmetic) Implementation n An alternative to MAC implementation which is the most common FPGA FIR implementation due to the LUT rich architecture of FPGAs. y[n] = ∑ c[n] ∙ x[n] n n = 0, 1, …, N-1 Variable x[n] can be represented by: x [n] = ∑ xb [n] ∙ 2 b b=0, 1, …, B-1 xb [n] € [0, 1] where xb [n] is the bth bit of x[n] and B is the input width. The inner product can be rewritten as follows: UC Santa Barbara ICCD 2006

FIR Filter DA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k]

FIR Filter DA (Distributed Arithmetic) Implementation (cont’d) y = ∑ c[n] ∑ xb [k] ∙ 2 b = c[0] (x. B-1 [0]2 B-1 + x. B-2 [0] 2 B-2 + … + x 0 [0]20 ) + c[1] (x. B-1 [1] 2 B-1 + x. B-2 [1] 2 B-2 + … + x 0 [1] 20 ) +… + c[N-1] (x. B-1 [N-1] 2 B-1 + x. B-2 [0] 2 B-2 + … + x 0 [N-1] 20 ) = (c[0] x. B-1 [0] + c[1] x. B-1 [1] + … + c[N-1] x. B-1 [N-1]) 2 B-1 +(c[0] x. B-1 [0] + c[1] x. B-2 [1] + … + c[N-1] x. B-2 [N-1]) 2 B-2 +… + (c[0] x 0 [0] + c[1] x 0 [1] + … + c[N-1] x 0 [N-1]) 20 = ∑ 2 b ∑ c[n] ∙ xb [k] where UC Santa Barbara n=0, 1, …, N-1 and b=0, 1, …, B-1 ICCD 2006

DA (Distributed Arithmetic) Implementation Serial A Serial DA Filter Block Diagram n n+1 clock

DA (Distributed Arithmetic) Implementation Serial A Serial DA Filter Block Diagram n n+1 clock cycles are needed for an n but input symmetrical filter to generate the output. Performance is limited by the fact that the next input sample can be processed only after every bit of the current input samples are processed The tradeoff here is performance for area UC Santa Barbara Address Data 0000 0 0001 C 0 0010 C 1 … … 1111 C 0+C 1+C 2+C 3 ICCD 2006

DA (Distributed Arithmetic) Implementation Parallel n n The performance of the circuit can be

DA (Distributed Arithmetic) Implementation Parallel n n The performance of the circuit can be improved by modifying the architecture to a parallel architecture which processes the data bits in groups Increasing the number of bits sampled has a significant effect on resource utilization on FPGA. n More LUTs n Larger size scaling accumulator A 2 bit parallel DA Filter Block Diagram UC Santa Barbara ICCD 2006

CSE (Common Subexpression Elimination) n Linear systems can be modeled using polynomials. Expressions consist

CSE (Common Subexpression Elimination) n Linear systems can be modeled using polynomials. Expressions consist of +, -, << operators. n Polynomial formulation C × X = (±X×Li) (14)10 × X = (1110)2 × X = X<<3 + X<<2 + X<<1 = XL 3 + XL 2 + XL 1 UC Santa Barbara ICCD 2006

CSE Example Y 0 Y 1 Y 2 Y 3 UC Santa Barbara =

CSE Example Y 0 Y 1 Y 2 Y 3 UC Santa Barbara = = 1 2 1 1 = Y 0 Y 1 Y 2 Y 3 X 0 + X 1 + X 2 + X 3 2 X 0 + X 1 – X 2 – 2 X 3 X 0 – X 1 – X 2 + X 3 X 0 – 2 X 1 + 2 X 2 – X 3 = = 1 1 -1 -2 1 -1 -1 2 1 -1 X 0 + X 1 + X 2 + X 3 X 0 L + X 1 – X 2 – X 3 L X 0 – X 1 – X 2 + X 3 X 0 – X 1 L + X 2 L – X 3 X 0 X 1 X 2 X 3 ICCD 2006

CSE Example Y 0 Y 1 Y 2 Y 3 = = X 0

CSE Example Y 0 Y 1 Y 2 Y 3 = = X 0 + X 1 + X 2 + X 3 X 0 L + X 1 - X 2 - X 3 L X 0 - X 1 - X 2 + X 3 X 0 - X 1 L + X 2 L - X 3 Y 0 Y 1 Y 2 Y 3 = = D 0 + X 1 + X 2 X 0 L + X 1 - X 2 - X 3 L D 0 - X 1 - X 2 X 0 - X 1 L + X 2 L - X 3 UC Santa Barbara D 0 = (X 0 + X 3) D 1 = (X 1 – X 2) ICCD 2006

CSE Example Y 0 Y 1 Y 2 Y 3 = = D 0

CSE Example Y 0 Y 1 Y 2 Y 3 = = D 0 + X 1 + X 2 X 0 L + D 1 - X 3 L D 0 - X 1 - X 2 X 0 - D 1 L - X 3 Y 0 Y 1 Y 2 Y 3 = = D 0 + D 2 X 0 L + D 1 - X 3 L D 0 - D 2 X 0 - D 1 L - X 3 UC Santa Barbara D 2 = (X 1 + X 2) D 3 = (X 0 – X 3) ICCD 2006

CSE Example Y 0 Y 1 Y 2 Y 3 = = 12 additions

CSE Example Y 0 Y 1 Y 2 Y 3 = = 12 additions X 0 + X 1 + X 2 + X 3 X 0 L + X 1 - X 2 - X 3 L X 0 - X 1 - X 2 + X 3 X 0 - X 1 L + X 2 L - X 3 8 additions D 0 = D 1 = D 2 = D 3 = X 0 + X 3 X 1 – X 2 X 1 + X 2 X 0 - X 3 4 shifts Y 0 = Y 1 = Y 2 = Y 3 = D 0 + D 2 D 1 + D 3 L D 0 - D 2 D 3 – D 1 L 2 shifts UC Santa Barbara ICCD 2006

FIR Filter Add/Shift Implementation Replacing Constant Multiplication by Multiplier Block UC Santa Barbara ICCD

FIR Filter Add/Shift Implementation Replacing Constant Multiplication by Multiplier Block UC Santa Barbara ICCD 2006

FIR Filter Add/Shift Implementation Registered Adder at no Additional Cost UC Santa Barbara ICCD

FIR Filter Add/Shift Implementation Registered Adder at no Additional Cost UC Santa Barbara ICCD 2006

Extracting Common Subexpressions F 1 = A + B + C + D F

Extracting Common Subexpressions F 1 = A + B + C + D F 2 = A + B + C + E Optimization Extracting Common Expression (A + B + C) Unoptimized Expression Trees Extracting Common Expression (A + B) UC Santa Barbara ICCD 2006

Synchronization n Extra registers are needed to synchronize the intermediate values, such that new

Synchronization n Extra registers are needed to synchronize the intermediate values, such that new values for A, B, C, D, E, F can be read in every clock cycle Calculating registers required for fastest evaluation UC Santa Barbara ICCD 2006

Experiment Results Resource Utilization/Performance Filter (# taps) Slices LUTs FFs Performance (Msps) 6 264

Experiment Results Resource Utilization/Performance Filter (# taps) Slices LUTs FFs Performance (Msps) 6 264 213 509 251 6 524 774 1012 245 10 474 406 916 222 10 781 1103 1480 222 13 386 334 749 252 13 929 1311 1775 199 20 856 705 1650 20 1191 1631 2288 199 28 1294 1145 2508 227 28 1774 2544 3381 199 41 2154 1719 4161 223 41 2475 3642 4748 222 61 3264 2591 6303 192 61 3528 5335 6812 199 119 6009 4821 11551 203 119 6484 9754 12539 205 151 7579 6098 14611 180 151 8274 12525 15988 199 Filter Implementation Using Add and Shift Method UC Santa Barbara Filter Implementation Using Xilinx Coregen (PDA) ICCD 2006

Experiment Results Resource Utilization UC Santa Barbara ICCD 2006

Experiment Results Resource Utilization UC Santa Barbara ICCD 2006

Experiment Results Power Consumption UC Santa Barbara ICCD 2006

Experiment Results Power Consumption UC Santa Barbara ICCD 2006

Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006

Creating MAC Filters Using Xilinx Coregen UC Santa Barbara ICCD 2006

Experiment Results Comparison with MAC Filters Using Multiplier Blocks Filter (# taps) Add Shift

Experiment Results Comparison with MAC Filters Using Multiplier Blocks Filter (# taps) Add Shift Method MAC filter Slices Msps 6 264 296 219 262 10 475 296 418 253 13 387 296 462 253 20 851 271 790 251 28 1303 305 886 251 41 2178 296 1660 243 61 3284 247 1947 242 119 6025 294 3581 241 151 7623 294 7631 215 UC Santa Barbara ICCD 2006

Experiment Results Comparison with MAC Filters Using Multiplier Blocks – Resource Utilization UC Santa

Experiment Results Comparison with MAC Filters Using Multiplier Blocks – Resource Utilization UC Santa Barbara ICCD 2006

Experiment Results Comparison with MAC Filters Using Multiplier Blocks Performance UC Santa Barbara ICCD

Experiment Results Comparison with MAC Filters Using Multiplier Blocks Performance UC Santa Barbara ICCD 2006

Conclusion/Observations n n Presented a multiplierless technique, based on the add and shift method

Conclusion/Observations n n Presented a multiplierless technique, based on the add and shift method and common subexpression elimination for low area, low power and high speed implementations of FIR filters. Validated our techniques on Virtex II/IV devices where we observed significant area and power reductions over traditional Distributed Arithmetic based techniques. n an average reduction of 58. 7% in the number of LUTs, and about 25% reduction in the number of slices and FFs. n Better performance in most of the cases even though our algorithm does not optimize for performance n Observed up to 50% reduction in dynamic power consumption n Higher performance as the filter size increases. n UC Santa Barbara Critical path in our design consists of adders while in MAC method, critical path consists of multipliers and adders. ICCD 2006