The Design of a Reconfigurable ContinuousFlow MixedRadix FFT

The Design of a Reconfigurable Continuous-Flow Mixed-Radix FFT Processor Anthony T. Jacobson, Dean N. Truong, Bevan M. Baas VLSI Computation Lab University of California, Davis

Outline 1. Introduction 2. Architectural Overview 3. Address Generation 4. Twiddle Factor ROM 5. Implementation Results

Design Goals 1. The Fast Fourier Transform (FFT) is a ubiquitous DSP algorithm 2. Applications which use FFTs typically require their FFTs to have: 1. High computational throughput 2. Runtime reconfigurability (e. g. cognitive radio) 3. High Signal to Quantization Noise Ratio (SQNR)

Main Features 1. 32 -bit complex FFTs (16 -bit real, 16 -bit imag. ) 2. Reconfigurable from 16 - to 4 k-point IFFTs/FFTs 3. Mixed-Radix 1. Radix-4 computation with final Radix-2 stage, if necessary (for odd n, 2 n-point FFTs) 2. Decimation in Time (DIT) addressing 4. Memory-based architecture 1. Lower area compared to pipelined designs 2. Continuous flow for maximum throughput 5. Area efficient twiddle-factor ROM design

Outline 1. Introduction 2. Architectural Overview 3. Address Generation 4. Twiddle Factor ROM 5. Implementation Results

Continuous Flow Architecture 1. 16 -bit data words are passed between I/O and memory (1 word real, 1 word imag. ) 2. Four 32 -bit complex data are read/written by the processing element (FFT butterfly) 3. The FFT’s internal memory consists of two 4 k word banks (1 word = 32 -bits) 1. 4 k word banks allows support for 4096 point FFTs 4. Each bank is partitioned into four “subbanks” for multi-read/writes from/to the processing element 1. Four 1024 word x 32 -bit SRAMs One bank is used to read/write from I/O while the other is used to read/write from the processing element wrt_addr wrt_data SRAM rd_data rd_addr Each bank consists of dual -port SRAMs (one read and write per cycle)

Block Diagram I/O Interface Memory Processing Element

Radix-4 DIT Butterfly 1. The computational heart of the FFT is its butterfly consisting of: Three complex multiplications 2. Twelve complex additions 1. 2. Execution broken into two pipeline stages: MULT: Three 16 x 16 -bit multipliers 2. ADD: Twelve 4 -input 32 -bit adders (34 -bit sum) 1. ½ LSB rounding and truncation • 16 -bit MSB final result Radix-2 DIT butterfly can be achieved by only utilizing A and C as inputs and setting B and D inputs to zero A X = A + CW C Wc • W Y = A – CW

Quantization Considerations Im i 1. From stage to stage intermediate butterfly results are right shifted by 2 to avoid saturation Twiddle factor constants lie within the unit circle (magnitude ≤ 1), but inputs are not restricted by this 2. Additional configuration option to shift initial input by 1 is provided -1 1. 2. Block floating point helps increase efficiency by finding the minimum sign bits (redundant bits) over all butterfly results per FFT stage 1. For a worst case sinusoidal input SQNR is improved over 200% 3. Twiddle factors are in 1. 15 16 -bit fixed-point format 1. Multiply by ± 1 and ±i situations are handled through bypassing 1 Re Possible location of inputs (assume 1. 15) -i Typical location of W # sign bits From Mem << L S D # sign bits = min(prev. # sign bits, current # sign bits) To Mem

Outline 1. Introduction 2. Architectural Overview 3. Address Generation 4. Twiddle Factor ROM 5. Implementation Results

I/O Address Generation 1. Recall that each radix-4 FFT butterfly computation has four inputs and four outputs, which necessitate multiple reads and writes 1. Standard SRAMs only have a single read and single write port, so we break up one 4 k word bank into four 1 k word “subbanks” 2. To avoid memory conflicts we must ensure that each butterfly in/out accesses different subbanks 1. This requires developing an addressing scheme based on the memory location pattern of a DIT FFT Memo ry addres s Data Index = {0, 1, …, 2 n-1} Data indices in the above sample are after bit reversal, thus they do not represent the actual input order index of N-point (2 n) data

Butterfly Address Generation 1. FFT/IFFT is controlled by a primary address counter which is then broken up into “group” and “butterfly” counters (gr…g 0 and bs…b 0, respectively) 1. For radix-4 s and r is equal to log 4(N) for an N-point data set 2. The final butterfly addresses are determined by the two counters 3. The twiddle factor ROM address is determined solely on the butterfly counter base = {(0 – grg 0 – … – b 1 b 0), bs…b 2} offset = function of memory subbank # butterfly number

Outline 1. Introduction 2. Architectural Overview 3. Address Generation 4. Twiddle Factor ROM 5. Implementation Results

Twiddle Factor ROM 1. Radix-4 twiddle factor: 1. y = index, θN = 2π/N 2. θy ≡ yθN 2. W-ROM contains 512 32 -bit complex values for θy = [0, π/4] 1. All other factors can be obtained from symmetry and special relationships 2. The upper three bits of the address decodes the ROM outputs to their correct octants Wc Index = 2(Wb Index) Wd Index = 3(Wb Index) = Wc Index + Wb Index

Outline 1. Introduction 2. Architectural Overview 3. Address Generation 4. Twiddle Factor ROM 5. Implementation Results

Implementation Results 1. Fabricated within a 167 -processor array 2. ST Microelectronics LP 65 nm CMOS 3. Area: 1 mm 2 4. Initial results: 1. Fully functional up to 866 MHz at 1. 3 V 2. Average power at this operating point: 35 m. W

Conclusion 1. Runtime configurable 1. ~80 d. B for 64 -point ~74 d. B for 1024 -point 4. High throughput at 866 MHz: 1. ME M O ME M 67 ns to compute a 64 -point FFT 1. Over 950 Msamples per second 2. F 4 - to 4096 -point FFT/IFFTs 2. 32 -bit fixed-point complex data 3. High SQNR across all modes: 1. 2. ME M 1. 5 μs to compute a 1024 -point FFT 1. Over 680 Msamples per second ME M

Acknowledgements 1. ST Microelectronics 2. NSF Grant 430090 and CAREER award 546907 3. Intel 4. SRC Grant 1598 and CSR Grant 1659 5. Intellasys 6. UC Micro 7. SEM 8. J. -P. Schoellkopf, K. Torki, S. Dumont, Y. -P. Cheng, R. Krishnamurthy and M. Anders