Channel Equalization in MIMO Downlink and ASIP Architectures









































- Slides: 41
Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004
Wireless System n n Downlink transmission in MIMO wireless system Physical layer of the mobile handset Linear channel equalization Hardware implementation using ASIP architectures
Motivation MIMO Downlink and Equalization n MIMO: high data rate and high spectral efficiency n n Interference from each antenna that introduces MAI DS-CDMA signals in multipath environment – user orthogonality is destroyed which causes ISI Solution: powerful channel equalization to mitigate ISI and MAI in order to restore user’s orthogonality Chip level channel equalization based on iterative CG and adaptive LMS algorithms
Motivation ASIP Hardware Implementation n Future generations of mobile handsets: high speed, flexibility and low power Traditional approaches: ASIC and DSP processors ASIC: n n n DSP: n n n No flexibility: Family of ASICs are needed High probability of design errors, high design cost Not optimized for a given application Often limited instruction and data level parallelism ASIP: n Tradeoff between efficiency of ASICs and flexibility of DSPs
Thesis Contributions n Channel equalization in broad range of environments n n 16 -bit fixed point implementation Flexible ASIP architecture design n Same hardware - different equalization (slow/fast fading, CG/LMS) Extension of ASIP instruction set with application-specific operations Customized architecture: n n n Real-time requirements for 1 x. EV-DV standard (1. 2288 Mc/s) Reasonable clock frequency (up to 150 MHz) and power dissipation Automatic hardware design: from C to gate level n Hardware synthesis for FPGA and CMOS libraries
Outline n Data model n Channel equalization n ASIP hardware implementations n Conclusions and future work
Data Model: Transmission Side n n Alternating symbols over transmit antennas Spreading: orthogonality between users Scrambling: Reduction of inter-cell interference Transmission over multipath correlated channels
Receiver Implementations n n RAKE Receiver, Multiuser Detector, Kalman filter, LMMSE equalization RAKE: n n n Multiuser Detectors: n n n Deteriorated performance in highly loaded system Not appropriate for MIMO environments High computational complexity Limited knowledge about the activity of other users Kalman filter: n n Optimal solution in the sense of MSE Prohibitive complexity in MIMO environments
LMMSE Equalization n Lower complexity in comparison with other receivers n n n Independent on the number of users Iterative Solutions Good performance in highly scattered environments LMMSE Receiver
LMMSE Equalization n Linear system to be solved: n Covariance: block Toeplitz and positive definite n n A and B: Toeplitz Hermitian matrices C: Toeplitz matrix
LMMSE Approaches n LMMSE solution: n Cholesky decomposition n n Conjugate Gradient (CG) n n n More complex hardware primitives Iterative solution, fast convergence Block algorithm – modifications for fast fading channels Least Mean Square (LMS) n n Adaptive algorithm Sensitivity to learning step
Equalization in Time-Varying Channels n Spatially correlated, frequency selective (multipaths), fading channels Data-rate: 1. 2288 MChips/sec n Antenna correlation: n n n Base Station: 50. 18% Mobile: 43. 99%
Channel Equalization: CG Algorithm n N samples: 4096 in slow fading channels
CG Equalization in Veh. A 30 km/h n n Sliding Window (SW) approach Faster variations: more frequent update of filter coefficients
CG Equalization: Velocity of 120 km/h n Multiple sub-blocks instead of two blocks Partial channel estimation for each sub-block n Apply weights for global channel estimation: n n Weights are adjusted according to the channel variations n If channel fading is faster, faster the coefficients drop to 0
Architectural Alternative: LMS Equalization n Adaptive LMS:
Performance: Slow Fading Environments Pedestrian A – 3 km/h n n Pedestrian B – 10 km/h From 32 -bit floating to 16 -bit fixed point Control of quantization error
Performance: Vehicular A 30 km/h n CG with sliding window (CG-SW): Improvement in comparison with basic CG
CG–SW Approach: Fixed Point Vehicular A – 30 km/h n n 32 -bit floating point and 16 -bit fixed point About 1 % BER difference
Performance: Velocity of 120 km/h Pedestrian A - 120 km/h n CG with sliding window and weights averaging n n Vehicular A 120 km/h CG-SW-WA with different numbers of sub-blocks Performance improvement if weights are applied
Computational Complexity n n Number of operations per chip in 1 second CG filter update is less complex n Reason: block-level filter update algorithm
Directions for Architecture Implementation n Equalization in different environments n n Block CG, adaptive LMS for slow fading environments Modifications of CG for fast fading channels Different computational complexity and amount of parallelism Flexible hardware for different equalizations and CG modifications n n Programmable architecture Application specific
ASIP Architecture for Equalization: Required Features n Flexible architecture able to operate in different channel environments n n n Architecture customization n n Implementation of application-specific operations Instruction and data level parallelism n n Slow/fast fading Low/high scattering Fast execution of complex algorithms Automatic hardware-software co-design n Fast processor design starting from C/C++ code of application
ASIP Architecture Based on TTA n Flexible architecture n n Customizable architecture n n Implementation of Special Function Units (SFUs) Instruction and data level parallelism n n n No limitations to add new FUs, buses, registers VLIW architecture principle Efficient and parallel data flow Fast processor design n n Automatic search for best processor VHDL processor representation
General Structure of TTA n Transport of operands triggers the appropriate operation as a side effect n n Only one instruction: “move” instruction 32 -bit architecture
TTA Design Flow: MOVE Tool n Design space exploration for optimal architecture
Customization of ASIP n Implementation of application specific operations n n n User-defined Special Function Units (SFUs) Sacrificing architecture generality for optimization and performance improvement Designed SFUs: n n Real multiplication with shifting ability Complex multiplication with shifting Sub-word arithmetic operations Sign-test and add/subtract
SFU: Complex Multiplication n Reduction of data transports between FUs n n n Less number of buses and smaller interconnection network Smaller instruction word Instruction and data parallelism is placed inside CXMUL
Performance Improvement with SFUs n n Bus reduction of 50% Instruction word length reduction of about 50%
TTA Processors for MIMO Equalization Two co-processors (CG equalization) 1. n n 2. n Co-processor for updating equalizer coefficients Co-processor filtering and user detection Single processor for all parts of equalization algorithm (CG/LMS equalization) Identical architectures for slow and fast fading environments
Single Processor vs. Two Coprocessors n Single processor n n Smaller area and power dissipation Higher clock frequency
Processor Flexibility n n Identical customized processor for broad range of channel environments Identical processor for LMS and CG equalization
Example of Designed Processor Coprocessor for CG filter update
Hardware synthesis design flow n n n MOVEGen: generates VHDL representation of processor core Xilinx tools for fast FPGA prototyping Mentor Graphics tools for CMOS gate level design
VHDL Template of TTA Processor n n Automatic VHDL generation of processor core, control and interconnection FUs, SFUs, peripherals: pre-designed or defined by user
Move. Proc Synthesis on Xilinx FPGA n CG/LMS equalizer including user detection n no SFUs 32 buses Xilinx FPGA part: XC 2 V 8000 n n Slices: 38, 757 out of 46, 592 BRAMs: 148 out of 168 IOBs: 263 out of 1108 MULT 18 x 18 s: 24 out of 168
Move. Proc Synthesis on Xilinx FPGA n Customized CG/LMS equalizer including user detection n with SFUs 16 buses Xilinx FPGA part: XC 2 V 6000 n n Slices: 21, 126 out of 33, 792 BRAMs: 107 out of 144 IOBs: 229 out of 1104 MULT 18 x 18 s: 11 out of 144
Gate Level CMOS Synthesis n Mentor Graphics Tools n 0. 5 m CMOS library n n Customized CG/LMS equalizer including user detection (with SFUs) Synthesis estimate of processor core: 182, 887 gates
Conclusions n Equalization algorithms for broad range of channel environments n n n ASIP architecture design based on TTA n n Slow fading: CG/LMS Fast fading: Modifications of basic CG equalization Same architecture – different equalization algorithms Optimization with application-specific operations Reasonable frequency and power dissipation for 3 GPP data rate Fast processor design n n VHDL representation of optimal processor FPGA synthesis and CMOS gate level synthesis
Future Work n Processor layout synthesis n n n Implementation of hybrid word length n n IC Station software tool from Mentor Graphics Precise timing, area, and power analysis Reduced precision for filter application part Implementation on C 5 x DSP for comparison
Acknowledgements n Thanks to: n n n n Professor Cavallaro Dr. De Baynast Professor Aazhang Dr. Dabak Dr. Sabharwal Texas Instruments Nokia