Channel Equalization in MIMO Downlink and ASIP Architectures

  • Slides: 41
Download presentation
Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29,

Channel Equalization in MIMO Downlink and ASIP Architectures Predrag Radosavljevic Rice University March 29, 2004

Wireless System n n Downlink transmission in MIMO wireless system Physical layer of the

Wireless System n n Downlink transmission in MIMO wireless system Physical layer of the mobile handset Linear channel equalization Hardware implementation using ASIP architectures

Motivation MIMO Downlink and Equalization n MIMO: high data rate and high spectral efficiency

Motivation MIMO Downlink and Equalization n MIMO: high data rate and high spectral efficiency n n Interference from each antenna that introduces MAI DS-CDMA signals in multipath environment – user orthogonality is destroyed which causes ISI Solution: powerful channel equalization to mitigate ISI and MAI in order to restore user’s orthogonality Chip level channel equalization based on iterative CG and adaptive LMS algorithms

Motivation ASIP Hardware Implementation n Future generations of mobile handsets: high speed, flexibility and

Motivation ASIP Hardware Implementation n Future generations of mobile handsets: high speed, flexibility and low power Traditional approaches: ASIC and DSP processors ASIC: n n n DSP: n n n No flexibility: Family of ASICs are needed High probability of design errors, high design cost Not optimized for a given application Often limited instruction and data level parallelism ASIP: n Tradeoff between efficiency of ASICs and flexibility of DSPs

Thesis Contributions n Channel equalization in broad range of environments n n 16 -bit

Thesis Contributions n Channel equalization in broad range of environments n n 16 -bit fixed point implementation Flexible ASIP architecture design n Same hardware - different equalization (slow/fast fading, CG/LMS) Extension of ASIP instruction set with application-specific operations Customized architecture: n n n Real-time requirements for 1 x. EV-DV standard (1. 2288 Mc/s) Reasonable clock frequency (up to 150 MHz) and power dissipation Automatic hardware design: from C to gate level n Hardware synthesis for FPGA and CMOS libraries

Outline n Data model n Channel equalization n ASIP hardware implementations n Conclusions and

Outline n Data model n Channel equalization n ASIP hardware implementations n Conclusions and future work

Data Model: Transmission Side n n Alternating symbols over transmit antennas Spreading: orthogonality between

Data Model: Transmission Side n n Alternating symbols over transmit antennas Spreading: orthogonality between users Scrambling: Reduction of inter-cell interference Transmission over multipath correlated channels

Receiver Implementations n n RAKE Receiver, Multiuser Detector, Kalman filter, LMMSE equalization RAKE: n

Receiver Implementations n n RAKE Receiver, Multiuser Detector, Kalman filter, LMMSE equalization RAKE: n n n Multiuser Detectors: n n n Deteriorated performance in highly loaded system Not appropriate for MIMO environments High computational complexity Limited knowledge about the activity of other users Kalman filter: n n Optimal solution in the sense of MSE Prohibitive complexity in MIMO environments

LMMSE Equalization n Lower complexity in comparison with other receivers n n n Independent

LMMSE Equalization n Lower complexity in comparison with other receivers n n n Independent on the number of users Iterative Solutions Good performance in highly scattered environments LMMSE Receiver

LMMSE Equalization n Linear system to be solved: n Covariance: block Toeplitz and positive

LMMSE Equalization n Linear system to be solved: n Covariance: block Toeplitz and positive definite n n A and B: Toeplitz Hermitian matrices C: Toeplitz matrix

LMMSE Approaches n LMMSE solution: n Cholesky decomposition n n Conjugate Gradient (CG) n

LMMSE Approaches n LMMSE solution: n Cholesky decomposition n n Conjugate Gradient (CG) n n n More complex hardware primitives Iterative solution, fast convergence Block algorithm – modifications for fast fading channels Least Mean Square (LMS) n n Adaptive algorithm Sensitivity to learning step

Equalization in Time-Varying Channels n Spatially correlated, frequency selective (multipaths), fading channels Data-rate: 1.

Equalization in Time-Varying Channels n Spatially correlated, frequency selective (multipaths), fading channels Data-rate: 1. 2288 MChips/sec n Antenna correlation: n n n Base Station: 50. 18% Mobile: 43. 99%

Channel Equalization: CG Algorithm n N samples: 4096 in slow fading channels

Channel Equalization: CG Algorithm n N samples: 4096 in slow fading channels

CG Equalization in Veh. A 30 km/h n n Sliding Window (SW) approach Faster

CG Equalization in Veh. A 30 km/h n n Sliding Window (SW) approach Faster variations: more frequent update of filter coefficients

CG Equalization: Velocity of 120 km/h n Multiple sub-blocks instead of two blocks Partial

CG Equalization: Velocity of 120 km/h n Multiple sub-blocks instead of two blocks Partial channel estimation for each sub-block n Apply weights for global channel estimation: n n Weights are adjusted according to the channel variations n If channel fading is faster, faster the coefficients drop to 0

Architectural Alternative: LMS Equalization n Adaptive LMS:

Architectural Alternative: LMS Equalization n Adaptive LMS:

Performance: Slow Fading Environments Pedestrian A – 3 km/h n n Pedestrian B –

Performance: Slow Fading Environments Pedestrian A – 3 km/h n n Pedestrian B – 10 km/h From 32 -bit floating to 16 -bit fixed point Control of quantization error

Performance: Vehicular A 30 km/h n CG with sliding window (CG-SW): Improvement in comparison

Performance: Vehicular A 30 km/h n CG with sliding window (CG-SW): Improvement in comparison with basic CG

CG–SW Approach: Fixed Point Vehicular A – 30 km/h n n 32 -bit floating

CG–SW Approach: Fixed Point Vehicular A – 30 km/h n n 32 -bit floating point and 16 -bit fixed point About 1 % BER difference

Performance: Velocity of 120 km/h Pedestrian A - 120 km/h n CG with sliding

Performance: Velocity of 120 km/h Pedestrian A - 120 km/h n CG with sliding window and weights averaging n n Vehicular A 120 km/h CG-SW-WA with different numbers of sub-blocks Performance improvement if weights are applied

Computational Complexity n n Number of operations per chip in 1 second CG filter

Computational Complexity n n Number of operations per chip in 1 second CG filter update is less complex n Reason: block-level filter update algorithm

Directions for Architecture Implementation n Equalization in different environments n n Block CG, adaptive

Directions for Architecture Implementation n Equalization in different environments n n Block CG, adaptive LMS for slow fading environments Modifications of CG for fast fading channels Different computational complexity and amount of parallelism Flexible hardware for different equalizations and CG modifications n n Programmable architecture Application specific

ASIP Architecture for Equalization: Required Features n Flexible architecture able to operate in different

ASIP Architecture for Equalization: Required Features n Flexible architecture able to operate in different channel environments n n n Architecture customization n n Implementation of application-specific operations Instruction and data level parallelism n n Slow/fast fading Low/high scattering Fast execution of complex algorithms Automatic hardware-software co-design n Fast processor design starting from C/C++ code of application

ASIP Architecture Based on TTA n Flexible architecture n n Customizable architecture n n

ASIP Architecture Based on TTA n Flexible architecture n n Customizable architecture n n Implementation of Special Function Units (SFUs) Instruction and data level parallelism n n n No limitations to add new FUs, buses, registers VLIW architecture principle Efficient and parallel data flow Fast processor design n n Automatic search for best processor VHDL processor representation

General Structure of TTA n Transport of operands triggers the appropriate operation as a

General Structure of TTA n Transport of operands triggers the appropriate operation as a side effect n n Only one instruction: “move” instruction 32 -bit architecture

TTA Design Flow: MOVE Tool n Design space exploration for optimal architecture

TTA Design Flow: MOVE Tool n Design space exploration for optimal architecture

Customization of ASIP n Implementation of application specific operations n n n User-defined Special

Customization of ASIP n Implementation of application specific operations n n n User-defined Special Function Units (SFUs) Sacrificing architecture generality for optimization and performance improvement Designed SFUs: n n Real multiplication with shifting ability Complex multiplication with shifting Sub-word arithmetic operations Sign-test and add/subtract

SFU: Complex Multiplication n Reduction of data transports between FUs n n n Less

SFU: Complex Multiplication n Reduction of data transports between FUs n n n Less number of buses and smaller interconnection network Smaller instruction word Instruction and data parallelism is placed inside CXMUL

Performance Improvement with SFUs n n Bus reduction of 50% Instruction word length reduction

Performance Improvement with SFUs n n Bus reduction of 50% Instruction word length reduction of about 50%

TTA Processors for MIMO Equalization Two co-processors (CG equalization) 1. n n 2. n

TTA Processors for MIMO Equalization Two co-processors (CG equalization) 1. n n 2. n Co-processor for updating equalizer coefficients Co-processor filtering and user detection Single processor for all parts of equalization algorithm (CG/LMS equalization) Identical architectures for slow and fast fading environments

Single Processor vs. Two Coprocessors n Single processor n n Smaller area and power

Single Processor vs. Two Coprocessors n Single processor n n Smaller area and power dissipation Higher clock frequency

Processor Flexibility n n Identical customized processor for broad range of channel environments Identical

Processor Flexibility n n Identical customized processor for broad range of channel environments Identical processor for LMS and CG equalization

Example of Designed Processor Coprocessor for CG filter update

Example of Designed Processor Coprocessor for CG filter update

Hardware synthesis design flow n n n MOVEGen: generates VHDL representation of processor core

Hardware synthesis design flow n n n MOVEGen: generates VHDL representation of processor core Xilinx tools for fast FPGA prototyping Mentor Graphics tools for CMOS gate level design

VHDL Template of TTA Processor n n Automatic VHDL generation of processor core, control

VHDL Template of TTA Processor n n Automatic VHDL generation of processor core, control and interconnection FUs, SFUs, peripherals: pre-designed or defined by user

Move. Proc Synthesis on Xilinx FPGA n CG/LMS equalizer including user detection n no

Move. Proc Synthesis on Xilinx FPGA n CG/LMS equalizer including user detection n no SFUs 32 buses Xilinx FPGA part: XC 2 V 8000 n n Slices: 38, 757 out of 46, 592 BRAMs: 148 out of 168 IOBs: 263 out of 1108 MULT 18 x 18 s: 24 out of 168

Move. Proc Synthesis on Xilinx FPGA n Customized CG/LMS equalizer including user detection n

Move. Proc Synthesis on Xilinx FPGA n Customized CG/LMS equalizer including user detection n with SFUs 16 buses Xilinx FPGA part: XC 2 V 6000 n n Slices: 21, 126 out of 33, 792 BRAMs: 107 out of 144 IOBs: 229 out of 1104 MULT 18 x 18 s: 11 out of 144

Gate Level CMOS Synthesis n Mentor Graphics Tools n 0. 5 m CMOS library

Gate Level CMOS Synthesis n Mentor Graphics Tools n 0. 5 m CMOS library n n Customized CG/LMS equalizer including user detection (with SFUs) Synthesis estimate of processor core: 182, 887 gates

Conclusions n Equalization algorithms for broad range of channel environments n n n ASIP

Conclusions n Equalization algorithms for broad range of channel environments n n n ASIP architecture design based on TTA n n Slow fading: CG/LMS Fast fading: Modifications of basic CG equalization Same architecture – different equalization algorithms Optimization with application-specific operations Reasonable frequency and power dissipation for 3 GPP data rate Fast processor design n n VHDL representation of optimal processor FPGA synthesis and CMOS gate level synthesis

Future Work n Processor layout synthesis n n n Implementation of hybrid word length

Future Work n Processor layout synthesis n n n Implementation of hybrid word length n n IC Station software tool from Mentor Graphics Precise timing, area, and power analysis Reduced precision for filter application part Implementation on C 5 x DSP for comparison

Acknowledgements n Thanks to: n n n n Professor Cavallaro Dr. De Baynast Professor

Acknowledgements n Thanks to: n n n n Professor Cavallaro Dr. De Baynast Professor Aazhang Dr. Dabak Dr. Sabharwal Texas Instruments Nokia