DSP Xilinx Core Solutions Group Xilinx DSP 1

  • Slides: 40
Download presentation
DSP Xilinx Core Solutions Group Xilinx DSP 1

DSP Xilinx Core Solutions Group Xilinx DSP 1

Traditional DSP: DSP Processors Single MAC Multiply Add Sequential Processing + Programmable + Off-the-shelf,

Traditional DSP: DSP Processors Single MAC Multiply Add Sequential Processing + Programmable + Off-the-shelf, standard part + Hardware multiplier – One MAC (Multiply Accumulate) – Time-Shared – Performance ceiling

Xilinx DSP High Performance Alternative - Parallel Processing + + Programmable Off-the-shelf, standard part

Xilinx DSP High Performance Alternative - Parallel Processing + + Programmable Off-the-shelf, standard part Many Multiplies in one clock cycle! Extend the performance of DSP Processors Multiply • • • Add Add Multiple MACs, Parallel Processing Add

Xilinx DSP Solution • CORE Generator • DSP Logi. COREs • Tools Integration System-Level

Xilinx DSP Solution • CORE Generator • DSP Logi. COREs • Tools Integration System-Level Tools

Existing Xilinx DSP Design Methodology CORE Generator w Parameterize DSP Logi. COREs w Connect

Existing Xilinx DSP Design Methodology CORE Generator w Parameterize DSP Logi. COREs w Connect the cores with HLD or schematic M 1 XC 4000 X/Spartan/Virtex

Addition of DSP System Level Tools CORE Generator M 1 w DSP System level

Addition of DSP System Level Tools CORE Generator M 1 w DSP System level tools — Used by all DSP systems engineers — 100, 000 copy installed base w Fit into existing DSP environment w Connect through the CORE Generator System. LINX interface

Billions of MACs per Second Performance 5 16 -bit FIR Filter Benchmark 4 3

Billions of MACs per Second Performance 5 16 -bit FIR Filter Benchmark 4 3 2 1 320 C 6 x 4005 XL 4013 XL 4036 XL 4062 XL 4085 XL XC 4085 XL > 10 x Faster than 320 C 6 x

120 Million Samples per Second 512 -Tap Decimating FIR 10 bits R E G

120 Million Samples per Second 512 -Tap Decimating FIR 10 bits R E G 1 2 8 w 32 -Tap FIR >10 DSP u. Ps 32 -Tap FIR w • • • 32 -Tap FIR • • • 32 -Tap FIR 3. 8 Billion MACs 5, 120 Flip-Flops — Just for data buffer Adder Tree w XC 4085 XL w 150, 000 Gates R 18 -bits E G

Price per Million MACs per Second Price $0. 25 $0. 20 $0. 15 $0.

Price per Million MACs per Second Price $0. 25 $0. 20 $0. 15 $0. 10 $0. 05 Lowest Cost C 6 x Xilinx XC 4000 XL

DSP Logi. COREs Exploit FPGA Architecture 16 -word RAM F/F Matrix of 16 by

DSP Logi. COREs Exploit FPGA Architecture 16 -word RAM F/F Matrix of 16 by 1 RAM primitives – Look-up-table logic – FIFOs, shift-registers, … – Multiple small memories 10, 000 RAM primitives on a chip Regular, monolithic, scalable structure Efficient: 1 - 3 Million MACs per CLB

Distributed RAM & Distributed Arithmetic (DA): Perfect Match DA Algorithms: • 4 -Input Look-Up-Tables

Distributed RAM & Distributed Arithmetic (DA): Perfect Match DA Algorithms: • 4 -Input Look-Up-Tables (LUT) Scaled with adders • For higher performance Use more LUTs = more parallelism • Efficiency similar to custom solution Achievable with LUT logic More ASIC gate equivalents More cost effective Basic DA Structure Matches XC 4000 Architecture 4 -Input LUT N-bits ADD or ACC. 4 -Input LUT • • •

Common DSP Functions w Filters w Transforms w Modulation w Basics — FIR —

Common DSP Functions w Filters w Transforms w Modulation w Basics — FIR — IIR — FFT — DCT — Multipliers — SIN tables — Multiply / add — Storage

FIR Filter N BITS WIDE SAMPLE DATA FIR FILTER X 0 • X 1

FIR Filter N BITS WIDE SAMPLE DATA FIR FILTER X 0 • X 1 C 0 • • X 2 X • • SUM C 1 C 2 K TAPS LONG • • • K SUM’s OUTPUT DATA

FIR Filter Logi. COREs Two Basic Types: 1. Serial Distributed Arithmetic FIR – SDA

FIR Filter Logi. COREs Two Basic Types: 1. Serial Distributed Arithmetic FIR – SDA FIR - Single Channel – SDA FIR - Dual Channel 2. Parallel Distributed Arithmetic FIR Combine basic PDA or SDA FIR cores to solve many problems

SDA FIR Filters Serial Distributed Arithmetic • • • Parallel In, Parallel Out, Bit-Serial

SDA FIR Filters Serial Distributed Arithmetic • • • Parallel In, Parallel Out, Bit-Serial Internally All taps processed in parallel Full precession through entire core One clock cycle required for each data bit One additional clock cycle for symmetric filters EXAMPLE: 10 -bit data, 80 taps, symmetrical FIR: • • For a bit level clock = 90 MHz Max sample rate = 90 MHz / 11 clks = 8. 2 Million samples/sec. Process 80 taps every 122 nsec. 656 Million MACs, 257 CLBs, 2. 55 Million MACs / CLB

SDA FIR Properties For a Given # of Taps: • Coefficient bit-width determines size

SDA FIR Properties For a Given # of Taps: • Coefficient bit-width determines size # CLBs = function of D. A. LUT width • Data bit-width determines max sample rate One serial clock per bit • Output data width does not effect CLB count

What to Ask w w w Data sample rate Number of taps Data word

What to Ask w w w Data sample rate Number of taps Data word width Coefficient Symmetry Same input & output sample rate? Number of CLBs

Serial Distributed Arithmetic FIR Filters Serial Distributed Arithmetic # CLBs 5 bit 8 tap

Serial Distributed Arithmetic FIR Filters Serial Distributed Arithmetic # CLBs 5 bit 8 tap 16 tap 12 bit 14 bit 16 bit 18 bit 20 bit 33 36 39 42 45 52 55 Non 54 69 59 71 64 76 69 81 77 96 85 53 46 61 102 80 80 89 95 101 104 108 112 116 123 127 138 146 142 154 93 101 107 114 118 127 126 140 137 153 148 174 175 187 182 116 138 154 165 179 191 226 239 Symm Non 32 tap 10 bit Symm Non 24 tap Data Word = Coefficient Size: Symm Non 40 tap Symm Non 48 tap Symm 158 173 187 202 217 246 261 64 tap Symm 197 215 233 250 268 305 323 80 tap Symm 236 257 278 299 320 364 385 5 bit Sample Rate XC 4000 E-1 8 bit 10 bit 12 bit 14 bit 16 bit 18 bit 20 bit Symm 13. 3 8. 9 7. 3 6. 2 5. 3 4. 7 4. 2 3. 8 Non 16. 0 10. 0 8. 0 6. 7 5. 0 4. 4 4. 0 MHz MHz

For SDA FIR Filters: Distributed RAM is More Efficient Build the Time-Skew Buffer with

For SDA FIR Filters: Distributed RAM is More Efficient Build the Time-Skew Buffer with Distributed RAM not Flip Flops 1 Logic Cell One 16 x 1 RAM Cell Primitive 16 x 1 Shift Register 16 Logic Cells FF FF FF 16 x 1 Shift Register FF • • • FF

Best Device Utilization Distributed RAM well suited to DSP Device Size (LCs) 1600 SDA

Best Device Utilization Distributed RAM well suited to DSP Device Size (LCs) 1600 SDA FIR Filters 1200 Block RAM 800 Xilinx Distributed RAM 400 0 16 -Taps 8 -Bits 16 -Taps 16 -Bits 64 -Taps 9 -Bits 64 -Taps 16 -Bits Xilinx Distributed RAM - Uses One Third the Ar

PDA FIR Filter Core Parallel Distributed Arithmetic FIR Filters • Fully parallel implementation •

PDA FIR Filter Core Parallel Distributed Arithmetic FIR Filters • Fully parallel implementation • All taps processed in parallel (same as SDA) • All bits processed in parallel • Up to 100 million samples per second • 2 billion MACs per 20 -tap core Inputs Outputs Data_IN DATA_OUT PDA FIR Cascade Mid_In C_M_IN Clock CK Cascade C_M_OUT Mid_Out Cascade C_D_OUT Data_Out

PDA FIR Filters • Parameterized • • • Input data: 4 to 24 bits

PDA FIR Filters • Parameterized • • • Input data: 4 to 24 bits Coefficients: 4 to 24 bits Symmetric, non-symmetric, negative symmetry Output data: 2 to 31 bits Taps: 2 to 20 per core • Automatically trims unused coefficient ROMs • Supports cascading multiple filter cores The high data sample rate solution

CORE Generator Software System. LINX: w Ability to call CORE Generator from Third Party

CORE Generator Software System. LINX: w Ability to call CORE Generator from Third Party Tools Data Sheets Web Mechanism to download new cores Alliance. CORE: Logi. CORE:

One line Documentation

One line Documentation

CORE Generator Methodology 1. Select a CORE 2. Enter parameters 3. Generate Core

CORE Generator Methodology 1. Select a CORE 2. Enter parameters 3. Generate Core

Logi. CORE - SDA Filter Design Package 160 CLB HOW ?

Logi. CORE - SDA Filter Design Package 160 CLB HOW ?

DSP CORE Generator Outputs 32 Tap FIR Filter n Schematic symbol n VHDL or

DSP CORE Generator Outputs 32 Tap FIR Filter n Schematic symbol n VHDL or Verilog HDL instantiation code n Simulation model n Design netlist with constraints FIR Filter Recipe Parameters DSP CORE Generator 20 rows by 9 columns 160 CLBs used Predictable Performance regardless number of cores

Predictable Size & Performance • Built for System Performance - Not Benchmarks. • Generated

Predictable Size & Performance • Built for System Performance - Not Benchmarks. • Generated with RPM (Relationally Placed Macro). RPM Macro Level Advantages • Predictable size. • Close proximity of communicating elements • Alignment of Critical paths • Accessible I/O signals • Improves Density RPM System Level Advantages • Rapid progress for automatic and manual design methods (1 macro, NOT 100’s of elements!) • Consistent performance anywhere on the die. • Packing density very high • Adequate set-up times Filling a device with Xilinx Cores does not reduce performance

Performance Independent of core location 80 MHz Same core installed in different locations w

Performance Independent of core location 80 MHz Same core installed in different locations w Xilinx Logi. COREs deliver the same performance for any placement w Non-segmented routing FPGAs can’t do this

Performance Independent of Device Utilization 80 MHz w Xilinx has performance independent of the

Performance Independent of Device Utilization 80 MHz w Xilinx has performance independent of the number of cores added w Non-segmented routing FPGAs can’t do this

Best FPGA Performance Xilinx is more Predictable Speed (MHz) 80 Xilinx Segmented 70 60

Best FPGA Performance Xilinx is more Predictable Speed (MHz) 80 Xilinx Segmented 70 60 Non Segmented 50 40 12 x 12 Area Efficient Multiplier 1 2 3 4. . . 8 Number of Instances Segmented = More Predictable and Repeatab

Performance Independent of Device Size 80 MHz w Same performance for a 4005 or

Performance Independent of Device Size 80 MHz w Same performance for a 4005 or 4085 w Non-segmented routing FPGAs can’t do this

Design Flow I COS 4: 1 32 -TAP FIR Decimate 5 MHz 20 MHz

Design Flow I COS 4: 1 32 -TAP FIR Decimate 5 MHz 20 MHz Q SIN ~ ~ ~ 4: 1 ~ ~ 48 -TAP FIR ~ ~ Low Pass Mixer • Generate each module. • Use Schematic or HDL at a system level. 4 K x 16 RAM Complex Demod 4 multipliers Base-band processor

Implementing the Mixer This mixer supports sample rates in excess of 85 MHz. It

Implementing the Mixer This mixer supports sample rates in excess of 85 MHz. It even supports sample rates up to 45. 6 MHz using the slowest Xilinx device(E-4)

Joining the Cores Here VHDL is used to link the cores into a system.

Joining the Cores Here VHDL is used to link the cores into a system. Schematic symbols may also be used. skip_value: skip_val --The integrator for skipping through the Sine table with forcing constant port map (cb => skip_constant); skip_integrater: skip_int port map (b => skip_constant, s => skip_integrate, l => GND, ce => VCC, c => clk); All component declaration and port map code provided by Coregen form_sine_address: for i in 0 to 6 generate --extract 7 bits required to address look-up table --MSB is not used as this represents overflow. --Lower bits are internal precision for integrator. skip_address (i) <= skip_integrate(i+10); end generate form_sine_address; sine_table : sine_lut -- sine wave look-up table port map (theta => skip_address, output => sine_wave, ctrl => VCC, --select SINE output when high c => clk);

Power Dissipation Advantage Often the Limiting Factor In DSP w Xilinx Advantage over competitive

Power Dissipation Advantage Often the Limiting Factor In DSP w Xilinx Advantage over competitive FPGAs — Segmented routing is essential in DSP applications — Altera Runs 3 X HOTTER than Xilinx! w Xilinx advantage over DSP processors: — TI Runs 2 X HOTTER 320 c 6 STOP – Independent study by Stanford Too Much Heat

Segmented Interconnect Yields Lower Power Ceramic eg me 5 0 No n-S Power (W)

Segmented Interconnect Yields Lower Power Ceramic eg me 5 0 No n-S Power (W) nte d 10 0 20 d e t n Package Thermal Limit e m g e S x in l i X 40 60 Clock Frequency (MHz) Plastic 80 100 Segmented = Lower Power, Faster Operation

Where to find opportunities w Look for high performance applications — Multiple DSP processors

Where to find opportunities w Look for high performance applications — Multiple DSP processors — Fixed function DSP parts — Gate array / custom DSP w Data rates typically above 1 MHz w Multiple channels required 100 Million FIR Filter Samples / sec. CORE

DSP Applications Image & Video Processing Communications Industrial, Militar Medical Imaging Copiers Cameras Security

DSP Applications Image & Video Processing Communications Industrial, Militar Medical Imaging Copiers Cameras Security Systems Video editors Inspection Sys Fingerprint ID Wireless Comm Cellular / PCS Modems Satellite Cable ADSL Telephone Test Motor control Numerical control Test equipment Vibration analysis Power supplies Radar Secure comm.

Where FPGA Solutions Fit Audio RF, Video, Multiple Channels k. Hz sample rates Single

Where FPGA Solutions Fit Audio RF, Video, Multiple Channels k. Hz sample rates Single channel MHz sample rates Processors FPGAs Fixed-point arithmetic Processors Floating-point arithmetic FPGAs ideal for high sample rates and computational intensity