Cpr E Com S 583 Reconfigurable Computing Prof

  • Slides: 40
Download presentation
Cpr. E / Com. S 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical

Cpr. E / Com. S 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #6 – Modern FPGA Devices

Quick Points • HW #2 is out • Due Thursday, September 20 (12: 00

Quick Points • HW #2 is out • Due Thursday, September 20 (12: 00 pm) Effort Level • LUT mapping • Comparing FPGA devices • Synthesizing arithmetic operators Standard Preferred Cpr. E 583 Assigned September 6, 2007 Due Cpr. E 583 – Reconfigurable Computing Lect-06. 2

Recap • Hard-wired carry logic support Altera FLEX 8000 September 6, 2007 Xilinx XCV

Recap • Hard-wired carry logic support Altera FLEX 8000 September 6, 2007 Xilinx XCV 4000 Cpr. E 583 – Reconfigurable Computing Lect-06. 3

Recap (cont. ) • Square-root carry select adders A 31 -30 A 29 -22

Recap (cont. ) • Square-root carry select adders A 31 -30 A 29 -22 A 21 -15 B 31 -30 0 + 1 B 29 -22 + 0 + 1 t 8 1 0 t 9 A 14 -9 1 B 21 -15 + 0 + 1 t 8 0 t 7 t 8 A 8 -4 1 B 14 -9 + 0 + 1 t 7 0 t 6 t 7 A 3 -0 1 B 8 -4 + 0 + 1 t 6 0 t 5 t 6 B 3 -0 1 + + 1 t 5 0 t 4 t 5 1 t 4 0 t 10 S 31 -30 September 6, 2007 S 29 -22 S 21 -15 S 14 -9 Cpr. E 583 – Reconfigurable Computing S 8 -4 S 3 -0 Lect-06. 4

Recap (cont. ) • If one operand is constant: • More speed? • Less

Recap (cont. ) • If one operand is constant: • More speed? • Less hardware? A 0 0 1 A 1 0 A 2 1 A 3 HA FA FA FA S 0 S 1 S 2 S 3 September 6, 2007 A 0 A 2 A 1 C 3 S 0 Cpr. E 583 – Reconfigurable Computing S 1 A 3 HA HA S 2 S 3 C 3 Lect-06. 5

X 3 Recap (cont. ) X 2 X 1 X 0 Y 0 •

X 3 Recap (cont. ) X 2 X 1 X 0 Y 0 • Carry save X 2 X 3 multiplication + + X 1 + Y 1 X 0 + Y 2 + + Y 3 + September 6, 2007 + + + Cpr. E 583 – Reconfigurable Computing Z 2 Z 1 Z 0 Lect-06. 6

Recap (cont. ) • If one operand is Y 0=0 X 2 X 3

Recap (cont. ) • If one operand is Y 0=0 X 2 X 3 constant: X 1 Y 1=1 X 0 • Can greatly reduce the number of adders • Removes all and gates X 2 X 3 + September 6, 2007 + Y 2=0 X 1 + Y 3=1 X 0 + Cpr. E 583 – Reconfigurable Computing Z 2 Z 1 Z 0 Lect-06. 7

LUT-Based Constant Multipliers 10101011 x NNNN AAAAAA + BBBBBB SSSSSSSS (N * 1011 (LSN))

LUT-Based Constant Multipliers 10101011 x NNNN AAAAAA + BBBBBB SSSSSSSS (N * 1011 (LSN)) (N * 1010 (MSN)) Product N 0–N 3 4 -LUT 4 -LUT 4 -LUT A 0–A 11 + N 4–N 7 4 -LUT 4 -LUT 4 -LUT S 0–S 15 B 4–B 15 • Constants can be changed in the LUTs to program new multipliers September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 8

Outline • Recap • More Multiplication • Handling Fractional Values • Fixed Point •

Outline • Recap • More Multiplication • Handling Fractional Values • Fixed Point • Floating Point • Some Modern FPGA Devices • Xilinx – XC 5200, Virtex (-II / -II Pro / -4 / -5), Spartan (-II / -3) • *Altera – FLEX 10 K, APEX (20 K / II), ACEX 1 K, Cyclone (II), Stratix (GX / II GX) September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 9

Partial Product Generation • AND gates in multiplication are wasteful • Option 1 –

Partial Product Generation • AND gates in multiplication are wasteful • Option 1 – use cascade logic • Option 2 – break into smaller (2 x 2) multipliers 42 = x 11 = + 462 = September 6, 2007 101010 x 1011 0110 0100 01110 Multiplicand Multiplier (10 x 11) (10 x 10) Product Cpr. E 583 – Reconfigurable Computing Lect-06. 10

Representation Compression • Multiplication can be simplified if the representation is compressed • Standard

Representation Compression • Multiplication can be simplified if the representation is compressed • Standard – binary representation {0, 1}x 2 n • Canonical Signed Digit (CSD) representation {-1, 0, 1}x 2 n • To encode CSD: • Set C = (B + (B<<1)) • Calculate -2 C = 2*(C>>1) • Di = Bi + Ci – 2 Ci+1, where Ci+1 is the carryout of Bi + Ci • Example: B = 61 d = 0111101 b C = 0111101 b + 01111010 b = 010110111 b -2 Ci+1 = 2222101 D = 1000201 = 1000(-1)01 • For any n bit number, there can only be n/2 nonzero digits in a CSD representation (every other bit) September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 11

Booth Encoding • Variation on CSD encoding: Ej = -2 Bi + Bi-1 +

Booth Encoding • Variation on CSD encoding: Ej = -2 Bi + Bi-1 + Bi-2 • Select a group of 3 digits, add the two least significant digits, and then subtract 2 x the most significant bit • Ej is {-2, -1, 0, 1}x 22 n • Example: • B = 61 d = 0111101 b = 0001111010 b (with padding) • E = 010(-1)1 • Reduces the number of partial products for multiplication by ½ • Can automatically handle negative numbers September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 12

Fractional Arithmetic • Many important computations require fractional components • Fractional arithmetic often ignored

Fractional Arithmetic • Many important computations require fractional components • Fractional arithmetic often ignored in FPGA literature • Complex standards (ex. IEEE special cases) • Resource intensive and slow • Why not just extend the binary representation past the decimal point? September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 13

Fixed-Point Representation • Separate value into Integer (I) and Fractional remainder (F) I F

Fixed-Point Representation • Separate value into Integer (I) and Fractional remainder (F) I F • F bits represent {0, 1}x 2 -n • How large to make I and F depends on application • Ex: Q 16. 16 is 16 bits of integer [-215, 216) with 16 bits of fraction – increments of 2 -16 or 0. 0000152587890625 • Ex: Q 1. 127 is a normalized integer [-1, 1) with 127 bits of fraction – increments of 2 -127 or 5. 8774717541114375398436826861112 e-39 September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 14

Fixed-Point Arithmetic • Addition, subtraction the same (Q 4. 4 example): 3. 6250 +

Fixed-Point Arithmetic • Addition, subtraction the same (Q 4. 4 example): 3. 6250 + 2. 8125 6. 4375 0011. 1010 0010. 1101 0110. 0111 • Multiplication requires realignment: 3. 6250 0011. 1010 x 2. 8125 0010. 1101 00111010 10. 1953125 1010. 00110010 September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 15

Fixed-Point Issues • Overflow/underflow • Quantization Errors • After rounding down previous example 3.

Fixed-Point Issues • Overflow/underflow • Quantization Errors • After rounding down previous example 3. 625 x 2. 8125 = 10. 1875 (0. 08% error) • In Q 4. 4, 2 divided by 3 = 0. 625 (6. 25% error) • Scaling • Dynamic range needed for some applications September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 16

IEEE 754 Floating Point • Single precision: V = (-1)S x 2(E-127) x (1.

IEEE 754 Floating Point • Single precision: V = (-1)S x 2(E-127) x (1. F) 1 8 23 S E F • Double precision: V = (-1)S x 2(E-1023) x (1. F) 1 S 11 52 E F • Special conditions – not a number (Na. N), +-0, +-infinity • Gradual underflow September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 17

Floating Point FPGA Hardware • Xilinx XCV 4085 • Addition • Single-precision – 587

Floating Point FPGA Hardware • Xilinx XCV 4085 • Addition • Single-precision – 587 4 -LUTs • Double-precision – 1334 4 -LUTs • Multiplication • Single-precision – 1661 4 -LUTs • Double-precision – 4381 4 -LUTs • Division • Single-precision – 1583 4 -LUTs • Double-precision – 4910 4 -LUTs • For double-precision, can only fit any two of three units on a single device! • See [Und 04] for details September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 18

Capacity Trends Virtex-5 550 MHz 24 M gates* Xilinx Device Complexity Virtex-II Pro 450

Capacity Trends Virtex-5 550 MHz 24 M gates* Xilinx Device Complexity Virtex-II Pro 450 MHz 8 M gates* Virtex-II 450 MHz 8 M gates Virtex-E 240 MHz 4 M gates Virtex 200 MHz 1 M gates XC 4000 100 MHz 250 K gates XC 2000 50 MHz 1 K gates 1985 XC 3000 85 MHz 7. 5 K gates 1987 1991 XC 5200 50 MHz 23 K gates 1995 Spartan 80 MHz 40 K gates Virtex-4 500 MHz 16 M gates* Spartan-3 326 MHz 5 M gates Spartan-II 200 MHz 200 K gates 1998 1999 2000 2002 2003 2004 2006 Year September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 19

Xilinx XC 5200 FPGA • Successor to the XC 4000 • Relatively small amount

Xilinx XC 5200 FPGA • Successor to the XC 4000 • Relatively small amount of CLBs with faster interconnect Device XC 5202 XC 5204 XC 5206 XC 5210 XC 5215 Logic Cells 256 3, 000 8 x 8 64 256 84 480 6, 000 10 x 12 120 480 124 784 10, 000 14 x 14 196 784 148 1, 296 16, 000 18 x 18 324 1, 296 1, 936 23, 000 22 x 22 484 1, 936 244 Max Logic Gates Versa. Block Array CLBs Flip-Flops I/Os September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 20

Xilinx XC 5200 (cont. ) • Each CLB consists of four Logic Cells (LCs)

Xilinx XC 5200 (cont. ) • Each CLB consists of four Logic Cells (LCs) • Logic Cell = LUT + DFF • 20 inputs • 12 outputs September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 21

Xilinx XC 5200 (cont. ) September 6, 2007 Cpr. E 583 – Reconfigurable Computing

Xilinx XC 5200 (cont. ) September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 22

Xilinx Spartan FPGAs • Meant to be low-power / low-cost version of XC 4000

Xilinx Spartan FPGAs • Meant to be low-power / low-cost version of XC 4000 series (on newer process technology) Device XCS 05 XCS 10 XCS 20 XCS 30 XCS 40 Logic Cells 238 5, 000 10 x 10 100 360 77 3, 200 466 10, 000 14 x 14 196 616 112 6, 272 950 20, 000 20 x 20 400 1, 120 160 12, 800 1, 368 30, 000 24 x 24 576 1, 536 192 18, 432 1, 862 40, 000 28 x 28 784 2, 016 224 25, 088 Max Logic Gates CLB Matrix Total CLBs Flip-Flops I/Os Dist. RAM Bits September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 23

Xilinx Spartan (cont. ) • Identical CLB to XC 4000 series September 6, 2007

Xilinx Spartan (cont. ) • Identical CLB to XC 4000 series September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 24

Xilinx Spartan (cont. ) • Individual LUTs can be programmed as 16 x 1

Xilinx Spartan (cont. ) • Individual LUTs can be programmed as 16 x 1 RAMs and combined to form larger memory structures September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 25

Xilinx Virtex FPGAs Device XCV 50 XCV 100 XCV 150 XCV 200 XCV 300

Xilinx Virtex FPGAs Device XCV 50 XCV 100 XCV 150 XCV 200 XCV 300 XCV 400 XCV 600 XCV 800 XCV 1000 Logic Cells Max Logic Gates 1, 728 57, 906 2, 700 108, 904 3, 888 164, 674 5, 292 238, 666 6, 912 322, 970 10, 800 468, 252 15, 552 661, 111 21, 168 888, 439 27, 648 1, 124, 022 September 6, 2007 CLB Array I/O Bits Block RAM Bits Select RAM+ Bits 16 x 24 20 x 30 24 x 38 28 x 42 32 x 48 40 x 60 48 x 72 56 x 84 64 x 96 180 260 284 316 404 512 512 32, 768 40, 960 49, 152 57, 844 65, 536 81, 920 98, 304 114, 688 131, 072 24, 576 38, 400 55, 296 75, 264 98, 304 153, 600 221, 184 301, 058 393, 216 Cpr. E 583 – Reconfigurable Computing Lect-06. 26

Xilinx Virtex (cont. ) • 4 4 -LUTs / FFs per CLB • Organized

Xilinx Virtex (cont. ) • 4 4 -LUTs / FFs per CLB • Organized into 2 “slices” September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 27

Xilinx Virtex (cont. ) • Block Select+RAM – dedicated blocks of onchip, true dual

Xilinx Virtex (cont. ) • Block Select+RAM – dedicated blocks of onchip, true dual port read/write synchronous RAM • 4 Kbit of RAM with different aspect ratios • Faster, less flexible than distributed RAM using LUTs Virtex-E – updated, larger version of Virtex devices September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 28

Xilinx Spartan-II • CLB structure similar to Virtex Device XC 2 S 15 XC

Xilinx Spartan-II • CLB structure similar to Virtex Device XC 2 S 15 XC 2 S 30 XC 2 S 50 XC 2 S 100 XC 2 S 150 XC 2 S 200 Logic Cells System Gates CLB Array I/O Bits 432 972 1, 728 2, 700 3, 888 5, 292 15, 000 30, 000 50, 000 100, 000 150, 000 200, 000 8 x 12 12 x 18 16 x 24 20 x 30 24 x 36 28 x 42 86 92 176 260 284 September 6, 2007 Cpr. E 583 – Reconfigurable Computing Distributed Select RAM Bits RAM+ Bits 6, 144 13, 824 24, 576 38, 400 55, 296 75, 264 16, 384 24, 576 32, 768 40, 960 49, 152 57, 344 Lect-06. 29

Xilinx Virtex-II Platform FPGAs • “Platform” FPGA == Multiplier? ? Device XC 2 V

Xilinx Virtex-II Platform FPGAs • “Platform” FPGA == Multiplier? ? Device XC 2 V 40 XC 2 V 80 XC 2 V 250 XC 2 V 500 XC 2 V 1000 XC 2 V 1500 XC 2 V 2000 XC 2 V 3000 XC 2 V 4000 XC 2 V 6000 XC 2 V 8000 Max Logic Gates CLB Array Multiplier Blocks Max I/O Pads Block RAM Bits Select RAM+ Bits 40 K 80 K 250 K 500 K 1 M 1. 5 M 2 M 3 M 4 M 6 M 8 M 8 x 8 16 x 8 24 x 16 32 x 24 40 x 32 48 x 40 56 x 48 64 x 56 80 x 72 96 x 88 112 x 104 4 8 24 32 40 48 56 96 120 140 168 88 120 200 264 432 528 624 720 912 1, 104 1, 108 8 K 16 K 48 K 96 K 160 K 240 K 336 K 448 K 720 K 1, 056 K 1, 456 K 72 K 144 K 432 K 576 K 720 K 864 K 1, 008 K 1, 728 K 2, 160 K 2, 592 K 3, 024 K September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 30

Xilinx Virtex-II (cont. ) • 4 Slices per CLB, 2 4 -LUTs per slice

Xilinx Virtex-II (cont. ) • 4 Slices per CLB, 2 4 -LUTs per slice • 8 LUTs per CLB • Block Select+RAMs now 18 Kbit each September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 31

Xilinx Virtex-II (cont. ) • Block multipliers (18 b x 18 b) arranged in

Xilinx Virtex-II (cont. ) • Block multipliers (18 b x 18 b) arranged in columns near RAM September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 32

Block Multipliers • Synthesis tools can take larger multipliers and break them down into

Block Multipliers • Synthesis tools can take larger multipliers and break them down into 18 x 18 multipliers September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 33

Xilinx Virtex-II Pro FPGAs Device Power. PC CPU Blocks Logic Cells Multiplier Blocks Max

Xilinx Virtex-II Pro FPGAs Device Power. PC CPU Blocks Logic Cells Multiplier Blocks Max I/O Pads Block RAM Bits Select RAM+ Bits 0 1 1 2 2 2 3, 168 6, 768 11, 088 20, 880 30, 816 43, 632 53, 136 74, 448 99, 216 12 28 44 88 136 192 232 328 444 204 348 396 564 644 804 852 996 1, 164 44 K 94 K 154 K 290 K 428 K 606 K 738 K 1, 034 K 1, 378 K 216 K 504 K 792 K 1, 584 K 2, 448 K 3, 456 K 4, 176 K 5, 904 K 7, 992 K XC 2 VP 2 XC 2 VP 4 XC 2 VP 7 XC 2 VP 20 XC 2 VP 30 XC 2 VP 40 XC 2 VP 50 XC 2 VP 70 XC 2 VP 100 September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 34

Xilinx Virtex-II Pro (cont. ) • Power. PC processor block features • 300+ MHz

Xilinx Virtex-II Pro (cont. ) • Power. PC processor block features • 300+ MHz Harvard architecture (RISC) • Five-stage pipeline • Hardware multiply/divide • Thirty-two 32 -bit GPRs • 16 KB two-way instruction cache • 16 KB two-way data cache • On-Chip Memory (OCM) interface • IBM Core. Connect (OPB, PLB) interfaces September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 35

Xilinx Virtex-II Pro (cont. ) • PPC 405 details September 6, 2007 Cpr. E

Xilinx Virtex-II Pro (cont. ) • PPC 405 details September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 36

Xilinx Spartan-3 FPGAs • CLB structure similar to Virtex-II Device XC 3 S 50

Xilinx Spartan-3 FPGAs • CLB structure similar to Virtex-II Device XC 3 S 50 XC 3 S 200 XC 3 S 400 XC 3 S 1000 XC 3 S 1500 XC 3 S 2000 XC 3 S 4000 XC 3 S 5000 System Gates CLB Array Multiplier Blocks Max I/O Pads 50 K 200 K 400 K 1 M 1. 5 M 2 M 4 M 5 M 16 x 12 24 x 20 32 x 28 48 x 40 64 x 52 80 x 64 96 x 72 104 x 80 4 12 16 24 32 40 96 104 124 173 264 391 487 565 712 784 September 6, 2007 Cpr. E 583 – Reconfigurable Computing Distr. RAM Select Bits RAM+ Bits 12 K 30 K 56 K 120 K 208 K 320 K 432 K 520 K 72 K 216 K 288 K 432 K 576 K 720 K 1, 728 K 1, 872 K Lect-06. 37

Xilinx Virtex-4 FPGAs • Comes in three varieties: • Virtex-4 LX: most amount of

Xilinx Virtex-4 FPGAs • Comes in three varieties: • Virtex-4 LX: most amount of LUTs • Virtex-4 FX: has Power. PCs like V 2 P • Virtex-4 SX: contains most amount of Xtreme. DSP slices • CLB structure similar to Virtex-II • Largest LX device – 89, 088 slices = 178, 176 4 -LUTs! • FX devices limited to 2 PPC 405 s like Virtex-II Pro • XTreme. DSP Slices: • Same 18 x 18 block multiplier, now with optional pipelining • Includes built-in 48 -bit accumulator for MAC operations September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 38

Xilinx Virtex-5 • CLB slices uses 6 -input LUTs • Block RAMs now 36

Xilinx Virtex-5 • CLB slices uses 6 -input LUTs • Block RAMs now 36 Kbits per block • DSP slices now support 25 x 18 MAC • Diagonal routing • LX, LXT, SXT, FXT varieties September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 39

Summary • Handling fractional math in hardware is important, and expensive • Data point

Summary • Handling fractional math in hardware is important, and expensive • Data point – 3 double-precision dividers in a Xilinx XC 2 VP 30 • Data point – cannot fit a double-precision multiplier in a Xilinx XC 3 S 50 • Fixed point an alternative, but not practical for all applications • Xilinx FPGAs • 4 -LUTs arranged in slices, CLBs (except for V 5) • Physical SRAM blocks for fast memory • Physical multipliers for fast DSP operations • Some physical CPUs to manage embedded systems September 6, 2007 Cpr. E 583 – Reconfigurable Computing Lect-06. 40