Cpr E Com S 583 Reconfigurable Computing Prof
Cpr. E / Com. S 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #5 – FPGA Arithmetic
Quick Points • HW #1 due Thursday at 12: 00 pm • Any comments? September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 2
Recap • Cluster size of N = [6 -8] is good, K = [4 -5] September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 3
LUT Mapping Techniques September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 4
LUT Mapping Techniques (cont. ) September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 5
LUT Mapping Techniques (cont. ) September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 6
Outline • Recap • Motivation • Carry / Cascade Logic • Addition • Ripple Carry • Carry Bypass • Carry Select • Carry Lookahead • Basic Multipliers September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 7
Motivation • Traditional microprocessors, DSPs, etc. don’t use LUTs • Instead use a w-bit Arithmetic and Logic Unit (ALU) • Carry connections are hard-wired • No switches, no stubs, short wires (1) A AND 2 OR 2 XOR 2 (1, 2) A September 4, 2007 Op ALU A B 3 -LUT Sum Cout / Cin B 2 -LUT 3 -LUT Out (2) (1) A (2) ADD SUB CMP B B Cin Out Cpr. E 583 – Reconfigurable Computing Cout Lect-05. 8
Adder Delays • Assuming a ripple-carry adder: • 32 -bit ALU delay – 6 ns • 32 -cascaded 4 -LUTs – 32 x 2. 5 ns = 80 ns Compare: 32 -bit ALU (0. 6λ) 4 -LUT delay 2. 0 ns 0. 5 ns Logic delay Single channel delay 2. 5 ns per bit 16 ns Area optimized Delay optimized • Motivates extra hardware to accelerate carry operations September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 9
Altera Flex 8000 Carry Chain September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 10
Xilinx XC 4000 Carry Chain September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 11
Cascades • Large fanin operations (reductions): • Decoding • Matching • Completion detection • Many-to-one reductions • Combining logic is simple • Makes use of dedicated paths September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 12
Altera Cascade Logic • LE delay – 2. 4 ns • Cascade delay – 0. 6 ns September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 13
Why Look at Arithmetic? • Parallelization • Specialization • Architecture • Size • Inputs • Adder problem – delay grows linearly with bit width • Solutions for larger adders: • Pipelining • Carry bypass • Carry select September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 14
Adder Pipelining • Not as practical in ASIC world (registers are expensive) • Registers essentially “free” in FPGA logic blocks A 1 B 1 + A 2 B 2 A 3 B 3 Cout S 1 + Cout S 2 + S 3 September 4, 2007 Cpr. E 583 – Reconfigurable Computing Cout Lect-05. 15
Carry Bypass Adders • If all the propagates are 1 (P 0 P 1 P 2 P 3 = 1) then Cout 3 = Cin • Pi = Ai xor Bi • Skip all the carry logic • Inexpensive A 0 Cin B 0 + Cout 0 A 1 B 1 + Cout 1 A 2 B 2 + A 3 Cout 2 B 3 + 0 Cout 3 1 P 0 P 1 P 2 P 3 September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 16
Carry Bypass Performance • Small hardware cost: • 16 -bit add – 4 CLB overhead • 32 -bit add – 9 CLB overhead • Delay growth still linear, smaller slope Delay Ripple Carry Bypass N-bit Adder September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 17
Carry Select Adders • • • Precompute addition value for (Cin = 0) case and (Cin = 1) case Use mux to select between two with actual Cin value Cost of this approach? A 0 B 0 A 1 + Cin A 4 A 2 + B 4 A 5 + 0 B 1 B 2 A 3 + B 5 A 6 + B 3 + B 6 A 7 + B 7 + 0 S 4 -7 1 + 1 A 4 September 4, 2007 + B 4 A 5 + B 5 A 6 + B 6 A 7 Cpr. E 583 – Reconfigurable Computing B 7 Lect-05. 18
Linear Carry-Select • Adder delay = w, mux delay = 1 A 31 -24 A 23 -16 A 15 -8 B 31 -24 0 + 1 t 8 B 23 -16 + 0 0 + 1 t 8 t 11 A 7 -0 1 B 15 -8 + t 8 0 0 + 1 t 8 t 10 1 B 7 -0 + + 1 t 8 0 t 8 t 9 1 t 8 0 t 12 S 31 -24 September 4, 2007 S 23 -16 S 15 -8 Cpr. E 583 – Reconfigurable Computing S 7 -0 Lect-05. 19
Square-Root Carry Select • Each carry arrives when the corresponding sum is ready A 31 -30 A 29 -22 A 21 -15 B 31 -30 0 + 1 B 29 -22 + 0 + 1 t 8 1 0 t 9 A 14 -9 1 B 21 -15 + 0 + 1 t 8 0 t 7 t 8 A 8 -4 1 B 14 -9 + 0 + 1 t 7 0 t 6 t 7 A 3 -0 1 B 8 -4 + 0 + 1 t 6 0 t 5 t 6 B 3 -0 1 + + 1 t 5 0 t 4 t 5 1 t 4 0 t 10 S 31 -30 September 4, 2007 S 29 -22 S 21 -15 S 14 -9 Cpr. E 583 – Reconfigurable Computing S 8 -4 S 3 -0 Lect-05. 20
Constant Addition • If one operand is constant: • More speed? • Less hardware? A 0 0 1 A 1 0 A 2 1 A 3 HA FA FA FA S 0 S 1 S 2 S 3 September 4, 2007 A 0 A 2 A 1 C 3 S 0 Cpr. E 583 – Reconfigurable Computing S 1 A 3 HA HA S 2 S 3 C 3 Lect-05. 21
Multiplication • Shift and add operations • Need N bit adder, M cycles 42 = x 10 = 101010 x 1010 000000 101010 000000 + 101010 420 = 111001110 September 4, 2007 Multiplicand (N bits) Multiplier (M bits) Partial products Product Cpr. E 583 – Reconfigurable Computing (N+M bits) Lect-05. 22
X 3 Array Multiplier X 2 X 1 X 0 Y 0 • Area – N x M cells • Delay – O(N+M) + + X 2 X 3 + September 4, 2007 + X 2 X 3 + X 1 Y 1 X 0 + Y 2 X 0 + Y 3 X 0 + Cpr. E 583 – Reconfigurable Computing Z 2 Z 1 Z 0 Lect-05. 23
X 3 Carry-Save X 2 X 1 X 0 Y 0 X 2 X 3 + + X 1 + Y 1 X 0 + Y 2 + + Y 3 + September 4, 2007 + + + Cpr. E 583 – Reconfigurable Computing Z 2 Z 1 Z 0 Lect-05. 24
Multiplier Pipelining • Register cost: • Multiplicand – (N bits/stage x M stages) • Multiplier – (M 2 + M) / 2 bits • Early output values – (M 2 + M) / 2 bits • Total – M x (N + M + 1) bits • Critical path = max: • DFF + FA + setup • Bottom-level adder September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 25
Constant Multiplier • Can greatly reduce the number of adders • Removes all and gates + September 4, 2007 + X 2 X 3 X 1 Y 1=1 X 0 Y 2=0 X 2 X 3 Y 0=0 X 1 + Y 3=1 X 0 + Cpr. E 583 – Reconfigurable Computing Z 2 Z 1 Z 0 Lect-05. 26
LUT-based Constant Multipliers • k-LUT can perform constant multiply of k-bit number • Break operand into k-bit quantities • Example: 8 -bit x 8 -bit constant, k=4 10101011 x NNNN AAAAAA + BBBBBBBB SSSSSSSS September 4, 2007 (N * 1011 (LSN)) (N * 1010 (MSN)) Product Cpr. E 583 – Reconfigurable Computing Lect-05. 27
Summary • Latency overhead of programmable logic • Several approaches to reducing design latency: • Fast carry • Cascade • Hardwired connections • Multiplier optimization goals different from adder • Other techniques: • Logarithmic v. linear (Wallace Tree multiplier) • Data encoding (Booth’s multiplier) September 4, 2007 Cpr. E 583 – Reconfigurable Computing Lect-05. 28
- Slides: 28