A Parameterized Floating Point Library Applied to Multispectral

A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University Wang P 166/MAPLD 2004

Outline l l l Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the floating point library Conclusions and future work Wang 2 P 166/MAPLD 2004

Variable Precision Floating Point Library l A library of fully pipelined and parameterized floating point modules l Implementations well suited for state of the art FPGAs – – l Xilinx Virtex II FPGAs and Altera Stratix devices Embedded Multipliers and Block RAM Signal/image processing algorithms accelerated using this library Wang 3 P 166/MAPLD 2004

Questions to Answer l Why floating point (FP)? l Why parameterized FP? l Why FPGAs? Wang 4 P 166/MAPLD 2004

Why Floating Point (FP) ? Fixed Point l l l Floating Point Limited range Number of bits grows for more accurate results Easy to implement in hardware Wang l l l 5 Dynamic range Accurate results More complex and higher cost to implement in hardware P 166/MAPLD 2004

Floating Point Representation Sign +/- Biased exponent Significand s=1. f (the 1 is hidden) e+bias f 32 -bits: 8 bits, bias=127 23+1 bits, IEEE single-precision format 64 -bits: 11 bits, bias=1023 52+1 bits, IEEE double-precision format (-1)s * 1. f * 2 e-BIAS Wang 6 P 166/MAPLD 2004

Why Parameterized FP ? l Minimize the bitwidth of each signal in the datapath – – l Make more parallel implementations possible Reduce the power dissipation Further acceleration – – Wang Custom datapaths built in reconfigurable hardware using either fixed-point or floating point arithmetic Hybrid representations supported through fixed-to-float and float-to-fixed conversions 7 P 166/MAPLD 2004

Why FPGA ? l Flexible FPGA architecture – – l Customize design architecture to suit algorithm – l high computational workload Allow cost / performance tradeoffs – l parallel, serial, bitwidth Fine grained parallelism – l LUTs, Block. RAM or SRL 16 Embedded multiplier, embedded power PC Small area, low latency, high throughput, low power dissipation Faster than general purpose processor, more flexible and lower cost than ASIC High Performance Signal Processing Easy and Affordable in FPGAs Wang 8 P 166/MAPLD 2004

Outline l l l Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the floating point library Conclusions and future work Wang 9 P 166/MAPLD 2004

Parameterized FP Modules l Arithmetic operation – – – l Format control – – l fp_add : floating point addition fp_sub : floating point subtraction fp_mul : floating point multiplication fp_div : floating point division fp_sqrt : floating point square root denorm : introducing implied integer digit rnd_norm : rounding and normalizing Format conversion – – fix 2 float : converting from fixed point to floating point float 2 fix : converting from floating point to fixed point Wang 10 P 166/MAPLD 2004

What Makes Our Library Unique ? l A superset of all floating point formats – l Parameterized for variable precision arithmetic – – l Support custom floating point datapaths Support hybrid fixed and floating point implementations Support fully pipelining – l including IEEE standard format Synchronization signals Complete – – – Wang Separate normalization Rounding (“round to zero” and “round to nearest”) Some error handling 11 P 166/MAPLD 2004

Generic Library Component l Synchronization signals for pipelining – l READY and DONE Some error handling features – Wang EXCEPTION_IN and EXCEPTION_OUT 12 P 166/MAPLD 2004

One Example - Assembly of Modules 2 denorm + 1 fp_add + 1 rnd_norm = 1 IEEE single precision adder Wang 13 P 166/MAPLD 2004

Another Example - Floating Point Multiplier (-1)s 1 * 1. f 1 * 2 e 1 -BIAS x (-1)s 2 * 1. f 2 * 2 e 2 -BIAS (-1)s 1 xor s 2 * (1. f 1*1. f 2) * 2(e 1+e 2 -BIAS)-BIAS Wang 14 P 166/MAPLD 2004

Latency Module Latency (clock cycles) denorm 0 rnd_norm 2 fp_add / fp_sub 4 fp_mul 3 fp_div 14 fp_sqrt 14 fix 2 float(unsigned/signed) 4/5 float 2 fix(unsigned/signed) 4/5 Clock rate of each module is similar Wang 15 P 166/MAPLD 2004

Outline l l l Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the library Conclusions and future work Wang 16 P 166/MAPLD 2004

Algorithms for Division and Square Root l Division – l P. Hung, H. Fahmy, O. Mencer, and M. J. Flynn, “Fast division algorithm with a small lookup table, " Asilomar Conference, 1999 Square Root – Wang M. D. Ercegovac, T. Lang, J. -M. Muller, and A. Tisserand, “Reciprocation, square root, inverse square root, and some elementary functions using small multipliers, " IEEE Transactions on Computers, vol. 2, pp. 628 -637, 2000 17 P 166/MAPLD 2004

Why Choose These Algorithms? l Both algorithms are simple and elegant – – l Very well suited to FPGA implementations – – l Based on Taylor series Use small table-lookup method with small multipliers Block. RAM, distributed memory, embedded multiplier Lead to a good tradeoff of area and latency Can be fully pipelined – Wang Clock speed similar to all other components 18 P 166/MAPLD 2004

Division Algorithm Dividend X and divisor Y are 2 m-bit fixed-point number [1, 2) , where Y is decomposed into higher order bit part , which are defined as and lower order bit , where Wang 19 P 166/MAPLD 2004

Division Algorithm – Continue Using Taylor series Error less than ½ ulp Two multipliers and one Table-Lookup are required Wang 20 P 166/MAPLD 2004

Division – Data Flow 2 m bits Dividend X Divisor Y m bits 2 m bits Multiplier Lookup Table 2 m+2 bits Multiplier 2 m bits Result Wang 21 P 166/MAPLD 2004

Square Root – Data Flow Y Reduce the input Y to a very small number A Reduction A 00. . . 00 A 2 A 3 A 4 Compute first terms of Taylor series Evaluation M B=f(A) Postprocessing Multiplier sqrt(Y) Wang 22 P 166/MAPLD 2004

Square Root – Reduction 4 k bits Y Y (k ) Y k bits M Table R Table ^ R 4 k bits k+1 bits Multiplier M ^ 4 k bits A = Y R- 1 0. . . 00 A 2 A 3 A 4 4 k bits Wang 23 P 166/MAPLD 2004

Square Root - Evaluation 0. . . 00 A 2 A 4 k bits A 2 A 3 k bits ^2 A 2 k bits A 3 k bits A 4 A 2 k bits Multiplier A 2*A 2 2 k bits A 2*A 3 Multiplier A 2*A 2 Multiple Operand Signed Adder B Wang 4 k bits 24 P 166/MAPLD 2004

Our Experiments l Designs specified in VHDL l Mapped to Xilinx Virtex II FPGA (XC 2 V 3000) – – – l System clock rates up to 300 MHz Density up to 8 M system gates 14, 336 slices 96 18 x 18 Embedded Multipliers 96 18 Kb Block. RAM (1, 728 Kb) 448 Kb Distributed Memory Currently targeting Annapolis Wildcard-II Wang 25 P 166/MAPLD 2004

Results - FP Divider on XC 2 V 3000 Floating Point Format 8(2, 5) 16(4, 11) 24(6, 17) 32(8, 23) # of slices 69 (1%) 110 (1%) 254 (1%) 335 (2%) # of Block. RAM 1 (1%) 7 (7%) # of 18 x 18 Embedded Multiplier 2 (2%) 8 (8%) 8 10 9 9 Maximum frequency (MHz) 124 96 108 110 # of clock cycles to obtain final results 10 10 14 14 Latency (ns)=clock period x # of clock cycles 80 105 129 127 Throughput (million results/second) 124 96 108 110 Clock period (ns) The last column is the IEEE single precision floating point format Wang 26 P 166/MAPLD 2004

Results - FP Square Root on XC 2 V 3000 Floating Point Format 8(2, 5) # of slices 16(4, 11) 24(6, 17) 32(8, 23) 113 (1%) 253 (1%) 338 (2%) 401 (2%) # of Block. RAM 3 (3%) # of 18 x 18 Embedded Multiplier 4 (4%) 5 (5%) 9 (9%) Clock period (ns) 10 9 11 12 Maximum frequency (MHz) 103 112 94 86 # of clock cycles to obtain final results 9 12 13 13 Latency (ns)=clock period x # of clock cycles 88 107 138 152 Throughput (million results/second) 103 112 94 86 Wang 27 P 166/MAPLD 2004

Outline l l l Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the library Conclusions and future work Wang 28 P 166/MAPLD 2004

Application : K-means Clustering for Multispectral Satellite Images Image spectral data Clustered image pixel Xij class 0 class 1 Wang 0 1 2 3 4 spectral component ‘k’ k i = 0 to I = 0 to K j = 0 to J class 2 class 3 29 class 4 Every pixel Xij is assigned a class cj P 166/MAPLD 2004

K-means – Iterative Algorithm l Each cluster has a center (mean value) – – l Cluster assignment – l l Initialized on host Initialization done once for complete image processing Distance (Manhattan norm) of each pixel and cluster center Accumulation of pixel value of each cluster Mean update via dividing the accumulator value by number of pixels (done once per iteration) l l Wang Previously done on host Moved to FPGA with fp_div 30 P 166/MAPLD 2004

K-means Clustering – Functional Units Validity Memory Acknowledge Pixel Data Cluster Centers Pixel Data DATAPATH Pixel Shift Datapath Subtraction Abs. Value Data Valid Accumulators Cluster Assignment Addition Comparison Mean Update Division Wang 31 P 166/MAPLD 2004

Outline l l l Project overview Library hardware modules Floating point divider and square root K-means clustering application for multispectral satellite images using the library Conclusions and future work Wang 32 P 166/MAPLD 2004

Conclusion l A Library of fully pipelined and parameterized hardware modules for floating point arithmetic l Flexibility in forming custom floating point formats l New module fp_div and fp_sqrt have small area and low latency, are easily pipelined l K-means clustering algorithm applied to multispectral satellite images makes use of fp_div Wang 33 P 166/MAPLD 2004

Future Work l More applications using – l New library modules – l fp_div and fp_sqrt ACC, MAC, INV_SQRT Use floating point lib to implement floating point coprocessor on FPGA with embedded processor Wang 34 P 166/MAPLD 2004