A Novel Time AreaPower Efficient Single Precision Floating

  • Slides: 17
Download presentation
A Novel Time- Area-Power Efficient Single Precision Floating Point Multiplier Himanshu Thapliyal and M.

A Novel Time- Area-Power Efficient Single Precision Floating Point Multiplier Himanshu Thapliyal and M. B Srinivas (thapliyalhimanshu@yahoo. com, srinivas@iiit. net) Center for VLSI and Embedded System Technologies International Institute of Information Technology Hyderabad-500019, India Thapliyal 1 MAPLD 2005/1013 -M

Abstract • • • Thapliyal In this paper, a Single Precision IEEE 754 floating-point

Abstract • • • Thapliyal In this paper, a Single Precision IEEE 754 floating-point multiplier with high speed and low power is presented. The bottleneck of any Single Precision floating-point design is the 24 x 24 bit integer multiplier. Urdhava Triyakbhyam algorithm of ancient Indian Vedic Mathematics is utilized to improve its efficiency. In the proposed architecture, the 24 x 24 bit multiplication operation is fragmented to four parallel 12 x 12 bit multiplication modules. The 12 x 12 multiplication modules are implemented using small 4 x 4 bit multipliers. In the unsigned 24 x 24 bit multiplier architecture, four redundant 4 x 4 multiplier are provided to enforce the feature of self repairability (to recover from the faults in each 12 x 12 multiply modules). Reconfigurability at run time is provided for attaining power saving. The multiplier has been designed, optimized and implemented on an FPGA based system. Thus, a highly regular, self-repairable floating point parallel multiplier architecture (which can be directly scaled for larger multiplication ) is proposed. 2 MAPLD 2005/1013 -M

INTRODUCTION IEEE 754 standard for floating point Sign bit 8 bit exponent 23 bit

INTRODUCTION IEEE 754 standard for floating point Sign bit 8 bit exponent 23 bit Mantissa Normalized form 1. b 1 b 2 b 3………. b 23*2 exp Hidden bit Thapliyal 3 MAPLD 2005/1013 -M

Addition • Subtract Exponents – Compare • Right Shift Significand with smaller exponent –

Addition • Subtract Exponents – Compare • Right Shift Significand with smaller exponent – By difference • Add/Subtract Significands (sign bits) • Normalize – Shift significand – Add or Subtract shift amount to exponent • Round – To number of bits for significand – Need to keep extra bits during computation Thapliyal 4 MAPLD 2005/1013 -M

Addition 1 1 8 -Bit Sub Exp Diff Control MUX Shifter MUX 24 -Bit

Addition 1 1 8 -Bit Sub Exp Diff Control MUX Shifter MUX 24 -Bit ALU MUX Inc/Dec Shifter Normalize Round Thapliyal 5 MAPLD 2005/1013 -M

Multiplication 1 8 -Bit Add MUX 8 -Bit Add 1 24 -Bit Multiply MUX

Multiplication 1 8 -Bit Add MUX 8 -Bit Add 1 24 -Bit Multiply MUX Control Shifter Normalize Round Thapliyal 6 MAPLD 2005/1013 -M

Conventional 24 x 24 Multiply Architectures Implemented in Floating Point Multipliers • • Thapliyal

Conventional 24 x 24 Multiply Architectures Implemented in Floating Point Multipliers • • Thapliyal Array Multiplier Redundant Binary Architectures( Pipeline Stages). Modified Booth encoding and a binary tree of 4: 2 Compressors (Wallace Tree). Modified carry save array in conjunction with Booth's algorithm 7 MAPLD 2005/1013 -M

Drawbacks of Conventional 24 x 24 Multiply Architectures Tree multipliers – shortest logic delay

Drawbacks of Conventional 24 x 24 Multiply Architectures Tree multipliers – shortest logic delay but irregular layouts with complicated interconnects. – Irregular layouts not only demand more physical design effort, but also introduce significant interconnect delay and make noise a problem due to several types of wiring capacitance. – the delay of the interconnection is most significant and is not suitable for VLSI implementation. – Huge amount of power consumption as reconfigurability at run time is not provided according to the input data width. Array Multipliers – array multipliers have larger delay and offer regular layout with simpler interconnects. – Interconnects become important in deep submicron design, structures with regular layout and simple interconnects are preferable – Huge amount of power consumption as reconfigurability at run time is not provided according to the input bit width. Thapliyal 8 MAPLD 2005/1013 -M

Novel Contribution For Designing Floating Point Multiplier • • • Thapliyal Urdhava Triyakbhyam algorithm

Novel Contribution For Designing Floating Point Multiplier • • • Thapliyal Urdhava Triyakbhyam algorithm of ancient Indian Vedic Mathematics is utilized. The 24 x 24 bit mantissa multiplication operation is fragmented to four parallel 12 x 12 bit multiplication modules. The 12 x 12 multiplication modules are implemented using small 4 x 4 bit multipliers. The whole 24 x 24 bit multiplication operation is divided into 36 4 x 4 multiply modules working in parallel. Four redundant 4 x 4 multiplier are provided to enforce the feature of self repairability Each 4 x 4 redundant multiplier will take care of the fault in one of the 12 x 12 multiplier. Power saving is attained as the 4 x 4 module that gives an erroneous result would be devoid of power supply. Otherwise, the corresponding redundant 4 x 4 multiplier will be switched off. The proposed architecture brings out the idea of reconfiguarability at runtime. This is possible when the mantissa is of 12 bits, which requires only one 12 x 12 multiplication block to be enabled through a control circuitry. The other three 12 x 12 multiplication blocks can be switched off during its computation, thus saving huge amount of power. Reconfiguarability at runtime is also extended to 8 bit and 4 bit mantissa, thereby reducing the power consumption largely. 9 MAPLD 2005/1013 -M

TABLE 1 - Example of 16 x 16 bit multiplication Using Urdhva Tiryakbhyam CP-

TABLE 1 - Example of 16 x 16 bit multiplication Using Urdhva Tiryakbhyam CP- Cross Product (Vertically and Crosswise) A 15 A 14 A 13 A 12 A 11 A 10 A 9 A 8 X 3 X 2 A= B= A 7 A 6 A 5 A 4 X 1 B 15 B 14 B 13 B 12 Y 3 B 11 B 10 B 9 B 8 B 7 B 6 B 5 B 4 Y 2 Y 1 X 3 X 2 X 1 X 0 Multiplicand[16 bits] Y 3 Y 2 Y 1 Y 0 Multiplier [16 bits] ---------------------------------J I H G F E D C P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 0 Product[32 bits] Where X 3, X 2, X 1, X 0, Y 3, Y 2, Y 1 and Y 0 are each of 4 bits. PARALLEL COMPUTATION & METHODOLOGY 1. CP 2. CP 3 CP 4 CP 5 CP 6 CP 7 CP Note: Thapliyal A 3 A 2 A 1 A 0 X 0 B 3 B 2 B 1 B 0 Y 0 X 0 = X 0 * Y 0 = A Y 0 X 1 X 0 = X 1 * Y 0+X 0 * Y 1= B Y 1 Y 0 X 2 X 1 X 0 = X 2 * Y 0 +X 0 * Y 2 +X 1 * Y 1=C Y 2 Y 1 Y 0 X 3 X 2 X 1 X 0 = X 3 * Y 0 +X 0 * Y 3+X 2 * Y 1 +X 1 * Y 2=D Y 3 Y 2 Y 1 Y 0 X 3 X 2 X 1 = X 3 * Y 1+X 1 * Y 3+X 2 * Y 2=E Y 3 Y 2 Y 1 X 3 X 2 = X 3 * Y 2+X 2 * Y 3=F Y 3 Y 2 X 3 = X 3 * Y 3 =G Y 3 Each Multiplication operation is an embedded parallel 4 x 4 multiply module 10 MAPLD 2005/1013 -M

Proposed 24 x 24 bit Architecture • Reconfigurability at Run time is provided with

Proposed 24 x 24 bit Architecture • Reconfigurability at Run time is provided with the output of Checker working as a control signal. • If any of (A or B)’s Mantissa is of 12 bits only then the Checker will check this and will switch off the multiply blocks that are not needed using the control signal. • Thus significant power saving can be obtained at run time. • The reconfigurablity has also been extended to individual 12 x 12 multiply modules as shown next. Thapliyal 11 MAPLD 2005/1013 -M

Internal structure of Individual 12 x 12 multiply module • The 12 bit A

Internal structure of Individual 12 x 12 multiply module • The 12 bit A & B are divided into 4 bits groups A 3, A 2, A 1 and B 3, B 2, B 1 respectively. • Checkers at A 3, A 2 and B 3, B 2 will check whether the mantissa to be multiplied are of 12 bits, 8 bits or 4 bits then will switch on or switch off the required 4 x 4 multiply modules accordingly. • Thus there is a significant reduction in power consumption if the mantissas to be multiplied are less than 12 bits. • Self-repairability at run time is also provided by providing a redundant 4 x 4 multiply module to each 12 x 12 multiply module as shown in next slide Thapliyal 12 MAPLD 2005/1013 -M

Feature of Self Repairability • P=Ax. B where A & B=12 bit A=A 3

Feature of Self Repairability • P=Ax. B where A & B=12 bit A=A 3 A 2 A 1 B=B 3 B 2 B 1 Where A 3, A 2, A 1, B 3, B 2, B 1 are each of 4 bits • Redundant 4 x 4 Multiplier is provided to Each 12 x 12 multiply module to provide Feature of Self Repairability. • The product of the redundant multiplier is distributed to all 4 x 4 multiplier • The 4 x 4 multiplier to be repaired, is specified by the given Aij , Bij and E bits. • It abandons its own output and replaces it by the one from the redundant multiplier. Thapliyal 13 MAPLD 2005/1013 -M

Verification and Implementation • The algorithms and architecture are implemented in Verilog HDL and

Verification and Implementation • The algorithms and architecture are implemented in Verilog HDL and the simulation is done in Modelsim Simulator. • The codes are synthesized in Xilinx ISE foundation 6. 3. The designs are optimized for speed using Xilinx , Device Family : Virtex. E, Device : XCV 300 e, Package: bg 432, Speed grade: -8. • The designs are completely technology independent and can be easily converted from one technology to another. Thapliyal 14 MAPLD 2005/1013 -M

Results and Discussion Table : Synthesis Results of the Proposed Floating Point Multiplier Architecture

Results and Discussion Table : Synthesis Results of the Proposed Floating Point Multiplier Architecture Name of Multiplier Vendor Device Family & Device Package Speed Grade Cell Use Proposed Multiplier Without Reconfigurability Xilinx Virtex. E Xcv 300 e Bg 432 -8 2967 37. 553 Proposed Multiplier With Reconfigurability Xilinx Virtex. E Xcv 300 e Bg 432 -8 3149 41. 203 Thapliyal 15 Estimatd Delay (ns) MAPLD 2005/1013 -M

Conclusions • The results obtained are quite encouraging. • There is not much increase

Conclusions • The results obtained are quite encouraging. • There is not much increase in area and the delay of the floating point multiplier with proposed logic. • Significant power saving is now possible in multiplier with the introduction of feature of reconfigurability at run time. • Self repairability in the multiplier will allow it to recover from logic faults (stuck-at faults) caused by any of 36 4 x 4 multipliers. • The proposed architecture can be extended for higher precision. • Work on novel exhaustive DFT technique for proposed multiplier is in progress. Thapliyal 16 MAPLD 2005/1013 -M

References • A Ga. As IEEE Floating Point Standard Single Precision Multiplier", S. Cui,

References • A Ga. As IEEE Floating Point Standard Single Precision Multiplier", S. Cui, N. Burgess, M. J. Liebelt and K. Eshraghian, Proceedings of the 12 th IEEE Symposium on Computer Arithmetic, pp 91 -97, Bath, UK, July 19 -21 1995. • R. K. Yu and G. B. Zyner, 167 mhz radix-4 floating point multiplier, in Proceedings of the 12 th Symposium on Computer Arithmetic (S. Knowles and W. H. Mc. Allister, eds. ), (Bath, England), pp. 149 -154, 1995. • Mark D. Aagaard and Carl-Johan H. Seger, "The Formal Verification of a Pipelined Double-Precision IEEE Floating-Point Multiplier", Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design, pp. 7 - 10 , San Jose, California, United States. • Ahmet Akkas, Michael J. Schulte, "A Quadruple Precision and Dual Double Precision Floating-Point Multiplier", . proceedings DSD 2003, pp. 76 -81, 3 -5 September 2003, Belek-Antalya, Turkey. • GH. A. Aty, Aziza 1. Hussein, I. S. Ashour and M. Mona, "High-speed, Area-Efficient FPGA-Based -Floating-point Multiplier", Proceedings ICM 2003, pp-274 -277, Dec. 911 2003, Cairo, Egypt. Thapliyal 17 MAPLD 2005/1013 -M