Speeding Up Floating Point Division With Inlined Iterative
- Slides: 18
Speeding Up Floating. Point Division With Inlined Iterative Algorithms Robert Enenkel, Allan Martin IBM® Toronto Lab
Outline §Hardware floating-point division §The case for software division §Software division algorithms §Special cases/tradeoffs §Performance results §Automatic generation © Copyright IBM Corp. 2005
Hardware Division §PPC fdiv, fdivs §Advantages ƒ accurate (correctly rounded) ƒ handles exceptional cases (Inf, Na. N) ƒ lower latency than SW §Disadvantages ƒ occupies FPU completely ƒ inhibits parallelism © Copyright IBM Corp. 2005
Alternatives to HW division §Vector libraries ƒ MASS ƒ higher overhead, greater speedup §In-lined software division ƒ low overhead, medium speedup © Copyright IBM Corp. 2005
Rationale for Software Division §Write SW division algorithm in terms of HW arithmetic instructions ƒ Newton's method or Taylor series §Latency will be higher than HW division §But. . . SW instructions can be interleaved, so throughput may be better §Requires enough independent instructions to interleave ƒ loop of divisions ƒ other work © Copyright IBM Corp. 2005
Newton's Method §To find x such that f(x) = 0, §Initial guess x 0 §xn+1 = xn - f(xn)/f'(xn), n=0, 1, 2, . . . §Provided x 0 is close enough ƒ xn converges to x ƒ It converges quadratically |xn+1 -x| < c|xn-x|^2 ƒ Number of bits of accuracy doubles with each iteration © Copyright IBM Corp. 2005
Newton's Method © Copyright IBM Corp. 2005
Newton Iteration for Division §For 1/b, let f(x) = 1/x - b §For a/b, use a*(1/b) or f(x) = a/x - b §Algorithm for 1/b ƒ x 0 ~ 1/b initial guess ƒ e 0 = 1 - b*y 0 ƒ x 1 = x 0 + e 0*x 0 ƒ e 1 = e 0*e 0 ƒ x 2 = x 1 + e 1*x 1 ƒ etc. . . © Copyright IBM Corp. 2005
How Many Iterations Needed? §Power 5 reciprocal estimate instructions ƒ FRES (single precision), FRE (double prec. ) ƒ |relative error| <= 2^(-8) §Floating-point precision ƒ single: 24 bits ƒ double: 53 bits §Newton iterations ƒ error: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒ single: 2 iterations for 1 ulp ƒ double: 3 iterations for 1 ulp © Copyright IBM Corp. 2005 ƒ +1 iteration for correct rounding (0. 5 ulps)
Taylor Series for Reciprocal §x 0 ~ 1/b initial guess §e = 1 - b x 0 § 1/b = x 0/(b x 0) = x 0 (1/(1 -e)) = x 0 (1 + e^2 + e^3 + e^4 +. . . ) §Algorithm (6 terms) ƒ e = 1 - d*x 0 ƒ t 1 = 0. 5 + e * e ƒ q 1 = x 0 + x 0 * e ƒ t 2 = 0. 75 + t 1*t 1 ƒ t 3 = q 1*e ƒ q 2 = x 0 + t 2*t 3 © Copyright IBM Corp. 2005
Speed/Accuracy tradeoff §IBM compilers have -qstrict/-qnostrict §-qstrict: SW result should match HW division exactly §-qnostrict: SW result may be slightly less accurate for speed © Copyright IBM Corp. 2005
Exceptions §Even when a/b is representable. . . § 1/b may underflow ƒ a ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒ Causes loss of accuracy § 1/b may overflow ƒ a, b denormalized, a/b ~ 1, 1/b = Inf ƒ Causes SW algorithm to produce Na. N §Handle with tests in algorithm ƒ Use HW divide for exceptional cases © Copyright IBM Corp. 2005
Algorithm variations §User callable built-in functions ƒ swdiv(a, b): double precision, checking ƒ swdivs(a, b): single precision, checking ƒ swdiv_nochk(a, b): double, non-checking ƒ swdivs_nochk(a, b): single, non-checking §Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict §_nochk versions faster but have argument restrictions © Copyright IBM Corp. 2005
Accuracy and Performance swdivs Power 5 Power 4 speedup ratio ulps max error 1. 07 1. 05 0. 5 swdivs_nochk 1. 46 1. 28 0. 5 swdiv strict 1. 05 0. 5 swdiv nostrict 1. 50 1. 5 swdiv_nochk strict 1. 51 0. 5 swdiv_nochk nostrict 1. 77 1. 5 © Copyright IBM Corp. 2005 0. 5
Automatic Generation of Software Division §The swdivs and swdiv algorithms can also be automatically generated by the compiler §Compiler can detect situations where throughput is more important than latency © Copyright IBM Corp. 2005
Automatic Generation of Software Division §In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒ independent instructions are good, especially other divides ƒ dependent instructions are bad (they increase latency) © Copyright IBM Corp. 2005
Automatic Generation of Software Division §In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations §Divides are expanded if divide does not appear in a recurrence (cyclic datadependence) © Copyright IBM Corp. 2005
Summary §Software divide algorithms ƒ user callable ƒ compiler generated §Loops of divides ƒ up to 1. 77 x speedup §UMT 2 K benchmark ƒ 1. 19 x speedup © Copyright IBM Corp. 2005
- Floating point division algorithm in computer architecture
- Floating point division
- Fixed point vs floating point
- Convert to floating point
- Floating point number representation
- Mips read float
- Bit4bytes
- Point format example
- Floating-point number
- Denormalized float
- Floating point arithmetic examples
- Floating point representation
- Floating-point number
- Fp adder hardware
- Parts of a floating point number
- Parts of a floating point number
- Floating point pipeline stages of pentium processor
- Xkcd floating point
- Dfa for floating point numbers