Speeding Up Floating Point Division With Inlined Iterative

  • Slides: 18
Download presentation
Speeding Up Floating. Point Division With Inlined Iterative Algorithms Robert Enenkel, Allan Martin IBM®

Speeding Up Floating. Point Division With Inlined Iterative Algorithms Robert Enenkel, Allan Martin IBM® Toronto Lab

Outline §Hardware floating-point division §The case for software division §Software division algorithms §Special cases/tradeoffs

Outline §Hardware floating-point division §The case for software division §Software division algorithms §Special cases/tradeoffs §Performance results §Automatic generation © Copyright IBM Corp. 2005

Hardware Division §PPC fdiv, fdivs §Advantages ƒ accurate (correctly rounded) ƒ handles exceptional cases

Hardware Division §PPC fdiv, fdivs §Advantages ƒ accurate (correctly rounded) ƒ handles exceptional cases (Inf, Na. N) ƒ lower latency than SW §Disadvantages ƒ occupies FPU completely ƒ inhibits parallelism © Copyright IBM Corp. 2005

Alternatives to HW division §Vector libraries ƒ MASS ƒ higher overhead, greater speedup §In-lined

Alternatives to HW division §Vector libraries ƒ MASS ƒ higher overhead, greater speedup §In-lined software division ƒ low overhead, medium speedup © Copyright IBM Corp. 2005

Rationale for Software Division §Write SW division algorithm in terms of HW arithmetic instructions

Rationale for Software Division §Write SW division algorithm in terms of HW arithmetic instructions ƒ Newton's method or Taylor series §Latency will be higher than HW division §But. . . SW instructions can be interleaved, so throughput may be better §Requires enough independent instructions to interleave ƒ loop of divisions ƒ other work © Copyright IBM Corp. 2005

Newton's Method §To find x such that f(x) = 0, §Initial guess x 0

Newton's Method §To find x such that f(x) = 0, §Initial guess x 0 §xn+1 = xn - f(xn)/f'(xn), n=0, 1, 2, . . . §Provided x 0 is close enough ƒ xn converges to x ƒ It converges quadratically |xn+1 -x| < c|xn-x|^2 ƒ Number of bits of accuracy doubles with each iteration © Copyright IBM Corp. 2005

Newton's Method © Copyright IBM Corp. 2005

Newton's Method © Copyright IBM Corp. 2005

Newton Iteration for Division §For 1/b, let f(x) = 1/x - b §For a/b,

Newton Iteration for Division §For 1/b, let f(x) = 1/x - b §For a/b, use a*(1/b) or f(x) = a/x - b §Algorithm for 1/b ƒ x 0 ~ 1/b initial guess ƒ e 0 = 1 - b*y 0 ƒ x 1 = x 0 + e 0*x 0 ƒ e 1 = e 0*e 0 ƒ x 2 = x 1 + e 1*x 1 ƒ etc. . . © Copyright IBM Corp. 2005

How Many Iterations Needed? §Power 5 reciprocal estimate instructions ƒ FRES (single precision), FRE

How Many Iterations Needed? §Power 5 reciprocal estimate instructions ƒ FRES (single precision), FRE (double prec. ) ƒ |relative error| <= 2^(-8) §Floating-point precision ƒ single: 24 bits ƒ double: 53 bits §Newton iterations ƒ error: 2^(-16), 2^(-32), 2^(-64), 2^(-128) ƒ single: 2 iterations for 1 ulp ƒ double: 3 iterations for 1 ulp © Copyright IBM Corp. 2005 ƒ +1 iteration for correct rounding (0. 5 ulps)

Taylor Series for Reciprocal §x 0 ~ 1/b initial guess §e = 1 -

Taylor Series for Reciprocal §x 0 ~ 1/b initial guess §e = 1 - b x 0 § 1/b = x 0/(b x 0) = x 0 (1/(1 -e)) = x 0 (1 + e^2 + e^3 + e^4 +. . . ) §Algorithm (6 terms) ƒ e = 1 - d*x 0 ƒ t 1 = 0. 5 + e * e ƒ q 1 = x 0 + x 0 * e ƒ t 2 = 0. 75 + t 1*t 1 ƒ t 3 = q 1*e ƒ q 2 = x 0 + t 2*t 3 © Copyright IBM Corp. 2005

Speed/Accuracy tradeoff §IBM compilers have -qstrict/-qnostrict §-qstrict: SW result should match HW division exactly

Speed/Accuracy tradeoff §IBM compilers have -qstrict/-qnostrict §-qstrict: SW result should match HW division exactly §-qnostrict: SW result may be slightly less accurate for speed © Copyright IBM Corp. 2005

Exceptions §Even when a/b is representable. . . § 1/b may underflow ƒ a

Exceptions §Even when a/b is representable. . . § 1/b may underflow ƒ a ~ b ~ huge, a/b ~ 1, 1/b denormalized ƒ Causes loss of accuracy § 1/b may overflow ƒ a, b denormalized, a/b ~ 1, 1/b = Inf ƒ Causes SW algorithm to produce Na. N §Handle with tests in algorithm ƒ Use HW divide for exceptional cases © Copyright IBM Corp. 2005

Algorithm variations §User callable built-in functions ƒ swdiv(a, b): double precision, checking ƒ swdivs(a,

Algorithm variations §User callable built-in functions ƒ swdiv(a, b): double precision, checking ƒ swdivs(a, b): single precision, checking ƒ swdiv_nochk(a, b): double, non-checking ƒ swdivs_nochk(a, b): single, non-checking §Accuracy of swdiv, swdiv_nochk depends on -qstrict/-qnostrict §_nochk versions faster but have argument restrictions © Copyright IBM Corp. 2005

Accuracy and Performance swdivs Power 5 Power 4 speedup ratio ulps max error 1.

Accuracy and Performance swdivs Power 5 Power 4 speedup ratio ulps max error 1. 07 1. 05 0. 5 swdivs_nochk 1. 46 1. 28 0. 5 swdiv strict 1. 05 0. 5 swdiv nostrict 1. 50 1. 5 swdiv_nochk strict 1. 51 0. 5 swdiv_nochk nostrict 1. 77 1. 5 © Copyright IBM Corp. 2005 0. 5

Automatic Generation of Software Division §The swdivs and swdiv algorithms can also be automatically

Automatic Generation of Software Division §The swdivs and swdiv algorithms can also be automatically generated by the compiler §Compiler can detect situations where throughput is more important than latency © Copyright IBM Corp. 2005

Automatic Generation of Software Division §In straight-line code, we use a heuristic that calculates

Automatic Generation of Software Division §In straight-line code, we use a heuristic that calculates how much FP can be executed in parallel ƒ independent instructions are good, especially other divides ƒ dependent instructions are bad (they increase latency) © Copyright IBM Corp. 2005

Automatic Generation of Software Division §In modulo scheduled loops software-divide code can be pipelined,

Automatic Generation of Software Division §In modulo scheduled loops software-divide code can be pipelined, interleaving multiple iterations §Divides are expanded if divide does not appear in a recurrence (cyclic datadependence) © Copyright IBM Corp. 2005

Summary §Software divide algorithms ƒ user callable ƒ compiler generated §Loops of divides ƒ

Summary §Software divide algorithms ƒ user callable ƒ compiler generated §Loops of divides ƒ up to 1. 77 x speedup §UMT 2 K benchmark ƒ 1. 19 x speedup © Copyright IBM Corp. 2005