Introduction and Floatingpoint Numbers Douglas Wilhelm Harder M

Introduction and Floating-point Numbers Douglas Wilhelm Harder, M. Math. LEL Department of Electrical and Computer Engineering University of Waterloo, Ontario, Canada ece. uwaterloo. ca dwharder@alumni. uwaterloo. ca © 2012 by Douglas Wilhelm Harder. Some rights reserved.

Introduction and Floating-point Numbers Outline This topic discusses numerical methods: – We will quickly quick overview of the tools used • • • Iteration Linear algebra Interpolation Taylor series Bracketing – Landau symbols and floating-point numbers – Topics to be covered in NE 216 and NE 217 • IVPs in NE 216 • PDEs in NE 217 2

Introduction and Floating-point Numbers Five Tools for Numerical Methods For many numerical algorithms, we use one or more of five tools: – – – Iteration Linear algebra Interpolation Taylor series Bracketing All numerical algorithms you have seen use these techniques in some combination 3

Introduction and Floating-point Numbers Iteration The first tool used by almost all numerical algorithms is iteration – We have a problem for which we are looking for a solution x* – We start with an initial approximation or guess, call it x 0 – We develop an algorithm or formula that takes a given approximation xk and hopefully produces a better approximation xk + 1 of x* – We need mechanisms to indicate: • When our guess is “good enough” • When it is pointless to continue using a particular algorithm 4

Introduction and Floating-point Numbers Iteration The easiest example is an application of the fixed-point theorem Given a problem of the form x = f(x) there are certain conditions (local contraction mappings) under which this may be solved by starting with an initial guess x 0 and iterating: xk + 1 = f(xk) 5

Introduction and Floating-point Numbers Iteration Let’s look at x = cos(x) Plotting this, we see that x 0 = 0. 7388 is likely a good initial approximation 6

Introduction and Floating-point Numbers Iteration >> x = 0. 7388000000 >> for i = 1: 20 x = cos( x ) end x* = 0. 739085133215160641655312 0. 739277172332048 0. 738955759728378 0. 739172274566653 0. 739026430946466 0. 739124674496043 0. 739058497154931 0. 739103075323556 0. 739073047076154 0. 739093274529801 0. 739079649103192 0. 739088827367838 0. 739082644784437 0. 739086809449742 0. 739084004082345 0. 739085893812137 0. 739084620867674 0. 739085478338494 0. 739084900735888 0. 739085289815975 0. 739085027726959 7

Introduction and Floating-point Numbers Iteration >> x = 0. 7388000000 x* = 0. 739085133215160641655312 >> for i = 1: 3 x = x + (cos(x) - x)/(sin(x) + 1) end 0. 739085151172220 0. 739085133215161 Note: some algorithms are better than others… 8

Introduction and Floating-point Numbers Selecting Initial Points Note: there is no such thing as a good numerical algorithm that can find a solution with any random initial point – Perhaps a mathematician cares about such properties, but if you as an engineer do not have a reasonable idea of what a good starting point is, you have a problem… – In this course, where necessary, you will be given initial points unless you are looking at a problem where you are explicitly asked to estimate an initial point 9

Introduction and Floating-point Numbers Linear Algebra The next tool used is linear algebra – Most interesting systems contain more than one variable – Usually this results in a system of either non-linear or linear equations – The only systems that we can easily solve are linear systems – Systems of non-linear equations can be solved by approximating it by a system of linear equations and iterating 10

Introduction and Floating-point Numbers Linear Algebra In one dimension, this is what Newton’s method does: – Given a non-linear function f and a point xk, – Find a tangent to the function at (xk, f(xk)) 11

Introduction and Floating-point Numbers Linear Algebra The equation of a line with a slope m at a point (xk, yk) is m(x – xk) + yk Finding the root of this line requires only simple algebra: In this case, we have: 12

Introduction and Floating-point Numbers Linear Algebra Once we start getting into larger systems of linear equations there are fast iterative systems – Much better than Gaussian elimination… – One of the worst performing is the Jacobi method x = f(x) 13

Introduction and Floating-point Numbers Linear Algebra Consider this system of linear equations: >> >> >> M = [5. 2 0. 3 0. 7 0. 4; 0. 3 4. 8 -1. 3 0. 5; 0. 7 -1. 3 7. 3 -0. 8; 0. 4 0. 5 -0. 8 6. 4]; b = [2 4 5 2]'; D = diag( M ) ); Moff = M – D; % M = D + Moff x = D^-1 * b % initial guess: Dx = b x = 0. 384615385 0. 83333333 0. 684931506849315 0. 3125000000 Solution: 14

Introduction and Floating-point Numbers Linear Algebra >> for i = 1: 5 x = D^-1 * (b - Moff*x) end x = 0. 220297681770285 0. 962245071566561 0. 830698981383913 0. 308973810151036 x = 0. 193509166827092 1. 012360930456767 0. 869025926564131 0. 327393371346209 x = 0. 184041581486460 1. 022496722669196 0. 882538012313945 0. 329943220201888 0. 181441747403601 1. 026482360721093 0. 885530302546704 0. 331432096237808 x = 0. 180694469520357 1. 027300171035569 0. 886652537362349 0. 331657244174278 15

Introduction and Floating-point Numbers Interpolation The next tool is finding interpolating polynomials: – Given a set of points, find the best-fitting polynomial that passes through them – For example, given (2, 5) and (7, 8), find a line y = ax + b that passes through these points 2 a + b = 5 7 a + b = 8 – This is a system of two equations and two unknowns: >> [2 1; 7 1] [5 8]' ans = 0. 60000000 3. 80000000 y = 0. 6 x + 3. 8 16

Introduction and Floating-point Numbers Interpolation Given n points with unique x values, it is always possible to find a unique interpolating polynomial passing through the points – For example, given (– 2, 6), (– 1, 0), (3, 4), and (5, 7), find a cubic polynomial y = ax 3 + bx 2 + cx + d passing through them (– 2)3 a + (– 2)2 b + (– 2) c + d = 6 (– 1)3 a + (– 1)2 b + (– 1) c + d = 0 33 a + 32 b + 3 c + d = 4 53 a + 52 b + 5 c + d = 7 – This gives the system 17

Introduction and Floating-point Numbers Interpolation In Matlab, we would do the following: >> M = [-8 4 -2 1; -1 1; 27 9 3 1; 125 25 5 1] M = -8 4 -2 1 -1 1 27 9 3 1 125 25 5 1 >> c = M [6 0 4 7]' c = -0. 188095238 1. 40000000 -0. 483333333 -2. 071428571 y = – 0. 188 x 3 + 1. 400 x 2 – 0. 483 x – 2. 071 18

Introduction and Floating-point Numbers Interpolation Checking our answer: >> >> >> plot( [-2 -1 3 5], [6 0 4 7], 'ro' ) xs = -2. 5: 0. 1: 5. 5; ys = polyval( c, xs ); hold on plot( xs, ys, 'b' ); 19

Introduction and Floating-point Numbers Taylor Series The fourth tool used in numerical methods that we will be seriously using is Taylor series – Normally, these are written as We can truncate the series as where 20

Introduction and Floating-point Numbers Taylor Series In numerical analysis, we usually write Taylor series as – Given a point x, we’d like to know what happens at x + h where h is very small – Thus, the Taylor series will usually be in the form – The truncated forms, of course, contain an approximation of the truncation error: where 21

Introduction and Floating-point Numbers Bracketing The fifth tool is bracketing: – Determine that the solution is on an interval [a, b] – Find an algorithm to reduce the size of the interval to either [a, c] or [c, b] where a < c < b – Iterate until the width of the interval is sufficiently small • Choose the endpoint that best satisfies the conditions – The most inefficient of methods… 22

Introduction and Floating-point Numbers Summary of Numerical Tools We have quickly summarized five tools used in numerical algorithms: – – – Iteration Linear algebra Interpolation Taylor series Bracketing The balance of this topic will discuss: – Landau symbols – Floating-point numbers – Topics to be covered in NE 216 and NE 217 • IVPs in NE 216 • PDEs in NE 217 23

Introduction and Floating-point Numbers Landau Symbols We will also use Landau symbols – When iterating, very often, we will have the situation: • Given a value of h that describes a quantitative value about which we may make observations – For example • For some algorithms, if we make h smaller by half, the error of our approximation is reduced by approximately half • In others, (e. g. , Newton’s method), if the error is h at one iteration, the error at the next iteration will be h 2 24

Introduction and Floating-point Numbers Landau Symbols To state this mathematically, we will use the big-O Landau symbol: – An algorithm is O(h) if halving h reduces the error by half – We will use O(h 2) to indicate the halving h will reduce the error by a factor of four: – We will see one case where the error is O(h 4)—halving h will reduce the error by a factor of 16: 25

Introduction and Floating-point Numbers Everything we do deals with floating-point numbers – Unfortunately, there are problems with floating point numbers: • We can only store a finite amount of precision • We lose associativity: (x + y) + z ≠ x + (y + z) – Both of these require us to carefully design our algorithms… 26

Introduction and Floating-point Numbers 27 Floating-point Numbers Floating-point operations are specified in IEEE-754 • Every individual in the working group could have promoted the floating-point format used by his or her company • Instead, they created a floatingpoint format that no one used but was superior to all that were George M. Bergman – Led by William Kahan – One of the most successful examples of collaboration

Introduction and Floating-point Numbers Some users will try: >> 1 - 2/3 - 1/3 ans = 5. 551115123125783 e-017 and claim floating-point numbers don’t work…. 1 2/3 1. 00000000000000000000000000 0. 101010101010101010101010101 0. 0101010101010101010101010101 1. 00000000000000000000000000 - 0. 101010101010101010101010101 0. 01010101010101010101010101010101010101010101010101011 - 0. 0101010101010101010101010101 0. 0000000000000000000000000001 28

Introduction and Floating-point Numbers Thus, >> 1 - 2/3 - 1/3 ans = 5. 551115123125783 e-017 >> 2^-54 ans = 5. 551115123125783 e-017 Floating-point arithmetic is not exact – Each operation—including conversion of decimal numbers to binary—may have an error up to 0. 5 in the least-significant bit – For function evaluates, the allowable error is slightly larger 29

Introduction and Floating-point Numbers Absolute and Relative Error To begin, we need to review absolute and relative error: – If a approximates the value x, we say that is the absolute error of a is the relative error of a is the percent relative error 30

Introduction and Floating-point Numbers Absolute and Relative Error Absolute error is not very useful: – An error of 1 mm may be: • Insignificant in a trip to Mars • Possibly significant in designing a Mars rover • Catastrophic for anything designed at the nanometer scale We will focus on the relative error: – A 0. 00001 error or 0. 001 % relative error is usually acceptable in many engineering applications regardless of scale – This is, however, application specific 31

Introduction and Floating-point Numbers Storing Real Numbers Let us try to store real numbers with a finite number of digits—for example, We will some constraints: – – Use a fixed amount of memory Represent both very large and very small numbers Represent numbers with a small relative error Easily test equality and relative magnitude 32

Introduction and Floating-point Numbers A Simple Example How significant a range can we represent with six decimal digits and a sign? ±NNNNNN Ideas? 33

Introduction and Floating-point Numbers A Reasonable Representation Here’s one very simple idea: – Let the six digits ±NNNNNN represent: – For example, +039432 represents 39. 432 – We store – 3. 14152 as -003142 34

Introduction and Floating-point Numbers A Reasonable Representation Our range is somewhat limited… – We can only represent numbers from 0. 001 to 999 – Also, consider the relative error: Value Representation 0. 0015 +000002 999. 9985 +999998 Relative Error 33 % 0. 00005 % Limited range, some numbers with large relative errors… Relative magnitude can, however, be found quickly 35

Introduction and Floating-point Numbers A Reasonable Representation Here is another idea: – Let the six digits ±EEMNNN represent: where we will require that M is non-zero – For example, +549238 represents 9. 238 × 105 – We represent 372. 863 as +513729 36

Introduction and Floating-point Numbers A Reasonable Representation How does it fare? ± 00 MNNN represents numbers as small as: • For example, +005723 represents 5. 723 × 10– 49 ± 99 MNNN represents numbers as large as: • For example, +995723 represents 5. 723 × 1050 37

Introduction and Floating-point Numbers A Reasonable Representation Also, no number on the range [1. 000 × 10– 49, 9. 999 × 1050] has a representation with a relative error larger than 0. 05 % – For example, 33 476 688 is represented by +563348 with a relative error of 0. 0099 % – For example, 6. 626 069 57× 10− 34 is represented by +156626 with a relative error of 0. 0010 % 38

Introduction and Floating-point Numbers A Reasonable Representation The requirement that the digit M is non-zero ensures unique representations: ±EEMNNN Otherwise, all four of +491000 +500100 +510010 +520001 represent the same value: 1 1. 000 × 100 0. 100 × 101 0. 010 × 102 0. 001 × 103 – Imagine if simply checking for equality required addition and iteration? 39

Introduction and Floating-point Numbers A Reasonable Representation Also, by choosing the order ±EEMNNN and using a bias (the – 49), we have one final advantage: – Relative comparisons are also fast These four represent numbers using our format: +856729 +389657 +573823 +195737 Which is the largest in magnitude? Which is smallest? 40

Introduction and Floating-point Numbers A Reasonable Representation We have a few more issues: – Representing zero: – Requiring negative zero: +000000 -000000 Why? – Recall that a floating-point zero represents all numbers on the range (–s, s) where s is the smallest number that can be represented by a non-zero floating-point number – One reason: Branch cuts: 41

Introduction and Floating-point Numbers A Reasonable Representation Other issues: – 1. 153 × 10 – 49 can be represented with full precision, but 4. 853 × 10 – 50 must be represented by +000000 Solution? Denormalized numbers: ± 00 NNNN represents ±N. NNN × 10 – 49 For example, 4. 853 × 10 – 50 = 0. 4853 × 10 – 49 is represented by ± 000485 – Representing infinity (for example, 1/0, – 1/0): +990000 and -990000 – Representing undefined operations (0/0, not-a-number or Na. N): +990100 42

Introduction and Floating-point Numbers IEEE 754 When Dr. Kahan lead the committee that eventually produced the IEEE 754 standard, there were numerous conflicts – People from numerous corporations were represented, each wanting to advocate for their own representations – Each corporation already had invested in their own designs • No one wants to modify existing hardware that has already been tested Fortunately, this committee overcame these biases and produced an excellent standard – IEEE 754 -2008 contains most of the original standard 43

Introduction and Floating-point Numbers IEEE 754 The original standard defines two formats: – The float, a single-precision floating-point number – The double, a double-precision floating-point number For most applications outside of graphics, float is unacceptable – We will focus on double 44

Introduction and Floating-point Numbers IEEE 754 The double uses 64 bits: SEEEEEENNNNNNNNNNNNNNNNNNNNNNNNNN to represent (in binary): where 0111112 = 1023 45

Introduction and Floating-point Numbers IEEE 754 The smallest positive normalized number is 000000100000000000000000000000000 which represents 46

Introduction and Floating-point Numbers IEEE 754 Denormalized numbers go as small as 000000000000000000000000000000001 which represents Note: 47

Introduction and Floating-point Numbers IEEE 754 The largest positive number is 01111111111111111111111111111111 which represents >> format long >> 1. 99999999778 * 2^1023 ans = 1. 797693134862316 e+308 >> 2^1024 ans = Inf Recall in decimal: 1. 999910 = 2 In binary: 1. 11112 = 2 48

Introduction and Floating-point Numbers IEEE 754 Infinity is represented by S 11111100000000000000000000000000 which represents If the mantissa is not all zero, it represents a Na. N – Na. N has special properties: Na. N == Na. N returns false (0)—you must use isnan( x ) 49

Introduction and Floating-point Numbers IEEE 754 You can view the underlying format: >> format hex >> pi ans = 400921 fb 54442 d 18 >> exp(1) ans = 4005 bf 0 a 8 b 14576 a >> 1/0 ans = 7 ff 0000000 >> 0 ans = 00000000 >> 1 e-300 ans = 01 a 56 e 1 fc 2 f 8 f 359 Hexadecimal Binary 0 0000 1 0001 2 0010 3 0011 4 0100 5 0101 6 0110 7 0111 8 1000 9 1001 a 1010 b 1011 c 1100 d 1101 e 1110 f 1111 50

Introduction and Floating-point Numbers Issues with Floating-point Numbers There are still issues: – – Overflow and underflow Subtractive cancellation Adding large and small numbers Order of operations—not associative These must be dealt with in our algorithms. . . 51

Introduction and Floating-point Numbers A More Accurate Sum The Kahan algorithm for adding numbers: function [s] = Kahan_sum( v ) s = 0; c = 0; for x = v y = x - c; t = s + y; c = (t - s) - y; s = t; end 52

Introduction and Floating-point Numbers A More Accurate Sum Consider: >> v = rand( 1, 10000000 ); >> sum( v ) ans = 5. 000908006717473 e+006 >> Kahan_sum( v ) ans = 5. 000908006717303 e+006 >> sum( sort( v, 'ascend' ) ) ans = 5. 000908006717453 e+006 >> sum( sort( v, 'descend' ) ) ans = 5. 000908006717471 e+006 53

Introduction and Floating-point Numbers A More Accurate Sum Are you serious? > Kahan_sum : = proc( v ) local s, c, y, t, x; s : = 0; c : = 0; for x in v do y : = x - c; t : = s + y; c : = (t - s) - y; s : = t; end do; return s; end proc: > S : = [seq( rand()/1 e 12, i = 1. . 1000000 )]: 54

Introduction and Floating-point Numbers A More Accurate Sum Are you serious? > add( i, i = S ); 5. 002264965 · 105 > add( i, i = sort( S ) ); 5. 002264658 · 105 > add( i, i = sort( S, `>` ) ); 5. 002264784 · 105 > Kahan_sum( S ); 5. 002264636 · 105 > Digits : = 30: > add( i, i = S ); 5. 00226463567620254 · 105 55

Introduction and Floating-point Numbers The Laboratories In the laboratories of this class, you will see six problems that arise often in nanotechnology engineering in association with this course The laboratories will be divided into two parts: – A one-hour presentation one week, and – A one-hour help session the next week for assistance 56

Introduction and Floating-point Numbers The Laboratories This is part of an integrated approach in your nanotechnology courses: – NE 113 Engineering Computation was your introduction – NE 216 will focus on numerical solutions to IVPs – NE 217 focuses on PDEs This will lead to NE 336 Micro and Nanosystem Computer-aided Design 57

Introduction and Floating-point Numbers The Laboratories NE 216 will look at numerical approximations to: – Numerical algorithms – Differentiation – 1 st-order initial value problems (IVPs) using Euler and Heun's methods – 1 st-order IVPs with the better 4 th-order Runge-Kutta and the Dormand-Prince methods – Systems of IVPs and converting higher-order IVPs into a system of 1 st-order IVPs – Boundary-value problems (BVPs) using the shooting method 58

Introduction and Floating-point Numbers The Laboratories NE 217 will look at numerical approximations to: – Boundary-value problems using finite differences – Heat-conduction/diffusion equation – Heat-conduction/diffusion using the Crank-Nicolson method with insulated boundaries – Wave equation – Laplace's equation in two and three dimensions – The heat-conduction/diffusion and wave equations in two and three dimensions 59

Introduction and Floating-point Numbers The Laboratories By the end of this sequence of laboratories, you will be able to produce the following animations: 60

Introduction and Floating-point Numbers Outline In this topic, we saw: – Five tools used in numerical algorithms: • • • Iteration Linear algebra Interpolation Taylor series Bracketing – Landau symbols and floating-point numbers, IEEE 744 – A summary of the laboratories in NE 216 and NE 217 • IVPs and then PDEs 61