MATHCMPSC 455 Introduction to Numerical Analysis I Floating

FLOATING POINT REPRESENTATION OF REAL NUMBERS This is about how computers represent and operate

FLOATING POINT FORMAT Formats for decimal system Standard Notation Scientific Notation Normalized Scientific Notation

FLOATING POINT FORMAT Format for floating point number (binary representation) Normalized IEEE floating point

Precision sign Exponent (M) Mantissa (N) single 1 8 23 double 1 11 52

ROUNDING How do we fit a given binary number in a finite number of

ROUNDING Notation: Denote the IEEE double precision floating point number associated to x, using

ROUNDING Example: Relative rounding error:

MACHINE REPRESENTATION … … • Sign: 1 bit, 0 for positive, 1 for negative;

ADDITION AND ROUNDING OF FLOATING POINT NUMBERS Step 1: line up the two numbers

Slides: 11

Download presentation

MATH/CMPSC 455 Introduction to Numerical Analysis I Floating Point Representation of Real Numbers

FLOATING POINT REPRESENTATION OF REAL NUMBERS This is about how computers represent and operate real numbers. Helps us to understand rounding errors We consider IEEE 754 Floating Point Standard Representing binary numbers in computer: 1. format 2. machine representation

FLOATING POINT FORMAT Formats for decimal system Standard Notation Scientific Notation Normalized Scientific Notation

FLOATING POINT FORMAT Format for floating point number (binary representation) Normalized IEEE floating point standard: o sign (+ or -) o mantissa , which contains the significant bits. (N b’s) o exponent (p, M-bit binary number) … …

Precision sign Exponent (M) Mantissa (N) single 1 8 23 double 1 11 52 Long double 1 15 64 Definition (machine epsilon, ): It is the distance between 1 and the smallest floating point number greater than 1. Gives a bound on the relative error due to rounding. For the IEEE double precision floating point standard:

ROUNDING How do we fit a given binary number in a finite number of bits? IEEE Rounding to Nearest Rule: For double precision, if the 53 rd bit to the right of the binary point is 0, then round down (truncate after the 52 nd bit). If the 53 rd bit is 1, then round up (add 1 to 52 bit), unless all known bits to the right of the 1 are 0’s, in which case 1 is added to bit 52 if and only if bit 52 is 1.

ROUNDING Notation: Denote the IEEE double precision floating point number associated to x, using the Rounding to the Nearest Rule, by fl(x). Definition (absolute error & relative error): Let computed version of the exact quantity. be a

ROUNDING Example: Relative rounding error:

MACHINE REPRESENTATION … … • Sign: 1 bit, 0 for positive, 1 for negative; • Mantissa: 52 bits, … 11 • Exponent: 11 bits so 0 < e < 2 -1 = 2047 and p = e - 1023 • 1~2046 -1022 ~ 1023 • 2 values reserved for infinity / Na. N and 0 • 2047 infinity if the mantissa is allzeros, Na. N otherwise; • 0 small numbers including 0

ADDITION AND ROUNDING OF FLOATING POINT NUMBERS Step 1: line up the two numbers Double Precision Step 2: add them Higher Precision Step 3: store the result as a floating point number Double Precision

Example :