FloatingPoint Arithmetic Chapter 9 Sepehr Naimi www Nicer

  • Slides: 26
Download presentation
Floating-Point Arithmetic Chapter 9 Sepehr Naimi www. Nicer. Land. com

Floating-Point Arithmetic Chapter 9 Sepehr Naimi www. Nicer. Land. com

Floating Point Calculation n Using Rational number approximation Fixed-point Floating point 2

Floating Point Calculation n Using Rational number approximation Fixed-point Floating point 2

Rational Number Approximation n n Using p/q Example 1: n n b = a

Rational Number Approximation n n Using p/q Example 1: n n b = a * 0. 75; b = a * 3 / 4; Example 2: n n e = 2. 718285 and 193/71 = 2. 71830985 b = a * e; b = a * 193 / 71; 3

Example n Solution in C: int area = R 0 * 22 / 7;

Example n Solution in C: int area = R 0 * 22 / 7; // area = pi * R 0^2 Solution in Assembly: mul r 0, r 0 @ r 0 = r 0 to the power of 2 mov r 1, #22 mul r 0, r 1 @ r 0 = 22 * r 0 mov r 1, #7 udiv r 0, r 1 4

Fixed-Point n n n Scale numbers with a power of 10 (or power of

Fixed-Point n n n Scale numbers with a power of 10 (or power of 2) Example: If the length is 1. 4 cm, the length is 14 mm. We can use mm in the case and use integer. Example 2: In patrol stations the gasoline is sold with precision of 0. 01 of liters. So, the numbers will become integers if we use a 100 scale. n 25. 12 liters 2512 5

Example n R 0 contains the used gasoline with precision of hundredth of liter.

Example n R 0 contains the used gasoline with precision of hundredth of liter. Each liter is 12$. Calculate the price with precision of Cent. Solution: Price in Dollar = liter * 12 = hundredth of liter *12 /100 Price in Cent = hundredth of liter *12 int cent = R 0 * 12; 6

Calculation for Fixed Float n To add or subtract 2 fixed points with the

Calculation for Fixed Float n To add or subtract 2 fixed points with the same scaling factor we simply Add and Subtract: 100×m + 100×n = 100×(m+n) n For example: n n n 5. 40 liter + 2. 31 liter = 7. 71 liter 540 + 231 = 771 hundredth of liter 7

Calculation for Fixed Float (Cont. ) n In multiplication of 2 fixed points with

Calculation for Fixed Float (Cont. ) n In multiplication of 2 fixed points with the same scaling factor the result must be divided by the scaling factor: n n 100×m × 100×n / 100= 100 × 100×(m×n) /100 =100×(m×n) To divide the result must by multiplied with the scaling factor: n 100×(100×m) / (100×n) = 100×(m / n) 8

Floating Point n IEEE 754 single precision 9

Floating Point n IEEE 754 single precision 9

Converting numbers to single precision n n If the number is positive, bit 31

Converting numbers to single precision n n If the number is positive, bit 31 is 0. If the number is negative, bit 31 is 1. The real number is converted to its binary form. The binary number is normalized to 1. xxxx E yyyy The bias 127 (0 x 7 F) is added to the exponent portion, yyyy, to get the biased exponent, which is placed in bits 30 to 23. The significand, xxxx, is placed in bits 22 to 0. 10

Example: Convert 9. 7510 to IEEE 754 single-precision floating-point format n Solution: n n

Example: Convert 9. 7510 to IEEE 754 single-precision floating-point format n Solution: n n n Sign bit 31 is 0 for positive Decimal 9. 75 = binary 1001. 11 which is normalized to 1. 00111 E 3 Exponent bits 30 to 23 are 1000 0010 after adding the bias (3 + 0 x 7 F = 0 x 82) Significand bits 22 to 0 are 00111000000000 Putting them all together gives the following: n 0100 0001 1100 0000 11

IEEE 754 Double-Precision Floating Point 12

IEEE 754 Double-Precision Floating Point 12

Example n n Convert 152. 187510 to IEEE 754 double-precision floating-point format. Solution: n

Example n n Convert 152. 187510 to IEEE 754 double-precision floating-point format. Solution: n n n 0100 Sign bit 63 is 0 for positive Decimal 152. 1875 = binary 10011000. 0011 which is normalized to 1. 00110000011 E 7 Exponent bits 62 to 53 are 10000000110 after adding the bias (7 + 0 x 3 FF = 0 x 406) Significand bits 52 to 0 are 0011000. . . 000 Putting them all together gives the following: 0000 0110 0011 0000 0110 0000 . . . 0000 13

Half-precision Floating-Point 14

Half-precision Floating-Point 14

Arm Arithmetic Co-processors n VFP (Vectored Floating-Point): n n performs single-precision and double-precision arithmetic

Arm Arithmetic Co-processors n VFP (Vectored Floating-Point): n n performs single-precision and double-precision arithmetic operations that are fully compliant to IEEE 754 standard NEON: n n n SIMD (Single Instruction Multiple Data) Supports integers, fixed-point numbers, and single -precision Used for media applications and digital signal processing 15

VFP and NEON in Raspberry Pi Raspberry Pi Pi 1 2 3 Zero VFPv

VFP and NEON in Raspberry Pi Raspberry Pi Pi 1 2 3 Zero VFPv 2 VFPv 3 VFPv 4 VFPv 2 NEON No Yes No 16

Registers in VFPv 2 17

Registers in VFPv 2 17

Floating-point status and control register (FPSCR) Bits 31 -28 25 24 23 -22 21

Floating-point status and control register (FPSCR) Bits 31 -28 25 24 23 -22 21 -20 18 -16 15, 12 -8 7, 4 -0 Name N, Z, C, V DN FZ RMode Stride Len Function Negative, Zero, Carry, Overflow flags Default Na. N mode control Flush-to-zero mode control Rounding Mode control Step size in vector Length of the vector Exception trap enable bits Cumulative exception bits 18

Floating-Point Data Processing Instructions Mnemonic VABS VNEG VSQRT VADD VSUB Function Absolute Negate square

Floating-Point Data Processing Instructions Mnemonic VABS VNEG VSQRT VADD VSUB Function Absolute Negate square root Add Subtract VDIV Divide VMUL VNMUL Multiply multiply negate multiply and accumulate VMLA VNMLA VMLS multiply and accumulate negate multiply and subtract Description Obtain the absolute value of the operand Negate the value of the operand Obtain the square root of the operand Add the operands Subtract the second operand from the first operand Divide the first operand by the second operand Multiply the two operands then negate the result Multiply the two operands then add the result to the destination register and store the final result in the destination register Multiply the two operands then add the result to the destination register, negate the final result and store it in the destination register Multiply the two operands then subtract the result from the destination register and store the final result in the destination register 19

Format modifiers of Floating-Point Instructions n Modifier Type . f 32 Single Precision .

Format modifiers of Floating-Point Instructions n Modifier Type . f 32 Single Precision . f 64 Double Precision Examples: n n vabs. f 32 vneg. f 64 s 1, s 0 d 1, d 0 @ s 1 = abs(s 0) @ d 1 = -d 0 20

VMOV n Between CPU register and the VFP register: n n s 1, r

VMOV n Between CPU register and the VFP register: n n s 1, r 1 @ copy content of R 1 to S 1 Between the VFP registers n n n vmov. f 32 s 2, s 1 @ copy content of S 1 to S 2 r 2, s 2 @ copy content of S 2 to R 2 Immediate values (in VFPv 3 and later): n n vmov. f 32 s 1, #2 @ load S 1 with 2. 0 s 2, #0. 125 @ load S 2 with 0. 125 21

VLDR and VSTR n Examples: n n vldr. f 32 s 2, [r 2,

VLDR and VSTR n Examples: n n vldr. f 32 s 2, [r 2, #4] @ R 2 holds the base addr. vstr. f 32 s 2, [r 3, #-4] @ R 3 holds the base addr. 22

Example n Write a program to calculate the area of a circle with single-precision

Example n Write a program to calculate the area of a circle with single-precision floating-point format. The radius of the circle is in register S 0 and area should be left in S 0. Solution: vmul. f 32 s 0, s 0 @ calculate r^2 ldr r 2, =pi. Number vldr. f 32 s 1, [r 2] @ load pi vmul. f 32 s 0, s 1 @ multiply pi. . . pi. Number: . float 3. 141592 23

Example: Write a program to add two floating-point numbers in the memory and save

Example: Write a program to add two floating-point numbers in the memory and save the result in the memory. Solution: . text. global _start: ldr r 3, =operand 1 vldr. f 32 s 0, [r 3] ldr r 3, =operand 2 vldr. f 32 s 1, [r 3] vadd. f 32 s 0, s 1 ldr r 3, =sum vstr. f 32 s 0, [r 3] mov r 7, #1 svc 0 @ load address of operand 1 @ load operand 1 in S 0 @ load address of operand 2 @ load operand 1 in S 0 @ add operand 2 to operand 1 @ load address of sum @ store the result in sum operand 1: . float 32. 5 operand 2: . float 23. 4 sum: . data. space 4 24

VMSR and VMRS n VMRS moves the VFP system register content to one of

VMSR and VMRS n VMRS moves the VFP system register content to one of the ARM registers n Example: Moving NZCV flags to the CPSR: n n VMRS APSR_nzcv, FPSCR VMSR: moves an Arm register to one of the VFP system registers. 25

VCVT n Converts between types: n n VCVT. type Sd, Sm Modifier Type .

VCVT n Converts between types: n n VCVT. type Sd, Sm Modifier Type . f 32 Single Precision . f 64 Double Precision . U 32 32 -bit Unsigned integer . S 32 32 -bit Signed integer Examples: n n vcvt. f 32. s 32 s 1, s 0 vcvt. f 64. f 32 d 1, s 0 @ signed int. to float @ float to double 26