Computer Organization CS 224 Fall 2012 Lesson 21

  • Slides: 12
Download presentation
Computer Organization CS 224 Fall 2012 Lesson 21

Computer Organization CS 224 Fall 2012 Lesson 21

FP Example: °F to °C q C code: float f 2 c (float fahr)

FP Example: °F to °C q C code: float f 2 c (float fahr) { return ((5. 0/9. 0)*(fahr - 32. 0)); } fahr in $f 12, result in $f 0, literals in global memory space q Compiled MIPS code: f 2 c: lwc 1 div. s lwc 1 sub. s mul. s jr $f 16, $f 18, $f 0, $ra const 5($gp) const 9($gp) $f 16, $f 18 const 32($gp) $f 12, $f 18 $f 16, $f 18

FP Example: Array Multiplication q. X l q. C =X+Y×Z All 32 × 32

FP Example: Array Multiplication q. X l q. C =X+Y×Z All 32 × 32 matrices, 64 -bit double-precision elements code: void mm (double x[][], double y[][], double z[][]) { int i, j, k; for (i = 0; i! = 32; i = i + 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j]; } l Addresses of x, y, z in $a 0, $a 1, $a 2, and i, j, k in $s 0, $s 1, $s 2

FP Example: Array Multiplication n MIPS code: li li L 1: li L 2:

FP Example: Array Multiplication n MIPS code: li li L 1: li L 2: li sll addu l. d L 3: sll addu l. d … $t 1, 32 $s 0, 0 $s 1, 0 $s 2, 0 $t 2, $s 0, 5 $t 2, $s 1 $t 2, 3 $t 2, $a 0, $t 2 $f 4, 0($t 2) $t 0, $s 2, 5 $t 0, $s 1 $t 0, 3 $t 0, $a 2, $t 0 $f 16, 0($t 0) # # # # $t 1 = 32 (row size/loop end) i = 0; initialize 1 st for loop j = 0; restart 2 nd for loop k = 0; restart 3 rd for loop $t 2 = i * 32 (size of row of x) $t 2 = i * size(row) + j $t 2 = byte offset of [i][j] $t 2 = byte address of x[i][j] $f 4 = 8 bytes of x[i][j] $t 0 = k * 32 (size of row of z) $t 0 = k * size(row) + j $t 0 = byte offset of [k][j] $t 0 = byte address of z[k][j] $f 16 = 8 bytes of z[k][j]

FP Example: Array Multiplication … sll $t 0, $s 0, 5 addu $t 0,

FP Example: Array Multiplication … sll $t 0, $s 0, 5 addu $t 0, $s 2 sll $t 0, 3 addu $t 0, $a 1, $t 0 l. d $f 18, 0($t 0) mul. d $f 16, $f 18, $f 16 add. d $f 4, $f 16 addiu $s 2, 1 bne $s 2, $t 1, L 3 s. d $f 4, 0($t 2) addiu $s 1, 1 bne $s 1, $t 1, L 2 addiu $s 0, 1 bne $s 0, $t 1, L 1 # # # # $t 0 = i*32 (size of row of y) $t 0 = i*size(row) + k $t 0 = byte offset of [i][k] $t 0 = byte address of y[i][k] $f 18 = 8 bytes of y[i][k] $f 16 = y[i][k] * z[k][j] f 4=x[i][j] + y[i][k]*z[k][j] $k k + 1 if (k != 32) go to L 3 x[i][j] = $f 4 $j = j + 1 if (j != 32) go to L 2 $i = i + 1 if (i != 32) go to L 1

Accurate Arithmetic q IEEE Std 754 specifies additional rounding control l Extra bits of

Accurate Arithmetic q IEEE Std 754 specifies additional rounding control l Extra bits of precision (guard, round, sticky) l Choice of rounding modes l Allows programmer to fine-tune numerical behavior of a computation q Not l all FP units implement all options Most programming languages and FP libraries just use defaults q Trade-off between hardware complexity, performance, and market requirements

Support for Accurate Arithmetic q IEEE 754 FP rounding modes • • q Always

Support for Accurate Arithmetic q IEEE 754 FP rounding modes • • q Always round up (toward +∞) Always round down (toward -∞) Truncate (toward 0) Round to nearest even (when the Guard || Round || Sticky are 100) – always creates a 0 in the least significant (kept) bit of F Rounding (except for truncation) requires the hardware to include extra F bits during calculations • • • Guard bit – used to provide one F bit when shifting left to normalize a result (e. g. , when normalizing F after division or subtraction) G Round bit – used to improve rounding accuracy R Sticky bit – used to support Round to nearest even; is set to a 1 whenever a 1 bit shifts (right) through it (e. g. , when aligning F during addition/subtraction) S F = 1. xxxxxxxxxxxx G R S

q Parallel programs may interleave operations in unexpected orders § 3. 6 Parallelism and

q Parallel programs may interleave operations in unexpected orders § 3. 6 Parallelism and Computer Arithmetic: Associativity • Assumptions of associativity may fail, since FP operations are not associative ! • Need to validate parallel programs under varying degrees of parallelism

q Originally based on 8087 FP coprocessor l 8 × 80 -bit extended-precision registers

q Originally based on 8087 FP coprocessor l 8 × 80 -bit extended-precision registers l Used as a push-down stack Registers indexed from TOS: ST(0), ST(1), … l q FP l l values are 32 -bit or 64 -bit in memory Converted on load/store of memory operand Integer operands can also be converted on load/store q Very l difficult to generate and optimize code Result: poor FP performance § 3. 7 Real Stuff: Floating Point in the x 86 FP Architecture

x 86 FP Instructions Data transfer Arithmetic Compare Transcendental FILD mem/ST(i) FISTP mem/ST(i) FLDPI

x 86 FP Instructions Data transfer Arithmetic Compare Transcendental FILD mem/ST(i) FISTP mem/ST(i) FLDPI FLD 1 FLDZ FIADDP FISUBRP FIMULP FIDIVRP FSQRT FABS FRNDINT FICOMP FIUCOMP FSTSW AX/mem FPATAN F 2 XMI FCOS FPTAN FPREM FPSIN FYL 2 X q Optional l l mem/ST(i) variations I: integer operand P: pop operand from stack R: reverse operand order But not all combinations allowed

Streaming SIMD Extension 2 (SSE 2) q Adds 4 × 128 -bit registers l

Streaming SIMD Extension 2 (SSE 2) q Adds 4 × 128 -bit registers l q Extended to 8 registers in AMD 64/EM 64 T Can be used for multiple FP operands l 2 × 64 -bit double precision l 4 × 32 -bit single precision l Instructions operate on them simultaneously - Single-Instruction Multiple-Data

q ISAs support arithmetic l l q Bounded range and precision l q Signed

q ISAs support arithmetic l l q Bounded range and precision l q Signed and unsigned integers Floating-point approximation to reals Operations can overflow and underflow MIPS ISA l l Core instructions: 31 integer + 23 arithmetic 54 most frequently used –Figure 3. 24 - 100% of SPECINT (p. 282) - 97% of SPECFP ( “ “) l q Other instructions: much less frequent: Figure 3. 25 See Fig 2. 45 and Fig 3. 26 § 3. 9 Concluding Remarks