INF 5063 Programming heterogeneous multicore processors x 86

AMD K 8 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland,

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon

Special instructions… § MMX − MMX is officially a meaningless initialism trademarked by Intel;

Special instructions… § SSE − Streaming SIMD Extensions (SSE) − SIMD; 4 computations at

Example: Matrix Multiplication 1 1 1 2 3 4 10 2 2 2 4

Example: Matrix Multiplication - C #include <stdio. h> float elts[4][4] = {1, 1, 2,

$Example: Matrix Multiplication – SSE __asm { mov #include <stdio. h> esi, VIN edi,$

$Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load$

$Example: Matrix Multiplication – SSE __asm { mov 1 1 1 2 2 4$

$Example: Matrix Multiplication – SSE __asm { mov 1 1 2 2 2 4$

$Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load$

Slides: 15

Download presentation

INF 5063: Programming heterogeneous multi-core processors x 86 September 14, 2021

AMD K 8 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Special instructions… § MMX − MMX is officially a meaningless initialism trademarked by Intel; unofficially, • Multi. Media e. Xtension • Multiple Math e. Xtension • Matrix Math e. Xtension − SIMD (Single Instruction, Multiple Data) computation processes multiple data in parallel with a single instruction, resulting in significant performance improvement; MMX gives 2 computations at once. − MMX defined 8 “new” 64 -bit integer registers (mm 0 ~ mm 7), which were aliases for the existing x 87 FPU registers – reusing 64 (out of 80) bits in the floating point registers. University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Special instructions… § SSE − Streaming SIMD Extensions (SSE) − SIMD; 4 computations at once. − SSE defines 8 new 128 -bit registers (xmm 0 ~ xmm 7) for single-precision floating-point computations. Since each register has 128 -bit long, we can store total 4 of 32 -bit floating-point numbers (1 -bit sign, 8 -bit exponent, 23 -bit mantissa). − Single or packed scalar operations: __SS vs __PS University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication 1 1 1 2 3 4 10 2 2 2 4 6 8 20 3 3 3 6 9 12 30 4 4 4 8 12 16 40 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication - C #include <stdio. h> float elts[4][4] = {1, 1, 2, 2, 3, 3, 4, 4, 4, 4}; float vin[4] = {1, 2, 3, 4}; float vout[4]; void main(void) { vout[0] = elts[0][0] * vin[0] elts[0][2] * vin[2] + + elts[0][1] * vin[1] + elts[0][3] * vin[3]; vout[1] = elts[1][0] * vin[0] elts[1][2] * vin[2] + + elts[1][1] * vin[1] + elts[1][3] * vin[3]; vout[2] = elts[2][0] * vin[0] elts[2][2] * vin[2] + + elts[2][1] * vin[1] + elts[2][3] * vin[3]; vout[3] = elts[3][0] * vin[0] elts[3][2] * vin[2] + + elts[3][1] * vin[1] + elts[3][3] * vin[3]; printf("%f %fn", vout[0], vout[1], vout[2], vout[3]); } University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

$Example: Matrix Multiplication – SSE __asm { mov #include <stdio. h> esi, VIN edi,$

Example: Matrix Multiplication – SSE __asm { mov #include <stdio. h> esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS float elts[4][4] = {1, 1, 2, 2, 3, 3, 4, 4, 4, 4}; movups xmm 4, [edx] movups xmm 5, [edx + float vin[4] = {1, 2, 3, 4}; movups xmm 6, [edx + movups xmm 7, [edx + float vout[4]; 0 x 10] 0 x 20] 0 x 30] // load v into xmm 0. movups xmm 0, [esi] void main(void) { vout[0] = elts[0][0] * vin[0] elts[0][2] * vin[2] vout[1] = // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 elts[1][0] * vin[0] elts[1][2] * vin[2] + + elts[0][1] vin[1] // broadcast x*into xmm 1, + multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 elts[0][3] * vin[3]; + + elts[1][1] * vin[1] + // repeat the process for y, movups xmm 1, xmm 0 elts[1][3] * vin[3]; shufps xmm 1, 0 x 55 Assuming elts in a COLUMN-MAJOR order: vout[2] b c d= a e f i h j k l vout[3] = n o p m a g e i n b f elts[2][0] * vin[0] elts[2][2] * vin[2] + + elts[3][0] * vin[0] elts[3][2] * vin[2] j n c g k o d h + + l shufps mulps addps xmm 1, 0 x 00 xmm 1, xmm 4 xmm 2, xmm 1 mulps addps xmm 1, xmm 5 xmm 2, xmm 1 addps xmm 2, xmm 1 movups xmm 1, xmm 0 z and w elts[2][1] * vin[1] + movups xmm 1, xmm 0 shufps xmm 1, 0 x. AA elts[2][3] * vin[3]; mulps xmm 1, xmm 6 elts[3][1] * vin[1] + shufps xmm 1, 0 x. FF mulps xmm 1, xmm 7 elts[3][3] * vin[3]; addps xmm 2, xmm 1 // write the results to vout [edi], xmm 2 p movups } vout[1], vout[2], vout[3]); printf("%f %fn", vout[0], } University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

$Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load$

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] // load v into xmm 0. movups xmm 0, [esi] // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

$Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load$

$Example: Matrix Multiplication – SSE __asm { mov 1 1 1 2 2 4$

Example: Matrix Multiplication – SSE __asm { mov 1 1 1 2 2 4 3 3 9 esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] 4 // load v into xmm 0. movups xmm 0, [esi] 4 // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 16 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

$Example: Matrix Multiplication – SSE __asm { mov 1 1 2 2 2 4$

Example: Matrix Multiplication – SSE __asm { mov 1 1 2 2 2 4 3 3 6 esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] 4 // load v into xmm 0. movups xmm 0, [esi] 4 // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 8 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

$Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load$