INF 5063 Programming heterogeneous multicore processors x 86

  • Slides: 15
Download presentation
INF 5063: Programming heterogeneous multi-core processors x 86 September 14, 2021

INF 5063: Programming heterogeneous multi-core processors x 86 September 14, 2021

AMD K 8 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland,

AMD K 8 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon

Intel Nehalem University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Special instructions… § MMX − MMX is officially a meaningless initialism trademarked by Intel;

Special instructions… § MMX − MMX is officially a meaningless initialism trademarked by Intel; unofficially, • Multi. Media e. Xtension • Multiple Math e. Xtension • Matrix Math e. Xtension − SIMD (Single Instruction, Multiple Data) computation processes multiple data in parallel with a single instruction, resulting in significant performance improvement; MMX gives 2 computations at once. − MMX defined 8 “new” 64 -bit integer registers (mm 0 ~ mm 7), which were aliases for the existing x 87 FPU registers – reusing 64 (out of 80) bits in the floating point registers. University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Special instructions… § SSE − Streaming SIMD Extensions (SSE) − SIMD; 4 computations at

Special instructions… § SSE − Streaming SIMD Extensions (SSE) − SIMD; 4 computations at once. − SSE defines 8 new 128 -bit registers (xmm 0 ~ xmm 7) for single-precision floating-point computations. Since each register has 128 -bit long, we can store total 4 of 32 -bit floating-point numbers (1 -bit sign, 8 -bit exponent, 23 -bit mantissa). − Single or packed scalar operations: __SS vs __PS University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication 1 1 1 2 3 4 10 2 2 2 4

Example: Matrix Multiplication 1 1 1 2 3 4 10 2 2 2 4 6 8 20 3 3 3 6 9 12 30 4 4 4 8 12 16 40 University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication - C #include <stdio. h> float elts[4][4] = {1, 1, 2,

Example: Matrix Multiplication - C #include <stdio. h> float elts[4][4] = {1, 1, 2, 2, 3, 3, 4, 4, 4, 4}; float vin[4] = {1, 2, 3, 4}; float vout[4]; void main(void) { vout[0] = elts[0][0] * vin[0] elts[0][2] * vin[2] + + elts[0][1] * vin[1] + elts[0][3] * vin[3]; vout[1] = elts[1][0] * vin[0] elts[1][2] * vin[2] + + elts[1][1] * vin[1] + elts[1][3] * vin[3]; vout[2] = elts[2][0] * vin[0] elts[2][2] * vin[2] + + elts[2][1] * vin[1] + elts[2][3] * vin[3]; vout[3] = elts[3][0] * vin[0] elts[3][2] * vin[2] + + elts[3][1] * vin[1] + elts[3][3] * vin[3]; printf("%f %fn", vout[0], vout[1], vout[2], vout[3]); } University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov #include <stdio. h> esi, VIN edi,

Example: Matrix Multiplication – SSE __asm { mov #include <stdio. h> esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS float elts[4][4] = {1, 1, 2, 2, 3, 3, 4, 4, 4, 4}; movups xmm 4, [edx] movups xmm 5, [edx + float vin[4] = {1, 2, 3, 4}; movups xmm 6, [edx + movups xmm 7, [edx + float vout[4]; 0 x 10] 0 x 20] 0 x 30] // load v into xmm 0. movups xmm 0, [esi] void main(void) { vout[0] = elts[0][0] * vin[0] elts[0][2] * vin[2] vout[1] = // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 elts[1][0] * vin[0] elts[1][2] * vin[2] + + elts[0][1] vin[1] // broadcast x*into xmm 1, + multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 elts[0][3] * vin[3]; + + elts[1][1] * vin[1] + // repeat the process for y, movups xmm 1, xmm 0 elts[1][3] * vin[3]; shufps xmm 1, 0 x 55 Assuming elts in a COLUMN-MAJOR order: vout[2] b c d= a e f i h j k l vout[3] = n o p m a g e i n b f elts[2][0] * vin[0] elts[2][2] * vin[2] + + elts[3][0] * vin[0] elts[3][2] * vin[2] j n c g k o d h + + l shufps mulps addps xmm 1, 0 x 00 xmm 1, xmm 4 xmm 2, xmm 1 mulps addps xmm 1, xmm 5 xmm 2, xmm 1 addps xmm 2, xmm 1 movups xmm 1, xmm 0 z and w elts[2][1] * vin[1] + movups xmm 1, xmm 0 shufps xmm 1, 0 x. AA elts[2][3] * vin[3]; mulps xmm 1, xmm 6 elts[3][1] * vin[1] + shufps xmm 1, 0 x. FF mulps xmm 1, xmm 7 elts[3][3] * vin[3]; addps xmm 2, xmm 1 // write the results to vout [edi], xmm 2 p movups } vout[1], vout[2], vout[3]); printf("%f %fn", vout[0], } University of Oslo INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] // load v into xmm 0. movups xmm 0, [esi] // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] // load v into xmm 0. movups xmm 0, [esi] // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] // load v into xmm 0. movups xmm 0, [esi] // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov 1 1 1 2 2 4

Example: Matrix Multiplication – SSE __asm { mov 1 1 1 2 2 4 3 3 9 esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] 4 // load v into xmm 0. movups xmm 0, [esi] 4 // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 16 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov 1 1 2 2 2 4

Example: Matrix Multiplication – SSE __asm { mov 1 1 2 2 2 4 3 3 6 esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] 4 // load v into xmm 0. movups xmm 0, [esi] 4 // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 8 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load

Example: Matrix Multiplication – SSE __asm { mov esi, VIN edi, VOUT // load columns of matrix into xmm 4 -7 mov edx, ELTS movups xmm 4, [edx] movups xmm 5, [edx + 0 x 10] movups xmm 6, [edx + 0 x 20] movups xmm 7, [edx + 0 x 30] // load v into xmm 0. movups xmm 0, [esi] // we'll store the final result in xmm 2; initialize it // to zero xorps xmm 2, xmm 2 // broadcast x into xmm 1, multiply it by the first // column of the matrix (xmm 4), and add it to the total movups xmm 1, xmm 0 shufps xmm 1, 0 x 00 mulps xmm 1, xmm 4 addps xmm 2, xmm 1 // repeat the process for y, z and w movups xmm 1, xmm 0 shufps xmm 1, 0 x 55 mulps xmm 1, xmm 5 addps xmm 2, xmm 1 } University of Oslo movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. AA xmm 6 xmm 1 movups shufps mulps addps xmm 1, xmm 2, xmm 0 xmm 1, 0 x. FF xmm 7 xmm 1 // write the results to vout movups [edi], xmm 2 INF 5063, Pål Halvorsen, Carsten Griwodz, Håvard Espeland, Håkon Stensland