Multimedia ISA Extensions Intels MMX The Basics Instruction

Multimedia ISA Extensions • Intel’s MMX – – – The Basics Instruction Set Examples Integration into Pentium Relationship to vector ISAs • AMD’s 3 DNow! • Intel’s ISSE (a. k. a. KNI) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

MMX: Basics • Multimedia applications are becoming popular • Are current ISAs a good match for them? • Methodology: – Consider a number of “typical” applications – Can we do better? – Cost vs. performance vs. utility tradeoffs • Net Result: Intel’s MMX • Can also be viewed as an attempt to maintain market share – If people are going to use these kind of applications ECE 1773 - Spring ‘ 02 we better support Some material © Hill, Sohi, Smith, Wood (UWMadison) them © A. Moshovos

Multimedia Applications • Most multimedia apps have lots of parallelism: – for I = here to infinity • out[I] = in_a[I] * in_b[I] – At runtime: • out[0] = in_a[0] * in_b[0] • out[1] = in_a[1] * in_b[1] • out[2] = in_a[2] * in_b[2] • out[3] = in_a[3] * in_b[3] • …. . • Also, work on short integers: – in_a[i] is 0 to 256 for example (color) – or, 0 to 64 k (16 -bit audio) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Observations • 32 -bit registers are wasted – only using part of them and we know – ALUs underutilized and we know • Instruction specification is inefficient – even though we know that a lot of the same operations will be performed still we have to specify each of the individually – Instruction bandwidth – Discovering Parallelism – Memory Ports? • Could read four elements of an array with one 32 -bit load • Same for stores • The hardware will have a hard time discovering this ECE 1773 Spring ‘ 02 – Coalescing and dependences Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

MMX Contd. • Can do better than traditional ISA – new data types – new instructions • Pack data in 64 -bit words – bytes – “words” (16 bits) – “double words” (32 bits) • Operate on packed data like short vectors (arrays) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

MMX: Example Up to 8 operations (64 bit) go in parallel w Potential improvement: 8 x w In practice less but still good w. Besides another reason to think your mach wis obsolete ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

MMX Data Types ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

A bit of History • This is a special case of SIMD – Single Instruction – Multiple Data • One instruction specifies that an operation should be applied: – Repeatedly – To possibly different data elements each time – Each of these operations are independent • Conventional ISA is SISD – Single Instruction/Single Data • First used in Livermore S-1 (> 25 years) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

MMX: Instruction Set • 57 new instructions • Integer Arithmetic – – • • add/sub/mul multiply add signed/unsigned saturating/wraparound Shifts Compare (form mask) Pack/Unpack Move – from/to memory – from/to registers ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Arithmetic • Conventional: Wrap-around – on overflow, wrap to -1 – on underflow, wrap to MAXINT • Think of digital audio – What happens when you turn volume to the MAX? • Brightness in pictures • Saturating arithmetic: – on overflow, stay at MAXINT – on underflow, stat at MININT • Two flavors: – unsigned – signed ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Operations • Mult/Add • Compares • Conversion – Interpolation/Transpose – Unpack (e. g. , byte to word) – Pack (e. g. , word to byte) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Examples • Image Composting – A and B images fade-in and fadeout – A * fade + B * (1 - fade), OR – (A - B) * fade + B • Image Overlay – Sprite: e. g. , mouse cursor – Spite: normal colors + transparent – for i = 1 to Sprite_Length • if A[I] = clear_color then – Out_frame[I] = C[I] – else Out_frame[I] = A[I] • Matrix Transpose – Covert from row major to column major – Used in JPEG ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Matrix Transpose 4 x 4 m 03 m 02 m 01 m 00 m 13 m 12 m 11 m 10 m 23 m 22 m 21 m 20 m 33 m 32 m 31 m 30 m 20 m 10 m 00 m 31 m 21 m 11 m 01 m 33 m 22 m 12 m 02 m 33 m 23 m 13 m 03 m 32 m 31 m 30 m 13 m 12 m 11 m 10 m 23 m 22 m 21 m 20 m 03 m 02 m 01 m 00 punpcklwd m 31 m 21 m 30 m 20 m 11 m 01 m 10 m 00 punpckhdq punpckldq m 31 m 21 m 11 m 01 m 30 m 20 m 10 m 00 • That’s for the first two rows ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Chroma Keying • for (i=0; i<image_size; i++) – if (x[i] == Blue) new_image[i] =y[i] – else new_image[i] = x[i]; ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Chroma Keying Code • Movq mm 3, mem 1 – Load eight pixels from persons’ image • Movq mm 4, mem 2 – Load eight pixels from the background image • • Pcmpeqb Pandn Por mm 1, mm 4, mm 3 mm 1 ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Integration into Pentium • Major issue: OS compatibility – Create new registers? – Share registers with FP • Existing OSes will save/restore • Use 64 -bit datapaths • Pipe capable of 2 MMX IPC • Separate MEM and Execute stage ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

“Recent” Multimedia Extensions • Intel MMX: integer arithmetic only • New algorithms -> new needs • Need for massive amounts of FP ops • Solution? MMX like ISA but for FP not only integer • Example: AMD’s 3 DNow! – New data type: • 2 packed single-precision FP – 2 x 32 -bits » sign + exponent + significant – New instructions – Speedup potential: 2 x ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

AMD’s 3 DNow! • • 21 new instructions Average: motivated by MPEG Add, Sub, Reverse Sub, Mul Accumulate – (A 1, A 2) acc (B 1, B 2) = (B 1 + B 2, A 1 + A 2) • Comparison (create mask) • Min, Max (pairwise) • Reciprocal and SQRT, – Approximation: 1 st step and other steps • Prefetch • Integer from/to FP conversion • All operate on packed FP data – sign. ECE*17732^(mantissa - 127) * exponent Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Recent Extensions Cont. • Intel’s ISSE – very similar to AMD’s 3 DNow! – But has separate registers • Lessons? – Applications change over time – Careful when introducing new instructions • How useful are they? • Cost? • LEGACY: are they going to be useful in the future? • Everyone has their own Multimedia Instruction set these days – read handout ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Intel’s SSE • Multimedia/Internet? • 70 new instructions • Major Types: – SIMD-FP 128 -bit wide 4 x 16 bit FP – Data movement and re-organization – Type conversion • Int to Fp and vice versa • Scalar/FP precision – State Save/Restore • New SSE registers not like MMX – Memory Streaming • Prefetch to specified hierarchy level – New Media • Absolute Diff, Rounded AVG, MIN/MAX • SSE 2: ECE 1773 - Spring ‘ 02 – SIMD-FP two. Wood 64 -bit Some material © Hill, Sohi, Smith, (UW- fp as 128 -bit Madison) © A. Moshovos

Altivec (Power. PC Mmedia Ext) • • • 128 -bit registers 8, 16, or 32 bit data types Scalar or single-precision FP 162 Instructions Saturation or Modulo arithmetic Four operand Instructions – 3 sources, 1 target ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Altivec Design Process • Look at Mmedia Kernel • Justify new instructions • Video – 8 bit int Low. Q, 16 -bit int High. Q • Audio – 16 bit int Low. Q, SP FP High. Q • Image Processing – 8 bit int Low. Q, 16 bit Int High. Q • 3 D Graphics – 16 bit int Low. Q, SP FP High. Q • Speech Recog. – 16 bit int Low Q, Sp FP High. Q • Communications/Crypto – 8 -bit or 16 bit unsigned int ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector Processors • Vector: – One-Dimensional array of numbers • Original Motivation: – Scientific/Numerical Programs operate on vectors • Parallelism Abound • Example: – Do i = 1 to 64 • C[I] = A[I] + B[I] • Vector Processors • Registers are vectors • Operations are element-wise across multiple vectors • Example: – addv Rc, Ra, Rb ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector Example • Do i = 1 to 64 C[I] = A[I] + B[I] • addv rc, ra, rb c[0] = a[0] + b[0] c[1] = a[1] + b[1] c[2] = a[2] + b[2] c[63] = a[63] + b[63] ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Why Vector Processors? • Deeper Pipelines faster Clock Higher Performance • BUT! – Interlock logic becomes really complicated as pipeline deepens – Bubbles due to data deps increase • Want Wider Machines to exploit Parallelism • BUT! – Increasingly Harder to increase issue width • Finally Recall Fetch and Issue Bottleneck – Can’t execute more that you fetch/decode ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

What’s Good About Vector Procs • Vectors facilitate deeper Pipelines – No intra vector interlocks – No intra vector data deps – Inner loop control deps eliminated • They were artificial to start with – Single Instruction for Multiple operations – Vector instruction provides information for what the machine is going to be doing for a while • Could exploit in memory system • Know that we are going to use 64 elements which are likely one after the other ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector Architectures • Vectors in Memory – – All vectors in memory Long startup latency Memory ports? Good for long vectors • Vectors in Registers – Load/store – Vector ops only on regs – Register ports less expensive than memory ports – Good for small vectors also – Register Vector is the limiter • Fact: in most applications vectors are short • Hence Register Vectors better ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector ISA Example • Vector-Vector Insts – VRC[i] = VRA[i] op VRB[i] • Vector – Scalar Inst – VRB[i] = VRA[i] op CONST • Vector Load/Store – Mem[i]= VRA[i] – W/ Stride • M[r 1 + i * r 2] = VRA[i] – Indexed • M[r 1+ VRB[i]] = VRA[i] • Also called scatter/gather • Support for shorter vectors – Vector Length Register • Vector Masks – VRb[i] = op VRa[i] if (VRc [i]) ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector Chaining • C[i] = A[i] * B[i] • D[i] =C[i] + x • MULTV VRC, VRA, VRB • ADDVI VRD, VRC, Rx • VRDi add can be initiated as soon as MUTLVi finishes • We do not have to wait for the whole MULTV to finish ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Vector Processors – A bit of History • CRAY-1: started in ’ 72, completed in ’ 74 • 12 ns cycle time • 8 Scalar Registers • 8 Address Registers • 8 Vectors or 64 words • 64 Scalar and 64 Address temporaries • 12 Functional Units • 1 Mword memory: 4 clock cycles ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Time/element Are Vectors Always a Win? Vector size • From Gordon Bell’s talk • Scalar is way better for short vectors • Vector 7 x Scalar for larger ECE 1773 - Spring ‘ 02 vectors Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Cray-1 Architecture ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith,

Vectors and SIMD • Vector Length – Not programmable (no VL reg) – Must be multiple of 64 total bits • Memory Load/Store – stride only • Arithmetic – Integer only • Conditionals – builds byte mask – do both ways and choose – no trap problem -- no trapping instructions • Data Movement – minimal – only pack/unpack ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos

Specifying Independence • Vectors and SIMD are examples of “independence” ISAs • Conventional ISA – One instruction after the other – No way of explicitly stating: • Inst A and B are independent • Vectors and SIMD – A series of many conventional instructions that are the same one vector or SIMD inst. • Limited flexibility for specifying independence • Still, these were optimized for the common case in a specific class of applications ECE 1773 - Spring ‘ 02 Some material © Hill, Sohi, Smith, Wood (UWMadison) © A. Moshovos