1 Introduction to MMX XMM SSE and SSE

  • Slides: 30
Download presentation
1 Introduction to MMX, XMM, SSE and SSE 2 Technology Multimedia Extension, Streaming SIMD

1 Introduction to MMX, XMM, SSE and SSE 2 Technology Multimedia Extension, Streaming SIMD Extension 11/23/98, 5/6/99, 2/5/03, 5/10/04, 5/4/05

SISD - Single Instruction, Single Data u Traditional computers u In general, one instruction

SISD - Single Instruction, Single Data u Traditional computers u In general, one instruction processes one data item Control Unit Memory Execution Unit 2

SIMD - Single Instruction, Multiple Data instruction can process multiple data items u Useful

SIMD - Single Instruction, Multiple Data instruction can process multiple data items u Useful when large amounts of regularly Memory organized data is processed u Example: Matrix and vector calculations u This is the basis of MMX and XMM 3 u One Control Unit Execution Units

4 MISD Memory u MISD: Control Unit Execution Units Multiple instructions process one data

4 MISD Memory u MISD: Control Unit Execution Units Multiple instructions process one data item.

5 MIMD u MIMD: Multiple instructions process multiple data items. Control Unit Execution Unit

5 MIMD u MIMD: Multiple instructions process multiple data items. Control Unit Execution Unit Memory Control Unit Execution Unit

6 Your Turn u How would you classify a traditional computer under this system?

6 Your Turn u How would you classify a traditional computer under this system? u How would you classify a Shemp which has multiple processors? u How would you classify a computer having a Intel Dual Core processor?

7 Potential Applications MMX and SSE u graphics u MEG video/image processing u music

7 Potential Applications MMX and SSE u graphics u MEG video/image processing u music synthesis u speech compression/recognition u video conferencing u matrix and vector calculations u Advanced 3 D graphics (SSE 2) u Speech recognition (SSE 2) u Scientific and engineering applications (SSE 2)

8 MMX u 4 new data types u New instructions u Uses 8 existing

8 MMX u 4 new data types u New instructions u Uses 8 existing 64 bit floating point registers

9 The floating point registers u Floating point is processed by eight 80 bit

9 The floating point registers u Floating point is processed by eight 80 bit registers ST(0), ST(1), …ST(7) in the floating point unit. u When doing floating point arithmetic, these registers are organized in a stack. u Programming floating point is quite different that programming integer arithmetic. u Floating point calculations are done using 80 bits even when the program specifies storing 32 or 64 bit data values.

Advantages of using the floating point registers in MMX. u The 10 registers already

Advantages of using the floating point registers in MMX. u The 10 registers already exist. Only logic had to be added to the chip. u The operating system already knows about the floating point registers. u When a computer is switches from one program to another, the state (registers) of the current program must be saved so state can be restored when the program becomes the active program once again. u The floating point registers are automatically saved as part of the state of a program. u MMX worked under existing operating systems!

11 New data types for MMX u 64 bits long. One data item can

11 New data types for MMX u 64 bits long. One data item can store: 8 one byte integers: 4 two byte integers: 2 four byte integers 1 eight byte integer

12 SSE and SSE 2 u SSE – Streaming SIMD Extensions u SSE 2

12 SSE and SSE 2 u SSE – Streaming SIMD Extensions u SSE 2 introduced eight 128 bit XMM registers u These registers are disjoint from the floating point/MMX registers u SSE (Pentium III) can handle 4 single floating point numbers u SSE 2 (Pentium 4) can also handle 2 double floating point numbers

13 New data types for XMM u 128 bits: Can be used as: 16

13 New data types for XMM u 128 bits: Can be used as: 16 one byte integers 8 two byte integers 4 doubleword integers or single precision floating 2 quadword integers or double precision floating

14 Your turn u Your program uses 3 arrays of 160, 000 byte integers.

14 Your turn u Your program uses 3 arrays of 160, 000 byte integers. We need to add the elements in the first two arrays to calculate third array. u Using a standard Pentium, how many “operations” are needed? (One operation includes loading 2 values into CPU, adding, storing the result and the associate loop processing) u How many XMM operations would be needed?

15 New instructions u Process the new data types 16, 8, 4, or 2

15 New instructions u Process the new data types 16, 8, 4, or 2 data items (64 bits or 128 bits) at a time. u Types of instructions: Add / Subtract Multiply/Multiply and add Shift Logical (AND, NAND, OR, XOR) Pack and unpack Move Shuffle and unpack (SSE)

16 Saturation u Handling overflow when adding 16, 8, 4, or 2 values at

16 Saturation u Handling overflow when adding 16, 8, 4, or 2 values at a time is a problem. Programmers can specify that when overflow occurs, the “sum” should be replaced by the maximum legal value. u Example: Unsigned byte addition 80 h + A 0 h = 120 h ===> overflow Instead the machine stores FFh. u Likewise when subtracting.

17 Comparison operations u Consider <, >, <=, >=, =, and < > operations.

17 Comparison operations u Consider <, >, <=, >=, =, and < > operations. u Consider comparing two 64 bits quantities each holding 8, 4, or 2 values. u Comparing multiple values at a time is a problem. So the MMX instructions store 0 for false and -1 for true for each of individual data items.

Example 1: Calculating Dot Products u Consider calculating S = 7 i=0 18 A

Example 1: Calculating Dot Products u Consider calculating S = 7 i=0 18 A i. Bi using MMX u Assume Ai and Bi are stored as signed 16 bit integers. u Assume that the products and sums should be calculated using 32 bits. u Assume that all values have two “binary” places.

Example 1: Calculating Dot Products u Storing 0 0 A and B (64 bit

Example 1: Calculating Dot Products u Storing 0 0 A and B (64 bit vectors) 2 4 6 8 10 12 1 2 3 4 5 6 19 14 bytes 7 subscripts A B u We store each Ai and Bi item as 16 bit integers, 4 per 64 bit data item. Assume each value has 2 binary places

Example 1: Calculating Dot Products u Multiply and 2 20 * 3 * 806

Example 1: Calculating Dot Products u Multiply and 2 20 * 3 * 806 add instruction * * 40 + u 20 4 30 * * 5 + * * 50 + 1520 +

Example 1: MMX: Calculating Dot Products u Packed Multiply 2 20 3 * *

Example 1: MMX: Calculating Dot Products u Packed Multiply 2 20 3 * * 40 * + u and add instruction * 4 30 5 50 + * + 806 u Packed * 1520 Add + + 2326 u (Normal) Add + * * + 21

Example 1: Calculating Dot Products 4 wo rds a ta time u Approximate 22

Example 1: Calculating Dot Products 4 wo rds a ta time u Approximate 22 algorithm – Load left half of A into a FP register. – Multiply and add by left half of B. – Shift products right 2 bits. (Products should have only two binary places. ) – Repeat with right halves of A and B using a s different register. d r o w – Add the second sum to the first. ble u o – Store the result. od e w m i T t a

Example 1: Calculating Dot Products 1 do uble at a word time u Approximate

Example 1: Calculating Dot Products 1 do uble at a word time u Approximate 23 algorithm (Conclusion) – Add the two sums together in EAX to get the final sum.

Example 1: Calculating Dot Products u Intel 24 claims that standard Pentiums would require

Example 1: Calculating Dot Products u Intel 24 claims that standard Pentiums would require 40 instructions to carry this out. Using MMX technology, only 13 instructions are needed. Speed improves by even a greater ratio.

Example 2: 24 -bit color video blending u Suppose 25 we have are displaying

Example 2: 24 -bit color video blending u Suppose 25 we have are displaying 640 by 480 pixel video that uses 24 bit colors - 8 bits for red, 8 for green, and 8 for blue. u Suppose we are currently showing one picture which we want to fade out and replace by “fading” in a second picture. u Suppose that we want to do the fade out/in in 255 steps.

Example 2: 24 -bit color video blending u For each step, for each of

Example 2: 24 -bit color video blending u For each step, for each of 3 colors and for each of the 640 by 480 pixels we must calculate: Result_pixel = New. Picture_pixel * (i/255) +Old. Picture_pixel * (1 -(i/255)) where “i” is the step counter. u This formula must be calculated 640 * 480 * 3 * 255 = 235, 008, 000 times on 8 bit data! 26

Example 2: 24 -bit color video blending u Intel 27 calculates that this requires

Example 2: 24 -bit color video blending u Intel 27 calculates that this requires execution of 1. 4 billion instructions on a standard PC even ignoring the calculation of i/255 and (1 -i/255) and loop control. u With MMX, we can calculate 4 values in parallel. The number of MMX instructions would be 525 million. (Because the multiply instruction only applies to word data, the byte data must be unpacked into words and repacked after the calculation. )

28 Also included in MMX u Intel increased cache size when MMX was introduced

28 Also included in MMX u Intel increased cache size when MMX was introduced (necessary for SIMD machines) u Programs run faster on MMX machines even if the SIMD instructions are not used u Excellent marketing: – Programs run faster on MMX machine – People want/buy MMX – Software publishers are encouraged to rewrite programs to take advantage of the new instructions

29 Information source u http: //www. intel. com/drg/mmx/manuals/ overview/index. htm#intro (no longer available) u

29 Information source u http: //www. intel. com/drg/mmx/manuals/ overview/index. htm#intro (no longer available) u http: //developer. intel. com/drg/mmx/manuals/ (no longer available) u http: //www. intel. com/design/Pentium 4/manuals/24 547012. pdf (IA-32 Intel Architecture Software Developer’s Manual, vol. 1) u This slide show is MMX. PPT

30 Your Turn u 1. Characterize the kinds of problems where SIMD is helpful.

30 Your Turn u 1. Characterize the kinds of problems where SIMD is helpful. u 2. Give examples of problems where SIMD is useful.