Intel Processor Architecture SIMD Instructions Intel Software College

Objectives After completion of this module you will be able to understand • SIMD

Agenda SIMD Rationale Intel SIMD History SIMD Data Types Intel SIMD Instructions Sets Programming

SIMD (Single Instruction Multiple Data) Technology • Increase processor throughput by performing multiple computations

Streaming SIMD Extensions A brief history MMX™ technology - Intel® Pentium® with MMX™ and

X 86 Register Sets SSE-Registers introduced first in Pentium® 3 MMX™ Technology / IA-FP

SIMD Data Types (1) · 64 -Bit Packed Integer Data Types - 8 packed

SIMD Data Types (2) · 128 -bit Floating-Point and Integer Data Types - 16

SIMD Data Types (3) • Packed BCD data-type - Packed BCD Integers 7 3

SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only

SSE Sample: Branch Removal R = (A < B)? C : D //remember: everything

SSE-2 Instructions Set Extensions Introduced by Intel® Pentium® 4 processor in 2000 Some 140

SIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements 4 x

Sample for SSE-2: SIMD Double SIMD Int Conversion SIMD Double SIMD Int: conversion to

SSE 3: No new Data Types but new Instructions FISTTP FP to integer conversions

Streaming SIMD Extensions 3 13 new instructions Three have limited use for application performance

SSE-3 Sample Complex Arithmetic: ADDSUBPS Operand. A Operand. B • Operand. A (xmm register;

Supplemental SSE-3 (SSSE-3) Extension introduced by Intel® Core™ Architecture Horizontal Addition/Subtraction PHADDW, PHADDSW, PHADDD,

SSSE-3 New Instructions Useful for media and imaging 16 new packed integer instructions Six

Sample SSSE-3 Inst. : Byte Permute PSHUFB mm, mm/m 64 PSHUFB xmm, xmm/m 128

Ways to SSE/SIMD programming Coding using SSE/SSE 2/3/4 assembler instructions • Very tedious (manually

Compiler Based Vectorization Processor Specific Generate Code and Optimize for Linux* Pentium® 3 compatible

Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved.

Instruction Set Extensions 32 Future SSE-4 45 nm Beginning in 2008: ~50 new instructions

Transfer Instructions (1) Instruction Category MMX Transfer Instructions Description MOVD Move doubleword MOVQ Move

Transfer Instructions (2) Instruction Category SSE 64 -Bit SIMD Integer Instructions Description SSE 2

Transfer Instructions (3) Instruction Category SSE 3 SIMD Floating. Point LOAD/MOVE/DUPLICATE Description Instructions MOVSLDUP

Packed Arithmetic Instructions (1) Instruction Category MMX Packed Arithmetic Instructions Description PADDB Add packed

Packed Arithmetic Instructions (2) Instruction Category Description SSE Packed Arithmetic Instructions ADDPS Add packed

Packed Arithmetic Instructions (3) Instruction Category SSE 64 -Bit SIMD Integer Instructions SSE 2

Packed Arithmetic Instructions (4) Instruction Category Description SSE 2 128 -Bit SIMD Integer Instructions

Packed Arithmetic Instructions (5) Instruction Category SSE 3 SIMD Floating. Point Horizontal ADD/SUB Instructions

Conversion Instructions (1) Instruction Category MMX Conversion Instructions Description PACKSSWB Pack words into bytes

Conversion Instructions (2) Instruction Category SSE 2 Conversion Instructions Description CVTPD 2 PI Convert

Conversion Instructions (3) Instruction Category SSE 2 Conversion Instructions Description CVTDQ 2 PS Convert

Comparison Instructions Instruction Category MMX Comparison Instructions Description PCMPEQB Compare packed bytes for equal

Logical Instructions Instruction Category MMX Logical Instructions Description PAND Bitwise logical AND PANDN Bitwise

Shift and Rotate Instructions Instruction Category MMX Shift and Rotate Instructions Description PSLLW Shift

Shuffle and Unpack Instructions Instruction Category SSE 64 -Bit SIMD Integer Instructions Description SSE

Functional Instructions Instruction Category SSE 64 -Bit SIMD Integer Instructions Description PAVGB Compute average

System Related Instructions Instruction Category MMX State Management Instructions SSE MXCSR State Management Instructions

Cacheability Control, Prefetch, and Instruction Ordering Instructions Instruction Category SSE Cacheability Control, Prefetch, and

Slides: 42

Download presentation

Intel® Processor Architecture: SIMD Instructions Intel® Software College

Objectives After completion of this module you will be able to understand • SIMD rationale • Intel SIMD instructions Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda SIMD Rationale Intel SIMD History SIMD Data Types Intel SIMD Instructions Sets Programming with Intel SIMD Instructions Backup – Instructions Reference Table Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SIMD (Single Instruction Multiple Data) Technology • Increase processor throughput by performing multiple computations in a single instruction • MMX™ technology, SSE 2 and SSE 3 are architectural extensions Example (SSE 2) Performs two double precision ops in one cycle • a 1+b 1=c 1 in parallel with a 0+b 0=c 0 Useful for matrix operations a 1 + a 0 + b 1 b 0 c 1 c 0 128 -bit Registers Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Streaming SIMD Extensions A brief history MMX™ technology - Intel® Pentium® with MMX™ and Pentium ® II processors • Introduced 64 -bit MMX registers for SIMD integer operations • Supports SIMD operations on packed byte, word, and double-word integers • Useful for multimedia and communications software SSE – Intel® Pentium® III processor • Introduced 128 -bit extended memory manager (XMM) registers for SIMD integers and FP-SP operands • Executes FP and SIMD simultaneously • Introduced data prefetch instructions • Useful for 3 D geometry, 3 D rendering, and video encoding/decoding SSE 2 – • • • Intel® Pentium® 4 and Intel® Xeon™ processors Added extra 64 -bit SIMD integer support Has same XMM registers for SIMD integer and floating point double precision (FP-DP) Has 144 new instructions for data support (no new registers) Adds support for cacheability and memory ordering operations Useful for 3 D graphics, video encoding/decoding and encryption SSE 3 – Intel® Pentium® 4 Processor • Accelerates performance of Streaming SIMD Extensions technology, Streaming SIMD Extensions 2 technology, and X 87 -FP math capabilities. • Useful in some 3 D operations (Quaternions), complex arithmetic and video codec algorithms SSSE 3 – Intel® Core® 2 Processor • application performance improvement. • potential for specifc application domains Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

X 86 Register Sets SSE-Registers introduced first in Pentium® 3 MMX™ Technology / IA-FP Registers IA-INT Registers 80 32 eax SSE Registers 128 64 st 0 mm 0 xmm 0 st 7 mm 7 xmm 7 … edi Fourteen 32 -bit registers l Scalar data & addresses l Direct access to regs l Eight 80/64 -bit registers l Hold data only l Stack access to FP 0. . FP 7 l Direct access to MM 0. . MM 7 l No MMX™ Technology / FP interoperability l Copyright © 2006, Intel Corporation. All rights reserved. Eight 128 -bit registers l Hold data only: l 4 x single FP numbers l 2 x double FP numbers l 128 -bit packed integers l Direct access to the registers l Use simultaneously with FP / MMX Technology l Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SIMD Data Types (1) · 64 -Bit Packed Integer Data Types - 8 packed bytes 63 31 0 - 4 packed words 63 31 0 - 2 doublewords 63 Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SIMD Data Types (2) · 128 -bit Floating-Point and Integer Data Types - 16 packed bytes integer 127 - 8 packed words integer 127 63 87 0 63 16 15 0 - 4 packed doublewords Single Precision Floating Point or Integer 127 63 32 31 0 - 2 quadwords Double Precision Floating Point or Integer 127 63 0 Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SIMD Data Types (3) • Packed BCD data-type - Packed BCD Integers 7 3 BCD 0 BCD - 80 -Bit Packed BCD Decimal Integers 79 71 0 X D 17 D 16 …… D 0 Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSE-Instructions Set Extensions Introduced by Pentium® 3 in 1999; now frequently called SSE-1 Only new data type supported: 4 x 32 Bit (Single Precision) floating point data Some 70 instructions • Arithmetic, compare, convert operations on SSE SP FP data • PACKED, UNPACKED • • • Data load/store Prefetch Extension of MMX Streaming Store (store without using cache in between) … 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSE Sample: Branch Removal R = (A < B)? C : D //remember: everything packed A 0. 0 -3. 0 cmplt B 0. 0 1. 0 -5. 0 00000 11111 and nand c 3 c 2 c 1 c 0 d 3 d 2 d 1 d 0 00000 c 2 00000 c 0 d 3 00000 d 1 00000 or Intel® Processor Architecture: SIMD Technology® Overview d 3 Copyright © 2006, Intel Corporation. All rights reserved. c 2 d 1 c 0 Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSE-2 Instructions Set Extensions Introduced by Intel® Pentium® 4 processor in 2000 Some 140 new instructions Added double precision floating point data (2 x 64 Bit) and all related instructions including conversion Again some extensions to MMX Added all possible combinations of integer data to SSE ( 1 x 128, 2 x 64, 4 x 32, 8 x 16, 16 x 8) and related operations 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SIMD Single vs. SIMD Double SIMD SP FP Operand = 4 Elements 4 x Single Precision: SSE-1 Element = SP FP Number 127 0 X 3 X 2 X 1 31 30 X 0 0 23 22 S Exponent Significand SIMD DP FP Operand = 2 Elements Element = DP FP Number 127 0 X 1 63 62 S Exponent 2 x Double Precision: SSE-2 X 0 52 51 Copyright © 2006, Intel Corporation. All rights reserved. 0 Significand Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Sample for SSE-2: SIMD Double SIMD Int Conversion SIMD Double SIMD Int: conversion to two lower ints, two higher ints cleared x 1 00000 x 0 00000 (int)x 1 (int)x 0 __m 128 d x; __m 128 i ix; ix = _mm_cvtpd_epi 32(x); l SIMD Int SIMD Double: conversion from two lower ints ? ? ? ? ix 1 ix 0 x = _mm_cvtepi 32_pd(ix); Intel® Processor Architecture: SIMD Technology® Overview (double)x 1 (double)x 0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSE 3: No new Data Types but new Instructions FISTTP FP to integer conversions ADDSUBPD, ADDSUBPS, Complex arithmetic MOVDDUP, MOVSHDUP, MOVSLDUP Video encoding SIMD FP using AOS format* LDDQU HADDPD, HSUBPD Thread Synchronization HADDPS, HSUBPS MONITOR, MWAIT * Also benefits Complex and Vectorization Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Streaming SIMD Extensions 3 13 new instructions Three have limited use for application performance improvement • FISTTP - X 87 to integer conversion (requires –longdouble switch) • MONITOR/MWAIT - thread synchronization • Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages The other ten have some potential for specifc application domains Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSE-3 Sample Complex Arithmetic: ADDSUBPS Operand. A Operand. B • Operand. A (xmm register; 4 data elements) • a 3, a 2, a 1, a 0 • Operand. B (xmm reg. Or memory addr; 4 data elements) • b 3, b 2, b 1, b 0 • Result (Stored in Operand. A) • a 3+b 3, a 2 -b 2, a 1+b 1, a 0 -b 0 __m 128 _mm_addsub_ps(__m 128 a, __m 128 b) a 3 b 3 Add a 3+b 3 a 2 b 2 a 1 b 1 Sub a 0 b 0 Add Sub Intel® Processor Architecture: SIMD Technology® Overview a 2 -b 2 a 1+b 1 a 0 -b 0 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Supplemental SSE-3 (SSSE-3) Extension introduced by Intel® Core™ Architecture Horizontal Addition/Subtraction PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD Packed Absolute Values PABSB, PABSW, PABSD Multiply and Add Packed Signed/Unsigned bytes Packed multiply High with Round and Scale Packed Shuffle Bytes Packed SIGN Packed Align Right PMADDUBSW PMULHRSW PSHUFB PSIGNB/W/D PALIGNR Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

SSSE-3 New Instructions Useful for media and imaging 16 new packed integer instructions Six new Instructions categories 64 -bit SIMD latencies 128 -bit SIMD latencies Absolute value and Integer “Sign” 1 clock Byte multiply and add 3 clocks Word multiply high with round and shift 3 clocks Byte Permute 1 clock 3 clocks Byte Concatenate with Shift 1 clock 2 clocks Integer Horizontal Adds and Subtracts 4 clocks 5 clocks All instructions come in both 128 -bit and 64 -bit (MMX) flavors Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Sample SSSE-3 Inst. : Byte Permute PSHUFB mm, mm/m 64 PSHUFB xmm, xmm/m 128 • • • A complete byte-granularity permutation The source operand is used as the control field (variable control) The destination operand gets permuted Each byte of the source field selects the origin of the corresponding destination byte Also includes force-byte-to-zero flag (bit 7) src 0 x 7 0 x. FF 0 x 80 0 x 01 0 x 00 dest 0 x 04 0 x 01 0 x 07 0 x 03 0 x 02 0 x. FF 0 x 01 dest 0 x 04 0 x 00 0 x. FF 0 x 01 Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Ways to SSE/SIMD programming Coding using SSE/SSE 2/3/4 assembler instructions • Very tedious (manually schedule) – discouraged: Don’t do it ! • E. g. : How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ? Intel® compiler’s C/C++ SIMD intrinsics • No need to take care of register allocation, scheduling etc Intel® compiler’s C++ Vector Class Library • Use this if you are heavy into C++ classes Vectorizer of Intel® C++ and Fortran Compilers • Recommended for most cases – easy and efficient Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL) 2001 PTE Engineering Enabling Conference Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Compiler Based Vectorization Processor Specific Generate Code and Optimize for Linux* Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE -ax. K Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE 2 -x. W -ax. W Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE 2 - depreciated switch: use x. W instead -x. N -ax. N Pentium® M processors including code generation for MMX, SSE and SSE-2 -x. B -ax. B Intel® processors with SSE 3 capability including Pentium 4 (both 32 and 64 bit mode) – including code generation for MMX, SSE 2 and SSE-3 -x. P, -ax. P Intel® processors with MNI capability – Intel® Core™ 2 Duo processors ( Conroe, Merom, Woodcrest) including code generation for MMX, SSE 2, SSE 3 and MNI -x. T, -ax. T Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Instruction Set Extensions 32 Future SSE-4 45 nm Beginning in 2008: ~50 new instructions in 13 groups All function in 32 -bit and 64 -bit modes Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2 D & 3 D Imaging, Vectorizing Compiler Performance Intel® Processor Architecture: SIMD Technology® Overview 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Transfer Instructions (1) Instruction Category MMX Transfer Instructions Description MOVD Move doubleword MOVQ Move quadword SSE SIMD Single. Precision Data Transfer Instructions MOVAPS Move four aligned packed single-precision floating-point values between XMM registers or between and XMM register and memory MOVUPS Move four unaligned packed single-precision floating-point values between XMM registers or between and XMM register and memory MOVHPS Move two packed single-precision floating-point values to an from the high quadword of an XMM register and memory MOVHLPS Move two packed single-precision floating-point values from the high quadword of an XMM register to the low quadword of another XMM Register MOVLPS Move two packed single-precision floating-point values to an from the low quadword of an XMM register and memory MOVLHPS Move two packed single-precision floating-point values from the low quadword of an XMM register to the high quadword of another XMM register MOVMSKPS Extract sign mask from four packed single-precision floating-point values MOVSS Move scalar single-precision floating-point value between XMM registers or between an XMM register and memory Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Transfer Instructions (2) Instruction Category SSE 64 -Bit SIMD Integer Instructions Description SSE 2 128 -Bit SIMD Integer Instructions MOVDQA Move aligned double quadword. PMOVMSKB Move byte mask MOVDQU Move unaligned double quadword MOVQ 2 DQ Move quadword integer from MMX to XMM registers MOVDQ 2 Q Move quadword integer from XMM to MMX registers SSE 2 double-precision floating-point data Movement Instructions MOVAPD Move two aligned packed double-precision floating-point values between XMM registers or between and XMM register and memory MOVUPD Move two unaligned packed double-precision floating-point values between XMM registers or between and XMM register and memory MOVHPD Move high packed double-precision floating-point value to an from the high quadword of an XMM register and memory MOVLPD Move low packed single-precision floating-point value to an from the low quadword of an XMM register and memory MOVMSKPD Extract sign mask from two packed double-precision floating-point values MOVSD Move scalar double-precision floating-point value between XMM registers or between an XMM register and memory Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Transfer Instructions (3) Instruction Category SSE 3 SIMD Floating. Point LOAD/MOVE/DUPLICATE Description Instructions MOVSLDUP Loads/moves 128 bits; duplicating the first and third 32 -bit data elements MOVSHDUP Loads/moves 128 bits; duplicating the second and fourth 32 -bit data elements MOVDDUP Loads/moves 64 bits (bits[63: 0] if the source is a register) and returns the same 64 bits in both the lower and upper halves of the 128 -bit result register; duplicates the 64 bits from the source SSE 3 Specialized 128 -bit Unaligned Data Load Instruction 64 -BIT MODE INSTRUCTIONS LDDQU Special 128 -bit unaligned load designed to avoid cache line splits LODSQ Load qword at address (R)SI into RAX MOVSQ Move qword from address (R)SI to (R)DI MOVZX (64 -bits) Move doubleword to quadword, zero-extension STOSQ Store RAX at address RDI Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Packed Arithmetic Instructions (1) Instruction Category MMX Packed Arithmetic Instructions Description PADDB Add packed byte integers PADDW Add packed word integers PADDD Add packed double word integers PADDSB Add packed signed byte integers with signed saturation PADDSW Add packed signed word integers with signed saturation PADDUSB Add packed unsigned byte integers with unsigned saturation PADDUSW Add packed unsigned word integers with unsigned saturation PSUBB Subtract packed byte integers PSUBW Subtract packed word integers PSUBD Subtract packed double word integers PSUBSB Subtract packed signed byte integers with signed saturation PSUBSW Subtract packed signed word integers with signed saturation PSUBUSB Subtract packed unsigned byte integers with unsigned saturation PSUBUSW Subtract packed unsigned word integers with unsigned saturation PMULHW Multiply packed signed word integers and store high result PMULLW Multiply packed signed word integers and store low result PMADDWD Multiply and add packed word integers Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Packed Arithmetic Instructions (2) Instruction Category Description SSE Packed Arithmetic Instructions ADDPS Add packed single-precision floating-point values ADDSS Add scalar single-precision floating-point values SUBPS Subtract packed single-precision floating-point values SUBSS Subtract scalar single-precision floating-point values MULPS Multiply packed single-precision floating-point values MULSS Multiply scalar single-precision floating-point values DIVPS Divide packed single-precision floating-point values DIVSS Divide scalar single-precision floating-point values RCPPS Compute reciprocals of packed single-precision floating-point values RCPSS Compute reciprocal of scalar single-precision floating-point values SQRTPS Compute square roots of packed single-precision floating-point values SQRTSS Compute square root of scalar single-precision floating-point values RSQRTPS Compute reciprocals of square roots of packed single-precision floating point values RSQRTSS Compute reciprocal of square root of scalar single-precision floating-point values MAXPS Return maximum packed single-precision floating-point values MAXSS Return maximum scalar single-precision floating-point values Intel® Processor Architecture: SIMD Technology® Overview MINPS Return minimum packed single-precision floating-point values MINSS Return minimum scalar single-precision floating-point values Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Packed Arithmetic Instructions (3) Instruction Category SSE 64 -Bit SIMD Integer Instructions SSE 2 Packed Arithmetic Instructions Description PMULHUW Multiply packed unsigned integers and store high result ADDPD Add packed double-precision floating-point values ADDSD Add scalar double precision floating-point values SUBPD Subtract scalar double-precision floating-point values SUBSD Subtract scalar double-precision floating-point values MULPD Multiply packed double-precision floating-point values MULSD Multiply scalar double-precision floating-point values DIVPD Divide packed double-precision floating-point values DIVSD Divide scalar double-precision floating-point values SQRTPD Compute packed square roots of packed double-precision floating-point Values SQRTSD Compute scalar square root of scalar double-precision floating-point values MAXPD Return maximum packed double-precision floating-point values MAXSD Return maximum scalar double-precision floating-point values MINPD Return minimum packed double-precision floating-point values MINSD Return minimum scalar double-precision floating-point values Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Packed Arithmetic Instructions (4) Instruction Category Description SSE 2 128 -Bit SIMD Integer Instructions PMULUDQ Multiply packed unsigned doubleword integers PADDQ Add packed quadword integers PSUBQ Subtract packed quadword integers SSE 3 SIMD Floating. Point Packed ADD/SUB Instructions ADDSUBPS Performs single-precision addition on the second and fourth pairs of 32 -bit data elements within the operands; single-precision subtraction on the first and third pairs ADDSUBPD Performs double-precision addition on the second pair of quad words, and double-precision subtraction on the first pair Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Packed Arithmetic Instructions (5) Instruction Category SSE 3 SIMD Floating. Point Horizontal ADD/SUB Instructions Description HADDPS Performs a single-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand; the second element by adding the third and fourth elements of the first operand; the third by adding the first and second elements of the second operand; and the fourth by adding the third and fourth elements of the second operand. HSUBPS Performs a single-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand; the second element by subtracting the fourth element of the first operand from the third element of the first operand; the third by subtracting the second element of the second operand from the first element of the second operand; and the fourth by subtracting the fourth element of the second operand from the third element of the second operand. HADDPD Performs a double-precision addition on contiguous data elements. The first data element of the result is obtained by adding the first and second elements of the first operand; the second element by adding the first and second elements of the second operand. HSUBPD Performs a double-precision subtraction on contiguous data elements. The first data element of the result is obtained by subtracting the second element of the first operand from the first element of the first operand; the second element by subtracting the second element of the second operand from the first element of the second operand. SSSE 3 Packed Arithmetic Instructions phaddw/d/sw Pairwise integer horizontal addition + pack phsubw/d/sw Pairwise integer horizontal subtract + pack PMADDUBSW Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate) Intel® Processor Architecture: SIMD Technology® Overview PMULHRSW Signed 16 bits multiply, return high bits Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Conversion Instructions (1) Instruction Category MMX Conversion Instructions Description PACKSSWB Pack words into bytes with signed saturation PACKSSDW Pack double words into words with signed saturation PACKUSWB Pack words into bytes with unsigned saturation. PUNPCKHBW Unpack high-order bytes PUNPCKHWD Unpack high-order words PUNPCKHDQ Unpack high-order double words PUNPCKLBW Unpack low-order bytes PUNPCKLWD Unpack low-order words PUNPCKLDQ Unpack low-order double words SSE Conversion Instructions CVTPI 2 PS Convert packed double word integers to packed single-precision floating point values CVTSI 2 SS Convert double word integer to scalar single-precision floating-point value CVTPS 2 PI Convert packed single-precision floating-point values to packed double word integers CVTTPS 2 PI Convert with truncation packed single-precision floating-point values to packed double word integers CVTSS 2 SI Convert a scalar single-precision floating-point value to a double word integer CVTTSS 2 SI Convert with truncation a scalar single-precision floating-point value to a scalar double word integer Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Conversion Instructions (2) Instruction Category SSE 2 Conversion Instructions Description CVTPD 2 PI Convert packed double-precision floating-point values to packed doubleword integers. CVTTPD 2 PI Convert with truncation packed double-precision floating-point values to packed doubleword integers CVTPI 2 PD Convert packed doubleword integers to packed double-precision floatingpoint values CVTPD 2 DQ Convert packed double-precision floating-point values to packed doubleword integers CVTTPD 2 DQ Convert with truncation packed double-precision floating-point values to packed doubleword integers CVTDQ 2 PD Convert packed doubleword integers to packed double-precision floatingpoint values CVTPS 2 PD Convert packed single-precision floating-point values to packed doubleprecision floating-point values CVTPD 2 PS Convert packed double-precision floating-point values to packed singleprecision floating-point values CVTSS 2 SD Convert scalar single-precision floating-point values to scalar doubleprecision floating-point values CVTSD 2 SS Convert scalar double-precision floating-point values to scalar singleprecision floating-point values CVTSD 2 SI Convert scalar double-precision floating-point values to a doubleword integer CVTTSD 2 SI Convert with truncation scalar double-precision floating-point values to scalar Intel® Processor Architecture: SIMD Technology® Overview doubleword integers CVTSI 2 SD Convert doubleword integer to scalar double-precision floating-point Value Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Conversion Instructions (3) Instruction Category SSE 2 Conversion Instructions Description CVTDQ 2 PS Convert packed doubleword integers to packed single-precision floatingpoint values CVTPS 2 DQ Convert packed single-precision floating-point values to packed doubleword integers CVTTPS 2 DQ Convert with truncation packed single-precision floating-point values to packed doubleword integers SSE 3 x 87 -FP Integer Conversion Instruction FISTTP Behaves like the FISTP instruction but uses truncation, irrespective of the rounding mode specified in the floating-point control word (FCW) Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Comparison Instructions Instruction Category MMX Comparison Instructions Description PCMPEQB Compare packed bytes for equal PCMPEQW Compare packed words for equal PCMPEQD Compare packed doublewords for equal PCMPGTB Compare packed signed byte integers for greater than PCMPGTW Compare packed signed word integers for greater than PCMPGTD Compare packed signed doubleword integers for greater than SSE Comparison Instructions CMPPS Compare packed single-precision floating-point values CMPSS Compare scalar single-precision floating-point values COMISS Perform ordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register UCOMISS Perform unordered comparison of scalar single-precision floating-point values and set flags in EFLAGS register SSE 2 Compare Instructions CMPPD Compare packed double-precision floating-point values CMPSD Compare scalar double-precision floating-point values COMISD Perform ordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register UCOMISD Perform unordered comparison of scalar double-precision floating-point values and set flags in EFLAGS register. Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Logical Instructions Instruction Category MMX Logical Instructions Description PAND Bitwise logical AND PANDN Bitwise logical AND NOT POR Bitwise logical OR PXOR Bitwise logical exclusive OR SSE Logical Instructions ANDPS Perform bitwise logical AND of packed single-precision floating-point values ANDNPS Perform bitwise logical AND NOT of packed single-precision floatingpoint values ORPS Perform bitwise logical OR of packed single-precision floating-point values XORPS Perform bitwise logical XOR of packed single-precision floating-point Values SSE 2 Logical Instructions ANDPD Perform bitwise logical AND of packed double-precision floating-point values ANDNPD Perform bitwise logical AND NOT of packed double-precision floatingpoint values ORPD Perform bitwise logical OR of packed double-precision floating-point values XORPD Perform bitwise logical XOR of packed double-precision floating-point Values Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Shift and Rotate Instructions Instruction Category MMX Shift and Rotate Instructions Description PSLLW Shift packed words left logical PSLLD Shift packed doublewords left logical PSLLQ Shift packed quadword left logical PSRLW Shift packed words right logical PSRLD Shift packed doublewords right logical PSRLQ Shift packed quadword right logical PSRAW Shift packed words right arithmetic PSRAD Shift packed doublewords right arithmetic SSE 2 128 -Bit SIMD Integer Instructions PSLLDQ Shift double quadword left logical PSRLDQ Shift double quadword right logical Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Shuffle and Unpack Instructions Instruction Category SSE 64 -Bit SIMD Integer Instructions Description SSE 2 128 -Bit SIMD Integer Instructions PSHUFLW Shuffle packed low words PSHUFW Shuffle packed integer word in MMX register PSHUFHW Shuffle packed high words PSHUFD Shuffle packed doublewords PUNPCKHQDQ Unpack high quadwords PUNPCKLQDQ Unpack low quadwords SSE Shuffle and Unpack Instructions SHUFPS Shuffles values in packed single-precision floating-point operands UNPCKHPS Unpacks and interleaves the two high-order values from two single-precision floating-point operands UNPCKLPS Unpacks and interleaves the two low-order values from two single-precision floating-point operands SSE 2 Shuffle and Unpack Instructions SHUFPD Shuffles values in packed double-precision floating-point operands UNPCKHPD Unpacks and interleaves the high values from two packed double-precision floating-point operands UNPCKLPD Unpacks and interleaves the low values from two packed double-precision floating-point operands SSSE 3 Packed Shuffle Bytes PSHUFB A complete byte-granularity permutation, including force-to-zero flag Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Functional Instructions Instruction Category SSE 64 -Bit SIMD Integer Instructions Description PAVGB Compute average of packed unsigned byte integers PAVGW Compute average of packed unsigned byte integers PEXTRW Extract word PINSRW Insert word PMAXUB Maximum of packed unsigned byte integers PMAXSW Maximum of packed signed word integers PMINUB Minimum of packed unsigned byte integers PMINSW Minimum of packed signed word integers PSADBW Compute sum of absolute differences SSSE 3 Instructions psignb/w/d Per element, if the source operand is negative, multiply the destination operand by -1 pabsb/w/d Per element, overwrite destination with absolute value of source PALIGNR Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [ dst, src] and store them to the dst register PMULHRSW Signed 16 bits multiply, return high bits Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

System Related Instructions Instruction Category MMX State Management Instructions SSE MXCSR State Management Instructions Description SSE 3 Agent Synchronization Instructions MONITOR Sets up an address range used to monitor write-back stores EMMS Empty MMX state LDMXCSR Load MXCSR register STMXCSR Save MXCSR register state MWAIT Enables a logical processor to enter into an optimized state while waiting for a write -back store to the address range set up by the MONITOR Instruction Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Cacheability Control, Prefetch, and Instruction Ordering Instructions Instruction Category SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions Description MASKMOVQ Non-temporal store of selected bytes from an MMX register into memory MOVNTQ Non-temporal store of quadword from an MMX register into memory MOVNTPS Non-temporal store of four packed single-precision floating-point values from an XMM register into memory PREFETCHh Load 32 or more of bytes from memory to a selected level of the processor’s cache hierarchy SFENCE Serializes store operations SSE 2 Cacheability Control and Ordering Instructions CLFLUSH Flushes and invalidates a memory operand its associated cache line from all levels of the processor’s cache hierarchy LFENCE Serializes load operations MFENCE Serializes load and store operations PAUSE Improves the performance of “spin-wait loops” MASKMOVDQU Non-temporal store of selected bytes from an XMM register into memory MOVNTPD Non-temporal store of two packed double-precision floating-point values from an XMM register into memory MOVNTDQ Non-temporal store of double quadword from an XMM register into memory MOVNTI Non-temporal store of a doubleword from a general-purpose register into Memory Intel® Processor Architecture: SIMD Technology® Overview Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.