Single Instruction Multiple Data Another approach to ILP

Outline • Array Processors / “True” SIMD • Vector Processors • Multimedia Extensions in

SIMD: Motivation • Let’s start with an example: – ILLIAC IV, U of Illinois,

SIMD: Motivation contd. • Replication to the extreme: Multi-processor Multiprocessor Uniprocessor CU CU ALU

SIMD: Motivation Contd. • Recall: – Part of architecture is understanding application needs •

SIMD Architecture CU μCU regs PE PE PE MEM MEM • Replicate Datapath, not

ILLIAC IV • Goal: – 1 Gops/sec – 256 PEs as four partitions of

ILLIAC IV I/O Proc CU PE PE PE PMEM ECE 1773 Portions from Hill,

ILLIAC IV Processing Element (PE) • 64 -bit numbers, float or fixed point •

PE Contd. • • PE mode: Active or Inactive, CU sets mode All PEs

Peak Compute Bandwidth • 64 PEs • Each can perform: – 1 64 b,

Control Unit (CU) • • • A simple CPU Can execute instructions w/o PE

Processing Element (PE) • 64 bit regs PEi-8 • A: Accumulator • B: 2

Datapaths Control Unit Bus CU Mode PE PE PMEM Common Data Bus Routing PE

Routing Network 56 57 58 59 60 61 62 63 7 0 1 2

Using ILLIAC IV: Example #2 • DO 10 I = 1 TO 64 10

Using ILLIAC IV: Example #2 • DO 10 I = 2 TO 64 •

Using ILLIAC IV: Example #2 contd. 1. 2. 3. 4. 5. 6. 7. 8.

Using ILLIAC IV: Example #2 contd. • Initial State: – PMEM(1)[a] = A(1) –

Vector Processors SIMD over time ECE 1773 Portions from Hill, Wood, Sohi and Smith

Vector Processors • Vector Datatype • Apply same operation on all elements of the

Properties of Vector Processors • One Vector instruction implies lots of work – Fewer

Classes of Vector Processors • Memory to memory – Vectors are in memory •

Historical Perspective • Mid-60 s: performance concerns • SIMD processor arrays • Also fast

CRAY-1 • Fast and simple scalar processor – 80 Mhz • Vector register concept

Physical Organization of CRAY-1 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin).

Components of Vector Processor • Scalar CPU: registers, datapaths, instruction fetch • Vector Registers:

CRAY-1 Organization • Simple 16 -bit Reg-to-Reg ISA • Use two 16 -bit to

CRAY-1 CPU • • • Scalar and vector modes 12. 5 ns clock 64

CRAY-1 CPU • Vector Length Register – Can use only a prefix of a

Cray-1 Memory System • 1 M 64 b words + 8 check bits (single

Instruction Format • Fields g h I j • Bits 0 -3 4 -6

Basic Vector Instructions • • • Inst Operands Operation Comment VADD. VV V 1,

Vector Memory Operations • Load/Store move groups of data between memory and registers •

Vector Code Example • Y[0: 63] = Y[0: 63] + a * X[0: 63]

Scalar Equivalent • • • Loop: • • LD R 0, a LI R

Vector Length Register • Allows us to vectorize code where the elements do not

Strip Mining • Suppose (application vector length) AVL > MVL (max vector length) •

Optimization #1: Chaining • Subsequent vector op can be initiated as soon as a

Optimization #2: Conditional Execution • Vector Mask Register – Bit vector: used as predicate

Optimization #3: Multi-lane Implementation • Vectors are interleaved so that multiple elems can be

Two Ways to View Vectorization • Classic Approach: Inner-loop – Think machine as having

Startup Cost ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley),

Execution Cost ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley),

Multimedia extensions SIMD in modern CPUs ECE 1773 Portions from Hill, Wood, Sohi and

Multimedia ISA Extensions • Intel’s MMX – – – The Basics Instruction Set Examples

MMX: Basics • Multimedia applications are becoming popular • Are current ISAs a good

Multimedia Applications • Most multimedia apps have lots of parallelism: – for I =

Observations • 32 -bit registers are wasted – only using part of them and

MMX Contd. • Can do better than traditional ISA – new data types –

MMX: Example Up to 8 operations (64 bit) go in parallel w Potential improvement:

Data Types ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley),

MMX: Instruction Set • 57 new instructions • Integer Arithmetic – – • •

Arithmetic • Conventional: Wrap-around – on overflow, wrap to -1 – on underflow, wrap

Operations • Mult/Add • Compares • Conversion – Interpolation/Transpose – Unpack (e. g. ,

Matrix Transpose 4 x 4 m 03 m 02 m 01 m 00 m

Examples • Image Composting – A and B images fade-in and fade-out – A

Chroma Keying • for (i=0; i<image_size; i++) – if (x[i] == Blue) new_image[i] =y[i]

Chroma Keying Code • Movq mm 3, mem 1 – Load eight pixels from

Integration into Pentium • Major issue: OS compatibility – Create new registers? – Share

“Recent” Multimedia Extensions • • Intel MMX: integer arithmetic only New algorithms -> new

AMD’s 3 DNow! • • 21 new instructions Average: motivated by MPEG Add, Sub,

Recent Extensions Cont. • Intel’s ISSE – very similar to AMD’s 3 DNow! –

Intel’s SSE • Multimedia/Internet? • 70 new instructions • Major Types: – SIMD-FP 128

Altivec (Power. PC Mmedia Ext) • • • 128 -bit registers 8, 16, or

Altivec Design Process • Look at Mmedia Kernel • Justify new instructions • Video

Slides: 66

Download presentation

Single Instruction Multiple Data Another approach to ILP and performance ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Outline • Array Processors / “True” SIMD • Vector Processors • Multimedia Extensions in modern instruction sets ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

SIMD: Motivation • Let’s start with an example: – ILLIAC IV, U of Illinois, 1972 (prototype) • Reasoning: How to Improve Performance – Rely on Faster Circuits • Cost/circuit increases with circuit speed • At some point, cost/performance unfavorable – Concurrency: • Replicate Resources • Do more per cycle ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

SIMD: Motivation contd. • Replication to the extreme: Multi-processor Multiprocessor Uniprocessor CU CU ALU ALU MEM MEM replicate • Very Felixible, but costly • Do we need all this flexibility? • There are middle-ground designs were only parts are replicated ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

SIMD: Motivation Contd. • Recall: – Part of architecture is understanding application needs • Many Apps: – for i = 0 to infinity • a(i) = b(i) + c • Same operation over many tuples of data • Mostly independent across iterations ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

SIMD Architecture CU μCU regs PE PE PE MEM MEM • Replicate Datapath, not the control • All PEs work in tandem • CU orchestrates operations ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). ALU

ILLIAC IV • Goal: – 1 Gops/sec – 256 PEs as four partitions of 64 PEs • What was built – 0. 2 Minsts/sec (we’ll talk about peak performance as ops) – 64 PEs – Prototype due date 1972 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

ILLIAC IV I/O Proc CU PE PE PE PMEM ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

ILLIAC IV Processing Element (PE) • 64 -bit numbers, float or fixed point • Multiples of smaller numbers that add up to 64 -bits – Today’s multimedia extensions • PMEM: One local memory module per PE – 2 K x 64 -bits – 188 ns access / 350 ns cycle (includes conflict resolution) • 100 K components per PE ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

PE Contd. • • PE mode: Active or Inactive, CU sets mode All PEs operate in lock-step Routing insts to move data from PE to PE The CU can execute instructions while PE’s are busy – Another degree of concurrency • Datatypes – – – 64 b float 64 b logical 48 b fixed 32 float 24 fixed 8 fixed ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Peak Compute Bandwidth • 64 PEs • Each can perform: – 1 64 b, 2 32 b, or 4 8 b operations • Or, in total: – 64 elems, 128 elems, or 512 elems • Peak: – 150 M 64 b ops/sec up to 10 G 32 b ops/sec – The last figure is for integer ops – Each int op takes 66 ns (4 per PE in parallel) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Control Unit (CU) • • • A simple CPU Can execute instructions w/o PE intervention Coordinates all PEs CU D 0 64 64 b registers, D 0 -D 63 4 64 b Accumulators A 0 -A 3 D 63 Ops: ALU A 0 – – – Integer ops Shifts Boolean Loop control Index PMem ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). A 1 A 2 A 3

Processing Element (PE) • 64 bit regs PEi-8 • A: Accumulator • B: 2 nd operand for binary ops • R: Routing – Inter-PE PEi-1 Communication • S: Temporary • X: Index for PMEM 16 bits • D: mode 8 bits • Communication: A B R S X D ALU PEi 0 1 2043 PMEMi – PMEM only from local PE ECE 1773 – Amongst PE with R Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). PEi+1 PEi+8

Datapaths Control Unit Bus CU Mode PE PE PMEM Common Data Bus Routing PE PMEM • CU Bus: Insts and Data from PMEM to CU in 8 words • CDB: Broadcast to all PEs – E. g. , constants for adds • Routing Network: amongst R registers • Mode: To activate/de-activate PEs ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Routing Network 56 57 58 59 60 61 62 63 7 0 1 2 3 4 5 6 7 0 15 8 9 10 11 12 13 14 15 8 23 16 17 18 19 20 21 22 23 16 31 24 25 26 27 28 29 30 31 24 39 32 33 34 35 36 37 38 39 32 47 40 41 42 43 44 45 46 47 40 55 48 49 50 51 52 53 54 55 48 63 56 57 58 59 60 61 62 63 56 0 1 2 3 4 5 6 7 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). i-8 i-1 19 12 20 28 i+1 21

Using ILLIAC IV: Example #2 • DO 10 I = 1 TO 64 10 C(I) = A(I) + B(I) • LDA a + 2 load A(i) into A (same a per PMEM) • ADDRN a + 1 add B(i) into A • STA a store A into C(i) a C(1) A(1) B(1) PMEM 1 C(2) A(2) B(2) PMEM 2 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). C(64) A(64) B(64) PMEM 64

Using ILLIAC IV: Example #2 • DO 10 I = 2 TO 64 • 10 A(I) = B(I) + A(I-1) • Expand into: – A(N) = A(1) + Sum B(i) [i = 2 to N] • We get: – DO 10 N=2 TO 64 – S = S + B(N) – 10 A(N) = S ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Using ILLIAC IV: Example #2 contd. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Enable all PEs All load A from a i=0 All R = A (including those inactive) All route R to PE(2^i) to the right j = 2^i – 1 Disable all PEs 1 through j A = A + R R contains a partial sum of many A(i) i=i+1 if i < lg(64) goto 4 Enable All PEs All store A at (a + 1) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Using ILLIAC IV: Example #2 contd. • Initial State: – PMEM(1)[a] = A(1) – PMEM(1+i)[a] = B(i+1) • For example, at PE 1 • STEP 1: A = A(1) – from PE 2 we get B(2) • STEP 2: A = A(1) + B(2) – from PE 4 we get B(4) + B(5) • STEP 3: A = A(1) + B(2) + B(4) + B(5) – From PE 8 we get B(8) + B(7) + B(12) + B(13) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Vector Processors SIMD over time ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Vector Processors • Vector Datatype • Apply same operation on all elements of the vector • No dependences amongst elements • Same motivation as SIMD ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Properties of Vector Processors • One Vector instruction implies lots of work – Fewer instructions • Each result independent of previous result – – Multiple operations in parallel Simpler design; no need for dependence checks Higher clock rate Compiler must help • Fewer Branches • Memory access pattern per vector inst known – – Prefetching effect Amortize mem latency Can exploit high-bandwidth mem system Less/no need for data caches ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Classes of Vector Processors • Memory to memory – Vectors are in memory • Load/store – Vectors are in registers – Load/store to communicate with memory – This prevailed ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Historical Perspective • Mid-60 s: performance concerns • SIMD processor arrays • Also fast Scalar machines – CDC 6600 • Texas Instruments ASC, 1972 – Memory to memory vector • Cray Develops CRAY-1, 1978 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

CRAY-1 • Fast and simple scalar processor – 80 Mhz • Vector register concept – Much simple ISA – Reduced memory pressure • Tight integration of scalar and vector units • Cylindrical design to minimize wire lengths • Freon Cooling ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Physical Organization of CRAY-1 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Components of Vector Processor • Scalar CPU: registers, datapaths, instruction fetch • Vector Registers: – – Fixed length memory bank holding a single vector reg Typically 8 -32 Vregs, up to 8 Kbits per Vreg At least; 2 Read, 1 Write ports Can be viewed as an array of N elements • Vector Functional Units: – Fully pipelined. New op per cycle – Typically 2 to 8 FUs: integer and FP – Multiple datapaths to process multiple elements per cycle if needed • Vector Load/Store Units (LSUs): – Fully pipelined – Multiple elems fetched/store per cycle – May have multiple LSUs • Cross-bar: – Connects FUS, LSUs and registers ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

CRAY-1 Organization • Simple 16 -bit Reg-to-Reg ISA • Use two 16 -bit to get Imm • Natural combinations of scalar and vector • Scalar bit-vectors match vector length • Gather/Scatter M-R • Cond. Merge ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

CRAY-1 CPU • • • Scalar and vector modes 12. 5 ns clock 64 -bit words Int & FP units 12 FUs 8 24 -bit A regs 64 B regs (temp storage for A) 8 64 -bit S regs 64 T regs (temp storage for S) 64 64 -elem, 64 bit elem V regs ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

CRAY-1 CPU • Vector Length Register – Can use only a prefix of a vreg • Vector Mask Register – Can use only a subset of a vreg • Real Time Register (counts clock cycles) • Four instruction buffers – 64 16 -bit parcels • 128 Basic Instructions • Interrupt Control • NO virtual memory system ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Cray-1 Memory System • 1 M 64 b words + 8 check bits (single error correction, double error detection) • 16 banks of 64 K words • 4 clocks period • 1 word per clock for B, T and Vreg • 1 word per 2 clocks for A & S • 4 words per clock for inst buffers ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Instruction Format • Fields g h I j • Bits 0 -3 4 -6 7 -9 16 -31 • Bits cnts 4 3 3 3 • X X opcode • Rd • A/S k m 10 -1213 -15 3 Rs 1 B/T ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). 16 Rs 2

Basic Vector Instructions • • • Inst Operands Operation Comment VADD. VV V 1, V 2, V 3 V 1=V 2+V 3 vector+vector VADD. SV V 1, R 0, V 2 V 1=R 0+V 2 scalar+vector VMUL. VV V 1, V 2, V 3 V 1=V 2*V 3 vector * vector VMUL. SV V 1, R 0, V 2 V 1=R 0*V 2 scalar * vector VLD V 1, R 0 V 1=M[R 0…R 0+63] stride = 1 • VLDS V 1, R 2 V 1=M[R 1…R 1+63*R 2] stride=R 2 • VLDX V 1, R 1, V 2 V 1=M[R 1+V 2[i], i=0 to 63] gather • VST store equiv of VLD • VSTS store equiv of VLDS ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), • Kozyrakis(Stanford). VSTX V 1, R 1 M[R 1+V 2[i], i=0 to

Vector Memory Operations • Load/Store move groups of data between memory and registers • Addressing Modes – Unit-stride: Fastest – Non-Unit, constant stride (interleaved memory helps – Indexed (gather-scatter) • Vector equiv of register indirect • Sparse arrays • Can vectorize more loops ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Vector Code Example • Y[0: 63] = Y[0: 63] + a * X[0: 63] • LD R 0, a • VLD V 1, Rx Load X[] in V 1 • VLD V 2, Ry Load Y[] in V 2 • VMUL. SV V 3, R 0, V 1 V 3 = X[]*a • VADD. VV V 4, V 2, V 3 V 4 = Y[]+X[]*a • VST Ry, V 4 store in Y[] ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Scalar Equivalent • • • Loop: • • LD R 0, a LI R 5, 512 (offset at the end of X[]) LD R 2, 0(Rx) MULTD R 2, R 0, R 2 LD R 3, 0(Ry) ADD R 4, R 2, R 3 LD R 0, a ST R 4, 0(Ry) VLD V 1, Rx ADD Rx, 8 VLD V 2, Ry V 3, R 0, V 1 ADD Ry, 8 VMUL. SV V 4, V 2, V 3 SUB R 5, 8 VADD. VV VST Ry, V 4 BNE Loop ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Vector Length Register • Allows us to vectorize code where the elements do not exactly fit within the vector register • What if we need a vector of just 32 elems? • Vector length register: – Operate up to this element – Can be anything from 0 to Maximum (64 in CRAY 1) • Can also be used to support runtime vector length variability ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Strip Mining • Suppose (application vector length) AVL > MVL (max vector length) • Each loop iteration handles MVL elems • Last iteration AVL MOD MVL – – – – VL = (AVL mode MVL) For (I=0; I<VL; I++) Y[I] = A*X[I] + Y[I] low = (AVL mod MVL) VL = low For (i=low; i < VL; i++) Y[i] = A*X[i] + Y[i] ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Optimization #1: Chaining • Subsequent vector op can be initiated as soon as a preceding vector op it depends upon produces its first result • Example • Vadd. vv v 1, v 2, v 3 • Vadd. sv v 4, v 1, R 0 time Add initiated V 1(1) V 1(2) V 1(3) V 1(4) unchained V 1(63) V 4(1) V 4(2) V 4(3) V 4(4) Add initiated V 1(1) V 1(2) V 1(3) V 1(4) V 4(1) V 4(2) V 4(3) V 4(4) V 1(63) V 4(63) chained ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford). V 4(63)

Optimization #2: Conditional Execution • Vector Mask Register – Bit vector: used as predicate – If 0 operation is not performed for the corresponding pair • • VLD V 1, Ra VLD V 2, Rb VCMP. NEQ. VV VMR, V 1, V 2 VSUB. VV V 3, V 2, V 1 (VMR) VST V 3, Ra For (i = 0; i < 64; i++) if (A[i] != B[i]) A[i] = A[i] – B[i] ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Optimization #3: Multi-lane Implementation • Vectors are interleaved so that multiple elems can be accessed per cycle • Replicate resources • Equivalent of Superscalar • Because of no intra-vector dependences and because inter-vector dependences are aligned (elem(i) to elem(i)) no need for interbank communications ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Two Ways to View Vectorization • Classic Approach: Inner-loop – Think machine as having 32 vector registers with 16 elems – 1 instruction updates all elements of a vector – Vectorize single dimension array operations • A new approach: Outer-loop – Think of machine as 16 “virtual processors” each with 32 scalar registers – 1 instruction updates register in 16 VPs – Good for irregular kernels • Hardware is the same for both • These describe the compiler’s perspective ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Startup Cost ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Execution Cost ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Multimedia extensions SIMD in modern CPUs ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Multimedia ISA Extensions • Intel’s MMX – – – The Basics Instruction Set Examples Integration into Pentium Relationship to vector ISAs • AMD’s 3 DNow! • Intel’s ISSE (a. k. a. KNI) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX: Basics • Multimedia applications are becoming popular • Are current ISAs a good match for them? • Methodology: – Consider a number of “typical” applications – Can we do better? – Cost vs. performance vs. utility tradeoffs • Net Result: Intel’s MMX • Can also be viewed as an attempt to maintain market share – If people are going to use these kind of applications we better support them ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Multimedia Applications • Most multimedia apps have lots of parallelism: – for I = here to infinity • out[I] = in_a[I] * in_b[I] – At runtime: • out[0] = in_a[0] * in_b[0] • out[1] = in_a[1] * in_b[1] • out[2] = in_a[2] * in_b[2] • out[3] = in_a[3] * in_b[3] • …. . • Also, work on short integers: – in_a[i] is 0 to 256 for example (color) – or, 0 to 64 k (16 -bit audio) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Observations • 32 -bit registers are wasted – only using part of them and we know – ALUs underutilized and we know • Instruction specification is inefficient – even though we know that a lot of the same operations will be performed still we have to specify each of the individually – Instruction bandwidth – Discovering Parallelism – Memory Ports? • Could read four elements of an array with one 32 -bit load • Same for stores • The hardware will have a hard time discovering this – Coalescing and dependences ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX Contd. • Can do better than traditional ISA – new data types – new instructions • Pack data in 64 -bit words – bytes – “words” (16 bits) – “double words” (32 bits) • Operate on packed data like short vectors – SIMD – First used in Livermore S-1 (> 20 years) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX: Example Up to 8 operations (64 bit) go in parallel w Potential improvement: 8 x w In practice less but still good w. Besides another reason to think your machine wis obsolete ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Data Types ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

MMX: Instruction Set • 57 new instructions • Integer Arithmetic – – • • add/sub/mul multiply add signed/unsigned saturating/wraparound Shifts Compare (form mask) Pack/Unpack Move – from/to memory – from/to registers ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Arithmetic • Conventional: Wrap-around – on overflow, wrap to -1 – on underflow, wrap to MAXINT • Think of digital audio – What happens when you turn volume to the MAX? • Similar for pictures • Saturating arithmetic: – on overflow, stay at MAXINT – on underflow, stat at MININT • Two flavors: – unsigned – signed ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Operations • Mult/Add • Compares • Conversion – Interpolation/Transpose – Unpack (e. g. , byte to word) – Pack (e. g. , word to byte) ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Matrix Transpose 4 x 4 m 03 m 02 m 01 m 00 m 13 m 12 m 11 m 10 m 23 m 22 m 21 m 20 m 33 m 32 m 31 m 30 m 20 m 10 m 00 m 31 m 21 m 11 m 01 m 33 m 22 m 12 m 02 m 33 m 23 m 13 m 03 m 32 m 31 m 30 m 13 m 12 m 11 m 10 m 23 m 22 m 21 m 20 m 03 m 02 m 01 m 00 punpcklwd m 31 m 21 m 30 m 20 m 11 m 01 m 10 m 00 punpckldq punpckhdq m 31 m 21 m 11 m 01 m 30 m 20 m 10 m 00 • That’s for the first two rows ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Examples • Image Composting – A and B images fade-in and fade-out – A * fade + B * (1 - fade), OR – (A - B) * fade + B • Image Overlay – Sprite: e. g. , mouse cursor – Spite: normal colors + transparent – for i = 1 to Sprite_Length • if A[I] = clear_color then – Out_frame[I] = C[I] – else Out_frame[I] = A[I] • Matrix Transpose – Covert from row major to column major – Used in JPEG ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Chroma Keying • for (i=0; i<image_size; i++) – if (x[i] == Blue) new_image[i] =y[i] – else new_image[i] = x[i]; ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Chroma Keying Code • Movq mm 3, mem 1 – Load eight pixels from persons’ image • Movq mm 4, mem 2 – Load eight pixels from the background image • • Pcmpeqb Pandn Por mm 1, mm 4, mm 3 mm 1 ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Integration into Pentium • Major issue: OS compatibility – Create new registers? – Share registers with FP • Existing OSes will save/restore • Use 64 -bit datapaths • Pipe capable of 2 MMX IPC • Separate MEM and Execute stage ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

“Recent” Multimedia Extensions • • Intel MMX: integer arithmetic only New algorithms -> new needs Need for massive amounts of FP ops Solution? MMX like ISA but for FP not only integer • Example: AMD’s 3 DNow! – New data type: • 2 packed single-precision FP – 2 x 32 -bits » sign + exponent + significant – New instructions ECE 1773 – Speedup potential: 2 x Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

AMD’s 3 DNow! • • 21 new instructions Average: motivated by MPEG Add, Sub, Reverse Sub, Mul Accumulate – (A 1, A 2) acc (B 1, B 2) = (B 1 + B 2, A 1 + A 2) • Comparison (create mask) • Min, Max (pairwise) • Reciprocal and SQRT, – Approximation: 1 st step and other steps • Prefetch • Integer from/to FP conversion • All operate on packed FP data – sign * 2^(mantissa - 127) * exponent ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Recent Extensions Cont. • Intel’s ISSE – very similar to AMD’s 3 DNow! – But has separate registers • Lessons? – Applications change over time – Careful when introducing new instructions • How useful are they? • Cost? • LEGACY: are they going to be useful in the future? • Everyone has their own Multimedia Instruction set these days – read handout ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Intel’s SSE • Multimedia/Internet? • 70 new instructions • Major Types: – SIMD-FP 128 -bit wide 4 x 16 bit FP – Data movement and re-organization – Type conversion • Int to Fp and vice versa • Scalar/FP precision – State Save/Restore • New SSE registers not like MMX – Memory Streaming • Prefetch to specified hierarchy level – New Media • Absolute Diff, Rounded AVG, MIN/MAX ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Altivec (Power. PC Mmedia Ext) • • • 128 -bit registers 8, 16, or 32 bit data types Scalar or single-precision FP 162 Instructions Saturation or Modulo arithmetic Four operand Instructions – 3 sources, 1 target ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).

Altivec Design Process • Look at Mmedia Kernel • Justify new instructions • Video – 8 bit int Low. Q, 16 -bit int High. Q • Audio – 16 bit int Low. Q, SP FP High. Q • Image Processing – 8 bit int Low. Q, 16 bit Int High. Q • 3 D Graphics – 16 bit int Low. Q, SP FP High. Q • Speech Recog. – 16 bit int Low Q, Sp FP High. Q • Communications/Crypto – 8 -bit or 16 bit unsigned int ECE 1773 Portions from Hill, Wood, Sohi and Smith (Wisconsin). Culler (Berkeley), Kozyrakis(Stanford).