DataLevel Parallelism in Vector and GPU Architectures Muhamed

Introduction v SIMD architectures can exploit significant data level parallelism for: Ø matrix oriented

SIMD Parallelism v Vector architectures v SIMD extensions v Graphics Processor Units (GPUs) v

Vector Architectures v Basic idea: Ø Read sets of data elements into “vector registers”

Vector Processing v Vector processors have high level operations that work on linear arrays

Vector Supercomputers Idealized by Cray-1, 1976: Scalar Unit + Vector Extensions v Load/Store Architecture

Cray 1 (1976) memory bank cycle 50 ns processor cycle 12. 5 ns (80

Cray 1 (1976) 64 Element Vector Registers Single Port Memory 16 banks of 64

Vector Programming Model Scalar Registers Vector Registers r 15 v 15 r 0 v

Vector Instructions Instr. Operands Operation Comment ADDV V 1, V 2, V 3 V

Properties of Vector Processors v Each result independent of previous result Ø Long pipeline,

Vector Code Example # Vector Code # Scalar Code # C code LI VLR,

Vector Instruction Set Advantages v Compact Øone short instruction encodes N operations v Expressive,

Components of a Vector Processor v Vector Register File Ø Has at least 2

Examples of Vector Machines Machine Cray 1 Cray XMP Cray YMP Cray C 90

Vector Arithmetic Execution v Use deep pipeline (=> fast clock) to execute element operations

Vector Memory System v Cray 1: 16 banks Ø 4 cycle bank busy time

Interleaved Memory Layout Unpipelined DRAM Addr+4 Addr+5 Unpipelined DRAM Addr+2 Addr+3 Unpipelined DRAM Unpipelined

Vector Instruction Execution ADDV C, A, B Execution using one pipelined functional unit Execution

Vector Unit Structure Vector Registers Elements 0, 4, 8, … Elements 1, 5, 9,

Vector Unit Implementation v Vector register file Ø Each register is an array of

T 0 Vector Microprocessor (1995) See http: //www. icsi. berkeley. edu/real/spert/t 0 intro. html

Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i];

Vector Stripmining v Problem: Vector registers have fixed length v What to do if

Vector Stripmining Example ANDI MOV loop: LV SLL Remainder ADDU LV ADDU 64 elements

Vector Chaining v Vector version of register bypassing Ø Introduced with Cray 1 LV

Vector Chaining Advantage v Without chaining, must wait for last element of result to

Vector Instruction Parallelism Can overlap execution of multiple vector instructions Example: 32 elements per

Vector Execution Time v Vector Execution Time depends on: Ø Vector length, data dependences,

Example on Convoys and Chimes LV V 1, Rx MULVS LV V 3, Ry

Vector Startup v Vector startup comes from pipeline latency v Important source of overhead,

Example on Vector Startup v Consider same example with 4 convoys v Vector length

Vector Chaining v Suppose: MULV V 1, V 2, V 3 ADDV V 4,

Vector Stride v Adjacent elements are not sequential in memory do 10 i =

Memory Addressing Modes v Load/store operations move groups of data between registers and memory

Vector Scatter/Gather Want to vectorize loops with indirect accesses for (i=0; i<N; i++) A[i]

Vector Scatter/Gather Scatter example: for (i=0; i<N; i++) A[B[i]]++; Vector Translation: LV v. B,

Memory Banks v Most vector processors support large number of independent memory banks v

Example on Memory Banks v The Cray T 90 has a CPU cycle =

Vector Conditional Execution Problem: Want to vectorize loops with conditional code: for (i=0; i<N;

Vector Masks v Vector masks have two important uses Ø Conditional execution and arithmetic

Masked Vector Instructions Simple Implementation Execute all N operations Turn off result writeback according

Compress/Expand Operations v Compress: Ø Packs non masked elements from one vector register contiguously

Vector Reductions Problem: Loop carried dependence on reduction variables sum = 0; for (i=0;

New Architecture Direction? v “…media processing will become the dominant force in computer architecture

SIMD Extensions v Media applications operate on data types narrower than the native word

SIMD Implementations v Intel MMX (1996) Ø Eight 8 bit integer ops or four

Example SIMD Code v Example DAXPY: L. D MOV MOV DADDIU Loop: MUL. 4

Roofline Performance Model v Basic idea: Ø Plot peak floating point throughput as a

Examples v Attainable GFLOPs/sec Min = (Peak Memory BW × Arithmetic Intensity, Peak Floating

GPU Architectures v Processing is highly data parallel ØGPUs are highly multithreaded ØUse thread

NVIDIA GPU Architecture v Similarities to vector machines: Ø Works well with data level

Threads and Blocks v A thread is associated with each data element v Threads

Example: NVIDIA Fermi v NVIDIA GPU has 32, 768 registers Ø Divided into lanes

Fermi Streaming Multiprocessor CSE 661 - Parallel and Vector Architectures Vector Computers – slide

Fermi Architecture Innovations v Each streaming multiprocessor has Ø Two SIMD thread schedulers, two

NVIDIA Instruction Set Arch. ISA is an abstraction of the hardware instruction set Ø

Conditional Branching v Like vector architectures, GPU branch hardware uses internal masks v Also

Example if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i];

NVIDIA GPU Memory Structures v Each SIMD Lane has private section of off chip

Summary v Vector is a model for exploiting Data Parallelism v If code is

Slides: 61

Download presentation

Data-Level Parallelism in Vector and GPU Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals CSE 661 - Parallel and Vector Architectures Vector Computers – slide

Introduction v SIMD architectures can exploit significant data level parallelism for: Ø matrix oriented scientific computing Ø media oriented image and sound processors v SIMD is more energy efficient than MIMD Ø Only needs to fetch one instruction per data operation Ø Makes SIMD attractive for personal mobile devices v SIMD allows programmer to continue to think sequentially CSE 661 - Parallel and Vector Architectures Vector Computers – slide 2

SIMD Parallelism v Vector architectures v SIMD extensions v Graphics Processor Units (GPUs) v For x 86 processors: Ø Expect two additional cores per chip per year Ø SIMD width to double every four years Ø Potential speedup from SIMD to be twice that from MIMD! CSE 661 - Parallel and Vector Architectures Vector Computers – slide 3

Vector Architectures v Basic idea: Ø Read sets of data elements into “vector registers” Ø Operate on those registers Ø Disperse the results back into memory v Registers are controlled by compiler Ø Used to hide memory latency Ø Leverage memory bandwidth CSE 661 - Parallel and Vector Architectures Vector Computers – slide 4

Vector Processing v Vector processors have high level operations that work on linear arrays of numbers: "vectors" SCALAR (1 operation) r 2 r 1 VECTOR (N operations) v 1 v 2 + + r 3 v 3 add r 3, r 1, r 2 addv v 3, v 1, v 2 CSE 661 - Parallel and Vector Architectures vector length Vector Computers – slide 5

Vector Supercomputers Idealized by Cray-1, 1976: Scalar Unit + Vector Extensions v Load/Store Architecture v Vector Registers v Vector Instructions v Hardwired Control v Highly Pipelined Functional Units v Interleaved Memory System v No Data Caches v No Virtual Memory CSE 661 - Parallel and Vector Architectures Vector Computers – slide 6

Cray 1 (1976) memory bank cycle 50 ns processor cycle 12. 5 ns (80 MHz) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 7

Cray 1 (1976) 64 Element Vector Registers Single Port Memory 16 banks of 64 bit words + 8 bit SECDED ( (Ah) + j k m ) (A 0) 80 MW/sec data load/store 64 T Regs Si Tjk ( (Ah) + j k m ) (A 0) 320 MW/sec instruction buffer refill 64 B Regs Ai Bjk 64 bitx 16 4 Instruction Buffers CSE 661 - Parallel and Vector Architectures V 0 V 1 V 2 V 3 V 4 V 5 V 6 V 7 S 0 S 1 S 2 S 3 S 4 S 5 S 6 S 7 A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7 NIP Vi V. Mask Vj V. Length Vk FP Add Sj FP Mul Sk FP Recip Si Int Add Int Logic Int Shift Pop Cnt Aj Ak Ai Addr Mul CIP LIP Vector Computers – slide 8

Vector Programming Model Scalar Registers Vector Registers r 15 v 15 r 0 v 0 [0] Vector Arithmetic Instructions ADDV v 3, v 1, v 2 Vector Length Register [1] + + [0] [1] + + v 3 VLR Stride, r 2 CSE 661 - Parallel and Vector Architectures [VLRMAX 1] v 1 v 2 Vector Load and Store Instructions LV v 1, r 2 Base, r 1 [2] v 1 [VLR 1] Vector Register Memory Vector Computers – slide 9

Vector Instructions Instr. Operands Operation Comment ADDV V 1, V 2, V 3 V 1=V 2+V 3 vector + vector ADDSV V 1, F 0, V 2 V 1=F 0+V 2 scalar + vector MULTV V 1, V 2, V 3 V 1=V 2 x. V 3 vector x vector MULSV V 1, F 0, V 2 V 1=F 0 x. V 2 scalar x vector LV V 1, R 1 V 1=M[R 1. . R 1+63] load, stride=1 LVWS V 1, R 2 V 1=M[R 1. . R 1+63*R 2] load, stride=R 2 LVI V 1, R 1, V 2 V 1=M[R 1+V 2 i, i=0. . 63] load, indexed Ceq. V VM, V 1, V 2 VMASKi = (V 1 i=V 2 i)? comp. setmask MOV VLR, R 1 Vec. Len. Reg. = R 1 set vector length MOV VM, R 1 Vec. Mask = R 1 set vector mask CSE 661 - Parallel and Vector Architectures Vector Computers – slide 10

Properties of Vector Processors v Each result independent of previous result Ø Long pipeline, compiler ensures no dependencies Ø High clock rate v Vector instructions access memory with known pattern Ø Highly interleaved memory Ø Amortize memory latency of over 64 elements Ø No (data) caches required! (Do use instruction cache) v Reduces branches and branch problems in pipelines v Single vector instruction implies lots of work ( loop) Ø Fewer instruction fetches CSE 661 - Parallel and Vector Architectures Vector Computers – slide 11

Vector Code Example # Vector Code # Scalar Code # C code LI VLR, 64 LI R 4, 64 for (i=0; i<64; i++) LV V 1, R 1 C[i] = A[i] + B[i]; loop: LV V 2, R 2 L. D F 0, 0(R 1) ADDV V 3, V 1, V 2 L. D F 2, 0(R 2) SV V 3, R 3 ADD. D F 4, F 2, F 0 S. D F 4, 0(R 3) ADDIU R 1, 8 ADDIU R 2, 8 ADDIU R 3, 8 SUBIU R 4, 1 BNEZ R 4, loop CSE 661 - Parallel and Vector Architectures Vector Computers – slide 12

Vector Instruction Set Advantages v Compact Øone short instruction encodes N operations v Expressive, tells hardware that these N operations: Ø Are independent Ø Use the same functional unit Ø Access disjoint registers Ø Access registers in the same pattern as previous instructions Ø Access a contiguous block of memory (unit stride load/store) Ø Access memory in a known pattern (strided load/store) v Scalable Ø Can run same object code on more parallel pipelines or lanes CSE 661 - Parallel and Vector Architectures Vector Computers – slide 13

Components of a Vector Processor v Vector Register File Ø Has at least 2 read and 1 write ports Ø Typically 8 32 vector registers Ø Each holding 64 (or more) 64 bit elements v Vector Functional Units (FUs) Ø Fully pipelined, start new operation every clock Ø Typically 4 to 8 FUs: FP add, FP mult, FP reciprocal Ø Integer add, logical, shift (multiple of same unit) v Vector Load Store Units (LSUs) Ø Fully pipelined unit to load or store a vector Ø May have multiple LSUs v Scalar registers Ø Single element for FP scalar or address v Cross bar to connect FUs , LSUs, registers CSE 661 - Parallel and Vector Architectures Vector Computers – slide 14

Examples of Vector Machines Machine Cray 1 Cray XMP Cray YMP Cray C 90 Cray T 90 Conv. C 1 Conv. C 4 Fuj. VP 200 Fuj. VP 300 NEC SX/2 NEC SX/3 Year Clock Regs Elements FUs LSUs 1976 80 MHz 8 64 6 1 1983 120 MHz 8 64 8 2 L, 1 S 1988 166 MHz 8 64 8 2 L, 1 S 1991 240 MHz 8 128 8 4 1996 455 MHz 8 128 8 4 1984 10 MHz 8 128 4 1 1994 133 MHz 16 128 3 1 1982 133 MHz 8 256 32 1024 3 2 1996 100 MHz 8 256 32 1024 3 2 1984 160 MHz 8+8 K 256+var 16 8 1995 400 MHz 8+8 K 256+var 16 8 CSE 661 - Parallel and Vector Architectures Vector Computers – slide 15

Vector Arithmetic Execution v Use deep pipeline (=> fast clock) to execute element operations V 1 V 2 V 3 v Simplifies control of deep pipeline because elements in vector are independent Ø No hazards! Six stage multiply pipeline V 3 <- v 1 * v 2 CSE 661 - Parallel and Vector Architectures Vector Computers – slide 16

Vector Memory System v Cray 1: 16 banks Ø 4 cycle bank busy time § Bank busy time: Cycles between accesses to same bank Ø 12 cycle latency Base Stride Vector Registers Address Generator + 0 1 2 3 4 5 6 7 8 9 A B C D E F Memory Banks CSE 661 - Parallel and Vector Architectures Vector Computers – slide 17

Interleaved Memory Layout Unpipelined DRAM Addr+4 Addr+5 Unpipelined DRAM Addr+2 Addr+3 Unpipelined DRAM Unpipelined DRAM Addr+0 Addr+1 Addr+6 Addr+7 v Great for unit stride: Ø Contiguous elements in different DRAMs Ø Startup time for vector operation is latency of single read v What about non unit stride? Ø Above good for strides that are relatively prime to 8 Ø Bad for strides = 2, 4 and worse for strides = multiple of 8 Ø Better: prime number of banks…! CSE 661 - Parallel and Vector Architectures Vector Computers – slide 18

Vector Instruction Execution ADDV C, A, B Execution using one pipelined functional unit Execution using four pipelined functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] A[4] A[3] B[5] B[4] B[3] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[4] C[5] C[6] C[7] C[0] C[1] C[2] C[3] CSE 661 - Parallel and Vector Architectures Vector Computers – slide 19

Vector Unit Structure Vector Registers Elements 0, 4, 8, … Elements 1, 5, 9, … Functional Unit Elements 2, 6, 10, … Elements 3, 7, 11, … Lane Memory Subsystem CSE 661 - Parallel and Vector Architectures Vector Computers – slide 20

Vector Unit Implementation v Vector register file Ø Each register is an array of elements Ø Size of each register determines maximum vector length Ø Vector length register determines vector length for a particular operation v Multiple parallel execution units = “lanes” Ø Sometimes called “pipelines” or “pipes” CSE 661 - Parallel and Vector Architectures Vector Computers – slide 21 33

T 0 Vector Microprocessor (1995) See http: //www. icsi. berkeley. edu/real/spert/t 0 intro. html Lane Vector register elements striped over lanes [24][25] [16][17] [8] [9] [0] [1] CSE 661 - Parallel and Vector Architectures [26] [27] [28] [19] [20] [11] [12] [3] [4] [29] [21] [13] [5] [30] [22] [14] [6] [31] [23] [15] [7] Vector Computers – slide 22

Automatic Code Vectorization for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Sequential Code load Vectorized Code load Time Iter. 1 add store load Iter. 2 add store CSE 661 - Parallel and Vector Architectures load Iter. 1 load add store Iter. 2 Vector Instruction Vectorization is a massive compile time reordering of operation sequencing requires extensive loop dependence analysis Vector Computers – slide 23

Vector Stripmining v Problem: Vector registers have fixed length v What to do if Vector Length > Max Vector Length? v Stripmining: generate code such that each vector operation is done for a size ≤ MVL Ø First loop iteration: do short piece (n mod MVL) Ø Remaining iterations: VL = MVL index = 0; /* start at index 0 */ VL = (n mod MVL) /* find the odd size piece */ while (n > 0) { /* do vector instructions on VL elements */ n = n – VL; index = index + VL; VL = MVL /* reset the length to max */ } CSE 661 - Parallel and Vector Architectures Vector Computers – slide 24

Vector Stripmining Example ANDI MOV loop: LV SLL Remainder ADDU LV ADDU 64 elements ADDV SV ADDU SUBU LI MOV BGTZ for (i=0; i<N; i++) C[i] = A[i]+B[i]; A B C + + + CSE 661 - Parallel and Vector Architectures R 1, RN, 63 # N mod 64 VLR, R 1 # Do remainder V 1, RA R 2, R 1, RA, V 2, RB RB, V 3, V 1, V 3, RC RC, RN, R 1, 64 VLR, R 1 N, loop 3 # Multiply by 8 R 2 # Advance pointer R 2 V 2 R 1 # Subtract elements # Reset full length # Any more to do? Vector Computers – slide 25

Vector Chaining v Vector version of register bypassing Ø Introduced with Cray 1 LV v 1, r 1 V 2 V 1 MULV v 3, v 1, v 2 V 3 V 4 V 5 ADDV v 5, v 3, v 4 Chain Load Unit Chain Mult. Add Memory CSE 661 - Parallel and Vector Architectures Vector Computers – slide 26

Vector Chaining Advantage v Without chaining, must wait for last element of result to be written before starting dependent instruction Load Mul Time Add v With chaining, can start dependent instruction as soon as first result appears Load Mul Add CSE 661 - Parallel and Vector Architectures Vector Computers – slide 27

Vector Instruction Parallelism Can overlap execution of multiple vector instructions Example: 32 elements per vector register and 8 lanes Load Unit load mul Multiply Unit Add Unit add load mul add time Instruction issue Complete 24 operations/cycle while issuing 1 short instruction/cycle CSE 661 - Parallel and Vector Architectures Vector Computers – slide 28

Vector Execution Time v Vector Execution Time depends on: Ø Vector length, data dependences, and structural hazards v Initiation rate Ø Rate at which a vector unit consumes vector elements Ø Typically, initiation rate = number of lanes Ø Execution time of a vector instruction = VL / Initiation Rate v Convoy Ø Set of vector instructions that can execute in same clock Ø No structural or data hazards (similar to VLIW concept) v Chime Ø Execution time of one convoy Ø m convoys take m chimes = approximately m x n cycles § If each chime takes n cycles and no overlapping convoys CSE 661 - Parallel and Vector Architectures Vector Computers – slide 29

Example on Convoys and Chimes LV V 1, Rx MULVS LV V 3, Ry ADDV SV Ry, V 4 v 1. 2. 3. 4. ; Load vector X V 2, V 1, F 0 ; vector Scalar multiply ; Load vector Y V 4, V 2, V 3 ; Add vectors ; Store result in vector Y 4 Convoys => 4 Chimes LV Suppose VL=64 MULVS, LV For 1 Lane: Chime = 64 cycles ADDV For 2 Lanes: Chime = 32 cycles SV For 4 Lanes: Chime = 16 cycles CSE 661 - Parallel and Vector Architectures Vector Computers – slide 30

Vector Startup v Vector startup comes from pipeline latency v Important source of overhead, so far ignored v Startup time = depth of pipeline v Increases the effective time to execute a convoy v Time to complete a convoy depends Ø Vector startup, vector length, number of lanes Operation Start up penalty (from CRAY 1) Vector load/store 12 cycles Vector multiply 7 cycles Vector add 6 cycles Startup penalty for load/store can be very high (100 cycles) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 31

Example on Vector Startup v Consider same example with 4 convoys v Vector length = n v Assume Convoys don’t overlays v Show the time of each convoy assuming 1 lane Convoy First result Last result 0 12 11 + n 2. MULVS, LV 12 + n + 12 23 + 2 n 3. ADDV 24 + 2 n + 6 29 + 3 n 4. SV 30 + 3 n + 12 41 + 4 n 1. LV Start time v. Total cycles = 42 + 4 n (with extra 42 startup cycles) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 32

Vector Chaining v Suppose: MULV V 1, V 2, V 3 ADDV V 4, V 1, V 5 ; RAW dependence v Chaining: Allow a vector operation to start as soon as the individual elements of the vector source operands become available. Forward individual elements of a vector register. v Dependent instructions can be placed in the same convoy (if no structural hazard) Unchained = 2 convoys 7 64 MULTV 6 64 ADDV Total = 141 cycles 128/141 = 0. 91 Flops/cycle CSE 661 - Parallel and Vector Architectures Chained = 1 convoy 7 64 MULTV 6 Total = 77 cycles 1. 66 Flops/cycle 64 ADDV Vector Computers – slide 33

Vector Stride v Adjacent elements are not sequential in memory do 10 i = 1, 100 do 10 j = 1, 100 A(i, j) = 0. 0 do 10 k = 1, 100 10 A(i, j) = A(i, j) + B(i, k) * C(k, j) v Either B or C accesses are not adjacent Ø 800 bytes between adjacent vector elements v Stride: distance separating elements that are to be merged into a single vector ØCaches do unit stride ØLVWS (load vector with stride) instruction v Think of addresses per vector element CSE 661 - Parallel and Vector Architectures Vector Computers – slide 34

Memory Addressing Modes v Load/store operations move groups of data between registers and memory v Three types of vector addressing Ø Unit stride § Contiguous block of information in memory § Fastest: always possible to optimize this Ø Non unit (constant) stride § Harder to optimize memory system for all possible strides § Prime number of data banks makes it easier to support different strides at full bandwidth Ø Indexed (gather scatter) § Vector equivalent of register indirect § Good for sparse arrays of data § Increases number of programs that vectorize CSE 661 - Parallel and Vector Architectures Vector Computers – slide 35

Vector Scatter/Gather Want to vectorize loops with indirect accesses for (i=0; i<N; i++) A[i] = B[i] + C[D[i]] Indexed load instruction (Gather) LV v. D, r. D # Load D vector (indices) LVI v. C, r. C, v. D # Load C vector indexed LV # Load B vector v. B, r. B ADDV v. A, v. B, v. C # Add Vectors SV v. A, r. A CSE 661 - Parallel and Vector Architectures # Store A vector Vector Computers – slide 36

Vector Scatter/Gather Scatter example: for (i=0; i<N; i++) A[B[i]]++; Vector Translation: LV v. B, r. B LVI v. A, r. A, v. B # Load A vector indexed ADDV v. A, 1 SVI # Load B vector (indices) # Increment v. A, r. A, v. B # Store A vector indexed Load Vector Indexed (Gather) Store Vector Indexed (Scatter) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 37

Memory Banks v Most vector processors support large number of independent memory banks v Memory banks are need for the following reasons Ø Multiple Loads/Stores per cycle Ø Memory bank cycle time > CPU cycle time Ø Ability to load/store non sequential elements Ø Multiple processors sharing the same memory Ø Each processor generates its stream of load/store instructions CSE 661 - Parallel and Vector Architectures Vector Computers – slide 38

Example on Memory Banks v The Cray T 90 has a CPU cycle = 2. 167 ns v The cycle of the SRAM in memory system = 15 ns v Cray T 90 can support 32 processors v Each processor is capable of generating 4 loads and 2 stores per CPU clock cycle v What is the number of memory banks required to allow all CPUs to run at full memory bandwidth v Solution: Ø Maximum number of memory references per cycle 32 CPUs x 6 references per cycle = 192 Ø Each SRAM busy is busy for 15 / 2. 167 = 6. 92 ≈ 7 cycles Ø To handle 192 requests per cycle requires 192 x 7 = 1344 memory banks Ø Cray T 932 actually has 1024 memory banks CSE 661 - Parallel and Vector Architectures Vector Computers – slide 39

Vector Conditional Execution Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i] Solution: Add vector mask registers Ø Vector version of predicate registers, 1 bit per element Ø Vector operation becomes NOP at elements where mask bit is 0 Code example: CVM LV v. A, SGTV LV v. A, SV v. A, r. A v. A, 0 r. B r. A CSE 661 - Parallel and Vector Architectures # # # Turn on all bits in Vector Mask Load entire A vector Set bits in mask register where A>0 Load B vector into A under mask Store A back to memory under mask Vector Computers – slide 40

Vector Masks v Vector masks have two important uses Ø Conditional execution and arithmetic exceptions v Alternative is conditional move/merge v More efficient than conditional moves Ø No need to perform extra instructions Ø Avoid exceptions v Downside is: Ø Extra bits in instruction to specify the mask register § For multiple mask registers Ø Extra interlock early in the pipeline for RAW hazards CSE 661 - Parallel and Vector Architectures Vector Computers – slide 41

Masked Vector Instructions Simple Implementation Execute all N operations Turn off result writeback according to mask M[7]=1 A[7] B[7] M[6]=0 M[5]=1 M[4]=1 M[3]=0 B[6] B[5] B[4] B[3] A[6] A[5] A[4] A[3] M[2]=0 C[2] M[1]=1 C[1] M[0]=0 Write Enable C[0] Density Time Implementation Scan mask vector and Execute only elements with Non zero masks M[7]=1 M[6]=0 M[5]=1 M[4]=1 M[3]=0 M[2]=0 M[1]=1 M[0]=0 A[7] B[7] C[5] C[4] C[1] Write data port CSE 661 - Parallel and Vector Architectures Vector Computers – slide 42

Compress/Expand Operations v Compress: Ø Packs non masked elements from one vector register contiguously at start of destination vector register Ø Population count of mask vector gives packed vector length Ø Used for density time conditionals and for general selection v Expand: performs inverse operation M[7]=1 M[6]=0 M[5]=1 M[4]=1 M[3]=0 A[7] A[6] A[5] A[4] A[3] A[7] A[5] A[4] A[1] M[2]=0 M[1]=1 M[0]=0 A[2] A[1] A[0] A[7] B[6] A[5] A[4] B[3] M[7]=1 M[6]=0 M[5]=1 M[4]=1 M[3]=0 A[5] A[4] A[1] B[2] A[1] B[0] M[2]=0 M[1]=1 M[0]=0 Compress CSE 661 - Parallel and Vector Architectures Expand Vector Computers – slide 43

Vector Reductions Problem: Loop carried dependence on reduction variables sum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum Solution: Use binary tree to perform reduction # Rearrange as: sum[0: VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0: VL-1] += A[i: i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0: VL-1] += sum[VL: 2*VL-1] } while (VL>1) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 44

New Architecture Direction? v “…media processing will become the dominant force in computer architecture & microprocessor design. ” v “. . . new media rich applications. . . involve significant real time processing of continuous media streams, and make heavy use of vectors of packed 8 , 16 , and 32 bit integer and FP” v Needs include high memory BW, high network BW, continuous media data types, real time response, fine grain parallelism Ø “How Multimedia Workloads Will Change Processor Design”, Diefendorff & Dubey, IEEE Computer (9/97) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 45

SIMD Extensions v Media applications operate on data types narrower than the native word size Ø Example: disconnect carry chains to “partition” adder v Limitations, compared to vector instructions: Ø Number of data operands encoded into op code Ø No sophisticated addressing modes § No strided, No scatter gather memory access Ø No mask registers CSE 661 - Parallel and Vector Architectures Vector Computers – slide 46

SIMD Implementations v Intel MMX (1996) Ø Eight 8 bit integer ops or four 16 bit integer ops v Streaming SIMD Extensions (SSE) (1999) Ø Eight 16 bit integer ops Ø Four 32 bit integer/fp ops or two 64 bit integer/fp ops v Advanced Vector Extensions (2010) § Four 64 bit integer/fp ops v Operands must be consecutive and aligned memory locations CSE 661 - Parallel and Vector Architectures Vector Computers – slide 47

Example SIMD Code v Example DAXPY: L. D MOV MOV DADDIU Loop: MUL. 4 D ADD. 4 D S. 4 D DADDIU DSUBU BNEZ F 0, a F 1, F 0 F 2, F 0 F 3, F 0 R 4, Rx, 512 L. 4 D F 4, 0[Rx] F 4, F 0 F 8, 0[Ry] F 8, F 4 0[Ry], F 8 Rx, 32 Ry, 32 R 20, R 4, Rx R 20, Loop CSE 661 - Parallel and Vector Architectures ; load scalar a ; copy a into F 1 for SIMD MUL ; copy a into F 2 for SIMD MUL ; copy a into F 3 for SIMD MUL ; last address to load ; load X[i], X[i+1], X[i+2], X[i+3] ; a×X[i], a×X[i+1], a×X[i+2], a×X[i+3] ; load Y[i], Y[i+1], Y[i+2], Y[i+3] ; a×X[i]+Y[i], . . . , a×X[i+3]+Y[i+3] ; store into Y[i], Y[i+1], Y[i+2], Y[i+3] ; increment index to X ; increment index to Y ; compute bound ; check if done Vector Computers – slide 48

Roofline Performance Model v Basic idea: Ø Plot peak floating point throughput as a function of arithmetic intensity Ø Ties together floating point performance and memory performance for a target machine v Arithmetic intensity Ø Floating point operations per byte read CSE 661 - Parallel and Vector Architectures Vector Computers – slide 49

Examples v Attainable GFLOPs/sec Min = (Peak Memory BW × Arithmetic Intensity, Peak Floating Point Perf. ) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 50

GPU Architectures v Processing is highly data parallel ØGPUs are highly multithreaded ØUse thread switching to hide memory latency § Less reliance on multi level caches ØGraphics memory is wide and high bandwidth v Trend toward general purpose GPUs ØHeterogeneous CPU/GPU systems ØCPU for sequential code, GPU for parallel code v Programming languages/APIs ØOpen. GL ØCompute Unified Device Architecture (CUDA) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 51

NVIDIA GPU Architecture v Similarities to vector machines: Ø Works well with data level parallel problems Ø Scatter gather transfers Ø Mask registers Ø Large register files v Differences: Ø No scalar processor Ø Uses multithreading to hide memory latency Ø Has many functional units, as opposed to a few deeply pipelined units like a vector processor CSE 661 - Parallel and Vector Architectures Vector Computers – slide 52

Threads and Blocks v A thread is associated with each data element v Threads are organized into blocks v Blocks are organized into a grid v GPU hardware handles thread management, not applications or OS CSE 661 - Parallel and Vector Architectures Vector Computers – slide 53

Example: NVIDIA Fermi v NVIDIA GPU has 32, 768 registers Ø Divided into lanes Ø Each thread is limited to 64 registers Ø Each thread has up to: § 64 registers of 32 32 bit elements § 32 registers of 32 64 bit elements Ø Fermi has 16 physical lanes, each containing 2048 registers CSE 661 - Parallel and Vector Architectures Vector Computers – slide 54

Fermi Streaming Multiprocessor CSE 661 - Parallel and Vector Architectures Vector Computers – slide 55

Fermi Architecture Innovations v Each streaming multiprocessor has Ø Two SIMD thread schedulers, two instruction dispatch units Ø 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load store units, 4 special function units Ø Thus, two threads of SIMD instructions are scheduled every two clock cycles v Fast double precision v Caches for GPU memory v 64 bit addressing and unified address space v Error correcting codes v Faster context switching v Faster atomic instructions CSE 661 - Parallel and Vector Architectures Vector Computers – slide 56

NVIDIA Instruction Set Arch. ISA is an abstraction of the hardware instruction set Ø “Parallel Thread Execution (PTX)” Ø Uses virtual registers Ø Translation to machine code is performed in software Ø Example: shl. s 32 R 8, block. Idx, 9 ; Thread Block ID * Block size (512) add. s 32 R 8, thread. Idx ; R 8 = i = my CUDA thread ID ld. global. f 64 RD 0, [X+R 8] ; RD 0 = X[i] ld. global. f 64 RD 2, [Y+R 8] ; RD 2 = Y[i] mul. f 64 R 0 D, RD 0, RD 4 ; Product in RD 0 = RD 0 * RD 4 (scalar a) add. f 64 R 0 D, RD 0, RD 2 ; Sum in RD 0 = RD 0 + RD 2 (Y[i]) st. global. f 64 [Y+R 8], RD 0 ; Y[i] = sum (X[i]*a + Y[i]) CSE 661 - Parallel and Vector Architectures Vector Computers – slide 57

Conditional Branching v Like vector architectures, GPU branch hardware uses internal masks v Also uses Ø Branch synchronization stack § Entries consist of masks for each SIMD lane § I. e. which threads commit their results (all threads execute) Ø Instruction markers to manage when a branch diverges into multiple execution paths § Push on divergent branch Ø …and when paths converge § Act as barriers § Pops stack v Per thread lane 1 bit predicate register, specified by programmer CSE 661 - Parallel and Vector Architectures Vector Computers – slide 58

Example if (X[i] != 0) X[i] = X[i] – Y[i]; else X[i] = Z[i]; ld. global. f 64 setp. neq. s 32 @!P 1, bra RD 0, [X+R 8] P 1, RD 0, #0 ELSE 1, *Push ; RD 0 = X[i] ; P 1 is predicate register 1 ; Push old mask, set new mask bits ; if P 1 false, go to ELSE 1 ld. global. f 64 RD 2, [Y+R 8] ; RD 2 = Y[i] sub. f 64 RD 0, RD 2 ; Difference in RD 0 st. global. f 64 [X+R 8], RD 0 ; X[i] = RD 0 @P 1, bra ENDIF 1, *Comp ; complement mask bits ; if P 1 true, go to ENDIF 1 ELSE 1: ld. global. f 64 RD 0, [Z+R 8] ; RD 0 = Z[i] st. global. f 64 [X+R 8], RD 0 ; X[i] = RD 0 ENDIF 1: <next instruction>, *Pop ; pop to restore old mask CSE 661 - Parallel and Vector Architectures Vector Computers – slide 59

NVIDIA GPU Memory Structures v Each SIMD Lane has private section of off chip DRAM Ø “Private memory” Ø Contains stack frame, spilling registers, and private variables v Each multithreaded SIMD processor also has local memory Ø Shared by SIMD lanes / threads within a block v Memory shared by SIMD processors is GPU Memory Ø Host can read and write GPU memory CSE 661 - Parallel and Vector Architectures Vector Computers – slide 60

Summary v Vector is a model for exploiting Data Parallelism v If code is vectorizable, then simpler hardware, more energy efficient, and better real time model than Out of order machines v Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, and conditional operations v Fundamental design issue is memory bandwidth Ø With virtual address translation and caching CSE 661 - Parallel and Vector Architectures Vector Computers – slide 61