GPU Programming with CUDA David KirkNVIDIA and Wenmei

  • Slides: 46
Download presentation
GPU Programming with CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE

GPU Programming with CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 1

What is GPGPU? • General Purpose computation using GPU and graphics API in applications

What is GPGPU? • General Purpose computation using GPU and graphics API in applications other than 3 D graphics – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU. org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 2

CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks

CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor • Targeted software stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – – – Standalone Driver - Optimized for computation Interface designed for compute – graphics-free API Data sharing with Open. GL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 3

Parallel Computing on a GPU • 8 -series GPUs deliver 25 to 200+ GFLOPS

Parallel Computing on a GPU • 8 -series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications – Available in laptops, desktops, and clusters • • GPU parallelism is doubling every year Programming model scales transparently • • Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Ge. Force 8800 Tesla D 870 Tesla S 870 4

CUDA – C with no shader limitations! • Integrated host+device app C program –

CUDA – C with no shader limitations! • Integrated host+device app C program – Serial or modestly parallel parts in host C code – Highly parallel parts in device SPMD kernel C code Serial Code (host) Parallel Kernel (device) Kernel. A<<< n. Blk, n. Tid >>>(args); . . . Serial Code (host) Parallel Kernel (device) Kernel. B<<< n. Blk, n. Tid >>>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign . . . 5

CUDA Devices and Threads • A compute device – – • • Is a

CUDA Devices and Threads • A compute device – – • • Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threads Differences between GPU and CPU threads – GPU threads are extremely lightweight • – Very little creation overhead GPU needs 1000 s of threads for full efficiency • Multi-core CPU needs only a few © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 6

G 80 CUDA mode – A Device Example • Processors execute computing threads •

G 80 CUDA mode – A Device Example • Processors execute computing threads • New operating mode/HW interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Texture Texture Texture Load/store Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Load/store 7

Extended C • Declspecs – global, device, shared, local, constant __device__ float filter[N]; __global__

Extended C • Declspecs – global, device, shared, local, constant __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M]; . . . • Keywords – thread. Idx, block. Idx region[thread. Idx] = image[i]; • Intrinsics __syncthreads(). . . – __syncthreads image[j] = result; • Runtime API – Memory, symbol, execution management • Function launch © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign } // Allocate GPU memory void *myimage = cuda. Malloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 8

Extended C Integrated source (foo. cu) cudacc EDG C/C++ frontend Open 64 Global Optimizer

Extended C Integrated source (foo. cu) cudacc EDG C/C++ frontend Open 64 Global Optimizer GPU Assembly CPU Host Code foo. s foo. cpp OCG gcc / cl G 80 SASS foo. sass © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Mark Murphy, “NVIDIA’s Experience with Open 64, ” www. capsl. udel. edu/conferences/open 64/2008 /Papers/101. doc 9

CUDA API Highlights: Easy and Lightweight • The API is an extension to the

CUDA API Highlights: Easy and Lightweight • The API is an extension to the ANSI C programming language Low learning curve • The hardware is designed to enable lightweight runtime and driver High performance © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 10

CUDA Thread Block • • All threads in a block execute the same kernel

CUDA Thread Block • • All threads in a block execute the same kernel program (SPMD) Programmer declares block: – – – • • Thread Id #: 0123… m Threads have thread id numbers within block – • Block size 1 to 512 concurrent threads Block shape 1 D, 2 D, or 3 D Block dimensions in threads CUDA Thread Block Thread program uses thread id to select work and address shared data Thread program Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate – Each block can execute in any order relative to other blocs! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Courtesy: John Nickolls, NVIDIA 11

Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks – Threads

Thread Blocks: Scalable Cooperation • Divide monolithic thread array into multiple blocks – Threads within a block cooperate via shared memory, atomic operations and barrier synchronization – Threads in different blocks cannot cooperate Thread Block 0 thread. ID 0 1 2 3 4 5 6 … float x = input[thread. ID]; float y = func(x); output[thread. ID] = y; … 7 0 1 2 3 4 5 6 Thread Block N - 1 7 … float x = input[thread. ID]; float y = func(x); output[thread. ID] = y; … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 0 … 1 2 3 4 5 6 7 … float x = input[thread. ID]; float y = func(x); output[thread. ID] = y; … 12

Transparent Scalability • Hardware is free to assign blocks to any processor at any

Transparent Scalability • Hardware is free to assign blocks to any processor at any time – A kernel scales across any number of parallel processors Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 4 Block 5 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Each block can execute in any order relative to other blocks. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 13

Block IDs and Thread IDs • Each thread uses IDs to decide what data

Block IDs and Thread IDs • Each thread uses IDs to decide what data to work on – – • Block ID: 1 D or 2 D Thread ID: 1 D, 2 D, or 3 D Simplifies memory addressing when processing multidimensional data – – – Image processing Solving PDEs on volumes … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 14

G 80 Example: Executing Thread Blocks t 0 t 1 t 2 … tm

G 80 Example: Executing Thread Blocks t 0 t 1 t 2 … tm SM 0 SM 1 MT IU SP t 0 t 1 t 2 … tm MT IU Blocks SP • Blocks Threads are assigned to Streaming Multiprocessors in block granularity – Shared Memory – Up to 8 blocks to each SM as resource allows SM in G 80 can take up to 768 threads • • Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Flexible resource allocation • Threads run concurrently – – © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign SM maintains thread/block id #s SM manages/schedules thread execution 15

G 80 Example: Thread Scheduling • Each Block is executed as 32 thread Warps

G 80 Example: Thread Scheduling • Each Block is executed as 32 thread Warps – – • An implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? – – Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Block 1 Warps … t 0 t 1 t 2 … t 31 … …Block 2 Warps t 0 t 1 t 2 … t 31 … Block 1 Warps … t 0 t 1 t 2 … t 31 … Streaming Multiprocessor Instruction L 1 Instruction Fetch/Dispatch Shared Memory SP SP SFU SP SP 16

G 80 Example: Thread Scheduling • SM implements zero-overhead warp scheduling – At any

G 80 Example: Thread Scheduling • SM implements zero-overhead warp scheduling – At any time, only one of the warps is executed by SM – Warps whose next instruction has its operands ready for consumption are eligible for execution – Eligible Warps are selected for execution on a prioritized scheduling policy – All threads in a warp execute the same instruction when selected © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 17

Terminology • Thread: concurrent code and associated state executed on the CUDA device (in

Terminology • Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) – The unit of parallelism in CUDA • Warp: a group of threads executed physically in parallel in G 80 • Block: a group of threads that are executed together and form the unit of resource assignment • Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 18

Memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University

Memories © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 19

CUDA Memory Model Overview • Global memory – Main means of communicating R/W Data

CUDA Memory Model Overview • Global memory – Main means of communicating R/W Data between host and device – Contents visible to all threads – Long latency access • We will focus on global memory for now – Constant and texture memory will come later © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory 20

CUDA Device Memory Allocation • cuda. Malloc() – Allocates object in the device. Grid

CUDA Device Memory Allocation • cuda. Malloc() – Allocates object in the device. Grid Block (0, 0) Global Memory – Requires two parameters Block (1, 0) Shared Memory • Address of a pointer to the allocated object • Size of of allocated object • cuda. Free() Host Registers Thread (0, 0) Thread (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory – Frees object from device Global Memory • Pointer to freed object © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 21

CUDA Device Memory Allocation (cont. ) • Code example: – Allocate a 64 *

CUDA Device Memory Allocation (cont. ) • Code example: – Allocate a 64 * 64 single precision float array – Attach the allocated storage to Md – “d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * sizeof(float); cuda. Malloc((void**)&Md, size); cuda. Free(Md); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 22

CUDA Host-Device Data Transfer • cuda. Memcpy() – memory data transfer – Requires four

CUDA Host-Device Data Transfer • cuda. Memcpy() – memory data transfer – Requires four parameters • • Pointer to destination Pointer to source Number of bytes copied Type of transfer – Host to Host – Host to Device – Device to Host – Device to Device Grid Block (0, 0) Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory • Asynchronous transfer © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 23

CUDA Host-Device Data Transfer (cont. ) • Code example: – Transfer a 64 *

CUDA Host-Device Data Transfer (cont. ) • Code example: – Transfer a 64 * 64 single precision float array – M is in host memory and Md is in device memory – cuda. Memcpy. Host. To. Device and cuda. Memcpy. Device. To. Host are symbolic constants cuda. Memcpy(Md, M, size, cuda. Memcpy. Host. To. Device); cuda. Memcpy(M, Md, size, cuda. Memcpy. Device. To. Host); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 24

CUDA Function Declarations Executed on the: Only callable from the: __device__ float Device. Func()

CUDA Function Declarations Executed on the: Only callable from the: __device__ float Device. Func() device __global__ void device host __host__ • Kernel. Func() float Host. Func() __global__ defines a kernel function – Must return void • __device__ and __host__ can be used together © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 25

CUDA Function Declarations (cont. ) • __device__ functions cannot have their address taken •

CUDA Function Declarations (cont. ) • __device__ functions cannot have their address taken • For functions executed on the device: – No recursion – No static variable declarations inside the function – No variable number of arguments © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 26

Calling a Kernel Function – Thread Creation • A kernel function must be called

Calling a Kernel Function – Thread Creation • A kernel function must be called with an execution configuration: __global__ void Kernel. Func(. . . ); dim 3 Dim. Grid(100, 50); // 5000 thread blocks dim 3 Dim. Block(4, 8, 8); // 256 threads per block size_t Shared. Mem. Bytes = 64; // 64 bytes of shared memory Kernel. Func<<< Dim. Grid, Dim. Block, Shared. Mem. Bytes >>>(. . . ); • Any call to a kernel function is asynchronous from CUDA 1. 0 on, explicit synch needed for blocking © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 27

G 80 Implementation of CUDA Memories • Each thread can: – Read/write per-thread registers

G 80 Implementation of CUDA Memories • Each thread can: – Read/write per-thread registers – Read/write per-thread local memory – Read/write per-block shared memory – Read/write per-grid global memory – Read/only per-grid constant memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign Grid Block (0, 0) Block (1, 0) Shared Memory Registers Thread (0, 0) Thread (1, 0) Host Shared Memory Registers Thread (0, 0) Thread (1, 0) Global Memory Constant Memory 28

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime local thread __device__ __local__ int

CUDA Variable Type Qualifiers Variable declaration Memory Scope Lifetime local thread __device__ __local__ int Local. Var; __device__ __shared__ int Shared. Var; shared block __device__ int Global. Var; global grid application constant grid application __device__ __constant__ int Constant. Var; • __device__ is optional when used with __local__, __shared__, or __constant__ • Automatic variables without any qualifier reside in a register – Except arrays that reside in local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 29

Where to Declare Variables? Can host access it? global constant yes Outside of any

Where to Declare Variables? Can host access it? global constant yes Outside of any Function © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign no register (automatic) shared local In the kernel 30

Variable Type Restrictions • Pointers can only point to memory allocated or declared in

Variable Type Restrictions • Pointers can only point to memory allocated or declared in global memory: – Allocated in the host and passed to the kernel: __global__ void Kernel. Func(float* ptr) – Obtained as the address of a global variable: float* ptr = &Global. Var; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 31

A Common Programming Strategy • Global memory resides in device memory (DRAM) much slower

A Common Programming Strategy • Global memory resides in device memory (DRAM) much slower access than shared memory • So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: – Partition data into subsets that fit into shared memory – Handle each data subset with one thread block by: • Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism • Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element • Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 32

A Common Programming Strategy (Cont. ) • Constant memory also resides in device memory

A Common Programming Strategy (Cont. ) • Constant memory also resides in device memory (DRAM) - much slower access than shared memory – But… cached! – Highly efficient access for read-only data • Carefully divide data according to access patterns – – R/Only constant memory (very fast if in cache) R/W shared within Block shared memory (very fast) R/W within each thread registers (very fast) R/W inputs/results global memory (very slow) For texture memory usage, see NVIDIA document. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 33

GPU Atomic Integer Operations • Atomic operations on integers in global memory: – –

GPU Atomic Integer Operations • Atomic operations on integers in global memory: – – – Associative operations on signed/unsigned ints add, sub, min, max, . . . and, or, xor Increment, decrement Exchange, compare and swap • Requires hardware with compute capability 1. 1 and above. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana Champaign 34 34

SM Register File • Register File (RF) – • • 32 KB (8 K

SM Register File • Register File (RF) – • • 32 KB (8 K entries) for each SM in G 80 TEX pipe can also read/write RF – I$ L 1 2 SMs share 1 TEX Load/Store pipe can also read/write RF Multithreaded Instruction Buffer R F C$ L 1 Shared Mem Operand Select MAD © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign SFU 35

Programmer View of Register File • There are 8192 registers in each SM in

Programmer View of Register File • There are 8192 registers in each SM in G 80 4 blocks 3 blocks – This is an implementation decision, not part of CUDA – Registers are dynamically partitioned across all blocks assigned to the SM – Once assigned to a block, the register is NOT accessible by threads in other blocks – Each thread in the same block only access registers assigned to itself © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 36

Example • If each Block has 16 X 16 threads and each thread uses

Example • If each Block has 16 X 16 threads and each thread uses 10 registers, how many thread can run on each SM? – Each block requires 10*256 = 2560 registers – 8192 = 3 * 2560 + change – So, three blocks can run on an SM as far as registers are concerned • How about if each thread increases the use of registers by 1? – Each Block now requires 11*256 = 2816 registers – 8192 < 2816 *3 – Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 37

More on Dynamic Partitioning • Dynamic partitioning gives more flexibility to compilers/programmers – One

More on Dynamic Partitioning • Dynamic partitioning gives more flexibility to compilers/programmers – One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each • This allows for finer grain threading than traditional CPU threading models – The compiler can trade off between instruction-level parallelism and thread level parallelism © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 38

ILP vs. TLP Example • Assume that a kernel has 256 -thread Blocks, 4

ILP vs. TLP Example • Assume that a kernel has 256 -thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads take 200 cycles – 3 Blocks can run on each SM • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load – Only two Blocks can run on each SM – However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency – Two blocks have 16 Warps. The performance can be actually higher! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 39

Memory Coalescing • When accessing global memory, peak performance utilization occurs when all threads

Memory Coalescing • When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Not coalesced Md coalesced Nd WIDTH Thread 1 Thread 2 WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 40

Parallel Memory Architecture • In a parallel machine, many threads access memory – Therefore,

Parallel Memory Architecture • In a parallel machine, many threads access memory – Therefore, memory is divided into banks – Essential to achieve high bandwidth • Each bank can service one address per cycle – A memory can service as many simultaneous accesses as it has banks • Multiple simultaneous accesses to a bank result in a bank conflict Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 – Conflicting accesses are serialized © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 41

Bank Addressing Examples • No Bank Conflicts – Linear addressing stride == 1 •

Bank Addressing Examples • No Bank Conflicts – Linear addressing stride == 1 • No Bank Conflicts – Random 1: 1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 42

Bank Addressing Examples • 2 -way Bank Conflicts – Linear addressing stride == 2

Bank Addressing Examples • 2 -way Bank Conflicts – Linear addressing stride == 2 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 • 8 -way Bank Conflicts – Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 15 Thread 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Linear addressing stride == 8 x 8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 43

How addresses map to banks on G 80 • Each bank has a bandwidth

How addresses map to banks on G 80 • Each bank has a bandwidth of 32 bits per clock cycle • Successive 32 -bit words are assigned to successive banks • G 80 has 16 banks – So bank = address % 16 – Same as the size of a half-warp • No bank conflicts between different half-warps, only within a single half-warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 44

Shared memory bank conflicts • • Shared memory is as fast as registers if

Shared memory bank conflicts • • Shared memory is as fast as registers if there are no bank conflicts The fast case: – – • If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: – – – Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign 45

Linear Addressing • Given: __shared__ float shared[256]; float foo = shared[base. Index + s

Linear Addressing • Given: __shared__ float shared[256]; float foo = shared[base. Index + s * thread. Idx. x]; s=1 Thread 0 Thread 1 Bank 0 Bank 1 Thread 2 Thread 3 Bank 2 Bank 3 Thread 4 Bank 4 Thread 5 Thread 6 Bank 5 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 s=3 • This is only bank-conflict-free if s shares no common factors with the number of banks – 16 on G 80, so s must be odd © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 -2009 ECE 498 AL, University of Illinois, Urbana-Champaign Thread 0 Thread 1 Bank 0 Bank 1 Thread 2 Thread 3 Bank 2 Bank 3 Thread 4 Bank 4 Thread 5 Thread 6 Bank 5 Bank 6 Thread 7 Bank 7 Thread 15 Bank 1546