Fast matrix multiplication with CUDA School of Electrical

Overview • Platform – GEFORCE 8800 GT, 512 MB – Core: G 92, Shader

The Tiled version • • Tile Size: 16 x 16 256 threads / block

The Tiled version – Memory access • Every half warp is accessing continuous memory

The Tiled version – Bank conflicts • No bank conflicts. 16 banks broadcast T

The Tiled version - Bottlenecks • If fully use memory bandwidth and ALUs: –

Using Large Tiles • Each thread: – 17 loads / iteration – W/16 iterations

Using Large Tiles - Algorithm • For each sub tile in A & B

Using Large Tiles - Algorithm • Solution 1: – Transpose A to column-major format

Using Large Tiles - Algorithm • Solution 3: – Padding Shared A with one

Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++)

$Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0];$

Using Large Tiles - Performance Kernel 16 x 16 tile 16 x 256 tile

Using Large Tiles – Performance 2 Kernel Matrix 1 k 2 k x 2

Tools - CUDA GPU Occupancy Calculator CDA 6938 University of Central Florida 15

Tools - decuda • Developed by Wladimir J. van der Laan – a Ph.

Tools – CUDA Visual Profiler • http: //forums. nvidia. com/index. php? showtopic=57443 – GPU

Tips • Get usage of reg, smem, cmem, and lmem: – nvcc -m 32

References • NVIDIA CUDA Samples: – – http: //www. nvidia. com/object/cuda_sample_linear_algebra. html Simple CUBLAS

Slides: 19

Download presentation

Fast matrix multiplication with CUDA School of Electrical Engineering and Computer Science University of Central Florida Hongliang @ HPCA'08

Overview • Platform – GEFORCE 8800 GT, 512 MB – Core: G 92, Shader frequency: 1. 5 GHz, Mem frequency: 900 MHz • Performance – Tuned for 4 k x 4 k matrix, 192 GFlops • Revisiting the Tiled version • Using large tiles – Base algorithm – Optimized algorithm • Tools and tips CDA 6938 University of Central Florida 2

The Tiled version • • Tile Size: 16 x 16 256 threads / block 14 regs, 2076 smem / block Occupancy: 2/3 T 0 T 1 … T 14 T 15 T 16 T 17 … T 30 T 31 … T 239 CDA 6938 University of Central Florida … T 255 3

The Tiled version – Memory access • Every half warp is accessing continuous memory locations. • Memory accesses are fully coalesced. T 0 T 1 … T 14 T 15 T 16 T 17 … T 30 T 31 … T 239 CDA 6938 University of Central Florida … T 255 4

The Tiled version – Bank conflicts • No bank conflicts. 16 banks broadcast T 0 T 1 … T 14 T 15 T 16 T 17 … T 30 T 31 … T 239 CDA 6938 University of Central Florida … T 255 5

The Tiled version - Bottlenecks • If fully use memory bandwidth and ALUs: – – • With 16 x 16 tiles: – – • • • 14. 4 G float/s, 168 G MAD/s 11. 67 MAD/float Total (W 3/8) loads, 8 MAD/float Too many loads! Solution: large tile. “Psub += As[ty][k] * Bs[k][tx]” – Extra instructions. 77 GFlops (4 k x 4 k) mov. b 32 $r 12, s[$ofs 4+0 x 0000] mov. b 32 $r 7, s[$ofs 4+0 x 0040] mad. rn. f 32 $r 11, s[$ofs 1+0 x 000 c], $r 11, $r 13 add. b 32 $ofs 4, $ofs 3, 0 x 0000019 c mad. rn. f 32 $r 13, s[$ofs 1+0 x 0010], $r 12, $r 11 THeight times. mov. b 32 $r 12, Reused s[$ofs 4+0 x 0000] mov. b 32 $r 11, s[$ofs 4+0 x 0040] mad. rn. f 32 $r 7, s[$ofs 1+0 x 0014], $r 7, $r 13 add. b 32 $ofs 4, $ofs 3, 0 x 0000021 c mad. rn. f 32 $r 13, s[$ofs 1+0 x 0018], $r 12, $r 7 Reused TWidth times. TWidth THeight CDA 6938 University of Central Florida 6

Using Large Tiles • Each thread: – 17 loads / iteration – W/16 iterations – Total (W 3/15) loads, 15 MAD/load 16 256 Stored in registers. 16 256 threads 16 Stored in shared memory. 16 Psubs/thread CDA 6938 University of Central Florida 7

Using Large Tiles - Algorithm • For each sub tile in A & B – Read the sub tile from A to shared memory. 1 number / thread. – For each of the 16 numbers in B: • Read one number from B into a register. • Perform one MAD for each Psub. • To remove extra instructions for offset calculation, we want the sub tile A to be stored in column-major format in the shared memory. B – But … A C T 0 T 1 T 2 CDA 6938 University of Central Florida … T 255 8

Using Large Tiles - Algorithm • Solution 1: – Transpose A to column-major format first. A • Shared A T 0 B 0 T 15 B 15 C T 0 T 1 T 2 … T 255 Solution 2: – Read A in row-major format, write to the shared memory in column-major format. – Bank conflicts when write to the shared memory! A T 0 Shared A T 15 C B 0 B 15 CDA 6938 T 0 T 1 T 2 University of Central Florida … T 255 9

Using Large Tiles - Algorithm • Solution 3: – Padding Shared A with one empty row. – No bank conflicts. Do not need to transpose A. – 164 GFlops (4 k x 4 k). Shared A A T 0 T 15 B 0 B 1 B 2 B 15 B 0 B 1 CDA 6938 C B 15 B 0 T 1 T 2 University of Central Florida … T 255 10

Using Large Tiles - code for (int i = 0; i < MATRIX_WIDTH/16; i++) { ashare[tx][ty] = A[0]; __syncthreads(); #pragma unroll // 150 GFlops (4 k x 4 k) without unroll for (int k = 0; k < 16; k++) { b = B[k * MATRIX_WIDTH]; comp 16(b, &ashare[k][0], c); } A += 16; B += 16 * MATRIX_WIDTH; __syncthreads(); }; CDA 6938 University of Central Florida 11

$Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0];$

Using Large Tiles - optimized do { ashare[tx][ty] = a; __syncthreads(); a = A[0]; bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[4 * MATRIX_WIDTH]; b[1] = B[5 * MATRIX_WIDTH]; b[2] = B[6 * MATRIX_WIDTH]; b[3] = B[7 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp 16(bb[i], &ashare[i][0], c); … bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; b[0] = B[12 * MATRIX_WIDTH]; b[1] = B[13 * MATRIX_WIDTH]; CDA 6938 b[2] = B[14 * MATRIX_WIDTH]; b[3] = B[15 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp 16(bb[i], &ashare[i + 8][0], c); bb[0] = b[0]; bb[1] = b[1]; bb[2] = b[2]; bb[3] = b[3]; A += 16; B += 16 * MATRIX_WIDTH; b[0] = B[0 * MATRIX_WIDTH]; b[1] = B[1 * MATRIX_WIDTH]; b[2] = B[2 * MATRIX_WIDTH]; b[3] = B[3 * MATRIX_WIDTH]; for (int i = 0; i < 4; i ++) comp 16(bb[i], &ashare[i + 12][0], c); __syncthreads(); } while( A < Alast ); . . . // last iteration University of Central Florida 12

Using Large Tiles - Performance Kernel 16 x 16 tile 16 x 256 tile Base Matrix reg smem occupancy Gflops* 1 k x 1 k 14 2076 2/3 81 2 k x 2 k 14 2076 2/3 81 4 k x 4 k 14 2076 2/3 77 1 k x 1 k 26 1116 1/3 172 2 k x 2 k 27 1116 1/3 161 4 k x 4 k 1 k x 1 k 29 1116 1/3 164 29 1116 1/3 176 30 1116 1/3 185 32 1116 1/3 2/3 2/3 192 111 114 112 (-maxrregcount 32) 2 k x 2 k 16 x 256 tile Optimized cublas (-maxrregcount 32) 4 k x 4 k (no –maxrregcount 32! Otherwise, will use lmem) 1 k x 1 k 2 k x 2 k 4 k x 4 k Execution time is measured as the computation time on GPU. CDA 6938 University of Central Florida 13

Using Large Tiles – Performance 2 Kernel Matrix 1 k 2 k x 2 k 16 x 16 tile 4 k x 4 k 1 k x 1 k 16 x 256 tile Base 2 k x 2 k 4 k x 4 k 1 k x 1 k 16 x 256 tile Optimized 2 k x 2 k 4 k x 4 k 1 k x 1 k 2 k x 2 k cublas 4 k x 4 k 8800 GT (G 92) 8800 GTX (G 80) Gflops (comp) Gflops (total) 81 81 77 172 161 164 176 185 192 111 114 112 64 71 73 112 130 146 114 145 168 83 97 104 85 84 65 185 176 178 192 193 192 115 118 117 64 74 61 113 139 157 116 149 168 82 100 108 Gflops (comp): excluding CPU GPU data transfer time. Gflops (total): including CPU GPU data transfer time. CDA 6938 University of Central Florida 14

Tools - CUDA GPU Occupancy Calculator CDA 6938 University of Central Florida 15

Tools - decuda • Developed by Wladimir J. van der Laan – a Ph. D candidate at the Institute of Mathematics and Computing Science of the University of Groningen. • http: //www. cs. rug. nl/~wladimir/decuda/ CDA 6938 University of Central Florida 16

Tools – CUDA Visual Profiler • http: //forums. nvidia. com/index. php? showtopic=57443 – GPU Time CPU Time Occupancy – Profiler counters: gld_incoherent : Number of non-coalesced global memory loads gld_coherent : Number of coalesced global memory loads gst_incoherent : Number of non-coalesced global memory stores gst_coherent : Number of coalesced global memory stores local_load : Number of local memory loads local_store : Number of local memory stores branch : Number of branch events (instruction and/or sync stack) divergent_branch : Number of divergent branches within a warp instructions : Number of dynamic instructions (in fetch) warp_serialize : Number of threads in a warp serialize based on address (GRF or constant) cta_launched : Number of CTAs launched on the PM TPC CDA 6938 University of Central Florida 17

Tips • Get usage of reg, smem, cmem, and lmem: – nvcc -m 32 -o data/matrix_kernel. cubin -cubin matrix_kernel. cu -compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include I. . /common/inc -DUNIX -O 3 --ptxas-options=-v • Compile with –maxrregcount CDA 6938 University of Central Florida 18

References • NVIDIA CUDA Samples: – – http: //www. nvidia. com/object/cuda_sample_linear_algebra. html Simple CUBLAS Matrix Multiplication Matrix Transpose • NVIDIA Forum: – http: //forums. nvidia. com/index. php? showtopic=47689&st=0 CDA 6938 University of Central Florida 19