CSE 690 GPGPU Lecture 7 Matrix Multiplications Klaus

GPU Algorithms • First algorithm: ù ù ù render a rectangle of size Nx.

GPU Algorithms • Better algorithm: ù ù use RGBA channels, pack a 2 x

GPU Algorithms • Using multi-texturing ù requires l passes

GPU Algorithms • Can use RGBA parallelism as well ù ù ù each texel

GPU Algorithms • Instead of a 2 x 2 submatrix, pack 4 x 1

GPU Algorithms • Originally only compute one product per shader ù ù ù practically

Reality Check • Would like to compare CPU and GPU efficiencies for GPGPU tasks

Platforms • Pentium 4 3 Ghz CPU, 512 KB L 2 cache 12 GFLOPS

Analysis • Currently: ù ù ù GPUs can fetch 16 floats and perform 16

Analysis • Pentium processors have large L 1 caches to boost memory bandwidth (bw)

Analysis • Expectations ù ù make sure that there is enough arithmetic per data

Analysis • What do GPUs need: ù ù ù bigger caches to enable larger

References ù ù ù E. Larsen and D. Mc. Allister, “Fast matrix multiplies using

Slides: 19

Download presentation

CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University

Basic Concept • Triple loop

GPU Algorithms • First algorithm: ù ù ù render a rectangle of size Nx. N represent the matrices as Nx. N textures each (i, j) is then a fragment each fragment program is a loop or an unrolled loop -> may get too long must pull in the same data many times -> poor data reuse, needs bandwidth makes no use of 4 -way RGBA parallelism -> wastes speedup

GPU Algorithms • Better algorithm: ù ù use RGBA channels, pack a 2 x 2 submatrix use swizzling to facilitate data reuse swizzling improves fragment code length by factor 2 may need multiple passes for larger matrices

GPU Algorithms • Using multi-texturing ù requires l passes

GPU Algorithms • Can use RGBA parallelism as well ù ù ù each texel represents a 2 x 2 submatrix use swizzling as usual needs l/2 passes

GPU Algorithms • Instead of a 2 x 2 submatrix, pack 4 x 1 column vectors ù makes 4 -times reuse of texels read from B, but uses texels from A only once

GPU Algorithms • Instead of a 2 x 2 submatrix, pack 4 x 1 column vectors ù ù 6 fetches are needed for 4 mad’s (mult-add’s) -> 1. 5 times more than before but less rows and columns are accessed per pass > improves cache hit frequency

GPU Algorithms • Originally only compute one product per shader ù ù ù practically can unroll the loop 3 -6 times (compute 3 -6 products) maximal fragment program length is the limit reduces the number of passes required

Reality Check • Would like to compare CPU and GPU efficiencies for GPGPU tasks • The task of matrix multiplication is insightful here ù ù ù features much data reuse graphics programs are generally more stream-like and have less data reuse this may lead to some limitations

Platforms • Pentium 4 3 Ghz CPU, 512 KB L 2 cache 12 GFLOPS peak compute • 44. 1 GB/sec cache BW • Using sgemm routine from ATLAS package • • NVIDIA Ge. Force 5900 Ultra • Ge. Force 6800 Ultra • • ATI Radeon 9800 XT • Radeon X 800 XT PE •

Performance

Bandwidth vs. Peak FLOPS

Analysis • Currently: ù ù ù GPUs can fetch 16 floats and perform 16 4 component mad’s per clock our app fetches 8 floats to perform one 4 component mad -> not enough computations need more math ops per float fetched (> 8)

Analysis • Pentium processors have large L 1 caches to boost memory bandwidth (bw) ù ù bw / compute ratio better main reason for only small performance gain achieved with GPUs

Analysis • Pentium processors have large L 1 caches to boost memory bandwidth (bw) ù ù ù bw / compute ratio better main reason for only small performance gain achieved with GPUs for matrix multiplications

Analysis • Expectations ù ù make sure that there is enough arithmetic per data item fetched lots of data resuse in the algorithm / task will make the CPU look better streaming data OK -> they don’t “suffer” from reuse matrix multiplication is an excellent reality-check example

Analysis • What do GPUs need: ù ù ù bigger caches to enable larger blocks currently there are enough registers to store a 6 x 6 submatrix but currently shaders can only produce a small number of outputs -> limits the amount of blocking Provide full-floating point accumulator registers Widen path between texture and register files

References ù ù ù E. Larsen and D. Mc. Allister, “Fast matrix multiplies using graphics hardware, ” Supercomputing 2001. J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU, ” Tech Report UIUCDCS-R-2003 -2328 -1 K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, ” Graphics Hardware Workshop 2004.