CSE 690 GPGPU Lecture 7 Matrix Multiplications Klaus
- Slides: 19
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University
Basic Concept • Triple loop
GPU Algorithms • First algorithm: ù ù ù render a rectangle of size Nx. N represent the matrices as Nx. N textures each (i, j) is then a fragment each fragment program is a loop or an unrolled loop -> may get too long must pull in the same data many times -> poor data reuse, needs bandwidth makes no use of 4 -way RGBA parallelism -> wastes speedup
GPU Algorithms • Better algorithm: ù ù use RGBA channels, pack a 2 x 2 submatrix use swizzling to facilitate data reuse swizzling improves fragment code length by factor 2 may need multiple passes for larger matrices
GPU Algorithms • Using multi-texturing ù requires l passes
GPU Algorithms • Can use RGBA parallelism as well ù ù ù each texel represents a 2 x 2 submatrix use swizzling as usual needs l/2 passes
GPU Algorithms • Instead of a 2 x 2 submatrix, pack 4 x 1 column vectors ù makes 4 -times reuse of texels read from B, but uses texels from A only once
GPU Algorithms • Instead of a 2 x 2 submatrix, pack 4 x 1 column vectors ù ù 6 fetches are needed for 4 mad’s (mult-add’s) -> 1. 5 times more than before but less rows and columns are accessed per pass > improves cache hit frequency
GPU Algorithms • Originally only compute one product per shader ù ù ù practically can unroll the loop 3 -6 times (compute 3 -6 products) maximal fragment program length is the limit reduces the number of passes required
Reality Check • Would like to compare CPU and GPU efficiencies for GPGPU tasks • The task of matrix multiplication is insightful here ù ù ù features much data reuse graphics programs are generally more stream-like and have less data reuse this may lead to some limitations
Platforms • Pentium 4 3 Ghz CPU, 512 KB L 2 cache 12 GFLOPS peak compute • 44. 1 GB/sec cache BW • Using sgemm routine from ATLAS package • • NVIDIA Ge. Force 5900 Ultra • Ge. Force 6800 Ultra • • ATI Radeon 9800 XT • Radeon X 800 XT PE •
Performance
Bandwidth vs. Peak FLOPS
Analysis • Currently: ù ù ù GPUs can fetch 16 floats and perform 16 4 component mad’s per clock our app fetches 8 floats to perform one 4 component mad -> not enough computations need more math ops per float fetched (> 8)
Analysis • Pentium processors have large L 1 caches to boost memory bandwidth (bw) ù ù bw / compute ratio better main reason for only small performance gain achieved with GPUs
Analysis • Pentium processors have large L 1 caches to boost memory bandwidth (bw) ù ù ù bw / compute ratio better main reason for only small performance gain achieved with GPUs for matrix multiplications
Analysis • Expectations ù ù make sure that there is enough arithmetic per data item fetched lots of data resuse in the algorithm / task will make the CPU look better streaming data OK -> they don’t “suffer” from reuse matrix multiplication is an excellent reality-check example
Analysis • What do GPUs need: ù ù ù bigger caches to enable larger blocks currently there are enough registers to store a 6 x 6 submatrix but currently shaders can only produce a small number of outputs -> limits the amount of blocking Provide full-floating point accumulator registers Widen path between texture and register files
References ù ù ù E. Larsen and D. Mc. Allister, “Fast matrix multiplies using graphics hardware, ” Supercomputing 2001. J. Hall, N. Carr and J. Hart, “Cache and bandwidth aware matrix multiplication on the GPU, ” Tech Report UIUCDCS-R-2003 -2328 -1 K. Fatahalian, J. Sugerman, and P. Hanrahan, “Understanding the efficiency of GPU algorithms for matrix-matrix multiplication, ” Graphics Hardware Workshop 2004.
- Reguli de citare
- Bravok
- Iso 690:2013
- Unesio
- Zasady sporządzania bibliografii
- Mt 690
- Iso 690 harvardský styl
- Iso 690-2
- Iso 690:2013
- 690-380
- Ejemplos de conclusiones de trabajos
- Iso 690 harvardský styl
- Gpgpu matlab
- Gpgpu tutorial
- Unicompiler
- Gpgpu
- What is gpgpu
- Deploying deep learning models with docker and kubernetes
- Gpgpu-sim tutorial
- Matlab parallel computing gpu