Anatomy of a High Performance ManyThreaded Matrix Multiplication

Introduction l Shared memory parallelism for GEMM l Many-threaded architectures require more sophisticated methods

Outline ØGoto. BLAS approach l. Opportunities for Parallelism l. Many-threaded Results 3

Goto. BLAS Approach The GEMM operation: n k m C += m A B

registers L 1 cache L 2 cache L 3 cache Main Memory +=

registers L 1 cache L 2 cache L 3 cache nc Main Memory nc

registers L 1 cache L 2 cache L 3 cache kc kc Main Memory

registers L 1 cache L 2 cache L 3 cache Main Memory mc mc

registers L 1 cache L 2 cache nr L 3 cache nr nr Main

registers L 1 cache L 2 cache L 3 cache Main Memory mr +=

Outline l. Goto. BLAS approach ØOpportunities for Parallelism l. Many-threaded Results 11

3 Loops to Parallelize in Goto. BLAS += 12

Multiple Levels of Parallelism ir += • All threads share micro-panel of B •

Multiple Levels of Parallelism jr jr += • • All threads share block of

Multiple Levels of Parallelism • • All threads share panel of B Each thread

Multiple Levels of Parallelism • • Each iteration updates entire C Iterations of the

Multiple Levels of Parallelism • • • All threads share matrix A Each thread

Outline l. Goto. BLAS approach l. Opportunities for Parallelism ØMany-threaded Results 20

Intel Xeon Phi l Many Threads ¡ 60 cores, 4 threads per core ¡Need

IBM Blue Gene/Q l (Not quite as) Many Threads ¡ 16 cores, 4 threads

Thank You l Questions? l Source code available at: ¡code. google. com/p/blis/ 33

Slides: 33

Download presentation

Anatomy of a High. Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G. Van Zee 1

Introduction l Shared memory parallelism for GEMM l Many-threaded architectures require more sophisticated methods of parallelism l Explore the opportunities for parallelism to explain which we will exploit l Need finer grain parallelism 2

Outline ØGoto. BLAS approach l. Opportunities for Parallelism l. Many-threaded Results 3

Goto. BLAS Approach The GEMM operation: n k m C += m A B 4

registers L 1 cache L 2 cache L 3 cache Main Memory +=

registers L 1 cache L 2 cache L 3 cache nc Main Memory nc += 6

registers L 1 cache L 2 cache L 3 cache kc kc Main Memory += 7

registers L 1 cache L 2 cache L 3 cache Main Memory mc mc += 8

registers L 1 cache L 2 cache nr L 3 cache nr nr Main Memory += 9

registers L 1 cache L 2 cache L 3 cache Main Memory mr += mr 10

Outline l. Goto. BLAS approach ØOpportunities for Parallelism l. Many-threaded Results 11

3 Loops to Parallelize in Goto. BLAS += 12

5 Opportunities for Parallelism += 13

Multiple Levels of Parallelism ir += • All threads share micro-panel of B • Each thread has its own micro-panel of A • Fixed number of iterations: 14

Multiple Levels of Parallelism jr jr += • • All threads share block of A Each thread has its own micro-panel of B Fixed number of iterations Good if shared L 2 cache 15

Multiple Levels of Parallelism • • All threads share panel of B Each thread has its own block of A Number of iterations is not fixed Good if multiple L 2 caches 16

Multiple Levels of Parallelism • • Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction 17

Multiple Levels of Parallelism • • Each iteration updates entire C Iterations of the loop are not independent Requires mutex when updating C Or a reduction

Multiple Levels of Parallelism • • • All threads share matrix A Each thread has its own panel of B Number of iterations is not fixed Good if multiple L 3 caches Good for NUMA reasons 19

Outline l. Goto. BLAS approach l. Opportunities for Parallelism ØMany-threaded Results 20

Intel Xeon Phi l Many Threads ¡ 60 cores, 4 threads per core ¡Need to use > 2 threads per core to utilize FPU l We do not block for the L 1 cache ¡Difficult to amortize the cost of updating C with 4 threads sharing an L 1 cache ¡We consider part of the L 2 cache as a virtual L 1 l Each core has its own L 2 cache 21

IBM Blue Gene/Q l (Not quite as) Many Threads ¡ 16 cores, 4 threads per core ¡Need to use > 2 threads per core to utilize FPU l We do not block for the L 1 cache ¡Difficult to amortize the cost of updating C with 4 threads sharing an L 1 cache ¡We consider part of the L 2 cache as a virtual L 1 l Single large, shared L 2 cache 26

Thank You l Questions? l Source code available at: ¡code. google. com/p/blis/ 33