Performance Libraries Intel Math Kernel Library MKL Intel
Performance Libraries: Intel Math Kernel Library (MKL) Intel Software College
Agenda Introduction • Purpose of Library • Intel® Math Kernel Library (Intel® MKL) Contents Performance Features • Resource Limited Optimization • Threading Using the Library The Library Sections • • • BLAS LAPACK* DFTs VML VSL Performance Libraries: Intel® Math Kernel Library (MKL) 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Purpose Performance, Performance! Intel’s engineering, scientific, and financial math library Addresses: • Solvers (BLAS, LAPACK) • Eigenvector/eigenvalue solvers (BLAS, LAPACK) • Some quantum chemistry needs (dgemm) • PDEs, signal processing, seismic, solid-state physics (FFTs) • General scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)] Tune for Intel® processors – current and future Performance Libraries: Intel® Math Kernel Library (MKL) 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Purpose – Don’ts But don’t use Intel® Math Kernel (Intel® MKL) on … X’ Y’ Z’ W’ = 4 x 4 Transformation matrix X Y Z W Geometric Transformation Don’t use Intel® MKL on “small” counts. Don’t call vector math functions on small n. § But you could use Intel® Performance Primitives. Performance Libraries: Intel® Math Kernel Library (MKL) 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Contents BLAS (Basic Linear Algebra Subroutines) Level 1 BLAS – vector-vector operations • 15 function types • 48 functions Level 2 BLAS – matrix-vector operations • 26 function types • 66 functions Level 3 BLAS – matrix-matrix operations • 9 function types • 30 functions Extended BLAS – level 1 BLAS for sparse vectors • 8 function types • 24 functions Performance Libraries: Intel® Math Kernel Library (MKL) 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Contents LAPACK (linear algebra package) Solvers and eigensolvers. Many hundreds of routines total! There are more than 1000 total user callable and support routines DFTs (Discrete Fourier transforms) Mixed radix, multi-dimensional transforms Multithreaded VML (Vector Math Library) Set of vectorized transcendental functions Most of libm functions, but faster VSL (Vector Statistical Library) Set of vectorized random number generators Performance Libraries: Intel® Math Kernel Library (MKL) 6 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Contents BLAS and LAPACK* are both Fortran. • Legacy of high performance computation VSL and VML have Fortran and C interfaces. DFTs have Fortran 95 and C interfaces. cblas interface. It is more convenient for a C/C++ programmer to call BLAS. Performance Libraries: Intel® Math Kernel Library (MKL) 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library (Intel® MKL) Environment Support 32 -bit and 64 -bit Intel® processors Windows* Linux* Compilers Intel, CVF, Microsoft Intel, Gnu Libraries . dll, . lib . a, . so Large set of examples and tests Extensive documentation Performance Libraries: Intel® Math Kernel Library (MKL) 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Resource Limited Optimization The goal of all optimization is maximum speed. Resource limited optimization – exhaust one or more resource of system: • CPU: Register use, FP units. • Cache: Keep data in cache as long as possible; deal with cache interleaving. • TLBs: Maximally use data on each page. • Memory bandwidth: Minimally access memory. • Computer: Use all the processors available using threading. • System: Use all the nodes available (cluster software). Performance Libraries: Intel® Math Kernel Library (MKL) 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Threading Most of Intel® Math Kernel Library (Intel® MKL) could be threaded but: • Limited resource is memory bandwidth. • Threading level 1 and level 2 BLAS are mostly ineffective ( O(n) ) There are numerous opportunities for threading: • Level 3 BLAS ( O(n 3) ) • LAPACK* ( O(n 3) ) • FFTs ( O(n log(n) ) • VML, VSL ? depends on processor and function All threading is via Open. MP*. All Intel MKL is designed and compiled for thread safety. Performance Libraries: Intel® Math Kernel Library (MKL) 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Linking with Intel® Math Kernel Library (Intel® MKL) Scenario 1: ifort, BLAS, IA-32 processor: ifort myprog. f mkl_c. lib Scenario 2: CVF, LAPACK, IA-32 processor: f 77 myprog. f mkl_s. lib Scenario 3: Statically link a C program with DLL linked at runtime: link myprog. obj mkl_c_dll. lib Note: Optimal binary code will execute at run time based on processor. Performance Libraries: Intel® Math Kernel Library (MKL) 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Matrix Multiplication Roll Your Own/Dot Product Roll Your Own for( i = 0; i < n; i++ ){ for( j = 0; j < m; j++ ){ for( k = 0; k < kk; k++ ) c[i][j] += a[i][k] * b[k][j]; }} ddot for( i = 0; i < n; i++ ){ for( j = 0; j < m; j++ ) c[i][j] = cblas_ddot( n, &a[i], incx, &b[0][j], incy); } Performance Libraries: Intel® Math Kernel Library (MKL) 12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Matrix Multiplication DGEMV/DGEMM dgemv for( i = 0; i < n; i++ ) cblas_dgemv( CBLAS_Row. Major, CBLAS_No. Trans, m, n, alpha, a, lda, &b[0][i], ldb, beta, &c[0][i], ldc ); dgemm Cblas_dgemm( Cblas. Col. Major, Cblas. No. Trans, m, n, kk, alpha, b, ldb, a, lda, beta, c, ldc ); Performance Libraries: Intel® Math Kernel Library (MKL) 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity 1: DGEMM Compare the performance of matrix multiply as implemented by C source code, DDOT, DGEMG and DGEMM. Exercise control of the threading capabilities in MKL/BLAS. Performance Libraries: Intel® Math Kernel Library (MKL) 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library Optimizations in LAPACK* Most important LAPACK optimizations: • Threading – effectively uses multiple CPUs • Recursive factorization • • Reduces scalar time (Amdahl’s law: t = tscalar + tparallel/p) Extends blocking further into the code No runtime library support required Performance Libraries: Intel® Math Kernel Library (MKL) 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Discrete Fourier Transforms One dimensional, two-dimensional, three-dimensional… Multithreaded Mixed radix User-specified scaling, transform sign Transforms on imbedded matrices Multiple one-dimensional transforms on single call Strides C and F 90 interfaces Performance Libraries: Intel® Math Kernel Library (MKL) 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Using the Intel® Math Kernel Library DFTs Basically a 3 -step Process Create a descriptor. Status = Dfti. Create. Descriptor(MDH, …) Commit the descriptor (instantiates it). Now supports FFTW Perform the transform. interface • Status = Dfti. Compute. Forward(MDH, X) • Status = Dfti. Commit. Descriptor(MDH) Optionally free the descriptor. MDH: My. Descriptor. Handle Performance Libraries: Intel® Math Kernel Library (MKL) 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Vector Math Library (VML) Features/Issues Vector Math Library: vectorized transcendental functions – like libm but better (faster) Interface: Have both Fortran and C interfaces Multiple accuracies • High accuracy ( < 1 ulp ) • Lower accuracy, faster ( < 4 ulps ) Special value handling √(-a), sin(0), and so on Error handling – can not duplicate libm here Performance Libraries: Intel® Math Kernel Library (MKL) 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
VML: Why Does It Matter? It is important for financial codes (Monte Carlo simulations). • Exponentials, logarithms Other scientific codes depend on transcendental functions. Error functions can be big time sinks in some codes. And so on Performance Libraries: Intel® Math Kernel Library (MKL) 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Vector Statistical Library (VSL) Set of random number generators (RNGs) Numerous non-uniform distributions VML used extensively for transformations Parallel computation support – some functions User can supply own BRNG or transformations Five basic RNGs (BRNGs) – bits, integer, FP • MCG 31, R 250, MRG 32, MCG 59, WH Performance Libraries: Intel® Math Kernel Library (MKL) 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Non-Uniform RNGs Gaussian (two methods) Exponential Laplace Weibull Cauchy Rayleigh Lognormal Gumbel Performance Libraries: Intel® Math Kernel Library (MKL) 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Using VSL Basically a 3 -step Process Create a stream pointer. VSLStream. State. Ptr stream; Create a stream. vsl. New. Stream(&stream, VSL_BRNG_MC_G 31, seed ); Generate a set of RNGs. vs. Rng. Uniform( 0, &stream, size, out, start, end ); Delete a stream (optional). vsl. Delete. Stream(&stream); Performance Libraries: Intel® Math Kernel Library (MKL) 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Activity: Calculating Pi using a Monte Carlo method Compare the performance of C source code (RAND function) and VSL. Exercise control of the threading capabilities in MKL/VSL. Performance Libraries: Intel® Math Kernel Library (MKL) 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Libraries: Intel® MKL What’s Been Covered Intel® Math Kernel Library is a broad scientific/engineering math library. It is optimized for Intel® processors. It is threaded for effective use on SMP machines. Performance Libraries: Intel® Math Kernel Library (MKL) 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Performance Libraries: Intel® Math Kernel Library (MKL) 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
- Slides: 25