CUDA Libraries NVIDIA Corporation 2013 Why Use Library

CUDA Libraries © NVIDIA Corporation 2013

Why Use Library No need to reprogram Save time Less bug Better Performance © NVIDIA Corporation 2013 = FUN

CUDA Math Libraries High performance math routines for your applications: cu. FFT – Fast Fourier Transforms Library cu. BLAS – Complete BLAS Library cu. SPARSE – Sparse Matrix Library cu. RAND – Random Number Generation (RNG) Library NPP – Performance Primitives for Image & Video Processing Thrust – Templated C++ Parallel Algorithms & Data Structures math. h - C 99 floating-point Library Included in the CUDA Toolkit © NVIDIA Corporation 2013 Free download @ www. nvidia. com/getcuda

Linear Algebra © NVIDIA Corporation 2013

A Birds Eye View on Linear Algebra Vector Matrix Solver © NVIDIA Corporation 2013

A Birds Eye View on Linear Algebra Dense Multi Node Sparse Vector Single Node Matrix Solver © NVIDIA Corporation 2013

Sometimes it seems as if there’s only three … Dense Sca. LAPACK Multi Node Sparse BLAS Matrix LAPACK Solver © NVIDIA Corporation 2013 Single Node Vector

. . but there is more … Sca. LAPACK Paradiso TAUCS Super. LU WSMP MUMPS UMFPACK Spooles Sparse. BLAS PLAPACK PBLAS Vector EISPACK LINPACK BLAS Matrix LAPACK Solver © NVIDIA Corporation 2013 Single Node Pa. Stix Multi Node Dense Sparse

… and even more. . Trilinos Sca. LAPACK Pa. Stix Matlab Paradiso TAUCS Super. LU WSMP MUMPS UMFPACK Spooles Sparse. BLAS PLAPACK PBLAS Vector EISPACK LINPACK BLAS R, IDL, Matrix Python, Matrix Ruby, . . LAPACK Solver © NVIDIA Corporation 2013 Multi Node Dense Sparse Single Node PETSc

NVIDIA CUDA Library Approach Provide basic building blocks Make them easy to use Make them fast Provides a quick path to GPU acceleration Enables ISVs to focus on their “secret sauce” Ideal for applications that use CPU libraries © NVIDIA Corporation 2013

NVIDIA’s Foundation for Lin. Alg on GPUs Dense Parallel Sparse NVIDIA cu. SPARSE Vector NVIDIA cu. BLAS Single Node Matrix Solver © NVIDIA Corporation 2013

cu. BLAS: >1 TFLOPS double-precision Up to 1 TFLOPS sustained performance and >8 x speedup over Intel MKL • cu. BLAS 5. 0, K 20 • MKL 10. 3. 6, Intel Sandy. Bridge E 5 -2687 W @ 3. 10 GHZ Performance may vary based on OS version and motherboard configuration © NVIDIA Corporation 2013

cu. BLAS: Legacy and Version 2 Interface Legacy Interface Convenient for quick port of legacy code Version 2 Interface Reduces data transfer for complex algorithms Return values on CPU or GPU Scalar arguments passed by reference Support for streams and multithreaded environment Batching of key routines © NVIDIA Corporation 2013

Version 2 Interface helps reducing memory transfers Index transferred to Legacy Interface CPU, CPU needs vector elements for scale factor idx = cublas. Isamax(n, d_column, 1); err = cublas. Sscal(n, 1. /d_column[idx], row, 1); © NVIDIA Corporation 2013

Version 2 Interface helps reducing memory transfers Index transferred to Legacy Interface CPU, CPU needs vector elements for scale factor idx = cublas. Isamax(n, d_column, 1); err = cublas. Sscal(n, 1. /d_column[idx], row, 1); Version 2 Interface err = cublas. Isamax(handle, n, d_column, 1, d_max. Idx); kernel<<< >>> (d_column, d_max. Idx, d_val); err = cublas. Sscal(handle, n, d_val, d_row, 1); All data remains on the GPU © NVIDIA Corporation 2013

The cu. SPARSE - CUSP Relationship Dense Parallel Sparse NVIDIA cu. SPARSE Matrix Solver © NVIDIA Corporation 2013 Single Node Vector

Third Parties Extend the Building Blocks Dense Multi Node Sparse Vector FLAME Single Node Library Matrix Solver IMSL Library © NVIDIA Corporation 2013

Different Approaches to Linear Algebra CULA tools (dense, sparse) LAPACK based API Solvers, Factorizations, Least Squares, SVD, Eigensolvers Sparse: Krylov solvers, Preconditioners, support for various formats cula. Sgetrf(M, N, A, LDA, IPIV, INFO) Array. Fire “Matlab-esque” interface C Fortran Array container object Solvers, Factorizations, SVD, Eigensolvers array out = lu(A) © NVIDIA Corporation 2013 Array. Fire Matrix Computations

Different Approaches to Linear Algebra (cont. ) MAGMA LAPACK conforming API Magma BLAS and LAPACK High performance by utilizing both GPU and CPU magma_sgetrf(M, N, A, LDA, IPIV, INFO) Lib. Flame LAPACK compatibility interface Infrastructure for rapid linear algebra algorithm development FLASH_LU_piv(A, p) © NVIDIA Corporation 2013 FLAME Library

Toolkits are increasingly supporting GPUs PETSc GPU support via extension to Vec and Mat classes Partially dependent on CUSP MPI parallel, GPU accelerated solvers Trilinos GPU support in KOKKOS package Used through vector class Tpetra MPI parallel, GPU accelerated solvers © NVIDIA Corporation 2013

Signal Processing © NVIDIA Corporation 2013

Common Tasks in Signal Processing Filtering © NVIDIA Corporation 2013 Correlation Segmentation

Vector Signal Image Processing Parallel Computing Toolbox Array. Fire Matrix Computations GPU Accelerated Data Analysis NVIDIA NPP Libraries for GPU Accelerated Signal Processing © NVIDIA Corporation 2013

Basic concepts of cu. FFT Interface modeled after FFTW Simple migration from CPU to GPU fftw_plan_dft 2_2 d => cufft. Plan 2 d “Plan” describes data layout, transformation strategy Depends on dimensionality, layout, type of transform Independent of actual data, direction of transform Reusable for multiple transforms Execution of plan Depends on transform direction, data cufft. Exec. C 2 C(plan, d_data, CUFFT_FORWARD) © NVIDIA Corporation 2013

Efficient use of cu. FFT Perform multiple transforms with the same plan Use e. g. in forward/inverse transform for convolution, transform at each simulation timestep, etc. Transform in streams cufft functions do not take a stream argument Associate a plan with a stream via cufft. Set. Stream(plan, stream) Batch transforms Concurrent execution of multiple identical transforms Support for 1 D, 2 D and 3 D transforms © NVIDIA Corporation 2013

High 1 D transform performance is key to efficient 2 D and 3 D transforms Performance may vary based on OS version and motherboard configuration © NVIDIA Corporation 2013 • Measured on sizes that are exactly powers-of-2 • cu. FFT 5. 0 on K 20

Basic concepts of NPP Collection of high-performance GPU processing Initial focus on Image, Video and Signal processing Growth into other domains expected Support for multi-channel integer and float data C API => name disambiguates between data types, flavor nppi. Add_32 f_C 1 R (…) “Add” two single channel (“C 1”) 32 -bit float (“ 32 f”) images, possibly masked by a region of interest (“R”) © NVIDIA Corporation 2013

NPP features a large set of functions Arithmetic and Logical Operations Add, mul, clamp, . . Threshold and Compare Geometric transformations Rotate, Warp, Perspective transformations Various interpolations Compression jpeg de/compression Image processing Filter, histogram, statistics © NVIDIA Corporation 2013 NVIDIA NPP

Random Number Generation on GPU Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library! Large suite of generators and distributions XORWOW, MRG 323 ka, MTGP 32, (scrambled) Sobol uniform, normal, log-normal Single and double precision Two APIs for cu. RAND Host: Ideal when generating large batches of RNGs on GPU Device: Ideal when RNGs need to be generated inside a kernel © NVIDIA Corporation 2013

cu. RAND: Host vs Device API Host API Generate set of random numbers at once #include “curand. h” curand. Create. Genarator(&gen, CURAND_RNG_PSEUDO_DEFAULT); curand. Generate. Uniform(gen, d_data, n); Device API #include “curand_kernel. h” __global__ void generate_kernel(curand. State *state) { int id = thread. Idx. x + block. Idx. x * 64; x = curand(&state[id]); Generate random } numbers per thread © NVIDIA Corporation 2013

cu. RAND Performance compared to Intel MKL Performance may vary based on OS version and motherboard configuration © NVIDIA Corporation 2013 • cu. SPARSE 5. 0 on K 20 X, input and output data on device • MKL 10. 3. 6 on Intel Sandy. Bridge E 5 -2687 W @ 3. 10 GHz

Thurst: STL-like CUDA Template Library Device and host vector class thrust: : host_vector<float> H(10, 1. f); thrust: : device_vector<float> D = H; Iterators C++ STL Features for CUDA thrust: : fill(D. begin(), D. begin()+5, 42. f); float* raw_ptr = thrust: : raw_pointer_cast(D); Algorithms Sort, reduce, transformation, scan, . . thrust: : transform(D 1. begin(), D 1. end(), D 2. begin(), D 2. end(), thrust: : plus<float>()); // D 2 = D 1 + D 2 © NVIDIA Corporation 2013

Open. ACC: New Open Standard for GPU Computing Faster, Easier, Portability http: //www. openacc-standard. org © NVIDIA Corporation 2013

$Vector Addition using Open. ACC void vec_add(float *x, float *y, int n) { #pragma$

Vector Addition using Open. ACC void vec_add(float *x, float *y, int n) { #pragma acc kernels for (int i=0; i<n; ++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x, y, n); free(x); free(y); void vec_add(float *x, float *y, int n) { for (int i=0; i<n; ++i) y[i]=x[i]+y[i]; } float *x=(float*)malloc(n*sizeof(float)); float *y=(float*)malloc(n*sizeof(float)); vec_add(x, y, n); free(x); free(y); #pragma acc kernels: run the loop in parallel on GPU © NVIDIA Corporation 2013

Open. ACC Basics Compute construct for offloading calculation to GPU #pragma acc parallel for (i=0; i<n; i++) a[i] = a[i] + b[i]; Data construct for controlling data movement between CPU-GPU #pragma acc data copy (list) / copyin (list) / copyout (list) / present (list) © NVIDIA Corporation 2013 #pragma acc data copy(a[0: n-1]) copyin(b[0: n-1]) { #pragma acc parallel for (i=0; i<n; i++) a[i] = a[i] + b[i]; #pragma acc parallel for (i=0; i<n; i++) a[i] *= 2; }

Math. h: C 99 floating-point library + extras © NVIDIA Corporation 2013

Explore the CUDA (Libraries) Ecosystem CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer. nvidia. com/cudatools-ecosystem Attend GTC library talks © NVIDIA Corporation 2013