CUDA SDK Compiler (front/back end k existujícímu C/C++ kompilátoru) Debugger Profiler Knihovny např.: Cublas Cufft Curand Velký balík ukázkových příkl

Knihovny pro CUDA

CUDA SDK Compiler (front/back end k existujícímu C/C++ kompilátoru) Debugger Profiler Knihovny např.: Cublas Cufft Curand Velký balík ukázkových příkladů Hlavní podporované jazyky: C/C++ Fortran

CUBLAS Implementace standardní knihovny BLAS pro GPU Stejné rozhraní jako původní BLAS, minimální změny kódu při převodu z CPU na GPU Od CUDA SDK 6.0 i CUBLASXT: cublasxt API for multi-gpu. The cublasxt API takes care of allocating the memory across the designated GPUs and dispatched the workload between them and finally retrieves the results back to the host. The cublasxt API supports only the compute-intensive BLAS3 routines (e.g matrix-matrix operations) where the PCI transfers back and forth from the GPU can be amortized. Tiling design approach To be able to share the workload between multiples GPUs,

CUBLAS příklad I #include <stdio.h> #include <stdlib.h> #include <math.h> #include <cublas.h> #define M 6 #define N 5 // rozměry matice #define IDX2C(i,j,ld) (((j)*(ld))+(i)) //mapovací funkce void modify (float *m, int ldm, int n, int p, int q, float alpha,float beta) { cublassscal (n-p, alpha, &m[idx2c(p,q,ldm)], ldm); cublassscal (ldm-p, beta, &m[idx2c(p,q,ldm)], 1); }

CUBLAS příklad II int main(int argc, char *argv[]) { int i, j; cublasstatus stat; float* devptra; float* a = 0; a = (float *)malloc (M * N * sizeof (*a)); for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { a[idx2c(i,j,m)] = i * M + j + 1; } } // naplnění matice

CUBLAS příklad III // inicializace knihovny cublasinit(); // alokace místa pro matici stat = cublasalloc (M*N, sizeof(*a), (void**)&devptra); if (stat!= CUBLAS_STATUS_SUCCESS) { printf ("device memory allocation failed"); return 1; } //nahrání matice z CPU na GPU cublassetmatrix (M, N, sizeof(*a), a, M, devptra, M); //vlastní lineární operace na GPU modify (devptra, M, N, 1, 2, 16.0f, 12.0f);

CUBLAS příklad IV //nahrání matice z GPU na CPU cublasgetmatrix (M, N, sizeof(*a), devptra, M, a, M); //uvolnění matice cublasfree (devptra); cublasshutdown(); for (j = 0; j < N; j++) { for (i = 0; i < M; i++) { printf ("%7.0f", a[idx2c(i,j,m)]); } printf ("\n"); } return 0; }

CUBLASXT+NVBLAS The NVBLAS Library can accelerate most BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs present in the system, when the charateristics of the call make it to speedup on a GPU. The NVBLAS Library is built on top of the cublas Library Depending on the charateristics of those BLAS calls, NVBLAS will redirect the calls to the GPUs present in the system or to CPU. That decision is based on a simple heuristic that estimates if the BLAS call will execute for long enough to amortize the PCI transfers of the input and output data to the GPU.

CUFFT Knihovna pro provedení FFT (efektivní algoritmus pro spočtení diskrétní Fourierovy transformace=dft a její inverze) na GPU Výhody: Algorithms highly optimized for input sizes that can be written in the form n=2^a * 3^b * 5^c * 7^d. Complex and real-valued input and output. 1D, 2D and 3D transforms Execution of multiple 1D, 2D and 3D transforms simultaneously. These batched transforms have higher performance than single transforms. In-place and out-of-place transforms Arbitrary intra- and inter-dimension element strides (strided layout) Execution of transforms across multiple GPUs Streamed execution, enabling asynchronous computation and data movement

CUFFT příklad I cuffthandle plan; cufftcomplex *devptr; cufftcomplex data[nx*batch]; int i; // vytvoření zdrojových dat for(i= 0 ; i < NX*BATCH ; i++){ data[i].x = 1.0f; data[i].y = 1.0f; } // alokace paměti na GPU cudamalloc((void**)&devptr,sizeof(cufftcomplex)*nx*ba TCH);

CUFFT příklad II // zkopírování do GPU cudamemcpy(devptr, data, sizeof(cufftcomplex)*nx*batch, cudamemcpyhosttodevice); // spočítá 1D FFT plan cufftplan1d(&plan, NX, CUFFT_C2C, BATCH); // vlastní dopřednou FFT cufftexecc2c(plan, devptr, devptr, CUFFT_FORWARD); // zpětná FFT cufftexecc2c(plan, devptr, devptr, CUFFT_INVERSE); // zkopírování z GPU cudamemcpy(data, devptr, sizeof(cufftcomplex)*nx*batch, cudamemcpydevicetohost); // dealokace všeho cufftdestroy(plan); cudafree(devptr);

CURAND Knihovna pro generování náhodných čísel K dispozici různé typy generátorů

nvgraph For graph analytics, data analytics sparse linear algebra approach. Operations: graph construction, manipulation primitives, a set of useful graph algorithms optimized for the GPU. The core functionality is a SPMV (sparse matrix vector product)

CUSPARSE Podmnožina BLAS pro řídké matice Verze 4.0 obsahuje: Operace typu řídký vektor krát hustý vektor Operace typu řídká matice krát hustý vektor Operace typu řídká matice krát sada hustých vektorů Konverzní operace Podpora formátů hustý, COO, CSR, CSC Navíc i řešič pro troúhelníkové řídké matice

(NPP) NVIDIA Performance Primitives NVIDIA NPP is a library of functions mainly focuses on imaging and video processing. arithmetic and logical operation functions color conversion and sampling functions JPEG compression and decompression functions data exchange and initialization functions filtering and computer vision functions geometry transformation functions morphological operations statistics and linear transforms memory support functions, threshold and compare operation functions

THRUST Knihovna paralelních algoritmů pro GPU Objektově orientovaný přístup Rozhraní obdobné the C++ Standard Template Library (STL). vector containers Transformations Reductions Prefix-Sums Reordering: partitioning, stream compaction Sorting

cusolver The cusolver library is a high-level package based on the cublas and cusparse libraries The intent of cusolver is to provide useful LAPACK-like features: such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cusolver provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern. The first part of cusolver is called cusolverdn, and deals with dense matrix factorization and solve routines such as LU, QR, SVD and LDLT, as well as useful utilities such as matrix and vector permutations. Next, cusolversp provides a new set of sparse routines based on a sparse QR factorization. Not all matrices have a good sparsity pattern for parallelism in factorization, so the cusolversp library also provides a CPU path to handle those sequential-like matrices. For those matrices with abundant parallelism, the GPU path will deliver higher performance. The library is designed to be called from C and C++. The final part is cusolverrf, a sparse re-factorization package that can provide very good performance when solving a sequence of matrices where only the coefficients are changed but the sparsity pattern remains the same.

CUDPP CUDPP = CUDA Data Parallel Primitives Library. Knihovna primitivních funkcí, vhodných jako základ složitějších paralelních algoritmů Projekt nepříliš aktivní Obsahuje: paralelní prefix-sum operaci, paralelní řazení paralelní redukci Operace s hash tabulky Atd.

Další knihovny/projekty GPU AI Board Games GPU AI Path Finding opencv Open Source Computer Vision Library openfoam open source CFD toolbox cudnn a GPU-accelerated library of primitives for deep neural networks. GPULib for accelerating general-purpose scientific computations from within the Interactive Data Language (IDL). GPULib provides basic arithmetic, array indexing, special functions, Fast Fourier Transforms (FFT), interpolation, BLAS matrix operations as well as LAPACK routines provided by MAGMA, and some image processing operations. Allinea DDT debugger TotalView IMSL Fortran Numerical Library

CUDA Math library Complete support for all C99 standard float and double math functions IEEE-754 accurate for float, double, and all rounding modes Extended Trigonometry and Exponential Functions cospi, sincos, sinpi, exp10 Additional Inverse Error Functions erfinv, erfcinv Optimized Reciprocal Functions rsqrt, rcbrt Floating Point Data Attributes signbit, isfinite, isinf, isnan Bessel Functions j0,j1,jn,y0,y1,yn Statistics normcdf, normcdfinv

CUSP CUSP = knihovna pro výpočty nad řídkými maticemi a grafové algorimy Projekt neaktivní http://cusplibrary.github.io/ Grafy: Bfs connected_components hilbert_curve maximum_flow max_flow_to_min_cut pseudo_peripheral_vertex symmetric_rcm Iterační řešiče Conjugate Gradient method BiCGstab GMRES 21

Knihovny pro LA I AmgX Flexible configuration allows for nested solvers, smoothers, and preconditioners Ruge-Steuben algebraic multigrid Un-smoothed aggregation algebraic multigrid Krylov methods: PCG, GMRES, BiCGStab, and flexible variants Smoothers: Block-Jacobi, Gauss-Seidel, incomplete LU, Polynomial, dense LU Scalar or coupled block systems MPI support OpenMP support Flexible and simple high level C API

Knihovny pro LA II MAGMA (Matrix Algebra on GPU and Multicore Architectures). http://icl.cs.utk.edu/magma/software Obdoba ScalaPacku, ale pro hybridní výpočty (kombinace CPU, GPU, MIC) Podporuje CUDA OpenCL, Intel Xeon Phi. 23

Knihovny pro LA III The SpeedIT Tools library. Iterační řešiče soustav lineárních rovnic pro různé typy matic Projekt ukončen Mnoho předpodmiňovačů http://speedit.vratis.com/ 24

Knihovny pro LA IV ViennaCL. is a free open-source linear algebra library For computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP. Poslední aktualizace: leden 2016 http://viennacl.sourceforge.net/ 25

Knihovny pro LA V ArrayFire (existuje i verze pro OpenCL) https://arrayfire.com/ For massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices. pro C, C++, Fortran, Python knihovna pro: Computer Vision Functions to create and modify Arrays = Array constructors, random number generation, transpose, indexing, etc. Image Processing = Image filtering, morphing and transformations. Linear Algebra = Matrix multiply, solve, decompositions, sparse matrix. Mathematical functions = Functions from standard math library. Signal Processing = Convolutions, FFTs, filters. Statistics Vector Algorithms = sum, min, max, sort, set operations, etc. 26

Knihovny pro LA VI CULAtools. http://www.culatools.com/ s a set of GPU-accelerated linear algebra libraries utilizing the NVIDIA CUDA parallel computing architecture Projekt ukončen v 2013 Knihovna LAPACK fcí pro Řídké Husté matice 27

Regulární výrazy Jsou k dispozici i knihovny pro RV např. CUDAgrep 9od roku 2015 neaktivní) https://github.com/bkase/cuda-grep K reprezentaci RV je použit nedeterministický konečný automat takto: Load dataset and regular expressions Compute NFA from regular expression for all regular expressions for all lines in dataset use NFA to pattern match each line

Emulace atd. Emulátory Ve starší verzích SDK byla možnost vygenerovat kód pro emulaci CUDA na CPU (nyní nepodporováno) Ocelot (projekt v 2014 ukončen) mcuda The MCUDA translation framework is a linux-based tool designed to effectively compile the CUDA programming model to a CPU architecture. Stav projektu??? RCUDA Remote CUDA WebGPU WebGPU is an online GPU programming environment used in various online courses offered by the University of Illinois and the IMPACT group. It is a scalable online GPU programming environment accessible though the web.