GPU Offloading and Heterogeneous Applications Martin Kruli by

GPU Offloading and Heterogeneous Applications Martin Kruliš by Martin Kruliš (v 1. 2) 15. 11. 2021 1

Heterogeneous Programming � GPU ◦ “Independent” device ◦ Controlled by host ◦ Used for “offloading” � Host Code ◦ Needs to be designed in a way that �Utilizes GPU(s) efficiently �Utilize CPU while GPU is working �CPU and GPU do not wait for each other by Martin Kruliš (v 1. 2) 15. 11. 2021 2

Heterogeneous Programming � Bad Example CPU GPU cuda. Memcpy(. . . , Host. To. Device); Kernel 1<<<. . . >>>(. . . ); cuda. Device. Synchronize(); cuda. Memcpy(. . . , Device. To. Host); . . . cuda. Memcpy(. . . , Host. To. Device); Kernel 2<<<. . . >>>(. . . ); Actually not necessary cuda. Device. Synchronize(); cuda. Memcpy(. . . , Device. To. Host); Device is. . . working by Martin Kruliš (v 1. 2) 15. 11. 2021 3

Overlapping Work � Overlapping ◦ Kernels CPU and GPU work �Started asynchronously (to host code) �Can be waited for (cuda. Device. Synchronize()) �A little more can be done with streams ◦ Memory transfers �cuda. Memcpy() is synchronous and blocking �Alternatively cuda. Memcpy. Async() starts the transfer and returns immediately �Can be synchronized the same way as the kernel by Martin Kruliš (v 1. 2) 15. 11. 2021 4

Overlapping Work � Using Asynchronous Transfers CPU GPU cuda. Memcpy. Async(Host. To. Device); Kernel 1<<<. . . >>>(. . . ); cuda. Memcpy. Async(Device. To. Host); . . . do_something_on_cpu(); . . . cuda. Device. Synchronize(); Workload balance becomes an issue by Martin Kruliš (v 1. 2) 15. 11. 2021 5

Overlapping Work � Multiprocessing ◦ Multiple CPU threads/processes may use the GPU ◦ Actually a bit tricky in case of threads… � GPU Overlapping Capabilities ◦ Multiple kernels may run simultaneously �Since Fermi architecture �cuda. Device. Prop. concurrent. Kernels ◦ Kernel execution may overlap with data transfers �Or even multiple data transfers �cuda. Device. Prop. async. Engine. Count by Martin Kruliš (v 1. 2) 15. 11. 2021 6

Streams � Stream ◦ In-order GPU command queue �Asynchronous GPU operations are registered in queue �Kernel execution �Memory data transfers �Commands in different streams may overlap �Provide means for explicit and implicit synchronization ◦ Default stream (stream 0) �Always present, does not have to be created �Global synchronization capabilities �Shared among threads by default by Martin Kruliš (v 1. 2) 15. 11. 2021 7

Streams � Stream Creation cuda. Stream_t stream; cuda. Stream. Create(&stream); � Stream Usage cuda. Memcpy. Async(dst, src, size, kind, stream); kernel<<<grid, block, shared. Mem, stream>>>(. . . ); � Stream Destruction cuda. Stream. Destroy(stream); by Martin Kruliš (v 1. 2) 15. 11. 2021 8

Streams � Default Stream ◦ Global synchronization when operation is enqueued �All previous operations in all streams on the same device must finish first �The operation must finish before any other operation (in any other stream on the device) is started ◦ Can be replicated per thread (not shared) �Individual replicas does not sync. among themselves �Use --default-stream per-thread compiler directive or �#define CUDA_API_PER_THREAD_DEFAULT_STREAM by Martin Kruliš (v 1. 2) 15. 11. 2021 9

Streams � Synchronization ◦ Explicit �cuda. Stream. Synchronize(stream) – waits until all commands issued to the stream have completed �cuda. Stream. Query(stream) – a non-blocking test whether the stream has finished ◦ Implicit �Operations in different streams cannot overlap if a special operation is issued between them �Memory allocation �Switch between L 1/shared memory configuration �A CUDA command to default stream by Martin Kruliš (v 1. 2) 15. 11. 2021 10

Streams � Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cuda. Memcpy. Async(…Host. To. Device, stream[i]); My. Kernel<<<g, b, 0, stream[i]>>>(. . . ); cuda. Memcpy. Async(…Device. To. Host, stream[i]); } May have many implicit synchronizations, depending on CC and hardware overlapping capabilities. by Martin Kruliš (v 1. 2) 15. 11. 2021 11

Streams � Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cuda. Memcpy. Async(…Host. To. Device, stream[i]); for (int i = 0; i < 2; ++i) My. Kernel<<<g, b, 0, stream[i]>>>(. . . ); for (int i = 0; i < 2; ++i) cuda. Memcpy. Async(…Device. To. Host, stream[i]); Much less opportunities for implicit synchronization by Martin Kruliš (v 1. 2) 15. 11. 2021 12

Streams � Callbacks ◦ Callbacks are registered in streams by cuda. Stream. Add. Callback(stream, fnc, data, 0); ◦ The callback function is invoked asynchronously after all preceding commands terminate ◦ Callback registered to the default stream is invoked after previous commands in all streams terminate ◦ Operations issued after callback registration start after the callback returns ◦ The callback looks like void CUDART_CB My. Callback(stream, error. Status, user. Data) {. . . by Martin Kruliš (v 1. 2) 15. 11. 2021 13

Streams � Events ◦ Special markers that can be used for synchronization and performance monitoring ◦ The typical usage is �Waiting for all commands before the marker finishes �Explicit synchronization between selected streams �Measuring time between two events ◦ Example cuda. Event_t event; cuda. Event. Create(&event); cuda. Event. Record(event, stream); cuda. Event. Synchronize(event); by Martin Kruliš (v 1. 2) 15. 11. 2021 14

Multi-GPU Programming � cuda. Set. Device() affects ◦ Memory allocations ◦ Kernel executions ◦ Stream and event creation � Using streams ◦ Set the device before stream is created ◦ Set the same device before issuing any operation to that stream by Martin Kruliš (v 1. 2) 15. 11. 2021 15

Asynchronous Error Handling � Asynchronous Errors ◦ An error may occur outside a CUDA call �In case of asynchronous memory transfers or kernel execution ◦ The error is reported by the following CUDA call ◦ To make sure all errors were reported, the device must synchronize (cuda. Device. Synchronize()) ◦ Error handling functions �cuda. Get. Last. Error() �cuda. Peek. At. Last. Error() �cuda. Get. Error. String(error) by Martin Kruliš (v 1. 2) 15. 11. 2021 16

Pipelining � Making a Good Use of Overlapping ◦ Split the work into smaller fragments ◦ Create a pipeline effect (load, process, store) by Martin Kruliš (v 1. 2) 15. 11. 2021 17

Feeding Threads � Data Gather and Scatter Problem Input Data Host Memory Gather Multiple cuda. Memcpy() calls may be quite inefficient Kernel Execution GPU Memory Scatter Results Host Memory by Martin Kruliš (v 1. 2) 15. 11. 2021 18

Feeding Threads � Gather and Scatter ◦ Reducing overhead ◦ Performed by CPU before/after cuda. Memcpy Main Thread Stream 0 Stream 1 Gather Ht. D copy Kernel Dt. H copy Scatter Gather Ht. D copy Kernel Dt. H copy … Scatter # of thread per GPU and # of streams per thread depends on the workload structure by Martin Kruliš (v 1. 2) 15. 11. 2021 19

Page-locked Memory � Page-locked (Pinned) Host Memory ◦ Host memory that is prevented from swapping ◦ Created/dismissed by cuda. Host. Alloc(), cuda. Free. Host() cuda. Host. Register(), cuda. Host. Unregister() ◦ Optionally with flags cuda. Host. Alloc. Write. Combined cuda. Host. Alloc. Mapped cuda. Host. Alloc. Portable Optimized for writing, not cached on CPU ◦ Copies between pinned host memory and device are automatically performed asynchronously ◦ Pinned memory is a scarce resource by Martin Kruliš (v 1. 2) 15. 11. 2021 20

Memory Mapping � Device Memory Mapping ◦ Allowing GPU to access portions of host memory directly (i. e. , without explicit copy operations) �For both reading and writing ◦ The memory must be allocated/registered with flag cuda. Host. Alloc. Mapped ◦ The context must have cuda. Device. Map. Host flag (set by cuda. Set. Device. Flags()) ◦ Function cuda. Host. Get. Device. Pointer() gets host pointer and returns corresponding device pointer by Martin Kruliš (v 1. 2) 15. 11. 2021 21

More Memory Functions � Scatter/gather operations ◦ Copying fragments of 2 D (or 3 D data layouts) ◦ cuda. Memcpy 2 D(), cuda. Memcpy 3 D() � Peer-to-peer data transfers ◦ Direct transfers between two GPUs in one system ◦ cuda. Memcpy. Peer(), cuda. Memcpy. Peer. Async() � Initialization ◦ cuda. Memset() by Martin Kruliš (v 1. 2) 15. 11. 2021 22

Unified Memory � Memory space accessible from host and GPUs ◦ Requires special allocation by cuda. Malloc. Managed() ◦ No explicit transfers, data are moved on demand ◦ Better handling on Pascal and later architectures �Support for virtual pages and page faults �Memory does not have to be allocated immediately ◦ cuda. Mem. Advise() – set preferred location (and possibly other indicators) ◦ cuda. Mem. Prefetch. Async() – similar to cuda. Memcpy but works with unified memory by Martin Kruliš (v 1. 2) 15. 11. 2021 23

NVLink � New fast interlink to replace PCIe by Martin Kruliš (v 1. 2) 15. 11. 2021 24

Related Technologies � GPUDirect ◦ RDMA transfers for GPU memory � Unified Communication X (UCX) � NVIDIA Collectives Communications Library � NVIDIA Open. SHMEM ◦ Works over multiple transport devices ◦ (e. g. , Infiniband) ◦ Similar to MPI – various transfer patterns (gather, scatter, all to all, …) ◦ Also similar to MPI, based on PGAS by Martin Kruliš (v 1. 2) 15. 11. 2021 25

Discussion by Martin Kruliš (v 1. 2) 15. 11. 2021 26