PTask Operating System Abstractions to Manage GPUs as

PTask: Operating System Abstractions to Manage GPUs as Compute Devices Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

Motivation There are lots of GPUs ◦ ◦ 3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 2

Motivation There are lots of GPUs ◦ ◦ 3 of top 5 supercomputers use GPUs In all new PCs, smart phones tablets These two things are related: Great for gaming and HPC/batch We need OS abstractions Unusable in other application domains GPU programing challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 3

Outline The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4

Traditional OS-Level abstractions programmervisible interface OS-level abstractions Hardware interface 1: 1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5

GPU Abstractions programmervisible interface GPGPU APIs Shaders/ Kernels Language Integration Direct. X/CUDA/Open. CL Runtime 1 OS-level abstraction! 1. 2. 3. No kernel-facing API No OS resource-management Poor composability PTask SOSP 2011 6

CPU-bound processes hurt GPUs 1200 GPU benchmark throughput 1000 800 nvocations per second 600 400 200 Higher is better 0 no CPU load CPU scheduler and GPU not integrated! high CPU load • Image-convolution in CUDA scheduler • Windows 7 x 64 8 GB RAM • Intel Core 2 Quad 2. 66 GHz • n. Vidia Ge. Force GT 230 PTask SOSP 2011 7

GPU-bound processes hurt CPUs OS cannot prioritize cursor updates • Flatter lines Are better WDDM + DWM + CUDA == dysfunction • Windows 7 x 64 8 GB RAM • Intel Core 2 Quad 2. 66 GHz • n. Vidia Ge. Force GT 230 PTask SOSP 2011 8

Composition: Gestural Interface Raw images “Hand” events detect capture camera images noisy point cloud xform detect gestures filter geometric transformation noise filtering High data rates Data-parallel algorithms … good fit for GPU NOT Kinect: this is a harder problem! PTask SOSP 2011 9

What We’d Like To Do #> capture | xform | filter | detect & CPU GPU CPU GPU Modular design flexibility, reuse Utilize heterogeneous hardware Data-parallel components GPU Sequential components CPU Using OS provided tools processes, pipes PTask SOSP 2011 10

GPU Execution model GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution ◦ Program inputs explicitly transferred/bound at runtime ◦ Device buffers pre-allocated User-mode apps must implement Main memory Copy inputs CPU Send commands Copy outputs GPU memory GPU PTask SOSP 2011 11

Data migration #> capture | xform | filter | detect & xform capture filter detect user read() write() read() copy to GPU kernel camdrv OS copy from GPU executive copy to GPU write() copy from GPU driver PCI-xfer HW write() read() PCI-xfer read() IRP HIDdrv PCI-xfer GPU Run! PTask SOSP 2011 12

GPUs need better OS abstractions GPU Analogues for: ◦ Process API ◦ IPC API ◦ Scheduler hints Abstractions that enable: ◦ Fairness/isolation ◦ OS use of GPU ◦ Composition/data movement optimization PTask SOSP 2011 13

Outline The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14

PTask OS abstractions: dataflow! ptask (parallel task) ◦ Has priority for fairness ◦ Analogous to a process for GPU execution ◦ List of input/output resources (e. g. stdin, stdout…) ports ◦ Can be mapped to ptask input/outputs ◦ A data source or sink channels ◦ Similar to pipes, connect arbitrary ports • OS objects OS RM possible ◦ Specialize to eliminate double-buffering • data: specify where, not how graph ◦ DAG: connected ptasks, ports, channels datablocks ◦ Memory-space transparent buffers PTask SOSP 2011 15

PTask Graph: Gestural Interface #> capture | xform | filter | detect & mapped mem filter f-out xform f-in capture cloud rawimg ptask graph detect GPU mem process (CPU) ptask (GPU) port channel Optimized data movement Data arrival triggers computation ptask graph datablock PTask SOSP 2011 16

PTask Scheduling Graphs scheduled dynamically ◦ ptasks queue for dispatch when inputs ready Queue: dynamic priority order ◦ ptask priority user-settable ◦ ptask prio normalized to OS prio Transparently support multiple GPUs ◦ Schedule ptasks for input locality PTask SOSP 2011 17

Location Transparency: Datablocks Datablock V main 1 gpu 0 0 gpu 1 1 space M 1 1 1 RW 11 10 10 data Main Memory GPU 0 Memory GPU 1 Memory … Logical buffer ◦ backed by multiple physical buffers ◦ buffers created/updated lazily ◦ mem-mapping used to share across process boundaries Track buffer validity per memory space ◦ writes invalidate other views Flags for access control/data placement PTask SOSP 2011 18

Datablock Action Zone Main Memory Datablock V M RW 0 1 01 0 0 1 main 1 01 0 0 1 gpu 1 space f-in xform cloud capture rawimg #> capture | xform | filter … filter GPU Memory data PTask SOSP 2011 … process ptask port channel datablock 19

Revised technology stack port datablock port • 1 -1 correspondence between programmer and OS abstractions • GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20

Outline The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21

Implementation Windows 7 ◦ Full PTask API implementation ◦ Stacked UMDF/KMDF driver Kernel component: mem-mapping, signaling User component: wraps Direct. X, CUDA, Open. CL ◦ syscalls Device. Io. Control() calls Linux 2. 6. 33. 2 ◦ Changed OS scheduling to manage GPU accounting added to task_struct PTask SOSP 2011 22

Gestural Interface evaluation Windows 7, Core 2 -Quad, GTX 580 (EVGA) Implementations ◦ ◦ pipes: capture | xform | filter | detect modular: capture+xform+filter+detect, 1 process handcode: data movement optimized, 1 process ptask: ptask graph Configurations ◦ real-time: driven by cameras ◦ unconstrained: driven by in-memory playback PTask SOSP 2011 23

Gestural Interface Performance relative to handcode 3, 5 3 2, 5 handcode 2 modular pipes ptask 1, 5 1 0, 5 lower is better 0 runtime compared to hand-code • pipes 11. 6% higher throughput compared to • sys lowerusage CPU util: no driver user • ~2. 7 x less CPU program • 16 x higher throughput • Windows 7 x 64 8 GB RAM • ~45% less memory • Intelusage Core 2 Quad 2. 66 GHz • GTX 580 (EVGA) PTask SOSP 2011 24

Performance Isolation 1600 PTask invocations/second 1400 1200 1000 800 fifo 600 priority ptask 400 200 0 Higher is better 2 4 6 PTask priority • FIFO – queue invocations in arrival order • ptask – aged priority queue w OS priority • graphs: 6 x 6 matrix multiply • priority same for every PTask node PTask provides throughput proportional to priority 8 • Windows 7 x 64 8 GB RAM • Intel Core 2 Quad 2. 66 GHz • GTX 580 (EVGA) PTask SOSP 2011 25

Multi-GPU Scheduling 1, 8 1, 6 1, 4 1, 2 1 0, 8 0, 6 0, 4 0, 2 0 Speedup over 1 GPU • Synthetic graphs: Varying depths 6 h- 4 de pt h- 2 de pt h- 1 data-aware de pt Higher is better priority • Data-aware == priority + locality • Graph depth > 1 req. for any benefit Data-aware provides best throughput, preserves priority • Windows 7 x 64 8 GB RAM • Intel Core 2 Quad 2. 66 GHz • 2 x GTX 580 (EVGA) PTask SOSP 2011 26

Linux+Enc. FS Throughput user-prgs R/W bnc cuda-1 cuda-2 user-libs Enc. FS FUSE libc PTask OS HW SSD 1 GPU/ CPU … Linux 2. 6. 33 SSD 2 GPU Simple GPU usage accounting • Restores performance cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read 1. 17 x -10. 3 x -30. 8 x 1. 16 x Write 1. 28 x -4. 6 x -10. 3 x 1. 21 x 1. 20 x PTask SOSP 2011 27 • Enc. FS: nice -20 • cuda-*: nice +19 • AES: XTS chaining • SATA SSD, RAID • seq. R/W 200 MB

Outline The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28

Related Work OS support for heterogeneous platforms: ◦ Helios [Nightingale 09], GPU Scheduling ◦ Time. Graph Pegasus [Baumann 09] , Offcodes [Weinsberg 08] [Gupta 11] Graph-based programming models ◦ ◦ ◦ [Kato 11], Barrel. Fish Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] Stream. It [Thies 02] Direct. Show TCP Offload [Currid 04] Tasking ◦ Tessellation, Apple GCD, … PTask SOSP 2011 29

Conclusions OS abstractions for GPUs are critical ◦ Enable fairness & priority ◦ OS can use the GPU Dataflow: a good fit abstraction ◦ system manages data movement ◦ performance benefits significant Thank you. Questions? PTask SOSP 2011 30