Piko A Framework for Authoring Programmable Graphics Pipelines
Piko: A Framework for Authoring Programmable Graphics Pipelines Anjul Patney and Stanley Tzeng UC Davis and NVIDIA Kerry A. Seitz, Jr. and John D. Owens UC Davis
What does an efficient graphics pipeline look like?
What does an efficient graphics pipeline look like? Renderer Unreal Engine 4 Unity 5 Disney Hyperion Pixar Render. Man Solid Angle Arnold Media Molecule Dreams
What does an efficient graphics pipeline look like? Renderer Platform Unreal Engine 4 GPU Unity 5 GPU Disney Hyperion Multicore CPU Pixar Render. Man Multicore CPU Solid Angle Arnold Multicore CPU Media Molecule Dreams GPU
What does an efficient graphics pipeline look like? Renderer Platform Algorithm Unreal Engine 4 GPU Rasterization with deferred shading Unity 5 GPU Rasterization with forward / deferred shading Disney Hyperion Multicore CPU Path tracing with deferred shading Pixar Render. Man Multicore CPU Reyes with Path tracing Solid Angle Arnold Multicore CPU Path tracing Media Molecule Dreams GPU Point-based rendering with deferred shading
Problem Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Vision Flexibility Stage A CPU Stage B Stage E Stage C Stage D Stage F High-level programmability ? GPU High-performance
Existing Work
Software Pipelines on GPUs Cuda. Raster Render. Ants Free. Pipe Voxel. Pipe Opti. X and Embree Programmable engines for accelerating ray tracing on specific platforms.
GRAMPS • Introduces flexible graphics pipelines • Abstracts stages in classes • Abstracts communication by queues [Sugerman et al. 2009]
Halide • Introduces programmable image pipelines • Applies well to shorter and more regular image-processing pipeline [Ragan-Kelley et al. 2012]
What are the fundamentals of high-performance? • • Parallelism Execution Locality Data Locality Producer-consumer locality Spatial tiling
Efficient graphics pipelines utilize spatial tiling
Efficient graphics pipelines utilize spatial tiling
Efficient graphics pipelines utilize spatial tiling • Packet ray tracing • SIMD fragment shading on GPUs • Tiled rendering on mobile GPUs
Vision Flexibility Stage A CPU Stage B Stage E Stage C Stage D Stage F High-level programmability ? GPU High-performance
Vision Flexibility Stage A CPU Stage B Stage E Stage C Stage D Stage F High-level programmability Piko GPU High-performance
System Walkthrough
CPU Compiler Host Code (C++) pikoc Host Interface (C++) Pipe Description (Piko) Executable Device Compiler Pipe Implementation (C++ / PTX)
Host Code (C++) Device-independent (C++) pikoc CPU Compiler Host Interface (C++) Pipe Description (Piko) Executable Device Compiler Pipe Implementation (C++ / PTX)
CPU Compiler Host Code (C++) pikoc Host Interface (C++) Pipe Description (Piko) Pipeline description (graph of stages) Executable Device Compiler Pipe Implementation (C++ / PTX)
CPU Compiler Host Code (C++) pikoc Host Interface (C++) Pipe Description (Piko) Executable Device Compiler Pipe Implementation (C++ / PTX) Clang- and LLVM- based infrastructure
CPU Compiler Host Code (C++) pikoc Host Interface (C++) Pipe Description (Piko) Executable Device Compiler Pipe Implementation (C++ / PTX)
Problem Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.
Problem Efficient graphics pipeline implementations are hard to write and the design space is hard to explore. Approach Use spatial tiling to help author efficient and flexible graphics pipelines.
Problem Efficient graphics pipeline implementations are hard to write and the design space is hard to explore. Approach Use programmable spatial tiling to help author efficient and flexible graphics pipelines.
Programmable Spatial Tiling
We need three answers from the pipeline author How does data map to spatial tile? Assign. Tile( ) How do we schedule tiles at runtime? Schedule( ) What to compute for each tile? Process( ) Each stage consists of these three “phases”
Each stage in a pipeline has three phases Assign. Tile Schedule Stage A Process Assign. Tile Stage B Schedule Process Stage C Assign. Tile Schedule Process
Assign. Tile
Assign. Tile
Assign. Tile
Assign. Tile
Phases help identify optimization opportunities. Stage A Identical tile size Stage B Stage C Stage D Identical data-to-tile mapping Identical tile-to-core mapping
Phases help identify optimization opportunities. Stage A Identical tile size Stage B Stage C Stage D Identical Assign. Tile Result Identical Schedule Result Stages can be fused to one
Phases help explore pipeline implementations Vertex Shade VS VS Geometry Shade GS GS Raster Rst Rst Fragment Shade FS FS Composite Cmp Cmp
Phases help explore pipeline implementations Vertex Shade VS VS Geometry Shade GS GS Raster Rst Rst Fragment Shade FS FS Composite Cmp Cmp
Phases help explore pipeline implementations Vertex Shade VS VS Geometry Shade GS GS Raster Rst Rst Fragment Shade FS FS Composite Cmp Cmp
Phases help explore pipeline implementations Vertex Shade VS VS Geometry Shade GS GS Raster Rst Rst Fragment Shade FS FS Composite Cmp Cmp
Evaluation
Piko pipelines are easy to express and customize Triangle Raster Stereo Raster Reyes Raster-Raytrace VS VS Split VS Setup Dice Setup Rast Shade Rast FS FS FS Sample FS Comp Comp Trace
Piko pipelines are easy to express and customize Triangle Raster Stereo Raster Reyes Raster-Raytrace VS VS Split VS Setup Dice Setup Rast Shade Rast FS FS FS Sample FS Comp Comp Trace
Piko pipelines are easy to express and customize Triangle Raster Stereo Raster Reyes Raster-Raytrace VS VS Split VS Setup Dice Setup Rast Shade Rast FS FS FS Sample FS Comp Comp Trace
Piko pipelines are easy to express and customize Triangle Raster Stereo Raster Reyes Raster-Raytrace VS VS Split VS Setup Dice Setup Rast Shade Rast FS FS FS Sample FS Comp Comp Trace
Piko lets us explore implementation alternatives Relative frame time No tiling, complete stage fusion 7 VS Fairy Forest 6 5 Setup VS Setup 4 3 Rast 2 Rast FS 1 FS 0 1 10 100 Shader complexity (# lights) NVIDIA GPU 1000 Multicore CPU Comp Baseline Comp
Piko lets us explore implementation alternatives Relative frame time Tiling with fusion 7 VS Fairy Forest 6 5 Setup 4 3 VS Setup Rast 2 1 FS 0 1 10 100 Shader complexity (# lights) NVIDIA GPU 1000 Multicore CPU Comp Baseline Rast FS Comp
Piko lets us explore implementation alternatives Relative frame time Tiling with no fusion 7 VS Fairy Forest 6 5 VS Setup Rast 4 3 2 1 FS 0 1 10 100 Shader complexity (# lights) NVIDIA GPU 1000 Multicore CPU Comp Baseline FS Comp
Piko enables high-performance code generation Rendering time (ms) cudaraster Piko Raster 12 10 8 6 4 2 0 Performance is within 3. 3 -5. 5 x of hand-optimized code. Fairy Forest Buddha Mecha Dragon [Laine and Karras 2011]
Piko enables high-performance code generation Split Performance (Mpatches / second) 14 12 Split performance is within 30% of handoptimized GPU Reyes. 10 8 6 4 2 0 Micropolis Piko Reyes [Weber et al. 2015]
Summary
Piko enables programmability and performance Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU
Piko enables programmability and performance Stage A CPU Stage B Stage E Stage C Stage D Stage F High-level programmability Piko GPU
Piko enables programmability and performance Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU High-performance
Piko enables programmability and performance Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU Flexibility
Our work is not done
Piko can be improved Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU Utilization of shared local memory
Piko can be improved Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU Support for dynamic scheduling of pipeline work
The search for a graphics abstraction is not over Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU
The search for a graphics abstraction is not over Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU Do tiles have to be 2 d, uniform, one-config-per stage?
The search for a graphics abstraction is not over Stage A CPU Stage B Stage E Stage C Stage D Stage F Piko GPU Are there other abstractions that enable high-level programmability and achieve high-performance?
Acknowledgments Discussions and advice Tim Foley, Jonathan Ragan-Kelley, Aaron Lefohn, Matt Pharr, Mark Lacey, Kayvon Fatahalian, Bill Mark, Marco Salvi, Chuck Lingle, Jason Mak, Edmund Yan, Calina Copos, Mike Steffen, Alex Elkman NVVM Help Vinod Grover, Sean Lee Financial Support Intel Science and Technology Center (VC), NVIDIA Research Fellowship, Intel Ph. D. Fellowship, National Science Foundation Fellowship, NVIDIA, AMD, NSF, UC Lab Fees Assets AMD, Intel (Project Offset), Ingo Wald, Bay Raitt, Stanford
Thank you! github. com/piko-dev/piko-public
Extra Slides
Host Code is device independent. Raster. Pipe pipe; pipe. allocate(. . . ); pipe. prepare(); pipe. run_single(); unsigned* pixels = Unmodified C++ pipe. piko. Screen. get. Data(); gl. Draw. Pixels(screen. W, screen. H, GL_RGBA, GL_UNSIGNED_BYTE, data);
A pipeline is a C++ class declaration. class Raster. Pipe : public Piko. Pipe { Vertex. Shader. Stage vertex. Shader_; Raster. Stage raster_; Piko. Screen piko. Screen_; . . . Stages are instantiated as objects. Raster. Pipe() { piko. Connect (vertex. Shader_, raster_, 0, 0); }. . . }; Connections indicate pipeline structure.
A stage is a C++ class definition. class Raster. Stage : public Stage<8, 8, 32, raster_stri, Pixel> { inline void Assign. Tile(raster_stri p) {. . . this->assign. To. Bin (p, bin. ID); . . . } inline void schedule(int bin. ID) { this->specify. Schedule (LOAD_BALANCE); } inline void process(raster_stri p) {. . . this->emit (Pixel(pos, color), 0); . . . } }; Templates specify tiling configuration. Built-in routines identify common scenarios. Each phase is a member function.
pikoc implements the pipeline description. Pipeline pikoc frontend pikoc backend Pipe Implementation Kernel plan Stages clang Frontend walks the AST and performs high-level optimizations. clang lib. NVVM Backend uses LLVM to generate optimized device code. Host Interface
WIP Slides
- Slides: 69