GPU Data Formatting and Addressing Aaron Lefohn University

  • Slides: 46
Download presentation
GPU Data Formatting and Addressing Aaron Lefohn University of California, Davis

GPU Data Formatting and Addressing Aaron Lefohn University of California, Davis

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

GPU memory model • GPU Data Storage – Vertex data – Texture data –

GPU memory model • GPU Data Storage – Vertex data – Texture data – Frame buffer PS 3. 0 GPUs Texture Data Vertex Processor Rasterizer Fragment Processor Frame Buffer(s)

GPU memory model • Read-Only – Traditional use of GPU memory – CPU writes,

GPU memory model • Read-Only – Traditional use of GPU memory – CPU writes, GPU reads • Read/Write – Save frame buffer(s) for later use as texture or vertex array – Save up to 16, 32 -bit floating values per pixel • Multiple Render Targets (MRTs)

How to Save Render Result 1. Copy framebuffer result to “other GPU memory” –

How to Save Render Result 1. Copy framebuffer result to “other GPU memory” – Copy-to-texture – Copy-to-vertex-array 2. Write directly to “other GPU memory'' – Render-to-texture – Render-to-vertex-array

Open. GL GPU Memory Writes • Texture 1. Copy frame buffer to texture 2.

Open. GL GPU Memory Writes • Texture 1. Copy frame buffer to texture 2. Render-to-texture • WGL_ARB_render_texture • GL_EXT_render_target • Superbuffers • Vertex Array 1. Copy frame buffer to vertex array • GL_EXT_pixel_buffer_object • Superbuffers 2. Render-to-vertex-array • Superbuffers

Render-To-Texture: 1 • Copy-To-Texture – Good • Cross-Platform texture writes • Flexible output •

Render-To-Texture: 1 • Copy-To-Texture – Good • Cross-Platform texture writes • Flexible output • 2 D output Copy to 1 D, 2 D, or 3 D texture – Bad • Slow • Consumes internal GPU memory bandwidth

Render-To-Texture: 2 • WGL_ARB_render_texture – Render-to-texture (RTT) using pbuffers http: //oss. sgi. com/projects/ogl-sample/registry/ARB/wgl_render_texture. txt

Render-To-Texture: 2 • WGL_ARB_render_texture – Render-to-texture (RTT) using pbuffers http: //oss. sgi. com/projects/ogl-sample/registry/ARB/wgl_render_texture. txt – Good • Fast RTT • Current state of the art for RTT – Bad • Only works on Windows • Slow Open. GL context switches • Many hacks to avoid this bottleneck

Render-To-Texture: 3 • GL_EXT_render_target – Proposed extension for cross-platform RTT http: //www. opengl. org/resources/features/GL_EXT_render_target.

Render-To-Texture: 3 • GL_EXT_render_target – Proposed extension for cross-platform RTT http: //www. opengl. org/resources/features/GL_EXT_render_target. txt – Good • Cross-platform, efficient RTT solution • Lightweight, simple extension – Bad • Specification not approved • No implementations exist (April 24, 2004)

Render-To-Texture: 4 • Superbuffers – Proposed new memory model for GPUs http: //www. ati.

Render-To-Texture: 4 • Superbuffers – Proposed new memory model for GPUs http: //www. ati. com/developer/gdc/Super. Buffers. pdf – Good • Unified GPU memory model • Render to any GPU memory • Cross platform (Open. GL owns memory, not OS) • Mix-and-match depth/stencil/color buffers – Bad • Large, complex extension • Specification not approved (April 24, 2004) • Only driver support is alpha version (ATI)

Render-To-Texture Summary • Open. GL RTT Currently Only Under Windows – Pbuffers • Complex

Render-To-Texture Summary • Open. GL RTT Currently Only Under Windows – Pbuffers • Complex and awkward RTT mechanism • Current state of the art • Cross-Platform RTT Coming Soon…

Render-To-Vertex-Array: 1 • GL_EXT_pixel_buffer_object – Copy framebuffer to vertex buffer object http: //developer. nvidia.

Render-To-Vertex-Array: 1 • GL_EXT_pixel_buffer_object – Copy framebuffer to vertex buffer object http: //developer. nvidia. com/object/nvidia_opengl_specs. html – Good • Only GPU/AGP memory bandwidth • Works with current drivers (NVIDIA) – Bad • No direct render-to-vertex-array (slower than true RTVA) • No ATI implementation

Render-To-Vertex-Array: 2 • Superbuffers – Write to “memory object” as render target – Read

Render-To-Vertex-Array: 2 • Superbuffers – Write to “memory object” as render target – Read from “memory object” as vertex array – Good • Direct render-to-vertex-array (fast) – Bad • Can render results always be interpreted as vertex data? • Large, complex, unapproved extension, …

Render-To-Vertex-Array Summary • Current Open. GL Support – NVIDIA: GL_EXT_pixel_buffer_object – ATI: Superbuffers •

Render-To-Vertex-Array Summary • Current Open. GL Support – NVIDIA: GL_EXT_pixel_buffer_object – ATI: Superbuffers • Semantics Still Under Development…

Fbuffer: Capturing Fragments • Idea – “Rasterization-Order FIFO Buffer” – Render results are fragment

Fbuffer: Capturing Fragments • Idea – “Rasterization-Order FIFO Buffer” – Render results are fragment values instead of pixel values – Mark and Proudfoot, Graphics Hardware 2001 http: //graphics. stanford. edu/projects/shading/pubs/hwws 2001 -fbuffer/ • Uses – Designed for multi-pass rendering with transparent geometry – New possibilities for GPGPU? • Varying number of results per pixel • RTT and RTVA with an fbuffer?

Fbuffer: Capturing Fragments • Implementations – ATI Radeon 9800 and newer ATI GPUs –

Fbuffer: Capturing Fragments • Implementations – ATI Radeon 9800 and newer ATI GPUs – Not yet exposed to user (ask for it!) • Problems – Size of fbuffer is not known before rendering – GPUs cannot perform dynamic memory allocation – How to handle buffer overflow?

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation •

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation • Address Use • Pointers – Multi-dimensional arrays – Sparse representations

GPU Memory Addresses • Where Addresses Generated? – – CPU Vertex processor Rasterizer Fragment

GPU Memory Addresses • Where Addresses Generated? – – CPU Vertex processor Rasterizer Fragment processor CPU Vertex stream or textures Input stream, ALU ops or textures Interpolation Input stream, ALU ops or textures Vertex Processor Rasterizer Fragment Processor

GPU Memory Addresses • Where Addresses Used? – Vertex textures (PS 3. 0 GPUs)

GPU Memory Addresses • Where Addresses Used? – Vertex textures (PS 3. 0 GPUs) – Fragment textures Texture Data CPU Vertex Processor Rasterizer Fragment Processor

GPU Memory Addresses • Pointers – Store addresses in texture – Dependent texture read

GPU Memory Addresses • Pointers – Store addresses in texture – Dependent texture read – Example: See Tim Purcell’s ray tracing talk float 2 addr = tex 2 D( addr. Tex, tex. Coord ); float 2 data = tex 2 D( data. Tex, addr ); Address Texture 0 1 2 3 3 3 1 1 Data Texture Data 0 1 2 3

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation •

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation • Address Use • Pointers – Multi-dimensional arrays – Sparse representations

Multi-Dimensional Arrays • Build Data Structures in 2 D Memory – Read/Write GPU memory

Multi-Dimensional Arrays • Build Data Structures in 2 D Memory – Read/Write GPU memory optimized for 2 D – Images • But Isn’t Physical Memory 1 D? – GPU memory hierarchy optimized to capture 2 D locality • Rasterization • Texture filtering • Igehy, Eldridge, Proudfoot, “"Prefetching in a Texture Cache Architecture, ” Graphics Hardware, 1998 • Conclusion: Use illusion of 2 D physical memory

GPU Arrays • Large 1 D Arrays – Current GPUs limit 1 D array

GPU Arrays • Large 1 D Arrays – Current GPUs limit 1 D array sizes to 2048 or 4096 – Pack into 2 D memory – 1 D-to-2 D address translation

GPU Arrays • 3 D Arrays – Problem • GPUs do not have 3

GPU Arrays • 3 D Arrays – Problem • GPUs do not have 3 D frame buffers • No RTT to slice of 3 D texture (except Superbuffers) – Solutions 1. Stack of 2 D slices 2. Multiple slices per 2 D buffer

GPU Arrays • Problems With 3 D Arrays for GPGPU – Cannot read stack

GPU Arrays • Problems With 3 D Arrays for GPGPU – Cannot read stack of 2 D slices as 3 D texture – Must know which slices are needed in advance – Visualization of 3 D data difficult • Solutions – Need render-to-slice-of-3 D-texture (Superbuffers) – Volume rendering of slice-based 3 D data • Course 28, “Real-Time Volume Graphics”, Siggraph 2004

GPU Arrays • Higher Dimensional Arrays – Pack into 2 D buffers – N-D

GPU Arrays • Higher Dimensional Arrays – Pack into 2 D buffers – N-D to 2 D address translation – Same problems as 3 D arrays if data does not fit in a single 2 D texture • Conclusions – Fundamental GPU memory primitive is a fixed-size 2 D array – GPGPU needs more general memory model

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation •

GPU-Based Data Structures • Building Blocks – GPU memory addresses • Address Generation • Address Use • Pointers – Multi-dimensional arrays – Sparse representations

Sparse Data Structures • Why Sparse Data Structures? – Reduce computational workload – Reduce

Sparse Data Structures • Why Sparse Data Structures? – Reduce computational workload – Reduce memory pressure • Examples – Sparse matrices • Krueger et al. , Siggraph 2003 • Bolz et al. , Siggraph 2003 Premoze et al. Eurographics 2003 – Implicit surface computations (sparse volumes) • Sherbondy et al. , IEEE Visualization 2003 • Lefohn et al. , IEEE Visualization 2003

Sparse Computation • Option 1: Store Complete Data Set on GPU – Cull unused

Sparse Computation • Option 1: Store Complete Data Set on GPU – Cull unused data – Conditional execution tricks (discussed earlier) • Option 2: Store Only Sparse Data on GPU – Saves memory – Potentially much faster than culling – Much more complicated (especially if time-varying)

Sparse Data Structures • Basic Idea – Pack “active” data elements into GPU memory

Sparse Data Structures • Basic Idea – Pack “active” data elements into GPU memory – For more information • Linear algebra section in this course : Static structures • Level-set case study in this course : Dynamic structures

Sparse Data Structures • Addressing Sparse Data – Neighborhoods no longer implicitly defined on

Sparse Data Structures • Addressing Sparse Data – Neighborhoods no longer implicitly defined on grid – Use pointer-based data structures to locate neighbors • Pre-compute neighbor addresses if possible – Use CPU or vertex processor – Removes pointer dereference from fragment program – Separate common addressing case from boundary conditions • Common case must be cache coherent • See Harris and Lefohn case studies for “substream” technique

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

Overview • GPU Memory Model • GPU-Based Data Structures • Performance Considerations

Memory Performance Issues • Pbuffer Survival Guide • Dependent Texture Costs • Computational Frequency

Memory Performance Issues • Pbuffer Survival Guide • Dependent Texture Costs • Computational Frequency

Pbuffer Survival Guide • Pbuffers Give us Render-To-Texture – Designed to create an environment

Pbuffer Survival Guide • Pbuffers Give us Render-To-Texture – Designed to create an environment map or two – Never intended to be used for GPGPU (100 s of pbuffers) – Problem • Each pbuffer has its own Open. GL render context • Each pbuffer may have depth and/or stencil buffer • Changing Open. GL contexts is slow – Solution • Many optimizations to avoid this bottleneck…

Pbuffer Survival Guide 1. Pack Scalar Data Into RGBA – > 4 x memory

Pbuffer Survival Guide 1. Pack Scalar Data Into RGBA – > 4 x memory savings – 4 x reduction in context switches – Be careful of read-modify-write hazard Scalar Data in 4 RGBA Pbuffers 1 RGBA Pbuffer

Pbuffer Survival Guide 2. Use Multi-Surface Pbuffers – Each RGBA surface is its own

Pbuffer Survival Guide 2. Use Multi-Surface Pbuffers – Each RGBA surface is its own render-texture • Front, Back, Aux. N (N = 0, 1, 2, …) – Greatly reduces context switches – Technically illegal, but “blessed” by ATI. Works on NVIDIA. 5 Pbuffers 1 RGBA Surface Each 1 Pbuffer 5 RGBA Surfaces

Pbuffer Survival Guide 2. Using Multi-Surface Pbuffers a) Allocate double buffer pbuffer (and/or with

Pbuffer Survival Guide 2. Using Multi-Surface Pbuffers a) Allocate double buffer pbuffer (and/or with AUX buffers) b) Set render target to back buffer gl. Draw. Buffer(GL_BACK) 2. Bind front buffer as texture wgl. Bind. Tex. Image. ARB(hpbuffer, WGL_FRONT_ARB) a) Render b) Switch buffers wgl. Release. Tex. Image. ARB(hpbuffer, WGL_FRONT_ARB) gl. Draw. Buffer(GL_FRONT) wgl. Bind. Tex. Image. ARB(hpbuffer, WGL_BACK_ARB)

Pbuffer Survival Guide 3. Pack 2 D domains into large buffer – “Flat 3

Pbuffer Survival Guide 3. Pack 2 D domains into large buffer – “Flat 3 D textures” – Be careful of read-modify-write hazard 3 D Volume Flattened Volume

Dependent Texture Costs • Cache Coherency – Dependent reads fast if they hit cache

Dependent Texture Costs • Cache Coherency – Dependent reads fast if they hit cache • Even chained dependencies can be same speed as nondependent reads – Very slow if out of cache • Example: 3 levels of dependent cache misses can be >10 x slower – More detail in “GPU Computation Strategies and Tricks”

Computational Frequency • Compute Memory Addresses at Low Frequency – Compute memory addresses in

Computational Frequency • Compute Memory Addresses at Low Frequency – Compute memory addresses in vertex program • Let rasterizer interpolation create per-fragment addresses • Compute neighbor addresses this way – Avoid fragment-level address computation whenever possible • Consumes fragment instructions • Computation often redundant with neighboring fragments • May defeat texture pre-fetch

Conclusions • GPU Memory Model Evolving – Writable GPU memory forms loop-back in an

Conclusions • GPU Memory Model Evolving – Writable GPU memory forms loop-back in an otherwise feedforward streaming pipeline – Memory model will continue to evolve as GPUs become more general stream processors • GPGPU Data Structures – Basic memory primitive is limited-size, 2 D texture – Use address translation to fit all array dimensions into 2 D – Maintain 2 D cache locality • Render-To-Texture – Use pbuffers with care and eagerly adopt their successor

Selected References • J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers

Selected References • J. Boltz, I. Farmer, E. Grinspun, P. Schoder, “Spare Matrix Solvers on the GPU: Conjugate Gradients and Multigrid, ” SIGGRAPH 2003 • N. Goodnight, C. Woolley, G. Lewin, D. Luebke, G. Humphreys, “A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware, ” Graphics Hardware 2003 • M. Harris, W. Baxter, T. Scheuermann, A. Lastra, “Simulation of Cloud Dynamics on Graphics Hardware, “ Graphics Hardware 2003 • H. Igehy, M. Eldridge, K. Proudfoot, “Prefetching in a Texture Cache Architecture, ” Graphics Hardware 1998 • J. Krueger, R. Westermann, “Linear Algebra Operators for GPU Implementation of Numerical Algorithms, ” SIGGRAPH 2003 • A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “A Streaming Narrow-Band Algorithm: Interactive Deformation and Visualization of Level Sets, ” IEEE Transactions on Visualization and Computer Graphics 2004

Selected References • A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and

Selected References • A. Lefohn, J. Kniss, C. Hansen, R. Whitaker, “Interactive Deformation and Visualization of Level Set Surfaces Using Graphics Hardware, ” IEEE Visualization 2003 • W. Mark, K. Proudfoot, “The F-Buffer: A Rasterization-Order FIFO Buffer for Multi. Pass Rendering, ” Graphics Hardware 2001 • T. Purcell, C. Donner, M. Cammarano, H. W. Jensen, P. Hanrahan, “Photon Mapping on Programmable Graphics Hardware, ” Graphics Hardware 2003 • A. Sherbondy, M. Houston, S. Napel, “Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware, ” IEEE Visualization 2003

Open. GL References • GL_EXT_pixel_buffer_object http: //www. nvidia. com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object. txt • GL_EXT_render_target, http: //www.

Open. GL References • GL_EXT_pixel_buffer_object http: //www. nvidia. com/dev_content/nvopenglspecs/GL_EXT_pixel_buffer_object. txt • GL_EXT_render_target, http: //www. opengl. org/resources/features/GL_EXT_render_target. txt • Open. GL Extension Registry http: //oss. sgi. com/projects/ogl-sample/registry/ • Superbuffers http: //www. ati. com/developer/gdc/Super. Buffers. pdf • WGL_ARB_render_texture http: //oss. sgi. com/projects/ogl-sample/registry/ARB/wgl_render_texture. txt http: //oss. sgi. com/projects/ogl-sample/registry/ARB/wgl_pbuffer. txt

Questions? • Acknowledgements – – – – Cass Everitt, Craig Kolb, Chris Seitz, and

Questions? • Acknowledgements – – – – Cass Everitt, Craig Kolb, Chris Seitz, and Jeff Juliano at NVIDIA Mark Segal, Rob Mace, and Evan Hart at ATI GPGPU Siggraph 2004 course presenters Joe Kniss and Ross Whitaker Brian Budge John Owens National Science Foundation Graduate Fellowship Pixar Animation Studios