GPU Precision Power Programmability CPU x 60decade 6

GPU § Precision, Power, Programmability – – – – CPU: x 60/decade, 6 GFLOPS, 6 GB/sec GPU: x 1000/decade, 20 GFLOPs, 25 GB/sec Arithmetic heavy (read OR write): faster hardware Parallelization Multi-billion $ entertainment market drives innovation 32 -bit Floating point Programmable (graphics, physics, general purpose data-flow) Can’t simply “port” CPU code to GPU David Luebke et al. GPGPU, SIGGRAPH 2004 Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 1

History of the 3 D graphics industry § § § 60 s: – Line drawings, hidden lines, parametric surfaces (B-splines…) – Automated drafting & machining for car, airplane, and ships manufacturers 70’s: – Mainframes, Vector tubes (HP…) – Software: Solids, (CSG), Ray Tracing, Z-buffer for hidden lines 80 s: – Graphics workstations ($50 K-$1 M): Frame buffers, rasterizers , GL, Phigs – VR: CAVEs and head-mounted displays – CAD/CAM & GIS: CATIA, SDRC, PTC – Sun, HP, IBM, SGI, E&S, DEC 90 s: – PCs ($2 K): Graphics boards, Open. GL, Java 3 D – CAD+Videogames+Animations: Auto. CAD, Solid. Works…, Alias-Wavefront – Intel, many board vendors 00 s: – Laptops, PDAs, Cell Phones: Parallel graphic chips – Everything will be graphics, 3 D, animated, interactive – Nvidia, Sony, Nokia Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 2

History of GPU § Pre-GPU Graphics Acceleration – SGI, Evans & Sutherland. Introduced concepts like vertex transformation and texture mapping. Very expensive! § First-Generation GPU (-1998) – Nvidia TNT 2, ATI Rage, Voodoo 3. Vertex transformation on CPU, limited set of math operations. § Second-Generation GPU (1999 -2000) – Ge. Force 256, Geforce 2, Radeon 7500, Savage 3 D. Transformation & Lighting. More configurable, still not programmable. § Third-Generation GPU (2001) – Geforce 3, Geforce 4 Ti, Xbox, Radeon 8500. Vertex Programmability, pixel-level configurability. § Fourth-Generation GPU (2002 -) – Geforce FX series, Radeon 9700 and on. Vertex-level and pixel-level programmability. Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 3

Architecture Application Vertex Shader transformed vertices, normals, colors Geometry Shader Rasterizer texture fragments (surfels per pixel) Fragment Shader pixel color, depth, stencil Compositor Display Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 4

Buffers § § § § Color: 8 -bit index to color table, float/16 -bit true color… Depth: 24 -bit or float (0 at back plane) Back and front: display front, update back, swap Stereo: Shutter glasses, HMD. Alternate frames Auxiliary: off-screen working space. Helps reduce passes. Stencil: 8 bits (left-over of depth buffer). <, >… mask, ++ Accumulation: sum, scale (supersampling, blur) P-buffer, superbuffers: Render to texture Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 5

Fragment operations § § Depth tests: <, <=, >, <=, ==, Z depth-interval Stencil test: mask? , counter, parity. Alpha tests: compare to reference alpha Alpha blending: + max, min, replace, blend Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 6

Data Parallelism in GPUs § § § Data flow: vertices > fragments > pixels Parallelism at each stage No shared or static data (except textures) ALU-heavy (multiple ALUs per stage in pipe) Fight memory latency with more computation Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 7

GPGPU § Stream: collection of records (pixels, vertices…) – Stored in Textures (a computational grid) § Kernel: Function applied to each element in stream – Transform, evolve (no dependency between records) • • Matrix algebra Image/volume processing Physical simulation Global illumination – Ray tracing – Photon mapping – Radiosity Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 8

Computational Resources § § Programmable parallel processors – Vertex & Fragment pipelines Rasterizer – Mostly useful for interpolating addresses (texture coordinates) and pervertex constants Texture unit – Read-only memory interface Render to texture (or Copy to texture) – Write-only memory interface Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 9

Vertex Processor § § Fully programmable (SIMD / MIMD) Processes 4 -vectors (RGBA / XYZW) Capable of scatter but not gather (A[i, j]=x; ) – Can change the location of current vertex – Cannot read info from other vertices – Can only read a small constant memory Vertex Texture Fetch – Random access memory for vertices – Arguably still not gather Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 10

Fragment Processor § § § May be invoked at each pixel by drawing a full screen quad Fully programmable (SIMD) Processes 4 -vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather (x=A[i+1, j]; ) and some scatter § – RAM read (texture), but no RAM write – Output address fixed to a specific pixel – But can change that address Typically more useful than vertex processor – More fragment pipelines than vertex pipelines – Gather – Direct output (fragment processor is at end of pipeline) Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 11

Branching § § § Not supported or expensive Avoid, replace by math Depth test Stencil test Occlusion query (conditional execution) Pre-computation (region of interest, use to set stencil mask) Jarek Rossignac http: //www. gvu. gatech. edu/~jarek MAGIC Lab SIC / Co. C / Georgia Tech 12