Graphics Hardware Joshua Barczak CMSC 435 UMBC ObjectOrder

Graphics Hardware Joshua Barczak CMSC 435 UMBC

Object-Order Rendering Object-Space Primitives World-space Primitives Modeling XForm Camera-Space Primitives Viewing XForm Clip-Space Primitives Projection XForm Clip-Space Primitives Window XForm Displayed Pixels Rasterize Pixels Viewport XForm Raster-Space Primitives

$The Code Draw. Triangles( Vertex* vb, int* ib, int n_primitives ) { for( each$

The Code Draw. Triangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices Transformed. Vertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer } } }

The Machine Host CPU Application Memory Device Driver PCIE bus The Draw. Triangles function, in ASIC form Memory GPU

The Cheap Machine Shared Die CPU Application Device Driver Memory

Our View of The Machine • API calls to – Manage memory – Configure the pipeline • Shader code • Resource bindings • Fixed-function states – Z/Stencil – Alpha – Draw sets of triangles Code Resource State VS Rasterizer Code Resource State PS Output Merger

The Rules • Rule #1: Don’t be silly – If you have 1000 triangles, do not make 1000 API calls • gl. Begin/gl. End are evil

The Rules • Rule #1: Don’t be silly – Compute at the correct rate • Per-vertex work is cheaper than per-pixel work – In general – Simplify uniform expressions • • X = CONST * CONST x = const X < SQRT(CONST) x*x < const X = y/CONST; x = y*(1/const) = y*const X = pow( CONST, y) = exp( log(x)*y) = exp(const*y)

GPU Operation Hardware Queue Command Buffers Frame N+1 Frame N+2 Driver Thread (on other core) Draw Calls Our Thread Our Scene The unstoppable march of time

The Worst Thing Imaginable CPU/GPU dependencies Draw() Read. Pixels() ……. GPU Do nothing while we wait for more draws CPU Process Draw calls Do nothing while we wait for the pixels Do Wait again whatever it is we’re Draw calls doing with those pixels

The Rules • Rule #1: Don’t be silly • Rule #2: One way traffic – Don’t read the frame buffer – If you use occlusion queries, wait a few frames before reading them

Open. GL Drawing void Draw. Indexed. Mesh( float* p. Positions, float* p. Normals, GLuint* p. Indices, GLuint n. Triangles ) { gl. Enable. Client. State(GL_VERTEX_ARRAY); gl. Enable. Client. State(GL_NORMAL_ARRAY); gl. Vertex. Pointer(3, GL_FLOAT, 3*sizeof(GLfloat), p. Positions); gl. Normal. Pointer(GL_FLOAT, 3*sizeof(GLfloat), p. Normals); gl. Draw. Elements( GL_TRIANGLES, 3*n. Triangles, GL_UNSIGNED_INT, p. Indices ); } Convenient, but wrong: Driver must copy the data every draw

Open. GL Drawing the Right Way void Draw. Indexed. Mesh( GLuint vbo, GLuint ibo, GLuint n. Triangles, GLuint index_offset ) { gl. Enable. Client. State(GL_VERTEX_ARRAY); These are no gl. Enable. Client. State(GL_NORMAL_ARRAY); longer pointers gl. Vertex. Pointer(3, GL_FLOAT, 3*sizeof(GLfloat), 0); gl. Normal. Pointer(GL_FLOAT, 3*sizeof(GLfloat), (GLvoid*) 3*sizeof(float) ); gl. Bind. Buffer( GL_ARRAY_BUFFER, vbo ); gl. Bind. Buffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); gl. Draw. Elements( GL_TRIANGLES, 3*n. Triangles, GL_UNSIGNED_INT, index_offset ); } At startup…. void Create. Buffer. Objects( Vertex* p. VB, int n. Vertices, int* p. IB, int n. Tris ) { gl. Gen. Buffers( 1, &vbo ); gl. Gen. Buffers( 1, &ibo ); gl. Bind. Buffer( GL_ARRAY_BUFFER, vbo ); gl. Bind. Buffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); gl. Buffer. Data( GL_ARRAY_BUFFER, n. Vertices*sizeof(Vertex), p. VB, GL_STATIC_DRAW ); gl. Buffer. Data( GL_ELEMENT_ARRAY_BUFFER, 3*n. Tris*sizeof(int), p. IB, GL_STATIC_DRAW); }

The Rules • Rule #1: Don’t be silly • Rule #2: One way traffic • Rule #3: Do not move data

Dynamic Buffers • But… I NEED to move data – Do you really? • Sometimes you do: – Particles – CPU animation • You need to double-buffer to avoid stalls

Dynamic Buffers Buffer 1 Buffer 0 GPU CPU Frame 0 Frame 1 Frame 2 Frame 3 If you give them the right flags, drivers will manage this for you (read the docs)

The Rules • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: When breaking rule #3, do so correctly – Use dynamic, write-only buffers with discard

Pipelining • Instructions takes several cycles Nonpipelined: N instructions in 4 N clocks 4 C. P. I Fetch Decode Execute Write Regs Pipelined: 9 intructions in 12 clocks 1. 3 C. P. I (1 CPI in the limit) F F F F F D D D D D E E E E E W W W W W

$Graphics Pipeline Draw. Triangles( Vertex* vb, int* ib, int n_primitives ) { for( each$

Graphics Pipeline Draw. Triangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices Transformed. Vertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer } } } Index Fetch Vertex Shade Clip Cull Raster Z Shade } } Blend Hardware Stages

Graphics Pipelining Index Fetch Vertex Shade Clip Cull Raster Z Shade Blend Clock cycles

The Physical Machine Control Registers Stuff like: - VB/IB address - Primitive type - Cull mode - Viewport size - ZBuffer - pointer - format -Color buffer - pointer - format - Textures - pointer - format - Blend mode - Z/Stencil modes Vertex Triangle Rasterizer Pixel Blend Functional Units

State Change Index Fetch Wait for VS Vertex Fetch Vertex Shade Clip Cull Raster Z Shade Blend White space indicates wasted electricity Clock Cycles (tick tock)

State Change (Software) Your Code Set. State() Draw triangles Driver Workload Figure out what regs to change - Turn texture/VBO handles to addresses/formats - Convert GL state to register bits - Change shader code addresses Put register writes into command buffer Put draw commands into command buffer Repeat… This part is much more severe than the hardware bubbles….

The Rules • • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state – Sort objects by state – Pack meshes together

Computation & Bandwidth Based on: • 100 Mtri/sec (1. 6 M/frame@60 Hz) • 256 B vertex data • 128 B interpolated • 68 B fragment output • 5 x depth complexity • 16 4 -byte textures • 223 ops/vert • 1664 ops/frag • No caching • No compression Slide: Olano 75 GB/s Vertex 13 GB/s Triangle 335 GB/s Texture 45 GB/s Fragment 67 GFLOPS It is physically impossible to run a serial datapath at these rates 1. 1 TFLOPS

Data Parallel Distribute Task Merge Slide: Olano Task

Parallel Graphics Vertex Geometry Pixel

Barycentric Rasterization • SIMD Parallelism (Nx. N Stamp)

Ordering • Independent pixels • Strict primitive order within a pixel – Or transparency doesn’t work

Parallel Raster Architectures • “Sort” – Resolving pixel ordering • Where the “sort” happens – First – Middle – Last • Good read: – “A Sorting Classification of Parallel Rendering” Molnar et al.

Slide: Olano Sort First Objects Distribute objects by screen tile Vertex Triangle Fragment Screen Some pixels Some objects

Slide: Olano Sort Middle Objects Distribute objects or vertices Vertex Some objects Merge & Redistribute by screen location Triangle Fragment Screen Some pixels Some objects

Slide: Olano Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column Interleaved

Slide: Olano Sort Last Objects Distribute by object Vertex Triangle Fragment Frag-merge Screen Full Screen Some objects Screen Partitioned

Architecture • An execution core – Not to scale Control Logic ALU Regs

Architecture • Multiple parallel cores – Multiple-instruction multiple-data (MIMD) 16 Instruction Streams 16 Data Streams

Architecture • SIMD Machine – Single Instruction Multiple Data – Shared control logic • Pro – More throughput • Con – Coherent execution 1 Instruction Stream 60 Data Streams

SIMD Branching if( x ) // mask threads { // issue instructions } else // invert mask { // issue instructions } // unmask Useful Useless Threads agree, take if Threads agree, take else Threads disagree, take if AND else

SIMD Looping while(x) // update mask { // do stuff } They all run ‘till the last one’s done….

The Rules • • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence – Keep control flow simple – Flatten branches – Avoid ‘else’ branches

DX 9 -Era GPU Clipspace Primitives Primitive Assembly Early-Z Rasterizer Index stream Post-Tn. L Cache VS VS Vertex Cache Z Pixel blocks Vertices PS PS PS PS Texture Cache Memory Pixels Blend Cache

Memory Bandwidth • Lots of things need data all at once Index Stream Vertex Demand Texture Demand Vertex $ Tex $ Memory Alpha/Z operation Color $ Depth $

Texture Tiling • Images tiled in memory In Memory:

Texture Tiling • Texture cache is for reuse across pixel blocks – Bandwidth savings – Not latency reduction • Like in a CPU In Memory:

Block Compression • DXT 1 (BC 1): Endpoint colors – 4 x 4 pixel block packed into 8 bytes – 8: 1 over standard 32 bit color Four possible colors 2 endpoints and 2 interior points Color Index (2 bits/pixel)

Block Compression • DXT 5(BC 3): – DXT 1 plus alpha • 4: 1 Endpoint alphas (2 bytes) Alpha index (3 bits/pixel) • BC 4 (ATI 1 N) – Just the alpha • 2: 1 for greyscale • BC 5 (ATI 2 N) – 2 alphas slapped together • 2: 1 for 2 channels • 4: 1 for TS normal map 8 possible alphas

Block Compression • BC 6/BC 7 – 16 byte block – 7 different formats

Z-Ordered Rasterization • Take 2 D integer coordinates – Interleave bits – Get a 1 D index – Consecutive 1 D indices are spatially coherent • Deinterleave a counter to walk through space Wikipedia

The Rules • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth – Small vertex formats • 16 -bit float, 8 -bit fixed – Small texture formats • Compress • Don’t use 4 8 -bit channels for a greyscale image! • See Rules 1 and 6

DX 9 -Era GPU Clipspace Primitives Primitive Assembly Early-Z Rasterizer Index stream Post-Tn. L Cache VS VS Vertex Cache Z Pixel blocks Vertices PS PS PS PS Texture Cache Memory Pixels Blend Cache

Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

DX 10 -Era GPU Clipspace Primitives Early-Z Primitive Assembly Rasterizer Pixels GS GS Threads Post-TL$ VS Threads Index Data Z US US US US US US US US US US US US US US US US L 1$ L 2 $ Memory Pixels Blend L 1$ Cache

The Geometry Shader • One Primitive In – Point/Line/Triangle – Up to N primitives out • Point/Line/Triangle – Unpredictable data amplification – Order MUST be preserved

The Geometry Shader One geometry shader GS Parallel geometry shaders Spits out primitives one by one GS GS GS Buffering Rasterizer Results must be consumed in order…. Lots of buffering Rasterizer

The Geometry Shader • Nvidia – Buffers in on chip memory – Parallelism limited by buffer space – Faster for small amplification • AMD – Buffers in DRAM – Lots of latency – Faster for large amplification Performance

The Rules • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow

Memory Latency Do a little math… Miss the cache… Wait a few hundred cycles for memory Keep going

Memory Latency • CPU Strategy – Bend over backwards to avoid stalls Sequencer Regs Gigantic Cache ALU More Regs Branch Predictor Out-of-Order Exec. Memory Prefetch

Memory Latency • GPU Strategy – Run lots of threads – “Hide” the stalls with useful work Sequencer Scheduler ALU ALU ALU ALU Regs (THOUSANDS of them) Tiny Cache

Latency Do a little math Hardware swaps in other threads Miss the cache… Memory access overlapped by useful work Keep going

Terminology • Thread – One instance of a shader program – One pixel/vertex • Warp/Wavefront – SIMD-sized collection of threads, in lockstep • What H/W people call a thread – Many warps in flight for latency hiding

Occupancy • Register file: – “Registers” are SIMD-sized – Evenly divided among warps SIMD Lane Register Number

Occupancy • 4 registers per thread – 8 warps SIMD Lane Register Number

Occupancy • 16 Registers per thread – 2 warps SIMD Lane Register Number

The Rules • • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full

Modern GPU Clipspace Primitives Rast Setup Rast GS Tess PA Post-TL$ Index Data Z GS/HS/DS Threads Vertex Threads US US US US US US US US US US US US US US US US L 1$ L 2 $ Memory L 1$ Pixels Blend

Tessellation Unigine. com

DX 11 Tessellation Pipeline Patches Control Points Hull Shader (Selects Tess Factors) Detail Levels Tessellation Hardware Domain Shader (Evaluation) U, V Coordinates Vertices Moreton 2001 Geometry Shader

Tessellation • Tessellation Pitfalls: – Backface cull happens post-tess • LOTS of wasted DS work – 2 x 2 Quad Utilization problem

Derivatives for Mip. Mapping • 2 x 2 Quads + Differencing Missing pixels are extrapolated… Each 2 x 2 quad is self-contained

Big Triangle

Rasterized Quads

Wasted Pixels 27 of 76 (35%) - Drops off very fast for big triangles - At this scale, this triangle is “small”

In the limit… In this scenario, we shade 4 times as many pixels as we need This is essentially what happens when we over-tessellate

The Rules • • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow

The Rules • • • Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow Rule #11: The rules are subject to change at any time and without notice…

NVIDIA Ge. Force 6 [Kilgaraff and Fernando, GPU Gems 2]

Vertex Processing • 4 -wide FP vector + special functions • Vertex texture fetch – Unfiltered – Very slow • MIMD

Fragment Processing • Pixel pipe – 2 4 -wide vector pipes • Dual issue • Vector co-issue – 3 x 1 or 2 x 2 • FP 16 arithmetic – Poor flow control granularity • All in-flight threads take same path

AMD/ATI R 600 [Tom’s Hardware]

SIMD Units • VLIW – 5 ALUs • 1 with transcendentals – 16 -wide SIMD • In groups of 4 – 64 thread “wavefront” • 2 waves issue over 8 clocks • 4 texture engines – Texture ops ¼ ALU rate

Dispatch

Demo

NVIDIA G 80 [NVIDIA 8800 Architectural Overview, NVIDIA TB-02787 -001_v 01, November 2006]

Streaming Processors • Scalar architecture – 32 -wide “warp” issued over 4 clocks – Special functions take 16 clocks (2 SFUs) • Instruction issue – Interleaved warp instrucitons

NVIDIA Fermi [Beyond 3 D NVIDIA Fermi GPU and Architecture Analysis, 2010]

Fermi Rasterization • Round robin vertex processing • 4 rasterizers (1 per GPC) – Screen partitioned Purcell 2010

NVIDA Fermi SM • 2 concurrent warps – 32 ALUs – 16 Load/Store units – 4 SFUs [NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009]

AMD GCN • Vector pipes – 4 SIMDs per CU – Issued round-robin • 64 -wide waves • Scalar processor – Integer ops and branching – Separate register set • Different instruction types can co-issue – From different waves