Advanced D 3 D 10 Rendering Emil Persson

  • Slides: 27
Download presentation
Advanced D 3 D 10 Rendering Emil Persson May 24, 2007

Advanced D 3 D 10 Rendering Emil Persson May 24, 2007

Overview Introduction to D 3 D 10 Rendering techniques in D 3 D 10

Overview Introduction to D 3 D 10 Rendering techniques in D 3 D 10 Optimizations 2 May 24, 2007 Advanced D 3 D 10 Rendering

Introduction Best D 3 D revision yet! Clean and powerful API Lots of new

Introduction Best D 3 D revision yet! Clean and powerful API Lots of new features – SM 4. 0 New geometry shader Stream Out – – – – 3 Texture arrays Render to volume texture MSAA individual sample access Constant buffers Sampler state decoupled from texture unit Dual-source blending Etc… May 24, 2007 Advanced D 3 D 10 Rendering

Clean API Vista only Everything is mandatory (almost) – No legacy hardware support Clean

Clean API Vista only Everything is mandatory (almost) – No legacy hardware support Clean starting point for future evolution of the API – Limited market short-term Some old features deprecated – – – 4 Fixed function Assembly shaders Alpha test Triangle fans Point sprites Clip planes May 24, 2007 Advanced D 3 D 10 Rendering

Dealing with deprecated features Fixed function – Write a few über-shaders Assembly shaders –

Dealing with deprecated features Fixed function – Write a few über-shaders Assembly shaders – Convert to HLSL Alpha test – Use discard or clip() in pixel shader – Use alpha-to-coverage Triangle fans – Seldom used anyway, usually just for a quad – Convert to triangle list or strip Point sprites – Expand point to 2 triangles in GS Clip planes – Use clip distance and/or cull distance 5 May 24, 2007 Advanced D 3 D 10 Rendering

SM 4. 0 Geometry shader – Processes a full primitive (point, line, triangle) –

SM 4. 0 Geometry shader – Processes a full primitive (point, line, triangle) – Has access to adjacency information (optional) Useful for silhouette detection, shadow volume extrusion etc. – May output multiple primitives Output limitation is 1024 floats – May output nothing (to kill primitive) 6 May 24, 2007 Advanced D 3 D 10 Rendering

SM 4. 0 Infinite instruction count – Very long shaders may have lower throughput

SM 4. 0 Infinite instruction count – Very long shaders may have lower throughput though Integer and bitwise instruction Indexable temporaries – Allows for local arrays – May be used to emulate a stack Useful system generated values – SV_Vertex. ID – SV_Primitive. ID – SV_Instance. ID – SV_Position (Like VPOS, but now. zw are defined too) – SV_Is. Front. Face (Like VFACE) – SV_Render. Target. Array. Index – SV_Viewport. Array. Index – SV_Clip. Distance – SV_Cull. Distance 7 May 24, 2007 Advanced D 3 D 10 Rendering

SM 4. 0 Integer & bitwise instructions – Signed and unsigned No idiv though,

SM 4. 0 Integer & bitwise instructions – Signed and unsigned No idiv though, just udiv – Same registers as floats Can alias without conversion with asint(), asuint(), asfloat() etc. – Integer texture sample values Syntax: Texture 2 D <uint 4> my. Tex; Access to individual samples in MSAA surface – Allows for custom AA resolve Syntax: Texture 2 DMS <float 4, 4> my. Tex; 8 May 24, 2007 Advanced D 3 D 10 Rendering

Pixel center Half pixel offset is gone! – Affects SV_Position as well – Now

Pixel center Half pixel offset is gone! – Affects SV_Position as well – Now matches Open. GL DX 10 9 May 24, 2007 Advanced D 3 D 10 Rendering DX 9

Pixel center Pixels and texels align – Tex. Coord = SV_Position. xy / float

Pixel center Pixels and texels align – Tex. Coord = SV_Position. xy / float 2(width, height) Texel center 10 May 24, 2007 Screenspace Advanced D 3 D 10 Rendering

The small batch problem D 3 D 10 designed to minimize batch overhead –

The small batch problem D 3 D 10 designed to minimize batch overhead – Pulls work from draw time to creation time Validation Shader input/output configuration – Immutable State Objects Input layout Rasterizer state Sampler state Depth stencil state Blend state 11 May 24, 2007 Advanced D 3 D 10 Rendering

The small batch problem D 3 D 10 also provides tools to reduce draw

The small batch problem D 3 D 10 also provides tools to reduce draw calls – Improved instancing interface – Geometry shader – More shader resources – Constant indexing in PS – Render target arrays – Texture arrays 12 May 24, 2007 Advanced D 3 D 10 Rendering

Rendering techniques in D 3 D 10 13 May 24, 2007 Advanced D 3

Rendering techniques in D 3 D 10 13 May 24, 2007 Advanced D 3 D 10 Rendering

Global Illumination 14 May 24, 2007 Advanced D 3 D 10 Rendering

Global Illumination 14 May 24, 2007 Advanced D 3 D 10 Rendering

Global Illumination Probes on a volume grid across the scene – Each probe captures

Global Illumination Probes on a volume grid across the scene – Each probe captures light environment into a tiny “cubemap” – Probes are converted to Spherical Harmonics coefficients Indirect lighting is computed using interpolated SH coefficients – Do the same in probe passes to get multiple light bounces 15 May 24, 2007 Advanced D 3 D 10 Rendering

Global Illumination Awful lot of work – Each probe is 6 slices. We need

Global Illumination Awful lot of work – Each probe is 6 slices. We need loads of probes. – Sample scene has over 300 probes Solution – Use geometry shader to reduce work – Distribute work across multiple frames Sample updates 40 cubes per frame Scatter updates to hide artifacts – Skip over “empty” space probes 16 May 24, 2007 Advanced D 3 D 10 Rendering

Global Illumination The Geometry Shader advantage – 40 cubes x 6 faces x n

Global Illumination The Geometry Shader advantage – 40 cubes x 6 faces x n draw calls = Pain DX 9 style unrealistic even for simple scenes – Update multiple slices per pass with GS output limit is 1024 floats Keep number of interpolators down to maximize primitive count Managed to update 5 probes (30 slices) per pass 8 passes is more manageable than 240. . . 17 May 24, 2007 Advanced D 3 D 10 Rendering

Post tone-mapping resolve D 3 D 10 allows for custom AA resolves – Can

Post tone-mapping resolve D 3 D 10 allows for custom AA resolves – Can drastically improve HDR AA quality Standard resolve occurs before tone-mapping Ideally resolve should be done after tone-mapping Standard resolve 18 May 24, 2007 Advanced D 3 D 10 Rendering Custom resolve

Post-tonemapping resolve Texture 2 DMS<float 4, SAMPLES> t. HDR; float 4 main(float 4 pos:

Post-tonemapping resolve Texture 2 DMS<float 4, SAMPLES> t. HDR; float 4 main(float 4 pos: SV_Position) : SV_Target { int 3 coord; coord. xy = (int 2) pos. xy; coord. z = 0; // Tone-map individual samples and sum it up float 4 sum = 0; [unroll] for (int i = 0; i < SAMPLES; i++) { float 4 c = t. HDR. Load(coord, i); sum. rgb += 1. 0 – exp 2(-exposure * c. rgb); } // Average sum *= (1. 0 / SAMPLES); // s. RGB sum. rgb = pow(sum. rgb, 1. 0 / 2. 2); return sum; } 19 May 24, 2007 Advanced D 3 D 10 Rendering

Optimizations 20 May 24, 2007 Advanced D 3 D 10 Rendering

Optimizations 20 May 24, 2007 Advanced D 3 D 10 Rendering

Geometry shader GS optimizations – Input/output usually the bottleneck – Reduce outputs with frustum

Geometry shader GS optimizations – Input/output usually the bottleneck – Reduce outputs with frustum and/or backface culling – Keep input small by packing data Tex. Coord could be 2 x 16 bits in an uint – Or use for instance asuint(normal. w) Merge to full float 4 vectors – Don’t do 2 x float 2 – Keep output small Could be faster to trade for some work in PS Pass just position, don’t interpolate both light. Vec and view. Vec – Or even back-project SV_Position. xyz to world space in PS Small output means more work fits within 1024 floats limit 21 May 24, 2007 Advanced D 3 D 10 Rendering

GS frustum and backface culling // Transform to clip space float 4 pos[3]; pos[0]

GS frustum and backface culling // Transform to clip space float 4 pos[3]; pos[0] = mul(mvp, In[0]. pos); pos[1] = mul(mvp, In[1]. pos); pos[2] = mul(mvp, In[2]. pos); // Use float 4 frustum culling to improve performance t 0 = saturate(pos[0]. xyxy * float 4(-1, 1, 1) - pos[0]. w); t 1 = saturate(pos[1]. xyxy * float 4(-1, 1, 1) - pos[1]. w); t 2 = saturate(pos[2]. xyxy * float 4(-1, 1, 1) - pos[2]. w); t = t 0 * t 1 * t 2; [branch] if (!any(t)) { // Use backface culling to improve performance float 2 d 0 = pos[1]. xy * pos[0]. w - pos[0]. xy * pos[1]. w; float 2 d 1 = pos[2]. xy * pos[0]. w - pos[0]. xy * pos[2]. w; [branch] if (d 1. x * d 0. y > d 0. x * d 1. y || min(pos[0]. w, pos[1]. w), pos[2]. w) < 0. 0) { // Output primitive here. . . } } 22 May 24, 2007 Advanced D 3 D 10 Rendering

Miscellaneous optimizations Pre-baked constant buffers – Don’t update per-material constants in DX 9 style

Miscellaneous optimizations Pre-baked constant buffers – Don’t update per-material constants in DX 9 style PS don’t need to return float 4 anymore – Use float 3 if you only care about RGB – May reduce instruction count Use GS to reduce draw calls – Single pass render-to-cubemap – Update multiple render targets per pass 23 May 24, 2007 Advanced D 3 D 10 Rendering

The new shader compiler SM 4 shader compiler preserves semantics better – This means

The new shader compiler SM 4 shader compiler preserves semantics better – This means more responsibility for you guys – Be careful about your assumptions – Periodically check the resulting assembly D 3 D 10 Disassemble. Shader() Use GPUShader. Analyzer for performance critical shaders 24 May 24, 2007 Advanced D 3 D 10 Rendering

The new shader compiler Example: HLSL code: float 4 main(float 4 t: TEXCOORD 0)

The new shader compiler Example: HLSL code: float 4 main(float 4 t: TEXCOORD 0) : SV_Target { if (t. x > t. y) return t. xyzw; else return t. wzyx; } DX 9 assembly: add r 0. x, -v 0. x, v 0. y cmp o. C 0, r 0. x, v 0. wzyx, v 0 DX 10 assembly: lt r 0. x, v 0. y, v 0. x if_nz r 0. x // <--- Did you really want a branch here? mov o 0. xyzw, v 0. xyzw ret else mov o 0. xyzw, v 0. wzyx ret endif 25 May 24, 2007 Advanced D 3 D 10 Rendering

The new shader compiler Use [branch], [flatten], [unroll] & [loop] to control output code

The new shader compiler Use [branch], [flatten], [unroll] & [loop] to control output code – This is not for everyone – Poor use could reduce performance Make sure you know what you’re doing Only use if you’re familiar with assembly code Verify that you get the code you expect – Always benchmark both options New DX 10 assembly (using [flatten]): lt r 0. x, v 0. y, v 0. x movc o 0. xyzw, r 0. xxxx, v 0. xyzw, v 0. wzyx ret 26 May 24, 2007 Advanced D 3 D 10 Rendering

Questions? emil. persson@amd. com 27 May 24, 2007 Advanced D 3 D 10 Rendering

Questions? emil. persson@amd. com 27 May 24, 2007 Advanced D 3 D 10 Rendering