Cg and Hardware Accelerated Shading Cem Cebenoyan Overview
Cg and Hardware Accelerated Shading Cem Cebenoyan
Overview Cg Overview Where we are in hardware today Physical Simulation on GPU Geforce. FX / Cg Demos Advanced hair and skin rendering in “Dawn” Adaptive subdivision surfaces and ambient occlusion shading in “Ogre” Procedural shading in “Time Machine” Depth of field and post-processing effects in “Toys” OIT NVIDIA CONFIDENTIAL
What is Cg? A high level language for controlling parts of the graphics pipeline of modern GPUs Today, this includes the vertex transformation and fragment processing units of the pipeline Very C-like Only simpler Native support for vectors, matrices, dot-products, reflection vectors, etc. Similar in scope to Renderman But notably different to handle the way hardware accelerators work NVIDIA CONFIDENTIAL
Cg Pipeline Overview Graphics Program Written in Cg “C” for Graphics Compiled & Optimized Low Level, Graphics “Assembly Code” NVIDIA CONFIDENTIAL
Graphics Data Flow Application NVIDIA CONFIDENTIAL Vertex Program Fragment Program Cg Program // // Diffuse lighting // float d = dot (normalize(frag. N), normalize(frag. L)); if (d < 0) d = 0; c = d * f 4 tex 2 D( t, frag. uv ) * diffuse; … Framebuffer
Graphics Hardware Today Fully programmable vertex processing Full IEEE 32 -bit floating point processing Native support for mul, dp 3, dp 4, rsq, pow, sin, cos. . . Full support for branching, looping, subroutines Fully programmable pixel processing IEEE 32 -bit, 16 -bit (s 10 e 5) math supported Same native math ops as vertex, plus texture fetch, and derivative instructions No branching, but >1000 instruction limit Floating point textures / frame buffers No blending / filtering yet ~500 mhz core clock NVIDIA CONFIDENTIAL
Physical Simulation Simple cellular automata-like simulations are possible on NV 20 class hardware (e. g. Game of Life, Greg James’ water simulation, Mark Harris’ CML work) Use textures to represent physical quantities (e. g. displacement, velocity, force) on a regular grid Multiple texture lookups allow access to neighbouring values Pixel shader calculates new values, renders results back to texture Each rendering pass draws a single quad, calculating next time step in simulation NVIDIA CONFIDENTIAL
Physical Simulation Problem: 8 bit precision on NV 20 is not enough, causes drifting, stability problems Float precision on NV 30 allows GPU physics to match CPU accuracy New fragment programming model (longer programs, flexible dependent texture reads) allows much more interesting simulations NVIDIA CONFIDENTIAL
Example: Cloth Simulation Shader Uses Verlet integration (see: Jakobsen, GDC 2001) Avoids storing explicit velocity newx = x + (x – oldx)*damping + a*dt*dt Not always accurate, but stable! Store current and previous position of each particle in 2 RGB float textures Fragment program calculates new position, writes result to float buffer Copy float buffer back to texture for next iteration (could use render-to-texture instead) Swap current and previous textures NVIDIA CONFIDENTIAL
Cloth Shader Demo NVIDIA CONFIDENTIAL
Cloth Simulation Shader 2 passes: 1. Perform integration 2. Apply constraints: Floor constraint Sphere constraint Distance constraints between particles Read back float frame buffer using gl. Read. Pixels Draw particles and constraints NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (1 st pass) void Integrate(inout float 3 x, float 3 oldx, float 3 a, float timestep 2, float damping) { x = x + damping*(x - oldx) + a*timestep 2; } my. Fragout main(v 2 fconnector In, uniform texobj. RECT x_tex, uniform texobj. RECT ox_tex, uniform float timestep, uniform float damping, uniform float 3 gravity) { my. Fragout Out; float 2 s = In. TEX 0. xy; // get current and previous position float 3 x = f 3 tex. RECT(x_tex, s); float 3 oldx = f 3 tex. RECT(ox_tex, s); // move the particle Integrate(x, oldx, gravity, timestep*timestep, damping); Out. COL. xyz = x; return Out; } NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (2 nd pass) // constrain particle to be fixed distance from another particle void Distance. Constraint(float 3 x, inout float 3 newx, float 3 x 2, float restlength, float stiffness) { float 3 delta = x 2 - x; float deltalength = length(delta); float diff = (deltalength - restlength) / deltalength; newx = newx + delta*stiffness*diff; } // constraint particle to be outside sphere void Sphere. Constraint(inout float 3 x, float 3 center, float r) { float 3 delta = x - center; float dist = length(delta); if (dist < r) { x = center + delta*(r / dist); } } // constrain particle to be above floor void Floor. Constraint(inout float 3 x, float level) { if (x. y < level) { x. y = level; } } NVIDIA CONFIDENTIAL
Cloth Simulation Cg Code (cont. ) my. Fragout main(v 2 fconnector In, uniform texobj. RECT x_tex, uniform texobj. RECT ox_tex, uniform float dist, uniform float stiffness) { my. Fragout Out; float 2 s = In. TEX 0. xy; // get current position float 3 x = f 3 tex. RECT(x_tex, s); // satisfy constraints Floor. Constraint(x, 0. 0 f); Sphere. Constraint(x, float 3(0. 0, 2. 0, 0. 0), 1. 0 f); // get positions of neighbouring particles float 3 x 1 = f 3 tex. RECT(x_tex, s + float 2(1. 0, 0. 0) ); float 3 x 2 = f 3 tex. RECT(x_tex, s + float 2(-1. 0, 0. 0) ); float 3 x 3 = f 3 tex. RECT(x_tex, s + float 2(0. 0, 1. 0) ); float 3 x 4 = f 3 tex. RECT(x_tex, s + float 2(0. 0, -1. 0) ); // apply distance constraints float 3 newx = x; if (s. x < 31) Distance. Constraint(x, newx, x 1, dist, stiffness); if (s. x > 0) Distance. Constraint(x, newx, x 2, dist, stiffness); if (s. y < 31) Distance. Constraint(x, newx, x 3, dist, stiffness); if (s. y > 0) Distance. Constraint(x, newx, x 4, dist, stiffness); Out. COL. xyz = newx; return Out; } NVIDIA CONFIDENTIAL
Physical Simulation – Future Work Limitation - only one destination buffer, can only modify position of one particle at a time Could use pack instructions to store 2 vec 4 h (8 half floats) in 128 bit float buffer Could also use additional textures to encode particle masses, stiffness, constraints between arbitrary particles (rigid bodies) “float buffer to vertex array” extension offers possibility of directly interpreting results as geometry without any CPU intervention! Collision detection with meshes is hard NVIDIA CONFIDENTIAL
Demos Introduction Developed 4 demos for the launch of Ge. Force FX “Dawn” “Toys” “Time Machine” “Ogre” (Spellcraft Studio) NVIDIA CONFIDENTIAL
Characters Look Better With Hair NVIDIA CONFIDENTIAL
Rendering Hair Two options: 1) Volumetric (texture) 2) Geometric (lines) We have used volumetric approximations (shells and fins) in the past (e. g. Wolfman demo) Doesn’t work well for long hair We considered using textured ribbons (popular in Japanese video games). Alpha sorting is a pain. Performance of Ge. Force FX finally lets us render hair as geometry NVIDIA CONFIDENTIAL
Rendering Hair as Lines Each hair strand is rendered as a line strip (2 -20 vertices, depending on curvature) Problem: lines are a minimum of 1 pixel thick, regardless of distance from camera Not possible to change line width per vertex Can use camera-facing triangle strips, but these require twice the number of vertices, and have aliasing problems NVIDIA CONFIDENTIAL
Anti-Aliasing Two methods of anti-aliasing lines in Open. GL GL_LINE_SMOOTH High quality, but requires blending, sorting geometry GL_MULTISAMPLE Usually lower quality, but order independent We used multisample anti-aliasing with “alpha to coverage” mode By fading alpha to zero at the ends of hairs, coverage and apparent thickness decreases “SAMPLE_ALPHA_TO_COVERAGE_ARB” is part of the ARB_multisample extension NVIDIA CONFIDENTIAL
Hair Without Antialiasing NVIDIA CONFIDENTIAL
Hair With Multisample Antialiasing NVIDIA CONFIDENTIAL
Hair Shading Hair is lit with simple anisotropic shader (Heidrich and Seidel model) Low specular exponent, dim highlight looks best Black hair = no shadows! Self-shadowing hair is hard Deep shadow maps Opacity shadow maps Top of head is painted black to avoid skin showing through We also had a very short hair style, which helps NVIDIA CONFIDENTIAL
Hair Styling is Important NVIDIA CONFIDENTIAL
Hair Styling Difficult to position 50, 000 individual curves by hand Typical solution is to define a small number of control hairs, which are then interpolated across the surface to produce render hairs We developed a custom tool for hair styling Commercial hair applications have poor styling tools and are not designed for real time output NVIDIA CONFIDENTIAL
Hair Styling Scalp is defined as a polygon mesh Hairs are represented as cubic Bezier curves Controls hairs are defined for each vertex Render hairs are interpolated across triangles using barycentric coordinates Number of generated hairs is based on triangle area to maintain constant density Can add noise to interpolated hairs to add variation NVIDIA CONFIDENTIAL
Hair Styling Tool Provides a simple UI for styling hair Combing tools Lengthen / shorten Straighten / mess up Uses a simple physics simulation based on Verlet integration (Jakobson, GDC 2001) Physics is run on control hairs only Collision detection done with ellipsoids NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Dawn Demo Show demo NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
The Ogre Demo A real-time preview of Spellcraft Studio’s inproduction short movie “Yeah!” Created in 3 DStudio MAX Used Character Studio for animation, plus Stitch plug-in for cloth simulation Original movie was rendered in Brazil with global illumination Available at: www. yeahthemovie. de Our aim was to recreate the original as closely as possible, in real-time NVIDIA CONFIDENTIAL
What are Subdivision Surfaces? A curved surface defined as the limit of repeated subdivision steps on a polygonal model Subdivision rules create new vertices, edges, faces based on neighboring features We used the Catmull-Clark subdivision scheme (as used by Pixar) MAX, Maya, Softimage, Lightwave all support forms of subdivision surfaces NVIDIA CONFIDENTIAL
Realtime Adaptive Tessellation Brute force subdivision is expensive Generates lots of polygons where they aren’t needed Number of polygons increases exponentially with each subdivision Adaptive tessellation subdivides patches based on screen-space patch size test Guaranteed crack-free Generates normals and tangents on the fly Culls off-screen and back-facing patches CPU-based (uses SSE were possible) NVIDIA CONFIDENTIAL
Control Mesh vs. Subdivided Mesh 4000 faces NVIDIA CONFIDENTIAL 17, 000 triangles
Control Mesh Detail NVIDIA CONFIDENTIAL
Subdivided Mesh Detail NVIDIA CONFIDENTIAL
Why Use Subdivision Surfaces? Content Characters were modeled with subdivision in mind (using 3 DSMax “Mesh. Smooth/NURMS” modifier) Scalability wanted demo to be scalable to lower-end hardware “Infinite” detail Can zoom in forever without seeing hard edges Animation compression Just store low-res control mesh for each frame May be accelerated on future GPUs NVIDIA CONFIDENTIAL
Disadvantages of Realtime Subdivision CPU intensive But we might as well use the CPU for something! View dependent Requires re-tessellation for shadow map passes Mesh topology changes from frame to frame Makes motion blur difficult NVIDIA CONFIDENTIAL
Ambient Occlusion Shading Helps simulate the global illumination “look” of the original movie Self occlusion is the degree to which an object shadows itself “How much of the sky can I see from this point? ” Simulates a large spherical light surrounding the scene Popular in production rendering – Pearl Harbor (ILM), Stuart Little 2 (Sony) NVIDIA CONFIDENTIAL
Occlusion N NVIDIA CONFIDENTIAL
How To Calculate Occlusion Shoot rays from surface in random directions over the hemisphere (centered around the normal) The percentage of rays that hit something is the occlusion amount Can also keep track of average of un-occluded directions – “bent normal” Some Renderman compliant renders (e. g. Entropy) have a built-in occlusion() function that will do this We can’t trace rays using graphics hardware (yet) So we pre-calculate it! NVIDIA CONFIDENTIAL
Occlusion Baking Tool Uses ray-tracing engine to calculate occlusion values for each vertex in control mesh We used 128 rays / vertex Stored as floating point scalar for each vertex and each frame of the animation Calculation took around 5 hours for 1000 frames Subdivision code interpolates occlusion values using cubic interpolation Used as ambient term in shader NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL
Ogre Demo Show demo NVIDIA CONFIDENTIAL
Procedural Shading in Time Machine Goals for the Time Machine demo Overview of effects Metallic Paint Wood Chrome Techniques used Faux-BRDF reflection Reveal and d. Xd. T maps Normal and Du. Dv scaling Dynamic Bump mapping Performance Issues Summary NVIDIA CONFIDENTIAL
Why do Time Machine? GPUs are much more programmable Thanks to generalized dependent texturing, more active textures (16 on Ge. Force FX) and (for our purposes) unlimited blend operations, high-quality animation is possible per-pixel Ge. Force FX has >2 x performance of Ge. Force 4 Ti Executing lots of per-pixel operations isn’t just possible; it can be done in real time. Previous per-pixel animation was limited Animated textures PDE / CA effects (see Mark Harris’ talk at GDC) Goal : Full-scene per-pixel animation NVIDIA CONFIDENTIAL
Why do Time Machine? (continued) Neglected pick-up trucks demonstrate a wide variety of surface effects, with intricate transitions and boundaries Paint oxidizing, bleaching and rusting Vinyl cracking Wood splintering and fading And more… Not possible with just per-vertex animation! NVIDIA CONFIDENTIAL
Time Machine Effects : Paint textures: Specular color shift Bubbling Oxidation Rusting 60 Pixel Shader instructions, 11 textures NVIDIA CONFIDENTIAL • Paint Color • Rust LUT • Shadow map • Spotlight mask • Light Rust Color* • Deep Rust Color* • Ambient Light* • Bubble Height* • Reveal Time* • New Environment* • Old Environment* (* = artist created)
Effects (cont’d) : Wood, Chrome, Glass Wood fades and cracks 31 instructions, 6 textures Chrome welts and corrodes 23 instructions, 8 textures Headlights fog 24 instructions, 4 textures NVIDIA CONFIDENTIAL
Procedural or Not? Procedural shading normally replaces textures with functions of several variables. Time Machine uses textures liberally. The only parameter to our shaders is time. However, turning everything into math is expensive Time Machine’s solution Give artist direct control (textures) over final image, use functions to control transitions NVIDIA CONFIDENTIAL
Techniques : Faux-BRDF Reflection Many automotive paints exhibit a color-shift as a function of the light and viewer directions. This effect has been approximated with analytic BRDFs (Lafortune’s cosine lobes) And measured by Cornell University’s graphics lab BRDF factorization [Mc. Cool, Rusinkiewicz] is one method to use this data on graphics hardware Efficient representation with multiple 2 D textures Closely approximates the original BRDFs But not necessarily the most efficient method for automotive paint, and not artist-controllable. Reflection intensity is uninteresting (largely Blinn) Rotated/projected axes hard to visualize NVIDIA CONFIDENTIAL
Techniques : Faux-BRDF Reflection 2 Our solution: project BRDF values onto a single 2 D texture, and factor out the intensity Compute intensity in real-time, using (N. H)^s Texture varies slowly, so it can be low-res (64 x 64). Anti-aliasing texture fixes laser noise at grazing angles For automotive paints, N. L and N. H work well for axes. Not physically accurate, but fast and high-quality. Easy for artists to tweak. Dupont Cayman lacquer NVIDIA CONFIDENTIAL Mystique lacquer
Techniques : Reveal and d. Xd. T maps Artists do not want to paint hundreds of frames of animation for a surface transition (e. g. , paint->rust) Ultimately, effect is just a conditional: if (time > n) color = rust; else color = paint; Or an interpolation between a start and end point paint = interpolate(paint, bleach, s*(time-n)); So all intermediate values can be generated. For continuous effects, use d. Xd. T (velocity) maps Can be stored in alpha in a DXT 5 texture. NVIDIA CONFIDENTIAL
Performance Concerns Executing large shaders is expensive. First rule of optimization: Keep inner loops tight Shaders are the inner loop, run >1 M times per frame. But graphics cards have many parallel units Vertex, fragment, and texture units Modern GPUs do a great job of hiding texture latency Bandwidth is unimportant in long shaders Time Machine runs at virtually the same framerate on a 500/500 Ge. Force. FX as it does on a 500/400 or 500/550 So not using textures is wasting performance! NVIDIA CONFIDENTIAL
Performance Concerns… What makes a good texture? Saves math operations 8 (RGBA) or 16 (HILO) bit precision sufficient Depends on a limited number of variables Textures we used Interpolating between light and dark rust layers Required computing the difference between light and dark layers’ reveal maps, and expanding to [0. . 1]. Function was dependent on current and reveal time. Used to blend two texture maps NVIDIA CONFIDENTIAL
Performance Concerns… Textures Used, continued… Surround Maps Recomputing the normal requires knowing the heights of 4 texels (s-1, t), (s+1, t), (s, t+1) and (s, t-1) Each height is only 1 8 -bit component Instead of 4 dependent fetches, we can pack all into 1 S(s, t) = [ H(s-1, t), H(s+1, t), H(s, t-1), H(s, t+1) ] Saved 4 math ops and 3 texture fetches + shuffle logic NVIDIA CONFIDENTIAL
Time Machine demo Show demo NVIDIA CONFIDENTIAL
Toys Demo - Simple Depth of Field Render scene to color and depth textures Generate mipmaps for color texture Render full screen quad with “simpledof” shader: Depth = tex(depthtex, texcoord) Coc (circle of confusion) = abs(depth*scale + bias) Color = txd(colortex, texcoord, (coc, 0), (0, coc)) Scale and bias are derived from the camera: Scale = Bias = NVIDIA CONFIDENTIAL (aperture * focaldistance * planeinfocus * (zfar – znear)) / ((planeinfocus – focaldistance) * znear * zfar) (aperture * focaldistance * (znear – planeinfocus)) / ((planeinfocus * focaldistance) * znear)
Artifacts: Bilinear Interpolation/Magnification Bilinear artifacts in extreme back- and near-ground Solution: multiple jittered samples Even without jittering, a 4 or 5 sample rotated grid pattern brings smaller artifacts under control Larger artifacts need jittered samples, and more of them Then it’s just a tradeoff between noise from the jittering and bilinear interpolation artifacts (and of course the quality/performance tradeoff with number of samples) NVIDIA CONFIDENTIAL
Noise vs. Interpolation Artifacts With Noise NVIDIA CONFIDENTIAL Without Noise
Artifacts: Depth Discontinuities Near-ground (blurry) pixels don’t properly blend out over top of mid-ground (sharp) pixels Easy solution: Cheat! Either don’t let objects get too far in front of the plane in focus, or blur everything a little more when they do – soft edges help hide this fairly well. NVIDIA CONFIDENTIAL
Depth Discontinuities NVIDIA CONFIDENTIAL
Fun With Color Matrices Since we’re already rendering to a full-screen texture, it’s easy to muck with the final image. Operations are just rotations / scales in RGB space Color (hue) shift Saturation Brightness Contrast These are all matrices, so compose them together, and apply them as 3 dot products in the shader NVIDIA CONFIDENTIAL
Original Image NVIDIA CONFIDENTIAL
Colorshifted Image NVIDIA CONFIDENTIAL
Black and White Image NVIDIA CONFIDENTIAL
Toys Demo Show demo NVIDIA CONFIDENTIAL
Order Independent Transparency Why is correct transparency hard? Depth peeling Two depth buffers Enter the shadow map Precision/invariance issues Depth replace texture shader Blending the layers Other applications NVIDIA CONFIDENTIAL
Can’t just gl. Enable(GL_BLEND)… Good Transparency with OIT NVIDIA CONFIDENTIAL Bad Transparency without OIT
Why is correct transparency hard? Most hardware does object-order rendering Correct transparency requires sorted traversal Have to render polygons in sorted order Not very convenient Polygons can’t intersect Lot of extra application work Especially difficult for dynamic scene databases NVIDIA CONFIDENTIAL
Depth Peeling The algorithm uses an “implicit sort” to extract multiple depth layers First pass render finds front-most fragment color/depth Each successive pass render finds (extracts) the fragment color/depth for the next-nearest fragment on a per pixel basis Use dual depth buffers to compare previous nearest fragment with current Second “depth buffer” used for comparison (read only) from texture [more on this later] NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL Layer 0 Layer 1 Layer 2 Layer 3
Cross-section view of depth peeling Layer 0 0 depth Layer 1 1 0 depth Layer 2 1 0 depth 1 Depth peeling strips away depth layers with each successive pass. The frames above show the frontmost (leftmost) surfaces as bold black lines, hidden surfaces as thin black lines, and “peeled away” surfaces as light grey lines. NVIDIA CONFIDENTIAL
Dual Depth Buffer Pseudo-code for ( i = 0; i < num_passes; i++ ) { clear color buffer depth unit 0: if(i == 0) { disable depth test } else { enable depth test } bind depth buffer (i % 2) disable depth writes /* read-only depth test */ set depth func to GREATER depth unit 1: bind depth buffer ((i+1) % 2) clear depth buffer enable depth writes; enable depth test; set depth func to LESS render scene save color buffer RGBA as layer i } NVIDIA CONFIDENTIAL
Implementation There is no “dual depth buffer” extension to Open. GL, so what can we do? Just need one depth test with writeable depth buffer – the other can be read-only Shadow mapping is a read-only depth test! Depth test can have an arbitrary camera location Other interesting uses for clip volumes Fast copies make this proposition reasonable Copies will be unnecessary in the future… NVIDIA CONFIDENTIAL
Precision / Invariance issues Using shadow mapping hardware introduces precision and invariance issues depth rasterization usually just needs to match output depth buffer precision, and requires no perspective correction Texture hardware requires perspective correction and projection at high precision Making things match would be difficult without the DEPTH_REPLACE texture shader Computes with texture hardware at texture precision Solves invariance problems at some extra expense Will be cheaper in the future… NVIDIA CONFIDENTIAL
NVIDIA CONFIDENTIAL 1 layer 2 layers 3 layers 4 layers
Compositing Each time we peel, we capture the RGBA, then as a final step, we blend all the layers together from back to front Opaque fragments completely overwrite previous transparent ones NVIDIA CONFIDENTIAL
Conclusions Results are nice! Get correct transparency without invasive changes to internal data structures Can be “bolted on” to existing CAD/CAM apps Requires n scene traversals for n correctly sorted depths n = 4 is often quite satisfactory (see previous slide) Shadow maps are for more than shadows! NVIDIA CONFIDENTIAL
Questions? cem@nvidia. com http: //developer. nvidia. com/cg/ http: //www. cgshaders. org/ NVIDIA CONFIDENTIAL
- Slides: 83