GH 05 KDTree Acceleration Structures for a GPU




![Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host](https://slidetodoc.com/presentation_image/4537d7436ce7014d163da18242b9e50f/image-5.jpg)











![Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] • Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] •](https://slidetodoc.com/presentation_image/4537d7436ce7014d163da18242b9e50f/image-17.jpg)










- Slides: 27

GH 05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University

Motivation GH 05 • Accelerated raytracing – On commodity HW – Production rendering – Real-time applications? • Performance trend – 9800 XT : 170 M ray-triangle intersects/s – X 800 XT PE: 350 M ray-triangle intersects/s

GPU Raytracing • Promising early results – Simple scenes • Uniform grid – Problems with complex scenes • Hierarchical accelerator (kd-tree) – Improve scalability GH 05

Outline • Background – GPU Raytracing – KD-Tree Algorithm • KD-Restart, KD-Backtrack • Results • Future Work GH 05
![Background Ray Engine Carr et al 2002 Parallel raytriangle intersection Host Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host](https://slidetodoc.com/presentation_image/4537d7436ce7014d163da18242b9e50f/image-5.jpg)
Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host controls culling • [Purcell et al. 2002] – Entire raytracing pipeline – Many rays required for efficiency – Uniform Grid GH 05

Why not KD-Tree? • Uniform grid acceleration structure – Regular structure = efficient traversal – Regular structure = poor partitioning • KD-Trees – Adapt to scene complexity – Compact storage, efficient traversal – “Best” for CPU raytracing [Havran 2000] GH 05

KD-Tree tmin GH 05 Z X X B Y C Y D Z A tmax A B C D

KD-Tree Traversal GH 05 Z X B Y X C Y D Z A A B C D

Per-Fragment Stacks • Parallel (per-ray) push – No indexed write in fragment program • Per-ray stack storage • [Ernst et al. 2004] – Emulate push with extra passes – Impractical, slow GH 05

Our Contribution • Stackless kd-tree traversal algorithms – KD-Restart – KD-Backtrack GH 05

Observation GH 05 Z X B Y X C Y D Z A A B Current leaf’s tmax = Next leaf’s tmin C D

KD-Restart GH 05 Z X B • Standard traversal – Omit stack operations – Proceed to 1 st leaf Y C A D • If no intersection – Advance (tmin, tmax) – Restart from root • Proceed to next leaf

KD-Restart • Restart traversal after each leaf – m leaves – Average depth d – Cost O(m*d) • Balanced tree of n nodes – Upper bound: O(n log(n)) • Standard algorithm: O(n) – Expected: O( log(n) ) GH 05

Observation GH 05 Z X B Y X C Y D Z A A B Ancestor of A is parent of Z C D

KD-Backtrack Z X • If no intersection B Y – Advance (tmin, tmax) – Start backtracking C A GH 05 D • If node intersects (tmin, tmax) – Resume traversal • Proceed to next leaf

KD-Backtrack • Backtrack after leaf – Revisits previous nodes – At most twice: from left, right • Within constant factor of standard traversal – Upper bound: O(n) – Expected: O( log(n) ) • Requires additional storage – Parent pointers – Bounding boxes for internal nodes GH 05
![Implementation GH 05 Built GPU raytracer in Brook Buck et al Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] •](https://slidetodoc.com/presentation_image/4537d7436ce7014d163da18242b9e50f/image-17.jpg)
Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] • 4 intersection schemes: – Brute Force – Uniform Grid – KD-Restart – KD-Backtrack

Scenes GH 05 Cornell Box Stanford Bunny 32 triangles 69451 triangles BART Robots BART Kitchen 71708 triangles 110561 triangles

Results Box GH 05 Bunny Robots Kitchen 12. 9 Relative speedup over brute-force intersection.

Results Traverse Backtrack Intersect GH 05 Ideal 10. 86 M 0 5. 91 M Restart 21. 80 M 0 5. 91 M Backtrack 10. 86 M 7. 78 M 5. 91 M Rays in each state throughout traversal.

Discussion • Absolute performance – Trails best CPU implementations 5 -6 x • Sources of inefficiency – Load balancing – Data reuse GH 05

Load Balancing • Subset of rays intersecting, traversing – Occlusion queries to select kernel – Early-Z to cull inactive rays • Approximately 5 x overhead – Query, kernel switch overhead – Worse with fewer rays GH 05

Data Reuse GH 05 • Every kernel – Loads ray origin/direction – Load/Store traversal state • Consumes streaming bandwidth – We are bandwidth-limited – CPU implementation stores these in registers

Branching GH 05 • Merge multiple passes into larger kernel – Fragment branches for load balancing – Avoid load/store of reused data • Current branching has high overhead • Shifts efficiency burden to HW

Conclusion • Stackless Traversal – Allows efficient GPU kd-tree – Scales to larger, more complex scenes • Future Work – Changes in HW – Alternative acceleration structures – “Out-of-core” scenes – Dynamic scenes GH 05

Acknowledgements • Tim Purcell (NVIDA) – Streaming raytracer • Mark Segal (ATI) – Demo machine • NVIDIA, ATI : HW • DARPA, Rambus : Funding GH 05

Questions GH 05