GH 05 KDTree Acceleration Structures for a GPU

  • Slides: 27
Download presentation
GH 05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford

GH 05 KD-Tree Acceleration Structures for a GPU Raytracer Tim Foley, Jeremy Sugerman Stanford University

Motivation GH 05 • Accelerated raytracing – On commodity HW – Production rendering –

Motivation GH 05 • Accelerated raytracing – On commodity HW – Production rendering – Real-time applications? • Performance trend – 9800 XT : 170 M ray-triangle intersects/s – X 800 XT PE: 350 M ray-triangle intersects/s

GPU Raytracing • Promising early results – Simple scenes • Uniform grid – Problems

GPU Raytracing • Promising early results – Simple scenes • Uniform grid – Problems with complex scenes • Hierarchical accelerator (kd-tree) – Improve scalability GH 05

Outline • Background – GPU Raytracing – KD-Tree Algorithm • KD-Restart, KD-Backtrack • Results

Outline • Background – GPU Raytracing – KD-Tree Algorithm • KD-Restart, KD-Backtrack • Results • Future Work GH 05

Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host

Background • Ray. Engine [Carr et al. 2002] – Parallel ray-triangle intersection – Host controls culling • [Purcell et al. 2002] – Entire raytracing pipeline – Many rays required for efficiency – Uniform Grid GH 05

Why not KD-Tree? • Uniform grid acceleration structure – Regular structure = efficient traversal

Why not KD-Tree? • Uniform grid acceleration structure – Regular structure = efficient traversal – Regular structure = poor partitioning • KD-Trees – Adapt to scene complexity – Compact storage, efficient traversal – “Best” for CPU raytracing [Havran 2000] GH 05

KD-Tree tmin GH 05 Z X X B Y C Y D Z A

KD-Tree tmin GH 05 Z X X B Y C Y D Z A tmax A B C D

KD-Tree Traversal GH 05 Z X B Y X C Y D Z A

KD-Tree Traversal GH 05 Z X B Y X C Y D Z A A B C D

Per-Fragment Stacks • Parallel (per-ray) push – No indexed write in fragment program •

Per-Fragment Stacks • Parallel (per-ray) push – No indexed write in fragment program • Per-ray stack storage • [Ernst et al. 2004] – Emulate push with extra passes – Impractical, slow GH 05

Our Contribution • Stackless kd-tree traversal algorithms – KD-Restart – KD-Backtrack GH 05

Our Contribution • Stackless kd-tree traversal algorithms – KD-Restart – KD-Backtrack GH 05

Observation GH 05 Z X B Y X C Y D Z A A

Observation GH 05 Z X B Y X C Y D Z A A B Current leaf’s tmax = Next leaf’s tmin C D

KD-Restart GH 05 Z X B • Standard traversal – Omit stack operations –

KD-Restart GH 05 Z X B • Standard traversal – Omit stack operations – Proceed to 1 st leaf Y C A D • If no intersection – Advance (tmin, tmax) – Restart from root • Proceed to next leaf

KD-Restart • Restart traversal after each leaf – m leaves – Average depth d

KD-Restart • Restart traversal after each leaf – m leaves – Average depth d – Cost O(m*d) • Balanced tree of n nodes – Upper bound: O(n log(n)) • Standard algorithm: O(n) – Expected: O( log(n) ) GH 05

Observation GH 05 Z X B Y X C Y D Z A A

Observation GH 05 Z X B Y X C Y D Z A A B Ancestor of A is parent of Z C D

KD-Backtrack Z X • If no intersection B Y – Advance (tmin, tmax) –

KD-Backtrack Z X • If no intersection B Y – Advance (tmin, tmax) – Start backtracking C A GH 05 D • If node intersects (tmin, tmax) – Resume traversal • Proceed to next leaf

KD-Backtrack • Backtrack after leaf – Revisits previous nodes – At most twice: from

KD-Backtrack • Backtrack after leaf – Revisits previous nodes – At most twice: from left, right • Within constant factor of standard traversal – Upper bound: O(n) – Expected: O( log(n) ) • Requires additional storage – Parent pointers – Bounding boxes for internal nodes GH 05

Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] •

Implementation GH 05 • Built GPU raytracer in Brook [Buck et al. ] • 4 intersection schemes: – Brute Force – Uniform Grid – KD-Restart – KD-Backtrack

Scenes GH 05 Cornell Box Stanford Bunny 32 triangles 69451 triangles BART Robots BART

Scenes GH 05 Cornell Box Stanford Bunny 32 triangles 69451 triangles BART Robots BART Kitchen 71708 triangles 110561 triangles

Results Box GH 05 Bunny Robots Kitchen 12. 9 Relative speedup over brute-force intersection.

Results Box GH 05 Bunny Robots Kitchen 12. 9 Relative speedup over brute-force intersection.

Results Traverse Backtrack Intersect GH 05 Ideal 10. 86 M 0 5. 91 M

Results Traverse Backtrack Intersect GH 05 Ideal 10. 86 M 0 5. 91 M Restart 21. 80 M 0 5. 91 M Backtrack 10. 86 M 7. 78 M 5. 91 M Rays in each state throughout traversal.

Discussion • Absolute performance – Trails best CPU implementations 5 -6 x • Sources

Discussion • Absolute performance – Trails best CPU implementations 5 -6 x • Sources of inefficiency – Load balancing – Data reuse GH 05

Load Balancing • Subset of rays intersecting, traversing – Occlusion queries to select kernel

Load Balancing • Subset of rays intersecting, traversing – Occlusion queries to select kernel – Early-Z to cull inactive rays • Approximately 5 x overhead – Query, kernel switch overhead – Worse with fewer rays GH 05

Data Reuse GH 05 • Every kernel – Loads ray origin/direction – Load/Store traversal

Data Reuse GH 05 • Every kernel – Loads ray origin/direction – Load/Store traversal state • Consumes streaming bandwidth – We are bandwidth-limited – CPU implementation stores these in registers

Branching GH 05 • Merge multiple passes into larger kernel – Fragment branches for

Branching GH 05 • Merge multiple passes into larger kernel – Fragment branches for load balancing – Avoid load/store of reused data • Current branching has high overhead • Shifts efficiency burden to HW

Conclusion • Stackless Traversal – Allows efficient GPU kd-tree – Scales to larger, more

Conclusion • Stackless Traversal – Allows efficient GPU kd-tree – Scales to larger, more complex scenes • Future Work – Changes in HW – Alternative acceleration structures – “Out-of-core” scenes – Dynamic scenes GH 05

Acknowledgements • Tim Purcell (NVIDA) – Streaming raytracer • Mark Segal (ATI) – Demo

Acknowledgements • Tim Purcell (NVIDA) – Streaming raytracer • Mark Segal (ATI) – Demo machine • NVIDIA, ATI : HW • DARPA, Rambus : Funding GH 05

Questions GH 05

Questions GH 05