Parallel Graphics Rendering Matthew Campbell Senior Computer Science

Parallel Graphics Rendering Matthew Campbell Senior, Computer Science mcampbel@vt. edu

Overview n n n Motivation Three categories of parallel rendering Our approach Results Questions

Motivation n n PC graphics cards are getting faster at an exponential rate. PC graphics boards are much cheaper than proprietary SGI hardware. n n n Geforce 4 FX = $150. 00 (130 Mtris/sec) SGI Onyx 300 = $145, 000 (80 Mtris/sec) Maintanance costs are lower n n Replacement parts are easy to get. PC’s are not as complicated as proprietary hardware.

Parallel Rendering n String together numerous PC’s with good graphics boards and render the models in parallel. n n n Increased performace Better technology tracking Three groups of algorithms: n n n Sort-First Sort-Middle Sort-Last

Rendering Pipeline n Transformation stage: n n Per-Vertex operations Primitive Assembly 3 D World Space! Rasterization stage: n n n Per-fragment operations Texture mapping 2 D Image Space!

Parallel Rendering – Sort Last n Distribute polygons n n Pass through entire rendering pipeline. n n Round robin distribution resulting in an equal load on each processor. Transformation / Rasterization (see last slide) Each CPU now has the entire scene n n n But individual scenes are incomplete Hidden polygons may be visible Solution: Image composition

Sort Last – Image Composition n n The scene at each CPU has a frame buffer with color values for each pixel and a depth buffer with Z values for each pixel. Composition: Given 2 scenes it computes the color of the pixel at each screen coordinate n n Compare the depth buffer values at each pixel location. The resultant color value is the color of the pixel corresponding to a lower z axis value. Alpha blending is more complex. n Why?

Sort Last – Image Composition n Time complexity of the previous sort algorithm is O(n), which is pretty bad. n n Can we improve it? Alternate algorithms: n n n Tree composition. Rotating rings. Binary composition.

Sort-Last Performance n Sort-Last has very high communication bandwidth requirement. n n Each processor needs to send and receive an entire frame 1280 x 1024 resolution, 24 -bits for color, 16 -bits for depth, 30 fps n n = (3. 9 MB + 2. 6 MB) * 30 = 196 MB/sec bidirectional! Need a very fast network interconnecting the CPUs in the cluster. In actuality, we need more bandwidth, because we haven’t taken into account, the time it takes to render the scene! But. . No overhead for rendering the actual scene!

Parallel Rendering – Sort Middle n n n Distribute polygons in a round robin fashion Trap polygons between geometry and rasterization phases Each CPU in the cluster is responsible for a specific region in screen coordinates Calculate the bounding boxes (screen space) for the trapped polygons and redistribute them to the appropriate CPU responsible for the region. Collate Images

Parallel Rendering – Sort Middle n How do you divide the screen into regions? n n n What is the mapping ratio between CPUs and regions? n n n Strips (either horizontal or vertical) Squares One-to-One: Each CPU manages 1 region One-to-Many: Each CPU manages many regions What about polygons that cross region boundaries? n Multiple CPUs render the same polygon.

Sort-Middle Performance n Load-balancing can be poor. The slowest CPU will block the system from rendering the next scene. n n n Load balancing is highly scene and view dependent. Need adaptive load-balancing schemes. In high polygon count scenes, the size of each polygon can be very small (~1 – 2 pixels). n n In this case, sort middle requires more bandwidth than sortlast. Communication bandwidth required is dependent on the scene complexity. (Bad)

Parallel Rendering – Sort First n n Distribute polygons round-robin to all CPUs. Calculate bounding volumes for each polygon n n Remember, we are still in the world coordinate system. Each CPU is responsible for 1 volume. Redistribute polygons based on bounding volumes. Pass through complete rendering pipeline In the end we have sub-images at each processor. n n Designate a coordinator node, which receives sub-images from all other processors. Coordinator collates sub-images into the final image.

Sort First - Performance n n Communication bandwidth required is based only on screen space resolution. Example: n n 4 CPUs, 1024*1024 scene, 32 bits/color The coordinator node receives 1024*24 bits/frame. ~ 3 MB. Bandwidth: 90 MB/sec for 30 fps. Problem: Similar to sort-middle, load balancing is scene dependent. n Bigger issue: Can’t use a one-to-many CPU to region mapping. n Or can you?

Parallel Rendering Issues n Cannot break the rendering pipeline n Pipeline is implemented in hardware n n Modern graphics cards have large amounts of memory on the board and much faster access times. n n n Therefore, very expensive. Could lead to excessive stalls, cache misses, etc. . 8 GB/sec vs. 1 GB/sec for AGP 4 x Graphics driver source code is unavailable Additional cost/overhead due to framebuffer accesses.

Our Approach n High Performance real-time rendering. n n n High scene complexity and/or multiple displays as in a VE. Target: 200 -300 million triangles/sec. In comparison the best SGI platform – Reality Monster is capable of 80 million polygons/sec Approach: n n Distributed Sort-First. Two level sorting. n n n Organize your model in a spatial tree data structure. At run-time compare bounding volumes for interior nodes of the tree. The bounding volume for an interior node is a superset of its children. This minimizes comparisons. Fine pruning based on viewing frustum.

Hardware n n n 32 Intel Xeon processor cluster (1. 5 GHz processor) 256 MB RDRAM/node (3. 2 GB/sec memory bandwidth) Myrinet (4 Gbps) and Fast Ethernet (200 Mbps full-duplex) communication fabrics. 64 bit/66 MHz PCI bus (4 Gbps throughput) 4 x AGP (1 GB/sec throughput)

Software n Extensible Parallel 3 D Rendering Engine n n n Supports large geometric databases, including standard formats such as 3 D Studio Provides an extensible API. Underlying system is based on Open. GL. Based on dynamic shared object model. Dynamic Load Balancing n n Adaptively resizes volumes assigned to a processor for single display systems. Adaptively changes the number of processors and rendering volumes for multi-display systems.

Software Architecture n n n Master-Slave arrangement Multi-threaded Two stage parallel rendering pipeline.

Results – Rendering Rate Figure 1: Scalability of our implementation. Actual depicts the performance taking into account triangle overlap among nodes, effective depicts what the system is capable of delivering. Left image uses a real world dataset (LIDAR data). Right image uses a generated dataset to fully exploit the overlap issue.

Results – Load Balancing Figure 2: The effects of load balancing on 4 nodes (left) and 16 nodes (right). The graph depicts the individiual frame times for first 100 frames.