Brook for GPUs Ian Buck Tim Foley Daniel
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10 th, 2003 February 11 th, 2004
Brook: general purpose streaming language • developed for PCA Program/Merrimac – compiler: RStream • Reservoir Labs Scalar Execution Unit – DARPA PCA Program • Stanford: Smart. Memories • UT Austin: TRIPS • MIT: RAW Stream Execution Unit Stream Register File Memory System Network Interface text DRDRAM – Brook version 0. 2 spec: http: //merrimac. stanford. edu – Brook for GPUs: http: //brook. sourceforce. net February 11 th, 2004 2
Brook: general purpose streaming language • stream programming model – enforce data parallel computing • streams – encourage arithmetic intensity • kernels • C with streams February 11 th, 2004 3
Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter – highlight gpu areas for improvement • features required general purpose stream computing February 11 th, 2004 4
system outline. br Brook source files brcc source to source compiler brt Brook run-time library February 11 th, 2004 5
Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float 3 positions<200>; float 3 velocityfield<100, 100>; – encourage data parallelism February 11 th, 2004 6
Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a, b, c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; – no dependencies between stream elements • encourage high arithmetic intensity February 11 th, 2004 7
Brook language kernels • Ray Triangle Intersection kernel void krn. Intersect. Triangle(Ray ray<>, Triangle tris[], Ray. State oldraystate<>, Grid. Trilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float 3 edge 1, edge 2, pvec, tvec, qvec; if(oldraystate. y > 0) { idx = trilist[oldraystate. w]. trinum; edge 1 = tris[idx]. v 1 - tris[idx]. v 0; edge 2 = tris[idx]. v 2 - tris[idx]. v 0; pvec = cross(ray. d, edge 2); det = dot(edge 1, pvec); inv_det = 1. 0 f/det; tvec = ray. o - tris[idx]. v 0; candidatehit. data. y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge 1 ); candidatehit. data. z = dot( ray. d, qvec ) * inv_det; candidatehit. data. x = dot( edge 2, qvec ) * inv_det; candidatehit. data. w = idx; } else { candidatehit. data = float 4(0, 0, 0, -1); } } February 11 th, 2004 8
Brook language additional features • reductions – scalar – stream • stride & repeat • Gather. Op & Scatter. Op – a[i] += p – p = a[i]++ February 11 th, 2004 9
brcc compiler infrastructure • based on ctool – http: //ctool. sourceforge. net • parser – build code tree – extend C grammar to accept Brook • convert – tree transformations • codegen – generate cg & hlsl code – call cgc, fxc – generate stub function February 11 th, 2004 10
Applications Ray-tracer FFT Segmentation Linear Algebra: – BLAS, LINPACK, LAPACK February 11 th, 2004 11
Brook Performance February 11 th, 2004 12
GPU Gotchas Time Registers Used February 11 th, 2004 13
GPU Gotchas Time Registers Used NVIDIA NV 3 x: Register usage vs. Time February 11 th, 2004 14
GPU Gotchas NVIDIA: • Register Penalty • Render to Texture Limitation – Requires explicit copy or heavy pbuffer solution – Superbuffer extension needed http: //mirror. ati. com/developer/SIGGRAPH 03/Percy_Open. GL_Extensions SIG 03. pdf February 11 th, 2004 15
GPU Gotchas ATI Radeon 9800 Pro • Limited dependent texture lookup • 96 instructions • 24 -bit floating point – s 16 e 7 Integers up to 131, 072 (s 23 e 8: 16, 777, 216) February 11 th, 2004 1 2 3 4 Memory Refs Math Ops 16
GPU Catch-Up! • Integer & Bit Ops & Double Precision • Memory Addressing • CGC/FXC Performance – Hand code performance critical code • No native reduction support • No native scatter support – p[i] = a (indirect write) • No programmable blend – Gather. Op / Scatter. Op • Limited 4 x 4 output – Brook virtualized kernel outputs • Readback still slow – NV 35 Open. GL: 600 MB/sec Download – ATI Direct. X: 550 MB/sec Download February 11 th, 2004 170 MB/sec Readback 50 MB/sec Readback 17
GPUs of the future (we hope) • Complete Instruction Sets – Integers, Bit Ops, Doubles, Mem Access • Integration – Streaming coprocessor not just a rendering device SDRAM February 11 th, 2004 Stream Register File • Streaming architectures ALU Cluster 18
Brook for GPUs • Release v 0. 3 available on Sourceforge • Project Page – http: //graphics. stanford. edu/projects/brook • Source – http: //www. sourceforge. net/projects/brook • Over 4 K downloads! • Questions? Fly-fishing fly images from The English Fly Fishing Shop February 11 th, 2004 19
- Slides: 19