Brook for GPUs Ian Buck Tim Foley Daniel
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10 th, 2003
Brook: general purpose streaming language • developed for stanford streaming supercomputing project – architecture: Merrimac Scalar Execution Unit – compiler: RStream • Reservoir Labs Stream Execution Unit Stream Register File Memory System Network Interface text DRDRAM – Center for Turbulence Research – NASA – DARPA PCA Program • Stanford: Smart. Memories • UT Austin: TRIPS • MIT: RAW – Brook version 0. 2 spec: http: //merrimac. stanford. edu December 10 th, 2003 2
why graphics hardware? Ge. Force FX December 10 th, 2003 3
why graphics hardware? Pentium 4 SSE theoretical* 3 GHz * 4 wide *. 5 inst / cycle = 6 GFLOPS Ge. Force FX 5900 (NV 35) fragment shader obtained: MULR R 0, R 0: 20 GFLOPS equivalent to a 10 GHz P 4 Ge. Force FX and getting faster: 3 x improvement over NV 30 (6 months) December 10 th, 2003 *from Intel P 4 Optimization Manual 4
gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency December 10 th, 2003 5
gpu: data parallel & arithmetic intensity – data parallelism • each fragment shaded independently • better alu usage • hide memory latency – arithmetic intensity • compute-to-bandwidth ratio • lots of ops per word transferred • app limited by alu performance, not off-chip bandwidth • more chip real estate for alu’s, not caches December 10 th, 2003 64 bit fpu 6
Brook: general purpose streaming language • stream programming model – enforce data parallel computing • streams – encourage arithmetic intensity • kernels • C with streams December 10 th, 2003 7
Brook for gpus • demonstrate gpu streaming coprocessor – explicit programming abstraction December 10 th, 2003 8
Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources December 10 th, 2003 9
Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter December 10 th, 2003 10
Brook for gpus • demonstrate gpu streaming coprocessor – make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources – performance! • … on applications that matter – highlight gpu areas for improvement • features required general purpose stream computing December 10 th, 2003 11
system outline. br Brook source files brcc source to source compiler brt Brook run-time library December 10 th, 2003 12
Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float 3 positions<200>; float 3 velocityfield<100, 100>; December 10 th, 2003 13
Brook language streams • streams – collection of records requiring similar computation • particle positions, voxels, FEM cell, … float 3 positions<200>; float 3 velocityfield<100, 100>; – similar to arrays, but… • index operations disallowed: position[i] • read/write stream operators stream. Read (positions, p_ptr); stream. Write (velocityfield, v_ptr); – encourage data parallelism December 10 th, 2003 14
Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } December 10 th, 2003 15
Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a, b, c); December 10 th, 2003 16
Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a, b, c); December 10 th, 2003 for (i=0; i<100; i++) c[i] = a[i]+b[i]; 17
Brook language kernels • kernels – functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } – no dependencies between stream elements • encourage high arithmetic intensity December 10 th, 2003 18
Brook language kernels • kernels arguments – input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a, b: Read-only input streams result: Write-only output stream December 10 th, 2003 19
Brook language kernels • kernels arguments – input/output streams – constant parameters kernel void foo (float a<>, float b<>, float t, out float result<>) { result = a + t*b; } float a<100>; float b<100>; float c<100>; foo(a, b, 3. 2 f, c); December 10 th, 2003 20
Brook language kernels • kernels arguments – input/output streams – constant paramters – gather streams kernel void foo (float a<>, float b<>, float t, float array[], out float result<>) { result = array[a] + t*b; } float a<100>; b<100>; c<100>; array<25> foo(a, b, 3. 2 f, array, c); December 10 th, 2003 gpu bonus 21
Brook language kernels • kernels arguments – – input/output streams constant parameters gather streams iterator streams kernel void foo (float a<>, float b<>, float t, float array[], iter float n<>, out float result<>) { result = array[a] + t*b + n; } float a<100>; float b<100>; float c<100>; float array<25> iter float n<100> = iter(0, 10); gpu bonus foo(a, b, 3. 2 f, array, n, c); December 10 th, 2003 22
Brook language kernels • example – position update in velocity field kernel void updatepos (float 2 pos<>, float 2 vel[100], float timestep, out float 2 newpos<>) { newpos = pos + vel[pos]*timestep; } updatepos (positions, velocityfield, 10. 0 f, positions); December 10 th, 2003 23
Brook language reductions December 10 th, 2003 24
Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a, r); December 10 th, 2003 25
Brook language reductions • reductions – compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a, r); December 10 th, 2003 r = a[0]; for (int i=1; i<100; i++) r += a[i]; 26
Brook language reductions • reductions – associative operations only (a+b)+c = a+(b+c) • sum, multiply, max, min, OR, AND, XOR • matrix multiply December 10 th, 2003 27
Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function December 10 th, 2003 28
Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a, r); December 10 th, 2003 29
Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a, r); December 10 th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 30
Brook language reductions • multi-dimension reductions – stream “shape” differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a, r); December 10 th, 2003 for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 31
Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride December 10 th, 2003 32
Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a, b, c); December 10 th, 2003 33
Brook language stream repeat & stride • kernel arguments of different shape – resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a, b, c); December 10 th, 2003 foo(a[0], foo(a[2], foo(a[4], foo(a[6], foo(a[8], foo(a[10], foo(a[12], foo(a[14], foo(a[16], foo(a[18], b[0], b[1], b[2], b[3], b[4], c[0]) c[1]) c[2]) c[3]) c[4]) c[5]) c[6]) c[7]) c[8]) c[9]) 34
Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20, 10>; vector<1, 10>; tempmv<20, 10>; result<20, 1>; mul(matrix, vector, tempmv); sum(tempmv, result); December 10 th, 2003 M V V V = T 35
Brook language matrix vector multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20, 10>; vector<1, 10>; tempmv<20, 10>; result<20, 1>; mul(matrix, vector, tempmv); sum(tempmv, result); December 10 th, 2003 T sum R 36
brcc compiler infrastructure December 10 th, 2003 37
brcc compiler infrastructure • based on ctool – http: //ctool. sourceforge. net • parser – build code tree – extend C grammar to accept Brook • convert – tree transformations • codegen – generate cg & hlsl code – call cgc, fxc – generate stub function December 10 th, 2003 38
brcc compiler kernel compilation kernel void updatepos (float 2 pos<>, float 2 vel[100], float timestep, out float 2 newpos<>) { newpos = pos + vel[pos]*timestep; } float 4 main (uniform float 4 _workspace : register (c 0), uniform sampler _tex_pos : register (s 0), float 2 _tex_pos : TEXCOORD 0, uniform sampler vel : register (s 1), uniform float 4 vel_scalebias : register (c 1), uniform float timestep : register (c 2)) : COLOR 0 { float 4 _OUT; float 2 pos; float 2 newpos; pos = tex 2 D(_tex_pos, _tex_pos). xy; newpos = pos + tex 2 D(vel, (pos). xy*vel_scalebias. xy+vel_scalebias. zw). xy * timestep; _OUT. x = newpos. x; _OUT. y = newpos. y; _OUT. z = newpos. y; _OUT. w = newpos. y; return _OUT; } December 10 th, 2003 39
brcc compiler kernel compilation static const char __updatepos_ps 20[] = "ps_2_0. . . static const char __updatepos_fp 30[] = "!!fp 30. . . void updatepos (const __BRTStream& pos, const __BRTStream& vel, const float timestep, const __BRTStream& newpos) { static const void *__updatepos_fp[] = { "fp 30", __updatepos_fp 30, "ps 20", __updatepos_ps 20, "cpu", (void *) __updatepos_cpu, "combine", 0, NULL }; static __BRTKernel k(__updatepos_fp); k->Push. Stream(pos); k->Push. Gather. Stream(vel); k->Push. Constant(timestep); k->Push. Output(newpos); k->Map(); } December 10 th, 2003 40
brcc runtime streams December 10 th, 2003 41
brt runtime streams • streams separate texture per stream December 10 th, 2003 vel texture 1 pos texture 2 42
brt runtime kernels • kernel execution – set stream texture as render target – bind inputs to texture units – issue screen size quad • texture coords provide stream positions vel kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } a foo b result December 10 th, 2003 43
brt runtime reductions • reduction execution – multipass execution – associativity required December 10 th, 2003 44
research directions • demonstrate gpu streaming coprocessor – compiling Brook to gpus – evaluation – applications December 10 th, 2003 45
research directions • applications – linear algebra – image processing – molecular dynamics (gromacs) – – – FEM multigrid raytracer volume renderer SIGGRAPH / GH papers December 10 th, 2003 46
research directions • virtualize gpu resources – texture size and formats • packing streams to fit in 2 D segmented memory space float matrix<8096, 10, 30, 5>; December 10 th, 2003 47
research directions • virtualize gpu resources – texture size and formats • support complex formats typedef struct { float 3 pos; float 3 vel; float mass; } particle; kernel void foo (particle a<>, float timestep, out particle b<>); float a<100>[8][2]; December 10 th, 2003 48
research directions • virtualize gpu resources – multiple outputs • simple: let cgc or fxc do dead code elimination kernel void foo (float 3 a<>, float 3 b<>, …, out float 3 x<>, out float 3 y<>) kernel void foo 1(float 3 a<>, float 3 b<>, …, out float 3 x<>) kernel void foo 2(float 3 a<>, float 3 b<>, …, out float 3 y<>) • better: compute intermediates separately December 10 th, 2003 49
research directions • virtualize gpu resources – limited instructions per kernel • generalize RDS algorithm for kernels – compute ideal # of passes for intermediate results – hard ? ? ? December 10 th, 2003 50
research directions • auto vectorization kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } kernel void foo_faster (float 4 a<>, float 4 b<>, out float 4 result<>) { result = a + b; } December 10 th, 2003 51
research directions • Brook v 0. 2 support – stream operators • stencil, group, domain, repeat, stride, merge, … – building and manipulating data structures • scatter. Op a[i] += p • gather. Op p = a[i]++ • gpu primitives December 10 th, 2003 52
research directions • gpu areas of improvement – – reduction registers texture constraints scatter capabilities programmable blending • gather. Op, scatter. Op December 10 th, 2003 53
Brook status • team – – Jeremy Sugerman Daniel Horn Tim Foley Ian Buck • beta release – December 15 th – sourceforge December 10 th, 2003 54
Questions? December 10 th, 2003 Fly-fishing fly images from The English Fly Fishing Shop 55
- Slides: 55