Lowlevel Shader Optimization for NextGen and DX 11

Introduction GDC 2013 “Low-level Thinking in High-level Shading Languages” Covered the basic shader features

Main lessons from last year You get what you write! Don't rely on compiler

More lessons Put abs() and negation on input, saturate() on output rcp(), rsqrt(), exp

A look at modern hardware 7 -8 years from last-gen to next-gen Lots of

Sampling a cubemap shader main s_mov_b 64 s[2: 3], exec s_wqm_b 64 exec, exec

Hardware evolution Fixed function moving to ALU Interpolators Vertex fetch Export conversion Projection/Cubemap math

Hardware evolution Most of everything is backed by memory No constant registers Textures, sampler-states,

NULL shader AMD DX 10 hardware float 4 main(float 4 tex_coord : TEXCOORD 0)

Not so NULL shader AMD DX 11 hardware 00 ALU: ADDR(32) CNT(8) 0 x:

Not so NULL shader anymore shader main s_mov_b 32 m 0, s 2 v_interp_p

NULL shader AMD DX 11 hardware float 4 main(float 4 scr_pos : SV_Position) :

Shader inputs Shader gets a few freebees from the scheduler VS – Vertex Index

Shader inputs There is no such thing as a Vertex. Declaration Vertex data manually

Shader inputs Up to 16 user SGPRs The primary communication path from driver to

Shader inputs Texture Descriptor is 8 SGPRs Resource desc list return T 0. Load(0);

Shader inputs Interpolation costs two ALU per float Packing does nothing on GCN Use

$Interpolation Using nointerpolation float 4 main(float 4 tc: TC) : SV_Target { return tc;$

Shader inputs SV_Is. Front. Face comes as 0 x 0 or 0 x. FFFF

GCN instructions Instructions limited to 32 or 64 bits Can only read one scalar

GCN instructions GCN is “scalar” (i. e. not VLIW or vector) Operates on individual

GCN instructions Full rate Float add/sub/mul/mad/fma Integer add/sub/mul 24/mad 24/logic Type conversion, floor()/ceil()/round() ½

GCN instructions ¼ rate Transcendentals (rcp(), rsq(), sqrt(), etc. ) Double mul/fma Integer 32

GCN instructions Super expensive Integer divides Unsigned integer somewhat less horrible Inverse trigonometry Caution:

GCN instructions Sometimes MAD becomes two vector instructions return (x + 3. 0 f)

GCN instructions MAD-form still usually beneficial When none of the instruction limitations apply When

GCN instructions ADD-MUL MAD return (x + y) * 3. 0 f; return x

GCN instructions ADD-MUL MAD return (v 4 + c. x) * c. y; return

Vectorization Scalar code Vectorized code return 1. 0 f - v. x * v.

ROPs HD 7970 264 GB/s BW, 32 ROPs RGBA 8: 925 MHz * 32

ROPs XB 1 16 ROPs ESRAM: 109 GB/s (write) BW DDR 3: 68 GB/s

ROPs Not enough ROPs to utilize all BW! Always for RGBA 8 Often for

Branching managed by scalar unit Execution is controlled by a 64 bit mask in

Integer mul 24() Inputs in 24 bit, result full 32 bit Get the upper

24 bit multiply mul 32 mul 24 return i * j; 4 cycles v_mul_lo_u

Integer division Not natively supported by HW Compiler does some obvious optimizations Also some

Integer division Stick to unsigned if possible Helps with divide by non-POT constant too

Doubles Do you actually need doubles? My professional career's entire list of use of

Doubles Use FMA if possible Same idea as with MAD/FMA on floats No double

Packing Built-in functions for packing f 32 tof 16() f 16 tof 32() Hardware

Float Prefer conditional assignment sign() - Horribly poorly implemented step() - Confusing code and

Texturing Sampler. States are data Must be fetched by shader Prefer Load() over Sample()

Texturing Cubemapping Adds a bunch of ALU operations Skybox with cubemap vs. six 2

Registers GPUs hide latency by keeping multiple wavefronts in flight GCN can support up

Registers Keep register life-time low for is for (each){ Work. A(); } (each){ Work.

Registers Consider using specialized shaders #ifdef instead of branching Über-shaders pay for the worst

Register indexing Expensive Implemented as a loop Variable cost, depending on coherency float arr[16];

Register indexing Manually select 2 * (N-1) ALUs N=16 ⇒ 30 ALU return (index

Register indexing Optimized manual selection N+2*log(N)-2 N=16 ⇒ 22 ALU bool lo = (index

How to get to the HW asm GPUShader. Analyzer doesn’t support GCN yet Code.

Things shader authors should stop doing pow(color, 2. 2 f) You almost certainly did

Things compilers should stop doing x * 2 => x + x saturate(a *

Things compilers should stop doing asfloat(0 x 7 FFFFF) => 0 Spend awful lots

Things compilers should be doing x * 5 => (x << 2) + x

Potential extensions Hardware has many unexplored features Cross-thread communication “Programmable” branching Virtual functions Goto

References [1] Steam HW stats [2] Division of integers by constants [3] Open GPU

Questions? Twitter: _Humus_ Email: emil. persson@avalanchestudios. se

Slides: 57

Download presentation

Low-level Shader Optimization for Next-Gen and DX 11 Emil Persson Head of Research, Avalanche Studios

Introduction GDC 2013 “Low-level Thinking in High-level Shading Languages” Covered the basic shader features set Float ALU ops New since last year Next-gen consoles GCN-based GPUs DX 11 feature set mainstream 70% on Steam have DX 11 GPUs [1]

Main lessons from last year You get what you write! Don't rely on compiler “optimizing” for you Compiler can't change operation semantics Write code in MAD-form Separate scalar and vector work Also look inside functions Even built-in functions! Add parenthesis to parallelize work for VLIW

More lessons Put abs() and negation on input, saturate() on output rcp(), rsqrt(), exp 2(), log 2(), sin(), cos() map to HW Watch out for inverse trigonometry! Low-level and High-level optimizations are not mutually exclusive! Do both!

A look at modern hardware 7 -8 years from last-gen to next-gen Lots of things have changed Old assumptions don't necessarily hold anymore Guess the instruction count! Texture. Cube; Sampler. State Samp; float 4 main(float 3 tex_coord : TEXCOORD) : SV_Target { return Cube. Sample(Samp, tex_coord); } sample o 0. xyzw, v 0. xyzx, t 0. xyzw, s 0

Sampling a cubemap shader main s_mov_b 64 s[2: 3], exec s_wqm_b 64 exec, exec s_mov_b 32 m 0, s 16 v_interp_p 1_f 32 v 2, v 0, attr 0. x v_interp_p 2_f 32 v 2, v 1, attr 0. x v_interp_p 1_f 32 v 3, v 0, attr 0. y v_interp_p 2_f 32 v 3, v 1, attr 0. y v_interp_p 1_f 32 v 0, attr 0. z v_interp_p 2_f 32 v 0, v 1, attr 0. z v_cubetc_f 32 v 1, v 2, v 3, v 0 v_cubesc_f 32 v 4, v 2, v 3, v 0 v_cubema_f 32 v 5, v 2, v 3, v 0 v_cubeid_f 32 v 8, v 2, v 3, v 0 v_rcp_f 32 v 2, abs(v 5) s_mov_b 32 s 0, 0 x 3 fc 00000 v_mad_legacy_f 32 v 7, v 1, v 2, s 0 v_mad_legacy_f 32 v 6, v 4, v 2, s 0 image_sample v[0: 3], v[6: 9], s[4: 11], s[12: 15] dmask: 0 xf s_mov_b 64 exec, s[2: 3] s_waitcnt vmcnt(0) v_cvt_pkrtz_f 16_f 32 v 0, v 1 v_cvt_pkrtz_f 16_f 32 v 1, v 2, v 3 exp mrt 0, v 1 done compr vm s_endpgm end 15 VALU 1 transcendental 6 SALU 1 IMG 1 EXP

Hardware evolution Fixed function moving to ALU Interpolators Vertex fetch Export conversion Projection/Cubemap math Gradients Was ALU, became TEX, back to ALU (as swizzle + sub)

Hardware evolution Most of everything is backed by memory No constant registers Textures, sampler-states, buffers Unlimited resources “Stateless compute”

NULL shader AMD DX 10 hardware float 4 main(float 4 tex_coord : TEXCOORD 0) : SV_Target { return tex_coord; } 00 EXP_DONE: PIX 0, R 0 END_OF_PROGRAM

Not so NULL shader AMD DX 11 hardware 00 ALU: ADDR(32) CNT(8) 0 x: INTERP_XY R 1. x, y: INTERP_XY R 1. y, z: INTERP_XY ____, w: INTERP_XY ____, 1 x: INTERP_ZW ____, y: INTERP_ZW ____, z: INTERP_ZW R 1. z, w: INTERP_ZW R 1. w, 01 EXP_DONE: PIX 0, R 1 END_OF_PROGRAM R 0. y, R 0. x, Param 0. x Param 0. x VEC_210 VEC_210 shader main s_mov_b 32 m 0, s 2 v_interp_p 1_f 32 v 2, v 0, attr 0. x v_interp_p 2_f 32 v 2, v 1, attr 0. x v_interp_p 1_f 32 v 3, v 0, attr 0. y v_interp_p 2_f 32 v 3, v 1, attr 0. y v_interp_p 1_f 32 v 4, v 0, attr 0. z v_interp_p 2_f 32 v 4, v 1, attr 0. z v_interp_p 1_f 32 v 0, attr 0. w v_interp_p 2_f 32 v 0, v 1, attr 0. w v_cvt_pkrtz_f 16_f 32 v 1, v 2, v 3 v_cvt_pkrtz_f 16_f 32 v 0, v 4, v 0 exp mrt 0, v 1, v 0 done compr vm s_endpgm end

Not so NULL shader anymore shader main s_mov_b 32 m 0, s 2 v_interp_p 1_f 32 v 2, v 0, attr 0. x v_interp_p 2_f 32 v 2, v 1, attr 0. x v_interp_p 1_f 32 v 3, v 0, attr 0. y v_interp_p 2_f 32 v 3, v 1, attr 0. y v_interp_p 1_f 32 v 4, v 0, attr 0. z v_interp_p 2_f 32 v 4, v 1, attr 0. z v_interp_p 1_f 32 v 0, attr 0. w v_interp_p 2_f 32 v 0, v 1, attr 0. w v_cvt_pkrtz_f 16_f 32 v 1, v 2, v 3 v_cvt_pkrtz_f 16_f 32 v 0, v 4, v 0 exp mrt 0, v 1, v 0 done compr vm s_endpgm end Set up parameter address and primitive mask Interpolate, 2 ALUs per float FP 32→FP 16 conversion, 1 ALU per 2 floats Export compressed color

NULL shader AMD DX 11 hardware float 4 main(float 4 scr_pos : SV_Position) : SV_Target { return scr_pos; } 00 EXP_DONE: PIX 0, R 0 END_OF_PROGRAM exp mrt 0, v 2, v 3, v 4, v 5 vm done s_endpgm

Shader inputs Shader gets a few freebees from the scheduler VS – Vertex Index PS – Barycentric coordinates, SV_Position CS – Thread and group IDs Not the same as earlier hardware Not the same as APIs pretend Anything else must be fetched or computed

Shader inputs There is no such thing as a Vertex. Declaration Vertex data manually fetched by VS Driver patches shader when VDecl changes float 4 main(uint id: SV_Vertex. ID) : SV_Position { return asfloat(id); } float 4 main(float 4 tc: TC) : SV_Position { return tc; } v_mov_b 32 exp s_swappc_b 64 v_mov_b 32 exp v 1, 1. 0 pos 0, v 0 done param 0, v 1, v 1 Sub-routine call s[0: 1], s[0: 1] v 0, 1. 0 pos 0, v 4, v 5, v 6, v 7 done param 0, v 0

Shader inputs Up to 16 user SGPRs The primary communication path from driver to shader Shader Resource Descriptors take 4 -8 SGPRs Not a lot of resources fit by default Typically shader needs to load from a table

Shader inputs Texture Descriptor is 8 SGPRs Resource desc list return T 0. Load(0); return T 0. Load(0) * T 1. Load(0); v_mov_b 32 v 0, 0 v_mov_b 32 v 1, 0 v_mov_b 32 v 2, 0 image_load_mip v[0: 3], s[4: 11] s_load_dwordx 8 s[4: 11], s[2: 3], 0 x 00 s_load_dwordx 8 s[12: 19], s[2: 3], 0 x 08 v_mov_b 32 v 0, 0 v_mov_b 32 v 1, 0 v_mov_b 32 v 2, 0 s_waitcnt lgkmcnt(0) image_load_mip v[3: 6], v[0: 3], s[4: 11] image_load_mip v[7: 10], v[0: 3], s[12: 19] Raw resource desc Explicitly fetch resource descs

Shader inputs Interpolation costs two ALU per float Packing does nothing on GCN Use nointerpolation on constant values SV_Position A single ALU per float Comes preloaded, no interpolation required noperspective Still two ALU, but can save a component

$Interpolation Using nointerpolation float 4 main(float 4 tc: TC) : SV_Target { return tc;$

Interpolation Using nointerpolation float 4 main(float 4 tc: TC) : SV_Target { return tc; } float 4 main(nointerpolation float 4 tc: TC) : SV_Target { return tc; } v_interp_p 1_f 32 v_interp_p 2_f 32 v_interp_mov_f 32 v 2, v 3, v 4, v 0, v 1, v 0, v 1, attr 0. x attr 0. y attr 0. z attr 0. w v 0, v 1, v 2, v 3, p 0, attr 0. x attr 0. y attr 0. z attr 0. w

Shader inputs SV_Is. Front. Face comes as 0 x 0 or 0 x. FFFF return (face? 0 x. FFFF : 0 x 0) is a NOP Or declare as uint (despite what documentation says) Typically used to flip normals for backside lighting float flip = face? 1. 0 f : -1. 0 f; return normal * flip; return face? normal : -normal; return asfloat( Bit. Field. Insert(face, asuint(normal), asuint(-normal)) ); v_cmp_ne_i 32 v_cndmask_b 32 v_mul_f 32 v_cmp_ne_i 32 v_cndmask_b 32 v_bfi_b 32 vcc, 0, v 2 v 0, -1. 0, vcc v 1, v 0, v 1 v 2, v 0, v 2 v 0, v 3 vcc, 0, v 2 v 0, -v 0, vcc v 1, -v 1, vcc v 2, -v 3, vcc v 0, v 2, v 0, -v 0 v 1, v 2, v 1, -v 1 v 2, v 3, -v 3

GCN instructions Instructions limited to 32 or 64 bits Can only read one scalar reg or one literal constant Special inline constants 0. 5 f, 1. 0 f, 2. 0 f, 4. 0 f, -0. 5 f, -1. 0 f, -2. 0 f, -4. 0 f -64. . 64 Special output multiplier values 0. 5 f, 2. 0 f, 4. 0 f Underused by compilers (fxc also needlessly interferes)

GCN instructions GCN is “scalar” (i. e. not VLIW or vector) Operates on individual floats/ints Don't confuse with GCN's scalar/vector instruction! Wavefront of 64 “threads” Those 64 “scalars” make a SIMD vector … which is what vector instructions work on Additional scalar unit on the side Independent execution Loads constants, does control flow etc.

GCN instructions Full rate Float add/sub/mul/mad/fma Integer add/sub/mul 24/mad 24/logic Type conversion, floor()/ceil()/round() ½ rate Double add

GCN instructions ¼ rate Transcendentals (rcp(), rsq(), sqrt(), etc. ) Double mul/fma Integer 32 -bit multiply For “free” (in some sense) Scalar operations

GCN instructions Super expensive Integer divides Unsigned integer somewhat less horrible Inverse trigonometry Caution: Instruction count not indicative of performance anymore

GCN instructions Sometimes MAD becomes two vector instructions return (x + 3. 0 f) * 1. 5 f; return x * 1. 5 f + 4. 5 f; return x * c. x + c. y; v_add_f 32 v_mul_f 32 s_mov_b 32 v_add_f 32 s 0, v 0, v_mov_b 32 v_mac_f 32 v_mov_b 32 s_mov_b 32 v_mac_f 32 v 1, 0 x 40900000 s 0, 0 x 3 fc 00000 v 1, s 0, v 0, 0 x 40400000, v 0, 0 x 3 fc 00000, v 0 0 x 3 fc 00000 s 0, v 0 0 x 40900000 s 0, v 0 So writing in MAD-form is obsolete now? Nope v 1, s 1 v 1, s 0, v 0

GCN instructions MAD-form still usually beneficial When none of the instruction limitations apply When using inline constants (1. 0 f, 2. 0 f, 0. 5 f etc) When input is a vector

GCN instructions ADD-MUL MAD return (x + y) * 3. 0 f; return x * 3. 0 f + y; v_add_f 32 v_mul_f 32 v_madmk_f 32 v 0, v 2, v 0, 0 x 40400000, v 2, 0 x 40400000, v 0 return (x + 3. 0 f) * 0. 5 f; return x * 0. 5 f + 1. 5 f; v_add_f 32 v_mul_f 32 v 0, 0 x 40400000, v 0, 0. 5, v 0 v_madak_f 32 s_mov_b 32 v_add_f 32 s 0, 0 x 3 fc 00000 v 0, s 0 div: 2 v 0, 0. 5, v 0, 0 x 3 fc 00000 Single immediate constant Inline constant

GCN instructions ADD-MUL MAD return (v 4 + c. x) * c. y; return v 4 * c. x + c. y; v_add_f 32 v_mul_f 32 v_mov_b 32 v_mad_f 32 v_mac_f 32 v 1, v 2, v 3, v 0, s 0, s 1, v 2 v 3 v 4 v 0 v 1 v 2 v 3 v 0 v 1, v 2, v 3, v 4, v 1, s 1 v 2, v 3, v 4, s 0, Vector operation s 0, v 1 v 0

Vectorization Scalar code Vectorized code return 1. 0 f - v. x * v. x - v. y * v. y; return 1. 0 f – dot(v. xy, v. xy); v_mad_f 32 v_mul_f 32 v_mac_f 32 v_sub_f 32 v 2, -v 2, 1. 0 v 2, -v 0, v 2 v 2, v 0 v 0, 1. 0, v 2

ROPs HD 7970 264 GB/s BW, 32 ROPs RGBA 8: 925 MHz * 32 * 4 bytes = 118 GB/s (ROP bound) RGBA 16 F: 925 MHz * 32 * 8 bytes = 236 GB/s (ROP bound) RGBA 32 F: 925 MHz * 32 * 16 bytes = 473 GB/s (BW bound) PS 4 176 GB/s BW, 32 ROPs RGBA 8: 800 MHz * 32 * 4 bytes = 102 GB/s (ROP bound) RGBA 16 F: 800 MHz * 32 * 8 bytes = 204 GB/s (BW bound)

ROPs XB 1 16 ROPs ESRAM: 109 GB/s (write) BW DDR 3: 68 GB/s BW RGBA 8: 853 MHz * 16 * 4 bytes = 54 GB/s (ROP bound) RGBA 16 F: 853 MHz * 16 * 8 bytes = 109 GB/s (ROP/BW) RGBA 32 F: 853 MHz * 16 bytes = 218 GB/s (BW bound)

ROPs Not enough ROPs to utilize all BW! Always for RGBA 8 Often for RGBA 16 F Bypass ROPs with compute shader Write straight to a UAV texture or buffer Done right, you'll be BW bound We have seen 60 -70% BW utilization improvements

Branching managed by scalar unit Execution is controlled by a 64 bit mask in scalar regs Does not count towards you vector instruction count Branchy code tends to increase GPRs x? a : b Semantically a branch, typically optimized to Cnd. Mask Can use explicit Cnd. Mask()

Integer mul 24() Inputs in 24 bit, result full 32 bit Get the upper 16 bits of 48 bit result with mul 24_hi() 4 x speed over 32 bit mul Also has a 24 -bit mad No 32 bit counterpart The addition part is full 32 bit

24 bit multiply mul 32 mul 24 return i * j; 4 cycles v_mul_lo_u 32 return mul 24(i, j); v 0, v 1 mad 32 5 cycles v_mul_u 32_u 24 v 0, v 1 1 cycle mad 24 return i * j + k; return mul 24(i, j) + k; v_mul_lo_u 32 v_add_i 32 v_mad_u 32_u 24 v 0, v 1, v 2 v 0, v 1 v 0, vcc, v 0, v 2 1 cycle

Integer division Not natively supported by HW Compiler does some obvious optimizations Also some less obvious optimizations [2] i / 4 => i >> 2 i / 3 => mul_hi(i, 0 x. AAAAAAB) >> 1 General case emulated with loads of instructions ~40 cycles for unsigned ~48 cycles for signed

Integer division Stick to unsigned if possible Helps with divide by non-POT constant too Implement your own mul 24 -variant i / 3 ⇒ mul 24(i, 0 x. AAAB) >> 17 Works with i in [0, 32767*3+2] Consider converting to float Can do with 8 cycles including conversions Special case, doesn't always work

Doubles Do you actually need doubles? My professional career's entire list of use of doubles: Mandelbrot Quick hacks Debug code to check if precision is the issue

Doubles Use FMA if possible Same idea as with MAD/FMA on floats No double equivalent to float MAD No direct support for division Also true for floats, but x * rcp(y) done by compiler 0. 5 ULP division possible, but far more expensive Double a / b very expensive Explicit x * rcp(y) is cheaper (but still not cheap)

Packing Built-in functions for packing f 32 tof 16() f 16 tof 32() Hardware has bit-field manipulation instructions Fast unpack of arbitrarily packed bits int r = s & 0 x 1 F; // 1 cycle int g = (s >> 5) & 0 x 3 F; // 1 cycle int b = (s >> 11) & 0 x 1 F; // 1 cycle

Float Prefer conditional assignment sign() - Horribly poorly implemented step() - Confusing code and suboptimal for typical case Special hardware features min 3(), max 3(), med 3() Useful for faster reductions General clamp: med 3(x, min_val, max_val)

Texturing Sampler. States are data Must be fetched by shader Prefer Load() over Sample() Reuse sampler states Old-school texture ↔ sampler-state link suboptimal

Texturing Cubemapping Adds a bunch of ALU operations Skybox with cubemap vs. six 2 D textures Sample offsets Load(tc, offset) bad Consider using Gather() Sample(tc, offset) fine

Registers GPUs hide latency by keeping multiple wavefronts in flight GCN can support up to 10 simultaneous wavefronts Fixed pool of 64 KB for VGRPs, 2 KB for SGPRs

Registers Keep register life-time low for is for (each){ Work. A(); } (each){ Work. B(); } better than: (each){ Work. A(); Work. B(); } Don't just sample and output an alpha just because you have one available

Registers Consider using specialized shaders #ifdef instead of branching Über-shaders pay for the worst case Reduce branch nesting

Register indexing Expensive Implemented as a loop Variable cost, depending on coherency float arr[16]; //. . . return arr[index]; s_cbranch_execz label_004 D s_mov_b 64 s[22: 23], exec label_0040: v_readfirstlane_b 32 s 20, v 12 s_mov_b 32 m 0, s 20 s_mov_b 64 s[24: 25], exec v_cmpx_eq_i 32 s[26: 27], s 20, v 12 s_andn 2_b 64 s[24: 25], s[26: 27] v_mov_b 32 v 2, 0 v_cmpx_lt_u 32 s[26: 27], s 20, 16 v_movrels_b 32 v 2, v 14 s_mov_b 64 exec, s[24: 25] s_cbranch_execnz label_0040 s_mov_b 64 exec, s[22: 23] label_004 D:

Register indexing Manually select 2 * (N-1) ALUs N=16 ⇒ 30 ALU return (index (index (index (index == == == == 0)? 1)? 2)? 3)? 4)? 5)? 6)? 7)? 8)? 9)? 10)? 11)? 12)? 13)? 14)? s[0] : s[1] : s[2] : s[3] : s[4] : s[5] : s[6] : s[7] : s[8] : s[9] : s[10] : s[11] : s[12] : s[13] : s[14] : s[15]; v_cmp_eq_i 32 v_cmp_eq_i 32 v_cmp_eq_i 32 v_cmp_eq_i 32 v_cndmask_b 32 v_cndmask_b 32 v_cndmask_b 32 v_cndmask_b 32 v_cmp_ne_i 32 v_cndmask_b 32 s[2: 3], v 20, 1 s[12: 13], v 20, 2 s[14: 15], v 20, 3 s[16: 17], v 20, 4 s[18: 19], v 20, 5 s[20: 21], v 20, 6 s[22: 23], v 20, 7 s[24: 25], v 20, 8 s[26: 27], v 20, 9 s[28: 29], v 20, 10 s[30: 31], v 20, 11 s[32: 33], v 20, 12 s[34: 35], v 20, 13 vcc, 14, v 20 v 21 , v 19, v 16, vcc v 21 , v 21, v 18, s[34: 35] v 21 , v 21, v 17, s[32: 33] v 21 , v 21, v 15, s[30: 31] v 21 , v 21, v 14, s[28: 29] v 21 , v 21, v 11, s[26: 27] v 21 , v 21, v 10, s[24: 25] v 21 , v 21, v 9, s[22: 23] v 21 , v 21, v 8, s[20: 21] v 21 , v 21, v 7, s[18: 19] v 21 , v 21, v 13, s[16: 17] v 21 , v 21, v 6, s[14: 15] v 21 , v 21, v 12, s[12: 13] v 21 , v 21, v 5, s[2: 3] vcc, 0, v 20 , v 4, v 21, vcc

Register indexing Optimized manual selection N+2*log(N)-2 N=16 ⇒ 22 ALU bool lo = (index < 8); float b 0 = lo? s[0] : s[8]; float b 1 = lo? s[1] : s[9]; float b 2 = lo? s[2] : s[10]; float b 3 = lo? s[3] : s[11]; float b 4 = lo? s[4] : s[12]; float b 5 = lo? s[5] : s[13]; float b 6 = lo? s[6] : s[14]; float b 7 = lo? s[7] : s[15]; lo = ((index & 0 x 7) < 4); float c 0 = lo? b 0 : b 4; float c 1 = lo? b 1 : b 5; float c 2 = lo? b 2 : b 6; float c 3 = lo? b 3 : b 7; lo = ((index & 0 x 3) < 2); float d 0 = lo? c 0 : c 2; float d 1 = lo? c 1 : c 3; lo = ((index & 0 x 1) < 1); return lo? d 0 : d 1; v_cmp_gt_i 32 v_cndmask_b 32 v_cndmask_b 32 v_and_b 32 v_cmp_gt_u 32 v_cmp_lt_u 32 v_cndmask_b 32 v_cndmask_b 32 vcc, 8, v 20 v 21, v 10, v 4, v 22, v 11, v 5, v 23, v 14, v 12, v 24, v 15, v 6, v 25, v 17, v 13, v 26, v 18, v 7, v 27, v 16, v 8, v 28, v 19, v 29, 7, v 20 v 30, 3, v 20, 1, v 20 vcc, 4, v 29 s[2: 3], v 30, 2 s[12: 13], v 20, v 25, v 21, v 26, v 22, v 27, v 23, v 28, v 24, v 20, v 22, v 20, v 21, v 23, v 21, v 20, vcc vcc 1 vcc vcc s[2: 3] s[12: 13]

How to get to the HW asm GPUShader. Analyzer doesn’t support GCN yet Code. XL to the rescue! Cmdline only, but gets the job done Detailed AMD blog-post [4] Provides ASM, GPR stats etc. Code. XLAnalyzer -c Hawaii -f main -s HLSL -p ps_5_0 -a stats. csv --isa ISA. txt Shader. hlsl ******** Build Began for 1 Devices******** Compile for device: Hawaii - Succeeded. Extracting ISA for device: Hawaii - Succeeded. Writing Analysis data succeeded!

Things shader authors should stop doing pow(color, 2. 2 f) You almost certainly did something wrong This is NOT s. RGB! normal = Normal. Sample(. . . ) * 2. 0 f – 1. 0 f; Use signed texture format instead

Things compilers should stop doing x * 2 => x + x saturate(a * a) => min(a * a, 1. 0 f) This is a pessimization x * 4 + x => x * 5 Makes absolutely no sense, confuses optimizer This is a pessimization (x << 2) + x => x * 5 Dafuq is wrong with you?

Things compilers should stop doing asfloat(0 x 7 FFFFF) => 0 Spend awful lots of time trying to unroll loops with [loop] tag This is a bug. It's a cast. Even if it was a MOV it should still preserve all bits and not flush denorms. I don't even understand this one Treat vectors as anything else than a collection of floats

Things compilers should be doing x * 5 => (x << 2) + x Use mul 24() when possible Compiler for HD 6 xxx detects some cases, not for GCN Expose more hardware features as intrinsics More and better semantics in the D 3 D bytecode Require type conversions to be explicit

Potential extensions Hardware has many unexplored features Cross-thread communication “Programmable” branching Virtual functions Goto

References [1] Steam HW stats [2] Division of integers by constants [3] Open GPU Documentation [4] Code. XL for game developers: How to analyze your HLSL for GCN

Questions? Twitter: _Humus_ Email: emil. persson@avalanchestudios. se