Generating Truly Optimal Code Using a Metaprogramming Library

Generating Truly Optimal Code Using a Metaprogramming Library Don Clugston First D Programming Conference, 24 August 2007

String mixins in D – undercooked, but very tasty char [] greet(char [] greeting) { return `writefln(“` ~ greeting ~`, world!”); `; } void main() { mixin( greet( “Hello” ) ); } • Compiles to: void main() { writefln( “Hello, world!” ); } • Vindicates built-in string operations

The Challenge • Fortran: BLAS (a standard set of highly optimised routines). The crucial functions are coded in asm. y[] += a * x[] void DAXPY(double [] y, double [] x, double a) { for (i = 0; i < y. length; ++i) y[i] += x[i] * a; } • But BLAS is limited – nothing for simple things: – x[] = y[] - z[] – a[] = r[]*0. 3 + g[]*0. 5 + b[]*0. 2;

Operating overloading • Gives ideal syntax, always works • Can’t operate on built-in types • Inefficient because: – Creates unnecessary temporaries. – Multiple loops, eg a[]=b[]+c[]+d[] double [] temp 1= new double[], temp 2 = new double[]; for(int i=0; i<b. length; ++i) temp 1[i] = b[i] + c[i]; for(int i=0, i<temp 1. length; ++i) temp 2[i] = temp 1[i] + d[i]; a = temp 2; • Somehow, we need to get the expression inside the ‘for’ loop!

The Wizard Solution: Expression Templates (eg, Blitz++) • Overloaded operators don’t do the calculation: instead, they record the operation as a proxy type, creating a syntax tree. • Example: (a+b)/(c-d): DVExpr<DVBin. Expr. Op<DVExpr< DVBin. Expr. Op<DVec: : iter. T, DAp. Add>>, DVExpr<DVBin. Expr. Op< DVec: : iter. T, DAp. Subtract>>, DAp. Divide>> • Need a good optimiser. • Works in D as well as C++. BUT… we are fighting the compiler!

Representing the Syntax Tree in D • In D, any expression can be represented in a single template. void vector. Operation(char [] expression, T…)(T values) { } • Represent types and values in a tuple. Represent expression in a char []. A. . Z correspond to T[0]. . T[25]. eg: vector. Operation!(“A+=(B*C)/(A+D)”)(x, y, z, u, v); Note that ‘A’ appears twice in the expression (operator overloading can’t represent that).

Finding the vectors in a tuple • It’s a vector if you can index it. • Imperfection – can’t index tuple in CTFE. • Workaround – create array of results. template is. Vector(T. . . ) { static if (T. length == 0) const bool [] is. Vector = []; else static if( is( typeof(T[0][0]) ) ) const bool [] is. Vector = true ~ is. Vector!(T[1. . $]); else const bool [] is. Vector = false ~ is. Vector!(T[1. . $]); } • Usage: if ( is. Vector!(Tuple)[i]) { … }

Metaprogramming For Muggles char [] muggle (char [] expr, Values. . . )() { char [] code = "for (int i=0; i<values[0]. length; ++i) {"; foreach(c; expr) if (c >= 'A' && c <= 'Z’) { // A-Z become tuple members. code ~= "values[" ~ itoa(c-'A') ~ "]"; // add [i] if it was a vector if (is. Vector!(Values)[c-'A']) code ~= "[i]"; } else code ~= c; // Everything else is unchanged return code ~ "; }“; } template VEC(char [] expr) { void VEC(Values. . . )(Values values) { mixin( muggle!(expr, Values) ); }} USAGE: double [] firstvec, secondvec, thirdvec; VEC!("A+=B*(C+A*D)")(firstvec, secondvec, thirdvec, 25. 7);

Trivial enhancements • Ensure all vectors are the same length. foreach(int i, bool b; is. Vector!(Values)[1. . $]) { if (b) code ~= “assert(values[“ ~ atoi(i) ~ “]. length == values[0]. length); ”; } • Assert no aliasing (vectors don’t overlap). • Equalize with hand-coded asm BLAS routines. static if ( expr == “A+=B*C” && is( Values[0] == double[] ) && is( Values[1] == double[] ) && is ( Values[2] : double ) ) { return “DAXPY(values[0]. length, values[0]. ptr, values[1]. ptr, values[2]); ”; }

Asm code via perturbation • It’s hard to determine the optimal asm for an algorithm, much easier to modify existing code. • Begin with Agner Fogg’s optimal asm code for DAXPY. Use same loop design and register allocation strategy. • Ignore difficult cases – fallback to D code.

X 87 (stack-based) • Convert the infix expression into postfix. Split += into + and =. • Swap operands to avoid FMUL latency. A += B - C * D A = (A+B) - (C*D) CD*AB+-A= • Avoid gaps in the instruction set – Eg, fewer instructions for 80 -bit reals, so load them first whenever possible.

X 87 code generation • Directly convert postfix to inline asm. VEC!("C+=B*(A+D)")( 2213. 3, vec 1, floatvec, vec 2); // Postfix : BAD+*C+C= L 1: fld double ptr [EAX + 8*ESI]; //B fld double ptr [EAX + 8*ESI]; //A fadd double ptr [EDX + 8*ESI]; //D+ fmulp ST(1), ST; //* fadd float ptr [ECX + 4*ESI]; //C+ fxch ST(1), ST; fstp float ptr [ECX + 4*ESI - 4]; // C= L 2: inc ESI; jnz L 1;

SSE/SSE 2 (register-based) • Can’t do mixed-precision operations. • Unroll loop by 2 or 4, to take advantage of SIMD. • Instruction scheduling is less critical, but register allocation is more complicated than for x 87.

GPGPU • Use the GPU in modern video cards to perform massively parallel calculations. • Uses Open. GL or Direct. X calls, instead of inline asm. • Full of hacks (pretend your data is a texture!) – but a rational API should emerge soon. • This should NOT be built into a compiler!

Adding a front end • Operator overloading – Same limitations as before • Mixins eg, mixin(blade(“firstvec+=secondvec*2. 38”)); – clumsy syntax BUT: – Can detect aliases – Allows better error messages – Can unroll small loops inline – Closer to proposed macro syntax

Front end using mixins 1. Lex: first += second * 2. 38 A+=B*C. 2. Determine types, resolve aliases, convert constants to literals. 3. Determine precedence and associativity 4. Perform constant folding • • • We can do most of this using mixins Compiler help is most required for 4 __traits could help

Determining types char[] get. Symbol. Table(char [][] symbols) { char [] result = "["; for(int i=0; i<symbols. length; ++i) { if (i>0) result ~=", "; result ~= "[typeof(" ~ symbols[i] ~ `). stringof, ` ~ symbols[i] ~ `. stringof]`; } result ~= "]"; return result; } • When mixed in, this creates an array[2] of string literals. • [0] is the type, [1] is the value

Determining precedence class AST(char [] expr) { alias expr text; AST!("(" ~ text ~ “+” ~ T. text ~ ")") op. Add(T)(T x) { return null; } AST!("(" ~ text ~ “*” ~ T. text ~ ")") op. Mul(T)(T x) { return null; } AST!( text ~ "([" ~ T. text ~ "])“ ) op. Index(T)(T x) { return null; } } char [] get. Precedence(char [] expr) { char [] code = "typeof("; for(int i=0; i<expr. length; ++i) { if (expr[i]>='A' && expr[i]<='Z') code ~= "(cast(AST!(`" ~ expr[i] ~"`))(null))"; else code ~= expr[i]; } return code ~ "). text"; } mixin(get. Precedence(“A+B*C*D”) ) “A+((B*C)*D)”

Conclusion • Implementation and syntactic issues remain – Syntax for runtime and compile-time reflection – Macros, and an extended __traits syntax should help. – How to clean up mixin(), yet retain its power? • Yet perfectly optimal code is already possible. Libraries can perform optimisations previously required a compiler back-end.