Towards a More Principled Compiler Progressive Backend Compiler

Towards a More Principled Compiler: Progressive Backend Compiler Optimization David Koes 8/28/2006 School of Computer Science

Performance Gains Due to Compiler (gcc) 1 2. 8 Ghz Pentium 4, 1 GB RAM, -O 3 … School of Computer Science

The Future of Compiler Optimization is this possible? How do we exploit the existing optimization potential? Yes! Need a more principled compiler 10 -30% improvement just from reordering compiler phases http: //www. cs. rice. edu/~keith/Adapt/ 2 School of Computer Science

Compiler code size improvement 3 School of Computer Science

A Principled Compiler A compiler that – knows right from wrong (less optimal from more optimal) – follows a rigorous procedure to get the desired output 4 School of Computer Science

Today’s Compiler Problems target independent copy … GVN prop DCE strength SCCP reduct inlining code PRE cons t motion loop unroll prop target dependent insn sched reg alloc insn select branc peephol h opt e … 5 – some phases not internally optimal • purely heuristic solution – machine description mostly ignored – lack of integration between phases machine description optimized program School of Computer Science

Ideal Compiler cop y prop con st prop stre ngth red uct reg allo c 6 loop unr oll inlin e pee phole insn sele ct DC E code moti on CS E bran ch opt PR E GV N SC CP … – each phase locally optimal – makes full use of machine description – tight integration between phases Absolutely no idea how to do this or if it’s even possible machine description optimized program School of Computer Science

Towards a More Principled Compiler cop y prop con st prop stre ngth red uct reg allo c 7 loop unr oll inlin e pee phole insn sele ct DC E code moti on CS E bran ch opt PR E GV N – each phase locally optimal – makes full use of machine description – tight integration between phases SC CP … machine description optimized program School of Computer Science

Outline I. III. IV. V. 8 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science

Related Work Register Allocation Problem unbounded number of program variables spill code optimization … v = 1 w = v + 3 x = w + v u = v t = u + x print(x); print(w); print(t); print(u); … rematerialization register allocator memory operands 9 limited number of processor registers + slow memory eax register preferences ebx ecx edx esi edi esp ebp live range splitting School of Computer Science

Related Work Register Allocation Previous Work Method Expressive Fast Optimal / / Linear Scan Graph Coloring Integer Linear Programming Partitioned Boolean Quadratic Programming 10 School of Computer Science

Related Work Instruction Selection Problem Assem IR instruction selector IR Representation minimum cost tiling ? 11 movl leal (p), t 1 (x, t 1), t 2 1(y), t 3 (t 2, t 3), r School of Computer Science

Related Work Instruction Selection Previous Work Method DAG Tiling Register Allocation Aware Fast Optimal Dynamic Programming Binate Covering Peephole Based Instruction Selection AVIV Code Generator Exhaustive Search 12 School of Computer Science

Outline I. III. IV. V. 13 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science

A More Principled Register Completed Work Allocator – fully utilize machine description reg alloc • explicit and expressive model of costs of allocation for given architecture – optimal solutions machine description 14 School of Computer Science

Completed Work Multi-commodity Network Flow: An Expressive Model Given network (directed graph) with – cost and capacity on each edge – sources & sinks for multiple commodities Find lowest cost flow of commodities NP-complete for integer flows a Example: b edges have unit capacity 1 a 15 0 b School of Computer Science

Completed Work Register Allocation as a MCNF Variables Commodities a Variable Definition Source Variable Last Use Sink a Nodes Allocation Classes (Reg/Mem/Const) r 0 r 1 mem 1 Registers Limits Node Capacities Spill Costs Edge Costs r 1 mem 1 3 r 0 r 1 Allocation Flow 16 School of Computer Science

Completed Work Example Source Code int example(int a, int b) { int d = 1; int c = a - b; return c+d; } Pre-alloc Assembly MOVE 1 -> d SUB a, b -> c ADD c, d -> c MOVE c -> r 0 17 load cost insn pref cost mem access cost School of Computer Science

Completed Work Control Flow MCNF can only represent straight-line code – need to link together networks from basic blocks Extend MCNF model with merge and split nodes to implement boundary constraints. a: %eax a: mem details in proposal document… along with modeling persistence of values in memory a: mem 18 School of Computer Science

Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 19 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • NP-hard, so use progressive solution technique School of Computer Science

Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 20 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • NP-hard, so use progressive solution technique School of Computer Science

Completed Work Progressive Solution Technique Quickly find a good allocation Then progressively find better allocations Allocation Quality – until optimal allocation found – or time limit is reached Technique: Lagrangian relaxation directed allocators Compile Time 21 School of Computer Science

Completed Work Lagrangian Relaxation: Intuition Relaxes the hard constraints – only have to solve single commodity flow Combines easy subproblems using a Lagrangian multiplier (price) – an additional price on each edge – a price on each split/merge node a Example: b a b edges have unit capacity with price, solution to single commodity flow can be solution to multicommodity flow 1 a 22 0 b 1 a 0+1 b School of Computer Science

Completed Work Solution Procedure a Compute prices with iterative subgradient optimization 1 – guaranteed converge to optimal prices – optimal for linear relaxation 23 0+1 a b At each iteration, construct a feasible integer solution using current prices – iterative allocator in document – simultaneous allocator – trace-based simultaneous allocator b 1 a 0+1 b School of Computer Science

Completed Work Simultaneous Allocator Edges to/from memory cost 3 Current cost: -1 -3 -2 X X 24 School of Computer Science

Trace-Based Allocation Completed Work Decompose function into traces of basic blocks – run simultaneous allocator on each trace – control flow internal to trace presents difficulty addressed in proposal document 25 School of Computer Science

Completed Work Evaluation Implemented in gcc 3. 4. 4 targeting x 86 Optimize for code size – perfect static evaluation – important metric in its own right Media. Bench, Mi. Bench, Spec 95, Spec 2000 – over 10, 000 functions 26 School of Computer Science

Completed Work Progressiveness square. Encrypt 27 School of Computer Science

Completed Work Progressiveness quicksort 28 School of Computer Science

Completed Work Code Size e! v i ss e r g Pro 29 School of Computer Science

Completed Work Optimality Proven optimality 30 School of Computer Science

Completed Work Compile Time Slowdown : -( 9. 2 x slower 31 School of Computer Science

Completed Work A Better Register Allocator – fully utilize machine description reg alloc machine description 32 • explicit and expressive model of costs of allocation for given architecture: Global MCNF – locally optimal • approach optimality using progressive solution technique: Lagrangian directed allocators School of Computer Science

Outline I. III. IV. V. 33 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science

Proposed Work A Better Register Allocator Solver Improvements – Improve initial solution – Improve quality as prices converge – Hope to prove approximation bounds Model Improvements – Improve accuracy of model – Model simplification – Represent uniform register sets efficiently 34 School of Computer Science

Proposed Work Model Simplification Summarize overly expressive sections of the model Conservative simplification does not change optimal value Aggressive simplification explore tradeoff between model complexity and optimality 35 School of Computer Science

Proposed Work Instruction Selection Interaction which instruction is best depends on the register allocator perform same operation so let register allocator decide 36 School of Computer Science

Proposed Work Register Allocation Aware Instruction SElection (RA 2 ISE) Instruction selection not finalized until register allocation IR tiled with Register Allocation Aware Tiles (RAATs) A RAAT represents several instruction sequences – different costs – a sequence for every possible register allocation 37 School of Computer Science

2 RA ISE Proposed Work RAAT IR tiling mode l creati on cwtl %eax 38 register allocation School of Computer Science

Proposed Work Implementing RA 2 ISE Add side-constraints to Global MCNF model – implement inter-variable preferences and constraints • “if x allocated to r 1 and y allocated to r 2, then save three bytes” • “x and y must be allocated to the same register” Implement x 86 RAATs – RAAT tables created manually – GMCNF RAAT representation automatically generated from RAAT table with minimum use of side constraints Algorithms for tiling RAATs – leverage existing algorithms – exploit feedback between passes 39 School of Computer Science

Proposed Work Tiling RAATs 3 1 4 2 2 1 4 tiling 1 2 5 3 1 1 4 fe ed ba ck ter s i reg cate allo 4 1 2 3 edx mem 31 eax 4 2 3 40 School of Computer Science

Proposed Work Evaluation Implement in production quality compiler (gcc) Evaluate code size and simple code speed metric Evaluate on three different architectures – x 86 (8 registers) – 68 k/Cold. Fire (16 registers) – PPC (32 registers) 41 School of Computer Science

Outline I. III. IV. V. 42 Motivation Related Work Completed Work Proposed Work Contributions & Timeline School of Computer Science

Contributions RA 2 ISE – register allocation aware tiles (RAATs) explicitly encode effect of register allocation on instruction sequence – algorithms for tiling RAATs – expressive model of register allocation that operates on RAATs and explicitly represents all important components of register allocation – progressive solver for this model that can quickly find decent solution and approaches optimality as more time is allowed for compilation Comprehensive evaluation of RA 2 ISE 43 School of Computer Science

Thesis Statement RA 2 ISE is a principled and effective system for performing instruction selection and register allocation. 44 School of Computer Science

One Step Towards a More Principled Compiler cop DC y E loop prop unr con code oll moti st on prop inlin stre e ngth CS E red pee uct preg bran reg hole ch allo c insn opt c sele ct 45 PR E GV N SC CP … machine description optimized program School of Computer Science

Timeline 46 Fall 2006 add simple speed metric option to model begin model simplification work improve model accuracy and solver performance Winter 2006 finish model simplification work add side-constraints to model implement existing gcc tiles as RAATs improve model accuracy and solver performance Spring 2007 finish implementation of side-constraints and gcc RAATs begin work on RA 2 ISE infrastructure create gcc-independent set of RAATs for x 86 improve model accuracy and solver performance Summer 2007 finish work on RA 2 ISE investigate and develop tiling algorithms improve model accuracy and solver performance Fall 2007 add 68 k/Cold. Fire and Power. PC targets investigate uniform register set simplifications improve model accuracy and solver performance Winter 2007 begin writing thesis work on improving compile time performance Spring 2008 finish writing thesis School of Computer Science

Andrew Richard Koes 47 School of Computer Science

Questions? 48 ? School of Computer Science

Processor Performance 49 School of Computer Science

Instruction Selection & Register Allocation reg alloc insn selec t 50 – fully utilize machine description – locally optimal – tight integration between phases machine description School of Computer Science

Costs of Register Allocation Spilling to/from memory movl 8(%ebp), %edx Direct memory access addl 8(%ebp), %eax Moving between registers movl %edx, %ecx Rematerialization of constant value movl $3, %eax Register usage preferences imul %edx, %eax vs. imul %edx, %ecx 51 School of Computer Science

Iterative Heuristic Allocator Allocate each variable in a heuristic priority order Find shortest path in each block – avoid edges that make remaining problem infeasible Process blocks in topological order – allocation at block entry fixed by previous blocks Intuition: – shortest path is minimum cost allocation for a variable – allocate most significant variables first Limitation: – greedy: can’t undo poor decisions 52 School of Computer Science

Iterative Heuristic Allocator Allocation order: a, a b, bc, dc Edges to/from memory cost 3 d Cost: 0 4 0 -2 Total: 2 53 School of Computer Science

Simultaneous Allocator Scan each block – maintain an allocation of all live variables – at variable definition find cheapest allocation • allocation with shortest path to variable’s sink or block exit • allowed to evict (reallocate) already allocated variable – eviction cost shortest path to edge from current allocation to new allocation in this block – cost of eviction added to shortest path cost Intuition: – minimizing cost for all variables at once Limitation: – path computations limited to single block – future blocks do not change previous block allocations 54 School of Computer Science

Trace-Based Allocation Decompose function into traces of basic blocks – run simultaneous allocator on each trace – control flow internal to trace • update only blocks that are necessary (easy-update) • update all effected blocks (full-update) easy-update full-update 55 School of Computer Science

Accuracy of the Model Global MCNF model correctly predicts costs of register allocation within 2% for 72. 5% of functions compiled 56 School of Computer Science

Compile Time Slowdown : -( 10 x slower 57 School of Computer Science

Code size improvement 58 School of Computer Science

Code Size Improvement 59 School of Computer Science

Code Size Improvement 60 School of Computer Science

Code Performance 61 School of Computer Science

Integrating Register Allocation and Instruction int foo(int a, short b) Selection { return a*4+b; } 4 3 4 1 1 62 movl 4(%esp), %eax sall $2, %eax addl 8(%esp), %eax cwtl ret 5 4 3 1 movswl 8(%esp), %edx movl 4(%esp), %eax leal (%edx, %eax, 4), %eax ret School of Computer Science

Another RAAT 63 School of Computer Science