Compiler Construction BackEnd Synthesis Virendra Singh Associate Professor

Compiler Construction Back-End (Synthesis) Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http: //www. ee. iitb. ac. in/~viren/ E-mail: viren@ee. iitb. ac. in EE-717/453 Advanced Computing for Electrical Engineers Lecture 27

Compiler Architecture Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer Code Generator Symbol Table 08 Oct 2012 EE-717/EE-453@IITB 2 Target language

IR for Code Generation • Assume a low-level RISC-like IR – 3 address, register-register instructions + load/store r 1 <- r 2 op r 3 – Could be tree structure or linear – Expose as much detail as possible • Assume “enough” registers – Invent new temporaries for intermediate results – Map to actual registers later 08 Oct 2012 EE-717/EE-453@IITB 3

Overview: Instruction Selection • Map IR into assembly code • Assume known storage layout and code shape – i. e. , the optimization phases have already done their thing • Combine low-level IR operations into machine instructions (addressing modes, etc. ) 08 Oct 2012 EE-717/EE-453@IITB 4

Overview: Instruction Scheduling • Reorder operations to hide latencies – processor function units; memory/cache – Originally invented for supercomputers (1960 s) – Now important for consumer machines 08 Oct 2012 EE-717/EE-453@IITB 5

Overview: Register Allocation • Map values to actual registers – Previous phases change need for registers • Add code to spill values to temporaries as needed, etc. 08 Oct 2012 EE-717/EE-453@IITB 6

Difficulty Level • Instruction selection – Can make locally optimal choices – Global is undoubtedly NP-Complete • Instruction scheduling – Single basic block – quick heuristics – General problem – NP Complete • Register allocation – Single basic block, no spilling, interchangeable registers – linear – General – NP Complete 08 Oct 2012 EE-717/EE-453@IITB 7

Conventional Wisdom • We probably lose little by solving these independently • Instruction selection – Use some form of pattern matching – Assume “enough” registers • Instruction scheduling – Within a block, list scheduling is close to optimal – Across blocks: build framework to apply list scheduling • Register allocation – Start with virtual registers and map “enough” to K – Targeting, use good priority heuristic 08 Oct 2012 EE-717/EE-453@IITB 8

Low-Level IR Example (1) • For a local variable at a known offset k from the frame pointer fp – Linear MEM(BINOP(PLUS, TEMP fp, CONST k)) – Tree MEM + TEMP fp 08 Oct 2012 CONST k EE-717/EE-453@IITB 9

Low-Level IR Example (2) • For an array element e(k), where each element takes up w storage locations MEM + * MEM e k CONST w 08 Oct 2012 EE-717/EE-453@IITB 10

Generating Low-Level IR • Assuming initial IR is an AST, a simple treewalk can be used to generate the low-level IR – Can be done before, during, or after optimizations in the middle part of the compiler • Create registers (temporaries) for values and intermediate results – Value can be safely allocated in a register when only 1 name can reference it • Trouble: pointers, arrays, reference parameters – Assign a virtual register to anything that can go into one – Generate loads/stores for other values 08 Oct 2012 EE-717/EE-453@IITB 11

Instruction Selection Issues • Given the low-level IR, there are many possible code sequences that implement it correctly – e. g. to set eax to 0 on x 86 mov eax, 0 xor eax, eax sub eax, eaximul eax, 0 – Many machine instructions do several things at once – e. g. , register arithmetic and effective address calculation 08 Oct 2012 EE-717/EE-453@IITB 12

Instruction Selection Criteria • Several possibilities – Fastest – Smallest – Minimize power consumption • Sometimes not obvious – e. g. , if one of the function units in the processor is idle and we can select an instruction that uses that unit, it effectively executes for free, even if that instruction wouldn’t be chosen normally 08 Oct 2012 EE-717/EE-453@IITB 13

Implementation • Problem: We need some representation of the target machine instruction set that facilitates code generation • Idea: Describe machine instructions is same low-level IR used for program • Use pattern matching techniques to pick machine instructions that match fragments of the program IR tree – Want this to run quickly – Would like to automate as much as possible 08 Oct 2012 EE-717/EE-453@IITB 14

Matching: How? • Tree IR – pattern match on trees – Tree patterns as input – Each pattern maps to target machine instruction (or sequence) – Use dynamic programming or bottom-up rewrite system (BURS) • Linear IR – some sort of string matching – Strings as input – Each string maps to target machine instruction sequence – Use text matching or peephole matching • Both work well in practice; actual algorithms are quite different 08 Oct 2012 EE-717/EE-453@IITB 15

An Example Target Machine (1) • Arithmetic Instructions – (unnamed) ri – ADD ri <- rj + rk TEMP + * – MUL ri <- rj * rk – SUB and DIV are similar 08 Oct 2012 EE-717/EE-453@IITB 16

An Example Target Machine (2) • Immediate Instructions – ADDI ri <- rj + c + + CONST – SUBI ri <- rj - c CONST 08 Oct 2012 EE-717/EE-453@IITB 17

An Example Target Machine (3) • Load – LOAD ri <- M[rj + c] MEM MEM + + CONST 08 Oct 2012 MEM CONST EE-717/EE-453@IITB 18

An Example Target Machine (4) • Store – STORE M[rj + c] <- ri MOVE MEM MEM + + CONST 08 Oct 2012 MOVE MEM CONST EE-717/EE-453@IITB 19

Tree Pattern Matching (1) • Goal: Tile the low-level tree with operation trees • A tiling is a collection of <node, op> pairs – node is a node in the tree – op is an operation tree – <node, op> means that op could implement the subtree at node 08 Oct 2012 EE-717/EE-453@IITB 20

Tree Pattern Matching (2) • A tiling “implements” a tree if it covers every node in the tree and the overlap between any two tiles (trees) is limited to a single node – If <node, op> is in the tiling, then node is also covered by a leaf in another operation tree in the tiling – unless it is the root – Where two operation trees meet, they must be compatible (i. e. , expect the same value in the same location) 08 Oct 2012 EE-717/EE-453@IITB 21

Generating Code • Given a tiled tree, to generate code – Postorder treewalk; node-dependant order for children – Emit code sequences corresponding to tiles in order – Connect tiles by using same register name to tie boundaries together 08 Oct 2012 EE-717/EE-453@IITB 22

Tiling Algorithm • There may be many tiles that could match at a particular node • Idea: Walk the tree and accumulate the set of all possible tiles that could match at that point – Tiles(n) – Later: can keep lowest cost match at each point – Generates local optimality – lowest cost match at each point 08 Oct 2012 EE-717/EE-453@IITB 23

Peephole Matching • A code generaton/improvement strategy for linear representations • Basic idea – Look at small sequences of adjacent operations – Compiler moves a sliding window (“peephole”) over the code and looks for improvements 08 Oct 2012 EE-717/EE-453@IITB 24

Peephole Optimizations (1) • Classic example: store followed by a load, or push followed by a pop original mov [ebp-8], eax mov eax, [ebp-8] improved mov [ebp-8], eax push eax pop eax --- 08 Oct 2012 EE-717/EE-453@IITB 25

Peephole Optimizations (2) • Simple algebraic identies original add eax, 0 improved --- add eax, 1 inc eax mul eax, 2 add eax, eax 08 Oct 2012 EE-717/EE-453@IITB 26

Peephole Optimizations (3) • Jump to a Jump original jmp here improved jmp there: jmp there 08 Oct 2012 EE-717/EE-453@IITB 27

The Register Allocation Problem • Recall that intermediate code uses as many temporaries as necessary – This complicates final translation to assembly – But simplifies code generation and optimization – Typical intermediate code uses too many temporaries • The register allocation problem: – Rewrite the intermediate code to use fewer temporaries than there are machine registers – Method: assign more temporaries to a register • But without changing the program behavior 08 Oct 2012 EE-717/EE-453@IITB 28

History • Register allocation is as old as intermediate code • Register allocation was used in the original FORTRAN compiler in the ‘ 50 s – Very crude algorithms • A breakthrough was not achieved until 1980 when Chaitin invented a register allocation scheme based on graph coloring – Relatively simple, global and works well in practice 08 Oct 2012 EE-717/EE-453@IITB 29

An Example • Consider the program a : = c + d e : = a + b f : = e - 1 – with the assumption that a and e die after use • Temporary a can be “reused” after e : = a + b • Same with temporary e • Can allocate a, e, and f all to one register (r 1): r 1 : = r 2 + r 3 r 1 : = r 1 + r 4 r 1 : = r 1 - 1 08 Oct 2012 EE-717/EE-453@IITB 30

Basic Register Allocation Idea • The value in a dead temporary is not needed for the rest of the computation – A dead temporary can be reused • Basic rule: – Temporaries t 1 and t 2 can share the same register if at any point in the program at most one of t 1 or t 2 is live ! 08 Oct 2012 EE-717/EE-453@IITB 31

Algorithm: Part I • Compute live variables for each point: {a, c, f} {c, d, f} {b, c, f} a : = b + c d : = -a e : = d + f {c, d, e, f} {c, e} b : = d + e f : = 2 * e {c, f} {b, c, e, f} e : = e - 1 b : = f + c {b} 08 Oct 2012 EE-717/EE-453@IITB 32

The Register Interference Graph • Two temporaries that are live simultaneously cannot be allocated in the same register • We construct an undirected graph – A node for each temporary – An edge between t 1 and t 2 if they are live simultaneously at some point in the program • This is the register interference graph (RIG) – Two temporaries can be allocated to the same register if there is no edge connecting them 08 Oct 2012 EE-717/EE-453@IITB 33

Register Interference Graph. Example. • For our example: a b f c e d • E. g. , b and c cannot be in the same register • E. g. , b and d can be in the same register 08 Oct 2012 EE-717/EE-453@IITB 34

Register Interference Graph. Properties. • It extracts exactly the information needed to characterize legal register assignments • It gives a global (i. e. , over the entire flow graph) picture of the register requirements • After RIG construction the register allocation algorithm is architecture independent 08 Oct 2012 EE-717/EE-453@IITB 35

Graph Coloring. Definitions. • A coloring of a graph is an assignment of colors to nodes, such that nodes connected by an edge have different colors • A graph is k-colorable if it has a coloring with k colors 08 Oct 2012 EE-717/EE-453@IITB 36

Register Allocation Through Graph Coloring • In RA problem, colors = registers – We need to assign colors (registers) to graph nodes (temporaries) • Let k = number of machine registers • If the RIG is k-colorable then there is a register assignment that uses no more than k registers 08 Oct 2012 EE-717/EE-453@IITB 37

Graph Coloring. Example. • Consider the example RIG a r 2 r 3 b r 1 f c r 2 e d r 4 r 3 • There is no coloring with less than 4 colors • There are 4 -colorings of this graph 08 Oct 2012 EE-717/EE-453@IITB 38

Graph Coloring. Example. • Under this coloring the code becomes: r 2 : = r 3 + r 4 r 3 : = -r 2 : = r 3 + r 1 r 3 : = r 3 + r 2 r 1 : = 2 * r 2 : = r 2 - 1 r 3 : = r 1 + r 4 08 Oct 2012 EE-717/EE-453@IITB 39

Computing Graph Colorings • The remaining problem is to compute a coloring for the interference graph • But: 1. This problem is very hard (NP-hard). No efficient algorithms are known. 2. A coloring might not exist for a given number or registers • The solution to (1) is to use heuristics 08 Oct 2012 EE-717/EE-453@IITB 40

Graph Coloring Heuristic • Observation: – Pick a node t with fewer than k neighbors in RIG – Eliminate t and its edges from RIG – If the resulting graph has a k-coloring then so does the original graph • Why: – Let c 1, …, cn be the colors assigned to the neighbors of t in the reduced graph – Since n < k we can pick some color for t that is different from those of its neighbors 08 Oct 2012 EE-717/EE-453@IITB 41

Graph Coloring Heuristic • The following works well in practice: – Pick a node t with fewer than k neighbors – Put t on a stack and remove it from the RIG – Repeat until the graph has one node • Then start assigning colors to nodes on the stack (starting with the last node added) – At each step pick a color different from those assigned to already colored neighbors 08 Oct 2012 EE-717/EE-453@IITB 42

Graph Coloring Example (1) • Start with the RIG and with k = 4: a b f Stack: {} c e d • Remove a and then d 08 Oct 2012 EE-717/EE-453@IITB 43

Graph Coloring Example (2) • Now all nodes have fewer than 4 neighbors and can be removed: c, b, e, f f e 08 Oct 2012 b Stack: {d, a} c EE-717/EE-453@IITB 44

Graph Coloring Example (2) • Start assigning colors to: f, e, b, c, d, a a r 2 b r 1 f c r 2 e d 08 Oct 2012 r 3 r 4 r 3 EE-717/EE-453@IITB 45

What if the Heuristic Fails? • What if during simplification we get to a state where all nodes have k or more neighbors ? • Example: try to find a 3 -coloring of the RIG: a b f c e d 08 Oct 2012 EE-717/EE-453@IITB 46

What if the Heuristic Fails? • Remove a and get stuck (as shown below) • Pick a node as a candidate for spilling – A spilled temporary “lives” is memory • Assume that f is picked as a candidate b f c e d 08 Oct 2012 EE-717/EE-453@IITB 47

What if the Heuristic Fails? • Remove f and continue the simplification – Simplification now succeeds: b, d, e, c b c e d 08 Oct 2012 EE-717/EE-453@IITB 48

What if the Heuristic Fails? • On the assignment phase we get to the point when we have to assign a color to f • We hope that among the 4 neighbors of f we use less than 3 colors Þ optimistic coloring ? f r 2 e c d 08 Oct 2012 r 3 b r 1 r 3 EE-717/EE-453@IITB 49

Spilling • Since optimistic coloring failed we must spill temporary f • We must allocate a memory location as the home of f – Typically this is in the current stack frame – Call this address fa • Before each operation that uses f, insert f : = load fa • After each operation that defines f, insert store f, fa 08 Oct 2012 EE-717/EE-453@IITB 50

Spilling. Example. • This is the new code after spilling f a : = b + c d : = -a f : = load fa e : = d + f b : = d + e f : = 2 * e store f, fa e : = e - 1 f : = load fa b : = f + c 08 Oct 2012 EE-717/EE-453@IITB 51

Recomputing Liveness Information • The new liveness information after spilling: {a, c, f} {c, d, f} {c, e} f : = 2 * e store f, fa {c, f} {b, c, f} a : = b + c d : = -a f : = load fa e : = d + f {c, d, e, f} b : = d + e {c, f} {b, c, e, f} e : = e - 1 f : = load fa b : = f + c {b} 08 Oct 2012 EE-717/EE-453@IITB 52

Recomputing Liveness Information • The new liveness information is almost as before • f is live only – Between a f : = load fa and the next instruction – Between a store f, fa and the preceding instr. • Spilling reduces the live range of f • And thus reduces its interferences • Which result in fewer neighbors in RIG for f 08 Oct 2012 EE-717/EE-453@IITB 53

Recompute RIG After Spilling • The only changes are in removing some of the edges of the spilled node • In our case f still interferes only with c and d • And the resulting RIG is 3 -colorable a b f c e d 08 Oct 2012 EE-717/EE-453@IITB 54

Spilling (Cont. ) • Additional spills might be required before a coloring is found • The tricky part is deciding what to spill • Possible heuristics: – Spill temporaries with most conflicts – Spill temporaries with few definitions and uses – Avoid spilling in inner loops • Any heuristic is correct 08 Oct 2012 EE-717/EE-453@IITB 55

Caches • Compilers are very good at managing registers – Much better than a programmer could be • Compilers are not good at managing caches – This problem is still left to programmers – It is still an open question whether a compiler can do anything general to improve performance • Compilers can, and a few do, perform some simple cache optimization 08 Oct 2012 EE-717/EE-453@IITB 56

Summary: Register Allocation • Register allocation is a “must have” optimization in most compilers: – Because intermediate code uses too many temporaries – Because it makes a big difference in performance • Graph coloring is a powerful register allocation schemes • Register allocation is more complicated for CISC machines 08 Oct 2012 EE-717/EE-453@IITB 57

Compiler Assisted DVS • Identify program regions where DVS can be performed. • For each program region, identify the voltage (freq. ) mode to operate on, s. t. energy is minimized • Ensure that performance is not degraded. 08 Oct 2012 EE-717/EE-453@IITB 58

Motivating Example Freq. P 1 P 2 Exec. Time 200 Energy Exec. Time 300 Energy Exec. Time 400 Energy 151 6827 82 125 100 4552 149 163 76 3414 198 274 Freq. 200 400 DVS Exec. Time 151 3414 Energy 82 274 08 Oct 2012 P 3 P 4 P 5 Total 335 6827 335 14475 39 125 39 410 223 4552 223 9650 72 163 72 619 168 3414 168 7240 2 % Increase 176 274 176 1098 30 % decrease 300 400 223 3414 72 274 EE-717/EE-453@IITB 300 223 72 59 -7425 778

DVS Problem Formulation • Program divided into number of regions. • Assign an operating frequency for each program region. – Constraint • Marginal increase in exec. time of the program. – Objective • Minimizing program Energy Consumption. • Multiple Choice Knapsack Problem 08 Oct 2012 EE-717/EE-453@IITB 60

Energy Aware Instruction Scheduling • Reorder instructions – To reduce pipeline stalls – To exploit ILP • Use FU power gating • Reduce the number of transition between active and sleep state and increases active/idle period • Generate more balance schedule which helps to reduce peak power to average power 08 Oct 2012 EE-717/EE-453@IITB 61

Conclusions • Compiler research is fun! • It is cool to do compiler research! • But, remember Proebsting’s Law: Compiler Technology Doubles CPU Power Every 18 YEARS!! • Plenty of opportunities in compiler research! 08 Oct 2012 EE-717/EE-453@IITB 62

Thank You 08 Oct 2012 EE-717/EE-453@IITB 63