Register Allocation Graph Coloring Compiler Baojian Hua bjhuaustc

Register Allocation: Graph Coloring Compiler Baojian Hua bjhua@ustc. edu. cn

Middle and Back End AST translation IR 1 translation IR 2 other IR and translation asm

Back-end Structure IR instruction selector Assem register allocator Temp. Ma p instruction scheduler Assem

Instruction Selection int f (int x, int y) { int a; y: 12(%ebp) Positions int b; for a, b, c, int c; x: 8(%ebp) d can not int d; be determined during this a = x + y; b = a + 4; phase. c = b * 2; d = c / 8; return d; } int f (int x, int y){ int a, b, c, d; int t 1, t 2; pushl %ebp movl %esp, %ebp Prolog movl 8(%ebp), t 1 movl 12(%ebp), t 2 movl t 1, a addl t 2, a movl a, b addl $4, b movl b, %eax imult $2 movl %eax, c movl c, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret } Epilog

Register allocation n After instruction selection, there may be some variables left n basic idea: n put as many as possible of these variables into registers n n n speed! Into memory, only if the register are out of supply This process is called register allocation n the most popular and important optimization in modern compilers

Register Allocation Suppose that the register allocation determines that (we will discuss how to do this a little later): a => %eax b => %eax c => %eax d => %eax t 1 => %eax t 2 => %edx (this data structure is called a temp map) int f (int x, int y){ int a, b, c, d; int t 1, t 2; pushl %ebp movl %esp, %ebp movl 8(%ebp), t 1 movl 12(%ebp), t 2 movl t 1, a addl t 2, a movl a, b addl $4, b movl b, %eax imult $2 movl %eax, c movl c, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret }

. text. globl f Rewriting With the given temp map: a => %eax b => %eax c => %eax d => %eax t 1 => %eax t 2 => %edx We can rewrite the code accordingly, to generate the final assembly code. f: pushl %ebp movl %esp, %ebp movl 8(%ebp), t 1%eax movl 12(%ebp), t 2 %edx %eax movl t 1, a %edx addl t 2, a %eax movl a, b The rest are left to you! addl $4, b movl b, %eax imult $2 movl %eax, c movl b, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret

Peep-hole Optimization Peep-hole optimizations try to improve the code by examine the code using a code window. It’s of a local manner. For example, we can use a code window of width 1, to eliminate the obvious redundancy of the form: movl r, r . globl f f: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movl 12(%ebp), %edx movl %eax, %eax addl %edx, %eax movl %eax, %eax addl $4, %eax movl %eax, %eax imult $2 movl %eax, %eax cltd idivl $8 movl %eax, %eax leave ret

Final Assembly int f { int int (int x, int y) a; b; c; d; a = x + y; b = a + 4; c = b * 2; d = b / 8; return 0; } // This function does // NOT need a (stack) // frame!. text. globl f f: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax movl 12(%ebp), %edx addl %edx, %eax addl $4, %eax imult $2 cltd idivl $8 leave ret

Register Allocation Register allocation determines a temp map: a => %eax b => %eax c => %eax d => %eax t 1 => %eax t 2 => %edx How to generate such a temp map? Key observation: two variables can reside in one register, iff they don NOT live simultaneously. int f (int x, int y){ int a, b, c, d; int t 1, t 2; pushl %ebp movl %esp, %ebp movl 8(%ebp), t 1 movl 12(%ebp), t 2 movl t 1, a addl t 2, a movl a, b addl $4, b movl b, %eax imult $2 movl %eax, c movl c, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret }

int f (int x, int y){ int a, b, c, d; int t 1, t 2; pushl %ebp movl %esp, %ebp Liveness Analysis So, we can perform liveness analysis to calculate the live variable information. On the right, we mark, between each two statements, the live. Out set. {…} {eax } {d} {eax } movl 8(%ebp), t 1 movl 12(%ebp), t 2 movl t 1, a addl t 2, a movl a, b addl $4, b movl b, %eax imult $2 movl %eax, c movl c, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret }

Interference Graph (IG) Register allocation determines that: (the temp map) a => %eax b => %eax c => %eax d => %eax t 1 => %eax t 2 => %edx %eax a c %eax t 1 %eax b %eax d %eax t 2 %edx int f (int x, int y){ int a, b, c, d; int t 1, t 2; pushl %ebp movl %esp, %ebp movl 8(%ebp), t 1 movl 12(%ebp), t 2 ∞ t 1 movl t 1, a a ∞ t 2 addl t 2, a movl a, b addl $4, b movl b, %eax imult $2 movl %eax, c movl c, %eax cltd idivl $8 movl %eax, d movl d, %eax leave ret }

Steps in Register Allocator n n Do liveness analysis Build the interference graph (IG) n n Coloring the IG with K colors (registers) n n draw an edge between any two variables which don’t live simultaneously K is the number of available registers on a machine A classical problem in graph theory NP-complete (for K>=3), thus one must use heuristics Allocate physical registers to variables

History n n Early work by Cocke suggests that register allocation can be viewed as a graph coloring problem (1971) The first working allocator is Chaitin’s for IBM PL/1 compiler (1981) n n Later, IBM PL. 8 compiler Have some impact on the RISC

History, cont n n The more recent graph coloring allocator is due to Briggs (1992) For now, the graph coloring is the most popular allocator, used in many production compilers n n e. g. , GCC But more advanced allocators invented in recent years n n so, graph coloring is a lesson abandoned? more on next few lectures …

Graph coloring n Once we have the interference graph, we can try to color the graph with K colors n n K: number of machine registers adjacent nodes with difference colors But this problem is a NP-complete problem (for K>=3) So we must use some heuristics

Kempe’s Allocator

Kempe’s Theorem n [Kempe] Given a graph G with a node n such that degree(n)<K, G is K-colorable iff (G-{n}) is K-colorable (remove n and all edges connect n) n Proof? degree(n)<K n …

Kempe’s Algorithm kempe(graph G, int K) while (there is any node n, degree(n)<K) remove this node n assign a color to the removed node n // greedy if (G is empty) // i. e. , G is K-colorable return success; return failure;

Example a b e c K=4 1, 2, 3, 4 d degree(a) = 3<4 remove node “a”, assign the first available color

Example a b e c K=4 1, 2, 3, 4 d degree(a) = 3<4 remove node “a”, assign the first available color degree(b) = 2<4 remove node “b”, assign the first available color Here, we want to choose the node with lowest degree, what kind of data structure should we use?

Example a b e c K=4 1, 2, 3, 4 d degree(a) = 3<4 remove node “a”, assign the first available color degree(b) = 2<4 remove node “b”, assign the first available color degree(c) = 2<4 remove node “c”, assign the first available color

Example a b e c K=4 1, 2, 3, 4 d degree(a) = 3<4 remove node “a”, assign the first available color degree(b) = 2<4 remove node “b”, assign the first available color degree(c) = 2<4 remove node “c”, assign the first available color degree(d) = 1<4 remove node “d”, assign the first available color

Example a b e c K=4 1, 2, 3, 4 d degree(a) = 3<4 remove node “a”, assign the first available color degree(b) = 2<4 remove node “b”, assign the first available color degree(c) = 2<4 remove node “c”, assign the first available color degree(d) = 1<4 remove node “d”, assign the first available color degree(e) = 0<4 remove node “e”, assign the first available color

Example So this graph is 3 -colorable. a b We can refine it to the following one: kempe(graph G, int K) e c But if we have three colors, we can NOT apply the Kempe algorithm. (Why? ) d stack = []; while (true) K=3 remove and push node<K to stack; 1, 2, 3 if node>=K, remove and push it pop stack and assign colors Essentially, this is a lazy algorithm!

Example remove node “a”, push onto the stack a b e c K=3 1, 2, 3 d

significant Example a b e c K=3 1, 2, 3 d a remove node “a”, push onto the stack remove node “b”, push onto the stack

significant Example a b e c K=3 1, 2, 3 d b a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack

significant Example a b e c K=3 1, 2, 3 d d remove remove node node c b a “a”, push onto the stack “b”, push onto the stack “c”, push onto the stack “d”, push onto the stack “e”, push onto the stack

significant Example a b e c K=3 1, 2, 3 d e d c b a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e”

significant Example a b e c K=3 1, 2, 3 d d c b a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e” pop “d”

significant Example a b e c K=3 1, 2, 3 d c b a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e” pop “d” pop “c”

significant Example a b e c K=3 1, 2, 3 d b a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e” pop “d” pop “c” pop “b”

significant Example a b e c K=3 1, 2, 3 d a remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e” pop “d” pop “c” pop “b” pop “a”

Example a b e c K=3 1, 2, 3 d remove node “a”, push onto the stack remove node “b”, push onto the stack remove node “c”, push onto the stack remove node “d”, push onto the stack remove node “e”, push onto the stack pop the stack, assign suitable colors pop “e” pop “d” pop “c” pop “b” pop “a”

Moral n Kempe’s algorithm: n step #1: simplify n n step #2: select n n n remove graph nodes, be optimistic assign a color for each node, be lazy You should use this algorithm for your lab 6 first But what about the select phase fail? n no enough colors (registers)!

Example remove node “a”, push onto the stack a b e c K=2 1, 2 d

Failure n It’s often the case that Kempe’s algorithm fails n n The IG is not K-colorable The basic idea is to generate spilling code n some variables should be put into memory, instead of into registers n n Usually, spilled variables reside in the call stack Should modify code using such variables: n n for variable use: read from the memory for variable def: store into the memory

Spill code generation n The effect of spill code is to turn long live range into shorter ones n n n This may introduce more temporaries The register allocator should start over, after generating spill code We’ll talk about this shortly

Chaitin’s Allocator

Chaitin’s Algorithm n n Build: build the interference graph (IG) Simplify: simplify the graph Spill: for significant nodes, mark it as potential spill (sp), remove it and continue Select: pop nodes and try to assign colors n n if this fails for potential spill node, mark potential spill as actural spill and continue Start over: generate spill code for actural spills and start over from step #1 (build)

Chaitin’s Algorithm build simplify Potential spill Select Actual spill

Step 1: build the IG a = 1 a b c d e f b = 2 c = a+b d = a+c e = a+b f = d+e K=2 1, 2

Step 2: simplification a = 1 a b c d b = 2 c = a+b d = a+c e = a+b f f = d+e e K=2 1, 2 f

Step 2: simplification a = 1 a b c d b = 2 c = a+b d = a+c e = a+b e f f = d+e e K=2 1, 2 f

Step 2: simplification a = 1 a b b = 2 c c = a+b d = a+c c d e = a+b e f f = d+e e K=2 1, 2 f ps

Step 2: simplification a = 1 a b b = 2 c = a+b d = a+c c d e = a+b d ps c ps e f f = d+e e K=2 1, 2 f

Step 2: simplification a = 1 a b b = 2 c = a+b d = a+c c d e = a+b a d ps c ps e f f = d+e e K=2 1, 2 f

Step 2: simplification a = 1 a b b = 2 c = a+b d = a+c c d e = a+b b a d ps c ps e f f = d+e e K=2 1, 2 f

Step 3: selection a = 1 a b b b = 2 c = a+b d = a+c c d e = a+b a d ps c ps e f f = d+e e K=2 1, 2 f

Step 3: selection a = 1 a b b = 2 c = a+b d = a+c c d e = a+b a d ps c ps e f f = d+e e K=2 1, 2 f

Step 3: selection a = 1 a b b = 2 c = a+b d = a+c c d e = a+b d ps c ps e f f = d+e e K=2 1, 2 f

Step 3: selection a = 1 a b b = 2 c c = a+b d = a+c c e = a+b d actural spill a fake color f = d+e e K=2 1, 2 f e f ps

Step 3: selection a = 1 a b c actural spill a fake color d actural spill a fake color e f b = 2 c = a+b d = a+c e = a+b f = d+e K=2 1, 2 e f

Step 3: selection a = 1 a b c actural spill a fake color d actural spill a fake color e f b = 2 c = a+b d = a+c e = a+b f = d+f K=2 1, 2 f

Step 3: selection a = 1 a b c actural spill a fake color d actural spill a fake color e f b = 2 c = a+b d = a+c e = a+b f = d+e K=2 1, 2 There are two spills: c and d. One must rewrite the code.

Step 4: code rewriting (actual spill) a = 1 b = 2 What’s special about xi? They can NOT spill any more. (Why? ) a = 1 b = 2 x 1 = a+b c = a+b M[l_c] = x 1 d = a+c x 2 = M[l_c] e = a+b x 3 = a+x 2 f = d+e M[l_d] = x 3 There are two spills: c and d. Suppose the memory address for c and d are l_c and l_d (two integers indicating stack offsets). Then for each use, generate a read, for each def, generate a write. e = a+b x 4 = M[l_d] f = x 4+e c = a+b d = a+c f = d+e

Step 4: … star over x 3 a = 1 b = 2 a b x 1 = a+b M[l_c] = x 1 x 2 = M[l_d] x 1 x 3 = t 1+x 2 M[l_d] = x 3 e = a+b e x 4 = M[l_d] K=2 f = x 4+e 1, 2 x 4 x 2 f Leave other steps to you. This graph can NOT be colored with 2 colors. (There is a K 2 sub-graph. ) So, we have to do another iteration to generate spill code (Keep in mind that you can NOT spill x 1, x 2, x 3 and x 4) … Veryyyy EXPENSIVE!

x 5 = 1 Code spill (2 nd time) x 3 a = 1 b = 2 s 6 = M[l_a] spilled a M[l_a] = x 5 b x 1 = s 6+b M[l_c] = x 1 = a+b x 2 = M[l_c] = x 1 x 7 = M[l_a] x 2 = M[l_c] x 1 x 3 = a+x 2 M[l_d] = x 3 e = a+b x 4 x 2 x 3 = x 7+x 2 M[l_d] = x 3 e f x 8 = M[l_a] e = x 8+b x 4 = M[l_d] K=2 x 4 = M[l_d] f = x 4+e 1, 2 f = x 4+e

x 5 = 1 M[l_a] = x 5 IG x 1 b = 2 This graph is still not 2 colorable. Why? x 8 x 2 x 3 b x 7 x 5 x 6 e K=2 1, 2 x 4 f So we should continue to spill code. And star over… s 6 = M[l_a] x 1 = s 6+b M[l_c] = x 1 x 2 = M[l_c] x 7 = M[l_a] There are 3 variables remained: b, e, f. x 3 = x 7+x 2 Which one should be spilled? x 8 = M[l_a] Suppose we spill b. M[l_d] = x 3 e = x 8+b x 4 = M[l_d] f = x 4+e

Third Round x 1 x 6 x 3 x 9 x 10 x 7 x 5 x 2 x 8 = M[l_a] = x 5 x 11 = M[l_b] x 9 = 2 e = s 8+s 11 M[l_b] = x 9 x 4 = M[l_d] x 6 = M[l_a] f = x 4+e x 10 = M[l_b] x 11 x 8 x 5 = 1 x 1 = x 6+x 10 M[l_c] = x 1 x 4 x 2 = M[l_c] x 7 = M[l_a] x 3 = x 7+x 2 e K=2 1, 2 f M[l_d] = x 3 We have spill all of a, b, c, and d. This has the effect of chopping up all long live ranges into small live ranges!

Spilling a use n For a statement like this: n n t=u+v if we mark u as an actural spill, rewrite to: n n n u’ = M[l_u] t = u’+v where u’ can NOT be a candidate for future spill (unspillable)

Spilling a def n For a statement like this: n n t=u+v if we mark t as an actural spill, rewrite to: n n n t’ = u+v M[l_t] = t’ where t’ can NOT be a candidate for future spill (unspillable)

Spilled temps n Where should these variables be spilled … to? n function frames! arg 1 arg 0 The compiler maintains an internal counter. Each time the compiler finds an actural spill, it increases the counter and assigns a location for that spilled variable. ret addr old ebp %ebp Spill_0 Spill_1 %esp …

Frame n Suppose we put the frame on the stack: . text. globl f f: pushl %ebp movl %esp, %ebp pushl %ebx pushl %edi pushl %esi subl $(n*4), %esp n is the number of all spills, which can only be determined after register allocation.

Some improvements n n We can speed up the graph coloring based register alloctor in several ways But: n n To finish first, first finish KISS: keep it simple and stupid Don’t be too smart by half Your Tiger compiler must produce correct target code first

#1: Good data structures n For live sets n n For IG n n n bit-vector? or other data structures? adjacency list? adjacency matrix? both? Similar for other data structures Use good interface will let you write dead simple code and enhance it later

#2: frame slot allocation x 5 = 1 Allocating every spilled temp to its own frame slot can lead to a lot of memory used! A better idea is to share frame slot between spilled temp: iff they don’t live simultaneously: frame slot allocation! M[l_a] = x 5 x 9 = 2 M[l_b] = x 9 x 6 = M[l_a] x 10 = M[l_b] x 1 = x 6+x 10 M[l_c] = x 1 x 2 = M[l_c] x 7 = M[l_a] x 3 = x 7+x 2 M[l_d] = x 3 x 8 = M[l_a] x 11 = M[l_b] e = x 8+x 11 x 4 = M[l_d] f = x 4+e

#2: frame slot allocation x 5 = 1 M[l_a] = x 5 l_a x 9 = 2 M[l_b] = x 9 l_b l_d x 6 = M[l_a] x 10 = M[l_b] l_c x 1 = x 6+x 10 M[l_c] = x 1 How many different colors are required to color this graph? x 2 = M[l_c] x 7 = M[l_a] x 3 = x 7+x 2 M[l_d] = x 3 x 8 = M[l_a] x 11 = M[l_b] e = x 8+x 11 x 4 = M[l_d] f = x 4+e

#3: coalescing n Suppose we have a move statement: n n What’s the potential benefit of allocating both t and u to the same register r? n n t=u r=r This is called coalescing

Briggs’ Allocator