Lecture 15 Code Generation Instruction Selection Source code

  • Slides: 12
Download presentation
Lecture 15: Code Generation - Instruction Selection Source code Front. End Well-understood IR Middle-

Lecture 15: Code Generation - Instruction Selection Source code Front. End Well-understood IR Middle- IR End Back-End Object code instruction register instruction selection allocation scheduling Engineering • Last Lecture: – The Procedure Abstraction – Run-Time Organisation: Activation records for procedures. • Where are we? – Code generation: we assume the following model: • instruction selection: mapping IR into assembly code • register allocation: decide which values will reside in registers. • instruction scheduling: reorder operations to hide latencies… – Conventional wisdom says that we lose little by solving these (NP-complete) problems independently. 3/12/2021 COMP 36512 Lecture 15 1

Code Generation as a phase-decoupled problem (thanks for inspiration to: C. Kessler, A. Bednarski;

Code Generation as a phase-decoupled problem (thanks for inspiration to: C. Kessler, A. Bednarski; Optimal Integrated Code Generation for VLIW architectures; 10 th Workshop on Compilers for Parallel Computers, 2003) n io t a c llo IR-level Instruction Scheduling IR el llo v e l er A R t I gis e R Instruction Selection el v e A l IR- ister. IR-level g Re Instruction Scheduling n io cat target-level Instruction Scheduling n n o o i i l l t t ve loca e e l l et r Al g g r r ta istetarget-level g g Re Re Instruction Scheduling Target code

Instruction Selection • Characteristics: – use some form of pattern matching – can make

Instruction Selection • Characteristics: – use some form of pattern matching – can make locally optimal choices, but global optimality is NPcomplete – assume enough registers. • Assume a RISC-like target language • Recall: – modern processors can issue multiple instructions at the same time. – instructions may have different latencies: e. g. , add: 1 cycle; load 13 cycles; imult: 3 -16 cycles, etc • Instruction Latency: length time elapsed from the time the instruction was issued until the time the results can be used. 3/12/2021 COMP 36512 Lecture 15 3

Our (very basic) instruction set • load r 1, @a ; load register r

Our (very basic) instruction set • load r 1, @a ; load register r 1 with the memory value for a • load r 1, [r 2] ; load register r 1 with the memory value pointed by r 2 • mov r 1, 3 ; r 1=3 • add r 1, r 2, r 3 ; r 1=r 2+r 3 (same for mult, sub, div) • shr r 1, 1 ; r 1=r 1>>1 (shift right; r 1/2) • store r 1 ; store r 1 in memory • nop ; no operation 3/12/2021 COMP 36512 Lecture 15 4

Code generation for arithmetic expressions • Adopt a simple treewalk scheme; emit code in

Code generation for arithmetic expressions • Adopt a simple treewalk scheme; emit code in postorder: expr(node) { int result, t 1, t 2; switch(type(node)) { case *, /, +, -: t 1=expr(left child(node)); t 2=expr(right child(node)); result=Next. Register(); emit(op(node), result, t 1, t 2); break; case IDENTIFIER: t 1=base(node); t 2=offset(node); result=Next. Register(); emit(…) /* load IDENTIFIER */ break; case NUM: result=Next. Register(); emit(load result, val(node)); break; } return result; } 3/12/2021 COMP 36512 Lecture 15 Example: x+y: load r 1, @x load r 2, @y add r 3, r 1, r 2 (load r 1, @x would involve a load from address base+offset) 5

Issues with arithmetic expressions • What about values already in registers? – Modify the

Issues with arithmetic expressions • What about values already in registers? – Modify the IDENTIFIER case. • Why the left subtree first and not the right? – (cf. 2*y+x; x-2*y; x+(5+y)*7): the most demanding (in registers) subtree should be evaluated first (this leads to the Sethi-Ullman labelling scheme – first proposed by Ershov – see Aho 2 § 8. 10 or Aho 1 § 9. 10). • 2 nd pass to minimise register usage/improve performance. • The compiler can take advantage of commutativity and associativity to improve code (but not for floating-point operations) 3/12/2021 COMP 36512 Lecture 15 6

Issues with arithmetic expressions - 2 • Observation: on most processors, the cost of

Issues with arithmetic expressions - 2 • Observation: on most processors, the cost of a mult instruction might be several cycles; the cost of shift and add instructions is, typically, 1 cycle. • Problem: generate code that multiplies an integer with an unknown using only shifts and adds! • E. g. : 325*x = 256*x+64*x+x or (4*x+x)*(64+1) (Sparc) • For division, say x/3, we could compute 1/3 and perform a multiplication (using shifts and adds)… but this is getting complex! [For the curious only: see the “Hacker’s delight” and http: //www. hackersdelight. org/divc. More. pdf ] 3/12/2021 COMP 36512 Lecture 15 7

Trading register usage with performance • Example: w=w*2*x*y*z 1. 2. 6. 7. 12. 13.

Trading register usage with performance • Example: w=w*2*x*y*z 1. 2. 6. 7. 12. 13. 18. 19. 24. 26. load r 1, @w mov r 2, 2 mult r 1, r 2 load r 2, @x mult r 1, r 2 load r 2, @y mult r 1, r 2 load r 2, @z mult r 1, r 2 store r 1 2 registers, 31 cycles Hardware will stall until result is available! 1. 2. 3. 4. 5. 6. 8. 10. 12. 14. load r 1, @w load r 2, @x load r 3, @y load r 4, @z mov r 5, 2 mult r 1, r 5 mult r 1, r 2 mult r 1, r 3 mult r 1, r 4 store r 1 ; load: 5 cycles ; mov: 1 cycle ; mult: 2 cycles ; store: 5 cycles 5 registers, 19 cycles • Instruction scheduling (the problem): given a code fragment and the latencies for each operation, reorder the operations to minimise execution time (produce correct code; avoid spilling registers) 3/12/2021 COMP 36512 Lecture 15 8

Array references • Agree to a storage scheme: – Row-major order: layout as a

Array references • Agree to a storage scheme: – Row-major order: layout as a sequence of consecutive rows: A[1, 1], A[1, 2], A[1, 3], A[2, 1], A[2, 2], A[2, 3] – Column-major order: layout as a sequence of consecutive columns: A[1, 1], A[2, 1], A[1, 2], A[2, 2], … – Indirection vectors: vector of pointers to … values; best storage is application dependent; not amenable to analysis. • Referencing an array element, where w=sizeof(element): – row-major, 2 d: base+((i 1–low 1)*(high 2–low 2+1)+i 2–low 2)*w – column-major, 2 d: base+((i 2–low 2)*(high 1–low 1+1)+i 1–low 1)*w – general case for row-major order: • ((…(i 1 n 2+i 2)n 3+i 3)…)nk+ik)*w+base w*((…((low 1*n 2)+low 2)n 3+low 3)…)nk+lowk), where ni=highi–lowi+1 3/12/2021 COMP 36512 Lecture 15 9

Boolean and relational values • • expr not or-term | or-term or and-term |

Boolean and relational values • • expr not or-term | or-term or and-term | and-term and boolean | boolean true | false | rel-term rel-op expr | expr rel-op < | > | == | != | >= | <= Evaluate using treewalk-style generation. Two approaches for translation: numerical representation (0/1) or positional encoding. B or C and not D: r 1=not D; r 2=r 1 and C; r 3=r 2 or B. if (a<b) then … else… : comp rx, ra, rb; br rx L 1, L 2; L 1: …code for then. . . ; br L 3; L 2: …code for else. . . ; L 3: . . . Short-circuit evaluation: the C expression (x!=0 && y/x>1) relies on short-circuit evaluation for safety. 3/12/2021 COMP 36512 Lecture 15 10

Control-Flow statements • If expr then stmt 1 else stmt 2 – 1. Evaluate

Control-Flow statements • If expr then stmt 1 else stmt 2 – 1. Evaluate the expr to true or false – 2. If true, fall through to then; branch around else – 3. If false, branch to else; fall through to next statement. (if not(expr) br L 1; stmt 1; br L 2; L 1: stmt 2; L 2: …) • While loop; for loop; or do loop: – – 1. Evaluate expr 2. If false, branch beyond end of loop; if true, fall through 3. At end, re-evaluate expr 4. If true, branch to top of loop body; if false, fall through (if not(expr) br L 2; L 1: loop-body; if (expr) br L 1; L 2: …) • Case statement: evaluate; branch to case; execute; branch • Procedure calls: recall last lecture. . . 3/12/2021 COMP 36512 Lecture 15 11

Conclusion • (Initial) code generation is a pattern matching problem. • Instruction selection (and

Conclusion • (Initial) code generation is a pattern matching problem. • Instruction selection (and register allocation) was the problem of the 70 s. With the advent of RISC architectures (followed by superscalar and superpipelined architectures and with the ratio memory access vs CPU speed starting to rise) the problems shifted to register allocation and instruction scheduling. • Reading: Aho 2, Chapter 8 (more information than covered in this lecture but most of it useful for later parts of the module); Aho 1 pp. 478 -500; Hunter, pp. 186 -198; Cooper, Chapter 11 (NB: Aho treats intermediate and machine code generation as two separate problems but follows a low-level IR; things may vary in reality). • Next time: Register allocation 3/12/2021 COMP 36512 Lecture 15 12