Intermediate Code Generation Chapter 6 1 Backend and

Intermediate Code Generation Chapter 6 1

Back-end and Front-end of A Compiler 2

Back-end and Front-end of A Compiler Close to source language (e. g. syntax tree) Close to machine language m x n compilers can be built by writing just m front ends and n back ends 3

Back-end and Front-end of A Compiler Includes: • Type checking • Any syntactic checks that remain after parsing (e. g. ensure break statement is enclosed within while-, for-, or switch statements). 4

Component-Based Approach to Building Compilers Source program in Language-1 Front End Source program in Language-2 Front End Non-optimized Intermediate Code Intermediate-code Optimizer Optimized Intermediate Code Target-1 Code Generator Target-1 machine code Target-2 Code Generator Target-2 machine code 5

Intermediate Representation (IR) A kind of abstract machine language that can express the target machine operations without committing to too much machine details. : MWhy IR ? 6

Without IR C SPARC Pascal HP PA FORTRAN x 86 C++ IBM PPC 7

With IR C SPARC Pascal HP PA IR FORTRAN x 86 C++ IBM PPC 8

With IR C Pascal IR Common Backend ? FORTRAN C++ 9

Advantages of Using an Intermediate Language 1. Retargeting - Build a compiler for a new machine by attaching a new code generator to an existing front-end. 2. Optimization - reuse intermediate code optimizers in compilers for different languages and different machines. Note: the terms “intermediate code”, “intermediate language”, and “intermediate representation” are all used interchangeably. 10

Issues in Designing an IR • Whether to use an existing IR • if target machine architecture is similar • if the new language is similar • Whether the IR is appropriate for the kind of optimizations to be performed • e. g. speculation and predication • some transformations may take much longer than they would on a different IR 11

Issues in Designing an IR • Designing a new IR needs to consider Level (how machine dependent it is) Structure Expressiveness Appropriateness for general and special optimizations • Appropriateness for code generation • Whether multiple IRs should be used • • 12

Multiple-Level IR Source Program Semantic Check High-level IR High-level Optimization Low-level IR … Target code Low-level Optimization 13

Using Multiple-level IR Translating from one level to another in the compilation process • Preserving an existing technology investment • Some representations may be more appropriate for a particular task. 14

Commonly Used IR : MPossible IR forms • Graphical representations: such as syntax trees, AST (Abstract Syntax Trees), DAG • Postfix notation • Three address code • SSA (Static Single Assignment) form : MIR should have individual components that describe simple things 15

Syntax Tree 16

Syntax Tree + + a * * a b - b d c c 17

Syntax Tree + + a * * a b - b d c c 18

Syntax Tree + + a * * a b - b d c c Node can have more than one parent Directed Acyclic Graph (DAG): • More compact representation • Gives clues regarding generation of efficient code 19

Example Construct the DAG for: 20

How to Generate DAG from Syntax-Directed Definition? All what is needed is that functions such as Node and Leaf above check whether a node already exists. If such a node exists, a pointer is returned to that node. 21

How to Generate DAG from Syntax-Directed Definition? 22

Data Structure: Array 23

Data Structure: Array Leaves Operation code Left and right children of an intermediate node Scanning the array each time a new node is needed, is not an efficient thing to do. 24

Data Structure: Hash Table Hash function = h(op, L, R) 25

Three-Address Code • Another option for intermediate presentation • Built from two concepts: – addresses and instructions • At most one operator 26

Address Can be one of the following: • A name: source program name • A constant • Compiler-generated temporary 27

Instructions Procedure call such as p(x 1, x 2, …, xn) is implemented as: 28

Example OR 29

Choice of Operator Set • Rich enough to implement the operations of the source language • Close enough to machine instructions to simplify code generation 30

Data Structure How to present these instructions in a data structure? – Quadruples – Triples – Indirect triples 31

Data Structure: Quadruples • Has four fields: op, arg 1, arg 2, result • Exceptions: – Unary operators: no arg 2 – Operators like param: no arg 2, no result – (Un)conditional jumps: target label is the result 32

= minus c tz = b * tt t 3 = minus c t 4 = b * t 3 t 6 = tz + t 4 tl a = t 5 op arg 1 arg 2 result • tl 0 minus : c I I 1 • b * I 3 4 * I + I 5 = I I 2 m 1 nus 1 I T tl I -, I t 2 c. I I t 3 b • t 3 I t 4 I t 2 • I t 4 I' ts ts I II a . . . 33

Data Structure: Triples • Only three fields: no result field • Results referred to by its position 34

Data Structure: Indirect Triples • When instructions are moving around during optimizations: quadruples are better than triples. • Indirect triples solve this problem List of pointers to triples Optimizing complier can reorder instruction list, instead of affecting the triples themselves 35

Single-Static-Assignment (SSA) • Is an intermediate presentation • Facilitates certain code optimizations • All assignments are to variables with distinct names 36

Single-Static-Assignment (SSA) Example: If we use different names for X in true part and false part, then which name shall we use in the assignment of y = x * a ? The answer is: Ø-function Returns the value of its argument that corresponds to the control-flow path that was taken to get to the assignment statement containing the Ø-function 37

Example 38

Types and Declarations • Type checking: to ensure that types of operands match the type expected by operator • Determine the storage needed • Calculate the address of an array reference • Insert explicit type conversion • Choose the write version of an operator • … 39

Storage Layout • From the type, we can determine amount of storage at run time. • At compile time, we will use this amount to assign its name a relative address. • Type and relative address are saved in the symbol table entry of the name. • Data with length determined only at run time saves a pointer in the symbol table. 40

Storage Layout • Multibyte objects are stored in consecutive bytes and given the address of the first byte • Storage for aggregates (e. g. arrays and classes) is allocated in one contiguous block of bytes. 41

B 42

P D D T id ; { offset = 0; } { top. put( id. lexeme, T. type, offset); offset = offset + T. width; } D € 43

To keep track of the next available relative address Create a symbol table entry 44

Translations of Statements and Expressions Syntax-Directed Definition (SDD) Syntax-Directed Translation (SDT) 45

PRODUCTION S id = E ; SEMANTIC RULES S. code = E. code II gen(top. g et( id. lexeme) E E 1 + E 2 I - E 1 ( E 1 ) I id '=' E. addr) E. addr- = new Temp() E. cod e = E 1. code II E 2. code II gen(E. addr'=' E 1. addr'+' E 2. addr) E. addr = new Temp() E. cod e = E 1. code I gen(E. addr '=''minus' E 1. addr) E. addr· = E 1. addr E. cod e = Et. code E. addr = top. get( id. lexeme) E. cod e =" 46

Address holding value of E (e. g. tmp variable, name, constant) Three-address code of E Build an instruction Get a temporary variable Current symbol table 47

PRODUCTION S id = E ; E E 1 + Ez I - E 1 I SEMANTIC RULES S. code = E. cod e II gen( l. OJI. g et( id. lexeme) E. addr =new Temp() E. code = E 1. code II E 2. code II gen(E. addr '=' E 1. addr E. addr = new Temp() E. code == E 1. cod e II gen(E. addr a=b+-c '=' E. addr) '=' 'minus' E 1. addr) ( E 1 ) E. addr = E 1. addr E. code = E 1. cod e id E. addr = top. get( id. lexeme) E. cod e ==" • '+' E 2. addr) t 1 = minus c t 2 ; b + tl a = t 2 48

Generating three-address code incrementally to avoid long strings manipulations gen() does two things: • generate three address instruction • append it to the sequence of instructions generated so far 49

Arrays • Elements of the same time • Stored consecutively in memory • In languages like C or Java elements are: 0, 1, …, n-1 • In some other languages: low, low+1, …, high 50

Arrays If elements start with 0, and element width is w, then a[i] address is: base is address of A[0] Generalizing to two-dimensions a[i 1][i 2]: w 1 is width of a row and or w 2 the width of an element w is width of an element, n 2 is number of elements Generalizing to k-dimensions: or 51

temporary used while computing the offset pointer to symbol table entry 52

A is a 2 x 3 array of integers 53

Type Checking includes several aspects. • The language comes with a type system, i. e. , a set of rules saying what types can appear where. • The compiler assigns a type expression to parts of the source program. • The compiler checks that the type usage in the program conforms to the type system for the language. 54

Type Checking • All type checking could be done at run time: • The compiler generates code to do the checks. • Some languages have very weak typing; for example, variables can change their type during execution. • Often these languages need run-time checks. • Examples include lisp, snobol, apl. • A sound type system guarantees that all checks can be performed prior to execution. This does not mean that a given compiler will make all the necessary checks. • An implementation is strongly typed if compiled programs are guaranteed to run without type errors 55

Rules for Type Checking • There are two forms of type checking. 1. 2. We will learn type synthesis where the types of parts are used to infer the type of the whole. For example, integer+real =real. Type inference is very slick. The type of a construct is determined from usage. This permits languages like ML to check types even though names need not be declared. • We consider type checking for expressions. • Checking statements is very similar. • View the statement as a function having its components as arguments and returning void. 56

Type Conversions • A very strict type system would do no automatic conversion. • Instead it would offer functions for the programmer to explicitly convert between selected types. Then either the program has compatible types or is in error. • we will consider a approach in which the language permits certain implicit conversions that the compiler is to supply. • This is called type coercion. Explicit conversions supplied by the programmer are called casts. 57

Type Conversions • We will see integer and real types, and postulate a unary function denoted (real) that converts an integer into the real having the same value. • We consider the more general case where there are multiple types some of which have coercions (often called widening). • For example in C/Java, int can be widened to long, which in turn can be widened to float. • Mathematically the hierarchy is a partially order set (poset) in which each pair of elements has a least upper bound (LUB). • For many binary operators the two operands are converted to the LUB. So adding a short to a char, requires both to be converted to an int. Adding a byte to a float, requires the byte to be converted to a float (the float remains a float and is not converted). 58

Checking and Coercing Types for Basic Arithmetic • The steps for addition, subtraction, multiplication, and division are all essentially the same: • Convert each types if necessary to the LUB and then perform the arithmetic on the (converted or original) values. Note that conversion often requires the generation of code. • Two functions are convenient. 1. LUB(t 1, t 2) returns the type that is the LUB of the two given types. It signals an error if there is no LUB, for example if one of the types is an array. 2. widen(a, t, w, newcode, newaddr). Given an address a of type t, and a (hopefully) wider address w, produce the instructions newcode needed so that the address newaddr is the conversion of address a to type w. • LUB is simple, just look at the address latice. If one of the type arguments is not in the lattice, signal an error; otherwise find the lowest common ancestor. 59

Checking and Coercing Types for Basic Arithmetic • widen is more interesting. It involves n 2 cases for n types. Many of these are error cases (e. g. , if t wider than w). Below is the code for our situation with two possible types integer and real. • The four cases consist of 2 nops (when t=w), one error (t=real; w=integer) and one conversion (t=integer; w=real). widen (a: addr, t: type, w: type, newcode: string, newaddr: addr) if t=w newcode = "" newaddr = a else if t=integer and w=real newaddr = new Temp() newcode = gen(newaddr = (real) a) else signal error 60

Checking and Coercing Types for Basic Arithmetic • With these two functions it is not hard to modify the rules to catch type errors and perform coercions for arithmetic expressions. • Maintain the type of each operand by defining type attributes for e, t, and f. • Coerce each operand to the LUB. • This requires that we have type information for the base entities, identifiers and numbers. The lexer can supply the type of the numbers. We retrieve it via get(NUM. type). • It is more interesting for the identifiers. We insert that information when we process declarations. So we now have another semantic check: Is the identifier declared before it is used? 61

Checking and Coercing Types for Basic Arithmetic • let's examine a particularly interesting entry. Consider the assignment statement A[3/X+4] : = X*5+Y; • Consider the ra node, i. e. , the node corresponding to the production. ra → [ e ] : = e 1 ; • When the tree traversal gets to this node, its parent has passed in the value of the inherited attribute ra. id=id. entry. • Thus the ra node has access to the identifier table entry for ID, which in our example is the variable A. 62

Checking and Coercing Types for Basic Arithmetic • Prior to doing its calculations, the ra node invokes its children and gets back all the synthesized attributes. To summarize, when the ra node performs its calculations, it has available. • ra. id: the identifier entry for A. • e. addr/e. code: executing e. code results in e. addr containing the value of the expression 3/X+4. • e 1. addr/e 1. code: executing e 1. code results in e 1. addr containing the value of the expression X*5+Y. • e. type/e 1. type: The types of the expressions. • What must the ra node do? • Ensure execution of e. code and e 1. code. • Check that e. type is int (I don't do this). • Multiply e by the base width of the array A. (We need a temporary, ra. t 1, to hold the computed value). • Widen e 1 to the base type of A. (We need, and widen generates, a temporary ra. addr to hold the widened value). 63 • Do the actual assignment of X*5+Y to A[3/X+4].

Control Flow • Control flow includes the study of Boolean expressions, which have two roles. 1. They can be computed and treated similar to integers or real. There also relational operators that produce Boolean values from arithmetic operands. 2. They are used in certain statements that alter the normal flow of control. In this regard. • One question that comes up with Boolean expressions is whether both operands need be evaluated. If we are evaluating A or B and find that A is true, must we evaluate B? For example, consider evaluating A=0 OR 3/A < 1. 2 • when A is zero. • Consider A*F(x). If the compiler knows that for this run A is zero, must it generate code to evaluate F(x)? Don't forget that functions can have side effects, 64

Short-Circuit Code • This is also called jumping code. Here the Boolean operators AND, OR, and NOT do not appear in the generated instruction stream. Instead we just generate jumps to either the true branch or the false branch. • This statement cab be translated as follows 65

Flow-of-Control Statements • We now consider the translation of boolean expressions into threeaddress code • in the context of statements such as those generated by the following grammar: S if(B)S 1 S if ( B ) S 1 else S 2 S while ( B ) S 1 • both B and S have a synthesized attribute code, which gives the translation into three-address instructions. • For simplicity, we build up the translations B. code and S. code as strings, using syntax-directed definitions. 66

• The translation of if (B) S 1 consists of B. code followed by Sl. Code 67

68

Control-Flow Translation of Boolean Expressions 69

Control-Flow Translation of Boolean Expressions • Consider again the following statement • Translating into 3 address code will fetch the following results 70

Avoiding Redundant Gotos • For x > 200 we will have the following code fragment 71

Avoiding Redundant Gotos 72

Avoiding Redundant Gotos 73

Boolean Values and Jumping Code • A boolean expression may also be evaluated for its value, as in assignment statements such as x = true; or x = a < b; • A clean way of handling both roles of boolean expressions is to first build a syntax tree for expressions, using either of the following approaches: 1. Use two passes. Construct a complete syntax tree for the input, and then walk the tree in depth-first order, computing the translations specified by the semantic rules. 2. Use one pass for statements, but two passes for expressions. With this approach, we would translate E in while (E) S 1 before S 1 is examined. The translation of E, however, would be done by building its syntax tree and then walking the tree. 74