Elaboration or Semantic Analysis Compiler Baojian Hua bjhuaustc

Elaboration or: Semantic Analysis Compiler Baojian Hua bjhua@ustc. edu. cn

Front End source code lexical analyzer tokens parser abstract syntax tree semantic analyzer IR

Elaboration n Also known as type-checking, or semantic analysis n n context-sensitive analysis Checking the context-sensitive property of programs (AST): n n n every variable is declared before use every expression has a proper type function calls conform to definitions all other possible context-sensitive info’ (highly language-dependent) …

Elaboration Example // Sample C code: void f (int *p) { x += 4; p (23); “hello” + “world”; } int main () { f () + 5; break; } What errors can be detected here?

Conceptually AST Elaborator Language Semantics Intermediate Code

Semantics n Traditionally, semantics takes the form of natural language specification n e. g. , for the “+” operator, both the left and right operands should be of “integer” type refer to various specifications But recent research has revealed that semantics can also be addressed via math n rigorous and clean

Semantics n n Now let’s turn to Macqueen’s note… How to implement these rules?

Symbol Tables n In order to keep track of the types and other infos’ we’d maintain a finite map of program symbols to info’ n n symbols: variables, function names, etc. Such a mapping is called a symbol table, or sometimes an environment n n Notation: {x 1: b 1, x 2: b 2, …, xn: bn} where bi (1≤i ≤n) is called a binding

Type System n Next, we write the symbol table as ∑ n n ∑=T 1 x 1; T 2 x 2; T 3 x 3; … a list of (T id) tuples may be empty Each rule takes the form of ∑ P 1: T 1 … ∑ C : T ∑ Pn: Tn

Type System: exp T id ∈ ∑ ∑ num: int ∑ true: bool ∑ E 1: int ∑ id: T ∑ false: bool ∑ E 2: int ∑ E 1+E 2: int ∑ E 1: bool ∑ E 2: bool ∑ E 1&&E 2: bool

Type System: stm ∑ id: T ∑ E: T ∑ |- id: =E: OK ∑ E: int ∑ print(E): OK ∑ E: bool ∑ print. Bool(E): OK

Type System: dec, prog ∑; T id DS: ∑’ id ∈ dom(∑) ∑ T id; DS : ∑’ ∑ : ∑ DS: ∑ DS S: OK ∑ S: OK

Example // Whether or not the following program is // well-typed? int x; int y; print (x+y); int x; int y : ∑ int x int y: ∑ int x; int y: ∑ int x ∈ ∑ int y ∈ ∑ ∑ x: int ∑ y: int ∑ x+y: int ∑ print(x+y): OK int x; int y; print(x+y): OK

Elaboration of Expressions ∑ num: int T elab_exp (sigma, num) = return int

Elaboration of Expressions ∑ true: bool T elab_exp (sigma, true) = return bool

Elaboration of Expressions ∑ false: bool T elab_exp (sigma, false) = return bool

Elaboration of Expressions T id ∈ ∑ ∑ id : T T elab_exp (sigma, id) = T ty = Table_lookup (sigma, id); if (ty==NULL) error (“variable not declared”); return ty;

Elaboration of Expressions ∑ e 1: int ∑ e 2: int ∑ e 1+e 2: int T elab_exp (sigma, e 1+e 2) = type t 1 = elab_exp (sigma, e 1) type t 2 = elab_exp (sigma, e 2) switch (t 1, t 2){ case (Int, Int): return Int; case (Int, _): error (“e 2 should be int”) case(_, Int): error (“e 1 should be int”) default: error (“should both be int”) }

Elaboration of Expressions ∑ e 1: bool ∑ e 2: bool ∑ e 1&&e 2: bool type elab_exp (sigma, e 1&&e 2) = type t 1 = elab_exp (sigma, e 1) type t 2 = elab_exp (sigma, e 2) switch (t 1, t 2){ case (Bool, Bool): return Bool; case (Bool, _): error(“e 2 should be bool”) case(_, Bool): error(“e 1 should be bool”) default: error (“should both be bool”) }

Elaboration of Statements ∑ x: ty ∑ e: ty ∑ x: =e: OK void elab_stm (sigma, x=e) = type t 1 = elab_exp (sigma, x); type t 2 = elab_exp (sigma, e); if (t 1 != t 2) error (“different types in assigment”);

Elaboration of Statements ∑ e: int ∑ print(e): OK void elab_stm (sigma, print(e)) = type ty = elab_exp (sigma, e) if (ty != INT) error (“type should be INT”);

Elaboration of Statements ∑ e: bool ∑ print. Bool(e): OK void elab_stm (sigma, print. Bool(e)) = type ty = elab_exp (sigma, e) if (ty != BOOL) error (“type should be BOOL”);

Elaboration of Declarations ID ∈ dom(∑) ∑; type ID decs: ∑’ ∑ type ID; decs: ∑’ ∑ : ∑ Sigma elab_decs (sigma, decs) = if (decs==[]) return sigma; // decs = type ID; decs’ if (IDin sigma) error (“duplicated decl”); new_sigma = enter_table (sigma, type ID) return elab_decs(new_sigma, decs’);

Elaboration of Programs decs: ∑ ∑ stm: OK ∑ decs stm: OK void elab_prog (decs stm) = sigma = elab_decs (decs); elab_stm (sigma, stm)

Moral n There may be other information associated with identifiers, not just types, say: n n n Scope Storage class Access control info’ … All these details are handled by symbol tables (∑)!

Implementation n Must be efficient! n n lots of variables, functions, etc Two basic approaches: n Functional n n Imperative n n symbol table is implemented as a functional data structure (e. g. , red-black tree), with no tables ever destroyed or modified a single table, modified for every binding added or removed This choice is largely independent of the implementation language

Functional Symbol Table n Basic idea: n when implementing σ2 = σ1 + {x: t} creating a new table σ2, instead of modifying σ1 n when deleting, restore to the old table n n A good data structure for this is BST or red-black tree

BST Symbol Table ’ c: int e: int a: char b: double

Possible Functional Interface signature SYMBOL_TABLE = sig type ‘a t type key val empty: ‘a t val insert: ‘a t * key * ‘a -> ‘a t val lookup: ‘a t * key -> ‘a option end

Imperative Symbol Tables n n The imperative approach almost always involves the use of hash tables Need to delete entries to revert to previous environment n n made simpler because deletes follow a stack discipline can maintain a stack of entered symbols, so that they can be later popped and removed from the hash table

Possible Imperative Interface signature SYMBOL_TABLE = sig type ‘a t type key val val val end insert: ‘a t * key * ‘a -> unit lookup: ‘a t * key -> ‘a option delete: ‘a t * key -> unit begin. Scope: unit -> unit end. Scope: unit -> unit

Implementation of Symbols n For several reasons, it will be useful at some point to represent symbols as elements of a small, densely packed set of identities n n fast comparisons (equality) for dataflow analysis, we will want sets of variables and fast set operations n n n It will be critically important to use bit strings to represent the sets For example, your liveness analysis algorithm More on this later

Scope n n How to handle lexical scope? Many choices: n n One table + insert and remove bindings during elaboration, as we enters and leaves a local scope Stack of tables + insertion and removal always operated on stack-top n dragon compiler makes use of this

One-table approach int x; int f () { if (4) { int x; x = 6; } else { int x; x = 5; } x = 8; } σ={x: int} σ1 = σ + {f: …} = {x: int, f: …} σ2 = σ1 + {x: int} = {x: …, f: …, x: …} σ1 σ4 = σ1 + {x: int} = {x: …, f: …, x: …} σ1 σ1 Shadowing: “+” is not commutative!

Name Space struct list { int x; struct list *list; } *list; void walk (struct list *list) { list: printf (“%dn”, list->x); if (list = list->list) goto list; }

Name Space n It’s trivial to handle name space n n one symbol table for each name space Take C as an example: n Several different name spaces n n labels tags variables So …

Types n n The representation of types is highly language-dependent Some key considerations: n name vs. structural equivalence mutually recursive type definitions errors handling

Name vs. Structural Equivalence n n In a language with structural equivalence, this program is legal But not in a language with name equivalence (e. g. , C) For name equivalence, can generate a unique symbol for each defined type For structural equivalence, need to recursively compare the types struct A { int i; } x; struct B { int i; } y; x = y;

Mutually recursive type definitions n To process recursive and mutually recursive type definitions, need a placeholder n n n struct A { int data; struct A *next; struct B *b; }; in ML, an option ref struct in C, a pointer in Java, bind method (read Appel) B {…};

Error Diagnostic n To recover from errors, it is useful to have an “any” type n n makes it possible to continue more typechecking In practice, use “int” or guess one Similarly, a “void” type can be used for expressions that return no value Source locations are annotated in AST!

Summary n Elaboration checks the context-sensitive properties of programs n n n must take care of semantics of source programs and may translate into more low-level forms Usually the most big (complex) part in a compiler!