Elaboration or Semantic Analysis Compiler Baojian Hua bjhuaustc

Elaboration or: Semantic Analysis Compiler Baojian Hua bjhua@ustc. edu. cn

Front End source code lexical analyzer tokens parser abstract syntax tree semantic analyzer IR

Elaboration n Also known as type-checking, or semantic analysis n n context-sensitive analysis Checking the well-formedness of programs: n n n every variable is declared before use every expression has a proper type function calls conform to definitions all other possible context-sensitive info’ (highly language-dependent) … translate AST into intermediate or machine code

Elaboration Example void f (int *p) { x += 4; p (23); “hello” + “world”; } int main () { f () + 5; } What errors can be detected here?

Terminology n n Scope Lifetime Storage class Name space

Terminologies: Scope int x; int f () { if (4) { int x; x = 6; } else { int x; x = 5; } x = 8; }

Terminologies: Lifetime static int x; int f () { int x, *p; x = 6; p = malloc (sizeof (*p)); if (3) { static int x; x = 5; } }

Terminologies: Storage class extern int x; int f () { extern int x; x = 6; if (3) { extern int x; x = 5; } }

Terminologies: Name space struct list { int x; struct list *list; } *list; void walk (struct list *list) { list: printf (“%dn”, list->x); if (list = list->list) goto list; }

Moral n For the purpose of elaboration, must take care of all of this TOGETHER n n n Scope Life time Storage class Name space … All these details are handled by symbol tables!

Symbol Tables n In order to keep track of the types and other infos’ we’d maintain a finite map of program symbols to info’ n n symbols: variables, function names, etc. Such a mapping is called a symbol table, or sometimes an environment n n Notation: {x 1: t 1, x 2: t 2, …, xn: tn} where xi: ti (1≤i ≤n) is called a binding

Scope n n How to handle lexical scope? It’s easy, we just insert and remove bindings during elaboration, as we enters and leaves a local scope

Scope int x; int f () { if (4) { int x; x = 6; } else { int x; x = 5; } x = 8; } σ={x: int} σ1 = σ + {f: …} = {x: int, f: …} σ2 = σ1 + {x: int} = {x: …, f: …, x: …} σ1 σ4 = σ1 + {x: int} = {x: …, f: …, x: …} σ1 σ1 Shadowing: “+” is not commutative!

Implementation n Must be efficient! n n lots of variables, functions, etc Two basic approaches: n Functional n n Imperative n n symbol table is implemented as a functional data structure (e. g. , red-black tree), with no tables ever destroyed or modified a single table, modified for every binding added or removed This choice is largely independent of the implementation language

Functional Symbol Table n Basic idea: n when implementing σ2 = σ1 + {x: t} creating a new table σ2, instead of modifying σ1 n when deleting, restore to the old table n n A good data structure for this is BST or red-black tree

BST Symbol Table ’ c: int e: int a: char b: double

Possible Functional Interface signature SYMBOL_TABLE = sig type ‘a t type key val empty: ‘a t val insert: ‘a t * key * ‘a -> ‘a t val lookup: ‘a t * key -> ‘a option end

Imperative Symbol Tables n n The imperative approach almost always involves the use of hash tables Need to delete entries to revert to previous environment n n made simpler because deletes follow a stack discipline can maintain a stack of entered symbols, so that they can be later popped and removed from the hash table

Possible Imperative Interface signature SYMBOL_TABLE = sig type ‘a t type key val val val end insert: ‘a t * key * ‘a -> unit lookup: ‘a t * key -> ‘a option delete: ‘a t * key -> unit begin. Scope: unit -> unit end. Scope: unit -> unit

Name Space n It’s trivial to handle name space n n one symbol table for each name space Take C as an example: n Several different name spaces n n labels tags variables So …

Implementation of Symbols n For several reasons, it will be useful at some point to represent symbols as elements of a small, densely packed set of identities n n fast comparisons (equality) for dataflow analysis, we will want sets of variables and fast set operations n n n It will be critically important to use bit strings to represent the sets For example, your liveness analysis algorithm More on this later

Types n n The representation of types is highly language-dependent Some key considerations: n name vs. structural equivalence mutually recursive type definitions dealing with errors

Name vs. Structural Equivalence n n In a language with structural equivalence, this program is legal But not in a language with name equivalence (e. g. , C) For name equivalence, can generate a unique symbol for each defined type For structural equivalence, need to recursively compare the types struct A { int i; } x; struct B { int i; } y; x = y;

Mutually recursive type definitions n To process recursive and mutually recursive type definitions, need a placeholder n n n struct A { int data; struct A *next; struct B *b; }; in ML, an option ref struct in C, a pointer in Java, bind method (read Appel) B {…};

Error Diagnostic n To recover from errors, it is useful to have an “any” type n n makes it possible to continue more typechecking In practice, use “int” or guess one Similarly, a “void” type can be used for expressions that return no value Source locations are annotated in AST!

Organization of the Elaborator n Module structure: elab. Prog: Ast. Program. t -> unit elab. Stm: Ast. Stm. t * tenv * venv -> unit elab. Dec: Ast. Dec. t * venv * tenv-> tenv * venv elab. Ty: Ast. Type. t * tenv -> ty elab. Exp: Ast. Exp. t * venv-> ty elab. LVal: Ast. Lval. t * venv-> ty n n It will be extended to also do translation. For now let’s concentrate on typechecking

Elaborate Expressions n n Checks that expressions are correctly typed. Valid expressions are defined in the C specification. e: t means that e is a valid expression of type t. venv is a symbol table (environment).

Elaborate Expressions venv | e 1: int venv | e 2: int venv | e 1+e 2: int fun elab. Exp (e, venv) = case e of Binary. Exp (PLUS, e 1, e 2) => let val t 1 = elab. Exp (e 1, env) val t 2 = elab. Exp (e 2, env) in case (t 1, t 2) of (Int, Int) => Int | (Int, _) => error (“e 2 should be int”) | (_, Int) => error (“e 1 should be int”) | _ => error (“should both be int”)

Elaborate Types n n Elaborating types is straightforward, except for recursive types Need to do “knot-tying”: n extend tenv with bindings for all of the new type names n n bind new names to “dummy” bodies process each definition, replacing the dummy bodies with real definitions

Elaborate Declarations n elab. Dec will extend the symbol tables with a new binding: int a; n n will add {a: int} to the environment. Remember that environments have to take into account scope of variables!

Elaborate Statement, Lvals, Programs n n All follow the same structures as exp or types elab. Prog calls the other functions in order to type-check each component of the program (declarations, statements, expressions, …)

Labs n For lab #4, your job is to implement an elaborator for C-n you may go in two steps n n n first type-checking and then generating target code At every step, check the output carefully to make sure your compiler works correctly

Summary n Elaboration checks the well-formedness of programs n n n must take care of semantics of source programs and may translate into more low-level forms Usually the most big (complex) part in a compiler!