Compiler Technology Compiler software to translate high level

  • Slides: 29
Download presentation
Compiler Technology • Compiler – software to translate high level language program into a

Compiler Technology • Compiler – software to translate high level language program into a lower level language • usually machine language, sometimes assembly language as an intermediate • for Java and. net, compilers convert code into an intermediate format • a just-in-time (JIT) compiler or interpreter then converts it into machine code • Assembler – software to translate assembly language program into machine language • Interpeter – software to translate a single instruction from the given language into machine code and execute the instruction • interpreters work within an environment of previously executed instructions • interpreted languages include LISP and its dialects, Unix/Linux shells, Ruby and Python • most web browsers contain a Java Virtual Machine – an interpreter to convert Java byte code into machine code and execute the code

More Terminology • Decompiler – program that translates from a lower level language (usually

More Terminology • Decompiler – program that translates from a lower level language (usually assembly) into a higher-level language • this process is ambiguous because there are many ways a series of low level instructions can be represented in a high level way • Linker – program that connects the external references of source code to their definitions in a library • linking is often part of the compilation process although compilers allow you to separate this process for the translation process • Loader – program that loads executable code into memory • a linker/loader locates the external definitions and loads them with the linker • a dynamic linker/loader loads at run-time only the external definitions needed – a static linker/loader will load all definitions found in the program • Language rewriter – program that translates a program in a language to a new program in the same language • aspect-oriented programming relies on language rewriters for instance • it could also be used to replace deprecated instructions

Yet More Terminology • Preprocessing – a step taken by some compilers prior to

Yet More Terminology • Preprocessing – a step taken by some compilers prior to compilation to handle various simple chores • in C for instance, # statements are preprocessor operations like textual substitution for #define • Symbol table – collection of user-defined entities (variables, subroutines, classes, constants) • for each entity, the symbol table records the name, type, size/internal structure • this is used by the compiler to bind references in the source code to a memory location and to determine the amount of memory that needs to be allocated to run the given subroutine (note that constants may be stored separately in a “literal table”) • Tokens – the reserved words and syntactic constructs that are defined in the language • for instance, ; is the semicolon operator and == is the equality operator

One-Pass Compilers • A compiler operates on the source program instruction-by-instruction • In a

One-Pass Compilers • A compiler operates on the source program instruction-by-instruction • In a one-pass compiler, the compiler only makes one pass through the program, translating each instruction into machine code • this drastically limits the complexity of the source language in that there can be no forward references • subroutine calls cannot be to code listed beneath the call within the file • variable declarations must be placed in one location so that a symbol table can be generated – this would be the case in languages like Pascal and C • Java’s for loop would not be permissible because the compiler needs to first scan the for loop to determine if it is a counting loop or iterator loop before it can start translating the code • Pascal uses a one-pass compiler • C may be implemented in a one-pass compiler but gcc and Visual Studio are multipass compilers

Multi-pass Compilers • The multi-pass compiler makes multiple passes through the source code •

Multi-pass Compilers • The multi-pass compiler makes multiple passes through the source code • usually, each pass corresponds to a different process of compilation • lexical analysis – breaking code into lexical units • syntactic parsing – identifying the type of instruction and its component parts, generating a parse tree • semantic parsing – applies semantic rules to the parse tree to ensure that the code meets the semantic requirements of the language • code generation – generating either the machine code or an intermediate form (e. g. , assembly code, byte code) • code optimization – rearranging instructions to better first the parallel processing/pipelining aspects of the hardware • a variation is the two-pass compiler • one pass to perform lexical and syntactic analysis and form the symbol table • a second pass to perform semantic analysis and translation

Why One-Pass? • Such compilers are smaller and execute much faster • they are

Why One-Pass? • Such compilers are smaller and execute much faster • they are also far easier to write than multi-pass compilers • Pascal and C were both written with one-pass compilers in mind • thus we have restrictions in these languages that we won’t have in C++, Java, etc • Most languages today use multi-pass compilers because the language is too complex for a single pass • this allows for better language features • declarations anywhere • characters whose usage depends on context (for instance the period in COBOL used as the end of a sentence or a decimal point) • forward referencing instructions (such as subroutines defined lower in the file) • in addition, multiple passes allows for errors generated only for the given pass rather than either all found errors or the next error found (in the case of Pascal) • also the ability to optimize code

Compiler Passes: Preprocessing • LISP offers macros to replace code before substitution • this

Compiler Passes: Preprocessing • LISP offers macros to replace code before substitution • this is more an interpreter operation than a compiler preprocessing step • PL/I was the first language to offer preprocessing – three types • %INCLUDE – include text into the source file (%XINCLUDE includes text that is not already in the file) • %DECLARE – declare preprocessing variable (character or fixed only) • %assignment – assign a variable a value • %ACTIVATE/%DEACTIVATE – activate/deactivate a variable • %DO/%END – start and end of a block of several preprocessor statements • %PROCEDURE – define a preprocessor subroutine • %SELECT – conditional statement for preprocessing, also %IF, %THEN, %ELSE • %REPLACE – textual substitution • C was the next language to offer this and simplified what PL/I had • OCAML (predecessor of F#) has preprocessing known as Camlp 4

Compiler Passes: Lexical Analysis • Treat the source file as a stream of characters

Compiler Passes: Lexical Analysis • Treat the source file as a stream of characters • Break the stream into meaningful units • tokens (or language primitives) • example: for(i=0; i<n; i++) • analysis shown to the right • Simple left-to-right parsing • compare each grouping with the available tokens • apply rules for identifier names, literal numbers, special syntactic structures like i++ • As we are recognizing regular expressions, we can use a finite state automata • Errors caught in this pass are lexical errors • other errors may still exist in the code but won’t be caught during this pass for-operator ( open-paren-operator i identifier = assignment-operator 0 literal-number ; semicolon-operator i identifier < relational-operator n identifier ; semicolon-operator i identifier ++ increment-operator ) close-paren-operator

Example Lexical Analysis

Example Lexical Analysis

Example FSA • Let’s see how to recognize a legal identifier in C •

Example FSA • Let’s see how to recognize a legal identifier in C • rule: consist of letters, digits, underscores but starts with a letter or underscore • The FSA starts in state 1: start state • there are two links from state 1 • one link is traversed if the input character is a letter or underscore and leads to state 2 • the other link is traversed for any other character and leads to an error state • state 2 is reached if the first character is legal and has three links • one link loops back to state 2 and is traversed if the next input character is a letter, digit or underscore • one link leads to the end state (meaning a legal identifier) and taken if we have reached the end of this token • one link leads to an error state and is taken under any other situation • code from this FSA is given on the next slide

Implementing the Identifier FSA c = get. Char(); if(!is. Alpha(c)&&c!=‘_’) return false; // error

Implementing the Identifier FSA c = get. Char(); if(!is. Alpha(c)&&c!=‘_’) return false; // error while((c=get. Char())!=EOF) if(!isalnum(c)&&c!=‘_’) return false; // error return true; // reached the end of token, return true We make the assumption that there is a way to This code works as follows determine the end of this token (denoted as EOF) get the first character but in actuality, this is more challenging and if not legal, return false requires recursion if legal, continue while there are more characters if a character is not legal, return false when done, return true

Constructing a Lexical Analyzer • Define all of the tokens for the language •

Constructing a Lexical Analyzer • Define all of the tokens for the language • if a token can vary (e. g. , identifier, literal value) further define it as a regular expression • build a FSA for each token • connect the FSAs together into a single, large FSA • The program LEX, built in 1975, is a tool for generating the lexical analyzer given the tokens (and their rules) of the language • an example input for LEX is given on the next slide of a language that has • • numbers (integers only) simple identifiers (consist only of lower case letters) reserved words WHILE, DO, IF, THEN, ELSE, FI, OD and PRINT punctuation marks of : =, ; , =, !=, <=, >=, <, >, =, +, -, *, /, (, )

%{ #include "y. tab. h" extern int yylval; %} %% "=" { return EQ;

%{ #include "y. tab. h" extern int yylval; %} %% "=" { return EQ; } "!=" { return NE; } "<" { return LT; } "<=" { return LE; } ">" { return GT; } ">=" { return GE; } "+" { return PLUS; } "-" { return MINUS; } "*" { return MULT; } "/" { return DIVIDE; } ")" { return RPAREN; } "(" { return LPAREN; } ": =" { return ASSIGN; } "; " { return SEMICOLON; } "IF" "THEN" "ELSE" "FI" "WHILE" "DO" "OD" "PRINT" [0 -9]+ [a-z]+ { { { { { return IF; } return THEN; } return ELSE; } return FI; } return WHILE; } return DO; } return OD; } return PRINT; } yylval = atoi(yytext); return NUMBER; } { yylval = sum of all ASCII chars; return NAME; } { yyerror("illegal token"); } . %% #ifndef yywrap() { return 1; } #endif

Syntactic Analysis • Given a sequence of lexical items, do they form a syntactically

Syntactic Analysis • Given a sequence of lexical items, do they form a syntactically valid construct • e. g. , a C for loop will be: • • • for-operator open-paren-operator optionally: identifier assignment-operator expression (or literal number or identifier) semicolon operator optionally: boolean expression (or literal number or identifier) semicolon operator optionally: identifier assignment-operator expression (or identifier with prefix/postfix increment operator) close-paren-operator optionally: open-curly-brace-operator statement(s) optionally: close-curly-brace-operator • we need to write a parser (recognizer) piece of code for every construct

Parsers • The parser code will all be divided into functions, one per syntactic

Parsers • The parser code will all be divided into functions, one per syntactic unit • these functions can be called recursively • functions will return true (valid parse) or false (error) • on error, the parser must identify what rule failed and generate a syntax error message based on that rule • these messages are hardcoded into the parser, one per rule • functions will have a side effect of building a parse tree for the unit • A parser function will contain function calls to retrieve the next token and then test that token against the expected token • e. g. , in the function to recognize a for loop, the first token must be the for-operator followed by the open-paren-operator, etc • the parser’s logic is based on the grammar and can be automatically generated from the grammar but then we need to add other rules (such as balancing { })

Recursive Decent Parsing • The approach described is called recursive decent parsing • because

Recursive Decent Parsing • The approach described is called recursive decent parsing • because grammar rules can have multiple right hand sides, parsing is O(2^n) • • for instance, imagine the follow two rules A b. B | C B b. D if we are currently in a state of A and our next token is ‘b’, do we select b. B or C and then b. D? if we pick the wrong one, we have to backtrack and this is where the complexity becomes exponential in nature • in reality, our programming languages do not have the type of ambiguity that we find in human languages (e. g. , English) so there are few situations where we see rules like the above and thus the parsers will usually operate at a much better complexity • To use a RDP, we need the grammar to be non-left-recursive • If the grammar has left-recursive rules, we instead opt for a bottom-up parser, or chart parser • as we already examined these in chapter 4, we skip the details here

YACC • Yet Another Compiler • this is a tool, often used with LEX,

YACC • Yet Another Compiler • this is a tool, often used with LEX, to generate a parser given the grammar of a language • YACC generates a shift-reduce parsers that performs “ 1 look ahead” • this means that the parser looks ahead by 1 token before deciding which rule to employ • although this makes the compiler more complicated, it reduces the amount of backtracking • Input into YACC is • the lexemes from LEX • the root node of the grammar (the start state) • all of the grammar rules • example input shown on the next several slides

Input into YACC: Lexemes %start ROOT %token %token %token EQ NE LT LE GT

Input into YACC: Lexemes %start ROOT %token %token %token EQ NE LT LE GT GE PLUS MINUS MULT DIVIDE RPAREN LPAREN %token %token %token %% ASSIGN SEMICOLON IF THEN ELSE FI WHILE DO OD PRINT NUMBER NAME

Input into YACC: Grammar ROOT: stmtseq { execute($1); } ; statement: designator ASSIGN expression

Input into YACC: Grammar ROOT: stmtseq { execute($1); } ; statement: designator ASSIGN expression { $$ = assignment($1, $3); } | PRINT expression { $$ = print($2); } | IF expression THEN stmtseq ELSE stmtseq FI { $$ = ifstmt($2, $4, $6); } | IF expression THEN stmtseq FI { $$ = ifstmt($2, $4, empty()); } | WHILE expression DO stmtseq OD { $$ = whilestmt($2, $4); } ; stmtseq: stmtseq SEMICOLON statement { $$ = seq($1, $3); } | statement { $$ = $1; } ;

expression: expr 2 { $$ = $1; | expr 2 EQ expr 2 {

expression: expr 2 { $$ = $1; | expr 2 EQ expr 2 { | expr 2 NE expr 2 { | expr 2 LT expr 2 { | expr 2 LE expr 2 { | expr 2 GT expr 2 { | expr 2 GE expr 2 { ; } $$ $$ $$ = = = eq($1, ne($1, le($1, gt($1, $3); $3); } } } expr 2: expr 3 { $$ == $1; } | expr 2 PLUS expr 3 { $$ = plus($1, $3); } | expr 2 MINUS expr 3 { $$ = minus($1, $3); } ;

expr 3: expr 4 { $$ = $1; } | expr 3 MULT expr

expr 3: expr 4 { $$ = $1; } | expr 3 MULT expr 4 { $$ = mult($1, $3); } | expr 3 DIVIDE expr 4 { $$ = divide ($1, $3); } ; expr 4: PLUS expr 4 { $$ = $2; } | MINUS expr 4 { $$ = neg($2); } | LPAREN expression RPAREN { $$ = $2; } | NUMBER { $$ = number($1); } | designator { $$ = $1; } ; designator: NAME { $$ = name($1); } ;

Parse Tree For the expression b *b – 4 * a * c

Parse Tree For the expression b *b – 4 * a * c

Parsing void add_to_array(int *x, int y, int n) { int i = 0; for

Parsing void add_to_array(int *x, int y, int n) { int i = 0; for (i = 0; i < n; i++) { x[i] = x[i] + y; } }

Semantic Analysis • With a parse available, attributes are added to the stored content

Semantic Analysis • With a parse available, attributes are added to the stored content in the parse • for instance, types are added to identifier names • rules are then applied to ensure that the semantics of the language are being upheld, for example • proper types are used on the left and right hand side of an assignment statement • proper number and type of parameters between the actual and formal parameter list • properator(s) applied in the given context (+ used for numbers or Strings, ++ used for ordinal values, < and > used on comparable types of values) • type coercions, object binding (if possible), initializing local variables (if possible) are handled in this step • the compiler must implement the attribute grammar rules for semantic analysis

Code Generation • This is a planning step • given a parse, there are

Code Generation • This is a planning step • given a parse, there are many ways to convert this into assembly code, for instance • which data are preloaded into registers • which addressing mode is used when referencing a datum in memory • which instruction format will be used (if the platform has multiple instruction formats for a type of instruction) • often, compilers use implement a solution to the graph coloring problem to determine variable to register loaded (this is an NP complete problem!)

Example • How should a = b * c + d * e; be

Example • How should a = b * c + d * e; be compiled? • Assume the hardware has 8 registers, R 0 -R 7, and ALU operations can reference up to 1 memory location • below are two solutions – the on the left is more efficient if b, c, d and e are not referenced again, otherwise preloading all data into variables is more advantageous load R 1, b load R 2, c mult R 3, R 1, R 2 load R 4, d load R 5, e mult R 6, R 4, R 5 add R 7, R 3, R 6 store R 7, a load R 1, b mult R 2, R 1, c load R 3, d mult R 4, R 3, e add R 5, R 2, R 4 store R 5, a

Table of Costs • To determine what addressing modes to use, the compiler may

Table of Costs • To determine what addressing modes to use, the compiler may utilize a table of costs like the one below Mode Address Example Cost Direct M Add M 1 Register R Add R 1 0 Indexed c + (R) Add 100(R 1) 1 Indirect (M) Add (M) 2 Register Indirect (R) Add (R 1) 1 Indirect Indexed (c + (R)) Add *100(R 1) 2 Literal Add #5 0 #V

Code Optimization • There are many strategies that can reduce the run-time cost of

Code Optimization • There are many strategies that can reduce the run-time cost of executing code • code rearranging – fill what would be stalls in the pipeline with neutral instructions • algebraic simplification – move a result that is used again into a register • x = 7 * y + 5 – a; // move 7 * y into a register • z = 7 * y * c; • dead code elimination – remove code that cannot be reached • function inlining – replace the function call with the actual function body to save runtime overhead of using the run-time stack • loop optimization – rearrange nested loops that process data in non-contiguous order into one that processes data in contiguous order (this might improve cache performance) • loop unrolling – used to fill branch penalty spots with additional instructions • We explore many of these concepts in CSC 462/562