Topic 2 Infix to Postfix EE 456 Compiling

Why this topic? • Program to convert infix to postfix – A simple, one-pass

Syntax vs. Semantics • Syntax – Describes what is allowable in a language –

Structure of Simple Compiler • A lexical analyzer will tokenize the input • A

Context-free Grammars • Often used to describe the hierarchical structure of a programming language

Productions • Every production consists of left side, an arrow (“can have the form

Example CFG • List of digits separated by plus or minus signs: list →

CFG for Expressions expr → expr + term | expr – term | term

Strings • A string of tokens is a sequence of zero or more tokens

Parse Trees • Shows how the start symbol of a grammar can derive a

More on Parse Trees • A language can be defined as all strings that

Ambiguous Grammars • If any string has more than one parse tree, grammar is

Precedence and Associativity • Precedence – Determines the order in which different operators are

Postfix Notation • Formal rules, infix → postfix – If E is variable or

Syntax-Directed Definitions • CFG to specify syntactic structure of input • Each grammar symbol

Two Types of Attributes • Synthesized Attributes – Value at a parse-tree node can

Example Syntax-Directed Definition Production Semantic Rule expr 1 + term expr. t : =

Depth-First Search • Recursively visit all children before evaluating semantic rules of given node

Translation Schemes • Adds to a CFG • Includes “semantic actions” embedded within productions

Example Translation Scheme expr term expr + term { print(‘+’) } expr – term

Equivalent Translation Scheme expr rest term term rest + term { print(‘+’) } rest

Parsing • Parsing is the process of determining if a string of tokens can

Top-down Parsing • Recursively apply the following steps: – At node n with nonterminal

Predictive Parsing • Recursive-descent parsing is a recursive, top-down approach to parsing • A

Procedures for Nonterminals • Production with right side α used if lookahead is in

Left Recursion • Left-recursive productions can cause recursive-descent parsers to loop forever • Example:

Eliminating Left Recursion expr term expr rest term expr + term { print(‘+’) }

Infix to Prefix Code: Part 1 #include <stdio. h> #include <ctype. h> int lookahead;

Infix to Prefix Code: Part 2 … void expr(void) { term(); rest(); } void

Infix to Prefix Code: Part 3 … void rest(void) { if (lookahead == '+')

Infix to Prefix Code: Part 4 … void match(int t) { if (lookahead ==

Code Optimization 1 void rest(void) { REST: if (lookahead == '+') { match('+'); term();

Code Optimization 2 void expr(void) { term(); while (1) { if (lookahead == '+')

Improvements Remaining • • Want to ignore whitespace Allow numbers Allow identifiers Allow additional

Lexical Analyzer • Eliminates whitespace (and comments) • Reads numbers (not just single digits)

Allowable Tokens • expected tokens: +, -, *, /, DIV, MOD, (, ), ID,

Tokens and Attributes LEXEME white space TOKEN ATTRIBUTE VALUE --- sequence of digits NUM

A Simple Symbol Table • Each record of symbol table contains a token type

Updated Translation Scheme start list eof list expr; list | ε expr + term

After Eliminating Left Recursion start list eof list expr; list | ε expr term

Final Code • About 250 lines of C • Pretty sloppy, otherwise would be

Slides: 45

Download presentation

Topic #2: Infix to Postfix EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003

Why this topic? • Program to convert infix to postfix – A simple, one-pass compiler! – Infix notation is source language – Postfix is intermediate code – No code generation (but could be actions) • A good example that introduces many aspects of writing a compiler

Syntax vs. Semantics • Syntax – Describes what is allowable in a language – Relatively easy to describe (we will use context free grammars) • Semantics – Describes the meaning of a program (or expression) – Very difficult to describe in a formal sense • We are discussing programming languages, but these definitions also apply to natural languages

Structure of Simple Compiler • A lexical analyzer will tokenize the input • A “syntax-directed translator” combines syntax analysis and intermediate code generation

Context-free Grammars • Often used to describe the hierarchical structure of a programming language • Every CFG has four components: – A set of tokens (terminal symbols) – A set of nonterminals – A set of productions (rules) – A start symbol (left side of first rule) Example rule: stmt → if (expr) stmt else stmt

Productions • Every production consists of left side, an arrow (“can have the form of”), right side • Left side is a always nonterminal; the production is “for” this nonterminal • Right side consists of terminals and/or non -terminals • Convention, nonterminals will be italicized Example rule: stmt → if (expr) stmt else stmt

Example CFG • List of digits separated by plus or minus signs: list → list + digit list → list – digit list → digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • For convenience, nonterminals can be grouped using ‘|’ (“or”) list → list + digit | list – digit | digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • The tokens of this grammar: + • The start symbol: list - 0 1 2 3 4 5 6 7 8 9

CFG for Expressions expr → expr + term | expr – term | term → term * factor | term / factor | factor → digit | (expr) digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Strings • A string of tokens is a sequence of zero or more tokens • The string containing zero tokens is the empty string: ‘ε’ • A grammar “derives” strings – starting with start symbol – repeatedly replacing nonterminals with right side of productions for the nonterminals – The set of strings that can be derived from the start symbol is the “language” for the grammar

Parse Trees • Shows how the start symbol of a grammar can derive a string in the language • A tree with the following properties: – – The root is labeled with a start symbol Each leaf labeled with a token or ε Each interior node is labeled by a nonterminal If A is the label for an interior node, and X 1, X 2, …, Xn (nonterminals or tokens) are the labels of its children, then the following production must exist: A → X 1 X 2 … Xn

Sample Parse Tree: 9 – 5 + 2

More on Parse Trees • A language can be defined as all strings that can be generated by some parse tree • The process of finding a parse tree for a given string is called “parsing” the string

Ambiguous Grammars • If any string has more than one parse tree, grammar is said to be ambiguous • Need to avoid for compilation, since string can have more than one meaning • List of digits separated by plus or minus signs: string → string + string | string – string |0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Example merges notion of digit and list into single nonterminal string • Same strings are derivable, but some strings have multiple parse trees (possible meanings)

Two Parse Trees: 9 – 5 + 2

Precedence and Associativity • Precedence – Determines the order in which different operators are evaluated when they occur in the same expression – Operators of higher precedence are applied before operators of lower precedence • Associativity – Determines the order in which operators of equal precedence are evaluated when they occur in the same expression – Most operators have a left-to-right associativity, but some have right-to-left associativity

Postfix Notation • Formal rules, infix → postfix – If E is variable or constant, E → E – If E is expression of form E 1 op E 2, where op is binary operator, E 1 → E 1’, and E 2 → E 2’, then E → E 1’ E 2’ op – If E is expression of form (E 1) and E 1 → E 1’, then E → E 1’ • Parentheses are not needed!

Syntax-Directed Definitions • CFG to specify syntactic structure of input • Each grammar symbol has associated attributes • Each production has associated semantic rules for computing values of attributes • Grammar and semantic rules together constitute the syntax-directed definition • A parse tree showing the attribute values at each node is called an annotated parse tree

Two Types of Attributes • Synthesized Attributes – Value at a parse-tree node can be determined based on values of attributes of children – Can be evaluated during a single bottom-up traversal of parse tree • Inherited Attributes – More complicated – Will be discussed later in course

Example Syntax-Directed Definition Production Semantic Rule expr 1 + term expr. t : = expr 1. t || term. t || ‘+’ expr 1 - term expr. t : = expr 1. t || term. t || ‘-’ expr term expr. t : = term. t term 0 term. t : = ‘ 0’ term 1 term. t : = ‘ 1’ … term 9 … term. t : = ‘ 9’

Depth-First Search • Recursively visit all children before evaluating semantic rules of given node • Suitable if all attributes are synthesized procedure visit (n: node) begin for each child m of n, from left to right do visit(m) evaluate semantic rules at node n end

Annotated Parse Tree: 9 – 5 + 2

Translation Schemes • Adds to a CFG • Includes “semantic actions” embedded within productions • Similar to syntax-directed definition, but order of evaluation is explicit

Example Translation Scheme expr term expr + term { print(‘+’) } expr – term { print(‘-’) } term 0 { print(‘ 0’) } 1 { print(‘ 1’) } … term 9 { print(‘ 9’) }

Equivalent Translation Scheme expr rest term term rest + term { print(‘+’) } rest - term { print(‘-’) } rest ε 0 { print(‘ 0’) } 1 { print(‘ 1’) } … term 9 { print(‘ 9’) }

Parsing • Parsing is the process of determining if a string of tokens can be generated by a grammar • For any CFG, there is a parser that takes at most O(n^3) time • For most programming languages that arise in practice, linear time parsers exist

Top-down Parsing • Recursively apply the following steps: – At node n with nonterminal A, select a production for A – Construct children at n for symbols on right side of selected production – Find next node for which subtree needs to be constructed • Top-down parsing uses a “lookahead” symbol • Selecting production may involve trial-and-error and backtracking

Predictive Parsing • Recursive-descent parsing is a recursive, top-down approach to parsing • A procedure is associated with each nonterminal of the grammar • Predictive parsing – Special case of recursive-descent parsing – The lookahead symbol unambiguously determines the procedure for each nonterminal

Procedures for Nonterminals • Production with right side α used if lookahead is in FIRST(α) – FIRST(α) is set of all symbols that can be first symbol of α – If lookahead symbol is not in FIRST set for any production, can use production with right side of ε – If two or more possibilities, can not use this method – If no possibilities, an error is declared • Nonterminals on right side of selected production are recursively expanded

Left Recursion • Left-recursive productions can cause recursive-descent parsers to loop forever • Example: example + term • Can eliminate left recursion A Aα|β A βR R αR|ε

Eliminating Left Recursion expr term expr rest term expr + term { print(‘+’) } expr – term { print(‘-’) } term 0 { print(‘ 0’) } 1 { print(‘ 1’) } … term 9 { print(‘ 9’) } term rest + term { print(‘+’) } rest - term { print(‘-’) } rest ε 0 { print(‘ 0’) } 1 { print(‘ 1’) } … term 9 { print(‘ 9’) }

Infix to Prefix Code: Part 1 #include <stdio. h> #include <ctype. h> int lookahead; void void expr(void); rest(void); term(void); match(int); error(void); int main(void) { lookahead = getchar(); expr(); putchar('n'); /* adds trailing newline character */ } …

Infix to Prefix Code: Part 2 … void expr(void) { term(); rest(); } void term(void) { if (isdigit(lookahead)) { putchar(lookahead); match(lookahead); } else error(); } …

Infix to Prefix Code: Part 3 … void rest(void) { if (lookahead == '+') { match('+'); term(); putchar('+'); rest(); } else if (lookahead == '-') { match('-'); term(); putchar('-'); rest(); } } …

Infix to Prefix Code: Part 4 … void match(int t) { if (lookahead == t) lookahead = getchar(); else error(); } void error(void) { printf("syntax errorn"); /* print error message */ exit(1); /* then halt */ }

Code Optimization 1 void rest(void) { REST: if (lookahead == '+') { match('+'); term(); putchar('+'); goto REST; } else if (lookahead == '-') { match('-'); term(); putchar('-'); goto REST; } }

Code Optimization 2 void expr(void) { term(); while (1) { if (lookahead == '+') { match('+'); term(); putchar('+'); } else if (lookahead == '-') { match('-'); term(); putchar('-'); } else break; } }

Improvements Remaining • • Want to ignore whitespace Allow numbers Allow identifiers Allow additional operators (multiplications and division) • Allow multiple expressions (separated by semicolons)

Lexical Analyzer • Eliminates whitespace (and comments) • Reads numbers (not just single digits) • Reads identifiers and keywords

Implementing the Lexical Analyzer

Allowable Tokens • expected tokens: +, -, *, /, DIV, MOD, (, ), ID, NUM, DONE • ID represents an identifier, NUM represents a number, DONE represents EOF

Tokens and Attributes LEXEME white space TOKEN ATTRIBUTE VALUE --- sequence of digits NUM numeric value of sequence div DIV --- mod MOD --- letter followed by letters and digits ID EOF DONE any other character that character index into symbol table --NONE

A Simple Symbol Table • Each record of symbol table contains a token type and a string (lexeme or keyword) • Symbol table has fixed size • All lexemes in array of fixed size • Will be able to insert and search for tokens: – insert(s, t): creates entry with string s and token t, returns index into symbol table – lookup(s): searches for entry with string s, returns index if found, 0 otherwise • Keywords (div and mod) will be inserted into symbol table, they can not be used as identifiers

Updated Translation Scheme start list eof list expr; list | ε expr + term { print(‘+’) } | expr – term { print(‘-’) } | term * factor { print(‘*’) } | term / factor { print(‘/’) } | term div factor { print(‘DIV’) } | term mod factor { print(‘MOD’) } | factor (expr) | id { print(id. lexeme) } | num { print(num. value) }

After Eliminating Left Recursion start list eof list expr; list | ε expr term moreterms + term { print(‘+’) } moreterms | - term { print(‘-’) } moreterms | ε term factor morefactors * factor { print(‘*’) } morefactors | / factor { print(‘/’) } morefactors | div factor { print(‘DIV’) } morefactors | mod factor { print(‘MOD’) } morefactors | ε factor (expr) | id { print(id. lexeme) } | num { print(num. value) }

Final Code • About 250 lines of C • Pretty sloppy, otherwise would be longer