Introduction to Parsing Lecture 8 Adapted from slides

  • Slides: 55
Download presentation
Introduction to Parsing Lecture 8 Adapted from slides by G. Necula and R. Bodik

Introduction to Parsing Lecture 8 Adapted from slides by G. Necula and R. Bodik 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 1

Outline • • • Limitations of regular languages Parser overview Context-free grammars (CFG’s) Derivations

Outline • • • Limitations of regular languages Parser overview Context-free grammars (CFG’s) Derivations Syntax-Directed Translation 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 2

Languages and Automata • Formal languages are very important in CS – Especially in

Languages and Automata • Formal languages are very important in CS – Especially in programming languages • Regular languages – The weakest formal languages widely used – Many applications • We will also study context-free languages 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 3

Limitations of Regular Languages • Intuition: A finite automaton that runs long enough must

Limitations of Regular Languages • Intuition: A finite automaton that runs long enough must repeat states • Finite automaton can’t remember # of times it has visited a particular state • Finite automaton has finite memory – Only enough to store in which state it is – Cannot count, except up to a finite limit • E. g. , language of balanced parentheses is not regular: { (i )i | i 0} 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 4

The Structure of a Compiler Source Lexical analysis Today we start Optimization 2/8/2008 Tokens

The Structure of a Compiler Source Lexical analysis Today we start Optimization 2/8/2008 Tokens Parsing Interm. Language Prof. Hilfinger CS 164 Lecture 8 Code Gen. Machine Code 5

The Functionality of the Parser • Input: sequence of tokens from lexer • Output:

The Functionality of the Parser • Input: sequence of tokens from lexer • Output: abstract syntax tree of the program 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 6

Example • Pyth: if x == y: z =1 else: z = 2 •

Example • Pyth: if x == y: z =1 else: z = 2 • Parser input: IF ID == ID : ID = INT ELSE : ID = INT • Parser output (abstract syntax tree): IF-THEN-ELSE = == ID 2/8/2008 ID ID = INT Prof. Hilfinger CS 164 Lecture 8 ID INT 7

Why A Tree? • Each stage of the compiler has two purposes: – Detect

Why A Tree? • Each stage of the compiler has two purposes: – Detect and filter out some class of errors – Compute some new information or translate the representation of the program to make things easier for later stages • Recursive structure of tree suits recursive structure of language definition • With tree, later stages can easily find “the else clause”, e. g. , rather than having to scan through tokens to find it. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 8

Comparison with Lexical Analysis Phase Input Output Lexer Sequence of characters Sequence of tokens

Comparison with Lexical Analysis Phase Input Output Lexer Sequence of characters Sequence of tokens Parser Sequence of tokens Syntax tree 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 9

The Role of the Parser • Not all sequences of tokens are programs. .

The Role of the Parser • Not all sequences of tokens are programs. . . • . . . Parser must distinguish between valid and invalid sequences of tokens • We need – A language for describing valid sequences of tokens – A method for distinguishing valid from invalid sequences of tokens 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 10

Programming Language Structure • Programming languages have recursive structure • Consider the language of

Programming Language Structure • Programming languages have recursive structure • Consider the language of arithmetic expressions with integers, +, *, and ( ) • An expression is either: – – an integer an expression followed by “+” followed by expression an expression followed by “*” followed by expression a ‘(‘ followed by an expression followed by ‘)’ • int , int + int , ( int + int) * int are expressions 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 11

Notation for Programming Languages • An alternative notation: E int E E+E E E*E

Notation for Programming Languages • An alternative notation: E int E E+E E E*E E (E) • We can view these rules as rewrite rules – We start with E and replace occurrences of E with some right-hand side • E E*E (E)*E (E+E)*E … (int + int) * int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 12

Observation • All arithmetic expressions can be obtained by a sequence of replacements •

Observation • All arithmetic expressions can be obtained by a sequence of replacements • Any sequence of replacements forms a valid arithmetic expression • This means that we cannot obtain ( int ) ) by any sequence of replacements. Why? • This set of rules is a context-free grammar 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 13

Context-Free Grammars • A CFG consists of – A set of non-terminals N •

Context-Free Grammars • A CFG consists of – A set of non-terminals N • By convention, written with capital letter in these notes – A set of terminals T • By convention, either lower case names or punctuation – A start symbol S (a non-terminal) – A set of productions • Assuming E N E E Y 1 Y 2. . . Yn 2/8/2008 , or where Yi N T Prof. Hilfinger CS 164 Lecture 8 14

Examples of CFGs Simple arithmetic expressions: E int E E+E E E*E E (E)

Examples of CFGs Simple arithmetic expressions: E int E E+E E E*E E (E) – One non-terminal: E – Several terminals: int, +, *, (, ) • Called terminals because they are never replaced – By convention the non-terminal for the first production is the start one 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 15

The Language of a CFG Read productions as replacement rules: X Y 1. .

The Language of a CFG Read productions as replacement rules: X Y 1. . . Yn Means X can be replaced by Y 1. . . Yn X Means X can be erased (replaced with empty string) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 16

Key Idea 1. Begin with a string consisting of the start symbol “S” 2.

Key Idea 1. Begin with a string consisting of the start symbol “S” 2. Replace any non-terminal X in the string by a right-hand side of some production X Y 1 … Yn 3. Repeat (2) until there are only terminals in the string 1. The successive strings created in this way are called sentential forms. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 17

The Language of a CFG (Cont. ) More formally, may write X 1 …

The Language of a CFG (Cont. ) More formally, may write X 1 … Xi-1 Xi Xi+1… Xn X 1 … Xi-1 Y 1 … Ym Xi+1 … Xn if there is a production X i Y 1 … Ym 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 18

The Language of a CFG (Cont. ) Write X 1 … X n *

The Language of a CFG (Cont. ) Write X 1 … X n * Y 1 … Ym if X 1 … X n … … Y 1 … Ym in 0 or more steps 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 19

The Language of a CFG Let G be a context-free grammar with start symbol

The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of G is: L(G) = { a 1 … an | S * a 1 … an and every ai is a terminal } 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 20

Examples: • S 0 also written as S 0 | 1 S 1 Generates

Examples: • S 0 also written as S 0 | 1 S 1 Generates the language { “ 0”, “ 1” } • What about S 1 A A 0|1 A • What about S | ( S ) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 21

Pyth Example A fragment of Pyth: Compound while Expr: Block | if Expr: Block

Pyth Example A fragment of Pyth: Compound while Expr: Block | if Expr: Block Elses | else: Block | elif Expr: Block Elses Block Stmt_List | Suite (Formal language papers use one-character nonterminals, but we don’t have to!) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 22

Derivations and Parse Trees • A derivation is a sequence of sentential forms resulting

Derivations and Parse Trees • A derivation is a sequence of sentential forms resulting from the application of a sequence of productions S … … • A derivation can be represented as a parse tree – Start symbol is the tree’s root – For a production X Y 1 … Yn add children Y 1, …, Yn to node X 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 23

Derivation Example • Grammar E E + E | E * E | (E)

Derivation Example • Grammar E E + E | E * E | (E) | int • String int * int + int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 24

Derivation Example (Cont. ) 2/8/2008 E E+E E*E+E int * E + E int

Derivation Example (Cont. ) 2/8/2008 E E+E E*E+E int * E + E int * int + int Prof. Hilfinger CS 164 Lecture 8 25

Derivation in Detail (1) E E 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 26

Derivation in Detail (1) E E 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 26

Derivation in Detail (2) E E+E 2/8/2008 E E Prof. Hilfinger CS 164 Lecture

Derivation in Detail (2) E E+E 2/8/2008 E E Prof. Hilfinger CS 164 Lecture 8 + E 27

Derivation in Detail (3) E E E+E E*E+E E E 2/8/2008 * Prof. Hilfinger

Derivation in Detail (3) E E E+E E*E+E E E 2/8/2008 * Prof. Hilfinger CS 164 Lecture 8 + E E 28

Derivation in Detail (4) E E+E E*E+E int * E + E E *

Derivation in Detail (4) E E+E E*E+E int * E + E E * + E E int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 29

Derivation in Detail (5) E E+E E*E+E int * E + E int *

Derivation in Detail (5) E E+E E*E+E int * E + E int * int + E E * int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 + E E int 30

Derivation in Detail (6) E E+E E*E+E int * E + E int *

Derivation in Detail (6) E E+E E*E+E int * E + E int * int + int 2/8/2008 E E E * int Prof. Hilfinger CS 164 Lecture 8 + E E int 31

Notes on Derivations • A parse tree has – Terminals at the leaves –

Notes on Derivations • A parse tree has – Terminals at the leaves – Non-terminals at the interior nodes • A left-right traversal of the leaves is the original input • The parse tree shows the association of operations, the input string does not ! – There may be multiple ways to match the input – Derivations (and parse trees) choose one 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 32

The Payoff: parser as a translator syntax-directed translation stream of tokens parser ASTs, or

The Payoff: parser as a translator syntax-directed translation stream of tokens parser ASTs, or assembly code syntax + translation rules (typically hardcoded in the parser) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 33

Mechanism of syntax-directed translation • syntax-directed translation is done by extending the CFG –

Mechanism of syntax-directed translation • syntax-directed translation is done by extending the CFG – a translation rule is defined for each production given X d. ABc the translation of X is defined recursively using • translation of nonterminals A, B • values of attributes of terminals d, c • constants 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 34

To translate an input string: 1. 2. Build the parse tree. Working bottom-up •

To translate an input string: 1. 2. Build the parse tree. Working bottom-up • Use the translation rules to compute the translation of each nonterminal in the tree Result: the translation of the string is the translation of the parse tree's root nonterminal. Why bottom up? • a nonterminal's value may depend on the value of the symbols on the right-hand side, • so translate a non-terminal node only after children translations are available. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 35

Example 1: Arithmetic expression to value Syntax-directed translation rules: E E+T E 1. trans

Example 1: Arithmetic expression to value Syntax-directed translation rules: E E+T E 1. trans = E 2. trans + T. trans E T E. trans = T. trans T T*F T 1. trans = T 2. trans * F. trans T F T. trans = F. trans F int F. trans = int. value F (E) F. trans = E. trans 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 36

Example 1: Bison/Yacc Notation E: E+T { $$ = $1 + $3; } T:

Example 1: Bison/Yacc Notation E: E+T { $$ = $1 + $3; } T: T*F { $$ = $1 * $3; } F : int { $$ = $1; } F : ‘(‘ E ‘) ‘ { $$ = $2; } • KEY: $$ : Semantic value of left-hand side $n : Semantic value of nth symbol on right-hand side 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 37

Example 1 (cont): Annotated Parse Tree E (18) Input: 2 * (4 + 5)

Example 1 (cont): Annotated Parse Tree E (18) Input: 2 * (4 + 5) T (18) T (2) F (9) * ( int (2) ) E (9) E (4) + T (5) T (4) F (4) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 int (4) F (5) int (5) 38

Example 2: Compute the type of an expression E -> E + E E

Example 2: Compute the type of an expression E -> E + E E -> E and E E -> E == E E -> true E -> false E -> int E -> ( E ) 2/8/2008 if $1 == INT and $3 == INT: $$ = INT else: $$ = ERROR if $1 == BOOL and $3 == BOOL: $$ = BOOL else: $$ = ERROR if $1 == $3 and $2 != ERROR: $$ = BOOL else: $$ = ERROR $$ = BOOL $$ = INT $$ = $2 Prof. Hilfinger CS 164 Lecture 8 39

Example 2 (cont) • Input: (2 + 2) == 4 E (BOOL) E (INT)

Example 2 (cont) • Input: (2 + 2) == 4 E (BOOL) E (INT) == ( ) E (INT) + int (INT) E (INT) int (INT) 2/8/2008 E (INT) Prof. Hilfinger CS 164 Lecture 8 int (INT) 40

Building Abstract Syntax Trees • Examples so far, streams of tokens translated into –

Building Abstract Syntax Trees • Examples so far, streams of tokens translated into – integer values, or – types • Translating into ASTs is not very different 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 41

AST vs. Parse Tree • AST is condensed form of a parse tree –

AST vs. Parse Tree • AST is condensed form of a parse tree – – operators appear at internal nodes, not at leaves. "Chains" of single productions are collapsed. Lists are "flattened". Syntactic details are omitted • e. g. , parentheses, commas, semi-colons • AST is a better structure for later compiler stages – omits details having to do with the source language, – only contains information about the essential structure of the program. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 42

Example: 2 * (4 + 5) Parse tree vs. AST E * T T

Example: 2 * (4 + 5) Parse tree vs. AST E * T T F int (2) F * ( E E + 2 ) + 4 T T F F int (5) 5 int (4) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 43

AST-building translation rules E E+T $$ = new Plus. Node($1, $3) E T $$

AST-building translation rules E E+T $$ = new Plus. Node($1, $3) E T $$ = $1 T T*F $$ = new Times. Node($1, $3) T F $$ = $1 F int $$ = new Int. Lit. Node($1) F (E) $$ = $2 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 44

Example: 2 * (4 + 5): Steps in Creating AST E * 2 T

Example: 2 * (4 + 5): Steps in Creating AST E * 2 T F int (2) + 5 4 F * ( E E + + 5 4 ) T T F F int (5) (Only some of the semantic values are shown) int (4) 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 45

Leftmost and Rightmost Derivations E E+E E*E+E int * E + E int *

Leftmost and Rightmost Derivations E E+E E*E+E int * E + E int * int + int Leftmost derivation: always act on leftmost non-terminal 2/8/2008 E E+E E + int E * int + int Rightmost derivation: always act on rightmost non-terminal Prof. Hilfinger CS 164 Lecture 8 46

rightmost Derivation in Detail (1) E 2/8/2008 E Prof. Hilfinger CS 164 Lecture 8

rightmost Derivation in Detail (1) E 2/8/2008 E Prof. Hilfinger CS 164 Lecture 8 47

rightmost Derivation in Detail (2) E E+E E E 2/8/2008 Prof. Hilfinger CS 164

rightmost Derivation in Detail (2) E E+E E E 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 + E 48

rightmost Derivation in Detail (3) E E+E E + int E E + E

rightmost Derivation in Detail (3) E E+E E + int E E + E int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 49

rightmost Derivation in Detail (4) E E+E E + int E * E +

rightmost Derivation in Detail (4) E E+E E + int E * E + int E E E 2/8/2008 * Prof. Hilfinger CS 164 Lecture 8 + E E int 50

rightmost Derivation in Detail (5) E E+E E + int E * int +

rightmost Derivation in Detail (5) E E+E E + int E * int + int E E E * + E E int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 51

rightmost Derivation in Detail (6) E E+E E + int E * int +

rightmost Derivation in Detail (6) E E+E E + int E * int + int E E E * int 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 + E E int 52

Aside: Canonical Derivations • Take a look at that last derivation in reverse. •

Aside: Canonical Derivations • Take a look at that last derivation in reverse. • The active part (red) tends to move left to right. • We call this a reverse rightmost or canonical derivation. • Comes up in bottom-up parsing. We’ll return to it in a couple of lectures. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 53

Derivations and Parse Trees • For each parse tree there is exactly one leftmost

Derivations and Parse Trees • For each parse tree there is exactly one leftmost and one rightmost derivation • The difference is the order in which branches are added, not the structure of the tree. 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 54

Summary of Derivations • We are not just interested in whether s L(G) •

Summary of Derivations • We are not just interested in whether s L(G) • Also need derivation (or parse tree) and AST. • Parse trees slavishly reflect the grammar. • Abstract syntax trees abstract from the grammar, cutting out detail that interferes with later stages. • A derivation defines a parse tree – But one parse tree may have many derivations • Derivations drive translation (to ASTs, etc. ) • Leftmost and rightmost derivations most important in parser implementation 2/8/2008 Prof. Hilfinger CS 164 Lecture 8 55