COS 320 Compilers David Walker The Front End

The Front End stream of characters stream of tokens Lexer abstract syntax Parser Type

Parsing with CFGs • Context-free grammars are (often) given by BNF expressions (Backus-Naur Form)

Context-Free Grammars • Context-free grammars consist of: – Set of symbols: • terminals that

Context-Free Grammars A string is in the language of the CFG if only if

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ;

Parse Trees • Representing derivations as trees – useful in compilers: Parse trees correspond

Parse Trees • Example: S S; S ID = E ; S ID =

Parse Trees • Example: 2 derivations, but 1 tree S S; S ID =

Parse Trees • parse trees have meaning. – order of children, nesting of subtrees

Ambiguous Grammars • a grammar is ambiguous if the same sequence of tokens can

Ambiguous Grammars characters: 4 + 5 * 6 tokens: NUM(4) PLUS NUM(5) MULT NUM(6)

Ambiguous Grammars • problem: compilers use parse trees to interpret the meaning of parsed

Building Parsers • In theory classes, you might have learned about general mechanisms for

Recursive Descent Parsing • Recursive Descent Parsing (Appel Chap 3. 2): – aka: predictive

non-terminals: S, E, L terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, ; ,

non-terminals: A, S, E, L rules: 1. A : : = S EOF 2.

problem • predictive parsing only works for grammars where the first terminal symbol of

solution • eliminate left-recursion • rewrite the grammar so it parses the same language

eliminating left-recursion in general • Original grammar form: X : : = base X

Recursive Descent Parsing • Unfortunately, left factoring doesn’t always work • Questions: – how

Constructing RD Parsers • To construct an RD parser, we need to know what

Constructing Predictive Parsers 1. Y : : = 2. | bb 3. X :

Constricting Predictive Parsers • in general, must compute: – for each production X :

Iterative Analysis • Many compilers algorithms are iterative techniques. • Iterative analysis applies when:

Iterative Analysis • general form: – initialize set S with base cases – applied

Computing Nullable Sets • Non-terminal X is Nullable only if the following constraints are

Computing First Sets • First(X) is computed iteratively – base case: • if T

Computing Follow Sets • Follow(X) is computed iteratively – base case: • initially, we

building a predictive parser Z : : = X Y Z Z : :

Grammar: Z : : = X Y Z Z : : = d Computed

predictive parsing tables • if a predictive parsing table constructed this way contains no

another trick • Previously, we saw that grammars with left -recursion were problematic, but

summary • CFGs are good at specifying programming language structure • parsing general CFGs

Slides: 69

Download presentation

COS 320 Compilers David Walker

The Front End stream of characters stream of tokens Lexer abstract syntax Parser Type Checker • Lexical Analysis: Create sequence of tokens from characters (Chap 2) • Syntax Analysis: Create abstract syntax tree from sequence of tokens (Chap 3) • Type Checking: Check program for wellformedness constraints

Parsing with CFGs • Context-free grammars are (often) given by BNF expressions (Backus-Naur Form) – Appel Chap 3. 1 • More powerful than regular expressions – Matching parens – Nested comments • wait, we could do nested comments with ML-LEX! • CFGs are good for describing the overall syntactic structure of programs.

Context-Free Grammars • Context-free grammars consist of: – Set of symbols: • terminals that denotes token types • non-terminals that denotes a set of strings – Start symbol – Rules: symbol : : = symbol. . . symbol • left-hand side: non-terminal • right-hand side: terminals and/or non-terminals • rules explain how to rewrite non-terminals (beginning with start symbol) into terminals

Context-Free Grammars A string is in the language of the CFG if only if it is possible to derive that string using the following non -deterministic procedure: 1. begin with the start symbol 2. while any non-terminals exist, pick a non-terminal and rewrite it using a rule 3. stop when all you have left are terminals (and check you arrived at the string your were hoping to) Parsing is the process of checking that a string is in the CFG for your programming language. It is usually coupled with creating an abstract syntax tree.

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: S : : = S; S S : : = ID : = E S : : = PRINT ( Elist ) E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) Elist : : = Elist , E

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me! ID = NUM ; PRINT ( NUM )

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S ID = E ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S ID = E oops, can’t make progress ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S S; S ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S S; S ID : = E ; S ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) S S; S ID = E ; S ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E ) ID = NUM ; PRINT ( NUM ) 8. Elist : : = E 9. Elist : : = Elist , E Derive me!

non-terminals: S, E, Elist terminals: ID, NUM, PRINT, +, : =, (, ), ; rules: 1. S : : = S; S 2. S : : = ID : = E 3. S : : = PRINT ( Elist ) S S; S ID = E ; S ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E ) ID = NUM ; PRINT ( NUM ) left-most derivation 4. 5. 6. 7. E : : = ID E : : = NUM E : : = E + E E : : = ( S , Elist ) 8. Elist : : = E 9. Elist : : = Elist , E S S; S S ; PRINT ( Elist ) S ; PRINT ( E ) S ; PRINT ( NUM ) ID = E ; PRINT ( NUM ) ID = NUM ; PRINT ( NUM ) right-most derivation Another way to derive the same string

Parse Trees • Representing derivations as trees – useful in compilers: Parse trees correspond quite closely (but not exactly) with abstract syntax trees we’re trying to generate • difference: abstract syntax vs concrete (parse) syntax • each internal node is labeled with a non-terminal • each leaf note is labeled with a terminal • each use of a rule in a derivation explains how to generate children in the parse tree from the parents

Parse Trees • Example: S S; S ID = E ; S ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E ) ID = NUM ; PRINT ( NUM ) S S ID : = ; E NUM S PRINT ( L E NUM )

Parse Trees • Example: 2 derivations, but 1 tree S S; S ID = E ; S ID = NUM ; PRINT ( Elist ) ID = NUM ; PRINT ( E ) ID = NUM ; PRINT ( NUM ) S S; S S ; PRINT ( Elist ) S ; PRINT ( E ) S ; PRINT ( NUM ) ID = E ; PRINT ( NUM ) ID = NUM ; PRINT ( NUM ) S S ID : = ; E NUM S PRINT ( L E NUM )

Parse Trees • parse trees have meaning. – order of children, nesting of subtrees is significant S S S ID : = ; E NUM S S PRINT ( L ) PRINT ( S ; L E E NUM ) ID : = E NUM

Ambiguous Grammars • a grammar is ambiguous if the same sequence of tokens can give rise to two or more parse trees

Ambiguous Grammars characters: 4 + 5 * 6 tokens: NUM(4) PLUS NUM(5) MULT NUM(6) E non-terminals: E terminals: ID NUM PLUS MULT E : : = ID | NUM |E+E |E*E I like using this notation where I avoid repeating E : : = E + E E NUM(4) NUM(5) * E NUM(6)

Ambiguous Grammars characters: 4 + 5 * 6 tokens: NUM(4) PLUS NUM(5) MULT NUM(6) E non-terminals: E E terminals: ID NUM PLUS MULT + E E NUM(4) E * NUM(6) NUM(5) E E : : = ID | NUM |E+E |E*E E E NUM(4) + * E E NUM(6) NUM(5)

Ambiguous Grammars • problem: compilers use parse trees to interpret the meaning of parsed expressions – different parse trees have different meanings – eg: (4 + 5) * 6 is not 4 + (5 * 6) – languages with ambiguous grammars are DISASTROUS; The meaning of programs isn’t well-defined! You can’t tell what your program might do! • solution: rewrite grammar to eliminate ambiguity – fold precedence rules into grammar to disambiguate – fold associativity rules into grammar to disambiguate – other tricks as well

Building Parsers • In theory classes, you might have learned about general mechanisms for parsing all CFGs – algorithms for parsing all CFGs are expensive – to compile 1/10/100 million-line applications, compilers must be fast. • even for 10 thousand-line apps, speed is nice – sometimes 1/3 of compilation time is spent in parsing • compiler writers have developed specialized algorithms for parsing the kinds of CFGs that you need to build effective programming languages – LL(k), LR(k) grammars can be parsed.

Recursive Descent Parsing • Recursive Descent Parsing (Appel Chap 3. 2): – aka: predictive parsing; top-down parsing – simple, efficient – can be coded by hand in ML quickly – parses many, but not all CFGs • parses LL(1) grammars – Left-to-right parse; Leftmost-derivation; 1 symbol lookahead – key ideas: • one recursive function for each non terminal • each production becomes one clause in the function

non-terminals: S, E, L terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, ; , = rules: 1. S : : = IF E THEN S ELSE S 4. L : : = END 2. | BEGIN S L 5. |; SL 3. | PRINT E 6. E : : = NUM

non-terminals: S, E, L terminals: NUM, IF, THEN, ELSE, BEGIN, END, PRINT, ; , = rules: 1. S : : = IF E THEN S ELSE S 4. L : : = END 2. | BEGIN S L 5. |; SL 3. | PRINT E 6. E : : = NUM datatype token = NUM | IF | THEN | ELSE | BEGIN | END | PRINT | SEMI | EQ val tok = ref (get. Token ()) fun advance () = tok : = get. Token () fun eat t = if (! tok = t) then advance () else error () Step 3: write parser => one function per non-terminal; one clause per rule fun S () = case !tok of IF => eat IF; E (); eat THEN; S (); eat ELSE; S () | BEGIN => eat BEGIN; S (); L () | PRINT => eat PRINT; E () and L () = case !tok of END => eat END | SEMI => eat SEMI; S (); L () and E () = eat NUM; eat EQ; eat NUM

non-terminals: A, S, E, L rules: 1. A : : = S EOF 2. | ID : = E 3. | PRINT ( L ) 4. E : : = ID 5. | NUM 6. L : : = E 7. |L, E fun A () = S (); eat EOF and S () = case !tok of ID => eat ID; eat ASSIGN; E () | PRINT => eat PRINT; eat LPAREN; L (); eat RPAREN and E () = case !tok of ID => eat ID | NUM => eat NUM and L () = case !tok of ID => ? ? ? | NUM => ? ? ?

problem • predictive parsing only works for grammars where the first terminal symbol of each self-expression provides enough information to choose which production to use – LL(1) • if !tok = ID, the parser cannot determine which production to use: 6. L : : = E 7. |L, E (E could be ID) (L could be E could be ID)

eliminating left-recursion in general • Original grammar form: X : : = base X : : = X repeat Strings: base repeat. . . • Transformed grammar: X : : = base Xnew : : = repeat Xnew : : = Strings: base repeat. . .

Recursive Descent Parsing • Unfortunately, left factoring doesn’t always work • Questions: – how do we know when we can parse grammars using recursive descent? – Is there an algorithm for generating such parsers automatically?

Constructing RD Parsers • To construct an RD parser, we need to know what rule to apply when – we have seen a non terminal X – we see the next terminal a in input • We apply rule X : : = s when – a is the first symbol that can be generated by string s, OR – s reduces to the empty string (is nullable) and a is the first symbol in any string that can follow X

Constructing Predictive Parsers 1. Y : : = 2. | bb 3. X : : = c 4. |YZ next non-terminal seen X c X b X d 5. Z : : = d rule

Constructing Predictive Parsers 1. Y : : = 2. | bb 3. X : : = c 4. |YZ next non-terminal seen X c X b X d 5. Z : : = d rule 3

Constructing Predictive Parsers 1. Y : : = 2. | bb 3. X : : = c 4. |YZ next non-terminal seen 5. Z : : = d rule X c 3 X b 4 X d

Constructing Predictive Parsers 1. Y : : = 2. | bb 3. X : : = c 4. |YZ next non-terminal seen 5. Z : : = d rule X c 3 X b 4 X d 4

Constricting Predictive Parsers • in general, must compute: – for each production X : : = s, must determine if s can derive the empty string. • if yes, X Nullable – for each production X : = s, must determine the set of all first terminals Q derivable from s • Q First(X) – for each non terminal X, determine all terminals symbols Q that immediately follow X • Q Follow(X)

Iterative Analysis • Many compilers algorithms are iterative techniques. • Iterative analysis applies when: – must compute a set of objects with some property P – P is defined inductively. ie, there are: • base cases: objects o 1, o 2 “obviously” have property P • inductive cases: if certain objects (o 3, o 4) have property P, this implies other objects (f o 3; f o 4) have property P – The number of objects in the set is finite • or we can represent infinite collections using some finite notation & we can find effective termination conditions

Iterative Analysis • general form: – initialize set S with base cases – applied inductive rules over and over until you reach a fixed point • a fixed point is a set that does not change when you apply an inductive rule – Nullable, First and Follow sets can be determined through iteration – many program optimizations use iteration – worst-case complexity is bad – average-case complexity is good: iteration “usually” terminates in a couple of rounds

Computing Nullable Sets • Non-terminal X is Nullable only if the following constraints are satisfied (computed using iterative analysis) – base case: • if (X : = ) then X is Nullable – inductive case: • if (X : = ABC. . . ) and A, B, C, . . . are all Nullable then X is Nullable

Computing First Sets • First(X) is computed iteratively – base case: • if T is a terminal symbol then First (T) = {T} – inductive case: • if X is a non-terminal and (X: = ABC. . . ) then – First (X) = First (X) U First (ABC. . . ) where First(ABC. . . ) = F 1 U F 2 U F 3 U. . . and » F 1 = First (A) » F 2 = First (B), if A is Nullable » F 3 = First (C), if A is Nullable & B is Nullable » . . .

Computing Follow Sets • Follow(X) is computed iteratively – base case: • initially, we assume nothing in particular follows X – (Follow (X) is initially { }) – inductive case: • if (Y : = s 1 X s 2) for any strings s 1, s 2 then – Follow (X) = First (s 2) U Follow (X) • if (Y : = s 1 X s 2) for any strings s 1, s 2 then – Follow (X) = Follow(Y) U Follow (X), if s 2 is Nullable

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = nullable Z Y X X : : = a X : : = b Y e first follow

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = nullable Z no Y yes X no base case X : : = a X : : = b Y e first follow

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = nullable Z no Y yes X no X : : = a X : : = b Y e first follow after one round of induction, we realize we have reached a fixed point

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first Z no d Y yes c X no a, b base case follow

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first Z no d, a, b Y yes c X no a, b after one round of induction, no fixed point follow

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first Z no d, a, b Y yes c X no a, b follow after two rounds of induction, no more changes ==> fixed point

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c {} X no a, b {} base case

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b after one round of induction, no fixed point

building a predictive parser Z : : = X Y Z Z : : = d Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b after two rounds of induction, fixed point (but notice, computing Follow(X) before Follow (Y) would have required 3 rd round)

Grammar: Z : : = X Y Z Z : : = d Computed Sets: Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b • if T First(s) then enter (X : : = s) in row X, col T • if s is Nullable and T Follow(X) enter (X : : = s) in row X, col T Build parsing table where row X, col T tells parser which clause to execute in function X with next-token T: a Z Y X b c d e

Grammar: Z : : = X Y Z Z : : = d Computed Sets: Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b d e What are the blanks? a b Z Z : : = XYZ Y Y : : = X X : : = a X : : = b Ye c Z : : = d Y : : = c Y : : =

Grammar: Z : : = X Y Z Z : : = d Computed Sets: Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b d e What are the blanks? --> syntax errors a b Z Z : : = XYZ Y Y : : = X X : : = a X : : = b Ye c Z : : = d Y : : = c Y : : =

Grammar: Z : : = X Y Z Z : : = d Computed Sets: Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b Is it possible to put 2 grammar rules in the same box? a b Z Z : : = XYZ Y Y : : = X X : : = a X : : = b Ye c d e Z : : = d Y : : = c Y : : =

Grammar: Z : : = X Y Z Z : : = d e Computed Sets: Y : : = c Y : : = X : : = a X : : = b Y e nullable first follow Z no d, a, b {} Y yes c e, d, a, b X no a, b c, e, d, a, b Is it possible to put 2 grammar rules in the same box? a b Z Z : : = XYZ Y Y : : = X X : : = a X : : = b Y e c d e Z : : = d e Y : : = c Y : : =

predictive parsing tables • if a predictive parsing table constructed this way contains no duplicate entries, the grammar is called LL(1) – Left-to-right parse, Left-most derivation, 1 symbol lookahead • if not, of the grammar is not LL(1) • in LL(k) parsing table, columns include every klength sequence of terminals: aa ab ba bb ac ca . . .

another trick • Previously, we saw that grammars with left -recursion were problematic, but could be transformed into LL(1) in some cases • the example non-LL(1) grammar we just saw: Z : : = X Y Z Z : : = d e • how do we fix it? Y : : = c Y : : = X : : = a X : : = b Y e

another trick • Previously, we saw that grammars with left -recursion were problematic, but could be transformed into LL(1) in some cases • the example non-LL(1) grammar we just saw: Z : : = X Y Z Z : : = d e Y : : = c Y : : = X : : = a X : : = b Y e • solution here is left-factoring: Z : : = X Y Z Z : : = d W W : : = e Y : : = c Y : : = X : : = a X : : = b Y e

summary • CFGs are good at specifying programming language structure • parsing general CFGs is expensive so we define parsers for simple classes of CFG – LL(k), LR(k) • we can build a recursive descent parser for LL(k) grammars by: – – computing nullable, first and follow sets constructing a parse table from the sets checking for duplicate entries, which indicates failure creating an ML program from the parse table • if parser construction fails we can – rewrite the grammar (left factoring, eliminating left recursion) and try again – try to build a parser using some other method