Programming Languages CS 550 Lecture 4 Summary Scanner

Theme v. We have now seen how to describe syntax using regular expressions and

Parser and Scanner Generators v Tools exist (e. g. yacc/bison 1 for C/C++, PLY

Outline v Scanners and DFA v Regular Expressions and NDFA v Equivalence of DFA

Regular Expressions v Alphabet = v A language over is subset of strings in

List Tokens v LPAREN = ‘(‘ v RPAREN = ‘)’ v COMMA = ‘,

List Scanner TOKEN Get. Token() { int val = 0; if (c = getchar()

Flex List Tokens %{ #include "list. tab. h" extern int yylval; %} %% [

Deterministic Finite Automata v. Input comes from alphabet A v. Finite set of states,

Example 1 v. Create a finite state automata that accepts strings of a’s and

DFA Implementation v. Program to implement DFA b S 1 a > S 0

Calculator Tokens v ASSIGN = ‘: =‘ v PLUS = ‘+’, MINUS = ‘-’,

Calculator DFA 14 Copyright © 2009 Elsevier

Table Driven Scanner State 1 ' ', t n / * ( ) +

Non-Deterministic Finite Automata v Same as DFA M = (A, S, s 0, F,

Example 2 a v. DFA accepting (a|b)*abb b > S 0 a a S

Example 2 b v. NDFA accepting (a|b)*abb a, b > S 0 a S

Simulating an NDFA v Compute S = set of states NDFA could be in

NDFA from Regular Expressions v. Base case – c v. Union – R|S v.

Example 3 v. Construct a NDFA that accepts the language generated by the regular

Regular Expression Compiler %{ #include "machine. h" char input[100]; %} %union{ Machine. Ptr ndfa;

NDFA v. Find an equivalent DFA for Example 3 ØStates in DFA are sets

Exercise 1 1. Construct NDFA that recognizes (see pages 55 -57 of text) 1.

Solution 1. 2 A[1, 2, 4, 5, 8] d d B[2, 3, 4, 5,

Minimizing States in DFA v. The exists a unique minimal state DFA for any

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that

Equivalence of Regular Expressions and DFA v The languages accepted by finite automata are

Example 4 v. Create a regular expression for the language that consists of strings

Grammars and Regular Expressions v. Given a regular expression R, there exists a grammar

Example 5 v. Create a grammar that generates the language that consists of strings

n n ab Proof that is not Recognized by a Finite State Automata v

List Grammar v < list > → ( < sequence > ) | (

Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); }

Recursive Descent Parser seq() { elt(); if token = ‘, ’ then match(‘, ’);

Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif;

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list:

Top-down vs. Bottom-up Parsing 41 Copyright © 2009 Elsevier

Bottom-up Parsing LR Grammar 42 Copyright © 2009 Elsevier

LR Calculator Grammar Figure 2. 24 Program stmt_list stmt expr term factor add op

LL Calculator Grammar Here is an LL(1) grammar (Fig 2. 15): 1. 2. 3.

LL Calculator Grammar LL(1) grammar (continued) 10. 11. 12. 13. 14. 15. 16. 17.

Predictive Parser v. Predict which rules to match ØA § when next token can

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$)

Exercise 3 Trace through the recursive descent parser and build parse tree for the

Table-Driven LL Parser Copyright © 2009 Elsevier 50

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ read id stmt_list

Computing First, Follow, Predict v. Algorithm First/Follow/Predict: ØFIRST(α) ={c : α →* c β}

Predict Set for LL Parser 53 Copyright © 2009 Elsevier

LR Parsing v. Bottom up (rightmost derivation) ØMaintain forrest of partially completed subtrees of

Top-down vs. Bottom-up Parsing 55 Copyright © 2009 Elsevier

LR Parsing Example Stack id(A) , id(A), id(B), id(C); id(A), id(B), id(C) id_list_tail id(A),

LR Calculator Grammar (Figure 2. 24, Page 73): 1. program 2. stmt_list 3. →

LR Calculator Grammar LR grammar (continued): 9. term 10. 11. factor 12. 13. 14.

LR Parser State v. Keep track of set of productions we might be in

LR Parser States Copyright © 2009 Elsevier 60

Characteristic Finite State Machine Copyright © 2009 Elsevier 61

LR Parser Table Copyright © 2009 Elsevier 62

LR Parsing Example Copyright © 2009 Elsevier 63

Slides: 63

Download presentation

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators Jeremy R. Johnson 1

Theme v. We have now seen how to describe syntax using regular expressions and grammars and how to create scanners and parsers, by hand using automated tools. In this lecture we provide more details on parsing and scanning and indicate how these tools work. v. Reading: chapter 2 of the text by Scott. 2

Parser and Scanner Generators v Tools exist (e. g. yacc/bison 1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) v These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) v Similar tools (e. g. lex/flex for C/C++, Jflex for Java) exist, based on theory of finite automata, to automatically construct scanners from regular expressions 3 1 bison in the GNU version of yacc

Outline v Scanners and DFA v Regular Expressions and NDFA v Equivalence of DFA and NDFA v Regular Languages and the limitations of regular expressions v Recursive Descent Parsing v LL(1) Grammars and Tob-down (Predictive) Parsing v LR(1) Grammars and Bottom-up Parsing 4

Regular Expressions v Alphabet = v A language over is subset of strings in v Regular expressions describe certain types of languages Ø is a regular expression Ø = { } is a regular expression Ø For each a in , a denoting {a} is a regular expression Ø If r and s are regular expressions denoting languages R and S respectively then (r | s), (rs), and (r*) are regular expressions v E. G. 00, (0|1)*00(0|1)*, 00*11*22*, (1|10)* 5

List Tokens v LPAREN = ‘(‘ v RPAREN = ‘)’ v COMMA = ‘, ’ v NUMBER = DIGIT* v DIGIT = 0|1|2|3|4|5|6|7|8|9 v Unix shorthand: [0 -9], DIGIT+ v Whitespace: (‘ ’ | ‘n’ | ‘t’)* 6

List Scanner TOKEN Get. Token() { int val = 0; if (c = getchar() == eof) then return None end if; while c {‘ ’, ‘n’, ‘t’} then c = getchar() end do; if c {‘(’, ‘, ’, ‘)’} then return c end if; if c {‘ 0’, …, ‘ 9’} then while c {‘ 0’, …, ‘ 9’} do val = val*10 + (c – ‘ 0’); c = getchar(); end do; putchar(c); return (NUMBER, val); else return None; end if; } 7

Flex List Tokens %{ #include "list. tab. h" extern int yylval; %} %% [ tn] ; "(" return yytext[0]; ")" return yytext[0]; ", " return yytext[0]; [0 -9]+ { yylval = atoi(yytext); return NUMBER; } %% 8

Deterministic Finite Automata v. Input comes from alphabet A v. Finite set of states, S, start state, s 0, Accepting States, F v. Transition T from state to state depending on next input ØM = (A, S, s 0, F, T) v. The language accepted by a finite automata is the set of input strings that end up in accepting states

Example 1 v. Create a finite state automata that accepts strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a abbbabaabbb 011110010000

DFA Implementation v. Program to implement DFA b S 1 a > S 0 b bool EA() { S 0: x = getchar(); if (x == ‘b’) goto S 0; if (x == ‘a’) goto S 1; if (x == ENDM) return true; a S 1: x = getchar(); if (x == ‘b’) goto S 1; if (x == ‘a’) goto S 0; if (x == ENDM) return false; }

List DFA d S 1 ‘ ’, t, n > S 0 S 4 d ( ( , S 2 S 3 12

Calculator Tokens v ASSIGN = ‘: =‘ v PLUS = ‘+’, MINUS = ‘-’, TIMES = ‘*’, DIV = ‘/’ v LPAREN = ‘(’, RPAREN = ‘)’ v NUMBER = DIGIT* | DIGIT* (. DIGIT|DIGIT. ) DIGIT* v ID = LETTER (LETTER | DIGIT)* v DIGIT = 0|1| … |9, LETTER = a|…|z|A|…|Z v COMMENT = /* (non-* | * non-/)* */ | // (non-newline)* newline v WHITESPACE = (‘ ’ | ‘n’ | ‘t’)* 13

Table Driven Scanner State 1 ' ', t n / * ( ) + - : 17 17 2 10 6 7 8 9 11 3 4 2 = . digit 13 14 letter other token 16 div 3 3 18 3 3 3 4 4 5 4 4 4 4 4 5 4 4 18 5 4 4 4 4 4 6 lparen 7 rparen 8 plus 9 minus 10 times 11 12 12 assign 13 15 14 15 15 15 16 14 number 16 16 identifier 17 17 17 - - - white-space 18 - - - - comment 15

Non-Deterministic Finite Automata v Same as DFA M = (A, S, s 0, F, T) except ØCan have multiple transitions from same state with same input ØCan have epsilon transitions ØExcept input string if there is a path to an accepting state v. The languages accepted by NDFA are the same as DFA

Example 2 a v. DFA accepting (a|b)*abb b > S 0 a a S 1 b a S 2 b S 0

Example 2 b v. NDFA accepting (a|b)*abb a, b > S 0 a S 1 b S 2 b S 3 a, b > S 0 ε S 1 a S 2 b S 3 b S 4

Simulating an NDFA v Compute S = set of states NDFA could be in after reading each symbol in the input. v Si = set of possible states after reading i input symbols 1. Initialize S 0 = Epsilon. Closure{0} 2. for i = 1, …, len(str) 1. Ti = _{s Si-1} T[s, str[i]] 2. Si = Epsilon. Closure(Ti ) b a b b {0, 1}{0, 1, 2}{0, 1, 3}{0, 1, 4}

NDFA from Regular Expressions v. Base case – c v. Union – R|S v. Concatenation – RS c ε R ε ε S ε R v. Closure – R* ε S ε R ε ε

Example 3 v. Construct a NDFA that accepts the language generated by the regular expression (a|bc) S 1 a S 2 S 6 > S 0 S 3 b S 4 c S 5

Regular Expression Compiler %{ #include "machine. h" char input[100]; %} %union{ Machine. Ptr ndfa; char symbol; } %token <symbol> LETTER %type <ndfa> regexp %type <ndfa> cat %type <ndfa> kleene %% statement: regexp { do { printf("Enter stringn"); if (scanf("%s", input) != EOF) Simulate($1, input); else exit(1); } while (1); } regexp: regexp '|' cat { $$ = Machine. Or($1, $3); } | cat { $$ = $1; } ; cat: cat kleene { $$ = Machine. Concat($1, $2); } | kleene { $$ = $1; } ; kleene: '(' regexp ')' { $$ = $2; } | kleene '*' { $$ = Machine. Star($1); } | LETTER { $$ = Base. Machine($1); }; %%

NDFA v. Find an equivalent DFA for Example 3 ØStates in DFA are sets of states from NDFA [keep track of all possible transitions] > 013 a 26 b 4 c 56

Exercise 1 1. Construct NDFA that recognizes (see pages 55 -57 of text) 1. d*(. d | d. )d* 2. Convert NDFA from (1) to DFA (see pages 56 -58 of text)

Solution 1. 1 1 2 d 3 4 5 . 6 d 7 11 8 d 9 . 10 12 d 13 14 25

Solution 1. 2 A[1, 2, 4, 5, 8] d d B[2, 3, 4, 5, 8, 9]. . D[6, 10, 11, 12, 14] C[6] d d E[7, 11, 12, 14] F[7, 11, 12, 13, 14] d d G[11, 12, 13, 14] d 26

Minimizing States in DFA v. The exists a unique minimal state DFA for any language described by a regular expression v. Combine equivalent states Øp q if for each input string x T(p, x) is an accepting state iff T(q, x) is an accepting state ØInitialize two sets of states: accepting and nonaccepting ØPartition state sets which transition into multiple sets of states

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that minimizes the number of states (see page 59 of text) d, . ABC d, . DEFG d Ambiguity: T(A, d) = T(B, d) T(C, d) split ABC AB, C

Solution 2 d A d . B. C d DEFG d

Equivalence of Regular Expressions and DFA v The languages accepted by finite automata are equivalent to those generated by regular expressions Ø Given any regular expression R, there exists a finite state automata M such that L(M) = L(R) § Proof is given by previous construction Ø Given any finite state automata M, there exists a regular expression R such that L(R) = L(M) § The basic idea is to combine the transitions in each node along all paths that lead to an accepting state. The combination of the characters along the paths are described using regular expressions.

Example 4 v. Create a regular expression for the language that consists of strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a b*|(b*ab*a)*

Grammars and Regular Expressions v. Given a regular expression R, there exists a grammar with syntactic category <S> such that L(R) = L(<S>). v. There are grammars such that there does NOT exist a regular expression R with L(<S>) = L(R) Ø <S> a<S>b| Ø L(<S>) = {anbn, n=0, 1, 2, …}

Example 5 v. Create a grammar that generates the language that consists of strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a <S 0> b<S 0> a<S 1> <S 0> <S 1> b<S 1> a<S 0>

n n ab Proof that is not Recognized by a Finite State Automata v To show that there is no finite state automata that recognizes the language L = {anbn, n = 0, 1, 2, …}, we assume that there is a finite state automata M that recognizes L and show that this leads to a contradiction. v Since M is a finite state automata it has a finite number of states. Let the number of states = m. v Since M recognizes the language L all strings of the form akbk must end up in accepting states. Choose such a string with k = n which is greater than m.

n n ab Proof that is not Recognized by a Finite State Automata v Since n > m there must be a state s that is visited twice while the string an is read [we can only visit m distinct states and since n > m after reading (m+1) a’s, we must go to a state that was already visited]. v Suppose that state s is reached after reading the strings aj and ak (j k). Since the same state is reached for both strings, the finite state machine can not distinguish strings that begin with aj from strings that begin with ak. v Therefore, the finite state automata must either accept or reject both of the strings ajbj and akbj. However, ajbj should be accepted, while akbj should not be accepted.

List Grammar v < list > → ( < sequence > ) | ( ) v < sequence > → < listelement > , < sequence > | < listelement > v < listelement > → < list > | NUMBER 36

Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); } 37

Recursive Descent Parser seq() { elt(); if token = ‘, ’ then match(‘, ’); seq(); endif; } 38

Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif; } 39

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("L -> ( seq )n"); } | '(' ')' { printf("L -> () n "); } sequence: listelement ', ' sequence { printf("seq -> LE, seqn"); } | listelement { printf("seq -> LEn"); } ; listelement: NUMBER { printf("LE -> %dn", $1); } | list { printf("LE -> Ln"); } ; %% /* since no code here, default main constructed that simply calls parser. */ 40

LR Calculator Grammar Figure 2. 24 Program stmt_list stmt expr term factor add op mult op → → | | → | → | | → → stmt list $$ stmt_list stmt id : = expr read id write expr term expr add op term factor term mult_op factor ( expr ) id number + | * | / 43

LL Calculator Grammar Here is an LL(1) grammar (Fig 2. 15): 1. 2. 3. 4. 5. 6. 7. 8. 9. program stmt_list → stmt_list $$ → stmt_list | ε stmt → id : = expr | read id | write expr → term_tail → add op term_tail | ε 44

Predictive Parser v. Predict which rules to match ØA § when next token can start § * and the next token can follow A ØPREDICT(A ) = FIRST( ) FOLLOW(A) if EPS( ) § PREDICT(program → stmt_list $$) § FIRST(stmt_list) = {id, read, write} § FOLLOW(stmt_list) = {$$} ØIntersection of PREDICT sets for same lhs must 46 be empty

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$) otherwise error procedure stmt_list case input_token of id, read, write: stmt; stmt_list $$: skip otherwise error procedure stmt case input_token of id: match(id); match(: =); expr read: match(read); match(id) write: match(write); expr otherwise error procedure expr case input_token of id, number, (: term; termtail otherwise error procedure term_tail case input_token of +, -: add_op; term_tail ), id, read, write, $$: skip otherwise error procedure term case input_token of id, number, (: factor; factor_tail otherwise error 47

Exercise 3 Trace through the recursive descent parser and build parse tree for the following program 1. 2. 3. 4. 5. read A read B sum : = A + B write sum / 2

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ read id stmt_list $$ stmt_list $$ read A read B … A read B … Initial stack contents program stmt_list $$ stmt_list stmt read id match(read) match(id) stmt_list … stmt_list $$ $$ … … term_tail ε stmt_list ε 51

Computing First, Follow, Predict v. Algorithm First/Follow/Predict: ØFIRST(α) ={c : α →* c β} ØFOLLOW(A) = {c : S →+ α A c β} ØEPS(α) = if α →* ε then true else false ØPredict (A → α) = FIRST(α) ∪ (if EPS(α) then FOLLOW(A) else ) 52

LR Parsing v. Bottom up (rightmost derivation) ØMaintain forrest of partially completed subtrees of the parse tree ØJoin trees together when recognizing symbols in rhs of production ØKeep roots of partially completed trees on stack § Shift when new token § Reduce when top symbols match rhs ØTable driven 54

LR Parsing Example Stack id(A) , id(A), id(B), id(C); id(A), id(B), id(C) id_list_tail id(A), id(B) id_list_tail id(A) id_list_tail id_list Remaining Input A, B, C; C; ; 56

LR Calculator Grammar (Figure 2. 24, Page 73): 1. program 2. stmt_list 3. → stmt list $$ → stmt_list stmt | stmt 4. stmt 5. 6. → 7. expr 8. → id : = expr | read id | write expr term | expr add op term 57

LR Calculator Grammar LR grammar (continued): 9. term 10. 11. factor 12. 13. 14. add op 15. 16. mult op 17. → | | → | factor term mult_op factor ( expr ) id number + * / 58

LR Parser State v. Keep track of set of productions we might be in along with where in those productions we might be v. Initial state for calculator grammar program stmt_list $$ stmt_list stmt id : = expr stmt read id stmt write expr // basis // yield 59