Programming Languages CS 550 Lecture 4 Summary Scanner

  • Slides: 63
Download presentation
Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators Jeremy R. Johnson

Programming Languages (CS 550) Lecture 4 Summary Scanner and Parser Generators Jeremy R. Johnson 1

Theme v. We have now seen how to describe syntax using regular expressions and

Theme v. We have now seen how to describe syntax using regular expressions and grammars and how to create scanners and parsers, by hand using automated tools. In this lecture we provide more details on parsing and scanning and indicate how these tools work. v. Reading: chapter 2 of the text by Scott. 2

Parser and Scanner Generators v Tools exist (e. g. yacc/bison 1 for C/C++, PLY

Parser and Scanner Generators v Tools exist (e. g. yacc/bison 1 for C/C++, PLY for python, CUP for Java) to automatically construct a parser from a restricted set of context free grammars (LALR(1) grammars for yacc/bison and the derivatives CUP and PLY) v These tools use table driven bottom up parsing techniques (commonly shift/reduce parsing) v Similar tools (e. g. lex/flex for C/C++, Jflex for Java) exist, based on theory of finite automata, to automatically construct scanners from regular expressions 3 1 bison in the GNU version of yacc

Outline v Scanners and DFA v Regular Expressions and NDFA v Equivalence of DFA

Outline v Scanners and DFA v Regular Expressions and NDFA v Equivalence of DFA and NDFA v Regular Languages and the limitations of regular expressions v Recursive Descent Parsing v LL(1) Grammars and Tob-down (Predictive) Parsing v LR(1) Grammars and Bottom-up Parsing 4

Regular Expressions v Alphabet = v A language over is subset of strings in

Regular Expressions v Alphabet = v A language over is subset of strings in v Regular expressions describe certain types of languages Ø is a regular expression Ø = { } is a regular expression Ø For each a in , a denoting {a} is a regular expression Ø If r and s are regular expressions denoting languages R and S respectively then (r | s), (rs), and (r*) are regular expressions v E. G. 00, (0|1)*00(0|1)*, 00*11*22*, (1|10)* 5

List Tokens v LPAREN = ‘(‘ v RPAREN = ‘)’ v COMMA = ‘,

List Tokens v LPAREN = ‘(‘ v RPAREN = ‘)’ v COMMA = ‘, ’ v NUMBER = DIGIT* v DIGIT = 0|1|2|3|4|5|6|7|8|9 v Unix shorthand: [0 -9], DIGIT+ v Whitespace: (‘ ’ | ‘n’ | ‘t’)* 6

List Scanner TOKEN Get. Token() { int val = 0; if (c = getchar()

List Scanner TOKEN Get. Token() { int val = 0; if (c = getchar() == eof) then return None end if; while c {‘ ’, ‘n’, ‘t’} then c = getchar() end do; if c {‘(’, ‘, ’, ‘)’} then return c end if; if c {‘ 0’, …, ‘ 9’} then while c {‘ 0’, …, ‘ 9’} do val = val*10 + (c – ‘ 0’); c = getchar(); end do; putchar(c); return (NUMBER, val); else return None; end if; } 7

Flex List Tokens %{ #include "list. tab. h" extern int yylval; %} %% [

Flex List Tokens %{ #include "list. tab. h" extern int yylval; %} %% [ tn] ; "(" return yytext[0]; ")" return yytext[0]; ", " return yytext[0]; [0 -9]+ { yylval = atoi(yytext); return NUMBER; } %% 8

Deterministic Finite Automata v. Input comes from alphabet A v. Finite set of states,

Deterministic Finite Automata v. Input comes from alphabet A v. Finite set of states, S, start state, s 0, Accepting States, F v. Transition T from state to state depending on next input ØM = (A, S, s 0, F, T) v. The language accepted by a finite automata is the set of input strings that end up in accepting states

Example 1 v. Create a finite state automata that accepts strings of a’s and

Example 1 v. Create a finite state automata that accepts strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a abbbabaabbb 011110010000

DFA Implementation v. Program to implement DFA b S 1 a > S 0

DFA Implementation v. Program to implement DFA b S 1 a > S 0 b bool EA() { S 0: x = getchar(); if (x == ‘b’) goto S 0; if (x == ‘a’) goto S 1; if (x == ENDM) return true; a S 1: x = getchar(); if (x == ‘b’) goto S 1; if (x == ‘a’) goto S 0; if (x == ENDM) return false; }

List DFA d S 1 ‘ ’, t, n > S 0 S 4

List DFA d S 1 ‘ ’, t, n > S 0 S 4 d ( ( , S 2 S 3 12

Calculator Tokens v ASSIGN = ‘: =‘ v PLUS = ‘+’, MINUS = ‘-’,

Calculator Tokens v ASSIGN = ‘: =‘ v PLUS = ‘+’, MINUS = ‘-’, TIMES = ‘*’, DIV = ‘/’ v LPAREN = ‘(’, RPAREN = ‘)’ v NUMBER = DIGIT* | DIGIT* (. DIGIT|DIGIT. ) DIGIT* v ID = LETTER (LETTER | DIGIT)* v DIGIT = 0|1| … |9, LETTER = a|…|z|A|…|Z v COMMENT = /* (non-* | * non-/)* */ | // (non-newline)* newline v WHITESPACE = (‘ ’ | ‘n’ | ‘t’)* 13

Calculator DFA 14 Copyright © 2009 Elsevier

Calculator DFA 14 Copyright © 2009 Elsevier

Table Driven Scanner State 1 ' ', t n / * ( ) +

Table Driven Scanner State 1 ' ', t n / * ( ) + - : 17 17 2 10 6 7 8 9 11 3 4 2 = . digit 13 14 letter other token 16 div 3 3 18 3 3 3 4 4 5 4 4 4 4 4 5 4 4 18 5 4 4 4 4 4 6 lparen 7 rparen 8 plus 9 minus 10 times 11 12 12 assign 13 15 14 15 15 15 16 14 number 16 16 identifier 17 17 17 - - - white-space 18 - - - - comment 15

Non-Deterministic Finite Automata v Same as DFA M = (A, S, s 0, F,

Non-Deterministic Finite Automata v Same as DFA M = (A, S, s 0, F, T) except ØCan have multiple transitions from same state with same input ØCan have epsilon transitions ØExcept input string if there is a path to an accepting state v. The languages accepted by NDFA are the same as DFA

Example 2 a v. DFA accepting (a|b)*abb b > S 0 a a S

Example 2 a v. DFA accepting (a|b)*abb b > S 0 a a S 1 b a S 2 b S 0

Example 2 b v. NDFA accepting (a|b)*abb a, b > S 0 a S

Example 2 b v. NDFA accepting (a|b)*abb a, b > S 0 a S 1 b S 2 b S 3 a, b > S 0 ε S 1 a S 2 b S 3 b S 4

Simulating an NDFA v Compute S = set of states NDFA could be in

Simulating an NDFA v Compute S = set of states NDFA could be in after reading each symbol in the input. v Si = set of possible states after reading i input symbols 1. Initialize S 0 = Epsilon. Closure{0} 2. for i = 1, …, len(str) 1. Ti = _{s Si-1} T[s, str[i]] 2. Si = Epsilon. Closure(Ti ) b a b b {0, 1}{0, 1, 2}{0, 1, 3}{0, 1, 4}

NDFA from Regular Expressions v. Base case – c v. Union – R|S v.

NDFA from Regular Expressions v. Base case – c v. Union – R|S v. Concatenation – RS c ε R ε ε S ε R v. Closure – R* ε S ε R ε ε

Example 3 v. Construct a NDFA that accepts the language generated by the regular

Example 3 v. Construct a NDFA that accepts the language generated by the regular expression (a|bc) S 1 a S 2 S 6 > S 0 S 3 b S 4 c S 5

Regular Expression Compiler %{ #include "machine. h" char input[100]; %} %union{ Machine. Ptr ndfa;

Regular Expression Compiler %{ #include "machine. h" char input[100]; %} %union{ Machine. Ptr ndfa; char symbol; } %token <symbol> LETTER %type <ndfa> regexp %type <ndfa> cat %type <ndfa> kleene %% statement: regexp { do { printf("Enter stringn"); if (scanf("%s", input) != EOF) Simulate($1, input); else exit(1); } while (1); } regexp: regexp '|' cat { $$ = Machine. Or($1, $3); } | cat { $$ = $1; } ; cat: cat kleene { $$ = Machine. Concat($1, $2); } | kleene { $$ = $1; } ; kleene: '(' regexp ')' { $$ = $2; } | kleene '*' { $$ = Machine. Star($1); } | LETTER { $$ = Base. Machine($1); }; %%

NDFA v. Find an equivalent DFA for Example 3 ØStates in DFA are sets

NDFA v. Find an equivalent DFA for Example 3 ØStates in DFA are sets of states from NDFA [keep track of all possible transitions] > 013 a 26 b 4 c 56

Exercise 1 1. Construct NDFA that recognizes (see pages 55 -57 of text) 1.

Exercise 1 1. Construct NDFA that recognizes (see pages 55 -57 of text) 1. d*(. d | d. )d* 2. Convert NDFA from (1) to DFA (see pages 56 -58 of text)

Solution 1. 1 1 2 d 3 4 5 . 6 d 7 11

Solution 1. 1 1 2 d 3 4 5 . 6 d 7 11 8 d 9 . 10 12 d 13 14 25

Solution 1. 2 A[1, 2, 4, 5, 8] d d B[2, 3, 4, 5,

Solution 1. 2 A[1, 2, 4, 5, 8] d d B[2, 3, 4, 5, 8, 9]. . D[6, 10, 11, 12, 14] C[6] d d E[7, 11, 12, 14] F[7, 11, 12, 13, 14] d d G[11, 12, 13, 14] d 26

Minimizing States in DFA v. The exists a unique minimal state DFA for any

Minimizing States in DFA v. The exists a unique minimal state DFA for any language described by a regular expression v. Combine equivalent states Øp q if for each input string x T(p, x) is an accepting state iff T(q, x) is an accepting state ØInitialize two sets of states: accepting and nonaccepting ØPartition state sets which transition into multiple sets of states

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that

Exercise 2 1. Find an equivalent DFA to the one in Exercise 1 that minimizes the number of states (see page 59 of text) d, . ABC d, . DEFG d Ambiguity: T(A, d) = T(B, d) T(C, d) split ABC AB, C

Solution 2 d A d . B. C d DEFG d

Solution 2 d A d . B. C d DEFG d

Equivalence of Regular Expressions and DFA v The languages accepted by finite automata are

Equivalence of Regular Expressions and DFA v The languages accepted by finite automata are equivalent to those generated by regular expressions Ø Given any regular expression R, there exists a finite state automata M such that L(M) = L(R) § Proof is given by previous construction Ø Given any finite state automata M, there exists a regular expression R such that L(R) = L(M) § The basic idea is to combine the transitions in each node along all paths that lead to an accepting state. The combination of the characters along the paths are described using regular expressions.

Example 4 v. Create a regular expression for the language that consists of strings

Example 4 v. Create a regular expression for the language that consists of strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a b*|(b*ab*a)*

Grammars and Regular Expressions v. Given a regular expression R, there exists a grammar

Grammars and Regular Expressions v. Given a regular expression R, there exists a grammar with syntactic category <S> such that L(R) = L(<S>). v. There are grammars such that there does NOT exist a regular expression R with L(<S>) = L(R) Ø <S> a<S>b| Ø L(<S>) = {anbn, n=0, 1, 2, …}

Example 5 v. Create a grammar that generates the language that consists of strings

Example 5 v. Create a grammar that generates the language that consists of strings of a’s and b’s with an even number of a’s. b S 1 a > S 0 b a <S 0> b<S 0> a<S 1> <S 0> <S 1> b<S 1> a<S 0>

n n ab Proof that is not Recognized by a Finite State Automata v

n n ab Proof that is not Recognized by a Finite State Automata v To show that there is no finite state automata that recognizes the language L = {anbn, n = 0, 1, 2, …}, we assume that there is a finite state automata M that recognizes L and show that this leads to a contradiction. v Since M is a finite state automata it has a finite number of states. Let the number of states = m. v Since M recognizes the language L all strings of the form akbk must end up in accepting states. Choose such a string with k = n which is greater than m.

n n ab Proof that is not Recognized by a Finite State Automata v

n n ab Proof that is not Recognized by a Finite State Automata v Since n > m there must be a state s that is visited twice while the string an is read [we can only visit m distinct states and since n > m after reading (m+1) a’s, we must go to a state that was already visited]. v Suppose that state s is reached after reading the strings aj and ak (j k). Since the same state is reached for both strings, the finite state machine can not distinguish strings that begin with aj from strings that begin with ak. v Therefore, the finite state automata must either accept or reject both of the strings ajbj and akbj. However, ajbj should be accepted, while akbj should not be accepted.

List Grammar v < list > → ( < sequence > ) | (

List Grammar v < list > → ( < sequence > ) | ( ) v < sequence > → < listelement > , < sequence > | < listelement > v < listelement > → < list > | NUMBER 36

Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); }

Recursive Descent Parser list() { match(‘(‘); if token ‘)’ then seq(); endif; match(‘)’); } 37

Recursive Descent Parser seq() { elt(); if token = ‘, ’ then match(‘, ’);

Recursive Descent Parser seq() { elt(); if token = ‘, ’ then match(‘, ’); seq(); endif; } 38

Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif;

Recursive Descent Parser elt() { if token = ‘(‘ then list(); else match(NUMBER); endif; } 39

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list:

Yacc (bison) Example %token NUMBER /* needed to communicate with scanner */ %% list: '(' sequence ')' { printf("L -> ( seq )n"); } | '(' ')' { printf("L -> () n "); } sequence: listelement ', ' sequence { printf("seq -> LE, seqn"); } | listelement { printf("seq -> LEn"); } ; listelement: NUMBER { printf("LE -> %dn", $1); } | list { printf("LE -> Ln"); } ; %% /* since no code here, default main constructed that simply calls parser. */ 40

Top-down vs. Bottom-up Parsing 41 Copyright © 2009 Elsevier

Top-down vs. Bottom-up Parsing 41 Copyright © 2009 Elsevier

Bottom-up Parsing LR Grammar 42 Copyright © 2009 Elsevier

Bottom-up Parsing LR Grammar 42 Copyright © 2009 Elsevier

LR Calculator Grammar Figure 2. 24 Program stmt_list stmt expr term factor add op

LR Calculator Grammar Figure 2. 24 Program stmt_list stmt expr term factor add op mult op → → | | → | → | | → → stmt list $$ stmt_list stmt id : = expr read id write expr term expr add op term factor term mult_op factor ( expr ) id number + | * | / 43

LL Calculator Grammar Here is an LL(1) grammar (Fig 2. 15): 1. 2. 3.

LL Calculator Grammar Here is an LL(1) grammar (Fig 2. 15): 1. 2. 3. 4. 5. 6. 7. 8. 9. program stmt_list → stmt_list $$ → stmt_list | ε stmt → id : = expr | read id | write expr → term_tail → add op term_tail | ε 44

LL Calculator Grammar LL(1) grammar (continued) 10. 11. 12. 13. 14. 15. 16. 17.

LL Calculator Grammar LL(1) grammar (continued) 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. term → factor fact_tailt fact_tail → mult_op fact_tail | ε factor → ( expr ) | id | number add_op → + | mult_op → * | / 45

Predictive Parser v. Predict which rules to match ØA § when next token can

Predictive Parser v. Predict which rules to match ØA § when next token can start § * and the next token can follow A ØPREDICT(A ) = FIRST( ) FOLLOW(A) if EPS( ) § PREDICT(program → stmt_list $$) § FIRST(stmt_list) = {id, read, write} § FOLLOW(stmt_list) = {$$} ØIntersection of PREDICT sets for same lhs must 46 be empty

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$)

Recursive Descent Parser procedure program case input_token of id, read, write, $$: stmt_list; match($$) otherwise error procedure stmt_list case input_token of id, read, write: stmt; stmt_list $$: skip otherwise error procedure stmt case input_token of id: match(id); match(: =); expr read: match(read); match(id) write: match(write); expr otherwise error procedure expr case input_token of id, number, (: term; termtail otherwise error procedure term_tail case input_token of +, -: add_op; term_tail ), id, read, write, $$: skip otherwise error procedure term case input_token of id, number, (: factor; factor_tail otherwise error 47

Exercise 3 Trace through the recursive descent parser and build parse tree for the

Exercise 3 Trace through the recursive descent parser and build parse tree for the following program 1. 2. 3. 4. 5. read A read B sum : = A + B write sum / 2

Solution 3 Copyright © 2009 Elsevier 49

Solution 3 Copyright © 2009 Elsevier 49

Table-Driven LL Parser Copyright © 2009 Elsevier 50

Table-Driven LL Parser Copyright © 2009 Elsevier 50

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ read id stmt_list

Table-Driven LL Parser Parse Stack Input Stream Comment program stmt_list $$ read id stmt_list $$ stmt_list $$ read A read B … A read B … Initial stack contents program stmt_list $$ stmt_list stmt read id match(read) match(id) stmt_list … stmt_list $$ $$ … … term_tail ε stmt_list ε 51

Computing First, Follow, Predict v. Algorithm First/Follow/Predict: ØFIRST(α) ={c : α →* c β}

Computing First, Follow, Predict v. Algorithm First/Follow/Predict: ØFIRST(α) ={c : α →* c β} ØFOLLOW(A) = {c : S →+ α A c β} ØEPS(α) = if α →* ε then true else false ØPredict (A → α) = FIRST(α) ∪ (if EPS(α) then FOLLOW(A) else ) 52

Predict Set for LL Parser 53 Copyright © 2009 Elsevier

Predict Set for LL Parser 53 Copyright © 2009 Elsevier

LR Parsing v. Bottom up (rightmost derivation) ØMaintain forrest of partially completed subtrees of

LR Parsing v. Bottom up (rightmost derivation) ØMaintain forrest of partially completed subtrees of the parse tree ØJoin trees together when recognizing symbols in rhs of production ØKeep roots of partially completed trees on stack § Shift when new token § Reduce when top symbols match rhs ØTable driven 54

Top-down vs. Bottom-up Parsing 55 Copyright © 2009 Elsevier

Top-down vs. Bottom-up Parsing 55 Copyright © 2009 Elsevier

LR Parsing Example Stack id(A) , id(A), id(B), id(C); id(A), id(B), id(C) id_list_tail id(A),

LR Parsing Example Stack id(A) , id(A), id(B), id(C); id(A), id(B), id(C) id_list_tail id(A), id(B) id_list_tail id(A) id_list_tail id_list Remaining Input A, B, C; C; ; 56

LR Calculator Grammar (Figure 2. 24, Page 73): 1. program 2. stmt_list 3. →

LR Calculator Grammar (Figure 2. 24, Page 73): 1. program 2. stmt_list 3. → stmt list $$ → stmt_list stmt | stmt 4. stmt 5. 6. → 7. expr 8. → id : = expr | read id | write expr term | expr add op term 57

LR Calculator Grammar LR grammar (continued): 9. term 10. 11. factor 12. 13. 14.

LR Calculator Grammar LR grammar (continued): 9. term 10. 11. factor 12. 13. 14. add op 15. 16. mult op 17. → | | → | factor term mult_op factor ( expr ) id number + * / 58

LR Parser State v. Keep track of set of productions we might be in

LR Parser State v. Keep track of set of productions we might be in along with where in those productions we might be v. Initial state for calculator grammar program stmt_list $$ stmt_list stmt id : = expr stmt read id stmt write expr // basis // yield 59

LR Parser States Copyright © 2009 Elsevier 60

LR Parser States Copyright © 2009 Elsevier 60

Characteristic Finite State Machine Copyright © 2009 Elsevier 61

Characteristic Finite State Machine Copyright © 2009 Elsevier 61

LR Parser Table Copyright © 2009 Elsevier 62

LR Parser Table Copyright © 2009 Elsevier 62

LR Parsing Example Copyright © 2009 Elsevier 63

LR Parsing Example Copyright © 2009 Elsevier 63