1 Lexical Analysis and Lexical Analyzer Generators Chapter

  • Slides: 44
Download presentation
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP 5621 Compiler Construction Copyright

1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP 5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2007 -2011

2 Lexical Analyser • A lexical analyzer reads characters from the input and groups

2 Lexical Analyser • A lexical analyzer reads characters from the input and groups them into “token objects”. • Typical tasks of the lexical analyzer: – Remove white space and comments – Encode constants as tokens – Recognize keywords – Recognize identifiers and store identifier names in a global symbol table

3 Interaction of the Lexical Analyzer with the Parser Source Program Lexical Analyzer Token,

3 Interaction of the Lexical Analyzer with the Parser Source Program Lexical Analyzer Token, tokenval Parser Get next token error Symbol Table

4 Tokens, Patterns, and Lexemes • A token is a pair consisting of a

4 Tokens, Patterns, and Lexemes • A token is a pair consisting of a token name and an optional attribute value. – For example: id and num and for • Lexemes are the specific character strings that make up a token – For example: abc and 123 • Patterns are rules describing the set of lexemes belonging to a token – For example: “letter followed by letters and digits” and “non-empty sequence of digits”

5 Examples of tokens

5 Examples of tokens

6 Attributes for Tokens • Additional information about the token – Example: token number

6 Attributes for Tokens • Additional information about the token – Example: token number should have an attribute representing its lexeme. E. g. , 167 is represented by <number, 167> – Example: token id, has an attribute value points to the symbol-table entry for that identifier, which includes its lexeme, its type, etc.

7 Attributes of Tokens y : = 31 + 28*x Lexical analyzer <id, “y”>

7 Attributes of Tokens y : = 31 + 28*x Lexical analyzer <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> tokenval (token attribute) Parser

8 Reading Ahead • A lexical analyzer may need to read ahead some characters

8 Reading Ahead • A lexical analyzer may need to read ahead some characters before it can decide on the token to be returned to the parser. – Example: after seeing character >, it should read another one to see if it is =, and the token is >=

9 Specification of Tokens: Regular Expressions • Regular expressions are an important notation for

9 Specification of Tokens: Regular Expressions • Regular expressions are an important notation for specifying lexeme patterns. • Please review the definitions from “Theory of Languages and Automata” – Strings, Languages, …

10 Specification of Patterns for Tokens: Definitions • An alphabet is a finite set

10 Specification of Patterns for Tokens: Definitions • An alphabet is a finite set of symbols (characters) • A string s is a finite sequence of symbols from – s denotes the length of string s – denotes the empty string, thus = 0 • A language is a specific set of strings over some fixed alphabet

11 Specification of Patterns for Tokens: String Operations • The concatenation of two strings

11 Specification of Patterns for Tokens: String Operations • The concatenation of two strings x and y is denoted by xy • The exponentation of a string s is defined by s 0 = si = si-1 s for i > 0 note that s = s

12 Specification of Patterns for Tokens: Language Operations • Union L M = {s

12 Specification of Patterns for Tokens: Language Operations • Union L M = {s s L or s M} • Concatenation LM = {xy x L and y M} • Exponentiation L 0 = { }; Li = Li-1 L • Kleene closure L* = i=0, …, Li • Positive closure L+ = i=1, …, Li

13 Specification of Patterns for Tokens: Regular Expressions • Basis symbols: – is a

13 Specification of Patterns for Tokens: Regular Expressions • Basis symbols: – is a regular expression denoting language { } – a is a regular expression denoting {a} • If r and s are regular expressions denoting languages L(r) and L(s) respectively, then – – r s is a regular expression denoting L(r) L(s) rs is a regular expression denoting L(r)L(s) r* is a regular expression denoting L(r)* (r) is a regular expression denoting L(r) • A language defined by a regular expression is called a regular set

14 Specification of Patterns for Tokens: Regular Definitions • Regular definitions introduce a naming

14 Specification of Patterns for Tokens: Regular Definitions • Regular definitions introduce a naming convention: d 1 r 1 d 2 r 2 … dn rn where each ri is a regular expression over {d 1, d 2, …, di-1 } • Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

15 Specification of Patterns for Tokens: Regular Definitions • Example: letter A B …

15 Specification of Patterns for Tokens: Regular Definitions • Example: letter A B … Z a b … z digit 0 1 … 9 id letter ( letter digit )* • Regular definitions are not recursive: digits digit wrong!

16 Regular Definitions: Example

16 Regular Definitions: Example

17 Specification of Patterns for Tokens: Notational Shorthand • The following shorthands are often

17 Specification of Patterns for Tokens: Notational Shorthand • The following shorthands are often used: r+ = rr* r? = r [a-z] = a b c … z • Examples: digit [0 -9] num digit+ (. digit+)? ( E (+ -)? digit+ )?

18 Regular Definitions and Grammars Grammar stmt if expr then stmt else stmt expr

18 Regular Definitions and Grammars Grammar stmt if expr then stmt else stmt expr term relop term Regular definitions term id if num then else relop < <= <> > >= = id letter ( letter | digit )* num digit+ (. digit+)? ( E (+ -)? digit+ )?

19 Final Lexical Analyzer

19 Final Lexical Analyzer

20 Coding Regular Definitions in Transition Diagrams relop < <= <> > >= =

20 Coding Regular Definitions in Transition Diagrams relop < <= <> > >= = start 0 < 1 = 2 return(relop, LE) > 3 return(relop, NE) other = 5 > 6 id letter ( letter digit )* start 9 4 * return(relop, LT) return(relop, EQ) = 7 return(relop, GE) other 8 * return(relop, GT) letter or digit letter 10 other 11 * return(gettoken(), install_id())

21 Transition Diagrams

21 Transition Diagrams

Coding Regular Definitions in Transition Diagrams: Code token nexttoken() { while (1) { switch

Coding Regular Definitions in Transition Diagrams: Code token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c==blank || c==tab || c==newline) { state = 0; lexeme_beginning++; } else if (c==‘<’) state = 1; else if (c==‘=’) state = 5; else if (c==‘>’) state = 6; else state = fail(); break; case 1: … case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10: c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; … 22 Decides the next start state to check int fail() { forward = token_beginning; swith (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* error */ } return start; }

23 The Lex and Flex Scanner Generators • Lex and its newer cousin flex

23 The Lex and Flex Scanner Generators • Lex and its newer cousin flex are scanner generators • Systematically translate regular definitions into C source code for efficient scanning • Generated code is easy to integrate in C applications

24 Creating a Lexical Analyzer with Lex and Flex source program lex. l lex.

24 Creating a Lexical Analyzer with Lex and Flex source program lex. l lex. yy. c input stream lex or flex compiler lex. yy. c C compiler a. out sequence of tokens

25 Lex Specification • A lex specification consists of three parts: regular definitions, C

25 Lex Specification • A lex specification consists of three parts: regular definitions, C declarations in %{ %} %% translation rules %% user-defined auxiliary procedures • The translation rules are of the form: p 1 { action 1 } p 2 { action 2 } … pn { actionn }

26 Regular Expressions in Lex x match the character x . match the character.

26 Regular Expressions in Lex x match the character x . match the character. “string”match contents of string of characters. match any character except newline ^ match beginning of a line $ match the end of a line [xyz] match one character x, y, or z (use to escape -) [^xyz]match any character except x, y, and z [a-z] match one of a to z r* closure (match zero or more occurrences) r+ positive closure (match one or more occurrences) r? optional (match zero or one occurrence) r 1 r 2 match r 1 then r 2 (concatenation) r 1|r 2 match r 1 or r 2 (union) (r) grouping r 1r 2 match r 1 when followed by r 2 {d} match the regular expression defined by d

Example Lex Specification %{ /* definitions of manifest constants */ #define LT (256) …

Example Lex Specification %{ /* definitions of manifest constants */ #define LT (256) … %} delim [ tn] ws {delim}+ letter [A-Za-z] digit [0 -9] id {letter}({letter}|{digit})* number {digit}+(. {digit}+)? (E[+-]? {digit}+)? %% {ws} { } if {return IF; } then {return THEN; } else {return ELSE; } {id} {yylval = install_id(); return ID; } {number} {yylval = install_num(); return NUMBER; } “<“ {yylval = LT; return RELOP; } “<=“ {yylval = LE; return RELOP; } “=“ {yylval = EQ; return RELOP; } “<>“ {yylval = NE; return RELOP; } “>“ {yylval = GT; return RELOP; } “>=“ {yylval = GE; return RELOP; } %% int install_id() … 27 Return token to parser Token attribute Install yytext as identifier in symbol table

28 Design of a Lexical Analyzer Generator • Translate regular expressions to NFA •

28 Design of a Lexical Analyzer Generator • Translate regular expressions to NFA • Translate NFA to an efficient DFA Optional regular expressions NFA DFA Simulate NFA to recognize tokens Simulate DFA to recognize tokens

29 Nondeterministic Finite Automata • An NFA is a 5 -tuple (S, , ,

29 Nondeterministic Finite Automata • An NFA is a 5 -tuple (S, , , s 0, F) where S is a finite set of states is a finite set of symbols, the alphabet is a mapping from S ( U { }) to a set of states s 0 S is the start state F S is the set of accepting (or final) states

30 Transition Graph • An NFA can be diagrammatically represented by a labeled directed

30 Transition Graph • An NFA can be diagrammatically represented by a labeled directed graph called a transition graph a start a 0 b 1 b 2 b 3 S = {0, 1, 2, 3} = {a, b} s 0 = 0 F = {3}

31 Transition Table • The mapping of an NFA can be represented in a

31 Transition Table • The mapping of an NFA can be represented in a transition table (0, a) = {0, 1} (0, b) = {0} (1, b) = {2} (2, b) = {3} State Input a Input b 0 {0, 1} {0} 1 {2} 2 {3}

32 The Language Defined by an NFA • An NFA accepts an input string

32 The Language Defined by an NFA • An NFA accepts an input string x if and only if there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph • A state transition from one state to another on the path is called a move • The language defined by an NFA is the set of input strings it accepts, such as (a b)*abb for the example NFA

33 Design of a Lexical Analyzer Generator: RE to NFA to DFA Lex specification

33 Design of a Lexical Analyzer Generator: RE to NFA to DFA Lex specification with regular expressions p 1 p 2 … pn { action 1 } { action 2 } { actionn } NFA start s 0 N(p 1) … N(p 2) N(pn) action 1 action 2 actionn Subset construction DFA

34 From Regular Expression to NFA (Thompson’s Construction) start a start r 1 r

34 From Regular Expression to NFA (Thompson’s Construction) start a start r 1 r 2 start r 1 r 2 start r* start i i i a f N(r 1) N(r 2) i N(r 1) i f f N(r 2) f N(r) f

35 Combining the NFAs of a Set of Regular Expressions start a { action

35 Combining the NFAs of a Set of Regular Expressions start a { action 1 } abb { action 2 } a*b+ { action 3 } start 1 a 3 a 2 4 a 7 start b 5 b b 6 8 b 0 1 a 3 a 7 a b 2 4 8 b b 5 b 6

36 Simulating the Combined NFA Example 1 start 0 1 a 3 a 7

36 Simulating the Combined NFA Example 1 start 0 1 a 3 a 7 a a 0 2 1 4 3 7 7 8 b b 7 4 a a 8 action 1 2 b b 5 b 6 action 2 action 3 none action 3 Must find the longest match: Continue until no further moves are possible When last state is accepting: execute action

37 Simulating the Combined NFA Example 2 start 0 1 a 3 a 7

37 Simulating the Combined NFA Example 2 start 0 1 a 3 a 7 a b 8 b b a 0 2 5 6 1 4 8 8 3 7 7 4 a action 1 2 b b 5 b 6 action 2 action 3 none action 2 action 3 When two or more accepting states are reached, the first action given in the Lex specification is executed

38 Deterministic Finite Automata • A deterministic finite automaton is a special case of

38 Deterministic Finite Automata • A deterministic finite automaton is a special case of an NFA – No state has an -transition – For each state s and input symbol a there is at most one edge labeled a leaving s • Each entry in the transition table is a single state – At most one path exists to accept a string – Simulation algorithm is simple

39 Example DFA A DFA that accepts (a b)*abb b b start 0 a

39 Example DFA A DFA that accepts (a b)*abb b b start 0 a a b 1 a 2 a b 3

40 Conversion of an NFA into a DFA • The subset construction algorithm converts

40 Conversion of an NFA into a DFA • The subset construction algorithm converts an NFA into a DFA using: -closure(s) = {s} {t s … t} -closure(T) = s T -closure(s) move(T, a) = {t s a t and s T} • The algorithm produces: Dstates is the set of states of the new DFA consisting of sets of states of the NFA Dtran is the transition table of the new DFA

41 -closure and move Examples start 0 1 a 3 a 2 4 a

41 -closure and move Examples start 0 1 a 3 a 2 4 a 7 b 5 b b 6 8 b a a 0 2 1 4 3 7 7 -closure({0}) = {0, 1, 3, 7} move({0, 1, 3, 7}, a) = {2, 4, 7} -closure({2, 4, 7}) = {2, 4, 7} move({2, 4, 7}, a) = {7} -closure({7}) = {7} move({7}, b) = {8} -closure({8}) = {8} move({8}, a) = b 7 a 8 none Also used to simulate NFAs

42 The Subset Construction Algorithm Initially, -closure(s 0) is the only state in Dstates

42 The Subset Construction Algorithm Initially, -closure(s 0) is the only state in Dstates and it is unmarked while there is an unmarked state T in Dstates do mark T for each input symbol a do U : = -closure(move(T, a)) if U is not in Dstates then add U as an unmarked state to Dstates end if Dtran[T, a] : = U end do

Subset Construction Example 1 a 2 start 0 3 1 6 b 4 5

Subset Construction Example 1 a 2 start 0 3 1 6 b 4 5 7 a 8 b 9 b b C start A b a B a b a D a b E Dstates A = {0, 1, 2, 4, 7} B = {1, 2, 3, 4, 6, 7, 8} C = {1, 2, 4, 5, 6, 7} D = {1, 2, 4, 5, 6, 7, 9} E = {1, 2, 4, 5, 6, 7, 10} 10 43

Subset Construction Example 2 start 0 1 a 3 a 7 a b a

Subset Construction Example 2 start 0 1 a 3 a 7 a b a 1 2 4 8 b 5 b b 6 a 3 a 2 b a 3 C b start 44 A b a b D a a B a 1 b E a 3 b F a 2 a 3 Dstates A = {0, 1, 3, 7} B = {2, 4, 7} C = {8} D = {7} E = {5, 8} F = {6, 8}