1 Lexical Analysis and Lexical Analyzer Generators Chapter

2 Lexical Analyser • A lexical analyzer reads characters from the input and groups

3 Interaction of the Lexical Analyzer with the Parser Source Program Lexical Analyzer Token,

4 Tokens, Patterns, and Lexemes • A token is a pair consisting of a

6 Attributes for Tokens • Additional information about the token – Example: token number

7 Attributes of Tokens y : = 31 + 28*x Lexical analyzer <id, “y”>

8 Reading Ahead • A lexical analyzer may need to read ahead some characters

9 Specification of Tokens: Regular Expressions • Regular expressions are an important notation for

10 Specification of Patterns for Tokens: Definitions • An alphabet is a finite set

11 Specification of Patterns for Tokens: String Operations • The concatenation of two strings

12 Specification of Patterns for Tokens: Language Operations • Union L M = {s

13 Specification of Patterns for Tokens: Regular Expressions • Basis symbols: – is a

14 Specification of Patterns for Tokens: Regular Definitions • Regular definitions introduce a naming

15 Specification of Patterns for Tokens: Regular Definitions • Example: letter A B …

17 Specification of Patterns for Tokens: Notational Shorthand • The following shorthands are often

18 Regular Definitions and Grammars Grammar stmt if expr then stmt else stmt expr

20 Coding Regular Definitions in Transition Diagrams relop < <= <> > >= =

Coding Regular Definitions in Transition Diagrams: Code token nexttoken() { while (1) { switch

23 The Lex and Flex Scanner Generators • Lex and its newer cousin flex

24 Creating a Lexical Analyzer with Lex and Flex source program lex. l lex.

25 Lex Specification • A lex specification consists of three parts: regular definitions, C

26 Regular Expressions in Lex x match the character x . match the character.

Example Lex Specification %{ /* definitions of manifest constants */ #define LT (256) …

28 Design of a Lexical Analyzer Generator • Translate regular expressions to NFA •

29 Nondeterministic Finite Automata • An NFA is a 5 -tuple (S, , ,

30 Transition Graph • An NFA can be diagrammatically represented by a labeled directed

31 Transition Table • The mapping of an NFA can be represented in a

32 The Language Defined by an NFA • An NFA accepts an input string

33 Design of a Lexical Analyzer Generator: RE to NFA to DFA Lex specification

34 From Regular Expression to NFA (Thompson’s Construction) start a start r 1 r

35 Combining the NFAs of a Set of Regular Expressions start a { action

36 Simulating the Combined NFA Example 1 start 0 1 a 3 a 7

37 Simulating the Combined NFA Example 2 start 0 1 a 3 a 7

38 Deterministic Finite Automata • A deterministic finite automaton is a special case of

39 Example DFA A DFA that accepts (a b)*abb b b start 0 a

40 Conversion of an NFA into a DFA • The subset construction algorithm converts

41 -closure and move Examples start 0 1 a 3 a 2 4 a

42 The Subset Construction Algorithm Initially, -closure(s 0) is the only state in Dstates

Subset Construction Example 1 a 2 start 0 3 1 6 b 4 5

Subset Construction Example 2 start 0 1 a 3 a 7 a b a

Slides: 44

Download presentation

1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP 5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2007 -2011

2 Lexical Analyser • A lexical analyzer reads characters from the input and groups them into “token objects”. • Typical tasks of the lexical analyzer: – Remove white space and comments – Encode constants as tokens – Recognize keywords – Recognize identifiers and store identifier names in a global symbol table

3 Interaction of the Lexical Analyzer with the Parser Source Program Lexical Analyzer Token, tokenval Parser Get next token error Symbol Table

4 Tokens, Patterns, and Lexemes • A token is a pair consisting of a token name and an optional attribute value. – For example: id and num and for • Lexemes are the specific character strings that make up a token – For example: abc and 123 • Patterns are rules describing the set of lexemes belonging to a token – For example: “letter followed by letters and digits” and “non-empty sequence of digits”

5 Examples of tokens

6 Attributes for Tokens • Additional information about the token – Example: token number should have an attribute representing its lexeme. E. g. , 167 is represented by <number, 167> – Example: token id, has an attribute value points to the symbol-table entry for that identifier, which includes its lexeme, its type, etc.

7 Attributes of Tokens y : = 31 + 28*x Lexical analyzer <id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”> tokenval (token attribute) Parser

8 Reading Ahead • A lexical analyzer may need to read ahead some characters before it can decide on the token to be returned to the parser. – Example: after seeing character >, it should read another one to see if it is =, and the token is >=

9 Specification of Tokens: Regular Expressions • Regular expressions are an important notation for specifying lexeme patterns. • Please review the definitions from “Theory of Languages and Automata” – Strings, Languages, …

10 Specification of Patterns for Tokens: Definitions • An alphabet is a finite set of symbols (characters) • A string s is a finite sequence of symbols from – s denotes the length of string s – denotes the empty string, thus = 0 • A language is a specific set of strings over some fixed alphabet

11 Specification of Patterns for Tokens: String Operations • The concatenation of two strings x and y is denoted by xy • The exponentation of a string s is defined by s 0 = si = si-1 s for i > 0 note that s = s

12 Specification of Patterns for Tokens: Language Operations • Union L M = {s s L or s M} • Concatenation LM = {xy x L and y M} • Exponentiation L 0 = { }; Li = Li-1 L • Kleene closure L* = i=0, …, Li • Positive closure L+ = i=1, …, Li

13 Specification of Patterns for Tokens: Regular Expressions • Basis symbols: – is a regular expression denoting language { } – a is a regular expression denoting {a} • If r and s are regular expressions denoting languages L(r) and L(s) respectively, then – – r s is a regular expression denoting L(r) L(s) rs is a regular expression denoting L(r)L(s) r* is a regular expression denoting L(r)* (r) is a regular expression denoting L(r) • A language defined by a regular expression is called a regular set

14 Specification of Patterns for Tokens: Regular Definitions • Regular definitions introduce a naming convention: d 1 r 1 d 2 r 2 … dn rn where each ri is a regular expression over {d 1, d 2, …, di-1 } • Any dj in ri can be textually substituted in ri to obtain an equivalent set of definitions

15 Specification of Patterns for Tokens: Regular Definitions • Example: letter A B … Z a b … z digit 0 1 … 9 id letter ( letter digit )* • Regular definitions are not recursive: digits digit wrong!

16 Regular Definitions: Example

17 Specification of Patterns for Tokens: Notational Shorthand • The following shorthands are often used: r+ = rr* r? = r [a-z] = a b c … z • Examples: digit [0 -9] num digit+ (. digit+)? ( E (+ -)? digit+ )?

18 Regular Definitions and Grammars Grammar stmt if expr then stmt else stmt expr term relop term Regular definitions term id if num then else relop < <= <> > >= = id letter ( letter | digit )* num digit+ (. digit+)? ( E (+ -)? digit+ )?

19 Final Lexical Analyzer

20 Coding Regular Definitions in Transition Diagrams relop < <= <> > >= = start 0 < 1 = 2 return(relop, LE) > 3 return(relop, NE) other = 5 > 6 id letter ( letter digit )* start 9 4 * return(relop, LT) return(relop, EQ) = 7 return(relop, GE) other 8 * return(relop, GT) letter or digit letter 10 other 11 * return(gettoken(), install_id())

21 Transition Diagrams

Coding Regular Definitions in Transition Diagrams: Code token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c==blank || c==tab || c==newline) { state = 0; lexeme_beginning++; } else if (c==‘<’) state = 1; else if (c==‘=’) state = 5; else if (c==‘>’) state = 6; else state = fail(); break; case 1: … case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10: c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; … 22 Decides the next start state to check int fail() { forward = token_beginning; swith (start) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* error */ } return start; }

23 The Lex and Flex Scanner Generators • Lex and its newer cousin flex are scanner generators • Systematically translate regular definitions into C source code for efficient scanning • Generated code is easy to integrate in C applications

24 Creating a Lexical Analyzer with Lex and Flex source program lex. l lex. yy. c input stream lex or flex compiler lex. yy. c C compiler a. out sequence of tokens

25 Lex Specification • A lex specification consists of three parts: regular definitions, C declarations in %{ %} %% translation rules %% user-defined auxiliary procedures • The translation rules are of the form: p 1 { action 1 } p 2 { action 2 } … pn { actionn }

26 Regular Expressions in Lex x match the character x . match the character. “string”match contents of string of characters. match any character except newline ^ match beginning of a line $ match the end of a line [xyz] match one character x, y, or z (use to escape -) [^xyz]match any character except x, y, and z [a-z] match one of a to z r* closure (match zero or more occurrences) r+ positive closure (match one or more occurrences) r? optional (match zero or one occurrence) r 1 r 2 match r 1 then r 2 (concatenation) r 1|r 2 match r 1 or r 2 (union) (r) grouping r 1r 2 match r 1 when followed by r 2 {d} match the regular expression defined by d

Example Lex Specification %{ /* definitions of manifest constants */ #define LT (256) … %} delim [ tn] ws {delim}+ letter [A-Za-z] digit [0 -9] id {letter}({letter}|{digit})* number {digit}+(. {digit}+)? (E[+-]? {digit}+)? %% {ws} { } if {return IF; } then {return THEN; } else {return ELSE; } {id} {yylval = install_id(); return ID; } {number} {yylval = install_num(); return NUMBER; } “<“ {yylval = LT; return RELOP; } “<=“ {yylval = LE; return RELOP; } “=“ {yylval = EQ; return RELOP; } “<>“ {yylval = NE; return RELOP; } “>“ {yylval = GT; return RELOP; } “>=“ {yylval = GE; return RELOP; } %% int install_id() … 27 Return token to parser Token attribute Install yytext as identifier in symbol table

28 Design of a Lexical Analyzer Generator • Translate regular expressions to NFA • Translate NFA to an efficient DFA Optional regular expressions NFA DFA Simulate NFA to recognize tokens Simulate DFA to recognize tokens

29 Nondeterministic Finite Automata • An NFA is a 5 -tuple (S, , , s 0, F) where S is a finite set of states is a finite set of symbols, the alphabet is a mapping from S ( U { }) to a set of states s 0 S is the start state F S is the set of accepting (or final) states

30 Transition Graph • An NFA can be diagrammatically represented by a labeled directed graph called a transition graph a start a 0 b 1 b 2 b 3 S = {0, 1, 2, 3} = {a, b} s 0 = 0 F = {3}

31 Transition Table • The mapping of an NFA can be represented in a transition table (0, a) = {0, 1} (0, b) = {0} (1, b) = {2} (2, b) = {3} State Input a Input b 0 {0, 1} {0} 1 {2} 2 {3}

32 The Language Defined by an NFA • An NFA accepts an input string x if and only if there is some path with edges labeled with symbols from x in sequence from the start state to some accepting state in the transition graph • A state transition from one state to another on the path is called a move • The language defined by an NFA is the set of input strings it accepts, such as (a b)*abb for the example NFA

33 Design of a Lexical Analyzer Generator: RE to NFA to DFA Lex specification with regular expressions p 1 p 2 … pn { action 1 } { action 2 } { actionn } NFA start s 0 N(p 1) … N(p 2) N(pn) action 1 action 2 actionn Subset construction DFA

34 From Regular Expression to NFA (Thompson’s Construction) start a start r 1 r 2 start r 1 r 2 start r* start i i i a f N(r 1) N(r 2) i N(r 1) i f f N(r 2) f N(r) f

35 Combining the NFAs of a Set of Regular Expressions start a { action 1 } abb { action 2 } a*b+ { action 3 } start 1 a 3 a 2 4 a 7 start b 5 b b 6 8 b 0 1 a 3 a 7 a b 2 4 8 b b 5 b 6

36 Simulating the Combined NFA Example 1 start 0 1 a 3 a 7 a a 0 2 1 4 3 7 7 8 b b 7 4 a a 8 action 1 2 b b 5 b 6 action 2 action 3 none action 3 Must find the longest match: Continue until no further moves are possible When last state is accepting: execute action

37 Simulating the Combined NFA Example 2 start 0 1 a 3 a 7 a b 8 b b a 0 2 5 6 1 4 8 8 3 7 7 4 a action 1 2 b b 5 b 6 action 2 action 3 none action 2 action 3 When two or more accepting states are reached, the first action given in the Lex specification is executed

38 Deterministic Finite Automata • A deterministic finite automaton is a special case of an NFA – No state has an -transition – For each state s and input symbol a there is at most one edge labeled a leaving s • Each entry in the transition table is a single state – At most one path exists to accept a string – Simulation algorithm is simple

39 Example DFA A DFA that accepts (a b)*abb b b start 0 a a b 1 a 2 a b 3

40 Conversion of an NFA into a DFA • The subset construction algorithm converts an NFA into a DFA using: -closure(s) = {s} {t s … t} -closure(T) = s T -closure(s) move(T, a) = {t s a t and s T} • The algorithm produces: Dstates is the set of states of the new DFA consisting of sets of states of the NFA Dtran is the transition table of the new DFA

41 -closure and move Examples start 0 1 a 3 a 2 4 a 7 b 5 b b 6 8 b a a 0 2 1 4 3 7 7 -closure({0}) = {0, 1, 3, 7} move({0, 1, 3, 7}, a) = {2, 4, 7} -closure({2, 4, 7}) = {2, 4, 7} move({2, 4, 7}, a) = {7} -closure({7}) = {7} move({7}, b) = {8} -closure({8}) = {8} move({8}, a) = b 7 a 8 none Also used to simulate NFAs

42 The Subset Construction Algorithm Initially, -closure(s 0) is the only state in Dstates and it is unmarked while there is an unmarked state T in Dstates do mark T for each input symbol a do U : = -closure(move(T, a)) if U is not in Dstates then add U as an unmarked state to Dstates end if Dtran[T, a] : = U end do

Subset Construction Example 1 a 2 start 0 3 1 6 b 4 5 7 a 8 b 9 b b C start A b a B a b a D a b E Dstates A = {0, 1, 2, 4, 7} B = {1, 2, 3, 4, 6, 7, 8} C = {1, 2, 4, 5, 6, 7} D = {1, 2, 4, 5, 6, 7, 9} E = {1, 2, 4, 5, 6, 7, 10} 10 43

Subset Construction Example 2 start 0 1 a 3 a 7 a b a 1 2 4 8 b 5 b b 6 a 3 a 2 b a 3 C b start 44 A b a b D a a B a 1 b E a 3 b F a 2 a 3 Dstates A = {0, 1, 3, 7} B = {2, 4, 7} C = {8} D = {7} E = {5, 8} F = {6, 8}