Lexical Analysis Lexical Analysis v What is lexical

Lexical Analysis v What is lexical analysis? v How to specify the acceptable tokens

What is Lexical Analysis? v Scanner = Lexical analyzer q Input: Program viewed as

How to Specify Acceptable Tokens v Use regular expression (RE) q Formal description, no

How to Specify Acceptable Tokens v Some additional conventions, but not in formal RE

How to Specify Acceptable Tokens v Examples q digit = [0 -9] q alphabet

How to Identify Tokens v Intuitively read (ch); case (ch) an alphabet: … a

Finite State Automata v Finite state automata q A = (S, , s 0,

Finite State Automata v Processing input string q Starting from s 0, for each

NFA and DFA v NFA q Allow one state-symbol pair to be mapped to

NFA and RE Accepting any number of 1’s followed by a single 0 RE

Systematically Constructing NFA from RE v For a single character start v For a

Systematically Constructing NFA from RE v Alternate q Given NFAs for A and B

Lexical Analysis based on NFA v Example NFA q = {0, 1} q RE

Converting NFA to DFA v Conversion q Given an NFA find the DFA with

Converting NFA to DFA (cont. ) v Simulating NFA, subset construction q Construct the

Converting NFA to DFA (cont. ) (a|b)*abb (Intuitive Construction) a start S 0 b

RE-NFA-DFA (different construction) (a|b)*abb Formal Construction

DFA Minimization (a|b)*abb = Merge A and C A, C, E has the same

DFA Minimization v The previous method do not always get the minimal DFA a

DFA Minimization v How to identify states that can be merged? q Starting from

DFA Minimization v Why the DFA minimization would work? q If on the nth

Implementing a Language Processor from a DFA v Follow the execution of DFA loop

DFA and Scanner -- Issues v RE NFA DFA language processor Input string s

DFA and Scanner -- Issues v How to build a DFA for a scanner

Scanner using FA -- Multiple REs v Build an NFA for multiple REs v

Scanner using FA -- Ambiguity Consider if 11 potential. ID ambiguity Is it keyword

Scanner using FA -- Ambiguity Resolution v Longest match q Implies the need to

Scanner using FA -- Modified Example v Example q T 1 = abb* q

Scanner using FA -- Another Example v Example q T 1 = abb* q

Scanner using FA -- Backtracking v How will the scanner know where to stop

Scanner using FA -- Lookahead v How will the scanner know where to stop

Scanner using FA -- Lookahead v Another lookahead example q Input: abbba § abb,

Scanner using FA -- White Space v White space include blank, tab, and newline

Scanner using FA -- What’s different v How the DFA for scanner differs from

Miscellaneous -- Language Design Issues v In PL/1, id can be keywords q if

Miscellaneous -- Regular Language v Languages that can be specified by a regular expression

Miscellaneous -- Regular Language v RE q = {0, 1} q L = {

Miscellaneous -- NFA or DFA for Scanner v Processing input based on NFA q

Lexical Analysis -- Summary v Read Chapter 3 of the textbook q Except for

Slides: 45

Download presentation

Lexical Analysis

Lexical Analysis v What is lexical analysis? v How to specify the acceptable tokens in a language? q E. g. , Identifier, integers, odd number of 1’s, etc. v How to identify tokens based on the language? q Intuitively q RE to NFA to DFA to scanner v Combine multiple REs v Is it necessary to look ahead? q Rules to resolve the potential ambiguities v Tool: lex, input RE, output scanner (tutorial)

What is Lexical Analysis? v Scanner = Lexical analyzer q Input: Program viewed as a string q Output: A sequence of tokens q Process: match the language definitions and identify the tokens v Token q An indivisible unit with a certain logical meaning in the language, § § Key words: if, else, while, etc. Identifiers: abc, xyz, etc. Integers: 123, 456, etc. Operators: +, *, etc.

How to Specify Acceptable Tokens v Use regular expression (RE) q Formal description, no ambiguity v RE notation q : the set of all characters that are acceptable in the language q X | Y : alternation § X, Y are strings that are already defined in the language q X • Y : concatenation q X* : repetition q : empty string q ( ) : enclose an expression q Precedence: ( ), *, • , |

How to Specify Acceptable Tokens v Some additional conventions, but not in formal RE q X+ : X • X* q X? : optional, none or one appearance of X q [a-f] : one of the characters in the range q [a-f, A-F] : multiple ranges (or characters)

How to Identify Tokens v Intuitively read (ch); case (ch) an alphabet: … a numeral: … … q For each case, need to read further and has many more possibilities q Various cases has to merge together q The program can be very complicated and hard to understand v Finite state automata (FA) q Make the scanner process much easier q RE can be mapped to NFA automatically q NFA can be converted to DFA automatically q DFA can be converted to scanner automatically

Finite State Automata v Finite state automata q A = (S, , s 0, F, T) q S: all states in the FA q : all symbols accepted by the language q F: accepting states q T: all transitions § S → S ( or S { } → S ) v Why finite q Has a bounded number of states q Use bounded memory space

Finite State Automata v Processing input string q Starting from s 0, for each input character, make a transition on the automata § If no transition possible for the input character → error q When the input string is fully consumed § If at a final state → accept § Otherwise → error Input string s FA representing RE q If accept then s L(RE) § L(RE): the language defined by RE Accept/No

NFA and DFA v NFA q Allow one state-symbol pair to be mapped to multiple states q Allow transition ( (s 1, ) → s 2 ) q RE NFA mapping is very straightforward a start 0 (a|b)*abb a 1 b 2 b b v DFA q Deterministic mapping § One state-symbol pair can only be mapped to multiple states q No transition 3

NFA and RE Accepting any number of 1’s followed by a single 0 RE = 1*0 v Example NFA q = {0, 1} 1 0 0 1 Accepting any substrings with 00 at the end RE = (0|1)*00 1 0

Systematically Constructing NFA from RE v For a single character start v For a simple substring a start a a b ab v Repetition start a b A+ (ab)+ A start a a*b b start A*

Systematically Constructing NFA from RE v Alternate q Given NFAs for A and B q Construct A|B A B start v Concatenation q Concatenate substrings A and B A B

Lexical Analysis based on NFA v Example NFA q = {0, 1} q RE = b*(a|b) q For input b b a 2 1 b § From state 1, should it go to state 2 or back to state 1 § Consider go to state 2 § For bbbb, need to backtrack v Processing input string based on NFA q May need to backtrack q May end up exploring all the paths in the NFA q In more complicated NFA, this can be very high cost v Solution: convert NFA to DFA

Converting NFA to DFA v Conversion q Given an NFA find the DFA with the minimum number of states that has the same behavior as the NFA for all inputs v -enclosure q Given a set of NFA states T, -enclosure(T) is the set of states that are reachable through -transition from any state in T v Move (T, a) q All states that are reachable from T after reading a character a § Don’t forget to include the -enclosure of all the states

Converting NFA to DFA (cont. ) v Simulating NFA, subset construction q Construct the initial state of the DFA § By finding the -enclosure of the initial state q From a state T in the DFA, for each input characters a § Find the set of states in Move (T, a) § Make Move (T, a) a state in DFA if it is not there yet o If Move (T, a) contains at least one final state in NFA, then mark it as a final state in DFA § For convenience, consider the set of characters with the same Move (T, a) all together q Repeat the step above for all states in DFA that has not been processed yet (use a stack to keep track of)

Converting NFA to DFA (cont. ) (a|b)*abb (Intuitive Construction) a start S 0 b a S 1 b S 2 b S 3

(a|b)*abb (Intuitive Construction)

RE-NFA-DFA (different construction) (a|b)*abb Formal Construction

(a|b)*abb

DFA Minimization (a|b)*abb = Merge A and C A, C, E has the same transitions Can be merged But E is a final state and A and C are not A A

DFA Minimization v The previous method do not always get the minimal DFA a S 0 S 1 b b 0 1 2 3 4 b a S 3 S 2 S 4 b v Actually can be minimized further S 0 a b S 1 S 3 b a S 2 b a 1 4 - b 3 2 2 4 Cannot merge further

DFA Minimization v How to identify states that can be merged? q Starting from states s and t, for all strings x § If the acceptance decision is always the same, then s and t are indistinguishable (equivalent) q Final states and non-final states can never be merged v Method q Initialization: § Divide the states into two groups, final states and non-final states q Division within a group G § If for each input symbol a, two states s and t in G have transitions on a to the same group, then s and t stay in the same group § Otherwise, divide G and put s and t to different groups q Repeat the division, until no changes on grouping

DFA Minimization v Why the DFA minimization would work? q If on the nth round of group division, if s and t are in the same group, it means § For all string x of length n or less, s and t are indistinguishable q Reasoning § At k-th iteration, assume that for all strings of length k, s and t are indistinguishable § After division of group G at iteration k+1, s and t are still in the same group § It means s and t are indistinguishable for all strings of length k+1 o Since one more input symbol is tested in the division process

Implementing a Language Processor from a DFA v Follow the execution of DFA loop case State is when state 1=> case Next_Character is when ‘a’ => State : = state 3; when ‘b’ => State : = state 1; …… when others => End_token_processing; end case; when state 2 … …… end case; end loop;

DFA and Scanner -- Issues v RE NFA DFA language processor Input string s DFA representing RE Accept/No q Only determines whether the entire input string is accepted q Not good enough for lexical analysis v Scanner q Scanner is supposed to recognize many different tokens q Each token can be defined as an RE q Scanner does not process the entire input string at once q It is not clear how to cut the input string into tokens

DFA and Scanner -- Issues v How to build a DFA for a scanner q For each token, construct an RE § Construct RE 1, RE 2, … § RE = RE 1 | RE 2 | … q Multiple final states, each is for a specific token q When converting NFA to DFA, the final states for REi should be carried along q Allow acceptance of substrings when matching REi

Scanner using FA -- Multiple REs v Build an NFA for multiple REs v Detailed NFA-DFA conversion steps q Modified from the notes of Prof. Amaral, Univ. Alberta f 2 i 3 IF 4 1 a-z 5 a-z 6 9 any character 0 -9 14 7 15 error 0 -9 10 11 0 -9 ID 8 NUM 12 13

Scanner using FA -- Ambiguity Consider if 11 potential. ID ambiguity Is it keyword if and number 11? Or is it just id if 11? 2 -5 -6 -8 -15 Consider longest match i 1 -4 -9 -14 a-h j-z ID 5 -6 -8 -15 NUM 0 -9 Consider if still has ambiguity 10 -11 -13 -15 Is it a keyword or an id? Satisfies both error other Consider first match 15 a-e, g-z, 0 -9 IF, ID (state 3 is for IF IF and 8 is for ID) f 3 -6 -7 -8 a-z, 0 -9 ID 6 -7 -8 a-z, 0 -9 NUM 0 -9 11 -12 -13 0 -9 a-z, 0 -9

Scanner using FA -- Ambiguity Resolution v Longest match q Implies the need to lookahead q When to terminate? § Till there is no further transition feasible q What if the termination condition is met at a non-final state? § Need to backtrack § Not the case in most of the modern languages v First match q Should arrange the REs properly § E. g. , keywords REs should appear before the id RE q In practice for keywords § There is no need to have DFA with all keywords in it o Reducing the number of states to save space § Simply recognize all of them as identifiers and have a different DFA or a hash table for keywords identification o Note: There are cases other than keywords where first match is needed (e. g. , >=)

Scanner using FA -- Modified Example v Example q T 1 = abb* q T 2 = ba 1 a 2 b 4 b a 3 b 5 v Input: abbbbbba q Process abbbbbb, when the next a comes, what will happen? § Go back to state 1, move to state 2, find that the input cannot be accepted q But this is an acceptable string with two tokens v What can be done q Backtracking q Lookahead

Scanner using FA -- Another Example v Example q T 1 = abb* q T 2 = ca a 1 2 c 4 b a 3 b 5 v Input: abbbbcacaabb q Process abbbb, when c comes, what to do? q It is a bit different from regular FA q When there is no transition for the next input symbol § If it is at a final state, then accept the partial string and return the corresponding token id § Go back to the initial state

Scanner using FA -- Modified Example v Example q T 1 = abb* q T 2 = bca 1 a 2 b 4 b c 3 5 b a 6 v Input: abbbbbbca q Process abbbbbb, when the next c comes, what will happen? § Go back to state 1, find that the input cannot be accepted q But this is an acceptable string with two tokens v What can be done q Backtracking q Lookahead

Scanner using FA -- Backtracking v How will the scanner know where to stop and accept the b token a b q Backtracking 1 2 3 * b § Mark the state that may require backtracking c a 4 5 § After accepting, mark the location in the input § In case the next input is not acceptable, go back to the marked place q Input: abbbbbbca § ab*b*b*c -- cannot go further § Accept ab*b*b*b and process ‘c’ from starting state, but fails § Backtrack to the nearest *, accept ab*b*b, try ‘bca’ from staring state, succeeds q Problem: could be costly, sometimes may need to backtrack to many tokens before 6

Scanner using FA -- Lookahead v How will the scanner know where to stop and accept the b token a b q Lookahead § Allow the user to specify a lookahead string § Accept only after the lookahead string matches 1 3 /bc 2 b 4 c 5 a o T 1 = abb* /bc o T 2 = ba § At state 3, when seeing b, always lookahead to determine whether to return the token before the current “b” or continue the b* loop q Input: abbbbbbca § § abb: lookahead, no bc, continue abbb: lookahead, no bc, continue … abbbbb: lookahead, there is bc, accept abbbbb 6

Scanner using FA -- Lookahead v How will the scanner know where to stop and accept the b token a b q Lookahead q Input: abbbcaabbca 1 § abb: lookahead, there is bc, accept abb § Continue to accept bca o No need to lookahead § Continue to ab, lookahead, there is bc, accept ab § Continue to accept bca again § Tokens: abb, bca, ab, bca 3 /bc 2 b 4 c 5 a 6

Scanner using FA -- Lookahead v Another lookahead example q Input: abbba § abb, lookahead, there is ba, accept abb § Continue with ba and accept ba 1 q Input: abbbab § § a 2 b 4 b a b 3 /ba ? 5 abb: lookahead, there is ba, accept abb Continue with ba and accept ba How to handle the remaining b? Error! change the lookahead string to ba$ or baa o Only accept when seeing ba$ or baa o When bab, do not accept q Input: abbbaba § The old lookahead string would work and the new set would not q abbbababa……ba or abbbababa……bab § It is not possible to know till the end of the input § No lookahead string with any fixed length can do

Scanner using FA -- White Space v White space include blank, tab, and newline q They are not tokens, but need to define how to process them v Use a branch in the NFA for white space processing q Simply skip the white spaces (“ ” | “n” | “t”)+ start {/* do nothing*/} 1 t blank n do nothing n 5 blank t

Scanner using FA -- What’s different v How the DFA for scanner differs from the regular DFA q Needs to have a token marker for each final state § Final states for different tokens are distinguishable q Needs to have ambiguity resolution rules q DFA execution is different § Not to exhaust the entire input string § Accept when the DFA cannot go further, break the input string § Go back to the starting state after accepting q DFA execution is different § Need to backtrack or lookahead

Miscellaneous -- Language Design Issues v In PL/1, id can be keywords q if then = else; else = then; q Cannot resolve the ambiguity till parsing time § Unless impose lookahead rules v In FORTRAN, blanks are ignored (not just skipped in FA) q do 10 i = 1, 25 q do 10 i = 1. 25 (is it keyword “do” or identifier “do 10 i”) (= do 10 i = 1. 25) § Similar problem, lookahead is necessary v Make life easier: be strict? q q q All keywords starts by a different character All id starts by character z …… Too many rules for the programmer to remember If FA is used, then these specialized rules is not helping § Sometimes the FA can be more complicated § E. g. , length of id has to be <= 6 characters

Miscellaneous -- Regular Language v Languages that can be specified by a regular expression is called a regular language v Can RE specify all languages? q No q But RE is quite powerful already v Keeping counts q Famous example that RE cannot specify: § Matching ( and ), there can be the same or more ( than ) in an prefix, but the same numbers of ( and ) in the entire string § L = { pk qk, for any k } q How about = {0, 1}, L = { s | s has even number of 0’s and 1’s }

Miscellaneous -- Regular Language v RE q = {0, 1} q L = { s | s has even number of 0’s and 1’s } q (00|11)*((01|10)(00|11)*)* Specify using grammar E = 1 E | 0 E |

Miscellaneous -- NFA or DFA for Scanner v Processing input based on NFA q May need to backtrack and end up exploring all paths in the NFA v Converting NFA to DFA q Assume that the NFA has K states q The number of states of the corresponding DFA is bounded by 2 K q This can be very inefficient q But in practice, the number of states in DFA is not too much higher than the number of states in the original NFA v Can be a tradeoff q Between space and time q Use NFA to save space, use DFA to save time v Actually, DFA is a special case of NFA

Lexical Analysis -- Summary v Read Chapter 3 of the textbook q Except for 3. 9 v REs NFA DFA Language processor v DFA for Scanner q Final state handling q Ambiguity resolution q Backtracking and lookahead v Lexical analysis tool: lex