Lex a Lexical Analyzer Generator by M E

• Lex -- a Lexical Analyzer Generator (by M. E. Lesk and Eric. Schmidt) – Given tokens specified as regular expressions, Lex automatically generates a routine (yylex) that recognizes the tokens (and performs corresponding actions).

• Lex source program {definition} %% {rules} %% {user subroutines} Rules: <regular expression> <action> Each regular expression specifies a token. Default action for anything that is not matched: copy to the output Action: C/C++ code fragment specifying what to do when a token is recognized.

• lex program examples: ex 1. l and ex 2. l – ‘lex ex 1. l’ produces the lex. yy. c file that contains a routine yylex(). • The int yylex() routine is the scanner that finds all the regular expressions specified. – yylex() returns a non-zero value (usually token id) normally. – yylex() returns 0 when end of file is reached. – Need a drive to test the routine. Main. cpp is an example. • You need to have a yywrap() function in the lex file (return 1), see the function in ex 1. l. – Something to do with compiling multiple files.

• Lex regular expression: contains text characters and operators. – Letters of alphabet and digits are always text characters. • Regular expression integer matches the string “integer” – Operators: “[]^-? . *+|()$/{}%<> • When these characters happen in a regular expression, they have special meanings – Lex regular expressions cannot have space in them!!!

$– operators (characters that have special meanings): “[]^-? . *+|()$/{}%<> • ‘*’, ‘+’, ‘|’,$

– operators (characters that have special meanings): “[]^-? . *+|()$/{}%<> • ‘*’, ‘+’, ‘|’, ‘(‘, ’)’ -- used in regular expression • ‘ “ ‘ -- any character in between quote is a text character. – E. g. : “xyz++” == xyz”++” • ‘’ -- escape character, – To get the operators back: “xyz++” == ? ? – To specify special characters: 40 == “ “ • ‘[‘ and ‘]’ -- used to specify a set of characters – e. g: [a-z], [a-z. A-Z], – Every character in it except ^, - and is a text character – [-+0 -9], [40 -176] • ‘^’ -- not, used as the first character after the left bracket – E. g [^abc] – any character except ‘a’, ‘b’ or ‘c’. – [^a-z. A-Z] -- ? ?

$– operators (characters that have special meanings): “[]^-? . *+|()$/{}%<> • ‘. ’ --$

– operators (characters that have special meanings): “[]^-? . *+|()$/{}%<> • ‘. ’ -- every character • ‘? ’ -- optional ab? c matches ‘ac’ or ‘abc’ • ‘/’ -- used in character lookahead: – e. g. ab/cd -- matches ab only if it is followed by cd • ‘{‘’}’ -- enclose a regular definition • ‘%’ -- has special meaning in lex • ‘$’ -- match the end of a line, ‘^’ -- match the beginning of a line – ab$ == ab/n • ‘<‘ ‘>’: start condition (more context sensitivity support, see the paper for details).

– Order of pattern matching: • Always matches the longest pattern. • When multiple patterns matches, use the first pattern. – To override, add “REJECT” in the action. . %% Ab Abc {letter}{letter|digit}* {printf(“rule 1n”); } {printf(“rule 2n”); } {printf(“rule 3n”); } %% Input: Abc What happened when at ‘. *’ as a pattern? Should regular expressions for reserved words happen before or after the regular expression for identifier?

– Manipulate the lexeme and/or the input stream: • yytext -- a char pointer pointing to the matched C string • yyleng -- the length of the matched string • I/O routines to manipulate the input stream: – yyinput() -- get a character from the input character, return <=0 when reaching the end of the input stream, the character otherwise » yyinput() for c++, input() for c. – unput( c ) -- put c back onto the input stream – Deal with comments: (/* …. . */ » Why is pattern “/*”. *”*/” a problem? %% … “/*” {char c 1; c 2 = yyinput(); if (c 2 <=0) {lex_error(“unfinished comment” …} else { c 1 = c 2; c 2 = yyinput(); while (((c 1!=‘*’) || (c 2 != ‘/’)) && (c 2 > 0)) {c 1 = c 2; c 2 = yyinput(); } if (c 2 <= 0) {lex_error( …. ) }

– Reporting errors: • What kind of errors? Not too many. – Characters that cannot lead to a token – unended comments (can we do it in later phases? ) – unended string constants.

– Reporting errors: • How to keep track of current position (which line, which column)? – Use to global variable for this: yyline, yycolumn %{ int yyline = 1, yycolumn = 1; %}. . . %% “n” {yyline++; yycolumn = 0; } [ t]+ {/* do nothing*/ yycolumn += yyleng; } If {yycolumn += yyleng; return (IFNumber); } “+” {yycolumn += 1; return (PLUSNumber); } {letter}{letter|digit}* {yylval = idtable_insert(yytext); yycolumn += yyleng; return(IDNumber); }. . . %%

• Dealing with identifiers, string constants. – Data structures: • Put the lexeme in a string table: e. g. vector of C strings. • See token. l – Recognizing constant strings with special characters • Assuming string cannot pass line boundary. • Use yymore() “[^”n]* {char c; c = yyinput(); if (c != ‘”’) error else if (yytext[yyleng-1] == ‘\’) { unput( c ); yymore(); } else {/* find the whole string, normal process*/}

Put it all together • Checkout token. l and main. cpp in lex 2. tar