Lexical Analysis I Compiler Baojian Hua bjhuaustc edu
Lexical Analysis (I) Compiler Baojian Hua bjhua@ustc. edu. cn
Compiler source program compiler target program
Front and Back Ends source program front end IR back end target program
Front End source code lexical analyzer tokens parser abstract syntax tree semantic analyzer IR
Lexical Analyzer n The lexical analyzer translates the source program into a stream of lexical tokens n Source program: n n n stream of characters vary from language to language (ASCII, Unicode, or …) Lexical token: n n compiler internal data structure that represents the occurrence of a terminal symbol vary from compiler to compiler
Conceptually character lexical sequence analyzer token sequence
Example if (x > 5) y = “hello”; else z = 1; lexical analysis IF LPAREN IDENT(x) GT INT(5) RPAREN IDENT(y) ASSIGN STRING(“hello”) SEMICOLON ELSE IDENT(z) ASSIGN INT(1) SEMICOLON EOF
Lexer Implementation n Options: n Write a lexer by hand from scratch n n n boring, error-prone, and too much work see dragon-book section 3. 4 for the algorithm nevertheless, many compilers use this approach n n Automatic lexer generator n n gcc, llvm, and the Tiger you’ll build, … Quick, easy and happy, fast prototype We start with the first approach, and come to the latter one later
Token, Pattern, and Lexeme n Token: think “kind” n n Pattern: think “form” n n e. g. , in C, we have identifier, integers, floats, … e. g. , in C, an identifier starts with letter and followed by zero or more identifiers Lexeme: think “one instance” n e. g. , “tiger” is a valid C identifier
So n To hack a lexer, one must: n Specify all possible tokens n n Specify the pattern for each kind of token n n a little hard Write code to recognize them n n dozens in a practical language standard algorithms For the second purpose, one needs a little of math---regular expression
Basic Definitions n Alphabet: the character set n n String: a finite sequence of char from alphabet n n e. g. , ASCII or Unicode, or … e. g. , hello Language: a set of strings n n finite or infinite say the C language
Regular Expression (RE) n Construction by induction n each c in alphabet n n empty eps n n (a|b) = {a, b} for M and N, then MN n n {} for M and N, then M|N n n {c} (a|b)(c|d) = {ac, ad, bc, bd} for M, then M* (Kleen closure) n (a|b)* = {eps, a, aa, b, abb, baa, …}
Or more formally e -> | | epsi c e | e e*
Example n C’s indentifier: n n n starts with a letter (“_” counts as a letter) followed by zero or more of letter or digit e. g. , write in a stepwise way: (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)* n It’s tedious and error-prone …
Syntax Sugar n We introduce some abbreviations: n n n n [a-z] == a|b|…|z e+ == one or more of e e? == zero or one of e “a*” == a* itself, not a’s Kleen closure e{i, j} == more than i and less than j of e. == any character except for ‘n’ All these can be represented by the core RE n i. e. , they are derived forms
Example Revisited n C’s indentifier: n n starts with a letter (“_” counts as a letter) followed by zero or more of letter or digit (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) [_a-z. A-Z][_a-z. A-Z 0 -9]* n What about the key word “if”?
Ambiguous Rule n n A single RE is not ambiguous But in a language, there may be many REs? n n n [_a-z. A-Z][_a-z. A-Z 0 -9]* (i)(f) So, for a given string “if”, which RE to match?
Ambiguous Rule n Two conventions: n n Longest match: The regular expression that matches the longest string takes precedence. Rule Priority: associate each RE a priority. If two regular expressions match the same (longest) string, the higher RE takes precedence.
Transition Diagram algorithm
Lab 1 of Tiger n In the lab 1 of the Tiger compiler, you’re required to write a lexer by hand n n the diagram-transition algorithm as described above start early
- Slides: 20