Lexical Analysis I Compiler Baojian Hua bjhuaustc edu

Lexical Analysis (I) Compiler Baojian Hua bjhua@ustc. edu. cn

Compiler source program compiler target program

Front and Back Ends source program front end IR back end target program

Front End source code lexical analyzer tokens parser abstract syntax tree semantic analyzer IR

Lexical Analyzer n The lexical analyzer translates the source program into a stream of lexical tokens n Source program: n n n stream of characters vary from language to language (ASCII, Unicode, or …) Lexical token: n n compiler internal data structure that represents the occurrence of a terminal symbol vary from compiler to compiler

Conceptually character lexical sequence analyzer token sequence

Example if (x > 5) y = “hello”; else z = 1; lexical analysis IF LPAREN IDENT(x) GT INT(5) RPAREN IDENT(y) ASSIGN STRING(“hello”) SEMICOLON ELSE IDENT(z) ASSIGN INT(1) SEMICOLON EOF

Lexer Implementation n Options: n Write a lexer by hand from scratch n n n boring, error-prone, and too much work see dragon-book section 3. 4 for the algorithm nevertheless, many compilers use this approach n n Automatic lexer generator n n gcc, llvm, and the Tiger you’ll build, … Quick, easy and happy, fast prototype We start with the first approach, and come to the latter one later

Token, Pattern, and Lexeme n Token: think “kind” n n Pattern: think “form” n n e. g. , in C, we have identifier, integers, floats, … e. g. , in C, an identifier starts with letter and followed by zero or more identifiers Lexeme: think “one instance” n e. g. , “tiger” is a valid C identifier

So n To hack a lexer, one must: n Specify all possible tokens n n Specify the pattern for each kind of token n n a little hard Write code to recognize them n n dozens in a practical language standard algorithms For the second purpose, one needs a little of math---regular expression

Basic Definitions n Alphabet: the character set n n String: a finite sequence of char from alphabet n n e. g. , ASCII or Unicode, or … e. g. , hello Language: a set of strings n n finite or infinite say the C language

Regular Expression (RE) n Construction by induction n each c in alphabet n n empty eps n n (a|b) = {a, b} for M and N, then MN n n {} for M and N, then M|N n n {c} (a|b)(c|d) = {ac, ad, bc, bd} for M, then M* (Kleen closure) n (a|b)* = {eps, a, aa, b, abb, baa, …}

Or more formally e -> | | epsi c e | e e*

Example n C’s indentifier: n n n starts with a letter (“_” counts as a letter) followed by zero or more of letter or digit e. g. , write in a stepwise way: (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)* n It’s tedious and error-prone …

Syntax Sugar n We introduce some abbreviations: n n n n [a-z] == a|b|…|z e+ == one or more of e e? == zero or one of e “a*” == a* itself, not a’s Kleen closure e{i, j} == more than i and less than j of e. == any character except for ‘n’ All these can be represented by the core RE n i. e. , they are derived forms

Example Revisited n C’s indentifier: n n starts with a letter (“_” counts as a letter) followed by zero or more of letter or digit (…) (_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9) [_a-z. A-Z][_a-z. A-Z 0 -9]* n What about the key word “if”?

Ambiguous Rule n n A single RE is not ambiguous But in a language, there may be many REs? n n n [_a-z. A-Z][_a-z. A-Z 0 -9]* (i)(f) So, for a given string “if”, which RE to match?

Ambiguous Rule n Two conventions: n n Longest match: The regular expression that matches the longest string takes precedence. Rule Priority: associate each RE a priority. If two regular expressions match the same (longest) string, the higher RE takes precedence.

Transition Diagram algorithm

Lab 1 of Tiger n In the lab 1 of the Tiger compiler, you’re required to write a lexer by hand n n the diagram-transition algorithm as described above start early