Programming Language Lexical Structure Tokens generally fall into

Alphabets, Strings, Languages • An alphabet is a finite set of symbols. • A

Regular Expressions over alphabet A • is a regular expression denoting the empty language

Sample Regular Expressions • • • cat+dog a* aa* (a+b)*b* (1+2+3+4+5+6+7+8+9)(0+1+2+3+4+5+6+ 7+8+9)* Lecture 5

Finite Automata • A finite automaton is a machine comprising a 5 -tuple: (Q,

Deterministic Finite Automata a q 1 q 2 b ({q 1, q 2}, {a,

Language Accepted by a Finite Automaton • The language accepted by a Finite Automaton

Any Regular Expression Can be Converted into an ε-Nondeterministic Finite Automaton : q 0

Flex • Flex (http: //www. cise. ufl. edu/~jnw/COP 5555/References/flex. htm) is a tool (based

Flex Specifications • • • General form of a flex program: definitions %% rules

Flex Patterns (Consult the manual for details) • x matches the character x •

Programming Assignment • Your first programming assignment is to consult the document http: //www.

Things you Need to Know About Flex Definitions • The definitions portion of a

Things you Need to Know About Flex Rules • In the rules section, white

Things you Need to Know About Flex User Code • Any code can be

How Does a Flex Lexical Analyzer Work Together With a Parser? • The parser

Your Assignment (due 27 January) • You are to write a lexical analyzer for

Slides: 17

Download presentation

Programming Language Lexical Structure • Tokens generally fall into one of a small number of categories: – – reserved words (keywords: if while do) literals (constants: 3. 14159 “hello world”) special symbols (; == : =) identifiers (x principal interest_rate) • While the programming language grammar may guarantee that no ambiguities can arise, most programming languages nowadays support the use of white space to separate tokens. • FORTRAN made no requirements on the use of white space within statements. Note the wellknown DO 10 I problem. Lecture 5 PLP Spring 2004, UF CISE 1

Alphabets, Strings, Languages • An alphabet is a finite set of symbols. • A string is a finite sequence of symbols drawn from some alphabet. • A language is a (possibly infinite) set of strings. • A regular expression is a formula describing a language subject to the following constraints: Lecture 5 PLP Spring 2004, UF CISE 2

Regular Expressions over alphabet A • is a regular expression denoting the empty language . • ε is a regular expression denoting the language containing the empty string {ε}. • If a A, then a is a regular expression denoting the language containing the string a: {a}. • If R, R 1, and R 2 are regular expressions denoting L, L 1, and L 2 respectively, then – (R 1+R 2) is a regular expression denoting L 1 L 2 – (R 1 R 2) is a regular expression denoting L 1 L 2 where L 1 L 2 denotes the set of strings xy such that x L 1 and y L 2 – (R*) is a regular expression denoting i=1, Li where L 0 denotes {ε} and if i>0 Li denotes LLi-1 Lecture 5 PLP Spring 2004, UF CISE 3

Sample Regular Expressions • • • cat+dog a* aa* (a+b)*b* (1+2+3+4+5+6+7+8+9)(0+1+2+3+4+5+6+ 7+8+9)* Lecture 5 PLP Spring 2004, UF CISE 4

Finite Automata • A finite automaton is a machine comprising a 5 -tuple: (Q, , , q 0, F) where – Q is a finite set of states – is a finite set of symbols (the alphabet) – : Q X → Q is the transition function – q 0 Q is the initial state – F Q is the set of final states. • OK, so what’s this all about? Lecture 5 PLP Spring 2004, UF CISE 5

Deterministic Finite Automata a q 1 q 2 b ({q 1, q 2}, {a, b}, {((q 1, a), q 2), ((q 2, b), q 1)}, q 1, {q 2}) States: denoted by circles Alphabet: labeling transitions (arrows) Transition function: specifying the termini and labels of the transitions Start state: denoted with an unterminated entering arrow Final states: denoted with double circles Lecture 5 PLP Spring 2004, UF CISE 6

Language Accepted by a Finite Automaton • The language accepted by a Finite Automaton is the set of strings x such that if one starts in the start state, one can follow transitions labeled by the symbols in x (in order) to get to the final state. • If we allow the transitions to be labeled by ε, and we can follow an ε transition at any time (without matching a character in x), then we say such a machine is nondeterministic (we can’t determine which state should be chosen at any given time). • In fact, these nondeterministic finite automata are equivalent to deterministic ones. Lecture 5 PLP Spring 2004, UF CISE 7

Any Regular Expression Can be Converted into an ε-Nondeterministic Finite Automaton : q 0 q 1 R 1 : q 1 . . . qm R 2 : q 2 . . . qn ε ε: q 0 q 1 q 0 qm ε a a: . . . q 1 R 2 : R 1 R 2: q 1 qt. . . q 2 ε qn ε qs qn ε qm qs q 1 R 1*: . . . q 2 ε ε q 1 . . . qm qt ε ε Lecture 5 PLP Spring 2004, UF CISE 8

Flex • Flex (http: //www. cise. ufl. edu/~jnw/COP 5555/References/flex. htm) is a tool (based on a famous tool named lex) for constructing lexical analyzers. • Flex accepts specifications that use a form of regular expression to describe the tokens of a language, and then converts these specifications into a program (automaton) for recognizing the tokens. Lecture 5 PLP Spring 2004, UF CISE 9

Flex Specifications • • • General form of a flex program: definitions %% rules %% user code (often a main program) Each rule represents a pattern (regular expression), followed by spaces and/or tabs, followed by a C++ statement (usually a block surrounded by braces {}). During execution, when the function yylex is called, flex looks at whatever input remains to find the longest initial sequence of characters that matches a pattern. It then executes the action associated with that pattern. If there is no statement associated with the pattern, or if the pattern does not contain a return statement, flex tries to find another pattern in input. If the designer wants the lexical analyzer to return a token value, then the designer must include a return statement in the action. Flex’s patterns, though somewhat more general than the regular expressions I’ve discussed, are quite similar. They provide some useful shortcuts. Lecture 5 PLP Spring 2004, UF CISE 10

Flex Patterns (Consult the manual for details) • x matches the character x • Each of [xyz] [0 -9] [A-Za-z] matches any character in the specified class of characters. Ellipses specified with – are completed using the ASCII collating sequence. • [^A-Za-z] represents the complementation of a character class. • Unescaped quotation marks cause flex operator symbols to be treated literally: ”[xyz]”foo” represents the string [xyz]”foo. • r* represents the closure operation of regular expressions. • r+ represents one or more occurrences of r. • r? represents zero or one occurrences of r. • r|s represents alternation (either r or s but not both). • . (a period) matches any single character. Lecture 5 PLP Spring 2004, UF CISE 11

Programming Assignment • Your first programming assignment is to consult the document http: //www. cise. ufl. edu/~jnw/COP 5555/References/garnet. html, describing the Garnet language, mine it for tokens, and construct a lexical analyzer for the language using flex and C++. Lecture 5 PLP Spring 2004, UF CISE 12

Things you Need to Know About Flex Definitions • The definitions portion of a flex specification (the first part) can contain character class definitions and C or C++ code enclosed in funny brackets %{ and %} which must appear at the beginning of line. I recommend you avoid character class definitions at first, then add them later to make your rules simpler. /* scanner for a toy Pascal-like language */ %{ /* need this for the call to atof() below */ #include <math. h> %} DIGIT [0 -9] ID [a-z][a-z 0 -9]* %% Lecture 5 PLP Spring 2004, UF CISE 13

Things you Need to Know About Flex Rules • In the rules section, white space matters! Rules start with a pattern at the beginning of the line. They continue with an action statement. If the action runs multiple lines, then the following lines are indented. {DIGIT}+ { cout << "An integer: " << yytext << "(" << atoi(yytext) << ")" << endl; } {DIGIT}+". "{DIGIT}* { cout << "A float: " << yytext“ << "(" << atof(yytext) << ")" << endl; } if|then|begin|end|procedure|function { cout << "A keyword: "<< yytext << endl; } {ID} cout << "An identifier: " << yytext << endl; "+"|"-"|"*"|"/" cout << "An operator: " << yytext << endl; "{"[^}n]*"}" /* eat up one-line comments */ [ tn]+ /* eat up whitespace */. cout << "Unrecognized character: " << yytext); %% Lecture 5 PLP Spring 2004, UF CISE 14

Things you Need to Know About Flex User Code • Any code can be placed in the user code section. In the case of a stand-alone application, the code is usually a main program: %% main( int argc, char **argv ) { /* skip over program name */ ++argv, --argc; } if ( argc > 0 ) { yyin = fopen( argv[0], "r" ); } else { yyin = stdin; } yylex(); Lecture 5 PLP Spring 2004, UF CISE 15

How Does a Flex Lexical Analyzer Work Together With a Parser? • The parser calls yylex() to get the next token. • The lexical analyzer finds the token and sets the variable yylval (that it shares with the parser) to an appropriate value representing the information it needs to convey to the parser about the token. • The lexical analyzer returns an integer value identifying the type of token that it found. • Example: In parsing numbers, the value of yylval might be a structure containing a double and an int, and a return value of 1 might signify that an integer was seen, whereas 2 would mean that a floating point number was seen. • If the lexical analyzer returns a 0, the parser assumes that the input is exhausted. • The type of yylval can vary from project to project. Lecture 5 PLP Spring 2004, UF CISE 16

Your Assignment (due 27 January) • You are to write a lexical analyzer for the Garnet language using flex and C++. • Your lexical analyzer must declare yylval's type to be a class containing the following attributes: • The line number on which the token appears • The character number on that line at which the token begins • The string that comprises the token • The main program of your lexical analyzer shall call the lexical analyzer (yylex) repeatedly until end of file, and upon each return from the lexical analyzer shall print a line consisting of – – the token number followed by a tab, followed by the line number, followed by a tab, followed by the character number, followed by a tab, followed by the token string. Lecture 5 PLP Spring 2004, UF CISE 17