The Front End Source code Front End IR

The Front End Source code tokens Scanner IR Parser Errors Implementation Strategy Scanning Parsing

The Front End stream of characters Scanner microsyntax stream of tokens Parser syntax IR

The Big Picture The front end deals with syntax • Language syntax is specified

The Big Picture Why study automatic scanner construction? • Avoid writing scanners by hand

Regular Expressions We constrain programming languages so that the spelling of a word always

Regular Expressions How do these operators help? Regular Expression (over alphabet ) • is

Examples of Regular Expressions Identifiers: Letter (a|b|c| … |z|A|B|C| … |Z) Digit (0|1|2| …

Regular Expressions We use regular expressions to specify the mapping of words to parts

Example Consider the problem of recognizing ILOC register names Register r (0|1|2| … |

Example ( continued) DFA operation • Start in state S 0 & make transitions

Example ( continued) To be useful, the recognizer must be converted into code Char

Example ( continued) We can add “actions” to each transition Char next character State

What if we need a tighter specification? r Digit* allows arbitrary numbers • Accepts

Tighter register specification (continued) The DFA for Register r ( (0|1|2) (Digit | )

Tighter register specification (continued) r 0, 1 2 3 4 -9 All others s

Tighter register specification (continued) State Action r 0, 1 2 3 4, 5, 6

Table-Driven Scanners Common strategy is to simulate DFA execution • Table + Skeleton Scanner

Table-Driven Scanners Character Classification • Group together characters by their actions in the DFA

Table-Driven Scanners Building the Lexeme • Scanner produces syntactic category (part of speech) —

Table-Driven Scanners Choosing a Category from an Ambiguous RE • We want one DFA,

Table-Driven Scanners Scanning a Stream of Words • Real scanners do not look for

Table-Driven Scanners Handling a Stream of Words // recognize words state s 0 lexeme

Avoiding Excess Rollback • Some REs can produce quadratic rollback — Consider ab |

Maximal Munch Scanner // recognize words state s 0 lexeme empty string clear stack

Maximal Munch Scanner • Uses a bit array Failed to track dead-end paths —

Table-Driven Versus Direct-Coded Scanners index Table-driven scanners make heavy use of indexing • Read

Table-Driven Versus Direct-Coded Scanners Overhead of Table Lookup • Each lookup in Char. Cat

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit start: accept

What About Hand-Coded Scanners? Many (most? ) modern compilers use hand-coded scanners • Starting

Building Scanners The point • All this technology lets us automate scanner construction •

Slides: 32

Download presentation

The Front End Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language • Perform a membership test: code source language? • Is the program well-formed (semantically) ? • Build an IR version of the code for the rest of the compiler The front end deals with form (syntax) & meaning (semantics) 0

The Front End Source code tokens Scanner IR Parser Errors Implementation Strategy Scanning Parsing Specify Syntax regular expressions context-free grammars Implement Recognizer deterministic finite automaton push-down automaton Perform Work Actions on transitions in automaton 1

The Front End stream of characters Scanner microsyntax stream of tokens Parser syntax IR + annotations Errors Why separate the scanner and the parser? • Scanner classifies words • Parser constructs grammatical derivations Scanner is only pass that touches every character of the input. • Parsing is harder and slower Separation simplifies the implementation token is a pair • Scanners are simple <part of speech, lexeme > • Scanner leads to a faster, smaller parser 2

The Big Picture The front end deals with syntax • Language syntax is specified with parts of speech, speech not words • Syntax checking matches parts of speech against a grammar Simple expression grammar 1. goal expr 2. expr op term 3. | term S = goal 4. term number 5. | id N = { goal, expr, term, op } 6. op 7. + | – T = { number, id, +, - } P = { 1, 2, 3, 4, 5, 6, 7 } parts of speech syntactic variables The scanner turns a stream of characters into a stream of words, and classifies them with their part of speech. 3

The Big Picture Why study automatic scanner construction? • Avoid writing scanners by hand • Harness theory compile time design time Goals: source code Scanner parts of speech & words tables or code specifications Scanner Generator Represent words as indices into a global table Specifications written as “regular expressions” • To simplify specification & implementation of scanners • To understand the underlying techniques and technologies 4

Regular Expressions We constrain programming languages so that the spelling of a word always implies its part of speech The rules that impose this mapping form a regular language Regular expressions (REs) describe regular languages Regular Expression (over alphabet ) • is a RE denoting the set { } • If a is in , then a is a RE denoting {a} • If x and y are REs denoting L(x) and L(y) then — x | y is an RE denoting L(x) L(y) — xy is an RE denoting L(x)L(y) — x* is an RE denoting L(x)* Precedence is closure, then concatenation, then alternation 5

Regular Expressions How do these operators help? Regular Expression (over alphabet ) • is a RE denoting the set { } • If a is in , then a is a RE denoting {a} the spelling of any specific word is an RE • If x and y are REs denoting L(x) and L(y) then — x |y is an RE denoting L(x) L(y) any finite list of words can be written as an RE — xy is an RE denoting L(x)L(y) — x* is an RE denoting L(x)* ( w 0 | w 1 | … | w n ) we can use concatenation & closure to write more concise patterns and to specify infinite sets that have finite descriptions 6

Examples of Regular Expressions Identifiers: Letter (a|b|c| … |z|A|B|C| … |Z) Digit (0|1|2| … |9) Identifier Letter ( Letter | Digit )* Numbers: Integer (+|-| ) (0| (1|2|3| … |9)(Digit *) ) Decimal Integer. Digit Real * ( Integer | Decimal ) E (+|-| ) Digit * Complex ( Real , Real ) Numbers can get much more complicated! underlining indicates a letter in the input stream 7

Regular Expressions We use regular expressions to specify the mapping of words to parts of speech for the lexical analyzer Using results from automata theory and theory of algorithms, we can automate construction of recognizers from REs Þ We study REs and associated theory to automate scanner construction ! Þ Fortunately, the automatic techiques lead to fast scanners used in text editors, URL filtering software, … 8

Example Consider the problem of recognizing ILOC register names Register r (0|1|2| … | 9)* • Allows registers of arbitrary number • Requires at least one digit RE corresponds to a recognizer (or DFA) (0|1|2| … 9) r S 0 S 1 S 2 Recognizer for Register Transitions on other inputs go to an error state, s e 9

Example ( continued) DFA operation • Start in state S 0 & make transitions on each input character • DFA accepts a word x iff x leaves it in a final state (S 2 ) (0|1|2| … 9) r S 0 S 1 S 2 Recognizer for Register So, • r 17 takes it through s 0, s 1, s 2 and accepts • r takes it through s 0, s 1 and fails • a takes it straight to se 10

Example ( continued) To be useful, the recognizer must be converted into code Char next character State s 0 while (Char EOF) State (State, Char) Char next character if (State is a final state ) then report success else report failure Skeleton recognizer r 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 All others s 0 s 1 se se s 1 se s 2 se se se Table encoding the RE O(1) cost per character (or per transition) 11

Example ( continued) We can add “actions” to each transition Char next character State s 0 while (Char EOF) Next (State, Char) Act (State, Char) perform action Act State Next Char next character if (State is a final state ) then report success else report failure Skeleton recognizer r 0, 1, 2, 3, 4, All 5, 6, 7, 8, 9 others s 0 s 1 start se error s 1 se error s 2 add se error se error Table encoding RE Typical action is to capture the lexeme 12

What if we need a tighter specification? r Digit* allows arbitrary numbers • Accepts r 00000 • Accepts r 99999 • What if we want to limit it to r 0 through r 31 ? Write a tighter regular expression — Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) ) — Register r 0|r 1|r 2| … |r 31|r 00|r 01|r 02| … |r 09 Produces a more complex DFA • DFA has more states • DFA has same cost per transition • DFA has same basic implementation (or per character) More states implies a larger table. The larger table might have mattered when computers had 128 KB or 640 KB of RAM. Today, when a cell phone has 13 megabytes and a laptop has gigabytes, the concern seems outdated.

Tighter register specification (continued) The DFA for Register r ( (0|1|2) (Digit | ) | (4|5|6|7|8|9) | (3|30|31) ) (0|1|2| … 9) S 2 S 3 0, 1, 2 S 0 r S 1 3 S 5 0, 1 S 6 4, 5, 6, 7, 8, 9 S 4 • Accepts a more constrained set of register names • Same set of actions, more states 14

Tighter register specification (continued) r 0, 1 2 3 4 -9 All others s 0 s 1 se se se s 1 se s 2 s 5 s 4 se s 2 se s 3 s 3 se se se s 4 se se se s 5 se s 6 se se se se Table encoding RE for the tighter register specification 15

Tighter register specification (continued) State Action r 0, 1 2 3 4, 5, 6 7, 8, 9 other 0 1 start e e e 1 e 2 add 5 add 4 add e 2 e 3 add e exit 3, 4 e e e exit 5 e 6 add e e exit 6 e e e exit e e e e (0|1|2| … 9) S 2 S 3 0, 1, 2 S 0 r S 1 3 S 5 0, 1 S 6 16 4, 5, 6, 7, 8, 9 S 4

Table-Driven Scanners Common strategy is to simulate DFA execution • Table + Skeleton Scanner — So far, we have used a simplified skeleton state s 0 ; while (state exit) do char Next. Char( ) state (state, char); // read next character // take the transition • In practice, the skeleton is more complex — Character classification for table compression s 0 — Building the lexeme — Recognizing subexpressions 0… 9 r 0… 9 sf Practice is to combine all the REs into one DFA Must recognize individual words without hitting EOF 17

Table-Driven Scanners Character Classification • Group together characters by their actions in the DFA — Combine identical columns in the transition table, — Indexing by class shrinks the table state s 0 ; while (state exit) do char Next. Char( ) cat Char. Cat(char) state (state, cat) // read next character // classify character // take the transition • Idea works well in ASCII (or EBCDIC) — compact, byte-oriented character sets — limited range of values • Not clear how it extends to larger character sets (unicode) 18

Table-Driven Scanners Building the Lexeme • Scanner produces syntactic category (part of speech) — Most applications want the lexeme (word), too state s 0 lexeme empty string while (state exit) do char Next. Char( ) lexeme + char cat Char. Cat(char) state (state, cat) // read next character // concatenate onto lexeme // classify character // take the transition • This problem is trivial — Save the characters 19

Table-Driven Scanners Choosing a Category from an Ambiguous RE • We want one DFA, so we combine all the REs into one — Some strings may fit RE for more than 1 syntactic category Keywords versus general identifiers Would like to encode them into the RE & recognize them — Scanner must choose a category for ambiguous final states Classic answer: specify priority by order of REs (return 1 st) Alternate Implementation Strategy ( Quite popular) • Build hash table of keywords & fold keywords into identifiers • Preload keywords into hash table • Makes sense if Separate keyword • table can make — Scanner will enter all identifiers in the table matters worse — Scanner is hand coded Othersise, let the DFA handle them (O(1) cost per character) 20

Table-Driven Scanners Scanning a Stream of Words • Real scanners do not look for 1 word per input stream — Want scanner to find all the words in the input stream, in order — Want scanner to return one word at a time — Syntactic Solution: can insist on delimiters Blank, tab, punctuation, … Do you want to force blanks everywhere? in expressions? — Implementation solution Run DFA to error or EOF, back up to accepting state • Need the scanner to return token, not boolean — Token is < Part of Speech, lexeme > pair — Use a map from DFA’s state to Part of Speech (Po. S) 21

Table-Driven Scanners Handling a Stream of Words // recognize words state s 0 lexeme empty string clear stack push (bad) while (state se) do char Next. Char( ) lexeme + char if state ∈ SA then clear stack push (state) cat Char. Cat(char) state (state, cat) // clean up final state while (state ∉ SA and state ≠ bad) do state ← pop() truncate lexeme roll back the input one character end; // report the results if (state ∈ SA ) then return <Po. S(state), lexeme> else return invalid end; Need a clever buffering scheme, such as double buffering to support roll back 22

Avoiding Excess Rollback • Some REs can produce quadratic rollback — Consider ab | (ab)* c and its DFA — Input “ababc” s 0, s 1, s 3, s 4, s 5 — Input “abab” Not too pretty s 0, s 1, s 3, s 4, rollback 6 characters s 0, s 1, s 3, s 4, rollback 4 characters s 0 DFA for ab | (ab)* c s 1 a s 2 b s 4 c b s 3 a a c c s 5 s 0, s 1, s 3, s 4, rollback 2 characters s 0, s 1, s 3 • This behavior is preventable — Have the scanner remember paths that fail on particular inputs — Simple modification creates the “maximal munch scanner” 23

Maximal Munch Scanner // recognize words state s 0 lexeme empty string clear stack push (bad, bad) while (state se) do char Next. Char( ) Input. Pos + 1 lexeme + char if Failed[state, Input. Pos] then break; if state ∈ SA then clear stack push (state, Input. Pos) cat Char. Cat(char) state (state, cat) end // clean up final state while (state ∉ SA and state ≠ bad) do Failed[state, Input. Pos) true 〈state, Input. Pos〉← pop() truncate lexeme roll back the input one character end // report the results if (state ∈ SA ) then return <Po. S(state), lexeme> else return invalid Initialize. Scanner() Input. Pos 0 for each state s in the DFA do for i 0 to |input| do Failed[s, i] false end; 24

Maximal Munch Scanner • Uses a bit array Failed to track dead-end paths — Initialize both Input. Pos & Failed in Initialize. Scanner() — Failed requires space ∝ |input stream| Can reduce the space requirement with clever implementation • Avoids quadratic rollback — Produces an efficient scanner — Can your favorite language cause quadratic rollback? If so, the solution is inexpensive If not, you might encounter the problem in other applications of these technologies Thomas Reps, “`Maximal munch’ tokenization in linear 25 time”, ACM TOPLAS, 20(2), March 1998, pp 259 -273.

Table-Driven Versus Direct-Coded Scanners index Table-driven scanners make heavy use of indexing • Read the next character state s 0 ; • Classify it while (state exit) do char Next. Char( ) • Find the next state cat Char. Cat(char ) • Branch back to the top state (state, cat); Alternative strategy: direct coding • Encode state in the program counter — Each state is a separate piece of code Code locality as opposed to random access in • Do transition tests locally and directly branch • Generate ugly, spaghetti-like code • More efficient than table driven strategy — Fewer memory operations, might have more branches 26

Table-Driven Versus Direct-Coded Scanners Overhead of Table Lookup • Each lookup in Char. Cat or involves an address calculation and a memory operation — Char. Cat(char) becomes @Char. Cat 0 + char x w w is sizeof(el’t of Char. Cat) @ 0 + (state x cols + cat) x w cols is # of columns in w is sizeof(el’t of ) — (state, cat) becomes • The references to Char. Cat and expand into multiple ops • Fair amount of overhead work per character • Avoid the table lookups and the scanner will run faster 27

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit start: accept se lexeme “” count 0 goto s 0 : s 1: char Next. Char lexeme + char count++ if (char = ‘r’) then goto s 1 else goto sout char Next. Char lexeme + char count++ if (‘ 0’ char ‘ 9’) then goto s 2 else goto sout s 2: char Next. Char lexeme + char count 0 accept s 2 if (‘ 0’ char ‘ 9’) then goto s 2 else goto sout: if (accept se ) then begin for i 1 to count Roll. Back() report success end else report failure Fewer (complex) memory operations No character classifier 28 Use multiple strategies for test & branch

Building Faster Scanners from the DFA A direct-coded recognizer for r Digit start: accept se lexeme “” count 0 goto s 0 : s 1: char Next. Char lexeme + char count++ if (char = ‘r’) then goto s 1 else goto sout char Next. Char lexeme + char count++ if (‘ 0’ char ‘ 9’) then goto s 2 else goto sout s 2: char Next. Char lexeme + char count 1 accept s 2 if (‘ 0’ char ‘ 9’) then goto s 2 else goto sout: if (accept se ) then begin for i 1 to count If end of state. Roll. Back() test is complex (e. g. , success many cases), report scanner generator should endschemes consider other else (with reportclassification? ) failure • Table lookup • Binary search Direct coding the maximal munch scanner is easy, too. 29

What About Hand-Coded Scanners? Many (most? ) modern compilers use hand-coded scanners • Starting from a DFA simplifies design & understanding • Avoiding straight-jacket of a tool allows flexibility — Computing the value of an integer In LEX or FLEX, many folks use sscanf() & touch chars many times Can use old assembly trick and compute value as it appears — Combine similar states (serial or parallel) • Scanners are fun to write — Compact, comprehensible, easy to debug, … — Don’t get too cute (e. g. , perfect hashing for keywords) 30

Building Scanners The point • All this technology lets us automate scanner construction • Implementer writes down the regular expressions • Scanner generator builds NFA, DFA, minimal DFA, and then writes out the (table-driven or direct-coded) code • This reliably produces fast, robust scanners For most modern language features, this works • You should think twice before introducing a feature that defeats a DFA-based scanner • The ones we’ve seen (e. g. , insignificant blanks, non-reserved keywords) have not proven particularly useful or long lasting 31