Regular Expressions Programming Language Translators Prepared by Manuel

Regular Expressions Programming Language Translators Prepared by Manuel E. Bermúdez, Ph. D. Associate Professor University of Florida

Regular Expressions • A compact, easy-to-read language description. • Use operators to denote the language constructors described earlier, to build “complex” languages from simple “atomic” ones.

Regular Expressions Definition: A regular expression over an alphabet Σ is recursively defined as follows: 1. 2. 3. 4. 5. 6. ø denotes language ø ε denotes language {ε} a denotes language {a}, for all a Σ. (P + Q) denotes L(P) U L(Q), where P, Q are r. e. ’s. (PQ) denotes L(P)·L(Q), where P, Q are r. e. ’s. P* denotes L(P)*, where P is r. e. To prevent excessive parentheses, we assume left associativity, with the following operator precedence hierarchy, from most to least binding: *, ·, +

Regular Expressions Examples: (O + 1)*: any string of O’s and 1’s. (O + 1)*1: any string of O’s and 1’s, ending with a 1. 1*O 1*: any string of 1’s with a single O inserted. Letter (Letter + Digit)*: an identifier. Digit*: an integer. Quote Char* Quote: a string. † # Char* Eoln: a comment. † {Char*}: another comment. † † Assuming that Char does not contain quotes, eoln’s, or }.

Regular Expressions Conversion from Right-linear grammars to regular expressions Example: S → a. S → b. R →ε R → a. S What does S → a. S mean? L(S) {a}·L(S) S → b. R means L(S) {b}·L(R) S → ε means L(S) {ε}

Regular Expressions Together, they mean that L(S) = {a}·L(S) + {b}·L(R) + {ε} or S = a. S + b. R + ε Similarly, R → a. S means R = a. S. Thus, S = a. S + b. R + ε R = a. S System of simultaneous equations, in which the variables are nonterminals.

Regular Expressions Solving systems of simultaneously equations. S = a. S + b. R + ε R = a. S Back substitute R = a. S: S = a. S + ba. S + ε = (a + ba) S + ε Question: What to do with equations of the form: X = X + β ?

Regular Expressions Answer: β L(x), so αβ L(x), αααβ L(x), … Thus α*β = L(x). In our case, S = (a + ba) S + ε = (a + ba)*

Regular Expressions Right-linear regular grammar ↓ regular expression 1. A = α 1 + α 2 + … + αn if A → α 1 → α 2. . . → αn

Regular Expressions 2. If equation is of the form X = α, where X does not appear in α, then replace every occurrence of X with α in all other equations, and delete equation X = α. If equation is of the form X = αX + β, where X does not occur in either α or β, then replace the equation with X = α*β. Note: Some algebraic manipulations may be needed to obtain the form X = αX + β. Important: Catenation is not commutative!!

Regular Expressions Example: S→a → b. U → b. R R → aba. U →U U → a. S →b S = a + b. U + b. R R = aba. U + U = (aba + ε) U U = a. S + b Back substitute R: S = a + b. U + b(aba + ε) U U = a. S + b

Regular Expressions Back substitute U: S = a + b(a. S + b) + b(aba + ε)(a. S + b) = a + ba. S + bb + babaa. S + babab + ba. S + bb repeats = (ba + babaa)S + (a + bb + babab) therefore S = (ba + babaa)*(a + bb + babab)

Regular Expressions Summarizing: Done RGR RGL RE NSA Minimum DFA Soon DFA

Regular Expressions Regular Expression ↓ NFA Recursively build the FSA, mimicking the structure of the regular expression. Each FSA built has one start state, and one final state. Conversions: 1 2 if ø

Regular Expressions 1 • ε • or a P P 2 if a 2 if P + Q ε 1 ε • if ε ε Q ε ε 1 if P· Q Q P ε Q ε 2

Regular Expressions 1 ε ε ε P if P* 2 ε Example: (b (aba + ε) a)* 1 3 5 b a b 2 4 6 (b (aba + ε) a)*

Regular Expressions 7 10 3 a 4 a a ε 8 (b (aba + ε) a)* 9 (b (aba + ε) a)* 11 (b (aba + ε) a)* 5 b 8 a 6 ε 7 (b (aba + ε) a)*

Regular Expressions ε 3 ε 9 12 a ε 4 13 ε b 2 b 5 ε 8 ε 6 (b (aba + ε) a)* 7 a 1 (b (aba + ε) a)* ε ε 12 ε 3 a 9 ε 4 ε 13 ε 5 b 8 a 6 ε 7

Regular Expressions b 2 1 (b (aba + ε) a) * ε ε 12 ε 11 3 a 9 ε a 10 4 ε 13 ε ε 5 b 8 a 6 ε 7

Regular Expressions (b (aba + ε) a)* 14 ε ε ε 15 1 b ε 11 ε 9 a 10 2 ε ε 13 ε 12 ε 3 a ε ε 8 4 5 a 7 ε 6

Regular Expressions Regular Expression ALGORITHM 2 ↓ NFA Start With: E

Regular Expressions Apply Rules: a a* ε ε ab a b a+b a b

Regular Expressions Algorithm 1: • Builds FSA bottom up • Good for machines • Bad for humans Algorithm 2: • Builds FSA top down • Bad for machines • Good for humans Arguable

Regular Expressions Example (Algorithm 2): (a + b)* (aa + bb) (a + b)* aa + bb ε ε aa ε a+b a b bb ε a a b b

Regular Expressions Example (Algorithm 2): ba(a + b)* ab a ε ε b a b

Regular Expressions Deterministic Finite-State Automata (DFA’s) Definition: A deterministic FSA is defined just like an NFA, except that δ: Q x Σ → Q, rather than δ: Q x Σ union {ε}→ 2 Q Thus, both ε and a a are impossible.

Regular Expressions Every transition of a DFA consumes a symbol. Fortunately, DFA’s are just as powerful as NFA’s. Theorem: For every NFA there exists an equivalent (accepting the same language) DFA.

Regular Expressions Conversion from NFA’s to DFA’s: • “Simulate” all moves of the NFA with the DFA. • The start state of the DFA is the start state of the NFA (say, S), together with states that are εreachable from S. • Each state in the DFA is a subset of the set of states of the NFA; the notion of being in “any one of” a number of states. • New states in the DFA are constructed by calculating the sets of states that are reachable through symbols, after the start state. • The final states in the DFA are those that contain any final state of the NFA.

Regular Expressions Example: a*b + ba* a ε 2 ε 3 b a 1 b 4 ε 5 6 ε NFA

Regular Expressions DFA Input State 123 23 456 6 56 a 23 23 56 --56 a a 23 b 456 6 ------b 6 a 56 123 a

Regular Expressions In general, if NFA has N states, the DFA can have as many as 2 N states. ε Example: ba (a + b)* ab 0 b 1 a 2 ε ε a 4 5 ε 8 3 ε 6 7 b NFA ε ε 11 10 9

Regular Expressions DFA Input State 0 1 234689 34568910 34678911 a --234689 34568910 b 1 --34678911 346789

Regular Expressions a 0 b 1 a a 34568910 b a 346789 234689 b b a 34678911 b

Regular Expressions State Minimization Theorem: Given a DFA M, there exists an equivalent DFA M’ that is minimal, i. e. no other equivalent DFA exists with fewer states than M’. Definition: A partition of a set S is a set of subsets of S such that every element of S appears in exactly one of the subsets.

Regular Expressions Example: S = {1, 2, 3, 4, 5} Π 1 = { {1, 2, 3, 4}, {5} } Π 2 = { {1, 2, 3, }, {4}, {5} } Π 3 = { {1, 3}, {2}, {4}, {5} } Note: Π 2 is a refinement of Π 1 , and Π 3 is a refinement of Π 2.

Regular Expressions Minimization Algorithm: 1. Remove all undefined transitions by introducting a TRAP state, i. e. a state from which no final state is reachable. 2. Partition all states into two groups (final states and non-final states). 3. Complete the “Next State” table for each group, by specifying transitions from group to group. Form the next partition: split groups in which Next State table entries differ. Repeat 3 until no further splitting is possible. 4. Determine start and final states.

Regular Expressions a Example: a 1 Π 0 = { {1, 2, 3, 4}, {5} } State 1 2 3 4 5 a 1234 1234 2 4 a a b b 1234 5 1234 b b 3 b a 5 b Split {4} from partition {1, 2, 3, 4}

Regular Expressions a Π 1 = { {1, 2, 3}, {4}, {5} } State 1 2 3 4 5 a 123 123 123 1 b 123 4 123 5 123 Split {2} from partition {1, 2, 3} 2 4 a a b 3 b b a 5

Regular Expressions Π 2 = { {1, 3}, {2}, {4}, {5} } State 1 3 2 4 5 a 2 2 2 No more splitting b b 13 13 4 5 13 13 b 5 a a b 2 a 4 Minimal DFA a

Regular Expressions Summary of Regular Languages • • Smallest class in the Chomsky hierarchy. Appropriate for lexical analysis. Four representations: RGR , RGL , RE and FSA. All four are equivalent; there algorithms to perform transformations among them. • Various advantages and disadvantages among these four, for language designer, implementor, and user. • FSA’s can be made deterministic, and minimal.