i 206 Lecture 18 Regular Expressions Marti Hearst

  • Slides: 22
Download presentation
i 206: Lecture 18: Regular Expressions Marti Hearst Spring 2012 1

i 206: Lecture 18: Regular Expressions Marti Hearst Spring 2012 1

Distributed Systems Formal Languages Security Cryptography Network Standards & Protocols Inter-process Communication I/O Operating

Distributed Systems Formal Languages Security Cryptography Network Standards & Protocols Inter-process Communication I/O Operating System Methodologies/ Tools Process Application Memory Register, Cache Main Memory, Secondary Storage ALUs, Registers, Program Counter, Instruction Register Numbers, text, audio, video, image, … Assembly Instructions Machine Instructions Op-code, operands Instruction set arch Circuits Lossless v. lossy Info entropy & Huffman code Data compression Compiler/ Interpreter CPU Data storage Decimal, Hexadecimal, Binary Data Representation Gates Adders, decoders, Memory latches, ALUs, etc. Boolean Logic AND, OR, NOT, XOR, NAND, NOR, etc. Number Systems Binary Numbers Bits & Bytes Formal models Finite automata regex Program Memory hierarchy Design Principles Algorithms Analysis Big-O Data Structures Searching, sorting, etc. Stacks, queues, maps, trees, graphs, … Truth table Venn Diagram De. Morgan’s Law 2

3

3

Theories of Computation • What are the fundamental capabilities and limitations of computers? –

Theories of Computation • What are the fundamental capabilities and limitations of computers? – Complexity theory: what makes some problems computationally hard and others easy? – Automata theory: definitions and properties of mathematical models of computation – Computability theory: what makes some problems solvable and others unsolvable? 4

Finite Automata • Also known as finite state automata (FSA) or finite state machine

Finite Automata • Also known as finite state automata (FSA) or finite state machine (FSM) • A simple mathematical model of a computer • Applications: hardware design, compiler design, text processing 5

A First Example • Touch-less hand-dryer DRYER OFF DRYER ON 6

A First Example • Touch-less hand-dryer DRYER OFF DRYER ON 6

A First Example • Touch-less hand-dryer hand DRYER OFF DRYER ON hand 7

A First Example • Touch-less hand-dryer hand DRYER OFF DRYER ON hand 7

Finite Automata State Graphs • A state • The start state • An accepting

Finite Automata State Graphs • A state • The start state • An accepting state • A transition x 8

Finite Automata • Transition: s 1 x s 2 – In state s 1

Finite Automata • Transition: s 1 x s 2 – In state s 1 on input “x” go to state s 2 • If end of input – If in accepting state => accept – Else => reject • If no transition possible => reject 9

Language of a FA • Language of finite automaton M: set of all strings

Language of a FA • Language of finite automaton M: set of all strings accepted by M • Example: letter | digit letter S A • Which of the following are in the language? – x, tmp 2, 123, a? , 2 apples • A language is called a regular language if it is recognized by some finite automaton 10

Example • What is the language of this FA? digit B digit + S

Example • What is the language of this FA? digit B digit + S A 11

Regular Expressions • Regular expressions are used to describe regular languages • Arithmetic expression

Regular Expressions • Regular expressions are used to describe regular languages • Arithmetic expression example: (8+2)*3 • Regular expression example: (+|-)? [0 -9]+ 12

Example • What is the language of this FA? • Regular expression: (+|-)? [0

Example • What is the language of this FA? • Regular expression: (+|-)? [0 -9]+ digit B digit + S A 13

Three Equivalent Representations Regular expressions Finite automata Each can describe the others Regular languages

Three Equivalent Representations Regular expressions Finite automata Each can describe the others Regular languages Theorem: For every regular expression, there is a deterministic finite-state automaton that defines the same language, and vice versa. Adapted from Jurafsky & Martin 2000 14

Regex Rules • ? Zero or one occurrences of the preceding character/regex • *

Regex Rules • ? Zero or one occurrences of the preceding character/regex • * Zero or more occurrences of the preceding character/regex • + One or more occurrences of the preceding character/regex ba+ ba, baaa, baaaa … • {n} Exactly n occurrences of the preceding character/regex ba{3} baaa woodchucks? behaviou? r baa* ba* [ab]* [0 -9]* cat. *cat “how much wood does a woodchuck? ” “behaviour is the British spelling of behavior” ba, baaa, baaaa … b, baa, baaaa … , a, b, ab, baaa, aaabbb, … any positive integer, or zero A string where “cat” appears twice anywhere 15

Regex Rules • * is greedy: <. *> <a href=“index. html”>Home</a> • Lazy (non-greedy)

Regex Rules • * is greedy: <. *> <a href=“index. html”>Home</a> • Lazy (non-greedy) quantifier: <. *? > <a href=“index. html”>Home</a> 16

Regex Rules • [] Disjunction (Union) [w. W]ood [abcd]* [A-Za-z 0 -9] [A-Za-z]* “how

Regex Rules • [] Disjunction (Union) [w. W]ood [abcd]* [A-Za-z 0 -9] [A-Za-z]* “how much wood does a Woodchuck? ” “you are a good programmer” (any letter or digit) (any letter sequence) • | Disjunction (Union) (cats? |dogs? )+ “It’s raining cats and a dog. ” • ( ) Grouping (gupp(y|ies))* “His guppy is the king of guppies. ” 17

Regex Rules • ^ $ b Anchors (start/end of input string; word boundary) ^The

Regex Rules • ^ $ b Anchors (start/end of input string; word boundary) ^The end. $ ^The. * end. $ (the)* (btheb)* “The cat in the hat. ” “The end. ” “The bitter end. ” “I saw him the other day. ” • Special rule: when ^ is FIRST WITHIN BRACKETS it means NOT [^A-Z]* (anything not an upper case letter) 18

Regex Rules •  Escape characters . + \ “The + and  characters

Regex Rules • Escape characters . + \ “The + and characters are missing. ” 19

Operator Precedence Operator () * + ? {} sequences, anchors Precedence | lowest highest

Operator Precedence Operator () * + ? {} sequences, anchors Precedence | lowest highest • What is the difference? – [a-z]|[0 -9]* – [a-z]([a-z]|[0 -9])* 20

Regex for Dollars • No commas $[0 -9]+(. [0 -9])? • With commas $[0

Regex for Dollars • No commas $[0 -9]+(. [0 -9])? • With commas $[0 -9]? (, [0 -9][0 -9])*(. [0 -9])? • With or without commas $[0 -9]? ((, [0 -9][0 -9])*| [0 -9]*) (. [0 -9])? 21

Regex in Python • import re • result = re. search(pattern, string) • result

Regex in Python • import re • result = re. search(pattern, string) • result = re. findall(pattern, string) • result = re. match(pattern, string) • Python documentation on regular expressions – http: //docs. python. org/release/3. 1. 3/library/re. html – Some useful flags like IGNORECASE, MULTILINE, DOTALL, VERBOSE 22