Regular expressions http en wikipedia orgwikiRegularexpression Mastering Regular

  • Slides: 15
Download presentation
Regular expressions • http: //en. wikipedia. org/wiki/Regular_expression • Mastering Regular Expressions by Jeffrey E.

Regular expressions • http: //en. wikipedia. org/wiki/Regular_expression • Mastering Regular Expressions by Jeffrey E. F. Friedl • Linux editors and commands (e. g. grep) use regular expressions ISBN 0 -321 -33025 -0

Language Theory • Chomsky identified four classes of language • Programming languages are described

Language Theory • Chomsky identified four classes of language • Programming languages are described by a context-free grammar • Regular languages are somewhat simpler 2

Regular Grammars • Regular grammars are grammars whose BNF rules are restricted to the

Regular Grammars • Regular grammars are grammars whose BNF rules are restricted to the form <lhs> -> terminal <non-terminal> • Regular grammars can be represented by finite state automata and by regular expressions 3

Regular Expressions • First described by Stephen Kleene • Used for pattern matching –

Regular Expressions • First described by Stephen Kleene • Used for pattern matching – Unix utilities like grep and awk – built into many scripting languages (e. g. perl) – libraries exist for other languages (Pattern and Matcher classes in Java) • No standard notation – Many languages use Perl Compatible Regular Expressions • Useful for describing things like identifiers and numbers for a programming language 4

Regular Expression Components • Atoms - the characters that can be combined to make

Regular Expression Components • Atoms - the characters that can be combined to make the pattern being described • Concatenation - a sequence of atoms • Alternation - a choice between several patterns • Kleene closure (*) - 0 or more occurrences • Positive closure (+) - 1 or more occurrences • nothing ( ) 5

Patterns and Matching • a pattern is generally enclosed between a matched pair of

Patterns and Matching • a pattern is generally enclosed between a matched pair of characters, most commonly // – /pattern/ • Languages that support pattern matching may have a match operator 6

Regular Expression Metacharacters • Characters that have a special meaning within a pattern 7

Regular Expression Metacharacters • Characters that have a special meaning within a pattern 7

Simple Examples • A single character : /a/ – Matches any string that contains

Simple Examples • A single character : /a/ – Matches any string that contains the letter a • A sequence of characters – /ab/ matches any string that contains the letter a followed immediately by the letter b – /bird/ matches any string that contains the word bird – /Regular/ matches any string that contains the word Regular (matches are case-sensitive by default) 8

More Examples • Any character : a. – a followed by any character •

More Examples • Any character : a. – a followed by any character • A choice of two characters : a | b – a b ac ab bc but not cd ef • Optional repeated character : ab* – a ab abbbb abracadabra • Optional repeated sequence : a(bc)* – a abcbc • At least one of a sequence : ab+ – ab abbbb abracadabra 9

Anchors • Sometimes you want to check for something at the beginning or end

Anchors • Sometimes you want to check for something at the beginning or end of a string – /^The/ matches only if the first three characters in the string are The – /tar$/ matches only if the last three characters of the string are tar – If you need to match the beginning and/or end of a word, you can add a space at the appropriate end 10

Character Classes • You can put a set of characters inside square brackets to

Character Classes • You can put a set of characters inside square brackets to create a character class – [abc] means any one of a b or c • A ^ as the first character means any character that isn't in the set – [^abc] means any character except a b or c • You can also specify ranges of characters (based on ASCII codes) – [0 -9] is any digit 11

Perl Compatible Regular Expressions • Use b to specify a word boundary • Named

Perl Compatible Regular Expressions • Use b to specify a word boundary • Named character classes – – d for any digit w for letters, digits and underscores s for whitespace D, W, S exclude the characters in the lower case set • {} after a regular expression can be used to specify a number of repeats • /i at end of pattern means case-insensitive • /s at end of pattern means match newlines –. normally only matches characters other than newlines 12

Regular Expressions for String Manipulation • split( regexp, string) tokenizes a string • s/regexp/replacement/

Regular Expressions for String Manipulation • split( regexp, string) tokenizes a string • s/regexp/replacement/ substitutes for regexp – g at end means do all occurrences • Expression memory allows you to remember what matches parts of pattern in parentheses 13

Regular Expressions in Java • Java has classes for using regular expressions – The

Regular Expressions in Java • Java has classes for using regular expressions – The String class has a matches method • parameter is a regular expression – The java. util. regex package has classes that can be used for pattern matching operations • Pattern represents regular expressions • Matcher creates an object that performs various pattern matching operations 14

Try these • Give a regular expression to recognize – – java identifiers integer

Try these • Give a regular expression to recognize – – java identifiers integer literals a phone number with optional country code number on a license plate • Can you think of any others? 15