Regular Expressions 2 Concepts Regular Expressions allows to

  • Slides: 9
Download presentation
Regular Expressions

Regular Expressions

2 Concepts • Regular Expressions – allows to search for a pattern within a

2 Concepts • Regular Expressions – allows to search for a pattern within a text string – the patterns can be rather complex • same idea as "wildcard" characters – compare SQL – but much more expressive – often abbreviated, e. g. as Reg. Exp – Reg. Exps match as much as possible • they are greedy • Theoretical underpinnings – nondeterministic final automata (NFA) – regular grammars – but some constructs extend the functionality further • even beyond CFG (context-free grammars)

3 Support • Popular, widely supported • Directly in scripting languages – Java. Script

3 Support • Popular, widely supported • Directly in scripting languages – Java. Script • special syntax – PHP • functions – Ruby – Perl • as libraries – Java's java. lang. regex package

4 Java. Script Reg. Exp • Directly as argument of methods of String object

4 Java. Script Reg. Exp • Directly as argument of methods of String object – string. match(regexp) • returns an array of substrings that matched regexp pattern – string. replace(regexp, by) • returns a new string where the first (or all) matched patterns were replaced with by string – string. search(regexp) • returns the index of first substring that matched regexp pattern, -1 if there is no match – string. split(regexp) • returns an array of the substrings of string separated by regexp • regexp argument – enclosed in / • e. g. , /ex/ matches first occurrence of "ex" – optional modifiers placed as suffix • g (global); used in replace() – e. g. , /ex/g matches all occurrences of "ex" • i (ignore case) – e. g. , /ex/i matches all occurrences of "ex", "EX", "Ex" and "e. X" • m (multiline)

5 PHP Reg. Exp • functions with $regexp and $string arguments – ereg($regexp, $string

5 PHP Reg. Exp • functions with $regexp and $string arguments – ereg($regexp, $string [, &$matches]) • • returns length of matched string, false if there is no match array reference &$matches if given, will be filled with the string in $matches[0] and the matched substrings in subsequent elements – ereg_replace($regexp, $by, $string) • returns a string where the first (or all) matched patterns were replaced with $by string – split($regexp, $string [, $limit]) • • returns an array of substrings of $string that were separated by patterns matching $regexp optional $limit determines how many substrings to return (the last one contains the remainder) – eregi(), eregi_replace(), spliti() • same as ereg() and ereg_replace(), but ignores case – preg_match($regexp, $string ) • • similar to ereg(), see PHP documentation if global search for all matches is to be performed, ereg() or ereg_replace() must be called in a loop

6 Syntax in Java. Script • by "element" we mean a character or a

6 Syntax in Java. Script • by "element" we mean a character or a group • • • . ? * • + • • • any character one occurrences of preceding element or nothing any number of occurrences of preceding element, incl. none e. g. , a. *z matches the largest substring that starts with a and ends with z, incl. "az" any number of occurrences of preceding element, but at least one e. g. , a. +z matches the largest substring that starts with a and ends with z, not including "az" – note that "azz" and "aaz" are matched {n} exactly n occurrences of preceding element {m, n} between n and m occurrences of preceding element ^ beginning of the string $ end of the string sequence of elements means that such sequence must be matched • e. g. , a. z matches "axz", "a 5 z", "a. Qz", etc. [] alternative elements – e. g. , [ab] means a or b [^ ] none of the alternative elements – e. g. , [^ab] means not a and not b - range – e. g. , [a-z. A-Z] means a through z or A through Z, i. e. all lower-case and upper-case letters | or – e. g. , ab|yz matches "ab" and "yz"

7 Special Characters • Denoted by  – – – – /: / b:

7 Special Characters • Denoted by – – – – /: / b: space/blank t: tab character n: line feed r: carriage return f: form feed s: whitespace character, i. e. [ trn] d: digit, i. e. [0 -9] w: word character, i. e. [a-z. A-Z 0 -9_] S: not a whitespace character, i. e. [^s] D: not a digit, i. e. [^d] W: not a word character, i. e. [^] any other character preceded by means the character itself the "meta-characters" need to be escaped: • \, /, [, ], . , ? , [, ], |, +, *, (, ), ^, $, -, {, }

8 Reg. Exp Capturing • If you enclose subpattern(s) ( and ) within a

8 Reg. Exp Capturing • If you enclose subpattern(s) ( and ) within a Reg. Exp it the pattern(s) that will be captured, i. e. returned or used – e. g. , b(. *)@ will capture the first part of an email

9 Sample Reg. Exp • hex digit: – [0 -9 a-f. A-F] • identifier:

9 Sample Reg. Exp • hex digit: – [0 -9 a-f. A-F] • identifier: – [a-z. A-Z_][a-z. A-Z_0 -9]* • email address: – b[a-z. A-Z 0 -9. _%+-]+@[a-z. A-Z 0 -9. -]+. [a-z. A-Z]{2, 4}b