Text Processing with Regular Expressions What is Regular

  • Slides: 25
Download presentation
Text Processing with Regular Expressions

Text Processing with Regular Expressions

What is Regular Expression? • Regular expression is a language designed to manipulate text.

What is Regular Expression? • Regular expression is a language designed to manipulate text. Users use its extensive patternmatching notations to write regular expressions to: – Search text; – Extract, edit, replace, or delete text substrings; – Validate input data: • values, formats • Examples: – *. doc – Select * From Student Where Sname = ‘C%’;

System. Text. Regular. Expressions Namespace • We need to import the system. text. regular.

System. Text. Regular. Expressions Namespace • We need to import the system. text. regular. Expressions namespace and use the Regex class to create regular expressions. • Imports System. Text. Regular. Expressions • Dim re as New Regex(“[aeiou]d”)

Regular Expression Language Elements 1. Character Escapes Provides information on the set of escape

Regular Expression Language Elements 1. Character Escapes Provides information on the set of escape characters that signal to the regular expression parser that the character is not an operator and should be interpreted as a matching character. ordinary characters Characters other than. $ ^ { [ ( | ) * + ? match themselves. a Matches a bell (alarm). b Matches a backspace t Matches a tab. r Matches a carriage return. f Matches a form feed. n Matches a new line. e Matches an escape. * When the backslash is followed by a character that doesn’t form an escape sequence, it matches the character. * matches *, ( matches (

2. Character Classes Provides information on the set of regular expression characters that define

2. Character Classes Provides information on the set of regular expression characters that define the substring to match.

. Matches any character except n. [aeiou] Matches any single character included in the

. Matches any character except n. [aeiou] Matches any single character included in the specified set of characters. [^aeiou] Matches any single character not in the specified set of characters. [0 -9 a-f. A-F] Use of a hyphen (–) allows specification of contiguous character ranges. w Matches any word character. w is the same as [a-z. A-Z_0 -9]. W Matches any nonword character. W is the same as [^a-z. A-Z_0 -9]. s Matches any white-space character. s is the same as [ fnrtv]. S Matches any non-white-space character. S is the same as [^ fnrtv]. d Matches any decimal digit. . D Matches any nondigit.

Atomic Zero-Width Assertions Provides information on zero-width assertions that cause a match to succeed

Atomic Zero-Width Assertions Provides information on zero-width assertions that cause a match to succeed or fail depending on the regular expression parser's current position in the input string.

^ Specifies that the match must occur at the beginning of the string or

^ Specifies that the match must occur at the beginning of the string or the beginning of the line. $ Specifies that the match must occur at the end of the string, or at the end of the line. Ex. Abc$ -- match any abc immediately before the end of a line. A Specifies that the match must occur at the beginning of the string (ignores the Multiline option). Z Specifies that the match must occur at the end of the string (ignores the Multiline option). z Specifies that the match must occur at the end of the string (ignores the Multiline option). G Specifies that the match must occur at the point at which the current search started (often, this is one character beyond where the last search ended). b Specifies that the match must occur on a boundary between w (alphanumeric) and W (nonalphanumeric) characters. The match must occur on word boundaries — that is, at the first or last characters in words separated by spaces. B Specifies that the match must not occur on a b boundary.

Quantifiers Add optional quantity data to regular expressions. A particular quantifier applies to the

Quantifiers Add optional quantity data to regular expressions. A particular quantifier applies to the character, character class, or group that immediately precedes it.

 • * Specifies zero or more matches; Same as {0, }. • +

• * Specifies zero or more matches; Same as {0, }. • + Specifies one or more matches; Same as {1, }. • w+ • ? Specifies zero or one matches; Same as {0, 1}. • {n} Specifies exactly n matches; for example, d(3) matches groups of 3 or more digits. • {n, } Specifies at least n matches. {n, m} Specifies at least n, but no more than m, matches. For example, d{3, 5} matches groups of three, four or five digits. w{3, ) – words with at least 3 characters.

Alternation Constructs Provides information on alternation information that modifies a regular expression to allow

Alternation Constructs Provides information on alternation information that modifies a regular expression to allow either/or matching.

| Matches any one of the terms separated by the | (vertical bar) character;

| Matches any one of the terms separated by the | (vertical bar) character; for example, cat|dog|tiger. The leftmost successful match wins. (? (expression)yes|no) Matches the "yes" part if the expression matches at this point; otherwise, matches the "no" part. The "no" part can be omitted. (? (name)yes|no) Matches the "yes" part if the named capture string has a match; otherwise, matches the "no" part. The "no" part can be omitted.

Grouping Constructs Provides information on grouping constructs that cause a regular expression to capture

Grouping Constructs Provides information on grouping constructs that cause a regular expression to capture groups of subexpressions.

( ) Captures the matched substring (or noncapturing group; for more information, see the

( ) Captures the matched substring (or noncapturing group; for more information, see the Explicit. Capture option in Regular Expression Options). Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern. (? <name> ) Captures the matched substring into a group name or number name. The string used for name must not contain any punctuation and it cannot begin with a number. You can use single quotes instead of angle brackets; for example, (? 'name').

Matches Method • Matches: takes a string as input and returns a Match. Collection

Matches Method • Matches: takes a string as input and returns a Match. Collection object that contains 0 or more Match objects. • Match object properties and method: – – Value Index Length Next. Match • match. Collection properties: – Count – Item(index)

Imports System. Text. Regular. Expressions Public Class Form 1 Inherits System. Windows. Form Private

Imports System. Text. Regular. Expressions Public Class Form 1 Inherits System. Windows. Form Private Sub Button 1_Click(By. Val sender As System. Object, By. Val e As System. Event. Args) Handles Button 1. Click Dim re As Regex re = New Regex(txt. RE. Text) Dim source As String source = txt. Source. Text Dim mc As Match. Collection = re. Matches(source) Dim m As Match Dim result As String For Each m In mc result = result + m. Value + vb. Cr. Lf Next txt. Matches. Text = result End Sub

Match Method • Returns the first match object: – Dim m As match=re. match(source)

Match Method • Returns the first match object: – Dim m As match=re. match(source) – Do While m. Success • … • m=m. Next. Match – Loop

Is. Match Method • Checks if the pattern is contained in the source string:

Is. Match Method • Checks if the pattern is contained in the source string: – If re. Is. Match(source) Then • … – End if

Replace Method • Replace portions of the source string that match the regular expression.

Replace Method • Replace portions of the source string that match the regular expression. – Re. Replace(Source, new. String) • Dim re As Regex • re = New Regex(txt. RE. Text) • Dim source As String • source = txt. Source. Text • txt. Source. Text = re. Replace(source, txt. Replace. Text)

Validating Input Format with Regular Expressions • Date format: – d{2}-d{2}$ – d{2}-(d{2}$|d{4}$) •

Validating Input Format with Regular Expressions • Date format: – d{2}-d{2}$ – d{2}-(d{2}$|d{4}$) • Phone number: – (d{3})-d{3}-d{4}$ • Emp. ID begins with E followed by 3 digits: – Ed{3} • 5 lower or upper case letters – [a-z. A-Z]{5}

Validating Input Value with Regular Expressions • Allowable values: – San Francisco|Los Angeles|Taipei

Validating Input Value with Regular Expressions • Allowable values: – San Francisco|Los Angeles|Taipei

Searching with Regular Expressions • Pattern search: – Dd – w+ • Value search

Searching with Regular Expressions • Pattern search: – Dd – w+ • Value search – http

Private Sub Button 1_Click(By. Val sender As System. Object, By. Val e As System.

Private Sub Button 1_Click(By. Val sender As System. Object, By. Val e As System. Event. Args) Handles Button 1. Click Try Dim re As Regex re = New Regex(txt. RE. Text) Dim source As String source = txt. Source. Text Dim mc As Match. Collection = re. Matches(source) Dim m As Match Dim result As String For Each m In mc result = result + m. Value + vb. Cr. Lf Next txt. Matches. Text = result Catch ex As System. Exception Message. Box. Show(ex. Message) End Try End Sub Private. Button 2. Click Sub Button 2_Click(By. Val sender As System. Object, By. Val e As System. Event. Args) Handles Try Dim re As Regex re = New Regex(txt. RE. Text) Dim source As String source = txt. Source. Text = re. Replace(source, txt. Replace. Text) Catch ex As System. Exception Message. Box. Show(ex. Message) End Try End Sub

Is. Match Method Private Sub Button 1_Click(By. Val sender As System. Object, By. Val

Is. Match Method Private Sub Button 1_Click(By. Val sender As System. Object, By. Val e As System. Event. Args) Handles Button 1. Click Try Dim re As Regex re = New Regex(txt. Format. Text) Dim source As String source = txt. Source. Text If re. Is. Match(source) Then Message. Box. Show("valid") Else Message. Box. Show("not valid") End If Catch ex As System. Exception Message. Box. Show(ex. Message) End Try End Sub