Cosc 2030 String Parsing in c Regular Expressions

  • Slides: 36
Download presentation
Cosc 2030 String Parsing in c++ Regular Expressions

Cosc 2030 String Parsing in c++ Regular Expressions

String parsing • One of main tasks that a program may need to do

String parsing • One of main tasks that a program may need to do is take a string and parse it to determine the next step in the program. • • Command line applications Search applications (like bing and google). Most network applications, send and receive data as strings. to many more to even begin to name.

How to parse • As with all things in c/c++ you can do it

How to parse • As with all things in c/c++ you can do it any number of ways. • Develop a functions and algorithms to parse a string up. • Use the methods functions in the string class • String parsing. • Use the sscanf functions • More like regular expressions. • Use the regex stl • Which is regular expressions. • Requires visual studio 2010 or gcc 4. 3. 0+

Reading a line of input. • The standard cin reads to a space and

Reading a line of input. • The standard cin reads to a space and then stops. • This is not always the functionality we want. • getline function: (2 methods) • cin. getline(c_str, 256) • Reads to end of line marks or number of characters, which ever comes first. • Requires a c_str, instead of a string. • Example: char stuff[256]; Cin. getline(stuff, 256); • But this is still not the method we want since it requires c-strings.

Reading a line of input (2) • Getline second method, which is the method

Reading a line of input (2) • Getline second method, which is the method we want to use, since it returns a string. • Part of the string class getline(cin, string) • Example: string stuff; getline(cin, stuff)

Regular Expressions

Regular Expressions

Regular Expressions • Regex for short. • Likely the most powerful way to do

Regular Expressions • Regex for short. • Likely the most powerful way to do any string processing. • Use: • Create a pattern that you want to match with • Run the “match” • If returns true, then the string matched the pattern • Also can get all the matches into an array to use as well. • Problem: • Regex patterns can be very complex and we don’t have to time (about 6 lectures) to cover the entire regex set. This will only cover the very basics.

Code for regex • Include the regex stl #include <regex> • Define the pattern

Code for regex • Include the regex stl #include <regex> • Define the pattern • Note pattern is a variable! std: : regex pattern ( … string pattern…); • object that will contain the sequence of sub-matches (optional) std: : match_results<std: : string: : const_iterator> result;

code for regex • regex_match to match the full string If (std: : regex_match(string,

code for regex • regex_match to match the full string If (std: : regex_match(string, result, pattern)) • If true there was a match • if capturing matches, result should have matches. if result. size() >0 or if !result. empty() • regex_search to match any part of a string If (std: : regex_search(string, result, pattern)) • same as match, with result.

simple matching regex pattern("ello"); string line = "Hello"; If (std: : regex_search(line, result, pattern))

simple matching regex pattern("ello"); string line = "Hello"; If (std: : regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; • output: Match Hello

simple matching (2) regex pattern("ello"); string line = "Hi"; If (std: : regex_search(line, result,

simple matching (2) regex pattern("ello"); string line = "Hi"; If (std: : regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; • This would not match • Neither would line = "e. LLo";

Case insensitive matching. • Use the constant icase • Character matching should be performed

Case insensitive matching. • Use the constant icase • Character matching should be performed without regard to case. regex pattern("ello", std: : regex_constants: : icase); string line = "he. LLo"; If (std: : regex_search(line, result, pattern)) cout <<"Match "<<line<<endl; • output: Match He. LLo

alternation matching • assume we are using regex_search unless other noted. • | allows

alternation matching • assume we are using regex_search unless other noted. • | allows matching with an or • regex pattern("Fred|Wilma|Pebbles"); • True if contains Fred, Wilma, or Pebbles • regex pattern("Fred|Wilma|Pebbles Flintstone"); • matches Fred, Wilma, or Pebbles Flintstone

Grouping and alternation • ( ) allow you to group matching and end alternation

Grouping and alternation • ( ) allow you to group matching and end alternation as well. • regex pattern("(Fred|Wilma|Pebbles) Flintstone") • matches Fred Flintstone, Wilma Flintstone, or Pebbles Flintstone • regex pattern("(Blue|Song)bird") • matches Bluebird or Songbird • Remember we are using reg_search. So the line could "There is a Songbird singing in the tree"

Grouping and alternation (2) • regex ex 3("(p|g|m|s|b)et"); • true if contains: pet, get,

Grouping and alternation (2) • regex ex 3("(p|g|m|s|b)et"); • true if contains: pet, get, met, set, or bet • note () are also used to capture the match • So the result variable will tell you which letter it matched. • regex ex 4("th(is|at)"); • true if the string contains this or that

Single character matching • Use [] for single character matching. • regex pattern("[abc]") •

Single character matching • Use [] for single character matching. • regex pattern("[abc]") • true if it contains a and/or b and/or c • regex pattern("[pgmsb]et") • true if it contains for pet, get, met, set or bet • regex pattern("[Fred]") • true if it contains F and/or r and/or e and/or d • Not listed characters ^ character • regex pattern("[^abc]") • true if it does not contain a and b and c • regex pattern("[a-z]") • true if it contains any lower case letter a through z

Single character or'd matching (2) • regex pattern("[0 -9]") • true if it contains

Single character or'd matching (2) • regex pattern("[0 -9]") • true if it contains any number 0 through 9 • regex pattern("[0 -9-]") • matches 0 through 9 or the minus • regex pattern("[a-z 0 -9^]") • matches any single lowercase letter or digit or ^ • regex pattern("[a-z. A-Z 0 -9_]") • matches any single letter, digit, or underscore • regex pattern("[^aeiou. AEIOU]") • matches any non-vowel in in the string

matching quantifiers • multiple uses {min, max} • regex pattern("a{3}") • true if the

matching quantifiers • multiple uses {min, max} • regex pattern("a{3}") • true if the string contains aaa • regex pattern("a{3, }") • matches aaa, aaaaa, aaaaaa, etc. • regex pattern("a{3, 5}b") • matches aaab, aaaaab • common mistake • regex pattern("Fred{3}") • matches Freddd, not Fred • How to actually do it. • regex pattern("(Fred){3}") • matches Fred

matching quantifiers (2) • regex pattern("a{0, 5}") • match a, aaa, aaaaa, and if

matching quantifiers (2) • regex pattern("a{0, 5}") • match a, aaa, aaaaa, and if there are no a's • regex pattern("a*") • * match 0 or more times (max match) • regex pattern("a*? ") • * match 0 or more times (min match) • Difference between min and max matching • "aaaa"; #matches all three above • • Difference *, matches "aaaa" while *? matches "a" max matches as many characters as it can while min, matches as few characters as it can This becomes important later on.

matching quantifiers (3) • + 1 or more times (max match) • +? 1

matching quantifiers (3) • + 1 or more times (max match) • +? 1 or more times (min match) • regex pattern("a+") • true if there are 1 or more "a"s • ? match 0 or 1 time (max match) • ? ? match 0 or 1 time (min match) • regex pattern("a? ") • true if there 1 a or no "a"s • Also {3, 5}? min match • tries to match only 3 where possible • and {3, 5} max match • tries to match 5 where possible

matching quantifiers (4) • regex pattern("fo+ba? r") • matches f, 1 or more o's,

matching quantifiers (4) • regex pattern("fo+ba? r") • matches f, 1 or more o's, b, 0 or 1 a, then an r • match: fobar, foobr, • Non-match: fbar (missing o), foobaar (to many a's) • regex pattern("fo*ba? r") • matches f, 0 or more o's, b, 0 or 1 a, then an r • match: fobar, fbr, fooobr, etc… • Inside [], matching quantifiers are "normal" characters. • regex pattern("[. ? !+]*") • matches zero or more. , ? , !, or +

Trying out what we have learned. • • What will the following match? 1.

Trying out what we have learned. • • What will the following match? 1. 2. 3. 4. 5. 6. regex pattern("a+[bc]") regex pattern("(a|be)", std: : regex_constants: : icase) regex pattern("Hi{1, 3} There!? ") regex pattern("(Foo)? Bar", std: : regex_constants: : icase) regex pattern("[1 -9][a-z]*") regex pattern("[a-z. A-Z]+, [A-Z]{2} [0 -9]{5}") Write an regular expression for these 1. Match a social security number (with or without dashes) 2. A street address: number Name with either St, Ln, Rd or nothing. Also case insensitive

metasymbols • . match one character (except newline) • regex pattern(". ") • Always

metasymbols • . match one character (except newline) • regex pattern(". ") • Always true, except when the string is empty. • regex pattern("d. g") • true for d anycharacter and g • so dog, dbg, dag, dcg, d g, etc. • regex pattern("d. *g") • true d and 0 or more character and g • so dg, dog, dasdfg, d g, etc. • regex pattern("d. +g") • true d and 1 or more character and g • so NOT dg, but the rest dog, dasdfg, d g, etc.

metasymbols (2) • regex pattern("d. ? g") • true for d any single character

metasymbols (2) • regex pattern("d. ? g") • true for d any single character and g AND dg • regex pattern("d. {0, 1}g") • true for d any single character and g AND dg • same as above • if regex pattern("d. {2}g") • true for d and 2 characters and g • so doog, dafg, dghg, etc… • regex pattern("d. {2, 5}g") • true for d and 2 to 5 characters and g • so dooog, d. XXXXXg, g. Xobgg, etc…

metasymbols (3) • Anchoring • ^ beginning of the string (it's a not in

metasymbols (3) • Anchoring • ^ beginning of the string (it's a not in []) • $ end of the string • regex pattern("^dog$") • true only for "dog", not "ddogg" • Note, we could just use regex_match(string, result, pattern) instead of anchoring for this one, since we just find to find "dog". • regex pattern("^dog") • true only when the string start with "dog" • so "dog", "doga", etc.

metasymbols (4) • regex pattern("dog$") • true when the string ends with "dog" •

metasymbols (4) • regex pattern("dog$") • true when the string ends with "dog" • "dog", "asdfadfdog", "ddddooodog" • regex pattern("^. $") • true when the string is one character long and not the newline symbol • regex pattern("^[abc]+") • true when the string start with • • "a", "aaa", etc with any characters following. "b", "bbb", etc with any characters following. "c", "ccc", etc with any characters following As well as any combination of a's, b's, and c's • "abcabc", etc.

metasymbols (5) • d match a Digit [0 -9] • D match a Nondigit

metasymbols (5) • d match a Digit [0 -9] • D match a Nondigit [^0 -9] • s match whitespace [ tnrf] • S match a Nonwhitespace [^ tnrf] • w match a Word character [a-z. A-Z 0 -9_] • W match a Non word Character [^a-z. A-Z 0 -9_] • Note, we have an issue in C++ strings. The means something else. So while it d, We will have to write \d in order to get the string to recognize it correctly.

metasymbols (5) • regex pattern("\d") • true when string contains a digit • regex

metasymbols (5) • regex pattern("\d") • true when string contains a digit • regex pattern("\d+") • true when string contains 1 or more digit • regex pattern("\w\d") • true contains a word character and 1 digit • regex pattern("\w+\d") • true when contains 1 or more word characters and 1 digit • so these match "abc 1" "a 1" "11" "_9" "Z 8" and "a 1 a 1"

metasymbols (5) • regex pattern("^\s\w\d") • true when it starts with a whitespace, then

metasymbols (5) • regex pattern("^\s\w\d") • true when it starts with a whitespace, then a word character, and then a digit • " 11" "ta 1" "n 11" etc. • regex pattern("^\s*\w\d") • true when it starts with 0 or more whitespaces, then a word character, and then a digit • " 11" " t a 1" etc

Trying them out again. • What will the following match? 1. 2. 3. 4.

Trying them out again. • What will the following match? 1. 2. 3. 4. regex pattern("a+\w*? ") regex pattern("\w\s*\w+") regex pattern("^\d+[a-z]*") regex pattern("\w+, \s\w{2}\s{2}\d{5}") • Write an regular expression for these • Rewrite #4 so the city can two or more words. • Must start with has a letter, then have any number of letters and/or numbers or none at all, but end with a number

capturing • capturing the matches • use the () around the part you want

capturing • capturing the matches • use the () around the part you want to capture • regex ex 13("(\w+)"); • find 1 or more word characters and capture the resulting match • regex ex 14("(\w+)s+(\w+)"); • find 1 or more word characters, then white space, then 1 or more word characters. Capture the word character matches • example: "hi there" • result[1]="hi", result[2]="there" • regex ex 15("(\d+) (. *)"); • What does this capture?

capturing (2) • line="Hi There" • regex pattern 2("(((w+) )(w+))"); • regex_search(line, result, pattern

capturing (2) • line="Hi There" • regex pattern 2("(((w+) )(w+))"); • regex_search(line, result, pattern 2); • result[1]="Hi There" result[2]="Hi "result[3]="Hi" result[4]= "There"

capturing (3) • line = "a xxx c xxxxxc xxx d"; • regex pattern

capturing (3) • line = "a xxx c xxxxxc xxx d"; • regex pattern 3("(. +)x(. +)c"); • regex_search(line, result, pattern 3); • result[1] = result[2]= • hint, these are a max match.

Regex and python. • You will find regex exists in most languages (at least

Regex and python. • You will find regex exists in most languages (at least most useful languages). • In python it's part of the re package import re line = "a xxx c xxxxxc xxx d"; x = re. find("(. +)x(. +)c", line); • x is the first string that matched • re. findall returns an array of all the matches.

Regex reference • http: //www. codeguru. com/cpp/cpp_mfc/stl/article. php/c 15339 • http: //www. codeproject. com/KB/string/TR

Regex reference • http: //www. codeguru. com/cpp/cpp_mfc/stl/article. php/c 15339 • http: //www. codeproject. com/KB/string/TR 1 Regex. aspx • Patterns http: //msdn. microsoft. com/en-us/library/bb 982727. aspx

Q&A

Q&A