Chapter 13 String Manipulation and Regular Expressions Outline

  • Slides: 36
Download presentation
Chapter 13 – String Manipulation and Regular Expressions Outline 13. 1 13. 2 13.

Chapter 13 – String Manipulation and Regular Expressions Outline 13. 1 13. 2 13. 3 13. 4 13. 5 13. 6 13. 7 13. 8 13. 9 13. 10 13. 11 13. 12 Introduction Fundamentals of Characters and Strings String Presentation Searching Strings Joining and Splitting Strings Regular Expressions Compiling Regular Expressions and Manipulating Regular Expression Objects Regular Expression Repetition and Placement Characters Classes and Special Sequences Regular Expression String-Manipulation Functions Grouping Internet and World Wide Web Resources 2002 Prentice Hall. All rights reserved. 1

2 13. 1 Introduction • Presentation of Python’s string and character processing capabilities •

2 13. 1 Introduction • Presentation of Python’s string and character processing capabilities • Demonstrates powerful text-processing capabilities of regular expressions with module re 2002 Prentice Hall. All rights reserved.

13. 2 Fundamentals of Characters and Strings • Characters: fundamental building blocks of Python

13. 2 Fundamentals of Characters and Strings • Characters: fundamental building blocks of Python programs • Function ord returns a character’s integer ordinal value • Python supports strings as a built-in type 2002 Prentice Hall. All rights reserved. 3

13. 2 Fundamentals of Characters and Strings Python 2. 2 b 2 (#26, Nov

13. 2 Fundamentals of Characters and Strings Python 2. 2 b 2 (#26, Nov 16 2001, 11: 44: 11) [MSC 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information. >>> ord( "z" ) 122 >>> ord( "n" ) 10 Fig. 13. 1 Integer ordinal value of a character. 2002 Prentice Hall. All rights reserved. 4

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 5

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 5

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 6

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 6

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 7

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 7

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 8

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 8

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 9

13. 2 Fundamentals of Characters and Strings 2002 Prentice Hall. All rights reserved. 9

10 13. 3 String Presentation • Formatting enables users to read and understand string

10 13. 3 String Presentation • Formatting enables users to read and understand string data (e. g. , program instructions) 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 # Fig. 13. 3: fig 13_03.

1 2 3 4 5 6 7 8 # Fig. 13. 3: fig 13_03. py # Simple output formatting example. Outline string 1 = "Now I am here. " Centers calling string in a new string of 50 characters fig 13_03. py Right-aligns calling string in new string of 50 characters string 1. center( 50 ) string 1. rjust( 50 ) Left-aligns calling string in new string of 50 characters print string 1. ljust( 50 ) Now I am here. 2002 Prentice Hall. All rights reserved. 11

1 2 3 4 5 6 7 8 9 # Fig. 13. 4: fig

1 2 3 4 5 6 7 8 9 # Fig. 13. 4: fig 13_04. py # Stripping whitespace from a string 1 = "t n Outline This is a test string. tt n" fig 13_04. py Removes all leading and trailing whitespace from string Removes all leading whitespace from strings 'Original string: "%s"n' % string 1 'Using strip: "%s"n' % string 1. strip() Removes all trailing whitespace from string 'Using left strip: "%s"n' % string 1. lstrip() print "Using right strip: "%s"n" % string 1. rstrip() Original string: " This is a test string. " Using strip: "This is a test string. " Using left strip: "This is a test string. " Using right strip: " This is a test string. " 2002 Prentice Hall. All rights reserved. 12

13 13. 4 Searching Strings • Method find, index, rfind and rindex search for

13 13. 4 Searching Strings • Method find, index, rfind and rindex search for substrings in a calling string • Methods startswith and endswith return 1 if a calling string begins with or ends with a given string, respectively • Method count returns number of occurrences of a substring in a calling string • Method replace substitutes its second argument for its first argument in a calling string 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Outline # Fig. 13. 5: fig 13_05. py # Searching strings for a substring. # counting the occurrences of a substring 1 = "Test 1, test 2, test 3, test 4, Test 5, test 6" fig 13_05. py Returns number of times given substring appears in calling string print '"test" occurs %d times in nt%s' % ( string 1. count( Returns "test" number ), string 1 ) of times substring appears in slice of print '"test" occurs %d times after 18 th character in nt%s' % ( string 1. count( "test", 18, len( string 1 ) ), string 1 ) print calling string # finding a substring in a string 2 = "Odd or even" Returns lowest index at which substring occurs in calling string print '"%s" contains "or" starting at index %d' % ( string 2, string 2. find( "or" ) ) Returns lowest index at which substring # find index of "even" Unlike find, index raises Value. Error if substring not found try: print '"even" index is', string 2. index( "even" ) except Value. Error: print '"even" does not occur in "%s"' % begins string 2 with substring Returns 1 if calling string if string 2. startswith( "Odd" ): Returns if calling string ends print '"%s" starts with 1"Odd"' % string 2 occurs with substring if string 2. endswith( "even" ): print '"%s" ends with "even"n' % string 2 Returns highest index # searching from end of string print 'Index from end of "test" in "%s" is %d' % ( string 1, string 1. rfind( "test" ) ) print at which substring occurs 2002 Prentice Hall. All rights reserved. 14

36 37 38 39 40 41 42 43 44 45 46 47 48 49

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 # find rindex of "Test" try: highest index at which substring isiffound Unlike. Return rfind, rindex raises Value. Error substring print 'First occurrence of "Test" from end at index' , string 1. rindex( "Test" ) except Value. Error: print '"Test" does not occur in "%s"' % string 1 Outline not found fig 13_05. py print # replacing a substring 3 = "One, one, one" print "Original: ", string 3 Replace all occurrences of first argument with second argument print 'Replaced "one" with "two": ' Replace , 3 occurrences of first argument with second string 3. replace( "one", "two" ) print "Replaced 3 maximum: ", string 3. replace( "one", "two", 3 ) "test" occurs 4 times Test 1, test 2, "test" occurs 2 times Test 1, test 2, argument in test 3, test 4, Test 5, test 6 after 18 th character in test 3, test 4, Test 5, test 6 "Odd or even" contains "or" starting at index 4 "even" index is 7 "Odd or even" starts with "Odd" "Odd or even" ends with "even" Index from end of "test" in "Test 1, test 2, test 3, test 4, Test 5, test 6" is 35 First occurrence of "Test" from end at index 28 Original: One, one, one Replaced "one" with "two": One, two, two Replaced 3 maximum: One, two, one 2002 Prentice Hall. All rights reserved. 15

16 13. 5 Splitting and Joining Strings • Tokenization breaks statements into individual components

16 13. 5 Splitting and Joining Strings • Tokenization breaks statements into individual components (or tokens) • Delimiters, typically whitespace characters, separate tokens 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Outline # Fig. 13. 6: fig 13_06. py # Token splitting and delimiter joining. # splitting strings string 1 = "A, B, C, D, E, F" print print "String is: ", "Split string fig 13_06. py Splits calling string by whitespace characters Splits calling string by specified character string 1. split() Return list of tokens split by 2 comma delimiters string 1 by spaces: " , by commas: " , string 1. split( ", " ) by commas, max 2: " , string 1. split( ", ", 2 ) # joining strings list 1 = [ "A", "B", "C", "D", "E", "F" ] string 2 = "___" print "List is: ", list 1 Combines list with calling string print 'Joining with "%s": %s' % ( string 2, string 2. join ( list 1 ) ) print 'Joining with "-. -": ', "-. -". join( list 1 ) as a delimiter to create new string Combines list with calling quoted string as delimiter to create new string String is: A, B, C, D, E, F Split string by spaces: ['A, ', 'B, ', 'C, ', 'D, ', 'E, ', 'F'] Split string by commas: ['A', ' B', ' C', ' D', ' E', ' F'] Split string by commas, max 2: ['A', ' B', ' C, D, E, F'] List is: ['A', 'B', 'C', 'D', 'E', 'F'] Joining with "___": A___B___C___D___E___F Joining with "-. -": A-. -B-. -C-. -D-. -E-. -F 2002 Prentice Hall. All rights reserved. 17

18 13. 6 Regular Expressions • Provide more efficient and powerful alternative to string

18 13. 6 Regular Expressions • Provide more efficient and powerful alternative to string search methods • Text pattern that a program uses to find substrings that match patterns • Processing capabilities provided by module re 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # Fig. 13. 7: fig 13_07. py # Simple regular-expression example. Module re provides regular expression processing capabilities import re Outline fig 13_07. py List of regular expressions # list of strings to search and expressions used to search test. Strings = [ "Hello World", "Hello world!", "hello world" ] expressions = [ "hello", "Hello", "world!" ] # search every expression in every string for string in test. Strings: for expression in expressions: Returns an object containing substring matching the regular expression if re. search( expression, string ): print expression, "found in string", string else: Returns None if substring not found print expression, "not found in string", string print hello not found in string Hello World Hello found in string Hello World world! not found in string Hello World hello not found in string Hello world! Hello found in string Hello world! hello found in string hello world Hello not found in string hello world! not found in string hello world 2002 Prentice Hall. All rights reserved. 19

13. 7 Compiling Regular Expressions and Manipulating Regular Expression Objects • Compiled regular expressions

13. 7 Compiling Regular Expressions and Manipulating Regular Expression Objects • Compiled regular expressions represented by SRE_Pattern object, which provides all functionality available in module re • If a program uses a regular expression several times, the compiled version may be more efficient • Methods re. search and re. match return an SRE_Match object 2002 Prentice Hall. All rights reserved. 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Outline # Fig. 13. 08: fig 13_08. py # Compiled regular-expression and match objects. import re test. String = "Hello world" format. String = "%-35 s: %s" fig 13_08. py # string formatting the output compile takes a regular expression # create regular expression and Method compiled expression Method compile returns an SRE_Pattern object expression = "Hello" compiled. Expression = re. compile( expression ) as an argument # print expression and compiled expression print format. String % ( "The expression", expression ) print format. String % ( "The compiled expression" , compiled. Expression ) # search using re. search and compiled expression's search method print format. String % ( "Non-compiled search", Compiled) regular expression’s search re. search( expression, test. String ) print format. String % ( "Compiled search", compiled. Expression. search( test. String ) ) method # print results of searching print format. String % ( "search SRE_Match contains" , SRE_Match object’s method group re. search( expression, test. String ). group() ) print format. String % ( "compiled search SRE_Match contains" , compiled. Expression. search( test. String ). group() ) The expression The compiled expression Non-compiled search Compiled search SRE_Match contains compiled search SRE_Match contains : : : returns matching substring Hello <SRE_Pattern object at 0 x 00 B 60 A 20> <SRE_Match object at 0 x 00 D 0 F 9 B 8> Hello 2002 Prentice Hall. All rights reserved. 21

13. 8 Regular Expression Repetition and Placement Characters • Patterns built using combination of

13. 8 Regular Expression Repetition and Placement Characters • Patterns built using combination of metacharacters and escape sequences • Metacharacter: regular-expression syntax element that repeats, groups, places or classifies one or more characters – ? : matches zero or one occurrences of the expression it follows – +: matches one or more occurrences of the expression it follows – *: matches zero or more occurrences of the expression it follows 2002 Prentice Hall. All rights reserved. 22

13. 8 Regular Expression Repetition and Placement Characters – ^: indicates placement at the

13. 8 Regular Expression Repetition and Placement Characters – ^: indicates placement at the beginning of the string – $: indicates placement at the end of the string 2002 Prentice Hall. All rights reserved. 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # Fig. 13. 9: fig 13_09. py # Repetition patterns, matching vs searching. import re Outline ? matches+0 matches or 1 occurrences *1 Returns l or more of occurrences of l fig 13_09. py or moreofzero occurrences l test. Strings = [ "Heo", "Hellllo" ] expressions = [ "Hel? o", "Hel+o", "Hel*o" ] # match every expression with every string for expression in expressions: Returns SRE_Match for string in test. Strings: object only if beginning of string matches regular expression if re. match( expression, string ): print expression, "matches", string else: print expression, "does not match", string print # demonstrate expression 1 = expression 2 = expression 3 = Pattern occurs at beginning string the difference between matching of and searching "elo" # plain string Pattern occurs at end of string "^elo" # "elo" at beginning of string "elo$" # "elo" at end of string # match expression 1 with test. Strings[ 1 ] if re. match( expression 1, test. Strings[ 1 ] ): print expression 1, "matches", test. Strings[ 1 ] # search for expression 1 in test. Strings[ 1 ] if re. search( expression 1, test. Strings[ 1 ] ): print expression 1, "found in", test. Strings[ 1 ] 2002 Prentice Hall. All rights reserved. 24

34 35 36 37 38 39 40 # search for expression 2 in test.

34 35 36 37 38 39 40 # search for expression 2 in test. Strings[ 1 ] if re. search( expression 2, test. Strings[ 1 ] ): print expression 2, "found in", test. Strings[ 1 ] # search for expression 3 in test. Strings[ 1 ] if re. search( expression 3, test. Strings[ 1 ] ): print expression 3, "found in", test. Strings[ 1 ] Outline fig 13_09. py Hel? o matches Heo Hel? o matches Helo Hel? o does not match Hellllo Hel+o does not match Heo Hel+o matches Hellllo Hel*o matches Helo Hel*o matches Hellllo elo found in Helo elo$ found in Helo 2002 Prentice Hall. All rights reserved. 25

26 13. 9 Classes and Special Sequences • Regular-expression building blocks • Character class:

26 13. 9 Classes and Special Sequences • Regular-expression building blocks • Character class: specifies a group of characters to match in a string – Denoted by [] – Metacharacter ^ at beginning negates character class • Special sequence: shortcut for a common character class 2002 Prentice Hall. All rights reserved.

27 13. 9 Classes and Special Sequences 2002 Prentice Hall. All rights reserved.

27 13. 9 Classes and Special Sequences 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Outline # Fig. 13. 11: fig 13_11. py # Program that demonstrates classes and special sequences. import re fig 13_11. py # specifying character Raw classes [ ] by letter r stringwith preceded Character class of digits test. Strings = [ "2 x+5 y", "7 y-3 z" ] Alphanumeric character d represents character class ofclass digits w represents alphanumeric character expressions = [ r"2 x+5 y|7 y-3 z", r"[0 -9][a-z. A-Z 0 -9_]. [0 -9][yz]" , r"dw-dw" ] class # match every expression with every string for expression in expressions: for test. String in test. Strings: if re. match( expression, test. String ): print expression, "matches", test. String # specifying character classes with special sequences test. String 1 = "800 -123 -4567" test. String 2 = "617 -123 -4567" test. String 3 = "email: t. Bracket joe_doe@deitel. com" metacharacters specifies number Match 1 or more alphanumeric characters or range of repetitions expression 1 = r"^d{3}-d{4}$" expression 2 = r"w+: s+w+@w+. (com|org|net)" # matching with character classes if re. match( expression 1, test. String 1 ): print expression 1, "matches", test. String 1 if re. match( expression 1, test. String 2 ): print expression 1, "matches", test. String 2 2002 Prentice Hall. All rights reserved. 28

35 36 if re. match( expression 2, test. String 3 ): print expression 2,

35 36 if re. match( expression 2, test. String 3 ): print expression 2, "matches", test. String 3 2 x+5 y|7 y-3 z matches 2 x+5 y 2 x+5 y|7 y-3 z matches 7 y-3 z [0 -9][a-z. A-Z 0 -9_]. [0 -9][yz] matches 2 x+5 y [0 -9][a-z. A-Z 0 -9_]. [0 -9][yz] matches 7 y-3 z dw-dw matches 7 y-3 z ^d{3}-d{4}$ matches 800 -123 -4567 ^d{3}-d{4}$ matches 617 -123 -4567 w+: s+w+@w+. (com|org|net) matches email: Outline fig 13_11. py joe_doe@deitel. com 2002 Prentice Hall. All rights reserved. 29

30 13. 9 Classes and Special Sequences Python 2. 2 b 2 (#26, Nov

30 13. 9 Classes and Special Sequences Python 2. 2 b 2 (#26, Nov 16 2001, 11: 44: 11) [MSC 32 bit (Intel)] on win 32 Type "copyright", "credits" or "license" for more information. >>> import re >>> print re. match( "2 x+5 y", "2 x+5 y" ) None >>> print re. match( "2 x+5 y", "2 x 5 y" ) <SRE_Match object at 0 x 00932268> >>> print re. match( "2 x+5 y", "2 xx 5 y" ) <SRE_Match object at 0 x 00949 A 88> Fig. 13. 12 metacharacter in regular expressions. 2002 Prentice Hall. All rights reserved.

13. 10 Regular Expression String. Manipulation Functions • Module re provides pattern-based, stringmanipulation capabilities,

13. 10 Regular Expression String. Manipulation Functions • Module re provides pattern-based, stringmanipulation capabilities, such as substituting a substring in a string and splitting a string with a delimiter 2002 Prentice Hall. All rights reserved. 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Outline # Fig. 13: fig 13_13. py # Regular-expression string manipulation. import re fig 13_13. py test. String 1 = "This sentence ends in 5 stars *****" test. String 2 = "1, 2, 3, 4, 5, 6, 7" test. String 3 = "1+2 x*3 -y" format. String = "%-34 s: %s" # string to format output print format. String % ( "Original string", test. String 1 ) sub replaces with * in*test. String 1 Special^character is escaped with backslash # regular expression substitution test. String 1 = re. sub( r"*", r"^", test. String 1 ) print format. String % ( "^ substituted for *", test. String 1 ) test. String 1 = re. sub( r"stars", "carets", test. String 1 ) print format. String % ( '"carets" substituted for "stars"', test. String 1 ) print format. String % ( 'Every word replaced by "word"' , re. sub( r"w+", "word", test. String 1 ) ) sub’s optional fourth argument specifies a maximum number (3) of replacements print format. String % ( 'Replace first 3 digits by "digit"' , re. sub( r"d", "digit", test. String 2, 3 ) ) # regular expression splitting split tokenizes string by specified delimiter print format. String % ( "Splitting " + test. String 2, re. split( r", ", test. String 2 ) ) (, ) Passes split a character class of delimiters print format. String % ( "Splitting " + test. String 3, re. split( r"[+-*/%]", test. String 3 ) ) Only – and ^ need to be escaped in a character class 2002 Prentice Hall. All rights reserved. 32

Original string ^ substituted for * "carets" substituted for "stars" Every word replaced by

Original string ^ substituted for * "carets" substituted for "stars" Every word replaced by "word" Replace first 3 digits by "digit" Splitting 1, 2, 3, 4, 5, 6, 7 Splitting 1+2 x*3 -y : : : : This sentence ends in 5 stars ***** This sentence ends in 5 stars ^^^^^ This sentence ends in 5 carets ^^^^^ word word ^^^^^ digit, 4, 5, 6, 7 ['1', '2', '3', '4', '5', '6', '7'] ['1', '2 x', '3', 'y'] Outline fig 13_13. py 2002 Prentice Hall. All rights reserved. 33

34 13. 11 Grouping • Regular expression may specify groups of substrings to match

34 13. 11 Grouping • Regular expression may specify groups of substrings to match in a string • Program extracts information from matching groups • Metacharacters ( and ) denote a group • Greedy operators (+ and *) attempt to match as many characters as possible even if this is not the desired behavior 2002 Prentice Hall. All rights reserved.

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # Fig. 13. 14: fig 13_14. py # Program that demonstrates grouping and greedy operations. import re Outline fig 13_14. py format. String 1 = "%-22 s: %s" # string to format output # string that contains fields and expression to extract fields test. String 1 = Regular expression 1 describes 3 groups "Albert Antstein, phone: 123 -4567, e-mail: albert@bug 2 bug. com" expression 1 = r"(w+ w+), phone: (d{3}-d{4}), e-mail: (w+@w+. w{3})" groups returns list of substrings which match specified groups in expression 1 in specified group print format. String 1 % ( "Extract all user data", group returnstest. String 1 substring matching regular re. match( expression 1, ). groups() ) expressions print format. String 1 % ( "Extract user e-mail", re. match( expression 1, test. String 1 ). group( 3 ) ) print # greedy operations and grouping format. String 2 = "%-38 s: %s" # string to format output # strings and patterns to find base directory in a path. String = "/books/2001/python" # file path string Greedy operation expression 2 = Greedy "(/. +)/" # greedy operator expression matches too many print format. String 1 % ( "Greedy error", ? alters greedy ). group( behavior of 1 +) ) re. match( expression 2, path. String characters expression 3 = "(/. +? )/" # non-greedy operator expression print format. String 1 % ( "No error, base only", re. match( expression 3, path. String ). group( 1 ) ) 2002 Prentice Hall. All rights reserved. 35

Extract all user data : ('Albert Antstein', '123 -4567', 'albert@ bug 2 bug. com')

Extract all user data : ('Albert Antstein', '123 -4567', 'albert@ bug 2 bug. com') Extract user e-mail : albert@bug 2 bug. com Greedy error No error, base only : /books/2001 : /books Outline fig 13_14. py 2002 Prentice Hall. All rights reserved. 36