13 2 Fundamentals of Characters and Strings Characters

2 Characters and Strings Since characters and strings are fundamental in python, there a

13. 2 Fundamentals of Characters and Strings 3

13. 2 Fundamentals of Characters and Strings 4

13. 2 Fundamentals of Characters and Strings 5

7 1 2 3 4 5 6 7 8 9 Removes leading and trailing

8 13. 4 Searching Strings • Method find, index, rfind and rindex search for

9 Searching Strings s = "actgccgacgatcgcgcatcagcg " index_string= "01234567890123" # length 24 print s

10 13. 5 Splitting and Joining Strings • Tokenization breaks statements into individual components

12 Intermezzo 1 www. daimi. au. dk/~chili/CSS/Intermezzi /2. 10. 1. html 1. Copy and

13 from random import randrange Solution text = "" for i in range(150): next_char

14 Regular Expressions – Motivation Problem: search a text for any Danish email address:

15 13. 6 Regular Expressions • Provide more efficient and powerful alternative to string

16 Example Simple regular expression: reg. Exp = “football” - matches only the string

17 Compiling Regular Expressions re. search( reg. Exp, text ) 1. Compile reg. Exp

18 Searching for ‘football’ import re text 1 = "Here are the football results:

19 Building more sophisticated patterns Metacharacters: regular-expression syntax element ? : matches zero or

20 import re Metacharacter example text = "gaaagccactggggggga " reg. Exp 1 = "t?

21 A few more metacharacters ^: indicates placement at the beginning of the string

22 import re Metacharacter example text 1 = "aactggagcccca" text 2 = "ctgga" reg.

23 Yet more metacharacters. . {}: indicate repetition | : match either regular expression

24 Escaping metacharacters : used to escape (to ‘keep’) a metacharacter # search for

25 Intermezzo 2 http: //www. daimi. au. dk/~chili/CSS/ Intermezzi/2. 10. 2. html Copy and

26 Solution import re # this is a dna sequence in fasta format: seq

27 Character Classes A character class matches one of the characters in the class:

28 Special Sequences Special sequence: shortcut for a common character class reg. Exp 1

29 Other regular expression functions import re text = "1 a 2 b 3

30 Groups We can extract the actual substring that matched the regular expression by

31 13. 11 Grouping • • The substring that matches the whole RE called

32 Greedy vs. non-greedy operators • + and * are greedy operators – They

33 Greedy vs. non-greedy operators # Task: Find a space-separated list of digits, extract

Slides: 33

Download presentation

13. 2 Fundamentals of Characters and Strings • Characters: fundamental building blocks of Python programs • Function ord returns a character’s character code • Function chr returns the character with the given character code >>> ord('ff') Traceback (most recent call last): File "<stdin>", line 1, in ? Type. Error: ord() expected a character, but string of length 2 found >>> ord('f') 102 >>> ord('. ') 46 >>> chr(46) '. ' 1

2 Characters and Strings Since characters and strings are fundamental in python, there a lot of useful methods for dealing with them (fig. 13. 2).

13. 2 Fundamentals of Characters and Strings 3

13. 2 Fundamentals of Characters and Strings 4

13. 2 Fundamentals of Characters and Strings 5

6 1 2 3 4 5 6 7 8 # Fig. 13. 3: fig 13_03. py # Simple output formatting example. string 1 = "Now I am here. " print string 1. center( 50 ) print string 1. rjust( 50 ) print string 1. ljust( 50 ) Centers calling string in a new string of 50 characters Right-aligns calling string in new string of 50 characters Left-aligns calling string in new string of 50 characters Now I am here. Remember: strings are immutable; a string manipulating function fig 13_03. py returns a new string >>> a. String = 'gacataggt' >>> a. String. upper() 'GACATAGGT' >>> a. String 'gacataggt'

7 1 2 3 4 5 6 7 8 9 Removes leading and trailing whitespace from string # Fig. 13. 4: fig 13_04. py # Stripping whitespace from a string 1 = "t print n This is a test string. tt n" 'Original string: "%s"n' % string 1 'Using strip: "%s"n' % string 1. strip() 'Using left strip: "%s"n' % string 1. lstrip() "Using right strip: "%s"n" % string 1. rstrip() Original string: " This is a test string. " Removes leading whitespace from string Removes trailing whitespace from string Using strip: "This is a test string. " Using left strip: "This is a test string. " Using right strip: " This is a test string. " fig 13_04. py

8 13. 4 Searching Strings • Method find, index, rfind and rindex search for substrings in a calling string • Methods startswith and endswith return 1 if a calling string begins with or ends with a given string, respectively • Method count returns number of occurrences of a substring in a calling string • Method replace substitutes its second argument for its first argument in a calling string

9 Searching Strings s = "actgccgacgatcgcgcatcagcg " index_string= "01234567890123" # length 24 print s print index_string, "n" print "gc occurs %d times" % s. count( "gc" ) print “(%d times from index 13)n" % s. count( "gc", 13, len(s) ) # same result as s[13: len(s)]. count("gc") print "first occurrence of gc: index %d" % s. find( "gc" ) print "first occurrence of x: index % dn" % s. find( “x" ) # -1 is a number, program breaks down later if string not found? # index(): as find() but raises exception if string is not found actgccgacgatcgcgcatcagcg 01234567890123 gc occurs 4 times (3 times from index 13) first occurrence of gc: index 3 first occurrence of x: index -1 sequence doesn't start with AC last occurrence of gc: index 21 if s. startswith( "AC" ): print "sequence starts with AC" else: print "sequence doesn't start with AC" # case sensitive! replacing 'gc' with GC: act. GCcgacgatc. GCGCatca. GCg replace 2 occurrences max: act. GCcgacgatc. GCgcatcagcg print "last occurrence of gc: index %dn" % s. rfind( "gc" ) print "replacing gc with GC: n%sn" %s. replace( "gc", "GC" ) print "replace 2 occurrences max: n%s" %s. replace( "gc", "GC", 2 )

10 13. 5 Splitting and Joining Strings • Tokenization breaks statements into individual components (or tokens) • Delimiters, typically whitespace characters, separate tokens

11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # Fig. 13. 6: fig 13_06. py # Token splitting and delimiter joining. # splitting strings string 1 = "A, B, C, D, E, F" print print "String is: ", "Split string Splits calling string by whitespace characters string 1 by spaces: " , string 1. split() by commas: " , string 1. split( ", " ) by commas, max 2: " , string 1. split( ", ", 2 ) # joining strings list 1 = [ "A", "B", "C", "D", "E", "F" ] string 2 = "___" Return list of tokens split by first two comma delimiters print "List is: ", list 1 print 'Joining with ___ : %s' % ( string 2. join ( list 1 ) ) print 'Joining with -. - : ', "-. -". join( list 1 ) String is: A, B, C, D, E, F Split string by spaces: ['A, ', 'B, ', 'C, ', 'D, ', 'E, ', 'F'] Split string by commas: ['A', ' B', ' C', ' D', ' E', ' F'] Split string by commas, max 2: ['A', ' B', ' C, D, E, F'] List is: ['A', 'B', 'C', 'D', 'E', 'F'] Joining with "___": A___B___C___D___E___F Joining with "-. -": A-. -B-. -C-. -D-. -E-. -F Splits calling string by specified character fig 13_06. py Joins list elements with calling string as a delimiter to create new string Joins list elements with calling quoted string as delimiter to create new string

12 Intermezzo 1 www. daimi. au. dk/~chili/CSS/Intermezzi /2. 10. 1. html 1. Copy and run this program: /users/chili/CSS. E 03/Example. Programs/random_text. py What does it do? 2. Extend the program: search the text string it produces and print out the index of the first occurrence of 11 (you might look at Figure 13. 2 at page 438 ff to find a suitable string method). Tell the user if there is no '11'. 3. Split the text into a list of substrings using '11' as a delimiter, print out the list.

13 from random import randrange Solution text = "" for i in range(150): next_char = chr( randrange(48, 58) ) text = "". join( [text, next_char] ) print text i = text. find( "11" ) if i>=0: print "'11' found at index", i splittext = text. split( "11" ) print "text split in %d pieces" %len(splittext) for piece in splittext: print piece 12601124846403636181205195266540502045412039533711871515093137304632865380923951473259266 4241323032411475077087579523798182173083754226565772851806864 '11' found at index 4 text split in 4 pieces 1260 248464036361812051952665405020454120395337 87151509313730463286538092395147325926642413230324 475077087579523798182173083754226565772851806864

14 Regular Expressions – Motivation Problem: search a text for any Danish email address: <something>@<something>. dk import re text 1 = "No Danish email address here bush@whitehouse. org *@$@. hls. 29! fj 3 a" text 2 = "But here: chili@daimi. au. dk what a *(. @#$ nice @#*. ( email address" regular. Expression = "w+@[w. ]+. dk" compiled. RE = re. compile( regular. Expression) SRE_Match 1 = compiled. RE. search( text 1) SRE_Match 2 = compiled. RE. search( text 2) if SRE_Match 1: print "Text 1 contains this Danish email address: ", SRE_Match 1. group() else: print "Text 1 contains no Danish email address" if SRE_Match 2: print "Text 2 contains this Danish email address: ", SRE_Match 2. group() else: print "Text 2 contains no Danish email address" Text 1 contains no Danish email address Text 2 contains this Danish email address: chili@daimi. au. dk

15 13. 6 Regular Expressions • Provide more efficient and powerful alternative to string search methods • Instead of searching for a specific string we can search for a text pattern – Don’t have to search explicitly for ‘Monday’, ‘Tuesday’, ‘Wednesday’. . : there is a pattern in these search strings. – A regular expression is a text pattern • In Python, regular expression processing capabilities provided by module re

16 Example Simple regular expression: reg. Exp = “football” - matches only the string “football” To search a text for reg. Exp, we can use re. search( reg. Exp, text )

17 Compiling Regular Expressions re. search( reg. Exp, text ) 1. Compile reg. Exp to a special format (an SRE_Pattern object) 2. Search for this SRE_Pattern in text 3. Result is an SRE_Match object If we need to search for reg. Exp several times, it is more efficient to compile it once and for all: compiled. RE = re. compile( reg. Exp) 1. Now compiled. RE is an SRE_Pattern object compiled. RE. search( text ) 2. Use search method in this SRE_Pattern to search text 3. Result is same SRE_Match object

18 Searching for ‘football’ import re text 1 = "Here are the football results: Bosnia - Denmark 0 -7" text 2 = "We will now give a complete list of python keywords. " regular. Expression = "football" compiled. RE = re. compile( regular. Expression) SRE_Match 1 = compiled. RE. search( text 1 ) SRE_Match 2 = compiled. RE. search( text 2 ) if SRE_Match 1: print "Text 1 contains the substring ‘football’" if SRE_Match 2: print "Text 2 contains the substring ‘football’" Text 1 contains the substring 'football' Compile regular expression and get the SRE_Pattern object Use the same SRE_Pattern object to search both texts and get two SRE_Match objects (or none if the search was unsuccesful)

19 Building more sophisticated patterns Metacharacters: regular-expression syntax element ? : matches zero or one occurrences of the expression it follows +: matches one or more occurrences of the expression it follows *: matches zero or more occurrences of the expression it follows # search for zero or one t, followed by two a’s: reg. Exp 1 = “t? aa“ # search for g followed by one or more c’s followed by a: reg. Exp 1 = “gc+a“ #search for ct followed by zero or more g’s followed by a: reg. Exp 1 = “ctg*a“

20 import re Metacharacter example text = "gaaagccactggggggga " reg. Exp 1 = "t? aa" Compile all three regular expressions into SRE_Pattern objects compiled. RE 1 = re. compile( reg. Exp 1 ) reg. Exp 2 = "gc+a" Use three SRE_Pattern objects to search the text and get three SRE_Match objects compiled. RE 2 = re. compile( reg. Exp 2 ) reg. Exp 3 = "ctg*a" compiled. RE 3 = re. compile( reg. Exp 3 ) SRE_Match 1 = compiled. RE 1. search( text ) SRE_Match 2 = compiled. RE 2. search( text ) SRE_Match 3 = compiled. RE 3. search( text ) Text contains the regular expression t? aa Text contains the regular expression gc+a Text contains the regular expression ctg*a if SRE_Match 1: print "Text contains the regular expression", reg. Exp 1 if SRE_Match 2: print "Text contains the regular expression", reg. Exp 2 if SRE_Match 3: print "Text contains the regular expression", reg. Exp 3

21 A few more metacharacters ^: indicates placement at the beginning of the string $: indicates placement at the end of the string # search for zero or one t, followed by two a’s # at the beginning of the string: reg. Exp 1 = “^t? aa“ # search for g followed by one or more c’s followed by a # at the end of the string: reg. Exp 1 = “gc+a$“ # whole string should match ct followed by zero or more # g’s followed by a: reg. Exp 1 = “^ctg*a$“

22 import re Metacharacter example text 1 = "aactggagcccca" text 2 = "ctgga" reg. Exp 1 = "^t? aa" reg. Exp 2 = "gc+a$" This time we use re. search() to search the text for the regular expressions directly without compiling them in advance reg. Exp 3 = "^ctg*a$" if re. search( reg. Exp 1, text 1 ): print "Text 1 contains the regular expression", reg. Exp 1 if re. search( reg. Exp 2, text 1 ): print "Text 1 contains the regular expression", reg. Exp 2 if re. search( reg. Exp 3, text 1 ): print "Text 1 contains the regular expression", reg. Exp 3 if re. search( reg. Exp 3, text 2 ): print "Text 2 contains the regular expression", reg. Exp 3 Text 1 contains the regular expression ^ t? aa Text 1 contains the regular expression gc+a$ Text 2 contains the regular expression ^ ctg*a$

23 Yet more metacharacters. . {}: indicate repetition | : match either regular expression to the left or to the right (): indicate a group (a part of a regular expression) # search for four t’s followed by three c’s: reg. Exp 1 = “t{4}c{3}“ # search for g followed by 1 to 3 c’s: reg. Exp 1 = “gc{1, 3}$“ # search for either gg or cc: reg. Exp 1 = “gg|cc“ # search for either gg or cc followed by tt: reg. Exp 1 = “(gg|cc)tt“

24 Escaping metacharacters : used to escape (to ‘keep’) a metacharacter # search for x followed by + followed by y: reg. Exp 1 = “x+y“ # search for ( followed by x followed by y: reg. Exp 1 = “(xy“ # search for x followed by ? followed by y: reg. Exp 1 = “x? y“ # search for x followed by at least one ^ followed by 3: reg. Exp 1 = “x^+3“

25 Intermezzo 2 http: //www. daimi. au. dk/~chili/CSS/ Intermezzi/2. 10. 2. html Copy and run this program: /users/chili/CSS. E 03/Example. Programs/sequence_searching. py What does it do? Put in more regular expressions in the list to search for these patterns: 1. 6 c's followed by 3 g's 2. cc, followed by at least one g, followed by cc 3. double triplets (e. g. aaa followed by ccc) 4. any number of a's, followed by either cc or gg, followed by c at the end of the string

26 Solution import re # this is a dna sequence in fasta format: seq = """>U 03518 Aspergillus awamorinaacctgcggaaggatcattaccgagtgcgggtcctttgggcccaacctcccatccgtg tctattgtaccctgttgcttcggcgggcccgccgcttgtcggccgccgggggggcgcctctgccccc cgggcccgtgcccgccggagaccccaacacgaacactgtctgaaagcgtgcagtctgagttga atgcaatcagttaaaactttcaacaatggatctcttggttccggc""" regular_expressions = [ "a{4}", "c+(t|g)tt", "g*c$", "(gt){2}", "c{6}g{3}", "ccg+cc", "(aaa|ccc|ggg|ttt){2}", "a*(cc|gg)c$" ] for reg. Exp in regular_expressions: if re. search( reg. Exp, seq ): print "found", reg. Exp 1. 6 c's followed by 3 g's 2. cc, followed by at least one g, followed by cc 3. double triplets (e. g. aaa followed by ccc) 4. any number of a's, followed by either cc or gg, followed by c at the end of the string

27 Character Classes A character class matches one of the characters in the class: [abc] matches either a or b or c. d[abc]d matches dad and dbd and dcd [ab]+c matches e. g. ac, abc, bac, bbabaabc, . . • Metacharacter ^ at beginning negates character class: [^abc] matches any character other than a, b and c • A class can use – to indicate a range of characters: [a-e] is the same as [abcde] • Characters except ^ and – are taken literally in a class: [a+b*] matches a or + or b or *

28 Special Sequences Special sequence: shortcut for a common character class reg. Exp 1 = “dd: dd [AP]M” # (possibly illegal) time stamps like 04: 23: 19 PM reg. Exp 2 = "w+@[w. ]+. dk“ # any Danish email address

29 Other regular expression functions import re text = "1 a 2 b 3 c 4 d 5 e 6 f" print re. sub( "d", "*", text ) # substitute * for any digit (i. e. replace digit with *) print re. sub( "d", "*", text, 3 ) # substitute * for any digit, max 3 times print re. split( "d", text ) print re. split( "[a-z]", text ) print # delimiter: any digit # delimiter: any lower-case letter if re. search( "db", text ): print "method search found db" # the RE of search() can appear anywhere in text if re. match( "db", text ): print "method match found db“ # the RE of match() must appear in beginning of text if re. match( "da", text ): print "method match found da" *a*b*c*d*e*f *a*b*c 4 d 5 e 6 f ['', 'a', 'b', 'c', 'd', 'e', 'f'] ['1', '2', '3', '4', '5', '6', ''] method search found db method match found da

30 Groups We can extract the actual substring that matched the regular expression by calling method group() in the SRE_Match object: text = "But here: chili@daimi. au. dk what a *(. @#$ nice @#*. ( email address“ reg. Exp = "w+@[w. ]+. dk“ # match Danish email address compiled. RE = re. compile( reg. Exp) SRE_Match = compiled. RE. search( text ) if SRE_Match: print "Text contains this Danish email address: ", SRE_Match. group()

31 13. 11 Grouping • • The substring that matches the whole RE called a group RE can be subdivided into smaller groups (parts) Each group of the matching substring can be extracted Metacharacters ( and ) denote a group text = "But here: chili@daimi. au. dk what a *(. @#$ nice @#*. ( email address“ # Match any Danish email address; define two groups: username and domain: reg. Exp = “(w+)@([w. ]+. dk)“ compiled. RE = re. compile( reg. Exp ) SRE_Match = compiled. RE. search( text ) if SRE_Match: print "Text contains this Danish email address: ", SRE_Match. group() print “Username: ”, SRE_Match. group(1), “n. Domain: ”, SRE_Match. group(2) Text 2 contains this Danish email address: chili@daimi. au. dk Username: chili Domain: daimi. au. dk

32 Greedy vs. non-greedy operators • + and * are greedy operators – They attempt to match as many characters as possible even if this is not the desired behavior • +? and *? are non-greedy operators – They attempt to match as few characters as possible

33 Greedy vs. non-greedy operators # Task: Find a space-separated list of digits, extract the first number. import re text = "1 2 3 4 5 blah" # use greedy operator + reg. Exp = "(d )+" print "Greedy operator: ", re. match( reg. Exp, text ). group() # use non-greedy version instead (by putting a ? after the +) reg. Exp = "(d )+? " print "Non-greedy operator: ", re. match( reg. Exp, text ). group() Greedy operator: 1 2 3 4 5 Non-greedy operator: 1