Strings Strings n Our civilization has been built
Strings
Strings n Our civilization has been built on them n n n but we are moving towards to digital media Anyway, handling digital media is similar to. . A string is a list of characters n n n 성균관대학교정보통신학부 Am I a string? 一片花飛減却春
Character Code - ASCII n Define 2**7 = 128 characters n n why 7 bits? char in C is similar to an integer n ‘D’+ ‘a’ – ‘A’
more about ASCII n What about other languages? n n n there are more than 100 scripts some scripts have thousands characters Unicode n several standards n n n UTF 16 UTF 8 –variable length, leading zero means ASCII set. n 1~6 bytes are used Programming Languages n n C, C++ treat char as a byte Java treat char as two bytes
UTF-8 Number of bytes Bits for First Last code point Byte 1 Byte 2 Byte 3 1 7 U+0000 U+007 F 0 xxxxxxx 2 11 U+0080 U+07 FF 110 xxxxxx 3 16 U+0800 U+FFFF 1110 xxxxxx 4 21 U+10000 U+10 FFF F 11110 xxxxxx Byte 4 10 xxxxxx
String Representations n null-terminated: C, C++ array with length: Java linked list of characters n n used in some special areas Your Choice should consider n n space occupied constraints on the contents O(1) access to i-th char (search, insert, delete) what if the length can be limited (file names)
null-terminated char str[6] = “Hello”; char arr 1[] = “abc”; char arr 2[] = {‘a’, ‘b’, ‘c’); char arr 3[] = {‘a’, ‘b’, ‘c’, ‘ ’}; n str H e l l need for the termination printf(“format string”, str); /* str can be any length 0
String Match n Exact matching n n n a BIG problem for Google, twitter, Naver, bioinfo. . native method preprocessing methods Boyer-Moore Algorithm Utf-8 raises a limit Inexact matching this problem itself can be a single whole course for a semester
String match(2) n Problem n n find P in T T x a b x y a b x z P a b x y a b x z Your solution n n how many comparisons? better way?
The naïve string search 0 1 2 3 4 5 6 7 8 9 10 11 12 x a b x y a b x z * a b x y a b x z ^ ^ ^ ^ * a b x y a b x z ^ ^ ^ ^ Matched!
The naïve string search n Worst case n n n Find “aaaaab” in “aaaaaaaab” O(nm) Bring a better one next week !!
Problem Example n There are millions documents that contain company names – Anderson, Enron, Lehman. . n M&A, bankrupt force them to be changed n Your mission n change all the company names if they are changed
Input/Output Example 4 “Anderson Consulting” to “Accenture” “Enron” to “Dynegy” “DEC” to “Compaq” “TWA” to “American” 5 Anderson Accounting begat Anderson Consulting, which offered advice to Enron before it DECLARED bankruptcy, which made Anderson Consulting quite happy it changed its name in the first place!
Your Plan n read the M&A list into a DB n n n main loop n n n define DB structure how to handle the double quote (“) sign? compare each line of a doc with each DB entry if there is a match, replace it with a new name Now, define n n functions global variables
Your naive solution n n read data for changed company names to build a DB old name new name Anderson Consulting Accenture Enron Dynegy DEC Compaq . . for each entry n n look for the whole documents if there is a match, change it