Preprocessing Application String Matching By Rong Ge COSC
Preprocessing Application: String Matching By Rong Ge COSC 3100
Space-for-time tradeoffs Two varieties of space-for-time algorithms: b input enhancement — preprocess the input (or its part) to store some info to be used later in solving the problem • counting sorts (Last Lecture) • string searching algorithms (today) b prestructuring — preprocess the input to make accessing its elements easier • hashing A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 2
Review: String searching by brute force String Matching problem: Given a pattern P: a string of m characters to search for and a text T: a (long) string of n>=m characters to search in Return whether P occurs in T and the position of P in T Brute force algorithm Revisit Step 1 Step 2 Align pattern P at beginning of text T Scan from left to right, compare each character of P to the corresponding character in T until either all characters of P are found to match (successful search) or a mismatch is detected Step 3 While a mismatch is detected and T is not yet exhausted, realign P one position to the right and repeat Step 2 #comparison: O(mn) More efficient algorithms with O(n) time? 3
String searching by preprocessing Several string searching algorithms are based on the input enhancement idea of preprocessing the pattern P b Horspool’s algorithm 1. 2. 3. preprocess P to build one shift table tb align P against the beginning of T repeat until a matching substring is found or T ends: 1. 2. b compare the corresponding characters from right to left for P if mismatch occurs, shift P to right by tb(c) where c is the rightmost character of T in the current alignment Boyer -Moore algorithm improves Horspool’s algorithm by using two tables: one good table and one bad table A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 4
Horspool’s Algorithm Two key points: • preprocesses P to generate a shift table that determines how much to shift P when a mismatch occurs • always makes a shift based on the T’s character c aligned with the last character in P according to the shift table’s entry for c A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 5
How far to shift for each case? Focus on the rightmost character in T in the alignment: b Case I: character C != B (mismatch), and C does not occur in P. . . C. . . . . Text (C not in pattern) BAOBAB Pattern b Case II: Character O != B (mismatch), but occur in P once. . . O. . . . . (O occurs once in pattern) BAOBAB Character A != B (mismatch), and occurs in P more than once. . . A. . . . . (A occurs twice in pattern) BAOBAB b Case III: character R = R (match), but doesn’t appear in P in the first m-1 characters. . . MER. . . . . LEADER b Case IV: character B = B (match), and occurs in P in the first m-1 characters. . . B. . . . . BAOBAB Your answers? 6
Build a shift table for a given pattern P b Shift table: • An 1 -D array indicating the shift size for each character c if it is the rightmost character of T in the alignment • array size: number of characters in the alphabet – E. g. : 26 if all characters in T are capital case English letters b The array entry for a character c is computed as follows • Cases I & III: P’s length m • Case II & IV: distance from c’s rightmost occurrence in P among P’s first m-1 characters to P’s right end b Horspool's alg. populates the shift table with two steps: 1. Initialize each array entry as P’s length 2. Scan P from left to right and update the array entries A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 7
Build shift table b Example: Pattern: BAOBAB • Alphabet: capital case English letters b Preprocessing • Initialize all array entries as P’s length, which is 6 for BAOBAB - Taking care of cases I and III - Shift table is indexed by text and pattern alphabet A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 6 6 6 6 6 6 6 • Scan P from left to right for the first m-1 characters and update the table 0. BAOBAB: distance = 5 (B’s entry) 1. BAOBAB: distance = 4 (A’s entry) 2. BAOBAB: distance = 3 (O’s entry) 3. BAOBAB: distance = 2 (B’s entry) 4. BAOBAB: distance = 1 (A’s entry) skip the last character of P, - Taking care of cases II & IV Resulting shift table A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 A. 6 Levitin 6 6“Introduction 6 6 to 6 the 6 Design 6 6& 6 Analysis 3 6 of 6 Algorithms, ” 6 63 rd 6 6 6 ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 8
Example of Horspool’s alg. application A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ 1 2 6 6 6 3 6 6 6 Alg: keep aligning P against T and comparing from right to left, shifting if mismatched. Example: BARD LOVED BANANAS (Text) BAOBAB (shift t(L)=6) BAOBAB (shift t(B)=2) BAOBAB (shift t(N)=6) BAOBAB (T out of bound, unsuccessful search) # total comparisons: 4. 1 st alignment: 1 2 nd alignment: 2 3 rd alignment: 1 4 th alignment: 0 How many comparisons with brute force alg? A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 9
Boyer-Moore algorithm Based on same two ideas: • comparing P to T from right to left for each alignment • precomputing shift sizes in two tables – bad-symbol table t 1 indicates how much to shift based on text’s character causing a mismatch t 1 is built the same as Horspool’s alg. – good-suffix table d 2 indicates how much to shift based on matched part (suffix) of the pattern A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 10
Scenarios in string matching SI: the rightmost character of P doesn’t match, BM algorithm acts as Horspool’s. Shift P to right by t 1(c) text c Rightmost character of P doesn’t match pattern x SII: k characters are matched, and then a mismatch occurs. 0<k<m text c K characters k>0 matches pattern x K characters How much do we shift Pattern to right? A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 11
Prefix and suffix of strings b Prefix of a string P: A prefix of a string P is a substring of P that occurs at the beginning of P. E. g. P: BANANA Prefix B Length 1 b BA BANANA 2 3 4 5 6 Suffix of a string P: A suffix of a string P is a substring that occurs at the end of P. E. g. P: BANANA Suffixes: A, NA, ANA, NANA, ANANA, BANANA suffix A Length 1 NA ANA NANA ANANA BANANA 2 5 6 3 4 A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 12
Good-suffix table d 2 for pattern b applied after a suffix of P matches T for an alignment, and then a mismatch occurs • k is the length of the matched suffix of P, 0 < k < m b Good-suffix table d 2: • an 1 -D array with size of m, but the entry indexed by 0 is not used. • Each entry indicates the shift size for the good, matched suffix A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 13
Build good-suffix table d 2 for the pattern b For each entry indexed by k, set d 2(k) with three cases in order, k is in [1, m-1] and corresponds to the suffix of P with size k 1. the distance between matched suffix of size k and its rightmost occurrence in the pattern that is not preceded by the same character as the suffix E. g. : P: CABABA d 2(k=1) = 4 //Suffix with size 1: A; the rightmost A that is not preceded by B is the second character in P. 2. the distance between the longest part of the k-character suffix and the corresponding prefix of P, if there is no occurrence in case 1 E. g. : P: ANAN d 2(k=3) = 2 // Suffix with size 3: NAN; the longest part of NAN that matches a prefix of P is AN 3. m, the length of P if there is no occurrence in case 2 E. g. : P: ANAN d 2(k=1) = 4 //Suffix with size 1: N A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 14
Exercise b Build the good suffix table for pattern WOWWOW A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 15
Exercise b Build the good suffix table for pattern WOWWOW k pattern d 2 1 WOWWOW 2 Case 1 2 WOWWOW 5 Case 2: longest part of OW with a matched prefix is W 3 WOWWOW 3 Case 1 4 WOWWOW 3 Case 2: longest part of WWOW with a matched prefix is WOW 5 WOWWOW 3 Case 2: longest part of WWOW with a matched prefix is WOW A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 16
BM alg. for scenario II in string matching After successfully matching 0 < k < m characters, the algorithm shifts the pattern to right by d = max {d 1, d 2} where d 1 = max{t 1(c) - k, 1} is bad-symbol shift d 2(k) is good-suffix shift A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 17
Boyer-Moore Algorithm (cont. ) Step 1 Step 2 Step 3 Step 4 Fill in the bad-symbol shift table Fill in the good-suffix shift table Align the pattern against the beginning of the text Repeat until a matching substring is found or text ends: Compare the corresponding characters right to left. If no characters match, retrieve entry t 1(c) from the badsymbol table for the text’s character c causing the mismatch and shift the pattern to the right by t 1(c). If 0 < k < m characters are matched, retrieve entry t 1(c) from the bad-symbol table for the text’s character c causing the mismatch and entry d 2(k) from the good-suffix table and shift the pattern to the right by d = max {d 1, d 2} where d 1 = max{t 1(c) - k, 1}. A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 18
Example of Boyer-Moore alg. application Table t 1: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ 1 2 6 6 6 3 6 6 6 B E S S _ N N E W _ A B O U T _ B A O B A B S B A O B A B 0 match shift d 1 = t 1(N) = 6 B A O B A B Table d 2 2 matches: k=2 d 1 = t 1(_)-2 = 4 k pattern d 2 shift d 2(2) = 5 B A O B A B 1 BAOBAB 2 1 match: k=1 2 BAOBAB 5 shift d 1 = t 1(_)-1 = 5 3 BAOBAB 5 d 2(1) = 2 #comp: 12 with 4 BAOBAB 5 B A O B A B (success) BM, 13 with Horspool A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd 5 BAOBAB 5 19 #alignment: 4 with BM, 5 with Horspool ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All
Summary b b Preprocessing - preprocess the input (or its part) to store some info to be used later in solving the problem With preprocessing, Horspool’s and BM string algorithms are fast • Comparison starts from right to left on P A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 20
Finishing Counting Sort Alg. Counting. Sort(A[0, …, n-1], lb, ub) Input: an array A of integers between lb and ub (lb <=ub) Output: Array B[0, …, n-1] of A’s elements sorted in nondescending order for j 0 to ub-lb do //init freq to 0 C[j] 0 for i 0 to n-1 do //count freq C[A[i]-lb] + 1 for j 1 to ub-lb do //calc. last pos. of lb+j-1 C[j] C[j-1] + C[j] for i n-1 to 0 do //copy from A to B j A[i] – lb B[C[j]– 1] A[i] C[j] – 1 E. G. A = {3, 2, 4, 1} A. Levitin “Introduction to the Design & Analysis of Algorithms, ” 3 rd ed. , Ch. 7 © 2012 Pearson Education, Inc. Upper Saddle River, NJ. All 21
- Slides: 21