15 211 Fundamental Data Structures and Algorithms String
- Slides: 21
15 -211 Fundamental Data Structures and Algorithms String Matching Algorithms March 30, 2006 Ananda Gunawardena
Aho-Corasic
idea • Suppose we need to search for a set of m patterns, P 1, P 2 …. . , Pk in a text T • Possible solution: Apply KMP to all m patterns – cost is O(k(n+m)), where |T| = n, m=max|Pi| • Is there a better solution? • consider all patterns at once – some patterns may be prefixes of others • max and maxium can be found in one scan
All Prefixes • Consider the patterns, ab, ac, abc, bca, bcc, ba, bc • Prefixes of the patterns – a, abc, b, bca, bcc • A trie representing the patterns can be built in O(M) time, where M =
Boyer Moore
Brute Force KMP abcdeabcdeabcedfghijkl - bc- bc- - bcedfg 21 comparisons 19 comparisons
Brute Force abcdeabcedfghijkl B-M abcdeabcedfghijkl - bc- f e bc- b g d c bcedfg 15 + 6 = 21 comparisons 2 + 6 = 8 comparisons
Boyer Moore • Ideas – Scan pattern from right to left (and target from left to right) • Allows for bigger jumps on early failures • Could use a table similar to KMP. • But follow a better idea: – Use information about T as well as P in deciding what to do next.
Brute Force This string is textual - B-M This string is textual - - - l - - t- u - x - t - a t e - - - textual 16 + 7 = 23 comparisons 3 + 7 = 10 comparisons
Brute Force This is a sample sentence - - - B-M - - - - foobar 25 comparisons - -
Boyer Moore • Ideas – Scan pattern from right to left (and target from left to right) • Allows for bigger jumps on early failures • Could use a table similar to KMP. • But follow a better idea: – Use information about T as well as P in deciding what to do next. • If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.
Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = build. Last(P); int n = T. length; int m = P. length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math. min(j, 1 + last[T[i]]); j = m - 1; } Use last to } while (i <= n-1); determine next return -1; value for i. }
Boyer Moore matcher static int[] build. Last(char[] P) { int[] last = new int[128]; int m = P. length; for (int i=0; i<128; i++) last[i] = -1; Mismatch char is nowhere in the pattern (default). last says “jump the distance” for (int j=0; j<P. length; j++) last[P[j]] = j; return last; } Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”
KMP 1234561234356 - B-M 1234561234356 - - - 13 comparisons 7777777 1 comparison
KMP This is a string - B-M This is a string - - - i - r - - - ring 16 comparisons ring 7 comparisons n
KMP This is a string B-M This is a string - - - g - i - t - r - tring 16 comparisons 8 comparisons n
Rabin-Karp
Knuth-Morris-Pratt Summary • Intuition: – Analyze the pattern – Analog with a Matching FSM. • Steam-based: Never decrement i. • Works well: – For self-repetitive patterns in self-repetitive text • But: – For text, performance similar to brute force • Possibly slower, due to precomputation
Boyer-Moore Summary • Intuition: – Analyze the target and the pattern – Work backwards from end of pattern • Jump forward in target when failing • Works well: – For large alphabets • The last table for {0, 1}? – For text, in practice • But: – Streams? Must be able to decrement i.
String Matching • A possibly-true story: – A programmer, reluctant to learn the tricky preprocessing in Morris’s algorithm, in fact, “implemented” it using the brute force algorithm instead. • In 1980, Karp and Rabin discovered a simpler algorithm. – Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.
Karp and Rabin – Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the has for P. – Compute the hashes in a cumulative way, so each T[i] needs to be seen only once. – Average case time is O(M+N). – Worst case is unlikely (all collisions) at O(MN).
- Licenseid=string&content=string&/paramsxml=string
- Professor ajit diwan
- Cos 423
- Data structures and algorithms tutorial
- Information retrieval data structures and algorithms
- Data structures and algorithms bits pilani
- Ajit diwan iitb
- Data structures and algorithms
- Data structures and algorithms
- Waterloo data structures and algorithms
- Signature file structure in information retrieval system
- Data structures and algorithms
- Algorithms + data structures = programs
- Const int size=18; string *tb12 = new string[size];
- Private string[]
- String str
- Homologous structures
- Muthukrishnan data stream algorithms
- Opwekking 211
- Miller indices 211
- Snohomish county coordinated entry
- Poli 211