15 211 Fundamental Data Structures and Algorithms String

  • Slides: 21
Download presentation
15 -211 Fundamental Data Structures and Algorithms String Matching Algorithms March 30, 2006 Ananda

15 -211 Fundamental Data Structures and Algorithms String Matching Algorithms March 30, 2006 Ananda Gunawardena

Aho-Corasic

Aho-Corasic

idea • Suppose we need to search for a set of m patterns, P

idea • Suppose we need to search for a set of m patterns, P 1, P 2 …. . , Pk in a text T • Possible solution: Apply KMP to all m patterns – cost is O(k(n+m)), where |T| = n, m=max|Pi| • Is there a better solution? • consider all patterns at once – some patterns may be prefixes of others • max and maxium can be found in one scan

All Prefixes • Consider the patterns, ab, ac, abc, bca, bcc, ba, bc •

All Prefixes • Consider the patterns, ab, ac, abc, bca, bcc, ba, bc • Prefixes of the patterns – a, abc, b, bca, bcc • A trie representing the patterns can be built in O(M) time, where M =

Boyer Moore

Boyer Moore

Brute Force KMP abcdeabcdeabcedfghijkl - bc- bc- - bcedfg 21 comparisons 19 comparisons

Brute Force KMP abcdeabcdeabcedfghijkl - bc- bc- - bcedfg 21 comparisons 19 comparisons

Brute Force abcdeabcedfghijkl B-M abcdeabcedfghijkl - bc- f e bc- b g d c

Brute Force abcdeabcedfghijkl B-M abcdeabcedfghijkl - bc- f e bc- b g d c bcedfg 15 + 6 = 21 comparisons 2 + 6 = 8 comparisons

Boyer Moore • Ideas – Scan pattern from right to left (and target from

Boyer Moore • Ideas – Scan pattern from right to left (and target from left to right) • Allows for bigger jumps on early failures • Could use a table similar to KMP. • But follow a better idea: – Use information about T as well as P in deciding what to do next.

Brute Force This string is textual - B-M This string is textual - -

Brute Force This string is textual - B-M This string is textual - - - l - - t- u - x - t - a t e - - - textual 16 + 7 = 23 comparisons 3 + 7 = 10 comparisons

Brute Force This is a sample sentence - - - B-M - - -

Brute Force This is a sample sentence - - - B-M - - - - foobar 25 comparisons - -

Boyer Moore • Ideas – Scan pattern from right to left (and target from

Boyer Moore • Ideas – Scan pattern from right to left (and target from left to right) • Allows for bigger jumps on early failures • Could use a table similar to KMP. • But follow a better idea: – Use information about T as well as P in deciding what to do next. • If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = build.

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = build. Last(P); int n = T. length; int m = P. length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math. min(j, 1 + last[T[i]]); j = m - 1; } Use last to } while (i <= n-1); determine next return -1; value for i. }

Boyer Moore matcher static int[] build. Last(char[] P) { int[] last = new int[128];

Boyer Moore matcher static int[] build. Last(char[] P) { int[] last = new int[128]; int m = P. length; for (int i=0; i<128; i++) last[i] = -1; Mismatch char is nowhere in the pattern (default). last says “jump the distance” for (int j=0; j<P. length; j++) last[P[j]] = j; return last; } Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”

KMP 1234561234356 - B-M 1234561234356 - - - 13 comparisons 7777777 1 comparison

KMP 1234561234356 - B-M 1234561234356 - - - 13 comparisons 7777777 1 comparison

KMP This is a string - B-M This is a string - - -

KMP This is a string - B-M This is a string - - - i - r - - - ring 16 comparisons ring 7 comparisons n

KMP This is a string B-M This is a string - - - g

KMP This is a string B-M This is a string - - - g - i - t - r - tring 16 comparisons 8 comparisons n

Rabin-Karp

Rabin-Karp

Knuth-Morris-Pratt Summary • Intuition: – Analyze the pattern – Analog with a Matching FSM.

Knuth-Morris-Pratt Summary • Intuition: – Analyze the pattern – Analog with a Matching FSM. • Steam-based: Never decrement i. • Works well: – For self-repetitive patterns in self-repetitive text • But: – For text, performance similar to brute force • Possibly slower, due to precomputation

Boyer-Moore Summary • Intuition: – Analyze the target and the pattern – Work backwards

Boyer-Moore Summary • Intuition: – Analyze the target and the pattern – Work backwards from end of pattern • Jump forward in target when failing • Works well: – For large alphabets • The last table for {0, 1}? – For text, in practice • But: – Streams? Must be able to decrement i.

String Matching • A possibly-true story: – A programmer, reluctant to learn the tricky

String Matching • A possibly-true story: – A programmer, reluctant to learn the tricky preprocessing in Morris’s algorithm, in fact, “implemented” it using the brute force algorithm instead. • In 1980, Karp and Rabin discovered a simpler algorithm. – Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the hash for P.

Karp and Rabin – Uses hashing ideas: quickly compute hashes for all M-length substrings

Karp and Rabin – Uses hashing ideas: quickly compute hashes for all M-length substrings in T, and compare with the has for P. – Compute the hashes in a cumulative way, so each T[i] needs to be seen only once. – Average case time is O(M+N). – Worst case is unlikely (all collisions) at O(MN).