15 211 Fundamental Data Structures and Algorithms String

  • Slides: 32
Download presentation
15 -211 Fundamental Data Structures and Algorithms String Matching II March 30, 2006 Ananda

15 -211 Fundamental Data Structures and Algorithms String Matching II March 30, 2006 Ananda Gunawardena

In this lecture • FSM revisited • Aho-Corasick Algorithm – Multiple pattern matching •

In this lecture • FSM revisited • Aho-Corasick Algorithm – Multiple pattern matching • Boyer-Moore Algorithm – Right to left matching • Rabin-Karp Algorithm – Based on hash codes • Summary

FSM Revisited • Suppose we consider the alphabet ∑ ={a, b} and a pattern

FSM Revisited • Suppose we consider the alphabet ∑ ={a, b} and a pattern P=“ababa” • The states of the FSM are all the prefixes of P, i. e. { ε , a, aba, ababa} Q 0 Q 1 Q 2 Q 3 Q 4 Exercise: Mark the failure or backward transitions Q 5

FSM Revisited Q 0 Q 1 Q 2 Q 4 Q 3 P=“ababa” Expressed

FSM Revisited Q 0 Q 1 Q 2 Q 4 Q 3 P=“ababa” Expressed as a table state 0 a b 1 2 3 4 Q 5 j f a 1 0 ab 2 0 aba 3 1 abab 4 2 ababa 5 3

AHO-CORASICK

AHO-CORASICK

Multiple Pattern Search • Suppose we need to search for a set of k

Multiple Pattern Search • Suppose we need to search for a set of k patterns, P 1, P 2 …. . , Pk in a text T • Possible solution: Apply KMP to all k patterns – cost is O(k(n+m)), where |T| = n, m=max|Pi| • Is there a better solution? • consider all patterns at once – some patterns may be prefixes of others • max and maximum can be found in one scan

All Prefixes • Consider a set of patterns P={ab, ac, abc, bca, bcc, ba,

All Prefixes • Consider a set of patterns P={ab, ac, abc, bca, bcc, ba, bc} • Prefixes of the patterns – {a, abc, b, bca, bcc, ba, ac} • A trie representing the patterns can be built in O(M) time, where M = • Nodes of the trie are states and forward transitions are easy

Failure Transitions • How do we deal with the backward(failure) transitions? • Suppose U

Failure Transitions • How do we deal with the backward(failure) transitions? • Suppose U is the current match followed by a “wrong” letter • Find the longest suffix V of U, that is a prefix of some pattern in the set of patterns P • Example: Let P={aba, baba, cabab} • The failure function π is given by U a ab aba baba c ca π(u) ε ε ba ε a ab aba ε a U caba aba cabab π(u) ab aba ba bab

Failure functions U a ab aba baba c ca π(u) ε ε ba ε

Failure functions U a ab aba baba c ca π(u) ε ε ba ε a ab aba ε a U cabab π(u) ab aba bab ab aba a Q 0 b c ba baba cabab ca caba

More Formally. . • Let P = {W 1, W 2, …. , WM}

More Formally. . • Let P = {W 1, W 2, …. , WM} be the set of all prefixes of all patterns in the set of patterns {P 1, P 2, …. , Pk} • The transition function δ is given by δ: Px∑ P • The failure function π is given by π : P+ P π (p) = longest proper suffix of p in P, which is prefix of some Pi

Transition Function • Given the failure function π, we can compute the transition function

Transition Function • Given the failure function π, we can compute the transition function as follows δ (u, a) =

Computing π • How do we compute the failure function π? – KMP traverses

Computing π • How do we compute the failure function π? – KMP traverses a single string from left to right – Instead we need to traverse a trie in breadth first order computing failure functions • Complexity: As in KMP we can show that complexity of Aho-Corasick is O(M+n), where M=total length of the patterns and n=|T|

Boyer Moore

Boyer Moore

Boyer Moore • Boyer-Moore Idea – Scan pattern from right to left and text

Boyer Moore • Boyer-Moore Idea – Scan pattern from right to left and text from left to right • Allow for bigger jumps on early failures • Use a table similar to KMP. • Follow a “better” idea: – Use information about T as well as P in deciding what to do next.

Brute Force abcdeabcedfghijkl B-M abcdeabcedfghijkl - bc- f e bc- b g d c

Brute Force abcdeabcedfghijkl B-M abcdeabcedfghijkl - bc- f e bc- b g d c bcedfg 15 + 6 = 21 comparisons 2 + 6 = 8 comparisons

Brute Force This string is textual - B-M This string is textual - -

Brute Force This string is textual - B-M This string is textual - - - l - - t- u - x - t - a t e - - - textual 16 + 7 = 23 comparisons 3 + 7 = 10 comparisons

Brute Force This is a sample sentence - - - B-M - - -

Brute Force This is a sample sentence - - - B-M - - - - foobar 25 comparisons - -

Boyer Moore • Ideas – Scan pattern from right to left (and target from

Boyer Moore • Ideas – Scan pattern from right to left (and target from left to right) • Allows for bigger jumps on early failures • Could use a table similar to KMP. • But follow a better idea: – Use information about T as well as P in deciding what to do next. • If T[i] does not appear in the pattern, skip forward beyond the end of the pattern.

Boyer Moore matcher static int[] build. Last(char[] P) { int[] last = new int[128];

Boyer Moore matcher static int[] build. Last(char[] P) { int[] last = new int[128]; int m = P. length; for (int i=0; i<128; i++) last[i] = -1; Mismatch char is nowhere in the pattern (default). last says “jump the distance” for (int j=0; j<P. length; j++) last[P[j]] = j; return last; } Mismatch is a pattern char. last says “jump to align pattern with last instance of this char”

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = build.

Boyer Moore matcher static int match(char[] T, char[] P) { int[] last = build. Last(P); int n = T. length; int m = P. length; int i = m-1; int j = m-1; if (i > n-1) return -1; do { if (P[j]==T[i]) if (j==0) return i; else { i--; j--; } else { i = i + m – Math. min(j, 1 + last[T[i]]); j = m - 1; } Use last to } while (i <= n-1); determine next return -1; value for i. }

KMP 1234561234356 - B-M 1234561234356 - - - 13 comparisons 7777777 1 comparison

KMP 1234561234356 - B-M 1234561234356 - - - 13 comparisons 7777777 1 comparison

KMP This is a string - B-M This is a string - - -

KMP This is a string - B-M This is a string - - - i - r - - - ring 16 comparisons ring 7 comparisons n

KMP This is a string B-M This is a string - - - g

KMP This is a string B-M This is a string - - - g - i - t - r - tring 16 comparisons 8 comparisons n

Rabin-Karp

Rabin-Karp

Rabin-Karp Algorithm • Suppose P is a pattern and T is the search text.

Rabin-Karp Algorithm • Suppose P is a pattern and T is the search text. • Compute a hash code of P and ALL the hash codes of substrings of T of length |P|=m • If hash(P) = hash(T(i. . i+m-1)) for some 0≤ i ≤n-m, then we possibly found the pattern • But computing all hash codes takes Ω(nm) time, where |T|=n, |P|=m • How to compute a good hash code? – H(P) = where B is a large enough base, eg: B=256 • How to compute the hash code efficiently?

Rabin-Karp • How to compute the hash code efficiently • Need hash codes for

Rabin-Karp • How to compute the hash code efficiently • Need hash codes for the substrings of length m of the form T[i…i+m-1] • How to get T[i+1…i+m] from T[i…i+m-1] – drop T[i] and add T[i+m] • Find a relation between hash codes for T[i…i+m-1] and T[i+1…i+m]

Rabin-Karp Algorithm • H(T(i. . i+m-1)) = • What about H(T(i+1, …, i+1+m-1)) •

Rabin-Karp Algorithm • H(T(i. . i+m-1)) = • What about H(T(i+1, …, i+1+m-1)) • Keep arithmetic overflows in control using a Mod P for some prime P • However, we still need to do character by character comparison after we get a match

Summary

Summary

Knuth-Morris-Pratt Summary • Intuition: – Analyze the pattern – Analog with a Matching FSM.

Knuth-Morris-Pratt Summary • Intuition: – Analyze the pattern – Analog with a Matching FSM. • Never decrement i. • Works well: – For self-repetitive patterns in self-repetitive text • But: – For text, performance similar to brute force • Possibly slower, due to precomputation

Aho-Corasick Summary • Intuition: – Use prefixes of multiple patterns to define failure transitions

Aho-Corasick Summary • Intuition: – Use prefixes of multiple patterns to define failure transitions – Natural extension of the KMP idea • Works well: – For multiple pattern search – Used in famous fgrep utility

Boyer-Moore Summary • Intuition: – Analyze the target and the pattern – Work backwards

Boyer-Moore Summary • Intuition: – Analyze the target and the pattern – Work backwards from end of pattern • Jump forward in target when failing • Works well: – For large alphabets • The last table for {0, 1}? – For text, in practice • But: – Streams? Must be able to decrement i.

Rabin-Karp Summary • Intuition: – If hash codes of two patterns are the same,

Rabin-Karp Summary • Intuition: – If hash codes of two patterns are the same, then patterns “might” be the same – If the pattern is length m, compute hash codes of all substrings of length m – Leverage previous hash code to compute the next one • Works well: – Multiple pattern search • But: – Computing hash codes may be expensive