Approximate String Matching using Compressed Suffix Arrays T

  • Slides: 15
Download presentation
Approximate String Matching using Compressed Suffix Arrays T. N. D. Huynh, W. K. Sung

Approximate String Matching using Compressed Suffix Arrays T. N. D. Huynh, W. K. Sung National University of Singapore W. K. Hon, T. W. Lam The University of Hong Kong

String Matching Problem n n Given a text T of length n over an

String Matching Problem n n Given a text T of length n over an alphabet Σ, a pattern P of length m, find all occurrences of P inside the text T E. g. , T = barbara P = bar 2 occurrences, at position 1 and 4 in T

Index for String Matching n n Often, T is given ahead, which is going

Index for String Matching n n Often, T is given ahead, which is going to be matched with various P later Also, n » m. E. g. , T = Human Genome ~ 3 * 109 P = Gene ~ 103 It pays to waste some space to build an index for T that will facilitate later matching

Index for String Matching [Examples] Index Space (bits) Matching time [Weiner 73, Mc. Creight

Index for String Matching [Examples] Index Space (bits) Matching time [Weiner 73, Mc. Creight 76] Suffix Tree O(n log n) O(m + occ) [Manber & Myers 93] Suffix Array O(n log n) O(m + log n + occ) [Grossi & Vitter 00] CSA O(n) O(m log n + occ log n) [Ferragina & Manzini 00] FM-index O(n) O(m + occ log n)

k-Approximate String Matching n n Find all occurrences of P in T that have

k-Approximate String Matching n n Find all occurrences of P in T that have at most k “errors” (mismatch, edits) from P E. g. , T = barbara P = rba 5 occurrences, at positions 1 (delete r from P), 2 (insert a to P), 3 (match), 4 (delete r from P), 6 (delete b from P)

Previous Work & Our Result (k=1) Index Space (bits) Matching time [Cobbs 95] O(n

Previous Work & Our Result (k=1) Index Space (bits) Matching time [Cobbs 95] O(n log n) O(m 2 + occ) [Amir et al. 99] O(n log 2 n) O(m log n + occ) [Buchsbaum et al. 00] O(n log n) O(m log n + occ) [This paper] O(m log n + occ) O(n log n) Substituting by CSA [This paper] O(n) O(m log 2 n + occ log n)

Our Index n n Our index is Suffix Array + Inverse Definition 1: The

Our Index n n Our index is Suffix Array + Inverse Definition 1: The suffix array of T is an array SA such that SA[i] stores the starting position of the i-th smallest suffix of T

An Example of Suffix Array n E. g. , T = barbara Suffixes Ordered

An Example of Suffix Array n E. g. , T = barbara Suffixes Ordered suffixes position a a 7 = SA[1] ra ara 5 = SA[2] ara arbara 2 = SA[3] bara 4 = SA[4] rbara barbara 1 = SA[5] arbara ra 6 = SA[6] barbara 3 = SA[7]

Our Index: Suffix Array + Inverse n Lemma 1: Given a pattern P. Suppose

Our Index: Suffix Array + Inverse n Lemma 1: Given a pattern P. Suppose P occurs in T. Then all (exact) occurrences of P corresponds to a range, say [st, ed], in SA such that SA[st], SA[st+1], …, SA[ed] are position of all such occurrences.

Our Index: Suffix Array + Inverse n n Lemma 2: Given the range [st

Our Index: Suffix Array + Inverse n n Lemma 2: Given the range [st 1, ed 1] for P 1 and the range [st 2, ed 2] for P 2. Then, the range [st, ed] for P 1 P 2 can be found in O(log n) time, based on SA and its inverse. Idea of proof: Similar to Manber & Myers’ algorithm, using binary search.

Our Index: Suffix Array + Inverse n n Corollary 3: Given the range [st,

Our Index: Suffix Array + Inverse n n Corollary 3: Given the range [st, ed] for P, and an array C such that C[c] stores the total occurrences of a character in T that is smaller than the character c. Then, the range of c. P can be found in O(log n) time. Proof: Directly follows from Lemma 2 since [C[c-1]+1, C[c]] is the range of SA that corresponds to c.

1 -Approximate Matching Algorithm [The delete case] 1. Find the range [sti, edi] for

1 -Approximate Matching Algorithm [The delete case] 1. Find the range [sti, edi] for P[1…i], for every i [1, m] 2. Find the range [sti’, edi’] for P[i…m], for every i [1, m] 3. For every i [1, m], find the range of P[1…i-1] P[i+1. . m]. Report the occurrences. Time complexity: O(m log n + occ)

1 -Approximate Matching Algorithm n For the mismatch case or other edit cases, the

1 -Approximate Matching Algorithm n For the mismatch case or other edit cases, the algorithm is similar, except that in Step 3, we have to find the range for |Σ|m strings (instead of m strings in the delete case). Time complexity: O(|Σ|m log n+occ)

The General Case n n Our algorithm can be extended to solve the general

The General Case n n Our algorithm can be extended to solve the general k-approximate matching problem. The time complexity will be: O(|Σ|k mk log n + occ) Further, if we replace SA + Inverse by CSA of Grossi & Vitter, the space becomes O(n) bits, and the time will be blown up by an O(log n) factor

Future Work n Can we improve the time to O(m + occ) for the

Future Work n Can we improve the time to O(m + occ) for the 1 -approximate matching problem?