PARAMETERIZED TEXT INDEXING Rahul Shah Pattern Matching Input

  • Slides: 18
Download presentation
PARAMETERIZED TEXT INDEXING Rahul Shah

PARAMETERIZED TEXT INDEXING Rahul Shah

Pattern Matching • Input: a text T and a pattern P • Output: all

Pattern Matching • Input: a text T and a pattern P • Output: all positions in T where P appears as a substring. • Indexing Problem: • T is fixed • P is provided as a query • Data Structures for Indexing Problem: • Linear Space (in words): Suffix Tree, Suffix Array • Succinct/Compact Index: FM-Index, Compressed Suffix Array

Parameterized Pattern Matching • Alphabet Σ consists of two disjoint sets: • Static characters

Parameterized Pattern Matching • Alphabet Σ consists of two disjoint sets: • Static characters Σs • Parameterized characters Σp • Parameterized string (p-string) is a string in (Σs U Σp)* • Two p-strings S = s 1 s 2…sm and S’ = s’ 1 s’ 2…s’m match iff • si = s’i for any si in Σs • There exists a bijection ƒS that renames si to s’i for any si in Σp • Example: Σs = {A, B} and Σp = {w, x, y, z} • Ax. By and Aw. Bz p-match • Ax. By and Aw. Bw do not p-match

Baker’s Encoding • Convert a p-string S into a string prev(S) where • |prev(S)|

Baker’s Encoding • Convert a p-string S into a string prev(S) where • |prev(S)| = |S| • prev(S)[i] = S[i] if S[i] is static • prev(S)[i] = 0 if i is the first occurrence of the p-character S[i] • prev(S)[i] = (i – j) if j<i is the rightmost occurrence of the p-character S[i] • Example: • S = Ax. By. Bz. Ax. Bz • prev(S) = A 0 B 0 B 0 A 6 B 4

Baker’s p-Suffix Tree • Two p-strings S and S’ p-match iff prev(S) = prev(S’)

Baker’s p-Suffix Tree • Two p-strings S and S’ p-match iff prev(S) = prev(S’) • p-Suffix Tree: • Encode every suffix according to prev(. ) • Construct a suffix tree for every encoded suffix • Construction time: • Baker: O(n|Σp| + n log |Σ|) • Kosaraju: O(n log |Σ|) • Searching in p-Suffix Tree • Encode P using prev(. ) • Search in p-Suffix Tree for prev(P) • Time: O(|P| log |Σ| + occ)

Parameterized BWT • Take every circular suffix Ti and sort according to prev(. )

Parameterized BWT • Take every circular suffix Ti and sort according to prev(. ) • Obtain the last characters L[i] in sorted order • fi = first occurrence of L[i] in Ti • Define p. BWT[i] as follows: • = L[i], if L[i] is static • = number of 0’s in prev(Ti)[1, fi], otherwise

p-BWT and LF mapping • Parameterized Suffix Array: • p. SA[i] = j and

p-BWT and LF mapping • Parameterized Suffix Array: • p. SA[i] = j and p. SA-1 [j] = i iff prev(Ti) is the jth lexicographically smallest suffix among all prev(Tk), 1 ≤ k ≤ n • LF Mapping: LF(i) = p. SA-1[p. SA[i]-1] i Ti prev(Ti) Sorted Ti p. SA[i] p. SA-1[i] L[i] 1 Ax. Byxx$ A 0 B 034$ yxx$Ax. B 4 5 B 2 x. Byxx$A 0 B 034$A xx$Ax. By 5 3 y 3 Byxx$Ax B 0 A 0$A 3 x. Byxx$A 2 6 A 4 yxx$Ax. B 001$A 3 B x$Ax. Byx 6 1 x 5 xx$Ax. By 01$A 3 B 0 Ax. Byxx$ 1 2 $ 6 x$Ax. Byx 0$A 3 B 06 Byxx$Ax 3 4 x 7 $Ax. Byxx $A 0 B 034 $Ax. Byxx 7 7 x f p. BWT[i] LF(i) = p. SA-1[p. SA[i]-1] B 6 2 1 A 5 1 2 $ 7 1 2 3 3 1 4 3 2

Data Structures • WT over p. BWT • Operations: • p. BWT[i] • range.

Data Structures • WT over p. BWT • Operations: • p. BWT[i] • range. Count(i, j, x, y) = number of k in [i, j] satisfying x ≤ p. BWT[k] ≤ y • Space: n log |Σ| bits • Time: O(log |Σ|) • Succinct representation of p-Suffix Tree • Operations on a node: • left. Most. Leaf, right. Mostleaf, qth child • parent, lca • Space: 4 n + o(n) bits • Time: O(1) • Additional O(n) + o(n log |Σ|) bits structure • Total Space: n log |Σ| + O(n) + o(n log |Σ|) bits

Compute LF(i) when L[i] is static • Both L[i] and L[j] are static •

Compute LF(i) when L[i] is static • Both L[i] and L[j] are static • LF(i) < LF(j) iff p. BWT[i] < p. BWT[j] or p. BWT[i] = p. BWT[j] and i < j • L[i] is static and L[j] is parameterized • LF(j) < LF(i) • If L[i] is static then determining LF(i) is easy using standard wavelet tree operations • = range. Count(1, n, 1, c-1) + range. Count(1, i, c, c), where c = p. BWT[i] • Time: O(log |Σ|)

Compute LF(i) when L[i] is parameterized • z = locus of the string prev(Ti)[1,

Compute LF(i) when L[i] is parameterized • z = locus of the string prev(Ti)[1, fi] • Computed in O(log |Σ|) time using • WT and an additional O(n log |Σ|)-bit structure • v = parent(z), w = parent(v) • u = qth child of v • Computed in O(1) time using tree topology and additional O(n) bits

Compute LF(i) when L[i] is parameterized • LF(i) = N 1+N 2+N 3+N 4,

Compute LF(i) when L[i] is parameterized • LF(i) = N 1+N 2+N 3+N 4, where Ni is the number of suffixes j in Si such that LF(j) ≤ LF(i) • N 1 = number of j’s, such that • L[j] is p-character, and fj > 1+|path(lca(z, leafj))| • Given by f. Sum(z) • f. Sum(. ) for all nodes can be maintained in O(n) bits and O(1) access time.

Compute LF(i) when L[i] is parameterized • N 2 = number of j’s, such

Compute LF(i) when L[i] is parameterized • N 2 = number of j’s, such that • L[j] is p-character, and • fj > fi or fj = fi and j ≤ i • = range. Count(Lz, Rz, c+1, |Σp|) + range. Count(Lz, i, c, c), where c = p. BWT[i] • Computed using WT in O(log |Σ|) time

Compute LF(i) when L[i] is parameterized • N 3 = 0 • N 4

Compute LF(i) when L[i] is parameterized • N 3 = 0 • N 4 = number of j’s, such that • L[j] is p-character, fj > fi, and • leading character on the path from v to leafj is parameterized • = range. Count(Rz+1, Ru, c+1, |Σp|), where c = p. BWT[i] • Computed using WT in O(log |Σ|) time

Summarizing LF(i) • Computed in O(log |Σ|) • Space is n log |Σ| +

Summarizing LF(i) • Computed in O(log |Σ|) • Space is n log |Σ| + O(n) + o(n log |Σ|) bits • p. SA[. ] and p. SA-1[. ] can be computed in • Time: O(log 1+ε n) time • Space: additional O(n) bits • Sampled suffix array and inverse suffix array

Backward Search • Suffix range of P as follows • Given suffix range [sp,

Backward Search • Suffix range of P as follows • Given suffix range [sp, ep] of Q = proper suffix of P • c = preceding character of Q in P • Compute suffix range [sp’, ep’] of c. Q • Preprocess P in O(|P|log |Σ|) time such that for any p-character P[i], we can find • number of distinct p-characters in P[i+1, |P|] • number of distinct p-characters in P[i+1, ci], where ci is the first occurrence of c in P[i+1, |P|] • c is static • sp’ = 1+range. Count(1, n, 1, c-1) range. Count(1, sp-1, c, c) • ep’ = range. Count(1, n, 1, c-1) + range. Count(1, ep, c, c) • Time: O(log |Σ|)

Backward Search • c is parameterized and does not appear in Q • d

Backward Search • c is parameterized and does not appear in Q • d = number of distinct p-characters in Q • (ep’-sp’+1) = range. Count(sp, ep, d+1, |Σp|) • Computed in O(log |Σ|) time • sp’ = 1 + f. Sum(1+f. Sum(lca(leafsp, leafep))) • Computed in O(1) time • c is parameterized and appears in Q • d = number of distinct p-characters in Q until the first occurrence of c • (ep’-sp’+1) = range. Count(sp, ep, d, d) • Computed in O(log |Σ|) time • sp’ = LF(imin), where imin= min{i | sp ≤ i ≤ ep such that p. BWT[i] = d} • Computed in O(log |Σ|) time

Summarizing • Suffix Range of P is found in O(|P| log |Σ|) time •

Summarizing • Suffix Range of P is found in O(|P| log |Σ|) time • Each text-position located in O(log 1+ε n) time • Final Result • Space: n log |Σ| + O(n) + o(n log |Σ|) bits • Time: O(|P| log |Σ| + occ log 1+ε n)

Questions? Thank you!

Questions? Thank you!