An Efficient Index Structure for String Databases Tamer

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http: //www. cs. ucsb. edu/~tamer 1

Whole/Substring Matching Problem Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1 -2 % of database size). database string query string 2

String Similarity Motivation: n Base Pairs (millions) Applications w Genetic sequence databases, NCBI w Text databases, spell checkers, web search. w Video databases (e. g. VIRAGE, MEDIA 360) n n Database size is too large. Most of the techniques available are in-memory. Space requirement of current indexes is too large. Year 3

Outline Motivation & background Our contribution n Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion 4

Notation q : query string. m, n : length of strings. r : range query radius. = r/|q|: error rate. 5

String Similarity: an example A C T - - T A G C R I I D A A T G A T A G - 6

Background Edit operations: n n n Insert Delete Replace Edit distance (ED) between s 1 and s 2 = minimum number of edit operations to transform s 1 to s 2. Finding the edit distance is costly. n O(mn) time and space if m and n are lengths of s 1 and s 2 if dynamic programming is used [NW 70, SW 81]. 7

Related Work Lossless search n Online w [Mye 86] (Myers) reduce space requirement to O(rn), where r is query radius. w [WM 92] (Wu, Manber) binary masks, O(rn). w [BYN 99] (Beaze-Yates, Navarro) NFA n Offline (index based) w [Mye 94] (Myers) condensed r-neighborhood. w [BYN 97] (Beaze-Yates, Navarro) dictionary. Lossy search n n [AG 90] (Altschul, Gish) BLAST. w FASTA, SENSEI, Mega. BLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, Mum. MER. [GWWV 00] (Giladi, Walker, Wang, Volkmuth) SST-Tree 8

Outline Motivation & background Our contribution n Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion 9

Frequency Vector Let s be a string from the alphabet ={ 1, . . . , }. Let ni be the number of occurrences of the character i in s for 1 i , then frequency vector: f(s) =[n 1, . . . , n ]. Example: n n s = AATGATAG f(s) = [n. A, n. C, n. G, n. T] = [4, 0, 2, 2] 10

Effect of Edit Operations on Frequency Vector Delete : decreases an entry by 1. Insert : increases an entry by 1. Replace : Insert + Delete Example: n n s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT. ATAG => f(s) = [4, 0, 1, 2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (A C), s = ACCTATAG => f(s) = [3, 2, 1, 2] 11

An Approximation to ED: Frequency Distance (FD 1) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] n pos = (4 -2) + (2 -1) = 3 n neg = (2 -0) = 2 n FD 1(f(s), f(q)) = 3 FD 1(f(q), f(s)) n ED(q, s) = 4 FD 1(f(s 1), f(s 2))=max{pos, neg}. FD 1(f(s 1), f(s 2)) ED(s 1, s 2). f(s) f(q) 12

An Illustration of Frequency Distance & Edit Distance Set of strings 1 v 1 Frequency Distance v 2 Set of strings 2 Edit Distance 13

Using Local Information: Wavelet Decomposition of Strings s = AATGATAC => f(s)=[4, 1, 1, 2] s = AATG + ATAC = s 1 + s 2 f(s 1) = [2, 0, 1, 1] f(s 2) = [2, 1, 0, 1] 1(s)= f(s 1)+f(s 2) = [4, 1, 1, 2] 2(s)= f(s 1)-f(s 2) = [0, -1, 1, 0] 14

Wavelet Decomposition of a String: General Idea Ai, j = f(s(j 2 i : (j+1)2 i-1)) Bi, j = Ai-1, 2 j - Ai-1, 2 j+1 First wavelet coefficient Second wavelet coefficient (s)= 15

Wavelet Decomposition & ED Define FD(s 1, s 2)=max{FD 1, FD 2}. 16

Outline Motivation & background Our contribution n Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN and range queries Experimental results Conclusion 17

MRS-Index Structure Creation w=2 a s 1 transform 18

MRS-Index Structure Creation s 1 19

MRS-Index Structure Creation s 1 20

MRS-Index Structure Creation s 1 . . . slide c times c=box capacity 21

MRS-Index Structure Creation s 1 . . . 22

MRS-Index Structure Creation s 1 Ta, 1. . . W=2 a 23

Using Different Resolutions s 1 Ta, 1. . . W=2 a+1 Ta+1, 1 24

MRS-Index Structure 25

MRS-index properties Relative MBR volume (Precision) decreases when n n c increases. w decreases. MBRs are highly clustered. Box volume Box Capacity 26

Outline Motivation & background Our contribution n Frequency vector, frequency distance & wavelet transform Multi-resolution index structure k-NN & range queries Experimental results Conclusion 27

Range Queries [KS 01] s 1 s 2 sd 1= w=24 . . . w=25 . . . w=26 . . . 2 1 w=27 . . . 3 2 208 16 64 128 28

k-Nearest Neighbor Query [KSF+96, SK 98] k=3 29

k-Nearest Neighbor Query k=3 r = Edit distance to 3 rd closest substring 30

k-Nearest Neighbor Query r k=3 31

k-Nearest Neighbor Query k=3 32

Outline Motivation & background Our contribution Experimental results Conclusion 33

Experimental Settings w={128, 256, 512, 1024}. Human chromosomes from (www. ncbi. nlm. nih. gov) n n chr 02, chr 18, chr 21, chr 22 Plotted results are from chr 18 dataset. Queries are selected from data set randomly for 512 |q| 10000. An NFA based technique [BYN 99] is implemented for comparison. 34

Experimental Results 1: Effect of Box Capacity (10 -NN) 35

Experimental Results 2: Effect of Window Size (10 -NN) 36

Experimental Results 3: k-NN queries 37

Experimental Results 4: Range Queries 38

Outline Motivation & background Our Contribution Experimental results Discussion & conclusion 39

Discussion In-memory (index size is 1 -2% of the database size). Lossless search. 3 to 45 times faster than NFA technique for k. NN queries. 2 to 12 times faster than NFA technique for range queries. Can be used to speedup any previously defined technique. 40

Future Work Extend to weighted edit distance and affine gaps. Extend to local similarity (substring/substring) search. Compare the quality of answers and speed to BLAST (lossy search). Use as a preprocessing step to BLAST. Apply the MRS index structure for larger alphabet size (e. g. protein sequences. ). 41

Related Work Lossless search n Online w [Mye 86] (Myers) reduce space requirement to O(rn), where r is query radius. w [WM 92] (Wu, Manber) binary masks, O(rn). w [BYN 99] (Beaze-Yates, Navarro) NFA n Offline (index based) w [Mye 94] (Myers) condensed r-neighborhood. w [BYN 97] (Beaze-Yates, Navarro) dictionary. Lossy search n n [AG 90] (Altschul, Gish) BLAST. w FASTA, SENSEI, Mega. BLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, Mum. MER. [GWWV 00] (Giladi, Walker, Wang, Volkmuth) SST-Tree 42

Related Work (Similar problems) [BYP 92] (Beaze-Yates, Perleberg) only replace is allowed. [Gus 97] (Gusfield) exact matching, suffix trees. [JKS 00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree. 43

THANK YOU 44

Frequency Distance to an MBR B f(s) FD(f(q), f(s)) FD(f(q), B) f(q) 45