On the String Matching with k Differences in

Outline Ø Ø Ø Motivation - Statement of Problem - Related work Basic Techniques

Statement of Problem Ø String matching with k differences: to find all the occurrences

Statement of Problem Ø Ø String matching with k differences: to find all the

Related Work Ø Exact string matching - On-line algorithms: Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick

Basic Techniques Ø Ø Dynamic Programming - to calculate distance between pattern and targets

Dynamic Programming - Xi = x 1 x 2 … xi Time O(mn) -

Dynamic Programming Ø Example: X = gcaca , Y = acatatg , k =

BWT Transformation Ø BWT array L of y, denoted as BWT (Y), can be

BWT Transformation Suffix gtataca$ Sorted suffix $ a$ aca$ SAy r. F F L

BWT Transformation Ø Ø Burrows-Wheeler Transform ( BWT ) y = g 1 t

Backward Search of BWT-Index Ø Ø y = g 1 t 1 a 1

Backward Search of BWT-Index S(t, <a, [1, 3]> ) Search sequence : F $

rank. All Ø Ø Arrange | | arrays each for a character X such

Reduce rank. All-Index Size q q i 1 2 3 4 5 6 7

String Matching with k Differences ß v 0 v 2 … vl corresponds to

Search Trees ß v 0 <-, [1, 8]> T: F $ a 3 a

String Matching with k Differences D-vectors: D 0 D 1 D 2 D 3

Computational Complexities Ø Time complexity Worst case: O( k |T|) Average time complexity: O(

Improvement-1 ß v 0 T: F $ a 3 a 2 a 1 c

Improvement-1 ß gcaca $ a$ aca $ 1 $ c a 2 3 ca

Improvement-2 Ø Recognizing similar paths w <e 0, [ 0′, 0′]> <e 0, [

Pattern Partition Ø Ø As k increases, the performance of our algorithm degrades. Partition

Pattern Partition Ø Two-step method: Filtering and Exact matching Filtering In the first step,

Pattern Partition Ø Two-step method • Exact checking In the second step, for each

Pattern Partition Ø Illustration for pattern partition target string y m + 2 k

Experiments Ø 1. 2. 3. 4. 5. 6. 7. In our experiments, we have

Experiments Ø 1. 2. Test bed All codes are written in C++ and compiled

Experiments Ø 1. 2. Test bed For time measurements, we used the Unix time

Experiments Ø Data Reference sequences* Gorilla Danio Rerio (Zebra. Fish) Gorilla Chr 1 Protein-2

Experiments Ø Experiments on the string matching with small number of differences - Pattern

Experiments Number of nodes in T (Gorilla) k 1 2 3 4 5 6

Experiments Ø Experiments on the string matching with large number of differences (for which

Experiments Two-step execution details on Gorilla genome k 20 25 30 35 40 Step-1

Experiments Ø Experiments on number of subpatterns - Pattern length = 300 characters Gorilla

Experiments Number of segments checked in Step-2 by the pattern partition K 20 25

Experiments Ø Experiments on different lengths of patterns Gorilla genome Zebra. Fish genome 40

Experiments Ø Experiments on different lengths of patterns Protein 1 Protein 2 41

Conclusion Ø Main contribution - An algorithm for the string matching with differences q

Slides: 43

Download presentation

On the String Matching with k Differences in DNA Databases Yangjun Chen and Hoang Hai Nguyen Dept. of Applied Computer Science University of Winnipeg, Canada 1

Outline Ø Ø Ø Motivation - Statement of Problem - Related work Basic Techniques - Dynamic programming - BWT Arrays – A space-economic Index for String Matching Main algorithm for string Matching with k Differences - Search trees - Suffix trees over patterns, Similar paths, Pattern partition Experiments Conclusion and Future Work 2

Statement of Problem Ø String matching with k differences: to find all the occurrences of a pattern string x = x 1 x 2 … xm in a target string y = y 1 y 2. . . yn with at most k differences, where xi, yj , a given alphabet. In general, we distinguish among three kinds of differences: 1. 2. 3. A character of the pattern corresponds to a different character of the target. In this case we say that there is a mismatch between the two characters; A character of the target corresponds to ''no character'' in the pattern (an insertion into the pattern); and A character of the pattern corresponds to ''no character'' in the target (a deletion from the pattern). 3

Statement of Problem Ø Ø String matching with k differences: to find all the occurrences of a pattern string x = x 1 x 2 … xm in a target string y = y 1 y 2. . . yn with at most k differences. Example: k = 3 bpdqe gh cbcd efghi 4 pattern target

Related Work Ø Exact string matching - On-line algorithms: Knuth-Morris-Pratt , Boyer-Moore , Aho-Corasick - Index based: suffix trees (Weiner ; Mc. Creight ; Ukkonen), suffix arrays (Manber , Myers), BWTtransformation ( Burrow -Wheeler ), Hash (Karp, Rabin ) Ø Inexact string matching - String matching with k mismatches - Hamming distance (Landau, Vishkin; Amir at al. ; Cole; Chen, Wu) - String matching with k differences - Levelshtein distance ( Chang, Lampe) - String matching with wild-cards ( Manber, Baeza-Yates ) 5

Basic Techniques Ø Ø Dynamic Programming - to calculate distance between pattern and targets BWT transformation - to ‘fold’ the target strings 6

Dynamic Programming - Xi = x 1 x 2 … xi Time O(mn) - Yj = y 1 y 2 … yj complexity: D(0, j) = j, 0 j n; D(i, 0) = i, 0 i m; D(i – 1, j) + w(xi, ) D(i, j) = min D(i – 1, j - 1) + (xi, yj) D(i, j - 1) + w( , yj) D(i-1, j-1) D(i-1, j) D(i, j) where w(xi, yj) is the cost to change xi to yj, and (xi, yj) is 1 if xi = yj. Otherwise (xi, yj) = w(xi, yj). 7

Dynamic Programming Ø Example: X = gcaca , Y = acatatg , k = 2. For each yj, the distance between y 1…yj and x 1…xi for all xi will be calculated. j 0 1 2 3 4 5 6 7 a c a t a T g 0 0 0 0 i 0 1 g 1 1 1 1 0 2 c 2 2 1 3 a 3 2 2 1 2 2 3 2 4 c 4 2 2 3 3 3 5 a 5 4 3 2 4 4 8

BWT Transformation Ø BWT array L of y, denoted as BWT (Y), can be established by using the suffix array SA of y: if SA[i] = 0; L[i] = $, Ø L[i] = y[SA[i] – otherwise. 1], was proposed by M. Burrows and D. J. Wheeler in BWT array 1994. (M. Burrows, D. J. Wheeler, (1994), A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation. ) 9

BWT Transformation Suffix gtataca$ Sorted suffix $ a$ aca$ SAy r. F F L r. L $ a a Sorted rotations $g 1 t 1 a 1 t 2 a 2 c 1 a 3 a 3$g 1 t 1 a 1 t 2 a 2 c 1 a 3$g 1 t 1 a 1 t 2 7 6 4 1 2 a c t 1 1 1 taca$ ca$ a$ $ ataca$ gtataca$ 2 5 0 3 1 1 1 2 a c g t t a 1 t 2 a 2 c 1 a 3$g 1 t 1 a 1 t 2 a 2 g 1 t 1 a 1 t 2 a 2 c 1 a 3$g 1 t 1 a 1 t 2 a 2 c 1 a 3$g 1 a a $ a g 2 2 3 1 L = BWT (Y) 10

BWT Transformation Ø Ø Burrows-Wheeler Transform ( BWT ) y = g 1 t 1 a 2 t 2 a 3 c 1 a 3$ Rank correspondence: g 1 t 1 a 1 t 2 a 2 c 1 a 3 $ g 1 t 1 a 2 c 1 a 3 $ g 1 t 1 a 1 t 2 a 2 c 1 g 1 t 1 a 1 t 2 a 2 t 2 a 3 rank: 3 Sort these sequences lexicographically. rk. F 1 2 3 1 1 1 2 F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 Suffix Array L rk. L g 1 t 1 a 1 t 2 a 2 t 2 a 3 1 $ g 1 t 1 a 1 t 2 a 2 c 1 1 c 1 a 3 $ g 1 t 1 a 1 t 2 a 2 c 1 a 3 $ g 1 t 1 2 a 3 $ g 1 t 1 a 1 t 2 a 2 2 t 1 a 1 t 2 a 2 c 1 a 3 $ g 1 t 1 a 1 3 a 1 t 2 a 2 c 1 a 3 $ g 1 1 rk. F(e) = rk. L(e) 11 rank: 3 7 6 4 2 5 0 1 2 BWT construction: L[i] = $, if SA[i] = 1; L[i] = y[SA[i] – 1], otherwise. SA[…] – suffix array A suffix array can be established in O( n).

Backward Search of BWT-Index Ø Ø y = g 1 t 1 a 1 t 2 a 2 c 1 a 3$ x = tata search (z, ) = Backward Search F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 <z, [ , β]>, if z appears in L ; , otherwise. Z: a character : a range in F L : a range in L, corresponding to F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 12 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1

Backward Search of BWT-Index S(t, <a, [1, 3]> ) Search sequence : F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 <a, [1, 3]> F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 S(a, <t, [1, 2]> ) <t, [1, 2]> S(t, <a, [3, 3]> ) <t, [2, 2]> <a, [3, 3]> F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 13 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1

rank. All Ø Ø Arrange | | arrays each for a character X such that AX[i] (the ith entry in the array for X) is the number of appearances of X within L[1. . i]. Instead of scanning a certain segment L[ . . ] ( ) to find a subrange for a certain X , we can simply look up AX to see whether AX[ - 1] = A [ ]. If it is the case, then does not occur in . . ]. Otherwise, [AX[ - 1] + 1, AX[ ] ] should be the found range. Example To find the first and the last appearance of t in L[1. . 3], we only need to find At[1 – 1] = At [0] = 0 and At [3] = 2. So the corresponding range is [At[1 - 1] + 1, At[3]] = [1, 2]. F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 A$ 0 0 0 1 1 1 Aa 1 1 2 2 3 3 Ac 0 1 1 1 1 Ag 0 0 0 0 1 At 0 0 1 2 2 2

Reduce rank. All-Index Size q q i 1 2 3 4 5 6 7 8 F-ranks: F = <a; xa, ya> BWT array: L Reduced appearance array: A with bucket size . Reduced suffix array: SA* with bucket size . F $ a 3 a 2 a 1 c 1 g 1 t 2 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 rk. L 1 1 1 2 2 3 1 SA 7 6 4 2 5 0 1 2 F = < ; x , y > F$ = <$; 1, 1> Fa = <a; 2, 4> Fc = <c; 5, 5> Fg = <g; 6, 6> Ft = <t; 7, 8> + L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 13 Find a range: top F(x ) + A [ (top -1)/ ] + r +1 bot F(x ) + A [ bot / ] + r r is the number of 's appearances within L[ (top - 1)/ . . top - 1] r’ is the number of 's appearances within L[ bot / . . bot ] + A$ 0 0 0 1 1 1 Aa 1 1 2 2 3 3 Ac 0 1 1 1 1 Ag 0 0 0 0 1 At 0 0 1 2 2 2 + SA* 7 6 4 2 5 0 1 2

String Matching with k Differences ß v 0 v 2 … vl corresponds to a search sequence. Each ]>. The D-vector of v 0 is <0, 1, …, m>T. vj is labeled with <e j, [ j, j For j > o, we have D j[0] = D j-1[0] + 1 D j[i] = min{D j[i -1] + w(zi, ), D j-1[i] + w( , ej), D j-1[i - 1] + (zi, ej), }, for i > 0. 14

Search Trees ß

Search Trees ß v 0 <-, [1, 8]> T: F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 v 1 <a, [1, v 9 <c, [1, 1]> 3]> v 10 v 2 <c, [1, 1]> <a, [2, v 4 <t, [1, 2]> v 8 2]> v 10 v 5 v 3 <a, [2, 2]> <a, [3, 3]> <g, [1, <t, [1, 1]> v 11 v 6 P 1 P 3 <t, [2, <a, [3, 3]> P v 2]> 7 4 <g, [2, 2]> P 2 18 v 13 <t, [1, 2]> v 14 v 16 <a, [3, <g, [1, v 3]> 1]>P 15 6 <t, [2, 2]> P 5 v 17 <g, [1, 1]> P 7

String Matching with k Differences D-vectors: D 0 D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 D 9 D 10 D 11 D 12 D 13 D 14 D 15 D 16 D 17 - a c a t g g c a t a t g g 0 1 2 3 4 5 3 1 0 3 4 1 2 3 2 1 1 1 2 3 4 4 2 1 1 3 4 1 2 3 1 0 2 2 1 2 2 3 4 5 3 1 2 3 4 2 2 3 2 1 3 2 3 4 3 2 3 3 3 4 3 2 2 3 3 3 4 2 3 4 3 3 4 4 5 4 3 2 4 3 4 4 4 5 4 3 2 5 4 4 5 5

Computational Complexities Ø Time complexity Worst case: O( k |T|) Average time complexity: O( k | | 2 k) Ø Space complexity Worst case: O( km + n) Ø Existing methods: time complexity – O (k n); space complexity – O( m + n) 20

Improvement-1 ß v 0 T: F $ a 3 a 2 a 1 c 1 g 1 t 2 t 1 L a 3 c 1 t 2 t 1 a 2 $ a 1 g 1 v 1 <a, [1, 3]> v 2 <c, [1, 1]> v 4 <t, [1, 2]> v 5 v 3 <a, [2, 2]> <a, [3, 3]> <g, [1, 1]> v 6 P 1 P 3 <t, [2, v 2]> 7 <g, [2, 2]> P 2 21 <-, [1, 8]> D 4 2 2 2 3 3 4 ……

Improvement-1 ß gcaca $ a$ aca $ 1 $ c a 2 3 ca $ 4 $ 5 6

Improvement-2 Ø Recognizing similar paths w <e 0, [ 0′, 0′]> <e 0, [ 0, 0]> u w 1 <e 1, [ 1′, 1′]> …… <e 1, [ 1, 1)]> u 1 …… v [ 0, 0] [ 0′, 0′] [ 1, 1] [ 1′, 1′] …… 23

Pattern Partition Ø Ø As k increases, the performance of our algorithm degrades. Partition a pattern to get subpatterns with smaller k values. Quickly find all those substrings in a target, which match a subpattern with smaller k differences. For each surviving substring, recheck it to see whether it is an occurrence of the pattern, but with k differences. 24

Pattern Partition Ø Two-step method: Filtering and Exact matching Filtering In the first step, we partition the pattern x = x 1 … xm evenly into l segments, denoted as x = P 1 P 2. . . P l with each P i = x(i-1)r+1. . . Xir for 1 I l - 1, and P l = x(l-1)r+1 … xm, where r = m/l. Then, we check each P i against BWT (y) with k' = k/l differences in turn to find all the occurrences of P i (1 i l) in y. Each occurrence is represented by (i, j), indicating that P i matches a segment ending at position with k' differences. 25 j in y

Pattern Partition Ø Two-step method • Exact checking In the second step, for each occurrence ( i, j) found in the first step, the substring of the target: si, j = yj-ir+1 -k. . . yj-ir+1+m+k will be again closely checked against x with k differences by using a classical algorithm. The length of si, j is m + 2 k. 26

Pattern Partition Ø Illustration for pattern partition target string y m + 2 k j–I r+ 1 +k pattern string x m/l j–I r+m+k … r r … Pi 27

Experiments Ø 1. 2. 3. 4. 5. 6. 7. In our experiments, we have tested altogether 7 strategies: Ukkonen's onlline method (u-o for short, [57]), Chang-Lawer’s first method (ch-1 for short, [14]), Chang-Lawer's second method (ch-2 for short, [14]), Ukkonen's index-based method (u-i for short, [58]), Myers's index-based method (m-i for short, [44]), Peri-Culpepper's index-based method (pc-i for short, [49]), and ours, discussed in this paper. 28

Experiments Ø 1. 2. Test bed All codes are written in C++ and compiled by GNU g++ compiler version 5. 4. 0 with compiler option `-O 2’. All tests run on a 64 -bit Ubuntu OS with a single core of Intel Xeon E 5 -2637 @3. 50 Ghz. The system memory is of 64 GB. 29

Experiments Ø 1. 2. Test bed For time measurements, we used the Unix time commands. In addition, the suffix trees for patterns (in the Chang-Lawler's method and ours), as well as for reference sequences (in the Ukkonen's index-based method) are constructed by using the algorithm described in [59]. To construct the suffix arrays and the BWT-arrays, we used a code found in the libdivsufsort library (https: //github. com/ Y 256/libdivsufsort ) 30

Experiments Ø Data Reference sequences* Gorilla Danio Rerio (Zebra. Fish) Gorilla Chr 1 Protein-2 Num. characters 3, 063, 403, 506 1, 373, 472, 378 228, 908, 641 144, 000 30, 000 Time (s) for building BWT(y) 406. 817 173. 142 25. 03 15. 92 3. 04 *The first three are genome sequences while the last two are protein sequences. For genomes, | | = 4. For protein sequences, | | = 20. 31

Experiments Ø Experiments on the string matching with small number of differences - Pattern length = 100 characters Gorilla genome Zebra. Fish genome 32 Gorilla Chr 1

Experiments Number of nodes in T (Gorilla) k 1 2 3 4 5 6 7 |T| 1. 4 k 25 k 278. 5 k 2 M 10 M 39. 72 M 92 M 33

Experiments Protein 1 Protein 2

Experiments Ø Experiments on the string matching with large number of differences (for which the two-step method is used. ) - Pattern length = 300 characters Gorilla genome Zebra. Fish genome 35

Experiments Two-step execution details on Gorilla genome k 20 25 30 35 40 Step-1 23. 1 s 173. 2 s 172. 9 s 1187. 0 s 1187. 5 s Step-2 97. 41 s 122. 1 s 263. 4 s 311. 7 s 492. 32 s 565. 58 s num. of surviving segments 30. 5 k 52. 9 k 52. 7 k 69. 5 k 69. 3 k Size of a segment 353 364 388 399 428 439 45 Two-step execution details on Zebra. Fish genome k 20 25 30 35 40 45 Step-1 58. 73 s 58. 57 s 187. 5 s 187. 6 s 953. 0 s 952. 4 s Step-2 18. 41 s 21. 48 s 32. 24 s 38. 85 s 60. 33 s 60. 70 s num. of surviving segments 3638 3633 4709 4702 6376 6365 Size of a segment 423 444 455 463 474 36

Experiments Ø Experiments on the string matching with large number of differences (for which the two-step method is used. ) - Pattern length = 300 characters Gorilla Chr 1 Protein 1 37 Protein 2

Experiments Ø Experiments on number of subpatterns - Pattern length = 300 characters Gorilla genome Zebra. Fish genome l: the number of subpatterns 38 Gorilla Chr 1

Experiments Number of segments checked in Step-2 by the pattern partition K 20 25 30 35 40 45 l =10 30. 5 k 52. 9 k 52. 7 k 69. 5 k 69 k l = 12 28. 7 k 56. 2 k 56. 0 k 55. 8 k 60. 5 k 61. 4 k l = 15 30. 5 k 52. 9 k 52. 7 k 69. 5 k 69. 3 k 39

Experiments Ø Experiments on different lengths of patterns Gorilla genome Zebra. Fish genome 40 Gorilla Chr 1

Experiments Ø Experiments on different lengths of patterns Protein 1 Protein 2 41

Conclusion Ø Main contribution - An algorithm for the string matching with differences q q q k Combination of dynamic programming and BWT indexes for the problem of string matching with k difference Concept of search trees and two branch cutting methods Pattern partition - Extensive tests Ø Future work - String matching with don’t care symbols (using BWT transformation as indexes) 42

Thank you! 43