String Matching with k Mismatches and k differences

  • Slides: 43
Download presentation
String Matching with k Mismatches and k differences k-mismatch Slides are from Moshe Lewenstein

String Matching with k Mismatches and k differences k-mismatch Slides are from Moshe Lewenstein

String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson

String Matching with k Mismatches Landau – Vishkin 1986 Galil – Giancarlo 1986 Abrahamson 1987 Amir - Lewenstein - Porat 2000

Exact String Matching Input: T = t 1. . . P = p 1

Exact String Matching Input: T = t 1. . . P = p 1 … p m Output: All locations tn i of T where P appears Example: P= T= ABCAAB A B C A A B A A…

Exact String Matching Input: T = t 1. . . P = p 1

Exact String Matching Input: T = t 1. . . P = p 1 … p m Output: All locations tn i of T where P appears Example: P= T= ABCAAB A B C A A B A A… 3

Exact String Matching Input: T = t 1. . . P = p 1

Exact String Matching Input: T = t 1. . . P = p 1 … p m Output: All locations tn i of T where P appears Example: P= T= ABCAAB A B C A A B A A… 3 7

Exact String Matching Input: T = t 1. . . P = p 1

Exact String Matching Input: T = t 1. . . P = p 1 … p m Output: All locations tn i of T where P appears Example: P= T= ABCAAB A B C A A B A A… 3 7 11

Exact String Matching Input: T = t 1. . . P = p 1

Exact String Matching Input: T = t 1. . . P = p 1 … p m Output: All locations tn i of T where P appears Example: P= T= ABCAAB A B C A A B A A… Answer: {3, 7, 11, . . }

Exact String Matching Problem: Matching not exact: • Sequencing errors • Natural genetic variations

Exact String Matching Problem: Matching not exact: • Sequencing errors • Natural genetic variations • Etc. Need other definitions of string matching!

Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently

Approximate String Matching Idea: Find all text locations where distance from pattern is sufficiently small. distance metric: Let S = s 1 s 2…sm R = r 1 r 2…rm HAMMING DISTANCE Ham(S, R) = The number of locations j where sj Example: S = ABCABC R = ABBAAC Ham(S, R) = 2 rj

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C…

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C… 2 Ham(P, T 1) = 2

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C… 2, 4 Ham(P, T 2) = 4

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C… 2, 4, 6 Ham(P, T 3) = 6

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C… 2, 4, 6, 2 Ham(P, T 4) = 2

String Matching with Mismatches Input: T = t 1. . . P = p

String Matching with Mismatches Input: T = t 1. . . P = p 1 … p m tn Output: For each i in T Ham(P, titi+1…ti+m-1) Example: P= T= ABBAAC A B C A C… 2, 4, 6, 2, …

String Matching with k Mismatches Input: T = t 1 . . . t

String Matching with k Mismatches Input: T = t 1 . . . t n, Output: Every i in T s. t. Example: P= T= P = p 1 … p m Ham(P, titi+1…ti+m-1) k=2 ABBAAC A B C A C… 2, 4, 6, 2, … k

String Matching with k Mismatches Input: T = t 1 . . . t

String Matching with k Mismatches Input: T = t 1 . . . t n, Output: Every i in T s. t. Example: P= T= P = p 1 … p m Ham(P, titi+1…ti+m-1) k=2 ABBAAC A B C A C… 2, 4, 6, 2, … k

String Matching with k Mismatches Input: T = t 1 . . . t

String Matching with k Mismatches Input: T = t 1 . . . t n, Output: Every i in T s. t. Example: P= T= P = p 1 … p m Ham(P, titi+1…ti+m-1) k=2 ABBAAC A B C A C… 2, 4, 6, 2, … Y, N, N, Y, … k

Naïve Algorithm (for counting mismatches or k-mismatches problem) - Go to each location of

Naïve Algorithm (for counting mismatches or k-mismatches problem) - Go to each location of text and compute hamming distance of P and Ti P T Running Time: O(nm) n = |T|, m = |P|

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

The Kangaroo Method (for k-mismatches) Landau – Vishkin 1986 Galil – Giancarlo 1986

Suffix tree (Example) Let s = abab, a suffix tree of s is a

Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } $ b a b $ $ $ a b $

$ Suffix Tree properties a b 5 b $ a a b $ $

$ Suffix Tree properties a b 5 b $ a a b $ $ b $ 3 1 - Succint in space - O(n). - Can be built in O(n) time. Mc. Creight, Weiner, Ukkonen, Farach-Colton 2 4

Exact string matching $ s=abab$ a b $ 1 b $ 3 5 $

Exact string matching $ s=abab$ a b $ 1 b $ 3 5 $ a b $ 4 2 Given a pattern P = ab we traverse the tree according to the pattern.

Exact string matching $ s=abab$ 1 3 a b $ 1 b $ 3

Exact string matching $ s=abab$ 1 3 a b $ 1 b $ 3 5 $ a b $ 4 2 Leaves correspond to locations of appearance!

Exact string matching $ s=abab$ 1 3 a b $ 1 b $ 3

Exact string matching $ s=abab$ 1 3 a b $ 1 b $ 3 5 $ a b $ 4 2 Prepare Tree: O(n) time Find matches: O(m + occ) time occ = # of matches

Lowest common ancestors A lot more can be gained from the suffix tree if

Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a a $b b a a b 4 $ 1 b b $ 5 7 $ a a b $ b a 3 6 a b $ 2

Why? The LCA of two leaves represents the longest common prefix (LCP) of these

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a a $b b a a b 4 $ 1 b b $ 5 7 $ a a b $ b a 3 6 a b $ 2

Why? The LCA of two leaves represents the longest common prefix (LCP) of these

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a a $b b a a b 4 $ 1 b b $ 5 7 $ a a b $ b a 3 6 a b $ 2

Why? The LCA of two leaves represents the longest common prefix (LCP) of these

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes $ s = abbaab$ a a $b b a a b 4 $ 1 b b $ 5 7 $ a a b $ b a 3 6 a b $ 2

LCA/LCP properties $ b $ 4 a b b b a a b $

LCA/LCP properties $ b $ 4 a b b b a a b $ 1 7 a$ $ a a b $ 5 Preprocesssing time : O(n) Query Time: O(1) Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993 3 b 6 a a b $ 2

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check

The Kangaroo Method (for k-mismatches) - Create suffix tree for: s = P#T -Check P at each location i of T by kangrooing Example: P = ABABAABACAB T = ABBACABABABCABCA… i

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T

The Kangaroo Method (for k-mismatches) Preprocess: Build suffix tree of both P and T - O(n+m) time LCA preprocessing - O(n+m) time Check P at given text location Kangroo jump till next mismatch - O(k) time Overall time: O(nk)

K-difference problem Definition: Given strings S 1 and S 2 and a fixed Number

K-difference problem Definition: Given strings S 1 and S 2 and a fixed Number k, the k-difference global alignment problem Is to find the best global alignment of S 1 and S 2 containing at most k mismatches and spaces (if one exists) vs k-mismatch problem: allow insertions and deletions A special type of edit distance problem

O(km) Bounded dynamic programming. 1. define main diagonal: cell (i, i) for I <=n<=m

O(km) Bounded dynamic programming. 1. define main diagonal: cell (i, i) for I <=n<=m 2. k-difference global alignment must not contain cells (i, i+L) or (i, i-L) where L >k Suppose the lengths of the two inputs are m and n<=m. Then m-n<=k is necessary For finding a solution. Picture from Dan Gusfield’s book “Algorithms on strings, trees, and sequences