Bioinformatics Algorithms and Data Structures Chapter 2 KMP

Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer: Dr. Rose Slides by: Dr. Rose January 28, 2003 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Preliminaries: – KMP can be easily explained in terms of finite state machines. – KMP has a easily proved linear bound – KMP is usually not the method of choice UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Recall that the naïve approach to string matching is Q(mn). • How can we reduce this complexity? – Avoid redundant comparisons – Use larger shifts • Boyer-Moore good suffix rule • Boyer-Moore extended bad character rule UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • KMP finds larger shifts by recognizing patterns in P. – Let spi(P) denote the length of the longest proper suffix of P[1. . i] that matches a prefix of P. – By definition sp 1 = 0 for any string. – Q: Why does this make sense? – A: The proper suffix must be the empty string UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Example: P = abcaeabcabd – – – – – P[1. . 2] = ab hence sp 2 = ? sp 2 = 0 P[1. . 3] = abc hence sp 3 = ? sp 3 = 0 P[1. . 4] = abca hence sp 4 = ? sp 4 = 1 P[1. . 5] = abcae hence sp 5 = ? sp 5 = 0 P[1. . 6] = abcaea hence sp 6 = ? sp 6 = 1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Example Continued – – – – – P[1. . 7] = abcaeab hence sp 7 = ? sp 7 = 2 P[1. . 8] = abcaeabc hence sp 8 = ? sp 8 = 3 P[1. . 9] = abcaeabca hence sp 9 = ? sp 9 = 4 P[1. . 10] = abcaeabcab hence sp 10 = ? sp 10 = 2 P[1. . 11] = abcaeabcabd hence sp 11 = ? sp 11 = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Like the a/a concept for Boyer-Moore, there is an analogous spi/spí concept. • Let spí(P) denote the length of the longest proper suffix of P[1. . i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(spí + 1) are unequal. • Example: P = abcdabce sp´ 7 = 3 Obviously spí(P) <= spi(P), since the later is less restrictive. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • KMP Shift Rule: 1. Mismatch case: • • Let position i+1 in P and position k in T be the first mismatch in a left-to-right scan. Shift P to the right, aligning P[1. . spí] with T[k- spí. . k-1] 2. Match case: • • If no mismatch is found, an occurrence of P has been found. Shift P by n – spń spaces to continue searching for other occurrences. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Observations: – – The prefix P[1. . spí] of the shifted P is shifted to match the corresponding substring in T. Subsequent character matching proceeds from position spí + 1 Unlike Boyer-Moore, the matched substring is not compared again. The shift rule based on spí guarantees that the exact same mismatch won’t occur at spí + 1 but doesn’t guarantee that P(spí+1) = T(k) UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Example: P = abcxabcde – – If a mismatch occurs at position 8, P will be shifted 4 positions to the right. Q: Where did the 4 position shift come from? A: The number of position is given by i - sp´i , in this example i = 7, sp´ 7 = 3, 7 – 3 = 4 Notice that we know the amount of shift without knowing anything about T other than there was a mismatch at position 8. . UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm • Example Continued: P = abcxabcde – – – • After the shift, P[1. . 3] lines up with T[k-4. . k-1] Since it known that P[1. . 3] must match T[k-4. . k-1], no comparison is needed. The scan continues from P(4) & T(k) Advantages of KMP Shift Rule 1. P is often shifted by more than 1 character, (i - sp´i ) 2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

KMP Algorithm Full Example: T = xyabcxadcdqfeg P = abcxabcde Assume that we have already shifted past the first two positions in T. xyabcxadcdqfeg abcxabcde ^^^^ ^^ places 123 4567 81 d!=x, startshift again 4 from position 4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Preprocessing for KMP Approach: show to derive sp´ values from Z values. Definition: Position j > 1 maps to i if i = j + Zj(P) – 1 – – Recall that Zj(P) denotes the length of the Z-box starting at position j. This says that j maps to i if i is the right end of a Z-box starting at j. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Preprocessing for KMP Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1 Where j > 1 is the smallest position that maps to i. If j then sp´i(P) = 0 Similarly for sp: For any i > 1, spi(P) = i – j + 1 Where j, i j > 1, is the smallest position that maps to i or beyond. If j then spi(P) = 0 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Preprocessing for KMP Given theorem from the preceding slide, the spí and spi values can be computed in linear time using Zi values: For i = 1 to n { spí = 0; } For j = n downto 2 { i = j + Zi(P) – 1; spí = Zi; } spn(P) = spń(P); For i = n - 1 downto 2 { spi (P) = max[spi+1 (P) - 1, spí(P)]; } UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Preprocessing for KMP Defn. Failure function F´(i) = spí-1 + 1 , 1 i n + 1, sp´ 0 = 0 (similarly F(i) = spi-1 + 1 , 1 i n + 1, sp 0 = 0) • Idea: – – We maintain a pointer i in P and c in T. After a mismatch at P(i+1) with T(c), shift P to align P(spí + 1) with T(c), i. e. , i = spí + 1. Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1 Special case 2: we find P in T, shift n - spń spaces, i. e. , i = F´(n + 1) = spń + 1. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Full KMP Algorithm Preprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1 c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1; } If (p = n + 1) then report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1; } If (p = n + 1) then p != n+1 report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = 1! c = 2 p = F´(p) ; p = F’(1) = 1 } xyabcxabcdefeg abcxabcde ^ 1 a!=x UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1; } If (p = n + 1) then p != n+1 report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = 1! c = 3 p = F´(p) ; p = F’(1) = 1 } xyabcxabcdefeg abcxabcde ^ 1 a!=y UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { While P(p) = T( c )and p n { p = p + 1; c = c + 1; } If (p = n + 1) then p != n+1 report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = 8! don’t change c p = F´(p) ; p = F´(8) = 4 } xyabcxabcdefeg abcxabcde ^^^^ ^ 123 4567 8 d!=x UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Full KMP Algorithm c = 1; p = 1; While c + (n – p) m { p = 4, c = 10 While P(p) = T( c )and p n { p = p + 1; c = c + 1; } If (p = n + 1) then p = n+1 ! report an occurrence of P at position c – n of T. if (p = 1) then c = c + 1; p = F´(p) ; } xyabcxabcdefeg abcxabcde ^^^^ ^ ^ 4567 8 9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP • • Q: What is meant by real-time algorithms? A: Typically these are algorithms that are meant to interact synchronously in the real world. – – – This implies a known fixed turn-around time for processing a task Many embedded scheduling systems are examples involving real-time algorithms. For KMP this means that we require a constant time for processing all strings of length n. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP • • Q: Why is KMP not real-time? A: For any mismatched character in T, we may try matching it several times. – – • • Recall that spí only guarantees that P(i + 1) and P(spí + 1) differ There is NO guarantee that P(i + 1) and T(k) match We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k). This means that we have to compute spí values with respect to all characters in S since any could appear in T. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP • • • Define: sp´(i, x)(P) to be the length of the longest proper suffix of P[1. . i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x. This is will tell us exactly what shift to use for each possible mismatch. A mismatched character T(k) will never be involved in subsequent comparisons. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP • • Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons? A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k). This results in a real-time version of KMP. Let’s consider how we can find the sp´(i, x)(P) values in linear time. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP Thm. For P[i + 1] x, sp´(i, x)(P) = i - j + 1 – – Here j is the smallest position such that j maps to i and P(Zj + 1) = x. If there is no such j then where sp´(i, x)(P) = 0 For i = 1 to n { sp´(i, x) = 0 for every character x; } For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i, x) = Zi; } UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology

Real-Time KMP For i = 1 to n { sp´(i, x) = 0 for every character x; } For j = n downto 2 { i = j + Zi(P) – 1; x = P(Zj + 1); sp´(i, x) = Zi; } • Notice how this works: – Starting from the right • • • Find i the right end of the Z box associated with j Find x the character immediately following the prefix corresponding to this Z box. Set sp´(i, x) = Zi, the length of this Z box. UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology