Applications of suffix trees Maximal Repeat S xabcyiixabczdcqabczrxar
Applications of suffix trees
Maximal Repeat • S = xabcyiixabczdcqabczrxar Observation: Since the characters to the left/right are different, a cannot be extended and thus is maximal for the specific
Examples Example 1 S = xabcyiiizabcqabcyrxar Max. repeats: “abc”, “abcy” Example 2 S = xabcyiiizabcyrxar Max. repeats: “abcy” Example 3 S = yabababx Max. repeats: “abab”, “ab”. Indexes 2, 6 certify “ab”.
Characterizing Maximal Repeats S = #cxabxa$ 1 2 3 4 5 6 x a a b c x a b x a $ a x b $ x left diverse b a $ $ x a $ 4 1 6 3 5 2 a # x x b c
Proof of (easy) Suppose a is a maximal repeat By definition, a has at least two occurrences: xaw yaz s. t. x y, w z root a Let i and j be those two starting positions of a. Then leaves i and j are in v’s subtree and have different left characters x, y v is left diverse V “w …” “z …” x y i j
Proof of Case 1: p!=q Case 2: p=q root a a V V “p …” “q …” “p …” “r …” z x y
Proof of Suppose v is left diverse and let a be its path-label then there are substrings xap and yaq in S s. t. x y What about p and q ? 1) If p q a’s occurrences in xap and yaq form a maximal pair a is a maximal repeat, certified by those occurrences. 2) If p=q Since v is a branching node, there must be another edge out-coming v, and thus another substring zar in S, r p. If z x zar and xap form a certificate pair for maximal repeat. If z y zar and yap form a certificate pair for maximal repeat. In either case, a is a maximal repeat. 7
Question •
Example a $ m a a $ a m-1 $ m-2 a a $ m-3 1 # a
Question S=xabcs. . . yabcs. . . zabct. . . xabct. . . yabct. . . zabcs How many certificates are at S for the maximal repeat abc? c ab Answer: All possible pairing of 2 leafs with distinct left-characters, such that one of the leafs is below “abcs” and the other is below “abct”: 2+3+3 = 8 t s x y z x x y z
Super-maximal Repeats • Super-maximal repeat: A maximal repeat that isn’t a substring of any other maximal repeat. • Example: S = xabcyiiizabcqabcyrxar abcy is supermaximal, abc is not (but is a maximal repeat). • How can we find all super-maximal repeats in linear time? Theorem: A left-diverse internal node v in the suffix tree for S represents a super-maximal repeat iff: – all of v’s children are leaves, and – all those leaves have distinct left characters. 11
A left diverse internal node v in the suffix tree for S represents a supermaximal repeat iff 1 root 1. all of v’s children are leaves 2. each leaf has a distinct left character α Explanation: v Denote the path-label of v by α. 1. Assume there is an inner node w, a child of v. Then, there at least two leaves in the subtree rooted at w. Then there at least two occurrences of β which extends α to the right. So, αβ is a repeat, and thus it is contained in some maximal-repeat γ , and hence α is not super-maximal. w z y x 2 root α Why? 2. Assume that there are two leaves with the same left character x. Then alpha may be extended to the left in two occurrences, and thus the substring xα is contained in a maximal repeat. Hence, α is not super-maximal. β v z x x
Lowest common ancestors Idea: preprocess the suffix-tree to gain information on its structure LCA 1 2 3 4 5 6 7 8 1 - x r r r 2 x - r r r r 3 r r - y z z z r 8 4 r r y - z z z r 5 r r z z - v w r z x 6 r r z z v - w r 7 r r z z w w - r 8 r r r r - y 1 2 3 w 7 4 v 5 6
Example: Longest common extension S = “abcabd” cab d d a b The longest common extension of (1, 4) is “ab” d $ 1 a b c d $ 4 $ b $ 3 6 d$ c a b d 5 $ 2
Finding maximal palindromes • A palindrome: caabaac (odd), cbaabc (even) • Goal: find all maximal palindromes in a string S • In this text we handle only even palindromes Observation: The maximal palindrome with center at position i is the longest common prefix of the suffix at position i of S and the suffix at position m-i+2 of S r S =”xxabccbayyyyy” i=6 Sr= “yyyyyabccbaxx” m-i+2 = 13– 6+2 = 9
S = y a-r a x i a-r Sr = x m-i+2 a x. . . S, i a y. . . Sr, m-i+2 y
Maximal almost-palindrome (MAP) •
S = y’ b-r a-r y a x’ x i+1+|a| i b-r Sr = b a-r a y x x’ m-i+2 a b y’ m-i+3+|a| b y’ … xb S, i . . . yb . . . Sr, m-i+2 x’ … S, i+|a|+1 Sr, m-i+|a|+3
- Slides: 18