Computing longest common substring and all palindromes from
Computing longest common substring and all palindromes from compressed strings Wataru Matsubara 1, Shunsuke Inenaga 2, Akira Ishino 1, Ayumi Shinohara 1, Tomoyuki Nakamura 1, Kazuo Hashimoto 1 1 Graduate School of Information Sciences Tohoku University, Japan 2 Department of Computer Science and Communication Engineering, Kyushu University, Japan
Background and motivations
What is compressed string algorithm? input text A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob? “, and “Was it a bar or a bat I saw? “ and so on. :
What is compressed string algorithm? input text A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob? “, and “Was it a bar or a bat I saw? “ and so on. : output find palindromes mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
What is compressed string algorithm? compressed text One solution would be to decompress the compressed text. e)%e. ARY)(Re. JD)OIHOIFEnkkdi we 02 kfo)J”LPEPJ 9 w. EOW*# e. O … decompress The decompressed size can be exponentially large with respect to the compressed size. decompressed text A palindrome is a symmetric string. It is interesting on their own as word puzzles. For example, “I prefer pi“, ”Borrow or rob? “, and “Was it a bar or a bat I saw? “ and so on. : output find palindromes mm isi zz iprefrepi borroworrob wasitabarorabatisow oo :
Goal of algorithms for Compressed strings • Process the compressed text without decompression. • Processing time should be polynomial in n. – Decompressed size can be exponentially large with respect to n. n : the size of compressed text
Compressed schemes • run-length encoding • Lempel-Ziv • grammar based compression : [Rytter 2003] Resulting achieve of most practical compression methods can be transformed into SLP generating the same original text. Straight Line Program
Definition of Straight Line Program (SLP) SLP T T : sequence of assignments X 1 = expr 1 ; X 2 = expr 2; … ; Xn = exprn; Xk : variable, a exprk : Xi Xj ( a ) ( i, j < k ). SLP T for string w is a CFG in Chomsky normal form s. t. L(T) = {w}.
Straight Line Program (SLP) Example SLP n X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T= N N = O(2 n)
Straight Line Program (SLP) Example SLP n X 1 = a X 2 = b X 3 = X 1 X 2 X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 X 8 X 7 X 5 T= N N = O(2 n)
Efficient algorithms for compressed strings • substring matching – Karpinski et al (1996) – Miyazaki et al (1997) – Lifshits (2006) O(n 4 logn) time O(n 4) time O(n 3) time • minimum period – Karpinski et al (1996) – Lifshits (2006) O(n 4 logn) time O(n 3 log. N) time • all squares – Gasieniec et al (1994) O(n 6 log 5 N) time
Hardness results • Subsequence pattern matching – Lifshits and Lohrey (2006) NP-hard • Longest common subsequence – Lifshits and Lohrey (2006) NP-hard • Hamming distance – Lifshits (2007) #P-complete Is there any reasonable comparison measurement for compressed strings?
String comparison measures Hamming distance Longest common subsequence substring aabbaa abaaba uncompressed text O(N) O(N 2 / log. N) O(N) compressed text #P-comprete NP-hard [Lifshits 07] [Lifshits and Lohrey 06] ? ? we solve this problem
Our results
Our Result 1: Longest Common Substring Problem Given two SLP T and S that are descriptions of text T and S respectively, compute LCStr(T, S). Theorem LCStr(T, S) can be computed in O(n 4 logn) time using O(n 3) space. LCStr(T, S) : the length of longest common substring of T and S n : the total size of the input SLP
Our Result 2: palindromes Problem Given SLP T, compute (compressed representations) the set of all palindromes of T. Theorem The problem can be solved in O(n 4) time using O(n 2) space. Previous best result: O(n 5 log 4 N) time [Gasienec et al 1996] n : the size of SLP T N : the length of original text T (note that N = O(2 n)
Details of our algorithm Computing longest common substring Computing palindromes (omitted in this talk)
Property of common substrings (1/3) • For each common substring Z of string S and T, there always exists a variable Xi = Xl. Xr and Yj = YLYR such that: – Z is a common substring of Xi and Yj – Z contains an overlap between Xl and YR Xi Xl Overlap Xr Z common substring w YL Z Yj YR
Property of common substrings (2/3) • For each common substring Z of string S and T, there always exists a string w such that: – w is a substring of Z – w is an overlap of variables of S and T Xi Xl Overlap Xr w YL Yj YR
Property of common substrings (3/3) • For each common substring Z of string S and T, there always exists a string w such that: – Z can be calculate by expanding w Xi Xl Overlap Xr Z common substring w YL Z Yj YR Extend Process
Overlaps (OL) For any strings X, Y, the set of the lengths of overlaps of X and Y. X Y
Overlaps Example OL(“aabaaba”, “abaababb”) = {1, 3, 6} Xl aabaabaabab YR abaabaabab YR
Computing Overlaps [Karpinski et al 1996] Lemma For any variables Xi and Xj of SLP T, OL(Xi, Xj) can be represented by O(n) arithmetic progressions. Xi Yj Theorem For any SLP T, OL(Xi, Xj) can be computed in total of O(n 4 logn) time and O(n 3) space.
How to extend overlaps Xi Xl Xr aaabababb aabaabaabababaaba YL a b a ∈ OL(Xl, YR) YR Yj
How to extend overlaps Xi Xl Xr aaabababb match aabaabaabababaaba YL a b a ∈ OL(Xl, YR) YR Yj
How to extend overlaps Xi Xl Xr aaabababb match aabaabaabababaaba YL a b a ∈ OL(Xl, YR) YR Yj
How to extend overlaps Xi Xl Xr aaabababb match aabaabaabababaaba YL a b a ∈ OL(Xl, YR) YR Yj
How to extend overlaps Xi Xl Xr aaabababb aabaabaabababaaba YL a b a ∈ OL(Xl, YR) YR Yj mismatch
How to extend overlaps Xi Xl Xr aaabababb mismatch aabaabaabababaaba Yl a b a ∈ OL(Xl, YR) Yr Yj
How to extend overlaps Xi Xl Xr aaabababb We are not allowed to process character by character. aabaabaabababaaba Yl a b a ∈ OL(Xl, YR) Yr Yj
First-mismatch function [Karpinski et al 1996] input :SLP variables Xi and Yj , integer k output :position of first mismatch p p [p]}-1 Xi Mismatch ababaab k abababaaba Yj
First-mismatch function [Karpinski et al 1996] Lemma Provided that the sets of overlaps are already computed, FM(Xi, Yj, k) can be computed in O(nlogn) time.
Extending overlaps using FM function Lemma Extending overlaps can be done by O(n) calls of FM function.
pseudo-code Computing longest common substring O(n 2) items O(n) calls of FM function. O(nlogn) times Totally, LCStr (S, T) can be computed in O(n 2×n×nlogn)= O(n 4 logn) time.
Conclusions • Computing longest common substring from compressed string – O(n 4 logn) time and O(n 3) space • Computing all palindromes from compressed string – O(n 4) time and O(n 2) space
Thank you for your attention.
- Slides: 36