Longest Common Rigid Subsequence Bin Ma and Kaizhong
Longest Common Rigid Subsequence Bin Ma and Kaizhong Zhang Department of Computer Science University of Western Ontario, Canada.
(Rigid) Subsequence • Subsequence: COMBINATORIALPATTERNMATCHING CPM • Rigid Subsequence: 012345678901234567 COMBINATORIALPATTERNMATCHING CPM, (13, 7)
Common (Rigid) Subsequence • Longest Common Subsequence (LCS) – combinatorial pattern matching – longest common rigid subsequence comnienc • Longest Common Rigid Subsequence (LCRS) – combinatorial pattern matching – longest common rigid subsequence comni, (1, 1, 3, 5)
Previous Results • LCS and LCRS of two strings: – polynomial time solvable • LCS of many strings: – Cannot be approximated within ratio in polynomial time (Jiang and Li 1995, SIAM J COMP). – For random instances, a simple greedy algorithm can give an almost optimal solution with only small error. • LCRS of many strings: – Exponential time algorithms. – Our CPM paper tries to answer the time complexity.
Motivation in Bioinformatics • In biochemistry, a motif is a recurring pattern in DNA/protein sequences. • A protein motif (SH 3 domain binding motif) in J. Biological Chemistry 269: 24034 -9. • Many motifs can be found at PROSITE database of Ex. PASy.
Motivation • Rigoutsos and Floratos proposed the following problem (Bioinformatics 14: 55 -67, 1998). – Given n strings and a positive number K, find a longest “rigid pattern” (rigid subsequence) that occurs in at least K of the n strings. • When K=n, it is LCRS. • Exponential time algorithms were studied. • NP-hardness unknown.
Our Results • LCRS is MAX-SNP hard – Therefore, Rigoutsos and Floratos’ problem is also MAX-SNP hard. • For random instances, there is an algorithm solves LCRS with quasi-polynomial average running time. – The algorithm also works for Rigoutsos and Floratos’ problem with simple modifications.
MAX-SNP hard • L-reduction from Max-Cut edge vertex delimiter
The construction of each edge aaa aba bab contributes 0 aaa aba bab contributes 1 Three possible configurations in an ungapped alignment
The Algorithm • Let Si be the set of length-i common rigid subsequences. • We only need to prove that
Sketch of Proof • For each rigid subsequence in Si, the probability it occurs in one random string of length n • The prob. that it occurs in every input string • There are in total subsequences. length i rigid • This can be done by two cases i<=2 log n and i> 2 logn.
Acknowledgement • Supported by NSERC, PREA and CRC.
- Slides: 12