Efficient Algorithms for Locating the Length Constrained Heaviest

  • Slides: 21
Download presentation
Efficient Algorithms for Locating the Length. Constrained Heaviest Segments, with Applications to Biomolecular Sequence

Efficient Algorithms for Locating the Length. Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin* Tao Jiang Kun-Mao Chao Dept CS & Info Mngmt, Providence Univ, Taiwan Dept CS & Engineering, UC Riverside, USA Dept CS & Info Engnr, Nat. Taiwan Univ, Taiwan * Yaw-Ling Lin, Providence, Taiwan

Outline • • • Introduction. Applications to Biomolecular Sequence Analysis. Maximum Sum Consecutive Subsequence.

Outline • • • Introduction. Applications to Biomolecular Sequence Analysis. Maximum Sum Consecutive Subsequence. Maximum Average Consecutive Subsequence. Implementation and Preliminary Experiments Concluding Remarks Yaw-Ling Lin, Providence, Taiwan 2

Introduction • Two fundamental algorithms in searching for interesting regions in sequences: • Given

Introduction • Two fundamental algorithms in searching for interesting regions in sequences: • Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. • Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm. Yaw-Ling Lin, Providence, Taiwan 3

Applications to Biomolecular Sequence Analysis (I) • Locating GC-Rich Regions – Finding GC-rich regions:

Applications to Biomolecular Sequence Analysis (I) • Locating GC-Rich Regions – Finding GC-rich regions: an important problem in gene recognition and comparative genomics. – Cp. G islands ( 200 ~ 1400 bp ) – [Huang’ 94]: O(n L)-time algorithm. • Post-Processing Sequence Alignments – Comparative analysis of human and mouse DNA: useful in gene prediction in human genome. – Mosaic effect: bad inner sequence. – Normalized local alignment. – Post-processing local aligned subsequences Yaw-Ling Lin, Providence, Taiwan 4

Applications to Biomolecular Sequence Analysis (II) • Annotating Multiple Sequence Alignments – [Stojanovic’ 99]:

Applications to Biomolecular Sequence Analysis (II) • Annotating Multiple Sequence Alignments – [Stojanovic’ 99]: conserved regions in biomolecular sequences. – Numerical scores for columns of a multiple alignment; each column score shall be adjusted by subtracting an anchor value. • Ungapped Local Alignments with Length Constraints – Computing the length-constrained segment of each diagonal in the matrix with the largest sum (or average) of scores. – Applications in motif identification. Yaw-Ling Lin, Providence, Taiwan 5

Maximum Sum Consecutive Subsequence <-4, 1, -2, 3> is left-negative < 5, -3, 4,

Maximum Sum Consecutive Subsequence <-4, 1, -2, 3> is left-negative < 5, -3, 4, -1, 2, -6 > is not. <5> <-3, 4> <-1, 2> <-6> is minimal leftnegative partitioned. Yaw-Ling Lin, Providence, Taiwan 6

Minimal left-negative partition Yaw-Ling Lin, Providence, Taiwan 7

Minimal left-negative partition Yaw-Ling Lin, Providence, Taiwan 7

MLN-partition: linear time Yaw-Ling Lin, Providence, Taiwan 8

MLN-partition: linear time Yaw-Ling Lin, Providence, Taiwan 8

Max-Sum with LC Yaw-Ling Lin, Providence, Taiwan 9

Max-Sum with LC Yaw-Ling Lin, Providence, Taiwan 9

Analysis of MSLC Yaw-Ling Lin, Providence, Taiwan 10

Analysis of MSLC Yaw-Ling Lin, Providence, Taiwan 10

Max Average Subsequence <4, 2, 3, 8> is right-skew < 5, 3, 4, 1,

Max Average Subsequence <4, 2, 3, 8> is right-skew < 5, 3, 4, 1, 2, 6 > is not. <5> <3, 4> <1, 2, 6> is decreasing rightskew partitioned. Yaw-Ling Lin, Providence, Taiwan 11

Decreasing right-skiew partition Yaw-Ling Lin, Providence, Taiwan 12

Decreasing right-skiew partition Yaw-Ling Lin, Providence, Taiwan 12

DRS-partition: linear time Yaw-Ling Lin, Providence, Taiwan 13

DRS-partition: linear time Yaw-Ling Lin, Providence, Taiwan 13

Max-Avg-Seq with LC Yaw-Ling Lin, Providence, Taiwan 14

Max-Avg-Seq with LC Yaw-Ling Lin, Providence, Taiwan 14

Locate good-partner Yaw-Ling Lin, Providence, Taiwan 15

Locate good-partner Yaw-Ling Lin, Providence, Taiwan 15

Analysis of Max. Avg. Seq Yaw-Ling Lin, Providence, Taiwan 16

Analysis of Max. Avg. Seq Yaw-Ling Lin, Providence, Taiwan 16

Implementation and Preliminary Experiments Yaw-Ling Lin, Providence, Taiwan 17

Implementation and Preliminary Experiments Yaw-Ling Lin, Providence, Taiwan 17

Implementation and Preliminary Experiments Yaw-Ling Lin, Providence, Taiwan 18

Implementation and Preliminary Experiments Yaw-Ling Lin, Providence, Taiwan 18

Conclusion • Find a max-sum subsequence of length at most U can be done

Conclusion • Find a max-sum subsequence of length at most U can be done in O(n)-time. • Find a max-avg subsequence of length at least L can be done in O(n log L)-time. Yaw-Ling Lin, Providence, Taiwan 19

Recent Progress • Lu (CMCT’ 2002): finding the max-avg subsequence of length at least

Recent Progress • Lu (CMCT’ 2002): finding the max-avg subsequence of length at least L on binary (0, 1) sequences. O(n)-time. • Goldwasser, Kao, Lu (2002, manuscripts): finding the max-avg subsequence of length at least L and at most U on real sequences. O(n)-time • Tools: finding Cp. G islands using MAVG (joint work with Huang, X. , Jiang, T. and Chao, K. -M. ) http: //deepc 2. zool. iastate. edu/aat/mavg/cgdoc. html http: //deepc 2. zool. iastate. edu/aat/mavg/cg. html Yaw-Ling Lin, Providence, Taiwan 20

Future Research • Best k (nonintersecting) subsequences? • Normalized local alignment? • Measurement of

Future Research • Best k (nonintersecting) subsequences? • Normalized local alignment? • Measurement of goodness? Yaw-Ling Lin, Providence, Taiwan 21