Reconstruction of DNA sequencing by hybridization JiHong Zhang
Reconstruction of DNA sequencing by hybridization Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang ZHANGroup@aporc. org Institute of Applied Mathematics, AMSS, CAS
Bioinformatics Human Genome Project n n Large molecule data in biology, such as DNA and protein Knowledge of mathematics, computer science, information science, physics, system science, management science as well as biology Genomics n n n DNA sequencing Gene prediction Sequence alignment
DNA Sequencing …ACGTGACTGAGGACCGTG CGACTGACTGGGT CTAGACTACGTTTTA TATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
DNA Sequencing (shotgun) target DNA cut many times at random known dist ~500 bp forward-reverse linked reads ~500 bp
DNA Sequencing (SBH) DNA array (DNA chip) with 43 probes n Target DNA: AAATGCG
Sequencing by Hybridization Hybridize target to array containing a spot for each possible k-tuple (k-mer) The spectrum of a sequence n multi-set of all its k-long substrings (k-tuples) Goal n reconstruct the sequence from its spectrum Pevzner (1989): reconstruction is polynomial But …
Uniqueness of Reconstruction Different sequences can have the same spectrum: n ACT, CTA, TAC w ACTAC w TACTA Non-uniqueness Probability
Experiment Errors Hybridization experiments are error prone False negative error n n k-tuple appears in target DNA but does not appear in its measured spectrum Repetition of k-tuple False positive error n k-tuple does not appear in target DNA but does appear in its measured spectrum
Sequencing by Hybridization Target DNA Spectrum ……TTTTACGC…… ß TTT TTT TTA TAC ACG CGC TGA Ideal case With errors Errors: Positive (misread) / Negative (missing, repetition)
SBH Reconstruction Problem In the case of error-free SBH experiments n A desired solution of SBH is just a feasible solution including all k-tuple in the specturm For the general case n n There is no additional information except spectrum and the length of target DNA A feasible solution composed of a maximum cardinality subset of the spectrum shall be a reasonable desired solution
SBH Reconstruction Problem Ideal case (without repetitions and errors) n n Equivalent to finding an Eulerian path in a corresponding graph (Pevzner, 1989) A linear time algorithm (Fleischner, 1990) General case is NP-hard problem n n Branch and bound Heuristics Extensions n n PSBH (Positional SBH) SBH with length error
Motivations Give some criteria which can determine the most possible k-tuples at both ends and in the middle of all possible reconstructions of the target DNA n These criterions greatly reduce ambiguities in the reconstruction of DNA Transform the negative errors into the positive errors n These means enables us to handle both types of errors easily Separate the repetitions from both type of errors
Methods Estimate the number of k-tuples that does not occur in a solution n n Adjacency matrix (connection matrix) Give a lower bound of k-tuples that does not occur in all solutions from k-tuple i to j
Methods Determine the most possible k-tuples at both ends n n Reconstruct from the most possible end pairs to get an upper bound of SBH problem Purge the end pairs that can not have better solution than current upper bound
Methods Transform the negative errors into the positive errors n Artificial k-tuple w Fill in all the possible gaps due to false negative error n Negative error level w The maximal number of allowed consecutively missing ktuples w Reduce the number of artificial k-tuples
Computational Experiments 109 DNA sequence from Gen. Bank Simulate the SBH experiments Error models n n Randomly (probabilistic model) Systematically (one base mismatched model)
Conclusions Ideal case (without repetitions and errors) can be solved in polynomial time (Pevzner, 1989) General case is NP-hard problem Design efficient algorithms n n n Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new approach to the reconstruction of DNA sequencing by hybridization. Bioinformatics, vol 19(1), pages 14 -21, 2003. Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu. Combinatorial optimization problems in the positional DNA sequencing by hybridization and its algorithms. System Sciences and Mathematics, vol 3, 2002. (in Chinese) Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang. Application of neural networks in the reconstruction of DNA sequencing by hybridization. In Proceedings of the 4 th ISORA, 2002.
- Slides: 20