HMMs in gene prediction and similarity searches What

  • Slides: 32
Download presentation
(H)MMs in gene prediction and similarity searches

(H)MMs in gene prediction and similarity searches

What is an HMM? • States • Transition Probabilities • Emission Probabilities (Eddy 2004)

What is an HMM? • States • Transition Probabilities • Emission Probabilities (Eddy 2004)

What is hidden? (Eddy 2004) • State Path Log (product of transition and emission

What is hidden? (Eddy 2004) • State Path Log (product of transition and emission probabilities) Log (1 x 0. 25 x 0. 9 x 0. 25 … 0. 9 x 0. 4) = -41. 22

What is hidden? • State Path (Eddy 2004)

What is hidden? • State Path (Eddy 2004)

Using HMMs • Given the parameters of the model, compute the probability of a

Using HMMs • Given the parameters of the model, compute the probability of a particular output sequence. This problem is solved by the forward algorithm. • Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm. • Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem is solved by the Baum. Welch algorithm.

Profile Hidden Markov Models • Statistical model of multiple sequence alignments • Position-specific description

Profile Hidden Markov Models • Statistical model of multiple sequence alignments • Position-specific description of the level of conservation and the probabilities of observing each type of amino acid (nucleotide) at that position • Protein domain alignments (PFAM, TIGRFams, …) • Regulator binding site alignments

Simple Profile HMM – no gaps Emission Probabilities determined from distribution of amino acids

Simple Profile HMM – no gaps Emission Probabilities determined from distribution of amino acids at each site of the alignment

Allowing gaps in a position-specific way Need to allow a sequence to contain one

Allowing gaps in a position-specific way Need to allow a sequence to contain one or more residues not found in the model (Insert) and also be missing regions that are present in the model (Delete)

Pfam • Database of protein domains and families available as multiple alignments and HMMs

Pfam • Database of protein domains and families available as multiple alignments and HMMs • Pfam-A is curated. Pfam-B is automated.

A sample Pfam: MCPsignal

A sample Pfam: MCPsignal

Pfam- Seed Alignment

Pfam- Seed Alignment

Pfam – scoring members • Trusted cut-off – Bit score for lowest scoring match

Pfam – scoring members • Trusted cut-off – Bit score for lowest scoring match included in the full alignment • Noise cut-off – Bit score for highest scoring match not included in the full alignment • Gathering cut-off

ATTTATCCGCCGAAGCCATTACATAGTATCGCGCTTGGCAGTCGGATTCCGGCGCTGCGTGAAGACTATA AACTTGGGCGTTTATGCGGTCGTTATTTCCTCGCCACGGTTGGCAAGCTATTAACTGAAAAAGCGCCGCT TACCCGCCATCTGGTGCCAGTGGTGACGCCGGAATCGATTGTCATTCCGCCTGCGCCAGTCGCCAACGAT ACGCTGGTTGCCGAAGTGAGCGACGCTCCGCAGGCGAACGACCCGACATTTAACAATGAGGATCTGGCTT GATTTGCCGTTTTATCGACACCCACTGCCATTTTGATTTCCCGCCGTTTAGTGGCGATGAAGAGGCCAGC CTGCAACGCGCGGCACAAGCGGGCGTAGGCAAGATCATTGTTCCGGCAACAGAGGCGGAAAATTTTGCCC GTGTGTTGGCATTAGCGGAAAATTATCAACCGCTGTATGCCGCATTGGGCTTGCATCCTGGTATGTTGGA AAAACATAGCGATGTGTCTCTTGAGCAGCTACAGCAGGCGCTGGAAAGGCGTCCGGCGAAGGTGGTGGCG GTGGGGGAGATCGGTCTGGATCTCTTTGGCGACGATCCGCAATTTGAGAGGCAGCAGTGGTTACTCGACG AACAACTGAAACTGGCGAAACGCTACGATCTGCCGGTGATCCTGCATTCACGGCGCACGACAAACT GGCGATGCATCTTAAACGCCACGATTTACCGCGCACTGGCGTGGTTCACGGTTTTTCCGGCAGCCTGCAA CAGGCCGAACGGTTTGTACAGCTGGGCTACAAAATTGGCGTAGGCGGTACTATCACCTATCCACGCGCCA GTAAAACCCGCGATGTCATCGCAAAATTACCGCTGGCATCGTTATTGCTGGAAACCGACGCGCCGGATAT GCCGCTCAACGGTTTTCAGGGGCAGCCTAACCGCCCGGAGCAGGCTGCCCGTGTGTTCGCCGTGCTTTGC

ATTTATCCGCCGAAGCCATTACATAGTATCGCGCTTGGCAGTCGGATTCCGGCGCTGCGTGAAGACTATA AACTTGGGCGTTTATGCGGTCGTTATTTCCTCGCCACGGTTGGCAAGCTATTAACTGAAAAAGCGCCGCT TACCCGCCATCTGGTGCCAGTGGTGACGCCGGAATCGATTGTCATTCCGCCTGCGCCAGTCGCCAACGAT ACGCTGGTTGCCGAAGTGAGCGACGCTCCGCAGGCGAACGACCCGACATTTAACAATGAGGATCTGGCTT GATTTGCCGTTTTATCGACACCCACTGCCATTTTGATTTCCCGCCGTTTAGTGGCGATGAAGAGGCCAGC CTGCAACGCGCGGCACAAGCGGGCGTAGGCAAGATCATTGTTCCGGCAACAGAGGCGGAAAATTTTGCCC GTGTGTTGGCATTAGCGGAAAATTATCAACCGCTGTATGCCGCATTGGGCTTGCATCCTGGTATGTTGGA AAAACATAGCGATGTGTCTCTTGAGCAGCTACAGCAGGCGCTGGAAAGGCGTCCGGCGAAGGTGGTGGCG GTGGGGGAGATCGGTCTGGATCTCTTTGGCGACGATCCGCAATTTGAGAGGCAGCAGTGGTTACTCGACG AACAACTGAAACTGGCGAAACGCTACGATCTGCCGGTGATCCTGCATTCACGGCGCACGACAAACT GGCGATGCATCTTAAACGCCACGATTTACCGCGCACTGGCGTGGTTCACGGTTTTTCCGGCAGCCTGCAA CAGGCCGAACGGTTTGTACAGCTGGGCTACAAAATTGGCGTAGGCGGTACTATCACCTATCCACGCGCCA GTAAAACCCGCGATGTCATCGCAAAATTACCGCTGGCATCGTTATTGCTGGAAACCGACGCGCCGGATAT GCCGCTCAACGGTTTTCAGGGGCAGCCTAACCGCCCGGAGCAGGCTGCCCGTGTGTTCGCCGTGCTTTGC GAGTTGCGCCGGGAACCGGCGGATGAGATTGCGCAAGCGTTGCTTAATAACACGTATACGTTGTTTAACG TGCCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCAGTTGGCGCACAATGCCTGATGCGACGCTTAA CGCGTCTTATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCA GTTGGCGCACAATGCCTGATGCGACGCTTGTCGCGTCTTATCATGCCTACAAGTCTGTGCCGAACCGTAG GCCGGATAAGGCGTTCACGCCGCATCCGGCAGTCGGCGCATAATGCCTGATGCGACGCTTGTCGCGTCTT ATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCGCGCCGCATCCGGCAGTTGGCGC ACAATGCCTGATGCGACGCTTGACGCGTCTTATCAGGCCTACAAGTCTGTGCCGAACCGTAGGCCGTATC CGGCATGTCACAAATAGAGCGCCGGAAATATCAACCGGCTCACCCCGCGCACCTTTAACGCATCAGCCAA CGGCTCAACGTCTTCCGGCGTGGCGCTCGCCCAGCTTTGCGCCTCGCCATACACGCCGTGGGCATGAAAC GCGTTCAGGCGTACCGGAACATCGCCGAGTCCCTTGATAAACGCCGCCAGTTCTTCGATGTGTTGCAAAT AATCCACCTGGCCAGGGATCACCAGCAAACGCAGTTCCGCCAGCTTGCCGCGCTCTGCCAGCAAATAGAT GCTGCGCTTAATCTGCTGATTATCGCGTCCGGTGAGTTGTTGATGACATTCGCTCCCCCACGCTTTGAGA TCGAGCATTGCGCCGTCGCACACCGGGAGCAATTTTTCCCAGCCGGTTTCGCTCAACATGCCGTTACTGT CCACCAGACAGGTGAGATGGCGCAGTTGCGGATCGTTTTTGATAGCAGTAAACAGCGCCACCACAAACGG CAGCTGGGTCGTGGCTTCACCGCCACTCACCGTTATCCCTTCGATAAACAGCACTGCTTTGCGGACATGG CTAAGCACTTCGTCCACGCTCATGGATTGCGCCATGGGCGTGGCATGTTGCGGACACCTCTTCAGG TATCACACTGCTCGCAAACCACAGCGTTCCACACCACTTTGCCGTCAACAATCTGCAACGCCTGATGCGG ACACTGTGGCACTCCCCACAGTCATTGCAACGTCCCATCGTCCACGGATTGTGACAGTTTTTGCAG CGCAGATTGCAGCCCTGCAAAAACAGAGCCAGACGACTGCCTGGCCCGTCAACGCAGGAGAAGGGGATAA TCTTACTGACTAAAGCGCATCTGCTGTTCATGGCTTATCACGCGCGGCTGGCGTTCCAGAATACGAGTGT TGCGGCTTCTTCGCCCAGGTGGTGTTGGTGCGTGAACCTTCGGCGCGATATTTTTCTAAATC CGACAAACGCACCATATAACCGGTAACGCGAACCAGATCGTTACCGCTGACATTGGCGGTAAATTCACGC ATTCCGGCTTTAAAGGCACCGAGGCAAAGCTGTACCAGTGCCTGCGGGTTACGTTTGATGGTTTCGTCGA Gene Discovery

Prokaryotes: 10 kb DNA 3 m. RNAs 9 proteins Eukaryotes: 10 kb DNA Unprocessed

Prokaryotes: 10 kb DNA 3 m. RNAs 9 proteins Eukaryotes: 10 kb DNA Unprocessed m. RNA Processed m. RNA 1 protein

Two Approaches • Ab initio – Based exclusively on computational models – Error prone,

Two Approaches • Ab initio – Based exclusively on computational models – Error prone, esp. for eukaryotes – Generally requires manual clean up • Comparative – Find genes corresponding to sequenced c. DNAs – Find the genes already predicted for a closely related organism • If you can. . . use both strategies

Attributes that prove useful for gene prediction Begin with a start codon End with

Attributes that prove useful for gene prediction Begin with a start codon End with a stop codon Have a length divisible by 3 ORF Open Reading Frame Splice sites Tend to have a species specific codon usage Exhibit even higher order biases in composition Tend to be more conserved between organisms than non -coding regions

Detecting Signal Amid the Noise Each sequence can be translated in each of 6

Detecting Signal Amid the Noise Each sequence can be translated in each of 6 reading frames, 3 for the sequenced strand 3 for the reverse complement. There are far more open reading frames than there are genes. How do we know which reading frame contains real genes?

Organism-specific Composition Biases

Organism-specific Composition Biases

Codon usage in the E. coli K-12 and H. influenzae genomes 51. 8%GC coding

Codon usage in the E. coli K-12 and H. influenzae genomes 51. 8%GC coding 38. 1%GC coding Preference for GGC glycine codons Preference for GGU glycine codons

Gene Discovery using Markov Models and HMMs Example of a 1 st order Markov

Gene Discovery using Markov Models and HMMs Example of a 1 st order Markov model for gene prediction: The probability that base X is part of a coding region depends only on the base immediately preceding X. AX, TX, CX, GX How frequently does AX occur in a coding region vs. a non-coding region? A 5 th order model: AAAAAX, AAAATX, AAAACX, … GGGGGX

Model Order – which is best? • In general, higher order models better describe

Model Order – which is best? • In general, higher order models better describe the properties of real genes, but training higher order models requires more data and the training sets are limiting. • The probabilities of rare sequences in higher order models can be low enough that the model performs worse.

Gene Prediction Models based on Markov Chains Basic Method: • Build at least 6

Gene Prediction Models based on Markov Chains Basic Method: • Build at least 6 submodels (one for each reading frame) for coding regions and 1 for noncoding • Find ORFs -Start, Stop, mod(3) • Score each ORF by calculating the probability that it was generated by each model. Choose the model with the highest probability – if it exceeds a user-specified threshold, you have a gene. Two popular applications: GLIMMER, Gene. Mark Hidden Markov Models add modeling the gene boundaries as transitions between “hidden” states.

GLIMMER Reference: A. L. Delcher, D. Harmon, S. Kasif, O. White and S. L.

GLIMMER Reference: A. L. Delcher, D. Harmon, S. Kasif, O. White and S. L. Salzberg. Improved microbial gene identificaton with GLIMMER NAR, 1999, Vol. 27, No. 23, pp. 4636 -4641. • GLIMMER can be “trained” using the genome itself Finds the longest ORFs in the genome and assumes they are real genes to estimate emission probabilities • Interpolated Markov model • Not necessary to “fix” the order of the model • Analysis of 10 microbial genomes: GLIMMER 2 finds 97. 4 -99. 7% of annotated genes PLUS another 7 -25% !!! • GLIMMER 3 has a much lower False Positive Rate Specificity vs. Sensitivity

W. H. Majoros, M. Pertea, and S. L. Salzberg. Tigr. Scan and Glimmer. HMM:

W. H. Majoros, M. Pertea, and S. L. Salzberg. Tigr. Scan and Glimmer. HMM: two open-source ab initio eukaryotic gene-finders http: //www. tigr. org/software/Glimmer. HMM/index. shtml Sensitivity: TP/(TP+FN) How much of what you hoped to detect did you get? Specificity: TP/(TP+FP) How much of what you detected is real?