HMMs in gene prediction and similarity searches What
- Slides: 32
(H)MMs in gene prediction and similarity searches
What is an HMM? • States • Transition Probabilities • Emission Probabilities (Eddy 2004)
What is hidden? (Eddy 2004) • State Path Log (product of transition and emission probabilities) Log (1 x 0. 25 x 0. 9 x 0. 25 … 0. 9 x 0. 4) = -41. 22
What is hidden? • State Path (Eddy 2004)
Using HMMs • Given the parameters of the model, compute the probability of a particular output sequence. This problem is solved by the forward algorithm. • Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence. This problem is solved by the Viterbi algorithm. • Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem is solved by the Baum. Welch algorithm.
Profile Hidden Markov Models • Statistical model of multiple sequence alignments • Position-specific description of the level of conservation and the probabilities of observing each type of amino acid (nucleotide) at that position • Protein domain alignments (PFAM, TIGRFams, …) • Regulator binding site alignments
Simple Profile HMM – no gaps Emission Probabilities determined from distribution of amino acids at each site of the alignment
Allowing gaps in a position-specific way Need to allow a sequence to contain one or more residues not found in the model (Insert) and also be missing regions that are present in the model (Delete)
Pfam • Database of protein domains and families available as multiple alignments and HMMs • Pfam-A is curated. Pfam-B is automated.
A sample Pfam: MCPsignal
Pfam- Seed Alignment
Pfam – scoring members • Trusted cut-off – Bit score for lowest scoring match included in the full alignment • Noise cut-off – Bit score for highest scoring match not included in the full alignment • Gathering cut-off
ATTTATCCGCCGAAGCCATTACATAGTATCGCGCTTGGCAGTCGGATTCCGGCGCTGCGTGAAGACTATA AACTTGGGCGTTTATGCGGTCGTTATTTCCTCGCCACGGTTGGCAAGCTATTAACTGAAAAAGCGCCGCT TACCCGCCATCTGGTGCCAGTGGTGACGCCGGAATCGATTGTCATTCCGCCTGCGCCAGTCGCCAACGAT ACGCTGGTTGCCGAAGTGAGCGACGCTCCGCAGGCGAACGACCCGACATTTAACAATGAGGATCTGGCTT GATTTGCCGTTTTATCGACACCCACTGCCATTTTGATTTCCCGCCGTTTAGTGGCGATGAAGAGGCCAGC CTGCAACGCGCGGCACAAGCGGGCGTAGGCAAGATCATTGTTCCGGCAACAGAGGCGGAAAATTTTGCCC GTGTGTTGGCATTAGCGGAAAATTATCAACCGCTGTATGCCGCATTGGGCTTGCATCCTGGTATGTTGGA AAAACATAGCGATGTGTCTCTTGAGCAGCTACAGCAGGCGCTGGAAAGGCGTCCGGCGAAGGTGGTGGCG GTGGGGGAGATCGGTCTGGATCTCTTTGGCGACGATCCGCAATTTGAGAGGCAGCAGTGGTTACTCGACG AACAACTGAAACTGGCGAAACGCTACGATCTGCCGGTGATCCTGCATTCACGGCGCACGACAAACT GGCGATGCATCTTAAACGCCACGATTTACCGCGCACTGGCGTGGTTCACGGTTTTTCCGGCAGCCTGCAA CAGGCCGAACGGTTTGTACAGCTGGGCTACAAAATTGGCGTAGGCGGTACTATCACCTATCCACGCGCCA GTAAAACCCGCGATGTCATCGCAAAATTACCGCTGGCATCGTTATTGCTGGAAACCGACGCGCCGGATAT GCCGCTCAACGGTTTTCAGGGGCAGCCTAACCGCCCGGAGCAGGCTGCCCGTGTGTTCGCCGTGCTTTGC GAGTTGCGCCGGGAACCGGCGGATGAGATTGCGCAAGCGTTGCTTAATAACACGTATACGTTGTTTAACG TGCCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCAGTTGGCGCACAATGCCTGATGCGACGCTTAA CGCGTCTTATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCACGCCGCATCCGGCA GTTGGCGCACAATGCCTGATGCGACGCTTGTCGCGTCTTATCATGCCTACAAGTCTGTGCCGAACCGTAG GCCGGATAAGGCGTTCACGCCGCATCCGGCAGTCGGCGCATAATGCCTGATGCGACGCTTGTCGCGTCTT ATCATGCCTACAGGTTTGTGCCGAACCGTAGGCCGGATAAGGCGTTCGCGCCGCATCCGGCAGTTGGCGC ACAATGCCTGATGCGACGCTTGACGCGTCTTATCAGGCCTACAAGTCTGTGCCGAACCGTAGGCCGTATC CGGCATGTCACAAATAGAGCGCCGGAAATATCAACCGGCTCACCCCGCGCACCTTTAACGCATCAGCCAA CGGCTCAACGTCTTCCGGCGTGGCGCTCGCCCAGCTTTGCGCCTCGCCATACACGCCGTGGGCATGAAAC GCGTTCAGGCGTACCGGAACATCGCCGAGTCCCTTGATAAACGCCGCCAGTTCTTCGATGTGTTGCAAAT AATCCACCTGGCCAGGGATCACCAGCAAACGCAGTTCCGCCAGCTTGCCGCGCTCTGCCAGCAAATAGAT GCTGCGCTTAATCTGCTGATTATCGCGTCCGGTGAGTTGTTGATGACATTCGCTCCCCCACGCTTTGAGA TCGAGCATTGCGCCGTCGCACACCGGGAGCAATTTTTCCCAGCCGGTTTCGCTCAACATGCCGTTACTGT CCACCAGACAGGTGAGATGGCGCAGTTGCGGATCGTTTTTGATAGCAGTAAACAGCGCCACCACAAACGG CAGCTGGGTCGTGGCTTCACCGCCACTCACCGTTATCCCTTCGATAAACAGCACTGCTTTGCGGACATGG CTAAGCACTTCGTCCACGCTCATGGATTGCGCCATGGGCGTGGCATGTTGCGGACACCTCTTCAGG TATCACACTGCTCGCAAACCACAGCGTTCCACACCACTTTGCCGTCAACAATCTGCAACGCCTGATGCGG ACACTGTGGCACTCCCCACAGTCATTGCAACGTCCCATCGTCCACGGATTGTGACAGTTTTTGCAG CGCAGATTGCAGCCCTGCAAAAACAGAGCCAGACGACTGCCTGGCCCGTCAACGCAGGAGAAGGGGATAA TCTTACTGACTAAAGCGCATCTGCTGTTCATGGCTTATCACGCGCGGCTGGCGTTCCAGAATACGAGTGT TGCGGCTTCTTCGCCCAGGTGGTGTTGGTGCGTGAACCTTCGGCGCGATATTTTTCTAAATC CGACAAACGCACCATATAACCGGTAACGCGAACCAGATCGTTACCGCTGACATTGGCGGTAAATTCACGC ATTCCGGCTTTAAAGGCACCGAGGCAAAGCTGTACCAGTGCCTGCGGGTTACGTTTGATGGTTTCGTCGA Gene Discovery
Prokaryotes: 10 kb DNA 3 m. RNAs 9 proteins Eukaryotes: 10 kb DNA Unprocessed m. RNA Processed m. RNA 1 protein
Two Approaches • Ab initio – Based exclusively on computational models – Error prone, esp. for eukaryotes – Generally requires manual clean up • Comparative – Find genes corresponding to sequenced c. DNAs – Find the genes already predicted for a closely related organism • If you can. . . use both strategies
Attributes that prove useful for gene prediction Begin with a start codon End with a stop codon Have a length divisible by 3 ORF Open Reading Frame Splice sites Tend to have a species specific codon usage Exhibit even higher order biases in composition Tend to be more conserved between organisms than non -coding regions
Detecting Signal Amid the Noise Each sequence can be translated in each of 6 reading frames, 3 for the sequenced strand 3 for the reverse complement. There are far more open reading frames than there are genes. How do we know which reading frame contains real genes?
Organism-specific Composition Biases
Codon usage in the E. coli K-12 and H. influenzae genomes 51. 8%GC coding 38. 1%GC coding Preference for GGC glycine codons Preference for GGU glycine codons
Gene Discovery using Markov Models and HMMs Example of a 1 st order Markov model for gene prediction: The probability that base X is part of a coding region depends only on the base immediately preceding X. AX, TX, CX, GX How frequently does AX occur in a coding region vs. a non-coding region? A 5 th order model: AAAAAX, AAAATX, AAAACX, … GGGGGX
Model Order – which is best? • In general, higher order models better describe the properties of real genes, but training higher order models requires more data and the training sets are limiting. • The probabilities of rare sequences in higher order models can be low enough that the model performs worse.
Gene Prediction Models based on Markov Chains Basic Method: • Build at least 6 submodels (one for each reading frame) for coding regions and 1 for noncoding • Find ORFs -Start, Stop, mod(3) • Score each ORF by calculating the probability that it was generated by each model. Choose the model with the highest probability – if it exceeds a user-specified threshold, you have a gene. Two popular applications: GLIMMER, Gene. Mark Hidden Markov Models add modeling the gene boundaries as transitions between “hidden” states.
GLIMMER Reference: A. L. Delcher, D. Harmon, S. Kasif, O. White and S. L. Salzberg. Improved microbial gene identificaton with GLIMMER NAR, 1999, Vol. 27, No. 23, pp. 4636 -4641. • GLIMMER can be “trained” using the genome itself Finds the longest ORFs in the genome and assumes they are real genes to estimate emission probabilities • Interpolated Markov model • Not necessary to “fix” the order of the model • Analysis of 10 microbial genomes: GLIMMER 2 finds 97. 4 -99. 7% of annotated genes PLUS another 7 -25% !!! • GLIMMER 3 has a much lower False Positive Rate Specificity vs. Sensitivity
W. H. Majoros, M. Pertea, and S. L. Salzberg. Tigr. Scan and Glimmer. HMM: two open-source ab initio eukaryotic gene-finders http: //www. tigr. org/software/Glimmer. HMM/index. shtml Sensitivity: TP/(TP+FN) How much of what you hoped to detect did you get? Specificity: TP/(TP+FP) How much of what you detected is real?
- Gene prediction in prokaryotes and eukaryotes
- Hmms
- Gene by gene test results
- Chapter 17 gene expression from gene to protein
- What is cross projection sketch
- Xray searches
- Make a prediction about kenny and franchesca
- What is inferring
- Paspc
- Difference between prediction and forecasting
- Difference between static and dynamic branch prediction
- To make predictions we often use
- Championship branch prediction
- Corner prediction
- The hunger games chapter 24 questions and answers
- Phd secondary structure prediction
- Variance in regression
- Meritsprediction
- Singkong prediction
- Good readers making predictions by
- Make predictions with scatter plots
- Fb24 prediction
- Proyeksi berdasarkan kedudukan sumbu simetri
- Hypothesis vs prediction
- Branch prediction in computer architecture
- Avoiding discrimination through causal reasoning
- Masta prediction
- Branch prediction in computer architecture
- Branch prediction in computer architecture
- Branch prediction
- How can two parents with brown eyes have a child with blue
- Residual standard error formula
- Explain an instruction issue algorithm of pentium processor