Genome Annotation Continued This weeks lab Genome annotation
Genome Annotation Continued • This week’s lab. • Genome annotation - web based databases for assigning gene function.
Last week’s lab • • E-value Score Blastx Taxonomy
Lab • Sequence assembly and analysis • Assemble individual sequence reads • Phred = 30 - good or bad?
Linking Protein Sequence, Structure, and Function Protein Domains Protein sequences CDD: Conserved functional domains in proteins represented by a PSSM PSI-BLAST, RPS-BLAST, CDART 3 D Domains NCBI Field Guide
Position Specific Substitution Rates Weakly conserved serine Active site serine
Position Specific Score Matrix (PSSM) 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 D G V I S S C N G D S G G P L N C Q A A 0 -2 -1 -3 -2 4 -4 -2 -2 -5 -2 -3 -3 -2 -4 -1 0 0 -1 R N D C Q E G H I L K M F -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 Serine scored 0 2 -1 -6 7 0 is-2 0 -6 differently -4 2 0 -2 -3 -3 -4 -4 -5 two 7 -4 positions -7 -7 -5 -4 -4 in-4 these -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -6 -4 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 8 -6 -7 -7 -5 -6 -7 Active -6 -6 -5 site -6 nucleophile -5 -5 -6 -6 -6 -7 -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 -1 1 3 -4 -1 1 4 -3 -1 -2 -2 P 1 -2 -4 -5 -5 -1 -7 -5 -6 -5 -4 -6 -6 9 -6 -6 -4 0 -3 S 0 -2 0 -3 1 4 -4 -1 -3 -4 7 -4 -2 -4 -6 -2 -1 -1 0 T -1 -1 -2 0 -3 3 -4 -3 -5 -4 -2 -5 -4 -4 -5 -1 0 -1 -2 W -6 0 -6 -1 -7 -6 -5 -3 -6 -8 -6 -6 -6 -7 -5 -6 -5 -3 -2 Y -4 -6 -4 -4 -5 -5 0 -4 -6 -7 -5 -7 -7 -7 -4 -1 0 -3 -2 V -1 -5 -2 0 -6 -3 -4 -3 -6 -7 -5 -7 -7 -6 0 -4 -3
Hidden Markov Models • A statistical model that can be applied to any system that is represented as a discrete state. – Applies to protein and nt sequences. • Can be thought of much like PSSMs used in PSIBLAST. – After several interations. • Are used in gene finding and protein profile analysis.
Uses of HMMs in protein function analysis. • TIGRFAMs – Strive to annotate function of an entire protein • PFAMs – Strive to annotate domains of proteins.
Homologs, orthologs, and paralogs. • Homologous genes are genes that share a common evolutionary ancestor. – Orthologs are genes found in different organisms that arose from a common ancestor. Speciation. – Paralogs are genes found in the same organism that arose from a common ancestor. Duplication could have occurred in the species or earlier, often have diverged in function
Orthologs may differ in function!
TIGRFAM • Curated such that proteins in a TIGRFAM should have the same function if they are equivalogs. • Proteins have identity over their entire length. • Equivalog family = all proteins that are conserved with respect to function since their last common ancestor. • Superfamily - all proteins with homology but may have different biological functions. • Subfamily - incomplete set of proteins with homology - may have diverse biological functions.
PFAM • More likely to describe a protein domain rather than a family. – Pfams will not overlap. • Crosslisted in TIGRFAM page. • ~70% of proteins in SWISS-Prot have a Pfam match.
COGs • Cluster of orthologous groups • Pairwise comparison of orthologs from many bacterial genomes. • Suggests function only (book example).
Gene Ontology (GO) • “The goal of the Gene Ontology project is to produce a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. ” • Biological process, Molecular function, Cellular component
Literature Curation • Saccharomyces genome database (SGD) for example. • Manual curation of the literature for experimental evidence linking function to annotation.
Additional databases • SMART - Simple Modular Architecture Research Tool. • PROSITE - Protein motifs • PRODOM - A databased on PSI-BLAST PSSMs. • Inter. Pro - A database that brings together many of the above databases so that you can search them all at once. • Others.
CDD • Conserved domain database - linking all of this information together. • Consists of SMART, Pfam, and COGs (KOGs). Searchable directly automatically searched by BLAST. • Linked to CDART - allows the identification of proteins with a similar domain architecture.
Bottom line about databases • Are useful tools in assigning possible functions. • Be careful about annotations – example -proteins in the same COG can be orthologs that have evolved different functions. – Many annotations are not backed up by experimental data. – Some databases are automated - have not been checked for accuracy.
Annotation can not be guaranteed without experimental evidence. • Functional genomics
- Slides: 19