Proteomics and Protein Bioinformatics Functional Analysis of Protein

Proteomics and Protein Bioinformatics: Functional Analysis of Protein Sequences Anastasia Nikolskaya Assistant Professor (Research) Protein Information Resource Department of Biochemistry and Molecular Biology Georgetown University Medical Center

Overview • Role of bioinformatics/computational biology in proteomics research • Functional annotation of proteins = assigning correct name, describing function or predicting function for a sequence • Classification of proteins = grouping them into families of related sequences • Annotating a family helps the annotation of its members Sequence function

Bioinformatics An Emerging Field Where Biological/Biomedical and Mathematical/Computational Disciplines Converge Computational biology – study of biological systems using computational methods Bioinformatics – development of computational tools and approaches

Bioinformatics as related to proteins 1. Sequence analysis • Genome projects -> Gene prediction • Protein sequence analysis • Comparative genomics • Protein sequence and family databases (annotation and classification) 2. Structural genomics 3. Data analysis and integration for: • Large scale gene expression analysis • Protein-protein interaction • Intracellular protein localization 4. Integration of all data on proteins to reconstruct pathways and cellular systems, make predictions and discover new knowledge

Functional Genomics and Proteomics studies biological systems based on global knowledge of protein sets (proteomes). Functional genomics studies biological functions of proteins, complexes, pathways based on the analysis of genome sequences. Includes functional assignments for protein sequences. Genome Transcriptome Proteome Metabolome

Proteomics • Data: Gene expression profiling Genome -wide analyses of gene expression (DNA Microarrays/Chips ) • Data: Protein-protein interaction (Yeast Two-Hybrid Systems) • Data: Structural genomics Determine 3 D structures of all protein families • Data: Genome projects (Sequencing) Bioinformatics Analysis and integration of these data

Bioinformatics and Genomics/Proteomics Sequence, Other Data Unknown Genes Pathways and Regulatory Circuits Putative Functional Groups Hypothetical Cell

What’s in it for me? • When an experiment yields a sequence (or a set of sequences), we need to find out as much as we can about this protein and its possible function from available data • Especially important for poorly characterized or uncharacterized (“hypothetical”) proteins • More challenging for large sets of sequences generated by large-scale proteomics experiments • The quality of this assessment is often critical for interpreting experimental results and making hypothesis for future experiments Sequence function

Bioinformatics as related to proteins 1. Sequence analysis • Genome projects -> Gene prediction • Protein sequence analysis • Comparative genomics • Protein sequence and family databases (annotation and classification) 2. Structural genomics 3. Data analysis and integration for: • Large scale gene expression analysis • Protein-protein interaction • Intracellular protein localization 4. Integration of all data on proteins to reconstruct pathways and cellular systems, make predictions and discover new knowledge

Work with protein sequence, not DNA sequence DNA Sequence Genomic DNA Sequence Gene Recognition Gene Promoter C A Protein Sequence Function C A A T Exon 1 5' UTR T A A T Intron G T Exon 2 A G Exon 3 Intron G T 3' UTR A A T A A A G G A Protein Sequence Exon 1 Structure Determination Family Classification Protein Structure Exon 2 Exon 3 Function Analysis Protein Family Molecular Evolution Gene Network Metabolic Pathway

Most new proteins come from genome sequencing projects • • • Mycoplasma genitalium - 484 proteins Escherichia coli - 4, 288 proteins S. cerevisiae (yeast) - 5, 932 proteins C. elegans (worm) ~ 19, 000 proteins Homo sapiens ~ 30, 000 proteins. . . and have unknown functions

Advantages of knowing the complete genome sequence • All encoded proteins can be predicted and identified • The missing functions can be identified analyzed • Peculiarities and novelties in each organism can be studied • Predictions can be made and verified

The changing face of protein science 20 th century 21 st century • Few well-studied proteins • Many “hypothetical” proteins • Mostly globular with enzymatic activity • Various, often with no enzymatic activity • Biased protein set • Natural protein set Credit: Dr. M. Galperin, NCBI

Properties of the natural protein set • Unexpected diversity of even common enzymes (analogous, paralogous enzymes) • Conservation of the reaction chemistry, but not the substrate specificity • Functional diversity in closely related proteins • Abundance of new structures Credit: Dr. M. Galperin, NCBI

E. coli Characterized experimentally 2046 Characterized by similarity 1083 Unknown, conserved 285 Unknown, no similarity 874 M. jannaschii 97 1025 211 411 S. cerevisiae 3307 1055 1007 966 from Koonin and Galperin, 2003, with modifications H. sapiens 10189 10901 2723 7965

Functional annotation of proteins (protein sequence databases) Protein Sequence From new genomes Automatic assignment based on sequence similarity: gene name, protein name, function Function To avoid mistakes, need human intervention (manual annotation) Best annotated protein databases: Swiss. Prot, PIR-1 Now part of Uni. Prot – Universal Protein Resource

Objectives of functional analysis for different groups of proteins • Experimentally characterized – Up-to-date information, manually annotated (curated database!) Problems: misinterpreted experimental results (e. g. suppressors, cofactors) “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible • Function can be predicted – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways • “Unknowns” (conserved or unique)

Problems in functional assignments for “knowns” • Previous low quality annotations lead to propagation of mistakes • biologically senseless annotations Arabidopsis: separation anxiety protein-like Helicobacter: brute force protein Methanococcus: centromere-binding protein Plasmodium: frameshift • propagated mistakes of sequence comparison

Problems in functional assignments for “knowns” • Multi-domain organization of proteins New sequence ACT domain BLAST Chorismate mutase domain ACT domain mutase In BLAST output, top hits are to chorismate mutases -> The name “chorismate mutase” is automatically assigned to new sequence. ERROR ! Can be propagated, protein gets erroneous EC number, assigned to erroneous pathway, etc

Problems in functional assignments for “knowns” • Low sequence complexity (coiled-coil, non-globular regions) • Enzyme evolution: - Divergence in sequence and function - Non-orthologous gene displacement: Convergent evolution

Objectives of functional analysis for different groups of proteins • Experimentally characterized • “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible • Function can be predicted – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways • “Unknowns” (conserved or unique) – Rank by importance

Functional prediction: Dealing with “hypothetical” proteins • Computational analysis – Sequence analysis of the new ORFs • Structural analysis – Determination of the 3 D structure • Mutational analysis • Functional analysis – Expression profiling – Tracking of cellular localization

Functional prediction: computational analysis • Cluster analysis of protein families (family databases) • Use of sophisticated database searches (PSI-BLAST, HMM) • Detailed manual analysis of sequence similarities

Using comparative genomics for protein analysis • Proteins (domains) with different 3 D folds are not homologous (unrelated by origin) • Those amino acids that are conserved in divergent proteins within a (super)family are likely to be important for catalytic activity. • Reaction chemistry often remains conserved even when sequence diverges almost beyond recognition

Using comparative genomics for protein analysis • Prediction of the 3 D fold (if distant homologs have known structures!) and general biochemical function is much easier than prediction of exact biological (or biochemical) function • Sequence analysis complements structural comparisons and can greatly benefit from them • Comparative analysis allows us to find subtle sequence similarities in proteins that would not have been noticed otherwise Credit: Dr. M. Galperin, NCBI

Poorly characterized protein families: only general function can be predicted • Enzyme activity can be predicted, the substrate remains unknown (ATPases, GTPases, oxidoreductases, methyltransferases, acetyltransferases) • Helix-turn-helix motif proteins (predicted transcriptional regulators) • Membrane transporters

Functional prediction: computational analysis • Phylogenetic distribution – Wide - most likely essential – Narrow - probably clade-specific – Patchy - most intriguing, niche-specific • Domain association – Rosetta Stone (for multidomain proteins) • Gene neighborhood (operon organization)

Using genome context for functional prediction Leucine biosynthesis

Functional Prediction: Role of Structural Genomics Protein Structure Initiative: Determine 3 D Structures of All Proteins – Family Classification: Organize Protein Sequences into Families, collect families without known structures – Target Selection: Select Family Representatives as Targets – Structure Determination: X-Ray Crystallography or NMR Spectroscopy – Homology Modeling: Build Models for Other Proteins by Homology – Functional prediction based on structure

Structural Genomics: Structure-Based Functional Assignments Methanococcus jannaschii MJ 0577 (Hypothetical Protein) Contains bound ATP => ATPase or ATP-Mediated Molecular Switch Confirmed by biochemical experiments

Crystal structure is not a function! Credit: Dr. M. Galperin, NCBI

Functional prediction: problem areas • Identification of protein-coding regions • Delineation of potential function(s) for distant paralogs • Identification of domains in the absence of close homologs • Analysis of proteins with low sequence complexity

Objectives of functional analysis for different groups of proteins • Experimentally characterized – Up-to-date information, manually annotated • “Knowns” = Characterized by similarity (closely related to experimentally characterized) – Make sure the assignment is plausible • Function can be predicted – Extract maximum possible information – Avoid errors and overpredictions – Fill the gaps in metabolic pathways • “Unknowns” (conserved or unique) – Rank by importance

“Unknown unknowns” • Phylogenetic distribution – Wide - most likely essential – Narrow - probably clade-specific – Patchy - most intriguing, nichespecific

Can protein classification help? • Protein families are real and reflect evolutionary relationships • Function often follows along the family lines • Therefore, matching a new protein sequence to well-annotated and curated family provides information about this new protein and helps predicting its function. This is more accurate than comparing the new sequence to individual proteins in a database (search classification database vs search protein database) To make annotation and functional prediction for new sequences accurate and efficient, need “natural” protein classification

Protein Evolution • Tree of Life & Evolution of Protein Families (Dayhoff, 1978) • Can build a tree representing evolution of a protein family, based on sequences • Orthologous Gene Family: Organismal and Sequence Trees Match Well

Protein Evolution • • • Homolog – Common Ancestors – Common 3 D Structure – Usually at least some sequence similarity (sequence motifs or more close similarity) Ortholog – Derived from Speciation Paralog – Derived from Duplication A ancestor Ax 1 Az 2 Species 1 Species 2

Levels of Protein Classification Class / Composition of structural elements No relationships Fold TIM-Barrel Topology of folded backbone Possible monophyly above and below Recognizable sequence similarity (motifs); basic biochemistry Monophyletic origin High sequence similarity (alignments); biochemical properties Evolution by ancient duplications Superfamily Aldolase Class I Aldolase Orthologous 2 -keto-3 -deoxy-6 - Orthology for a given set of group phosphogluconate species; biochemical aldolase activity; biological function Origin traceable to a single gene in LCA Lineagespecific expansion (LSE) Evolution by recent duplication and loss PA 3131 and PA 3181 Paralogy within a lineage

Protein Family vs Domain: Evolutionary/Functional/Structural Unit A protein can consist of a single domain or multiple domains. Proteins have modular structure. Recent domain shuffling: CM (Aro. Q type) SF 001501 CM (Aro. Q type) PDH PDH CM (Aro. Q type) SF 006786 SF 001499 ACT SF 005547 PDT ACT SF 001424 PDT ACT SF 001500

Protein Evolution: Sequence Change vs. Domain Shuffling If enough similarity remains, one can trace the path to the common origin What about these?

Practical classification of proteins: setting realistic goals We strive to reconstruct the natural classification of proteins to the fullest possible extent BUT Domain shuffling rapidly degrades the continuity in the protein structure (faster than sequence divergence degrades similarity) THUS The further we extend the classification, the finer is the domain structure we need to consider SO We need to compromise between the depth of analysis and protein integrity OR … Credit: Dr. Y. Wolf, NCBI

Complementary approaches Classify domains Classify whole proteins • Allows to build a hierarchy and • Does not allow to build a hierarchy deep along the trace evolution all the way to evolutionary tree because of the deepest possible level, the domain shuffling last point of traceable homology and common origin • Can usually annotate specific • Can usually annotate only biological function (value for generic biochemical function the user and for the automatic individual protein annotation) ØCan map domains onto proteins ØCan classify proteins when some of the domains are not defined

Levels of protein classification Class / Composition of structural elements No relationships Fold TIM-Barrel Topology of folded backbone Possible monophyly Domain Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I Aldolase High sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2 -keto-3 -deoxy-6 - Orthology for a given set of phosphogluconate species; biochemical aldolase activity; biological function Origin traceable to a single gene in LCA LSE PA 3131 and PA 3181 Evolution by recent duplication and loss Paralogy within a lineage

Protein classification databases Domain classification Whole protein classification • Pfam • PIRSF • SMART • CDD Mixed • TIGRFAMS • COGs Based on structural fold • SCOP

Protein family – domain – site (motif) Inter. Pro is an integrated resource for protein families, domains and sites. Combines a number of databases: PROSITE, PRINTS, Pfam, SMART, Pro. Dom, TIGRFAMs, PIRSF CM PDT ACT SF 001500 Bifunctional chorismate mutase/prephenate dehydratase

Inter. Pro Entry Type defines the entry as a Family, Domain, Repeat, or Site Family = protein family. “Contains” field lists domains within this protein “Found in” field: for domain entries, lists families which contain this domain

Whole protein functional annotation is best done using annotated whole protein families [Ni. Fe]-hydrogenase maturation factor, carbamoyl phosphate-converting enzyme PIRSF 006256 Acylphosphatase – Znf x 2 – Yrd. C - related to Peptidase M 22 On the basis of domain composition alone, can not predict biological function

PIRSF protein classification system ØBasic concept: l. A network classification system based on evolutionary relationship of whole proteins ØBasic unit = PIRSF Family l. Homeomorphic (end-to-end similarity with common domain architecture) l. Monophyletic (common ancestry) ØDomains and motifs are mapped onto PIRSF

Levels of protein classification Class / Composition of structural elements No relationships Fold TIM-Barrel Topology of folded backbone Possible monophyly Domain Superfamily Aldolase Recognizable sequence similarity (motifs); basic biochemistry Monophyletic origin Class I Aldolase High sequence similarity (alignments); biochemical properties Evolution by ancient duplications Orthologous group 2 -keto-3 -deoxy-6 - Orthology for a given set of phosphogluconate species; biochemical aldolase activity; biological function Origin traceable to a single gene in LCA LSE PA 3131 and PA 3181 Evolution by recent duplication and loss Paralogy within a lineage

PIRSF curation and annotation • Preliminary clusters, uncurated – Computationally generated • Preliminary curation – Membership • Regular members: seed, representative • Associate members – Signature domains & HMM thresholds • Full curation – Membership – Family name – Description, bibliography (optional) – Integrated into Inter. Pro

Protein classification systems can be used to: • Provide accurate automatic annotation for new sequences • Detect and correct genome annotation errors systematically • Drive other annotations (active site etc) • Improve sensitivity of protein identification, simplify detection of non-obvious relationships • Provide basis for evolutionary and comparative research Discovery of new knowledge by using information embedded within families of homologous sequences and their structures

Systematic correction of annotation errors: Chorismate mutase • Chorismate Mutase (CM), Aro. Q class – – PIRSF 001501 – CM (Prokaryotic type) [PF 01817] PIRSF 001499 – Tyr. A bifunctional enzyme (Prok) [PF 01817 -PF 02153] PIRSF 001500 – Phe. A bifunctional enzyme (Prok) [PF 01817 -PF 00800] PIRSF 017318 – CM (Eukaryotic type) [Regulatory Dom-PF 01817] • Chorismate Mutase, Aro. H class – PIRSF 005965 – CM [PF 01817] Aro. Q Prok Aro. Q Euk Aro. H

Systematic correction of annotation errors: IMPDH Misnomer in Methanococcus jannaschii IMPDH Misnomers in Archaeoglobus fulgidus

Impact of protein bioinformatics and genomics • Single protein level – Discovery of new enzymes and superfamilies – Prediction of active sites and 3 D structures • Pathway level – Identification of “missing” enzymes – Prediction of alternative enzyme forms – Identification of potential drug targets • Cellular metabolism level – Multisubunit protein systems – Membrane energy transducers – Cellular signaling systems