Consensus Algorithm for Structure Prediction ROAWBetter O L
Consensus Algorithm for Structure Prediction ROAWBetter O L L I S G E L S A X L Ross Bayer Olga Russakovsky Alex Chan William Lu
Lazy Student Analogy • Suppose you are taking a comprehensive exam • Each question is from one of several areas (e. g. Literature, History, Math, and Science) • Collaboration is allowed • You are lazy, so you don’t want to study • But you know how well everyone did on a previous comprehensive exam • How to collaborate to get the highest score?
Lazy Student Analogy Overall Score: CLiterature: A History: D Math: D Science: D Overall Score: CLiterature: D History: D Math: A Science: D Overall Score: CLiterature: D History: A Math: D Science: D Overall Score: C Literature: C History: C Math: C Science: C Solution 1 score: C Solution 2 score: A- / B+
Only a few folds are found in nature "the large majority of proteins come from no more than one thousand families. " Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992 Jun 18; 357(6379): 543 -4.
Performance Prediction Algorithm Performance Algorithm 1 Algorithm 2 Algorithm 3 Space of All Proteins
Algorithms • Algorithms § § Robetta FFAS 03 Sam-T 02 Sam-T 99 • Servers were chosen by § Performance § Participation in Live. Bench and CASP § Availability of web interface
Simplified View of SCOP Family … Family Protein Protein Protein Protein Protein Protein Protein Protein
Training (Creating the “Databank”) Databank SCOP Family … Family Protein Protein Protein Protein Protein Protein Protein Protein
The Classifier Unknown protein: APITAYSQQTRGLLGCIITSLTGRDKNQVEGEVQVVSTATQSFLATCVNGVCWTVYHGA I know! Databank Family … Family Protein Protein Protein
Classifier • Goal: Classify unknown proteins into SCOP families § Classifying directly into SCOP family is structure prediction! • BLAST classification against PDB § Weighted average of results → SCOP family • Seems to work well § Correctly classified 11 of 12 SCOP proteins in CASP 5
ROAWBetter Gameplan • Procedure § Incoming protein is classified into a SCOP family § Algorithm that ran best for that family is selected § Communicate with server protein’s structure • We also have a default (for when family is unknown or family has no data): Robetta § With our limited training, this meant that we used Robetta quite a bit § But we did better than Robetta, hence ROAWBetter
Confidence Estimation • Confidence estimation is certainly possible • Several sources of confidence estimation § Amount of data collected for a particular family § Classifier confidence, based on BLAST E values, and how many BLAST hits § Some algorithms have confidence estimates in the predicted structure • Confidence low for unknown family
Results • Based on comparison of GDT scores on best models, we would have come: § 7 th in CASP 6 (including human-expert systems) § 1 st in CAFASP 3 (fully automated systems) § We rock (well, not too much) • This is a good sign § Very limited training, BUT § Still picked algorithms over Robetta at good times
Hypothetical protein of E. coli FFAS 03 Robetta
Anthranilate Phosphoribosyltransferase 2 SAM-T 02 Robetta
Conclusion: Modular Framework Database (SCOP) Robetta FFAS 03 Sam-T 02 Sam-T 99 Databank Training (Live. Bench results) Algorithm Determination Family Classifier (Best from family) (BLAST)
Lazy Student Analogy - Revisited Focus: Literature Focus: Math Focus: History Focus: Science • What would you do if you knew there is a collaborative, comprehensive exam? § Divide up the areas § Each study specific area § Consult with the area expert during the exam • Therefore we have revolutionized the world!!
- Slides: 17