Evolution teaches to predict protein structure and function

Evolution teaches prediction • Is Bioinformatics up to the data deluge? • Sequence comparison:

http: //cubic. bioc. columbia. edu/ • Volker Eyrich • Claus Andersen Copenhagen • Rajesh

CUBIC http: //cubic. bioc. columbia. edu Dariusz Przybylski Trevor Siggers Volker Eyrich Jinfeng Liu

The Data Deluge Conclusion: Bioinformatics will have a hell of a problem Burkhard Rost

Data Deluge: what do we want? Burkhard Rost (Columbia New York)

Data Deluge: numbers 50 1. 200. 000 500. 000 2000 17. 000 800 35.

Data Deluge: what CAN we do? Burkhard Rost (Columbia New York)

Data Deluge: we CAN we do? Not much … … yet Burkhard Rost (Columbia

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in

Dynamic programming: optimal alignment Burkhard Rost (Columbia New York)

BLAST: fast matching of single ‘words’ ? ? Burkhard Rost (Columbia New York)

Sequence -> Structure • Sequence folds into unique structure S -> T Burkhard Rost

Sequence -> Structure • Sequence folds into unique structure S -> T • Similar

Percentage sequence identity 10 15 -10 20 25 30 35 -5 0 5 Distance

Significant sequence identity B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost

Evolution did it ! B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard

Similar sequence -> similar structure? B Rost 1999 Prot. Engin. : 12, 85 -94

Detecting true hits in Twilight zone B Rost 1999 Prot. Engin. : 12, 85

Finding similar structures in Twilight zone B Rost 1999 Prot. Engin. : 12, 85

‘Secure’ thresholds for BLAST B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard

Accuracy vs. coverage Burkhard Rost (Columbia New York)

BLAST is not enough. . . B Rost 1999 Prot. Engin. : 12, 85

Sequence Space Hopping B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost

Success through sequence space hopping B Rost 1999 Prot. Engin. : 12, 85 -94

Hypothetical distribution of similar structures Burkhard Rost (Columbia New York)

Midnight zone: real - random B Rost 1997 Folding & Design: 2, S 19

Evolution into the Midnight zone Number of structure pairs 1600 1200 800 400 0

Protein structures evolved at random - almost • average < 10% – -> most

Structure space Burkhard Rost (Columbia New York) B Rost 1998 Structure: 6, 259 -263

Percentage of pairs Gold-mine out of reach! Burkhard Rost (Columbia New York)

Conservation of function Devon & Valencia 2000, Proteins, 41, pp. 98 Burkhard Rost (Columbia

Conservation of EC number Burkhard Rost (Columbia New York)

Conservation of EC number 2 Burkhard Rost (Columbia New York)

Conservation of EC number: BLAST Burkhard Rost (Columbia New York)

Conservation in detail Burkhard Rost (Columbia New York)

Accuracy vs. coverage: EC number Burkhard Rost (Columbia New York)

Conservation of EC numbers Burkhard Rost (Columbia New York)

Notation: protein structure 1 D, 2 D, 3 D Burkhard Rost (Columbia New York)

Goal of structure prediction • Epstein & Anfinsen, 1961: sequence uniquely determines structure •

Protein structure prediction in reality 3 D 1 D Ho. Mo Fo. Rc Burkhard

Homology modelling/comparative modelling PDB U (sequence) significant sequence identity H • assumption: H and

Protein structure prediction in reality SWISS-PROT view Genome view Ho. Mo 1 D Fo.

Structure prediction for protein universe Burkhard Rost (Columbia New York)

Improving prediction by waiting it out … 1999 1995 1991 Burkhard Rost (Columbia New

Membrane prediction Burkhard Rost (Columbia New York)

HTM prediction waiting for database growth. . . 1999 1996 1993 Burkhard Rost (Columbia

Topology for membrane helical proteins Burkhard Rost (Columbia New York)

PHDsec success on Poly-Valine Burkhard Rost (Columbia New York)

Refine by dynamic programming on NN ‘energy’ Burkhard Rost (Columbia New York)

PHDhtm refine topology predictio n Burkhard Rost (Columbia New York)

PHDhtm on Poly-Valine Burkhard Rost (Columbia New York)

Example IS representative Burkhard Rost (Columbia New York)

To be or not to be (HTM) Burkhard Rost (Columbia New York)

False positives: globular proteins Burkhard Rost (Columbia New York)

Details PHDsec: Wrong alignment • single sequences => accuracy clearly lower • sufficient information

Details PHDhtm: wrong for ‘save’ alignment . . , . . 1. . ,

Details PHDhtm: correct for accurate alignment . . , . . 1. . ,

Defining residue solvent accessibility Burkhard Rost (Columbia New York)

Evolution for accessibility prediction • Detailed prediction problematic • Significant gain by evolutionary information:

PHDacc: the un-g(l)ory details • accuracy > 75% (two states: buried, exposed) • distribution

Shuttle into the nucleus Cytoplasm Nucleus Burkhard Rost (Columbia New York)

How many NLS motifs in databases? • ONE in PROSITE bi-partite motif Coverage Burkhard

Experimental NLS: positive charges Burkhard Rost (Columbia New York)

Experimental NLS: more complicated Burkhard Rost (Columbia New York)

In silico mutagenisis Burkhard Rost (Columbia New York)

Increasing accuracy and coverage Coverage Burkhard Rost (Columbia New York)

Nuclear protein in proteomes Burkhard Rost (Columbia New York)

Un-annotated nuclear proteins with NLS • ATAXIN-1 GERGHGGG • Breast Cancer type 2 (Brc

Using NLS to bind DNA Burkhard Rost (Columbia New York)

DNA-binding predictions in proteomes Burkhard Rost (Columbia New York)

Rotation @ CUBIC. bioc. columbia. edu • want all cell-cycle protein • search in

Significant motifs Burkhard Rost (Columbia New York)

Finding unique subsets of proteins Burkhard Rost (Columbia New York)

Retenti on signals in ER and Golgi Burkhard Rost (Columbia New York)

Prokaryotes Cumulative percentage of proteins Family size Archeans Aeropyrum pernix K 1 Eukaryotes Number

Do we aim at getting one structure per fold? • Structural proteomics = hunt

Similar amino acid composition Burkhard Rost (Columbia New York)

Inventory of life: membrane proteins Eukaryotes Prokaryotes Archaea Burkhard Rost (Columbia New York)

Number of membrane helices -> complexity? Burkhard Rost (Columbia New York)

Membrane proteins: kingdoms invented different tricks Burkhard Rost (Columbia New York)

The membrane LEGO Burkhard Rost (Columbia New York)

Length of globular regions in membrane proteins Burkhard Rost (Columbia New York)

Inventory of life: coiled-coil proteins Eukaryotes Prokaryotes Archaeans Burkhard Rost (Columbia New York)

Coiled-coil proteins: details Burkhard Rost (Columbia New York)

Inventory of life: compartments Burkhard Rost (Columbia New York)

Protein structure universe Burkhard Rost (Columbia New York)

Distribution of protein length Burkhard Rost (Columbia New York)

Bottleneck 5: money. . . • Goal 500 in 5 years • money: total

What will we get? • many new structures • the machinery for structural genomics

Recipe to determine targets • Is it a known structure? • Is it similar

Alternative recipe to determine targets • Do we have a crystal? • Is it

Reality check: the invaluable contribution of bioinformatics to target selection Burkhard Rost (Columbia New

Target selection Burkhard Rost (Columbia New York)

Priority classes • Experimental feasibility • Biophysical properties – length – presence of Methionine

Target selection machinery Burkhard Rost (Columbia New York)

Conclusions: Structural Genomics • we get: • most major functional elements • most structural

Midnight zone STRONGLY populated Number of structure pairs 1600 1200 800 400 0 5

What we are threading for Burkhard Rost (Columbia New York)

Goals of fold recognition, threading, remote homology modelling • Recognising similar fold(s) (entire proteins)

Two paths to fold recognition Burkhard Rost (Columbia New York)

TOPITS Burkhard Rost (Columbia New York)

Prediction-based threading Burkhard Rost (Columbia New York)

Example of remote sequence identity Burkhard Rost (Columbia New York)

30% correct first, better if stronger Burkhard Rost (Columbia New York)

Other threading methods • TOPITS is not the best! • CASP Prediction. Center. llnl.

Long floppy regions • less than 5% helix or strand over > 70 residues

Floppy loops between domains Formate Dehydrogenase H (1 aa 6. pdb) Isoamylase (1 bf

Floppy ends pyruvate: ferredoxin oxidoredisoamylase (1 b 0 p. A. pdb) Aspartate aminotransferase (2

Floppy-wrap SH 3 and adjacent ligand site (1 awj. pdb) Cellulase (1 tf 4

Weirdoes Extracellular domain of T beta RI (1 tbi. pdb) Gene 5 DNA binding

Weirdoes are not alone ! Burkhard Rost (Columbia New York)

10% of biomass weird ! Burkhard Rost (Columbia New York)

Length distribution of floppy regions Burkhard Rost (Columbia New York)

Weirdoes functional ! Burkhard Rost (Columbia New York)

Yeast-2 -hybrid interactions Burkhard Rost (Columbia New York)

Conclusions • no prediction of 3 D structure • no prediction of function •

Thanksgiving • • Volker Eyrich Schrödinger, New York Chris Sander Whitehead, Boston Reinhard Schneider

Availability of methods • email: Predict. Protein@columbia. edu – subject: HELP – file: Email

Slides: 165

Download presentation

Evolution teaches to predict protein structure and function Burkhard Rost CUBIC Columbia University rost@columbia. edu http: //www. columbia. edu/~rost http: //cubic. bioc. columbia. edu/ Burkhard Rost (Columbia New York)

Evolution teaches prediction • Is Bioinformatics up to the data deluge? • Sequence comparison: do we know what we do? – conservation of structure and function • Structure prediction: where are we today? • How to learn from the evolutionary odyssey? – secondary structure – transmembrane proteins – solvent accessibility • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

http: //cubic. bioc. columbia. edu/ • Volker Eyrich • Claus Andersen Copenhagen • Rajesh Nair • Bastian Bruning • Jinfeng Liu Nijmegen • Dariusz Przybylski • Hepan Tan Columbia • Yanay Ofran • Trevor Siggers • Henry Bigelow Columbia • Kazimierz Wrzeszczynski • Sven Mika • Chien Peter Chen • Burkhard Rost • Miguel Andrade EMBL • Sean O’Donoghue LION • Andrej Sali Marc Marti-Renom Rockefeller • Alfonso Valencia Florencio Pazos Madrid • Michal Linial Jerusalem • http: //cubic. bioc. columbia. edu/ Burkhard Rost (Columbia New York)

CUBIC http: //cubic. bioc. columbia. edu Dariusz Przybylski Trevor Siggers Volker Eyrich Jinfeng Liu Hepan Tan Murat Cokol Burkhard Rost (Columbia New York)

The Data Deluge Conclusion: Bioinformatics will have a hell of a problem Burkhard Rost (Columbia New York)

Data Deluge: what do we want? Burkhard Rost (Columbia New York)

Data Deluge: numbers 50 1. 200. 000 500. 000 2000 17. 000 800 35. 000 Burkhard Rost (Columbia New York)

Data Deluge: what CAN we do? Burkhard Rost (Columbia New York)

Data Deluge: we CAN we do? Not much … … yet Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Sequence comparison: do we know what we do? – conservation of structure and function • Structure prediction: where are we today? • How to learn from the evolutionary odyssey? – secondary structure – transmembrane proteins – solvent accessibility • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Dynamic programming: optimal alignment Burkhard Rost (Columbia New York)

BLAST: fast matching of single ‘words’ ? ? Burkhard Rost (Columbia New York)

Profile-based comparison Burkhard Rost (Columbia New York)

Zones Burkhard Rost (Columbia New York)

Sequence -> Structure • Sequence folds into unique structure S -> T Burkhard Rost (Columbia New York)

Sequence -> Structure • Sequence folds into unique structure S -> T • Similar sequences fold into similar structures S + S’-> T Burkhard Rost (Columbia New York)

Sequence -> Structure • Sequence folds into unique structure S -> T • Similar sequences fold into similar structures S + S’-> T • Most sequences don’t fold, at all S -> no T Burkhard Rost (Columbia New York)

Percentage sequence identity 10 15 -10 20 25 30 35 -5 0 5 Distance from HSSP threshold 10 Twilight zone = false positives explode Number of protein pairs 106 105 104 103 102 101 B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Significant sequence identity B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Evolution did it ! B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Similar sequence -> similar structure? B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Detecting true hits in Twilight zone B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Finding similar structures in Twilight zone B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

‘Secure’ thresholds for BLAST B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Accuracy vs. coverage Burkhard Rost (Columbia New York)

BLAST is not enough. . . B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Sequence Space Hopping B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Success through sequence space hopping B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Zones Burkhard Rost (Columbia New York)

Profile-based database search B Rost 2001 Structural Bioinformatics: in press Burkhard Rost (Columbia New York)

Profile-based database search Burkhard Rost (Columbia New York)

Zones Burkhard Rost (Columbia New York)

Hypothetical distribution of similar structures Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Midnight zone: real - random B Rost 1997 Folding & Design: 2, S 19 -S 24 AS Yang and B Honig 2000 J. Mol. Biol. : 301, 679 -689 Burkhard Rost (Columbia New York)

Evolution into the Midnight zone Number of structure pairs 1600 1200 800 400 0 5 10 15 20 25 25 50 75 100 Percentage pairwise sequence identity B Rost and S O'Donoghue 1998 EMBL preprint Burkhard Rost (Columbia New York)

Protein structures evolved at random - almost • average < 10% – -> most pairs have ‘random’ identity levels • 3 - 4% anchor residues • 4 billion years of evolution reached equilibrium – rate of creating new structures slower than drift towards mean • averages for convergent and divergent evolution similar • convergent evolution may have been a major event Burkhard Rost (Columbia New York)

Structure space Burkhard Rost (Columbia New York) B Rost 1998 Structure: 6, 259 -263

Percentage of pairs Gold-mine out of reach! Burkhard Rost (Columbia New York)

Conservation of function Devon & Valencia 2000, Proteins, 41, pp. 98 Burkhard Rost (Columbia New York)

Conservation of EC number Burkhard Rost (Columbia New York)

Conservation of EC number 2 Burkhard Rost (Columbia New York)

Conservation of EC number: BLAST Burkhard Rost (Columbia New York)

Conservation in detail Burkhard Rost (Columbia New York)

Accuracy vs. coverage: EC number Burkhard Rost (Columbia New York)

Conservation of EC numbers Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Structure prediction: where are we today? • How to learn from the evolutionary odyssey? – secondary structure – transmembrane proteins – solvent accessibility • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Notation: protein structure 1 D, 2 D, 3 D Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Goal of structure prediction • Epstein & Anfinsen, 1961: sequence uniquely determines structure • INPUT: • OUTPUT: sequence 3 D structure and function Burkhard Rost (Columbia New York)

Protein structure prediction in reality 3 D 1 D Ho. Mo Fo. Rc Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Homology modelling/comparative modelling PDB U (sequence) significant sequence identity H • assumption: H and U homolgous 3 D structures • strategy: modelling of U based on H Burkhard Rost (Columbia New York)

Protein structure prediction in reality 3 D 1 D Ho. Mo Fo. Rc Burkhard Rost (Columbia New York)

Protein structure prediction in reality SWISS-PROT view Genome view Ho. Mo 1 D Fo. Rc …. the art of being humble Burkhard Rost (Columbia New York)

Structure prediction for protein universe Burkhard Rost (Columbia New York)

Improving prediction by waiting it out … 1999 1995 1991 Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • How to learn from the evolutionary odyssey? – secondary structure – transmembrane proteins – solvent accessibility • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Evolution did it ! B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Membrane prediction Burkhard Rost (Columbia New York)

HTM prediction waiting for database growth. . . 1999 1996 1993 Burkhard Rost (Columbia New York)

Topology for membrane helical proteins Burkhard Rost (Columbia New York)

PHDsec success on Poly-Valine Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Refine by dynamic programming on NN ‘energy’ Burkhard Rost (Columbia New York)

PHDhtm refine topology predictio n Burkhard Rost (Columbia New York)

PHDhtm on Poly-Valine Burkhard Rost (Columbia New York)

Example IS representative Burkhard Rost (Columbia New York)

To be or not to be (HTM) Burkhard Rost (Columbia New York)

False positives: globular proteins Burkhard Rost (Columbia New York)

Details PHDsec: Wrong alignment • single sequences => accuracy clearly lower • sufficient information in multiple alignment – many sequences – diversity • wrong alignment -> wrong prediction ID ftsh_ecoli ftsh_haein ftsh_bacsu ftsh_porpu ftsh_lacla ftsh_odosi %IDE %WSIM IFIR ILAS JFIR JLAS LALI NGAP LSEQ 1. 00 1 644 644 0 0 644 0. 76 0. 84 256 635 1 380 0 0 381 0. 50 0. 62 3 630 6 637 623 6 14 637 0. 48 0. 59 5 604 9 623 598 5 19 628 0. 46 0. 57 1 638 12 695 635 7 52 695 0. 45 0. 56 2 611 5 644 609 5 32 644 Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Defining residue solvent accessibility Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Evolution for accessibility prediction • Detailed prediction problematic • Significant gain by evolutionary information: in/out with > 75% accuracy! Burkhard Rost (Columbia New York)

PHDacc: the un-g(l)ory details • accuracy > 75% (two states: buried, exposed) • distribution with ≈ 10% • stronger predictions more accurate • WARNING: too reliability index almost factor 2 large for single sequences • accuracy below average for intermediate state • VERY dependent on alignment accuracy Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • Are 1 D predictions useful? – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • Are 1 D predictions useful? Of course to experts – sub-cellular localisation – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Shuttle into the nucleus Cytoplasm Nucleus Burkhard Rost (Columbia New York)

How many NLS motifs in databases? • ONE in PROSITE bi-partite motif Coverage Burkhard Rost (Columbia New York)

Experimental NLS: positive charges Burkhard Rost (Columbia New York)

Experimental NLS: more complicated Burkhard Rost (Columbia New York)

In silico mutagenisis Burkhard Rost (Columbia New York)

Increasing accuracy and coverage Coverage Burkhard Rost (Columbia New York)

Nuclear protein in proteomes Burkhard Rost (Columbia New York)

Un-annotated nuclear proteins with NLS • ATAXIN-1 GERGHGGG • Breast Cancer type 2 (Brc 2) RIKKKQR • Fibroblast Growth factor (fgf) KKRRRRR • Brg 1 ERKRRQ Burkhard Rost (Columbia New York)

Using NLS to bind DNA Burkhard Rost (Columbia New York)

DNA-binding predictions in proteomes Burkhard Rost (Columbia New York)

Rotation @ CUBIC. bioc. columbia. edu • want all cell-cycle protein • search in SWISS-PROT, PROSITE • search literature • build ‘expert’ set of known Burkhard Rost (Columbia New York)

Significant motifs Burkhard Rost (Columbia New York)

Rotation @ CUBIC. bioc. columbia. edu • want all cell-cycle protein • search in SWISS-PROT, PROSITE • search literature • build ‘expert’ set of known • choose unique subset Burkhard Rost (Columbia New York)

Finding unique subsets of proteins Burkhard Rost (Columbia New York)

Similar sequence -> similar structure? B Rost 1999 Prot. Engin. : 12, 85 -94 Burkhard Rost (Columbia New York)

Rotation @ CUBIC. bioc. columbia. edu • want all cell-cycle protein • search in SWISS-PROT, PROSITE • search literature • build ‘expert’ set of known • choose unique subset • find motifs …. sorry time run out, here! Burkhard Rost (Columbia New York)

Retenti on signals in ER and Golgi Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • High-throughput success of predictions: – localisation: accessibility useful, but not enough! – whole genomes – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Burkhard Rost (Columbia New York)

Prokaryotes Cumulative percentage of proteins Family size Archeans Aeropyrum pernix K 1 Eukaryotes Number of proteins in family Burkhard Rost (Columbia New York)

Structure prediction for protein universe Burkhard Rost (Columbia New York)

Do we aim at getting one structure per fold? • Structural proteomics = hunt for new folds ? Tough task for theory! -> Practice: Shrink complexes: 14747 technicians! • Can we avoid non-globular proteins? • Can we prioritise aspects of function? Burkhard Rost (Columbia New York)

Similar amino acid composition Burkhard Rost (Columbia New York)

Inventory of life: membrane proteins Eukaryotes Prokaryotes Archaea Burkhard Rost (Columbia New York)

Number of membrane helices -> complexity? Burkhard Rost (Columbia New York)

Membrane proteins: kingdoms invented different tricks Burkhard Rost (Columbia New York)

The membrane LEGO Burkhard Rost (Columbia New York)

Length of globular regions in membrane proteins Burkhard Rost (Columbia New York)

Inventory of life: coiled-coil proteins Eukaryotes Prokaryotes Archaeans Burkhard Rost (Columbia New York)

Coiled-coil proteins: details Burkhard Rost (Columbia New York)

Inventory of life: compartments Burkhard Rost (Columbia New York)

Protein structure universe Burkhard Rost (Columbia New York)

Distribution of protein length Burkhard Rost (Columbia New York)

Bottleneck 5: money. . . • Goal 500 in 5 years • money: total of $ 25 M in 5 years 50, 000, 000 Lire Burkhard Rost (Columbia New York)

What will we get? • many new structures • the machinery for structural genomics • some weired structures. . . Burkhard Rost (Columbia New York)

Recipe to determine targets • Is it a known structure? • Is it similar to a known structure? • Is it a membrane protein? • Does it look like a known fold? • Does it look like a globular protein? • Is it a big family? • Is it short (NMR) does it contain Met (MAD)? Burkhard Rost (Columbia New York)

Alternative recipe to determine targets • Do we have a crystal? • Is it a known structure? • Is it similar to a known structure? Burkhard Rost (Columbia New York)

Reality check: the invaluable contribution of bioinformatics to target selection Burkhard Rost (Columbia New York)

Target selection Burkhard Rost (Columbia New York)

Priority classes • Experimental feasibility • Biophysical properties – length – presence of Methionine • Bioinformatics criteria – similarity to known structure – family size – functional annotation • Functional genomics Burkhard Rost (Columbia New York)

Target selection machinery Burkhard Rost (Columbia New York)

Conclusions: Structural Genomics • we get: • most major functional elements • most structural scaffolds • evolutionary links • structure-based comparison • high-throughput techniques • we won’t get: • complexes • interaction between them • particular structures • when? • 70% of the human genome by 2010 2015 • remainder = HTMs? Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • High-throughput success of predictions: – localisation: accessibility useful, but not enough! – whole genomes: kingdoms differ in some respects! – 3 D structure: threading – floppy regions Burkhard Rost (Columbia New York)

Midnight zone STRONGLY populated Number of structure pairs 1600 1200 800 400 0 5 10 15 20 25 25 50 75 100 Percentage pairwise sequence identity Burkhard Rost (Columbia New York)

What we are threading for Burkhard Rost (Columbia New York)

Goals of fold recognition, threading, remote homology modelling • Recognising similar fold(s) (entire proteins) • Detecting remote homologies for fragments (part of protein) • Align target and fold • Remote homology modelling (prediction in 3 D) Burkhard Rost (Columbia New York)

Two paths to fold recognition Burkhard Rost (Columbia New York)

TOPITS Burkhard Rost (Columbia New York)

Prediction-based threading Burkhard Rost (Columbia New York)

Example of remote sequence identity Burkhard Rost (Columbia New York)

30% correct first, better if stronger Burkhard Rost (Columbia New York)

Other threading methods • TOPITS is not the best! • CASP Prediction. Center. llnl. gov/content. html • CAFASP www. cs. bgu. ac. il/~dfischer/CAFASP 2/ • EVA cubic. bioc. columbia. edu/eva/ • CUBIC links cubic. bioc. columbia. edu/doc/links_index. html Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • High-throughput success of predictions: – localisation: accessibility useful, but not enough! – whole genomes: kingdoms differ in some respects! – threading: better than sequence alignment! – floppy regions (NORS: no regular secondary structure) Burkhard Rost (Columbia New York)

Long floppy regions • less than 5% helix or strand over > 70 residues Burkhard Rost (Columbia New York)

Floppy loops between domains Formate Dehydrogenase H (1 aa 6. pdb) Isoamylase (1 bf 2. pdb) phi. X 174 virion (1 al 0 F. pdb) DNA-containing capsid of CPV (4 dpv. pdb) Burkhard Rost (Columbia New York)

Floppy ends pyruvate: ferredoxin oxidoredisoamylase (1 b 0 p. A. pdb) Aspartate aminotransferase (2 aat. pdb) Capsid protein of CPV(1 b 35 C. pdb) Prothrombin fragment 2 (2 hpp. P. pdb) Hexon from adenovirus type 2 (1 dhx. pdb) SH 3 domainof PLC-gamma (1 hsq. pdb) Myeloperoxidase (1 mhl. A. pdb) Hydroxylase component of MMOH (1 mty. B. pdb) Burkhard Rost (Columbia New York)

Floppy-wrap SH 3 and adjacent ligand site (1 awj. pdb) Cellulase (1 tf 4 A. pdb) Erythrocyte catalase (7 cat. pdb) Phosphoglycerate mutase (3 pgm. pdb) Gm. DNV capsid protein (1 dnx. pdb) Carboxypeptidase T (1 obr. pdb) Burkhard Rost (Columbia New York)

Weirdoes Extracellular domain of T beta RI (1 tbi. pdb) Gene 5 DNA binding protein (2 gn 5. pdb) HIVZ 2 Tat protein (1 tac. pdb) Recombinant Kringle 5 domain (5 hpg. pdb) Plasminogen Kringle 4 (1 krn. pdb) Aspartate Transcarbamoylase (9 atc. pdb) Burkhard Rost (Columbia New York)

Weirdoes are not alone ! Burkhard Rost (Columbia New York)

10% of biomass weird ! Burkhard Rost (Columbia New York)

Length distribution of floppy regions Burkhard Rost (Columbia New York)

Weirdoes functional ! Burkhard Rost (Columbia New York)

Yeast-2 -hybrid interactions Burkhard Rost (Columbia New York)

Evolution teaches prediction • Bioinformatics up to the data deluge? NO, but work in progress! • Know what we do? Some do, 30% over 100 residues! • Where are we today? NO 3 D prediction from sequence! • Evolutionary odyssey applied: – secondary structure – transmembrane proteins – solvent accessibility +15% -> 76% ± 10% +10% -> 65% topo ok + 5% -> 75% • High-throughput success of predictions: – localisation: – whole genomes: – threading: – NORS: accessibility useful, but not enough! kingdoms differ in some respects! better than sequence alignment! weirdoes not alone AND functional! Burkhard Rost (Columbia New York)

Conclusions • no prediction of 3 D structure • no prediction of function • but: quantum leap through using ‘frozen knowledge’ from evolution and protein structures • the data deluge floods bioinformatics • the unsolved urgent problems are legion • but: it is still time to get it done: running BLAST is NOT all there is … the key is intelligent use of biological knowledge. . . Burkhard Rost (Columbia New York)

Thanksgiving • • Volker Eyrich Schrödinger, New York Chris Sander Whitehead, Boston Reinhard Schneider LION, Boston Alfonso Valencia CNB, Madrid . . in general Miguel Andrade EMBL, Heidelberg • • • Jinfeng Liu LION, Heidelberg genomes, floppy, domains localisation Séan O’Donoghue • • Rajesh Nair SIB, Genève NLS, localisation Amos Bairoch Michael Braxenthaler La Roche, New York • • Yanay Ofran protein interactions • Søren Brunak CBS, København • • Dariusz PSI-Blast, EVA, threading Rita Casadio Przybylski Univ. Bologna • Antoine De Daruvar LION, Bordeaux • • Henry Bigelow predict porins David Eisenberg UCLA, Los Angeles • • • Piero Fariselli Barry Honig Tim Hubbard Michael Levitt Marc Marti-Renom Andrej Sali Michael Scharf Gerrit Vriend Manfred Sippl Univ. Bologna Columbia, New York Sanger, Hinxton Univ. Stanford Rockefeller, New York Take 5, Heidelberg Univ. Nijmegen Univ. Salzburg • Claus Andersen continuous DSSP • Bastiaan Bruning transcription factors • Sven Mika nuclear matrix proteins • Chien Peter Chen membrane proteins • Kazimierz Wrzeszczynski cell-cycle/ER-Golgi • Hepan Tan floppy regions Burkhard Rost (Columbia New York)

Availability of methods • email: Predict. Protein@columbia. edu – subject: HELP – file: Email address options # protein name SEQWENCE • WWW: http: //cubic. bioc. columbia. edu/predictprotein/ • META: http: //cubic. bioc. columbia. edu/ • EVA: http: //cubic. bioc. columbia. edu/eva • CUBIC: http: //cubic. bioc. columbia. edu/ predictprotein/submit_meta. html Burkhard Rost (Columbia New York)