Protein Structure Prediction David Wild Keck Graduate Institute

Summary • • Motivation Secondary Structure Prediction Tertiary Structure Prediction Sequence/Structure Approaches 3 D

Functional assignment by homology: the function-homology gap yeast data analyzed by Gene. Quiz

Russell et al. J. Mol. Biol (1997) 269, 423 -439

enterotoxin homolog: cholera toxin 80% ID 98/103 residues with rmsd 0. 6 A remote

• Active site formed by loops between the carboxy end of the -strands

Central Dogma “The 3 D structure of a protein is determined by its sequence

Dominant Effects in Protein Folding • Net protein stability - diverse chemical properties of

Secondary Structure Prediction • History and Context • Chou & Fasman • Lim •

Secondary Structure Prediction by Eye • Position of insertions and deletions probable loop •

Single Sequence Methods Chou & Fasman 1974 • Propensities of formation based upon frequency

The Lim Method (1974) • Theory based on packing of polypeptide chains e. g.

-helix -strand strong former glu, ala, leu met, val, ile former his, met,

Single Sequence Methods Garnier, Osguthorpe, Robson (GOR), 1978 • Window of 17 residues (i-8

GOR I Example • For alanine 240 in helix, 150 not in helix, total

Neural networks applied to SS prediction • Use known structures as target function •

Qian and Sejnowski (1988) 13 residue window Input -6, -5, -4, -3, -2, -1,

• Binary coding of amino acid residues – 20 residues require 5 bits

PHD Neural Network Rost & Sander, 1993 • Uses multiple independent neural networks as

Weight Matrix Methods • Position specific scoring matrix (PSSM) • Feature is represented as

Nearest Neighbor Methods Salamov & Solovyev, NSSP 1995 • Use database of proteins of

Indentifying factors that affect secondary structure King & Sternberg, DSC 1996 • Relative aa

CASP 2 - Blind Prediction of Protein Secondary Structure Server Predictions M=Multiple S=Single Zemla

Issues • Definition of secondary structure from 3 D coordinates is not exact •

-sheets are formed by long range interactions

Generative probabilistic models (Schmidler et al. (2000); Chu et al. (2004))

Tertiary Structure Prediction • Comparative modeling –Homology modeling • Fragment-based –COMPOSER –SWISS-MODEL • 3

Sequence Alignment Accuracy: %correctly aligned residues vs. %sequence identity Saqi et al. Prot. Eng

Ab initio/De Novo Folding q Combinatorial approaches • Secondary structure prediction + Docking q.

Threader Jones et al 1999 • Structural role of residue described in terms of

For specified atoms in a pair of residues {a, b}, with a sequence separation

Ab-initio Approaches LINUS (Srinivasan & Rose, 1995) Folding by “Hierarchical Condensation” Cstart N 6

ROSETTA Simons et al, 1997 • Metropolis Monte Carlo simulated annealing procedure • 3

FRAGFOLD Jones (1997, 2001) Library of super-secondary structures fragments -hairpin motif From Branden and

Folding Proteins with Boltzmann Learning Rule • NOT traditional ab initio folding • Learn

Boltzmann Learning Rule The probability of nativei fold given sequencei and the model parameters

Potentials • Lennard-Jones between atoms X and Y • Hydrogen bonds • Others Total

Assessment • Live. Bench • CAFASP 3 Servers – Evaluation Results • CASP 5

Slides: 58

Download presentation

Protein Structure Prediction David Wild Keck Graduate Institute of Applied Life Sciences David_Wild@kgi. edu

Summary • • Motivation Secondary Structure Prediction Tertiary Structure Prediction Sequence/Structure Approaches 3 D profile Threading Ab-initio Approaches

Growth of PDB

Functional assignment by homology: the function-homology gap yeast data analyzed by Gene. Quiz

Russell et al. J. Mol. Biol (1997) 269, 423 -439

enterotoxin homolog: cholera toxin 80% ID 98/103 residues with rmsd 0. 6 A remote homolog: toxic shock syndrome toxin; no sequence similarity but 35/95 residues with rmsd 2. 4 A analog: t. RNA synthetase; no sequence similarity but 41/103 residues with rmsd 2. 2 A no known functional similarity

From Hegyi and Gerstein (1999)

• Active site formed by loops between the carboxy end of the -strands and the amino end of the -helices at one end of the barrel From Branden and Tooze (1999)

From Hegyi and Gerstein (1999)

Baker and Sali (2000)

Central Dogma “The 3 D structure of a protein is determined by its sequence and its environment without the obligatory role of extrinsic factors” • Anfinsen (1973) - renaturation of ribonuclease • Ignores role of chaperones, disulfide interchange enzymes etc

Dominant Effects in Protein Folding • Net protein stability - diverse chemical properties of main and side chain atoms give rise to interplay of non-covalent and entropic effects • Hydrophobic effect - non-polar core • Atomic packing - van der Waals interactions favor close packing • Conformational entropy - freezing of rotamers • Electrostatic effects - ion pairs and H-bonds • Disulfide bridges

Primary Secondary Tertiary

Secondary Structure Prediction • History and Context • Chou & Fasman • Lim • Garnier-Osguthorpe-Robson • Comparison of Methods • Newer Approaches

Secondary Structure Prediction by Eye • Position of insertions and deletions probable loop • Conserved Gly/Pro probable loop • Short runs of conserved hydrophobics buried -strand • i, i+2, i+4 pattern of conserved residues surface -strand • i, i+3, i+4, i+7 conserved pattern surface helix Helix Edge strand Buried strand

From Branden and Tooze (1999)

Single Sequence Methods Chou & Fasman 1974 • Propensities of formation based upon frequency of occurrence • Generate tables for , , turn & random coil • Strong/weak/indifferent formers & breakers • Rules for nucleation, propagation & termination • 15 protein database - 50% accuracy!

The Lim Method (1974) • Theory based on packing of polypeptide chains e. g. : -helices that make contact with the main protein body need a hydrophobic side • Hydrophobic residues must face internally and pack closely together • Method defines hydrophobics/hydrophilics and passageway residues • Advantage: rules have a clear basis in protein chemistry theory • Disadvatange: rules complex & difficult to understand

-helix -strand strong former glu, ala, leu met, val, ile former his, met, gln, trp, val, phe cys, tyr, phe, gln, leu, thr, trp weak former leu, ile ala indifferent former asp, thr, ser, arg, cys arg, gly, asp breaker asn, tyr lys, ser, his, asn, pro strong breaker pro, gly glu

Single Sequence Methods Garnier, Osguthorpe, Robson (GOR), 1978 • Window of 17 residues (i-8 i i+8) • 4 states - predicted structure is highest value summed over window • “Information theoretic” approach • Single sequence GORI - 55% accuracy • GORIII - pair information - correlate the type of residues in a window with the residue to be predicted • Sensitive to database size - getting better all the time

GOR I Example • For alanine 240 in helix, 150 not in helix, total 390 residues • For all residues 780 in helix (H), 1050 not in helix (~H), total 1830 P(S=H|A) = 240/390 = 0. 615 P(S=~H|A) = 150/390 = 0. 385 P(S=H) = 780/1830 = 0. 426 P(S=~H) = 1050/1830 = 0. 573 I(S=H: ~H; A) = ln(0. 615/0. 385) - ln(0. 426/0. 573) (log-odds ratio) = 0. 4683 - 0. 2964 = 0. 7647

Neural networks applied to SS prediction • Use known structures as target function • Single sequence methods not that successful, but better than GOR (Qian & Sejnowski, 1988 ~ 63%) • Adding information from an alignment substantially improves accuracy • Disadvantage: one loses sight of original problem due to ‘black box’ nature of prediction method • Large number of parameters

Qian and Sejnowski (1988) 13 residue window Input -6, -5, -4, -3, -2, -1, X, +1, +2, +3, +4, +5, +6 Input Layer 13 groups, each of 21 units (20 residues plus space) Hidden layer Output Layer 3 groups (H, E, C) Prediction of center residue X

• Binary coding of amino acid residues – 20 residues require 5 bits – for instance ala = 00001 cys = 00010 asp = 00011 … trp = 10100 • Could alternatively encode 5 properties, e. g. : hydrophobicity, side chain size etc. . .

PHD Neural Network Rost & Sander, 1993 • Uses multiple independent neural networks as prediction engine • Balanced training - present network with one structural class at a time • Addition of evolutionary information improves prediction quality How… 1. Sequence to structure - input coded as a profile, trained against known structure 2. Structure to structure - predicted SS trained against known structure 3. Jury decision - numerical average over number of different level 2 networks

Profile/PSSM • Position Specific Scoring Matrix, or weight matrix, is calculated based on observed frequencies in a column GCGGTGATAATGGTTGCATG TTGGGTATATTTGACTATGG ATGCATACACTATAGGTGTG TGCAGTAAGATACAAATGGC ATGGTTATAGTATGCCCATG Acknowledgement: Mike Gribskov

Weight Matrix Methods • Position specific scoring matrix (PSSM) • Feature is represented as a matrix with a score for every possible character • A simple weight matrix for the bacterial promoter -10 region, values here are simply % frequencies A C G T 2 9 10 79 T 95 2 1 3 A 26 14 16 44 T 59 13 15 13 A 51 20 13 17 A Acknowledgement: Mike Gribskov 1 3 0 96 T

From Baldi and Brunak (2001)

Nearest Neighbor Methods Salamov & Solovyev, NSSP 1995 • Use database of proteins of known structure • Match each segment of query sequence against all sequences in database • Choose secondary structure state of the majority of its neighbors as the prediction • Neighbors are decided upon by using amino acid substitution tables and scoring tables

Indentifying factors that affect secondary structure King & Sternberg, DSC 1996 • Relative aa position in chain • Treatment of insertions/deletions • Hydrophobic moment • %aa content • not a ‘black box’ technique

CASP 2 - Blind Prediction of Protein Secondary Structure Server Predictions M=Multiple S=Single Zemla et al. Proteins (1997) Suppl. 1, 140 -150

Issues • Definition of secondary structure from 3 D coordinates is not exact • Different algorithms to define secondary structure DSSP, STRIDE, DEFINE, Author, P-Curve give different definitions: DSSP/Stride/Define DSSP/Define 95% 74% 73% • Definition itself is open to interpretation - there are more than 3 states defined: H, E, G, I, T, C, B, S H, E, C

-sheets are formed by long range interactions

Generative probabilistic models (Schmidler et al. (2000); Chu et al. (2004))

Tertiary Structure Prediction • Comparative modeling –Homology modeling • Fragment-based –COMPOSER –SWISS-MODEL • 3 D distance constraints –MODELER • Fold Recognition/Threading/Inverse Folding • Proteins may have undetectable sequence similarity but striking structural similarity. • Glimmers in the twilight zone (Doolittle, 1987)

Sequence Alignment Accuracy: %correctly aligned residues vs. %sequence identity Saqi et al. Prot. Eng (1998)

Russell et al. J. Mol. Biol (1997) 269, 423 -439

Fold Recognition Methods • Sequence profile – PSI-BLAST – HMM – Environmental PSSM • Structural profile – 3 D-1 D profile • Threading – Pair potential based fold recognition

Ab initio/De Novo Folding q Combinatorial approaches • Secondary structure prediction + Docking q. Energy minimization q. Monte Carlo simulation • Fragments of highly resolved protein structures are joined together and the feasibility of the fold is evaluated with a potential function. q. Lattice simulations – Still mainly developer based usage.

From Higgins and Taylor (2000)

Bowie, Luthy and Eisenberg (1991)

Threader Jones et al 1999 • Structural role of residue described in terms of interactions • ‘Network’ of pairwise interatomic energy terms (potentials) from a statistical analysis of proteins of known structure and inverse Boltzman equation (Sippl 1990) used as sequence-structure compatibility function

For specified atoms in a pair of residues {a, b}, with a sequence separation of k and distance interval s, the potential is given by mab is the number of pairs ab observed at sequence separation k is the weight of each observation fabk(s) is the equivalent frequency of occurrence of residue pair ab. fk(s) is the frequency of occurrence of all residue pairs at sequence separation k and separation distance s Potentials corresponding to short (sequence separation , k < 11), medium (11 k 22), and long (k > 30) range interactions, have been utilized.

Ab-initio Approaches LINUS (Srinivasan & Rose, 1995) Folding by “Hierarchical Condensation” Cstart N 6 ( ({ j cycle 1 N-1 Step 2 })) Generate Trial conformation C* 1. Randomly choose backbone conformation 2. Bump check C* 3. Calculate energy of C*, U(C*) 4. If U(C*)<U(C) or x < e- E , 5. where is x is random and 0<x<1 then C = C* Fragments of 50 residues, interaction interval 6< <48 Simple potential: • Contact energy • H-bonding • Main chain ‘torsional potential’ ( > 0 except for glycine)

ROSETTA Simons et al, 1997 • Metropolis Monte Carlo simulated annealing procedure • 3 and 9 residue fragments of known structures with local sequences similar to the target sequence • Potential function - sequence dependent terms qhydrophobic burial qelectrostatics and disulfide bonding, • sequence independent terms qhard sphere packing, q alpha-helix and beta-strand packing qcollection of beta-strands in beta-sheets

FRAGFOLD Jones (1997, 2001) Library of super-secondary structures fragments -hairpin motif From Branden and Tooze (1999)

Folding Proteins with Boltzmann Learning Rule • NOT traditional ab initio folding • Learn the potentials that maximize the probability of known native folds • Then, use learned potential for future folding (Ole Winter & Anders Krogh 2003)

Boltzmann Learning Rule The probability of nativei fold given sequencei and the model parameters : The updating of parameters with the rate .

Potentials • Lennard-Jones between atoms X and Y • Hydrogen bonds • Others Total of more than 1000 model parameters to learn

Assessment • Live. Bench • CAFASP 3 Servers – Evaluation Results • CASP 5 – Evaluation Results