Bioinformatics Ahmet Sacan 06 Protein secondary structure prediction

Bioinformatics Ahmet Sacan 06. Protein (secondary) structure prediction Images are from "Understanding Bioinformatics" and Wikipedia.

Defining secondary structures • Real secondary structures are not perfect – B-strands are usually curved – B-bulges are common • Hydrogen bonding – DSSP, STRIDE • C-alpha distances, Dihedral angles – PALSSE, DEFINE TTABPSIVARSNFNVCRLPGTPEAICATYTGB GEECSSHHHHHHTTTCCHHHHHH

DSSP • DSSP: the popular method for defining secondary structures – – – – H: -helix E: -strand G: 3 -10 helix (H-bonds b/w <i, i+3>) I: π-helix (H-bonds b/w <i, i+5>) B: beta bridge S: bend T: -turn C: coil (none of the above)

PALSSE uses C-alpha distances

Disagreement among methods • A) DSSP, B) PALSSE, C) Beta-spider • Methods (usually) agree on alphahelices, but not on beta-strands.

Secondary structure prediction • Given: MKLFYKPGACSLASHITLRESGKDFTLVSVDL MKKRLENGDDYFAVNPKGQVPALLLDDGT LLTEGVAIMQYLADSVPDRQLLAPVNSISRY KTIEWLNYIATELHKGFTPLFRPDTPEEYKPT VRAQLEKKLQYVNEALKDEHWICGQRFTIA DAYLFTVLRWAYAVKLNLEGLEHIAAFMQR MAERPEVQDALSAEGLK CEEEECTTSTTHHHHHTTCCCEEEEEE TTTTEETTSCBSTTTCTTCCSCEEECTTSCEEE SHHHHHHCGGGCSSCCTTCHHHHHHTHHHHGGGGCSSSCTT THHHHHHHHHTTTSSBTTBS SCCHHHHHHHHTCCCTTCHH HHHHHTCHHHHHTTCC • Find: TTABPSIVARSNFNVCRLPGTPEAICATYTGB ? ? ? ? ? ? ? ? ?

Secondary structure prediction

Determining success •

SOV is more useful for foldrecognition applications SOV Q 3 X-Ray CHHHHHC 100 Prediction 1 CHCHCHCC 12 58 Prediction 2 CCCHHHHHCCCC 63 58 Prediction 3 CHHHCHHC 41 83 Prediction 4 CHHCCHHHHHCC 52 75 Prediction 5 CCCHHHHHHCCC 80 67 • Predictions 1 and 2 have same composition, but 1 is unrealistic. • SOV gives higher scores when only one helix is predicted (2, 5)

Specialized scores for transmembrane helix prediction --> the method misses the first 2 -5 residues of the helix.

Training/Test datasets • Include a range of different and unrelated sequences and structures – Avoid compositional bias (e. g. , more alpha helices and random coils than B-strands) • No protein in test set should be homologous to training set proteins. – Use "non-redundant", "representative" datasets (PDB_SELECT, ASTRAL, etc. )

Secondary structure prediction • Classical methods used knowledge-based and statistical methods • Current methods use machine learning methods

Chou-Fasman Propensities • 1970, 15 structures • later repeated for 64, and 144 proteins. • Assign SSE to residues, then look for windows: – alpha helix: 6 residues – beta sheet: 5 residues – transmembrane helix: 15 -20 residues H: helix, B: sheet, I: indifferent. case reflects the magnitude

Propensities are positiondependent A) B) alpha-helix propensity with distance from N-cap propensities averaged over all hydrophobic residues. Hydrophobic residues prefer to be on the same side as the first residue.

GOR • GOR method uses information theory – Self: information contained by residue type – Directional: information contained by types of other residues (without using self-type) – Pair: same as directional, but including the self residue type. The effect of Pro vs. Met at position j+5 on alpha helix probability

Example GOR IV prediction

Homologous sequences • Information from homologous sequences enhances prediction results

Nearest Neighbor Methods • NNSP, SSPAL

Neural Network methods • • PHD PROF PSIPRED Jnet

A single neural unit • Edges multiply the signal (xi) by some weight (θi). • Nodes sum inputs

Activation function •

Multi-layer Neural Network

Double neural network

Neural network inputs for PSSM

Training Neural Network • Given: – sets of input x and correct output y – adjust weights, so network output is similar to y. • Gradient Descent: – Repeatedly move in the best direction that minimizes error. • Back-propagation: – Back-propagate the error into previous layers, so their weights can also be adjusted.

Gradient Descent • The Gradient is defined (though we can’t solve directly) • Points in the direction of fastest increase 27

Gradient Descent • Gradient points in the direction of fastest increase • To minimize R, move in the opposite direction 28

Gradient Descent • Initialize Randomly • Update with small steps • (nearly) guaranteed to converge to the minimum 29

Gradient Descent 30

Overtraining

Coiled-coil structures • Coil-coils are strong structures, with important functions – kreatins (intermediate filaments), – myosin II (motor protein) – fibrin (blood clotting) • Sequence of alpha-helix shows a periodicity of 7 amino acids, aka "heptad pattern" – If residues are abcdefg, then a and d are hydrophobic; e and g are often charged.

Coiled-coils • Leucine zippers – The repeating residues a and d are Leu. – Occur in transcription factors.

Transmembrane helix prediction • Find just the membrane helices; or additionally find inside/outside helices and loops and topology. • Simplest methods use hydrohobicity – hydropathic profile: average hydrophobicity over a sliding window (e. g. , 15 -residues)

Hydrophilic residues in multihelix TM proteins • amphipathic profile – uses hydrophobic moment. – can distinguish between surface helices in globular proteins and single or multi-pass TM helices. A, B) a transmembrane helix; C) non-transmembrane helix

Orienting the helix • Positive inside rule: Intracellular loops have higher concentration of Arg (R) and Lys (K). • Or, build the inside/outside into the model. • SOSUI assumes at least one hydrophobic helix is present (primary helix), and it uses the following features to identify others: – hydropathy index – ampiphilicity index – index of amino acid charges

HMM for transmembrane-helix prediction

Using consensus

RNA secondary structure prediction • Most methods try to minimize energy ~ maximize number of Hydrogen bonds. Prediction of an initiator t. RNA by A) MFOLD, B) Gene. Bee, C) RNAfold.