Truncation of Protein Sequences for Fast Profile Alignment

  • Slides: 28
Download presentation
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

Contents 1. Introduction – – 2. Speeding Up the Prediction Process – – –

Contents 1. Introduction – – 2. Speeding Up the Prediction Process – – – 3. 4. Cell Organelles and Proteins Subcellular Localization Signal-Based vs. Homology-Based Methods Predicting Cleaving Site Location Truncating Profiles vs. Truncating Sequences Perturbational Discriminant Analysis Experiments and Results Conclusions 2

Organelles • • • Cells have a set of organelles that are specialized for

Organelles • • • Cells have a set of organelles that are specialized for carrying out one or more vital functions. Proteins must be transported to the correct organelles of a cell to properly perform their functions. Therefore, knowing the subcellular localization is one step towards understanding the functions of proteins. 3

Proteins and Their Subcellular Location 4

Proteins and Their Subcellular Location 4

Subcellular Localization Prediction Two key methods: 1. Signal-based 2. Homology-based 5

Subcellular Localization Prediction Two key methods: 1. Signal-based 2. Homology-based 5

Signal-Based Method Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. Cleavage site •

Signal-Based Method Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. Cleavage site • The amino acid sequence of a protein contains information about its organelle destination. • Typically, the information can be found within a short segment of 20 to 100 amino acids preceding the cleavage site. • Signal-based methods (e. g. Target. P) can determine the cleavage site location 6

Homology-Based Method N-dim alignment vector Full-length Query Sequence Align with each of the training

Homology-Based Method N-dim alignment vector Full-length Query Sequence Align with each of the training sequences 1 . . . SVM classifier Subcellular Location N S(1)=KNKA··· S(2)=KAKN··· · · S(N)=KGLL··· Full-length Training sequences Advantage: • Can predict sequences that do not have cleavage sites. Drawback: • Given a query sequence, we need to align it with every training sequence in the training set, causing long computation 7 time.

Sequences Length Distribution Cleavage Site Length distribution of Seq. SP Ext: Occurrences of Seq.

Sequences Length Distribution Cleavage Site Length distribution of Seq. SP Ext: Occurrences of Seq. 21 m. TP 820 Cleavage Site Mit: 1050 35 c. TP Cleavage Site Chl: Sequence Length 18 760 • Many sequences are fairly long, thus, aligning the whole sequence will take long computation time. • c. TP, m. TP and SP are under 100 AAs only and contain the most relevant segment. • Computation saving can be achieved by aligning the signal segments only. 8

Proposed Method: Aligning the Segments that Contain the Most Relevant Info. N Amino Acid

Proposed Method: Aligning the Segments that Contain the Most Relevant Info. N Amino Acid Sequence truncate … C Signal-based Cleavage Site Predictor (e. g. Target. P) Cleavage Site Homology-based Method Subcellular Location Truncated sequence 9

Aligning Profiles Vs. Aligning Sequences Scheme I : Truncate the profiles Scheme II :

Aligning Profiles Vs. Aligning Sequences Scheme I : Truncate the profiles Scheme II : Truncate the sequences Query Sequence 10

Perturbational Discriminant Analysis Input and Hilbert Spaces: Input Space Hilbert Space Empirical Space: Empirical

Perturbational Discriminant Analysis Input and Hilbert Spaces: Input Space Hilbert Space Empirical Space: Empirical Space 11

Perturbational Discriminant Analysis • The objective of PDA is to find an optimal discriminant

Perturbational Discriminant Analysis • The objective of PDA is to find an optimal discriminant function in the Hilbert space or empirical space: • The optimal solution (see derivation in paper) in the empirical space is • ρ represents the noise (uncertainty) level in the measurement. It also ensures numerical stability of the matrix inverse. • Ρ = 1 in this work. 12

Perturbational Discriminant Analysis Example on 2 -D Data 3 classes of 2 -dim data

Perturbational Discriminant Analysis Example on 2 -D Data 3 classes of 2 -dim data in the input space Projection onto the 2 -dim PDA space RBF kernal matrix K Decision boundaries in the input space 13

Perturbational Discriminant Analysis Application to Sequence Classification Training sequences PSI-BLAST Test sequence PSI-BLAST Training

Perturbational Discriminant Analysis Application to Sequence Classification Training sequences PSI-BLAST Test sequence PSI-BLAST Training Profiles Test Profile Pairwise Alignment Align with Training Profiles K Compute PDA Para Compute PDA Score 14

Perturbational Discriminant Analysis Application to Multi-Class Problems 1 -vs-Rest PDA Classifier: MAXNET 15

Perturbational Discriminant Analysis Application to Multi-Class Problems 1 -vs-Rest PDA Classifier: MAXNET 15

Perturbational Discriminant Analysis Application to Multi-Class Problems Cascaded PDA-SVM Classifier: Test sequence Project onto

Perturbational Discriminant Analysis Application to Multi-Class Problems Cascaded PDA-SVM Classifier: Test sequence Project onto (C– 1)-dim PDA space 1 -vs-rest SVM Classifier Class label 16

Experiments Materials: • Eukaryotic sequences extracted from Swiss-Prot 57. 5 • Ext, Mit, and

Experiments Materials: • Eukaryotic sequences extracted from Swiss-Prot 57. 5 • Ext, Mit, and Chl contain experimentally determined cleavage sites • 25% Sequence identity (based on BLASTclust) Performance Evaluation: • 5 -Fold cross validation • Prediction accuracy and Matthew’s correlation coefficient (MCC) 17

Comparing Kernel Matrices Kernel matrix (Scheme I) Query Sequence Kernel matrix (Scheme II) 18

Comparing Kernel Matrices Kernel matrix (Scheme I) Query Sequence Kernel matrix (Scheme II) 18

Sensitivity Analysis Seq Cut Seq. at p±x Subcellular Localiation Accuracy (%) p: gournd-truth cleave

Sensitivity Analysis Seq Cut Seq. at p±x Subcellular Localiation Accuracy (%) p: gournd-truth cleave site Subcellular localization (Pair. Pro. SVM) Cyt/Nuc Ext Overall Mit Chl Subcellular location • The localization performance degrades when the cut-off position drifts away from the ground-truth cleavage site. • m. TP and c. TP are more sensitive to the error of cleavage site prediction than Ext. Ground-truth cleavage site p-16 p-8 p-2 p p+2 p+16 p+32 p+64 Cut-off Position 19

Performance of Cleavage Site Prediction Csite Prediction ACC(%) l t an t) n. P

Performance of Cleavage Site Prediction Csite Prediction ACC(%) l t an t) n. P n la (No P P( et. P t ge arg RF r Ta T C ) • Conditional Random Field (CRF) is better than Target. P(Plant) in terms of predicting the cleavage sites of signal peptide (Ext) but is worse than Target. P(Nonplant). • CRF is slightly inferior to Target. P in predicting the cleavage sites of mitochondria, but it is significantly better than Target. P in predicting the cleavage site of chloroplasts. Category 20

Comparing Profile Creation Time Scheme I PSIBLAST Long profile Cut short profile Pairwise Alignment

Comparing Profile Creation Time Scheme I PSIBLAST Long profile Cut short profile Pairwise Alignment Score Vector SVM or KPDA Subcellular Location Cut short sequence PSIBLAST short profile Pairwise Alignment Score Vector SVM or KPDA Subcellular Location Query Sequence Scheme II Findings: Profile creation time can be substantially reduced by truncating the protein sequences at the cleavage sites. 21

Training and Classification Time 1 -vs-rest SVM Classifier Project onto (C– 1)-dim PDA space

Training and Classification Time 1 -vs-rest SVM Classifier Project onto (C– 1)-dim PDA space Findings: The training time of 1 -vs-rest PDA and Cascaded PDASVM are substantially shorter than that of SVM. 22

Compare with State-of-the-Art Localization Predictors Localization Accuracy (%) MCC Conditional Random Fields Findings: In

Compare with State-of-the-Art Localization Predictors Localization Accuracy (%) MCC Conditional Random Fields Findings: In terms of localization accuracy, the proposed “Signal+Homology” method performs slightly better than the signal-based Target. P and is substantially better than the homology-based Sub. Loc. 23

Conclusion • Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and

Conclusion • Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and homology-based methods. • As far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence can save the profile creation time by 6 folds. 24

Compare with State-of-the-Art Localization Predictors 25

Compare with State-of-the-Art Localization Predictors 25

Performance of Cascaded Fusion Time (hr. ) Time Subcellular localization accuracy Fulllength Seq. with

Performance of Cascaded Fusion Time (hr. ) Time Subcellular localization accuracy Fulllength Seq. with Csite predicted by Target. P(P) Seq. with Csite predicted by Target. P(N) Acc (%) • The computation time for full-length profile alignment is a striking 116 hours • Our method not only leads to nearly a 20 folds reduction in computation time but also boosts the prediction performance. Seq. with Csite predicted by CRF 26

Fusion of Signal- and Homology-Based Methods 1) Cleavage site detection. The cleavage site (if

Fusion of Signal- and Homology-Based Methods 1) Cleavage site detection. The cleavage site (if any) of a query sequence is determined by a signal-based method. 2) Pre-sequence selection. The pre-sequence of the query is obtained by selecting from the N-terminal up to the cleavage site. 3) Pairwise alignment. The pre-sequence is aligned with each of the training pre-sequences to form an N-dim vector, which is fed to a one-vs-rest SVM classifier for prediction. 27 27

Perturbational Discriminant Analysis Spectral Space: Define the kernel matrix K can be factorized via

Perturbational Discriminant Analysis Spectral Space: Define the kernel matrix K can be factorized via spectral decomposition into Empirical Space Spectral Space 28