Proteomics Informatics Protein identification I searching protein sequence

  • Slides: 64
Download presentation
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)

Peptide Mapping - Mass Accuracy

Peptide Mapping - Mass Accuracy

Peptide Mapping Database Size Human C. elegans S. cerevisiae

Peptide Mapping Database Size Human C. elegans S. cerevisiae

Peptide Mapping Cys-Containing Peptides Human C. elegans S. cerevisiae

Peptide Mapping Cys-Containing Peptides Human C. elegans S. cerevisiae

Identification – Peptide Mass Fingerprinting Sequence DB Digestion MS All Peptide Masses MS Compare,

Identification – Peptide Mass Fingerprinting Sequence DB Digestion MS All Peptide Masses MS Compare, Score, Test Significance Identified Proteins Repeat for each protein Pick Protein

Pro. Found – Search Parameters http: //prowl. rockefeller. edu/

Pro. Found – Search Parameters http: //prowl. rockefeller. edu/

Pro. Found – Protein Identification by Peptide Mapping W. Zhang & B. T. Chait,

Pro. Found – Protein Identification by Peptide Mapping W. Zhang & B. T. Chait, Analytical Chemistry 72 (2000) 2482 -2489

Pro. Found Results

Pro. Found Results

Peptide Mapping – Mass Accuracy

Peptide Mapping – Mass Accuracy

Peptide Mapping - Database Size S. cerevisiae Expectation Values Peptide mapping example: S. Cerevisiae

Peptide Mapping - Database Size S. cerevisiae Expectation Values Peptide mapping example: S. Cerevisiae 4. 8 e-7 Fungi 8. 4 e-6 All Taxa 2. 9 e-4 Fungi All Taxa

Database size

Database size

Missed Cleavage Sites u=1 Expectation Values Peptide mapping example: u=1 4. 8 e-7 u=2

Missed Cleavage Sites u=1 Expectation Values Peptide mapping example: u=1 4. 8 e-7 u=2 1. 1 e-5 u=4 6. 8 e-4 u=2 u=4

Peptide Mapping - Partial Modifications No Modifications Searched Without Modifications Searched With Possible Phosphorylation

Peptide Mapping - Partial Modifications No Modifications Searched Without Modifications Searched With Possible Phosphorylation of S/T/Y DARPP-32 0. 00006 0. 01 CFTR 0. 00002 0. 005 Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data. Phophorylation (S, T, or Y)

Peptide Mapping - Ranking by Direct Calculation of the Significance

Peptide Mapping - Ranking by Direct Calculation of the Significance

General Criteria for a Good Protein Identification Algorithms The response to random input data

General Criteria for a Good Protein Identification Algorithms The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast.

Normalized Frequency Response to Random Data

Normalized Frequency Response to Random Data

Peptide Fragmentation Mass Analyzer 1 Ion Source Fragmentation Mass Analyzer 2 Detector b y

Peptide Fragmentation Mass Analyzer 1 Ion Source Fragmentation Mass Analyzer 2 Detector b y

Identification – Tandem MS

Identification – Tandem MS

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K % Relative Abundance 100 0 250 500 m/z 750 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 145 G 292 F 405 L 534 E 663 E 778 D 907 E 1020 L 1166 K % Relative Abundance 100 0 250 500 m/z 750 1000 b ions

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 % Relative Abundance 100 0 250 500 m/z 750 1000 b ions y ions

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 113 292 405 534 260 389 504 250 113 [M+2 H]2+ 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 129 875 [M+2 H]2+ 129 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – Sequence Confirmation S G F L E E D E L

Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000

Tandem MS – de novo Sequencing 762 Amino acid masses % Relative Abundance 100

Tandem MS – de novo Sequencing 762 Amino acid masses % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 1022 9071020 1080 750 Mass Differences Sequences consistent with spectrum 1000

Tandem MS – de novo Sequencing

Tandem MS – de novo Sequencing

Tandem MS – de novo Sequencing

Tandem MS – de novo Sequencing

Tandem MS – de novo Sequencing X X X SGF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… 1166 – 1020

Tandem MS – de novo Sequencing X X X SGF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… 1166 – 1020 – 18 = 128 ÞK or Q= 1166 Peptide M+H 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)(K/Q) SGF(I/L)EEDE(I/L)… X X X

Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H

Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H 2 O, -NH 3) Modifications Background peaks Incomplete information

Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS

Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS Compare, Score, Test Significance Repeat for all peptides LC-MS Repeat for all proteins Lysis Pick Protein Fractionation Digestion

Algorithms

Algorithms

Comparing and Optimizing Algorithms

Comparing and Optimizing Algorithms

MS/MS - Parent Mass Error and Enzyme Specificity Expectation Values MS/MS example: Dm=2, Trypsin

MS/MS - Parent Mass Error and Enzyme Specificity Expectation Values MS/MS example: Dm=2, Trypsin 2. 5 e-5 Dm=100, Trypsin 2. 5 e-5 Dm=2, non-specific 7. 9 e-5 Dm=100, non-specific 1. 6 e-4

Sequest Cross-correlation

Sequest Cross-correlation

X! Tandem - Search Parameters http: //www. thegpm. org/

X! Tandem - Search Parameters http: //www. thegpm. org/

X! Tandem - Search Parameters

X! Tandem - Search Parameters

X! Tandem - Search Parameters

X! Tandem - Search Parameters

spectra sequences Generic search engine Test all cleavages, sequences modifications, & mutations for all

spectra sequences Generic search engine Test all cleavages, sequences modifications, & mutations for all sequences Conventional, single stage searching

Some hard problems in MS/MS analysis in proteomics Allowing for unanticipated peptide cleavages -

Some hard problems in MS/MS analysis in proteomics Allowing for unanticipated peptide cleavages - e. g. , chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient Determining potential modifications - e. g. , oxidation, phosphorylation, deamidation - calculation order 2 n - NP complete Detecting point mutations - e. g. , sequence homology - calculation order 18 N - NP complete

Multi-stage searching spectra sequences Tryptic cleavage sequences Modifications #1 Modifications #2 Point mutation X!

Multi-stage searching spectra sequences Tryptic cleavage sequences Modifications #1 Modifications #2 Point mutation X! Tandem

Search Results

Search Results

Search Results

Search Results

Sequence Annotations

Sequence Annotations

Search Results

Search Results

Search Results

Search Results

Identification – Spectrum Library Search Spectrum Library Pick Spectrum MS/MS Compare, Score, Test Significance

Identification – Spectrum Library Search Spectrum Library Pick Spectrum MS/MS Compare, Score, Test Significance Identified Proteins Repeat for all spectra Lysis Fractionation Digestion LC-MS/MS

Steps in making an Annotated Spectrum Library (ASL): 1. Find the best 10 spectra

Steps in making an Annotated Spectrum Library (ASL): 1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge. 2. Add the spectra together and normalize the intensity values. 3. Assign a “quality” value: the median expectation value of the 10 spectra used. 4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.

Spectrum Library Characteristics – Peptide Length

Spectrum Library Characteristics – Peptide Length

Spectrum Library Characteristics – Protein Coverage

Spectrum Library Characteristics – Protein Coverage

Identification – Spectrum Library Search Library spectrum (5: 25) Test spectrum (5: 25) Results:

Identification – Spectrum Library Search Library spectrum (5: 25) Test spectrum (5: 25) Results: 4 peaks selected, 1 peak missed

Identification – Spectrum Library Search How likely is this? Apply a hypergeometric probability model:

Identification – Spectrum Library Search How likely is this? Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. Matches 1 2 3 4 5 Probability 0. 45 0. 15 0. 016 0. 00039 0. 0000037

Identification – Spectrum Library Search If you have 1000 possible m/z values and 20

Identification – Spectrum Library Search If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1 matched: p = 0. 6 5 matched: p = 0. 0002 10 matched: p = 0. 0000001

Identification – Spectrum Library Search Experimental Mass Spectrum M/Z Best search result Library of

Identification – Spectrum Library Search Experimental Mass Spectrum M/Z Best search result Library of Assigned Mass Spectra

X! Hunter

X! Hunter

X! Hunter algorithm: 1. Use dot product to find a library spectrum that best

X! Hunter algorithm: 1. Use dot product to find a library spectrum that best matches a test spectrum. 2. Calculate p-value with hypergeometric distribution. 3. Use p-value to calculate expectation value, given the identification parameters. 4. If expectation value is less than the median expectation value of the library spectrum, report the median value.

X! Hunter Result Query Spectrum Library Spectrum

X! Hunter Result Query Spectrum Library Spectrum

Significance Testing False protein identification is caused by random matching An objective criterion for

Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.

Significance Testing - Expectation Values The majority of sequences in a collection will give

Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.

Significance Testing - Expectation Values Database Search List of Candidates M/Z Distribution of Scores

Significance Testing - Expectation Values Database Search List of Candidates M/Z Distribution of Scores for Random and False Identifications Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)