Proteomics Informatics Protein identification I searching protein sequence
- Slides: 64
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)
Peptide Mapping - Mass Accuracy
Peptide Mapping Database Size Human C. elegans S. cerevisiae
Peptide Mapping Cys-Containing Peptides Human C. elegans S. cerevisiae
Identification – Peptide Mass Fingerprinting Sequence DB Digestion MS All Peptide Masses MS Compare, Score, Test Significance Identified Proteins Repeat for each protein Pick Protein
Pro. Found – Search Parameters http: //prowl. rockefeller. edu/
Pro. Found – Protein Identification by Peptide Mapping W. Zhang & B. T. Chait, Analytical Chemistry 72 (2000) 2482 -2489
Pro. Found Results
Peptide Mapping – Mass Accuracy
Peptide Mapping - Database Size S. cerevisiae Expectation Values Peptide mapping example: S. Cerevisiae 4. 8 e-7 Fungi 8. 4 e-6 All Taxa 2. 9 e-4 Fungi All Taxa
Database size
Missed Cleavage Sites u=1 Expectation Values Peptide mapping example: u=1 4. 8 e-7 u=2 1. 1 e-5 u=4 6. 8 e-4 u=2 u=4
Peptide Mapping - Partial Modifications No Modifications Searched Without Modifications Searched With Possible Phosphorylation of S/T/Y DARPP-32 0. 00006 0. 01 CFTR 0. 00002 0. 005 Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data. Phophorylation (S, T, or Y)
Peptide Mapping - Ranking by Direct Calculation of the Significance
General Criteria for a Good Protein Identification Algorithms The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast.
Normalized Frequency Response to Random Data
Peptide Fragmentation Mass Analyzer 1 Ion Source Fragmentation Mass Analyzer 2 Detector b y
Identification – Tandem MS
Tandem MS – Sequence Confirmation S G F L E E D E L K % Relative Abundance 100 0 250 500 m/z 750 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 145 G 292 F 405 L 534 E 663 E 778 D 907 E 1020 L 1166 K % Relative Abundance 100 0 250 500 m/z 750 1000 b ions
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 % Relative Abundance 100 0 250 500 m/z 750 1000 b ions y ions
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 113 292 405 534 260 389 504 250 113 [M+2 H]2+ 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 129 875 [M+2 H]2+ 129 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – Sequence Confirmation S G F L E E D E L K 88 S 1166 145 G 1080 292 F 1022 405 L 875 534 E 762 663 E 633 778 D 504 907 E 389 1020 L 260 1166 K 147 b ions y ions 762 % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 750 1022 907 1020 1080 1000
Tandem MS – de novo Sequencing 762 Amino acid masses % Relative Abundance 100 0 875 [M+2 H]2+ 292 405 534 260 389 504 250 500 633 663 m/z 778 1022 9071020 1080 750 Mass Differences Sequences consistent with spectrum 1000
Tandem MS – de novo Sequencing
Tandem MS – de novo Sequencing
Tandem MS – de novo Sequencing X X X SGF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… 1166 – 1020 – 18 = 128 ÞK or Q= 1166 Peptide M+H 1166 -1079 = 87 => S SGF(I/L)EEDE(I/L)(K/Q) SGF(I/L)EEDE(I/L)… X X X
Tandem MS – de novo Sequencing Challenges in de novo sequencing Neutral loss (-H 2 O, -NH 3) Modifications Background peaks Incomplete information
Tandem MS – Database Search Sequence DB Pick Peptide MS/MS All Fragment Masses MS/MS Compare, Score, Test Significance Repeat for all peptides LC-MS Repeat for all proteins Lysis Pick Protein Fractionation Digestion
Algorithms
Comparing and Optimizing Algorithms
MS/MS - Parent Mass Error and Enzyme Specificity Expectation Values MS/MS example: Dm=2, Trypsin 2. 5 e-5 Dm=100, Trypsin 2. 5 e-5 Dm=2, non-specific 7. 9 e-5 Dm=100, non-specific 1. 6 e-4
Sequest Cross-correlation
X! Tandem - Search Parameters http: //www. thegpm. org/
X! Tandem - Search Parameters
X! Tandem - Search Parameters
spectra sequences Generic search engine Test all cleavages, sequences modifications, & mutations for all sequences Conventional, single stage searching
Some hard problems in MS/MS analysis in proteomics Allowing for unanticipated peptide cleavages - e. g. , chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient Determining potential modifications - e. g. , oxidation, phosphorylation, deamidation - calculation order 2 n - NP complete Detecting point mutations - e. g. , sequence homology - calculation order 18 N - NP complete
Multi-stage searching spectra sequences Tryptic cleavage sequences Modifications #1 Modifications #2 Point mutation X! Tandem
Search Results
Search Results
Sequence Annotations
Search Results
Search Results
Identification – Spectrum Library Search Spectrum Library Pick Spectrum MS/MS Compare, Score, Test Significance Identified Proteins Repeat for all spectra Lysis Fractionation Digestion LC-MS/MS
Steps in making an Annotated Spectrum Library (ASL): 1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge. 2. Add the spectra together and normalize the intensity values. 3. Assign a “quality” value: the median expectation value of the 10 spectra used. 4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.
Spectrum Library Characteristics – Peptide Length
Spectrum Library Characteristics – Protein Coverage
Identification – Spectrum Library Search Library spectrum (5: 25) Test spectrum (5: 25) Results: 4 peaks selected, 1 peak missed
Identification – Spectrum Library Search How likely is this? Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. Matches 1 2 3 4 5 Probability 0. 45 0. 15 0. 016 0. 00039 0. 0000037
Identification – Spectrum Library Search If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1 matched: p = 0. 6 5 matched: p = 0. 0002 10 matched: p = 0. 0000001
Identification – Spectrum Library Search Experimental Mass Spectrum M/Z Best search result Library of Assigned Mass Spectra
X! Hunter
X! Hunter algorithm: 1. Use dot product to find a library spectrum that best matches a test spectrum. 2. Calculate p-value with hypergeometric distribution. 3. Use p-value to calculate expectation value, given the identification parameters. 4. If expectation value is less than the median expectation value of the library spectrum, report the median value.
X! Hunter Result Query Spectrum Library Spectrum
Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.
Significance Testing - Expectation Values The majority of sequences in a collection will give a score due to random matching.
Significance Testing - Expectation Values Database Search List of Candidates M/Z Distribution of Scores for Random and False Identifications Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values
Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing (Week 4)
- Yosin hitomi
- Comparative proteomics kit ii western blot module
- History of proteomics
- Comparative proteomics kit ii western blot module
- Boxcar proteomics
- Comparative proteomics kit ii western blot module
- Comparative proteomics kit ii western blot module
- Ipde process examples
- Managing risk with the ipde process
- Positive identification
- Protein-protein docking
- Channel vs carrier proteins
- Amino acid nucleotide
- Sequence in pseudocode
- Convolutional sequence to sequence learning
- Differentiate finite sequence and infinite sequence
- Journal of american medical informatics association
- Nabla operator
- Informatics 43 uci
- Health informatics courses uk
- History of pharmacy informatics
- Python for informatics
- Masonlife
- Dikw examples in nursing
- Observational health data sciences and informatics
- School of computing and informatics
- School of computing and informatics
- Hong kong olympiad in informatics
- Chapter 26 informatics and documentation
- Poc informatics
- John von neumann faculty of informatics
- Python for informatics
- What is informatics
- Informatics vce
- In4matx 43 uci
- Health informatics career framework
- Biomedical informatics definition
- Pharmacy informatics definition
- Hong kong 1980 grid system
- Supply chain informatics
- Nursing informatics and healthcare policy
- Business informatics
- Definition of health management information system
- Personal traits for health informatics services workers
- Medical informatics definition
- Nursing informatics theories, models and frameworks
- Supply chain informatics
- Npex lims
- Ucsd nlp
- Financial informatics
- Olympiad in informatics
- Va office of health informatics
- It basics
- Dg informatics
- Python for informatics: exploring information
- Introduction to medical informatics
- National water informatics centre
- Gmu hap
- History of pharmacy informatics
- Health informatics
- Belarusian university of informatics and radioelectronics
- Python for informatics: exploring information
- Health informatics ryerson
- Informatics
- Pitt health informatics