shotgun sequencing 1 st 2 nd 3 rd

spectral matching MS/MS Spectrum Protein Databas e

“shotgun sequencing” time ms 1 time ms 2

distributed spectral matching 6000 spectra x 10 s/spectrum = 16 CPU hours LTQ Orbitrap

sequest XCorr: goodness of fit between theoretical b and y ions from peptides in

sequest time ms 1 5000 - 25000 ms 2 spectra time ms 2 2

all ms 2 in LC run all raw (all ms 2 = 1 file)

sequest all ms 2 in LC run digest to next peptide 1 dta, 2

theoretical “candidate” spectrum experimental peptide spectrum correlation spectrum yates j. r. 3 rd et

correlation spectrum yates j. r. 3 rd et al. j am soc mass spectrom

similarity scoring Xcorr score correlation spectrum yates j. r. 3 rd et al. j

Dot product similarity scoring – cross-correlation vs dot product Dot product Xcorr (cross-correlation)

non-indexed searching >ipi 00000001. 2 1 st MSQVQVQVQNPSAALSGSQILNKNQSLLSQ PLMSIPSTTSSLPSENAGRPIQNSALPSASITST SAAAESITPTVELNAL…. 1200 +/- 1 Da

indexed searching >ipi 00001234. 11 75 Da G >ipi 00344567. 1 WEFGGHTVLR 1200 +/-

scoring & analysis Score/Metric 1 Score/Metric 2 Score/Metric 3 Peptide A 7. 65 0.

The Results: Distinguishing Right from Wrong In large proteomics data sets (for which manual

Decoy Sequences? A “Reversed” Database! We generate decoy sequences by reversing each protein sequence

Target/Decoy Database Searching Forward database 1. MAGFA→ → →SHTRP Reversed database 1. PRTHS→ →

sequest scores: finding true positives Forward + Reverse DCn Forward Sequences XCorr TP PSM

High Mass Accuracy Mass “Accuracy” in Proteomics: Precision of mass errors between observed and

MMA: True Positives and False Positives True Positives False Positives 0 MMA False positives

MS/MS vs MMA: Precision vs Sensitivity 0 MMA MS/MS criteria are strong precision filters

Distracting Wrong from Right: MMA True Positives False Positives 0 MMA Search Space True

Mass Accuracy: Another dimension of selectivity Forward Sequences Forward + Reverse XCorr DCn Tryptic

Distracting Wrong from Right: Trypticity Tryptic Search True Positives False Positives K/R-Peptide. K/R- Partial

What do we have here, hm? n = 286 d. Cn 1 0. 8

Phosphopeptides: Chemically disadvantaged… Dataset of phosphorylated and unphosphorylated peptide MS/MS pairs MSFEILR P Singly

Phosphopeptides: Less power in XCorr & d. Cn XCorr (Ph/Un. Ph) 2 1. 5

Mass Accuracy: Can it help for phosphorylation? Yeast Whole-Cell Lysate Red. , Alkyl. SDS-PAGE

Mass Accuracy: Rescuing phosphopeptides SEQUEST partial enzyme search, fully tryptic peptide spectral matches Orbitrap

Mission: Phosphopeptide rescue – accomplished! 1046 # of phosphopeptides 0. 4% FP 74% increase

search algorithms & phosphorylation 98 sequest omssa 936 928 Bakalarski et al. , Anal.

phosphorylation site localization GFDSNQp. TWR or GFDp. SNQTWR? Beausoleil et al. , Nat. Biotechnol,

phosphorylation site localization Beausoleil et al. , Nat. Biotechnol, 2006

phosphorylation site localization Taus et al. , JPR, 2011

phosphorylation localization rate (FLR) use non-native phosphoacceptors as “decoys” Ser + Thr (human proteome):

Slides: 38

Download presentation

“shotgun sequencing” 1 st 2 nd 3 rd 4 th Relative Intensity 5 th 6 th 7 th 8 th 9 th 10 th MS 2

spectral matching MS/MS Spectrum Protein Databas e

“shotgun sequencing” time

“shotgun sequencing” time ms 1 time ms 2

distributed spectral matching 6000 spectra x 10 s/spectrum = 16 CPU hours LTQ Orbitrap base peak chromatogram search time single CPU parallel CPUs 37 min LC-MS/MS run-time 6186 MS/MS spectra 2308 peptide IDs (false-positive rate 1%) 287 protein IDs 20 nodes Server 16 hours 0. 8 hours

sequest XCorr: goodness of fit between theoretical b and y ions from peptides in the database d. Cn: fractional XCorr difference between the highest XCorr and next highest XCorr yates j. r. 3 rd et al. j am soc mass spectrom 5: 976 -89 (1994)

sequest time ms 1 5000 - 25000 ms 2 spectra time ms 2 2 all ms 2 ms in LC run ms 2

all ms 2 in LC run all raw (all ms 2 = 1 file) 501. 000 (precursor 1001. 500 (precursorm/z) +2 +3 1 dta 2 sequest (charge state) ms 2 array 1 ms 2 = 1 file (all ms 2 = ~10000 files)

sequest all ms 2 in LC run digest to next peptide 1 dta, 2 dta, 3 dta, 10000 dta MSQVQVQVQNPSAALSGSQILNK calculate peptide mass 2426. 258812 compare with precursor peptide mass: not a candidate 1000. 000 3000. 000 +/- 1 Da if cand. , calc. theoretical spectrum human ipi database correlate, score & 61236 proteins return 10000 32 xx 3, 250, 000 times

theoretical “candidate” spectrum experimental peptide spectrum correlation spectrum yates j. r. 3 rd et al. j am soc mass spectrom 5: 976 -89 (1994)

correlation spectrum yates j. r. 3 rd et al. j am soc mass spectrom 5: 976 -89 (1994)

similarity scoring Xcorr score correlation spectrum yates j. r. 3 rd et al. j am soc mass spectrom 5: 976 -89 (1994)

Dot product similarity scoring – cross-correlation vs dot product Dot product Xcorr (cross-correlation)

non-indexed searching >ipi 00000001. 2 1 st MSQVQVQVQNPSAALSGSQILNKNQSLLSQ PLMSIPSTTSSLPSENAGRPIQNSALPSASITST SAAAESITPTVELNAL…. 1200 +/- 1 Da >ipi 00853644. 1 61236 th human ipi database 61236 proteins …. AKPNINLITGHLEEPMPNPIDEMTEEQKEY EAMKLVNMLDKLSREELLKPMGLKPDGTIT

indexed searching >ipi 00001234. 11 75 Da G >ipi 00344567. 1 WEFGGHTVLR 1200 +/- 1 Da >ipi 00853644. 1 20245 Da human ipi database 61236 proteins indexed AKPNINLITGHLEEPMPNPIDEMTEEQEYEA MLVNMLDLSEELLKPMGLKPDGTITAKPNINL ITGHLEEPMPNPIDEMTEEQEYEAMLVNML DLSEELLKPMGLKPDGTIT

scoring & analysis Score/Metric 1 Score/Metric 2 Score/Metric 3 Peptide A 7. 65 0. 99 97 Peptide B 6. 99 0. 87 97 Peptide C 6. 21 0. 65 97 Peptide D 5. 57 0. 71 96 Peptide E 3. 31 0. 44 50 Peptide F 1. 85 0. 41 41 sensitivity = precision = frequency TP TN FN FP cutoff/threshold score/criterion specificity = TP TP + FN TP TP + FP TN TN + FP TP + TN accuracy = TP + TN + FP

The Results: Distinguishing Right from Wrong In large proteomics data sets (for which manual data inspection is impossible), how can we distinguish between correct and incorrect peptide assignments? Use “decoy” sequences to distract non-peptidic, nonuniquely matchable, or otherwise unmatchable spectra into a search space that is known a priori to be incorrect Use the frequency of “decoy” sequences among total sequences to estimate the overall frequency of wrong answers (False Positive Rate) Adjust filtering criteria to achieve a ~ 1% False Positive Rate

Decoy Sequences? A “Reversed” Database! We generate decoy sequences by reversing each protein sequence in a given database, such that the resultant in silico digest contains nonsense peptides, then append the reversed database to the end of the forward database SEARCHING Decoy references are labeled with # Database searching with SEQUEST occurs from top to bottom – when decoy references are found, there is an equal probability it could have also mapped to a non-decoy sequence. So our FPR is (# of decoys) x 2 / total matches.

Target/Decoy Database Searching Forward database 1. MAGFA→ → →SHTRP Reversed database 1. PRTHS→ → →AFGAM Composite Database Final list Sequest Right F Wrong (random) F R Unknown FP 100% 50%50% Filter (scoring, mass accuracy, etc) Generate final list Estimate FP rate from 2 x Rev (i. e. , 4%) Known FP

sequest scores: finding true positives Forward + Reverse DCn Forward Sequences XCorr TP PSM number FP XCorr

High Mass Accuracy Mass “Accuracy” in Proteomics: Precision of mass errors between observed and actual m/z LTQ Orbitrap & LTQ FT -0. 2 ± 1. 0 ppm LTQ FT (SIM) AGC target 50, 000 to avoid space-charge effects 0. 1 ± 0. 4 ppm Performance is related to the width of the distribution, not the average error Haas et al. (2006) Mol. Cell. Proteomics 5, 1326 Olsen et al. (2004) Mol. Cell. Proteomics 3, 608

MMA: True Positives and False Positives True Positives False Positives 0 MMA False positives are distributed evenly across MMA space PSM number FP TP

MS/MS vs MMA: Precision vs Sensitivity 0 MMA MS/MS criteria are strong precision filters – require TP / FP separation for sensitivity 50 40 30 20 10 0 MMA 0 0 1 2 3 4 5 6 7 MMA criteria are weak precision filters – assists MS/MS criteria in improving sensitivity 8

Distracting Wrong from Right: MMA True Positives False Positives 0 MMA Search Space True Positives False Positives Filtered 0 Extended Search Space MMA

Mass Accuracy: Another dimension of selectivity Forward Sequences Forward + Reverse XCorr DCn Tryptic Search +/- 2 Da 5 ppm filter DCn Tryptic Search +/- 2 Da XCorr

Distracting Wrong from Right: Trypticity Tryptic Search True Positives False Positives K/R-Peptide. K/R- Partial Enzyme Search True Positives Filtered False Positives Filtered A- G- C- S- T- I- L- F- P- M- V- H- D- E- Y- W- Q- N- K/R-Peptide. K/R- A- G- C- S- T- I- L- F- P- M- V- H- D- E- Y- W- Q- N-

What do we have here, hm? n = 286 d. Cn 1 0. 8 0. 6 Unphosphorylated Phosphorylated 0. 4 Reversed Hits 0. 2 0 0 2 4 6 8 XCorr

Phosphopeptides: Chemically disadvantaged… Dataset of phosphorylated and unphosphorylated peptide MS/MS pairs MSFEILR P Singly Phosphorylated (n=207) Doubly Phosphorylated (n=79) 8 n = 286 XCorr (Phosphorylated) d. Cn (Phosphorylated) 1. 0 MSFEILR 0. 8 0. 6 0. 4 0. 2 0. 0 n = 286 6 4 2 0 0. 2 0. 4 0. 6 0. 8 d. Cn (Unphosphorylated) 1. 0 0 2 4 6 XCorr (Unphosphorylated) 8

Phosphopeptides: Less power in XCorr & d. Cn XCorr (Ph/Un. Ph) 2 1. 5 Singly Phosphorylated 1 Doubly Phosphorylated 0. 5 86% Unphosphorylated d. Cn (Ph/Un. Ph) 0 2 1. 5 1 0. 5 0 93% Unphosphorylated

Mass Accuracy: Can it help for phosphorylation? Yeast Whole-Cell Lysate Red. , Alkyl. SDS-PAGE 60 -80 k. Da Trypsin IMAC-purification

Mass Accuracy: Rescuing phosphopeptides SEQUEST partial enzyme search, fully tryptic peptide spectral matches Orbitrap TOP 10 LTQ TOP 10 n=1390 +3: 2. 3 +2: 1. 3 -50 0 50 MMA (ppm) XCorr n=1311 +3: 3. 5 +2: 2. 7

Mission: Phosphopeptide rescue – accomplished! 1046 # of phosphopeptides 0. 4% FP 74% increase 715 600 1. 0% FP LTQ No MMA Orbitrap

search algorithms & phosphorylation 98 sequest omssa 936 928 Bakalarski et al. , Anal. Bioanal. Chem. , 2007

phosphorylation site localization GFDSNQp. TWR or GFDp. SNQTWR? Beausoleil et al. , Nat. Biotechnol, 2006

phosphorylation site localization Beausoleil et al. , Nat. Biotechnol, 2006

phosphorylation site localization Taus et al. , JPR, 2011

phosphorylation localization rate (FLR) use non-native phosphoacceptors as “decoys” Ser + Thr (human proteome): 14. 1% Pro + Glu (human proteome): 14. 5% allow search engine / localization assessment tools to consider p. P and p. E as true negative “decoys” calculate dataset FLR based on frequency of p. P + p. E “decoys” Baker et al. , MCP, 2011 Chalkey & Clauser, MCP, 2012