Protein Identification from Tandem Mass Spectra with Probabilistic

Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling Yiming Yang 1, 2, Abhay Harpale 1 and Subramanian Ganaphathy 1 1: Language Technologies Institute 2: Machine Learning Department School of Computer Science, Carnegie Mellon University @Yiming Yang, ECML 2009, Sept 8

Outline § Motivation & Background § Two probabilistic approaches § Experiments @Yiming Yang, ECML 2009, Sept 8 2

Motivation Proteins are important bio-markers for diseases, drug toxicity, therapeutic outcomes, etc. Statistical approaches have been developed for protein identification in computational proteomics Interdisciplinary research for comparing current solutions with successful methods in IR (information retrieval) for similar problems has been rare. We address this research gap by § § § Analyzing a major limitation of popular approaches in protein ID Proposing a new solution (Language Modeling for IR) @Yiming Yang, ECML 2009, Sept 8 3

The Protein ID Problem § § § Tandem mass (MS-MS) spectra are produced using some chemical process on an input sample (e. g. , blood) A sample typically consists of multiple proteins. The process segments each protein into many (hundreds) pieces, called peptides. § Peptides are further decomposed into ionized segments. § The MS-MS spectrum of a peptide is a series of spikes. § Each spike is the mass/charge (m/z) ratio of an ionized segment in the peptide. @Yiming Yang, ECML 2009, Sept 8 4

The Protein ID Problem (cont’d) Protein identification requires a mapping from empirical (MS-MS) spectra to protein sequences in an DB § § There are many protein sequence databases § Swiss. Prot, for example, contains 280, 000+ sequences § Each protein is defined as a sequence of amino-acid letters § Peptides in each protein are specified using cleaving rules § Each peptide has an amino-acid sequence and a corresponding theoretical (“expected”) spectrum @Yiming Yang, ECML 2009, Sept 8 5

Theoretical Spectra of peptides in a DB Empirical Spectra of peptides in a sample Mapping Matching § Fourier Transformation § Probabilistic Models § Heuristic Rules @Yiming Yang, ECML 2009, Sept 8 6

Theoretical Spectra Empirical Spectra Mapping Matching Words in L 2 Words in L 1 Matched Words (in L 2) Doc Retrieval Matched Documents (in L 2) @Yiming Yang, ECML 2009, Sept 8 7

Outline § Motivation & Background § Two probabilistic approaches § Experiments @Yiming Yang, ECML 2009, Sept 8 8

A Popular Approach in Protein ID (Protein. Prophet by Nesvizhskii et al. , 2003) § Given the predicted peptides based on MS-MS spectra, the probability for each candidate protein is estimated as: -- estimates the probability for a Boolean OR logic -- typically produces many false positives @Yiming Yang, ECML 2009, Sept 8 9

A Popular Approach in IR § Language Models (Ponte 1998; Lafferty & Zhai, 2001; …) § Query (q) is represented using a bag of words § Document (d) is represented using a bag of words § KL-divergence of the two words distributions (θq and θd ) is Cross entropy H (θq ||θd) -- not affect doc ranking -- a “soft” measure for the Boolean AND logic @Yiming Yang, ECML 2009, Sept 8 10

LM for Protein ID § Query language model for predicted peptides § Document language model for each protein sequence @Yiming Yang, ECML 2009, Sept 8 11

Outline § Motivation & Background § Two probabilistic approaches § Experiments @Yiming Yang, ECML 2009, Sept 8 12

Data Sets PPK (Purvine et al. , 2003) § § § 2995 empirical spectra from a mixture of 35 proteins 4535 protein sequences (325, 812 unique peptides) Mark 12 § § § 9380 empirical spectra from a mixture of 12 proteins 50, 012 protein sequences (5, 149, 302 unique peptides) randomly sampled from the Swith. Prot database Sigma 49 § § § 12, 498 empirical spectra from a mixture of 49 proteins 50, 049 protein sequences (2, 571, 642 unique peptides) randomly sampled from the Swith. Prot database @Yiming Yang, ECML 2009, Sept 8 13

Systems § § Prob-AND § Our proposed method § Nesvizhskii’s method, our own implementation § Supported by the Lemur search engine (Callan, 2002) § A popular software (online available) for protein/peptide ID Prob-OR Conventional Vector Space Model (TFIDF-cosine) X!Tandem q All the system, except X!Tandem, used SEQUEST to predict a set of peptides (as the “query”). q Each system produces a ranked list of proteins per query. @Yiming Yang, ECML 2009, Sept 8 14

Metrics § Mean Average Precision (MAP) § § Standard metric in IR for evaluating ranked lists Evaluate each ranked list from the top to each position where a true positive document is retrieved § § § Recall = TP/(TP + FN) Precision = TP/(TP + FP) TP = # of true positives, TN = # of true negatives FP = # of false positives, FN = # of false negatives Average the precision scores in recall intervals among 0%, 10%, 20%, …, 100% (“ 11 -pt AVGP”) Compute the mean of AVGP across all intervals and for all queries @Yiming Yang, ECML 2009, Sept 8 15

Main Results @Yiming Yang, ECML 2009, Sept 8 16

Statistical Significance Tests on Proportions @Yiming Yang, ECML 2009, Sept 8 17

Summary n n n The first interdisciplinary investigation/evaluation of state-ofthe-art IR methods (LM and VSM) in protein identification Prob-AND (LM) is a better choice of criterion than prob-OR in combining peptide-level evidence, improving precision significantly in the high-recall regions. Understanding the nature of proteomic data/problems by researchers with different backgrounds (IR or ML) is hard, but, the outcome is and will be rewarding. @Yiming Yang, ECML 2009, Sept 8 18

Future Research n Finding the “best” protein mixture (Arnold et al. , PSB 2007) n n Instead of predicting each protein independently Reduces to solving the minimum set cover problem (NP-hard) Revised as to find the most likely protein mixture (Li et al. , 2008) Greedy approximation strategies n n Using Gibbs sampling (local maxima, efficiency issues) Better results than Protein. Prophet (prob-OR) on Sigma 49 Comparative evaluation (with LM, VSM, etc. ) would be informative Scalability for high-recall predictions from very large protein databases? @Yiming Yang, ECML 2009, Sept 8 19

Thanks! @Yiming Yang, ECML 2009, Sept 8 20