Pep Novo De Novo Peptide Sequencing via Probabilistic

Pep. Novo De Novo Peptide Sequencing via Probabilistic Network Modeling

Peptide Fragmentation N N A A C C F F M E E T T P G R R PM-M Collision-Induced Dissociation (CID) C C

Peptide Fragmentation ¡ A peptide with mass PM, that fragments into a prefix of mass m, and a suffix of mass PM-m, can produce different fragment ions: Prefix ion position Suffix Ion position b m+1 y PM-m+19 b-H 2 O m-17 y-NH 3 PM-m+2 b+2 (m+2)/2 y-H 2 O PM-m-17 . . . ¡ . . . The intensities at the expected offsets from mass m are used to create an intensity vector:

The Spectrum Graph

Scoring for De Novo Sequencing ¡ ¡ ¡ All masses in spectrum range can be considered putative cleavage sites. Given observed intensities , how to evaluate if mass m is cleavage site. A common statistical tool used by many scoring functions is the likelihood ratio test (Dancik et al. 99’, Havilio et al. 03’, . . . )

Dancik et al. ’ 99 – Hypotheses ¡ ¡ The main concept: Give premium for present peaks and penalties for missing peaks. Uses a probability table: Fragmentation Hypothesis Fragment y 0. 71 (P 1) b 0. 66 (P 2) a 0. 26 (P 3) y-H 2 O ¡ Probability 0. 09 PR – Probability of observing random peak (~0. 1) (Random hypothesis). (Pk)

Scoring a Cleavage Site (Dancik ‘ 99) ¡ ¡ Out of k possible ions for cleavage at m, t are detected (w. l. o. g fragments 1, . . , t) and k-t are missing (t+1, . . , k). Score using a log ratio test: Probability of cleavage site m according to Fragmentation hypothesis Probability of cleavage site m according to Random hypothesis

Pep. Novo Scoring Pep. Novo implements a similar likelihood ratio test mechanism. ¡ Can be viewed as extending the scoring model of Dancik et al. 99’. ¡ Includes several factors that are not sufficiently addressed in current scoring functions. ¡

Enhancements to Dancik et al. (’ 99) 1. 2. 3. 4. 5. Several Intensity values. Combinations of fragment ions. Incorporation of additional chemical knowledge (e. g. , preferred cleavage sites). Positional influence of the cleavage site. Improved Random Model.

HCID - Fragmentation Network N-aa C-aa (N-terminal amino acid) a pos(m) (region in peptide) pos y P(y 2|y, po s) (C-terminal amino acid) 0. 1 0. 22 2 3 0. 52 4 3 0. 08 y b y 2 b 2 y-NH 3 b-H 2 O y-H 2 O a-NH 3 a-H 2 O y-H 2 O-NH 3 b-H 2 O-NH 3 0 0 0 1 b-H 2 O y-H 2 O Amino acid influence Ion combinations Positional influence

Discrete Intensity Values Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum). ¡ Normalized intensities Discretized into 4 intensity levels: ¡ l l zero : I < 0. 05 low : 0. 05 ≤ I < 2 (62% of peaks) medium : 2 ≤ I < 10 (26% of peaks) high : I ≥ 10 (12% of peaks)

Combinations of Fragments a y b y 2 b 2 y-NH 3 b-H 2 O y-H 2 O a-NH 3 a-H 2 O b-H 2 ONH 3 ¡ b-H 2 OH 2 O y-H 2 ONH 3 Different combinations have significantly different probabilities: l P(b=high| y=high) = 0. 36, vs. P(b=high| y=low) = 0. 03. l P(b-H 2 O > zero | b=high) = 0. 5, vs. P(b-H 2 O > zero | b= zero) = 0. 24.

Additional Chemical Knowledge N-aa C-aa (N-terminal amino acid) ¡ b y The identity of the flanking amino acids influences the peak intensities: l l ¡ (C-terminal amino acid) Increased intensities N-terminal to Proline and Glycine Increased intensities C-terminal to Aspartic Acid. 400 amino acid combinations reduced to 15 equivalence sets (X-P, X-G, etc. ).

Positional Influence pos(m) (region in peptide) a b b 2 y y 2 ¡ Creates separate models for different locations in the peptide ¡ Models phenomena such as: l l l weak b/y ions near the ends. prevalence of a-ions in the first half of the peptides. prevalence of b 2 towards the peptide’s C-terminal and y 2 near the N-terminal.

Probability under HCID ¡ From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so: where (f) are the parents of node f.

HRandom – Regional Density 3 3 3 2ε 2 2 2 1 1 0 Intensity levels m/z w Bin Window

Computing the Random Probability =1 -(2ε)/w , is the probability of a single peak missing the bin. ¡ Let ni , 1≤i≤d, be counts of peaks with intensity i in window w: ¡

Random Model for HRandom ¡ ¡ Peak occurrences are treated as random independent events: The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

The Likelihood Ratio Score ¡ ¡ A putative cleavage site is scored according to the log ratio test: Can be used to score a peptide by summing the score for the prefix masses:

Pep. Novo’s De Novo Sequencing ¡ ¡ ¡ A spectrum graph is created from the experimental MS/MS spectrum. The nodes are scored using our method. Highest scoring anti-symmetric path is found using dynamic programming algorithm.

Spectrum Graph ¡ ¡ ¡ Acyclic graph. Nodes are cleavage sites, each has a mass m and score s. Edges connect nodes with mass differences corresponding to an amino acid. Q V m: 0 s: 5. 0 A m: 71. 2 s: 4. 3 S m: 99. 1 s: 8. 1 m: 113 s: -1. 2 L W m: 163. 2 s: 2. 8 m: 199. 4 s: 5. 6

Results Algorithm Average Accuracy Sequence Length Tag 3 Tag 4 Tag 5 Pep. Nov o 0. 727 10. 30 0. 94 6 0. 87 0. 800 0. 654 1 Shereng a 0. 690 8. 65 0. 821 0. 711 0. 564 0. 364 Peaks 0. 673 10. 32 0. 889 0. 814 0. 689 0. 575 Lutefisk 0. 566 8. 79 0. 661 0. 521 0. 425 0. 339 Benchmarking reported for 280 spectra. Tag 6

Q&A