Assignment 2 Papers read for this assignment Paper

  • Slides: 31
Download presentation
Assignment 2: Papers read for this assignment • Paper 1: PALMA: m. RNA to

Assignment 2: Papers read for this assignment • Paper 1: PALMA: m. RNA to Genome Alignments using Large Margin Algorithms • Paper 2: Optimal spliced alignments of short sequence reads • Badil Elhady, Michael Chan

Paper 1: PALMA: m. RNA to Genome Alignments using Large Margin Algorithms

Paper 1: PALMA: m. RNA to Genome Alignments using Large Margin Algorithms

Motivation • Question for the study? – The correct alignment of m. RNA sequences

Motivation • Question for the study? – The correct alignment of m. RNA sequences to genomic DNA is still a challenging task. ( Due to the presence of sequencing errors, micro-exons, alternative splicing)

Method • Splice Site Prediction – SVM with large margin, decided under convex optimization

Method • Splice Site Prediction – SVM with large margin, decided under convex optimization • Intron Length Model • Dynamic Programming is used to maximize the scoring function, leading to Optimal Alignment. (Smith-Waterman Alignments with Intron Model) • This leads to: – Tuning the parameters of scoring function leads to. • A larger score • Other alignment would score lower

Slice site prediction – Accurately differentiates the exon-intron boundaries – Compartmentalize the local alignment

Slice site prediction – Accurately differentiates the exon-intron boundaries – Compartmentalize the local alignment of EST. – Claim: • Robust to mutations, insertions and deletions, as well as noise levels in accurately identifying intron boundaries as well as boundaries of the optimal local alignment.

Splice Site Predictions • From a set of ETS, sequences were extracted of confirmed

Splice Site Predictions • From a set of ETS, sequences were extracted of confirmed donor and acceptor slice sites. • To recognize acceptor and donor slice sites, 2 SVM classifiers were trained. Using “ weighted degree ” kernel. • kernel computes the similarity between sequences s and s’.

Intron Length Model • The main idea of the algorithm is to compute a

Intron Length Model • The main idea of the algorithm is to compute a local alignment by determining the maximum over all alignments of all prefixes – SE (1 : i) : =(SE(1), . . . , SE(i)) – SD(1 : j) : = (SD(1), . . . , SD(j)) » SE EST Sequences » SD DNA Sequences – Running time is O(m*n*L) » m length of SE » n length of SD » Smith-Waterman does not distinguish between exons and introns.

Scoring Function

Scoring Function

 • In generalizing the Smith. Waterman algorithm by including an intron model taking

• In generalizing the Smith. Waterman algorithm by including an intron model taking splice site predictions as well as intron length into account. • The information is then used to optimize the parameters used for alignment. Smith-Waterman Alignments with Intron Model Splice site prediction assisting params

Experimental setup • Evaluating PALMA vs. exalin, sim 4, and blat. • Alignment of

Experimental setup • Evaluating PALMA vs. exalin, sim 4, and blat. • Alignment of m. RNA seq. artificially shorting the middle exon (3 -50)nt as shown.

Experimental setup cont. • Artificially generating the data : as a control to know

Experimental setup cont. • Artificially generating the data : as a control to know exactly what the correct alignment has to be. • Add varying amounts of noise (p ¼ 0 , 1 , 5 and 10% of random mutations, deletions or insertions) to the query sequence. • Replace a part of the DNA or m. RNA sequence at its terminal ends with random sequence leading to a shortened correct alignment.

PALMA Add noise vs. exalin, sim 4, and blat.

PALMA Add noise vs. exalin, sim 4, and blat.

PALMA Varying lengths vs. exalin, sim 4, and blat.

PALMA Varying lengths vs. exalin, sim 4, and blat.

Conclusion – Motivation: high sensitivity detection of short exons in the midst of noise.

Conclusion – Motivation: high sensitivity detection of short exons in the midst of noise. – Principles • Splice Site Prediction • Intron Length Model • maximize scoring function, for Optimal Alignment. – Results: PALMA detects short exons while exalin, blat, etc, are unsuccessful

Further Topics • vmatch • svm • convex optimization

Further Topics • vmatch • svm • convex optimization

Paper 2: Optimal spliced alignments of short sequence reads

Paper 2: Optimal spliced alignments of short sequence reads

Situation • NGS has short length and inherent high error rate even compared to

Situation • NGS has short length and inherent high error rate even compared to Sanger. It is fast but the accuracy? • Many methods are efficient and accurate if the sequence blocks (exons) are sufficiently long and are highly similar to the genomic sequence. • Reads from NG sequencing techniques do not have either of 2 properties.

Motivation • Objective to be able to accurately align the sequence reads over intron

Motivation • Objective to be able to accurately align the sequence reads over intron boundaries. • QPALMA takes the read’s quality information as well as computational splice site predictions to compute accurate spliced alignments.

Principles • Learn, in a supervised manner, how to score quality information, splice site

Principles • Learn, in a supervised manner, how to score quality information, splice site predictions and sequence identity based on a representative set of sequence reads with known alignments. • Extended Smith-Waterman algorithm: • Extension 1: Quality Scores • Extension 2: Splice Sites • Extension 3: Non-affine Intron Length Model

1) Splice site prediction • Need to know acceptor and donor splice sites as

1) Splice site prediction • Need to know acceptor and donor splice sites as well as suitable decoy sequences. • Extension 1: Quality Scores • Extension 2: Splice Sites • Extension 3: Non-affine Intron Length Model

Extension 1: Quality Scores • same computational complexity as the original Smith –Waterman algorithm

Extension 1: Quality Scores • same computational complexity as the original Smith –Waterman algorithm (O(mn)). However, it uses a more complex scoring that may depend on the sequencing technology used. Constant

Extension 2: Splice Sites • (O(mn. L)) operations, where L is the maximal length

Extension 2: Splice Sites • (O(mn. L)) operations, where L is the maximal length of the intron. • The idea is to maintain an additional recurrence matrix W used to keep track of the intron boundaries.

Extension 3: Non-affine Intron Length Model • Recurrence can be implemented as follows •

Extension 3: Non-affine Intron Length Model • Recurrence can be implemented as follows • For long introns this approach seem computationally infeasible.

An alignment pipeline against whole genomes • !!! optimal alignments is time consuming •

An alignment pipeline against whole genomes • !!! optimal alignments is time consuming • => use vmatch(multi-step approach on enhanced suffix arrays) + high quality splice site detection

[Optional] An alignment pipeline against whole genomes • vmatch (1 st round) finds global

[Optional] An alignment pipeline against whole genomes • vmatch (1 st round) finds global alignments of all short reads (max 2 mismatches) against the genome to identify large fraction of unspliced reads. – If there are reads that cannot be aligned (leftover reads) – spliced or low quality reads • Yet, there is possibility that the boundary of the reads are the spliced sites – Check with QPALMA scoring function as a filter to quickly decide whether the read is spliced or not. – all combinations of putative donor splice sites within the read and acceptor splice sites ≤ 2000 nt downstream of the read, and – all combinations of putative acceptor splice sites within the read and donor splice sites ≤ 2000 nt upstream of the read.

[Optional] An alignment pipeline against whole genomes • leftover reads + spliced (predicted to

[Optional] An alignment pipeline against whole genomes • leftover reads + spliced (predicted to be by QPALMA) used as seeds for vmatch (2 nd round) and localize the splice sites with a ‘window’

Results

Results

Conclusion – Motivation: NGS is inaccurate. – Principles • 3 extentions to PALMA •

Conclusion – Motivation: NGS is inaccurate. – Principles • 3 extentions to PALMA • Vmatch pipelining, for boundary precision. – Results: lower error • QPALMA + vmatch pipelining = PALMA + 3 extentions – {SVM, large marigin}

References • A Tutorial on Support Vector Machines for Pattern Recognition • NCBI National

References • A Tutorial on Support Vector Machines for Pattern Recognition • NCBI National Center for Biotechnology Information http: //www. ncbi. nlm. nih. gov/About/primer/genetics_ genome. html • Bio. Info. Bank Library http: //lib. bioinfo. pl/ • High Throughput Short Read Alignment via Bidirectional BWT

Badil_el ha dy Mich_a___el__chan

Badil_el ha dy Mich_a___el__chan