An Integer Programming Approach to Novel Transcript Reconstruction

Slides: 1

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul*, Adrian Caciula*, Nicholas Mancuso*, Ion Mandoiu** and Alexander Zelikovsky* *Georgia State University, **University of Connecticut 2 3 4 5 GGR vs. GIR 7 9 t 2 t 1 t 3 Sensitivity: - portion of the annotated transcripts being captured by assembly method 1 2 3 200 200 Ø Read length is currently much shorter then transcripts length Ø Statistical reconstruction method - fragment length distribution 1 t 1 : 1 2 2 3 3 4 4 5 5 6 6 7 7 PPV - portion of annotated transcript among assembled transcripts 300 Challenges and Solutions Alternative Splicing Mean : 500; Std. dev. 50 1 3 200 Naïve IP formulation Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected, 0 otherwise – x(p) - 1 if the pe read p is selected to be mapped within 1 std. dev. – n(s 1) - expected portion of reads mapped within 1 std. dev. Preliminary Results (in silico) 100 x coverage, 2 x 100 bp pe reads annotations for genes and exon boundaries 1, 2 t 3 : t 4 : 1 1 1 3 2 3 3 4 4 4 5 5 5 6 0, 4 0, 2 0 1 std. dev. 0, 6 Make c. DNA & shatter into fragments Sequence fragment ends Map reads A Transcriptome Reconstruction A B A C D E B C D transcript/Transcript Expression E Gene Expression C Transcriptome Reconstruction Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data. Ø Map the RNA-Seq reads to genome Ø Construct Splice Graph - G(V, E) – V : exons – E: splicing events Ø Candidate transcripts – depth-first-search (DFS) Ø Filter candidate transcripts – fragment length distribution (FLD) – integer programming TRIP 0, 4 Cufflinks 0, 2 Objective: 1, 2, 3, 4 std. dev. Transcriptome Reconstruction using Integer Programming 8 0, 8 0 TRIP 2 3 4 5 6 7 # of transcripts per gene 1 IP formulation Exon 2 and 6 are “distant” exons : how to phase them? 1 1, 2 Constraints: 7 splice graph Flowchart for TRIP: (a) Positive Predictive Value (PPV) 7 7 TRIP 0, 8 Sensitivity t 2 : Cufflinks 1 Objective: RNA-Seq 00 Accuracy measures 500 Garber, M. et al. Nat. Biotechnol. June 2011 [Griffith and Marra 07] 0 Transcript length Ø GNFAtlas 2 gene expression levels – Uniform/geometric expression of gene transcripts Ø Normally distributed fragment lengths – Mean 500, std. dev. 50 Select the smallest set of putative transcripts that yields a good statistical fit between – empirically determined during library preparation – implied by “mapping” read pairs Mean : 500; Std. dev. 50 Ion Proton Sequencer 8 How to filter? Roche/454 FLX Titanium 400 -600 million reads/run 400 bp avg. length Illumina Hi. Seq 2000 Up to 6 billion PE reads/run 35 -100 bp read length 0 00 6 RABT(2011) • Simulate reads from annotated transcripts Advances in Next Generation Sequencing SOLi. D 4/5500 1. 4 -2. 4 billion PE reads/run 35 -50 bp read length 5000 10 1 AGR : Annotation-guided reconstruction – Explicitly use existing annotation during assembly – 10000 10 Trinity(2011), Velvet(2008), Trans. ABy. SS(2008) • Euler/de Brujin k-mer graph 15000 00 – single reads 20000 10 GGR : Genome-guided reconstruction (ab initio) – Exon identification – Genome-guided assembly 25000 00 – Genome 10 Scripture(2010) • Reports “all” transcripts Cufflinks(2010), Iso. Lasso(2011), SLIDE(2012) • Reports a minimal set of transcripts pseudo-exons Human genome UCSC annotations Number of transcripts – exons 0 GIR : Genome independent reconstruction (de novo) – k-mer graph INTRODUCTION http: //www. economist. com/node/16349358 Simulation Setup PPV In this work, we propose a novel statistical “genome guided” method called “Transcriptome Reconstruction using Integer Programming” (TRIP) that incorporates fragment length distribution into novel transcript reconstruction from paired-end RNA-Seq reads. To reconstruct novel transcripts, we create a splice graph based an exact annotation of exon boundaries and RNA -Seq reads. The exact annotation of exons can be obtained from annotation databases (e. g. , Ensembl) or can be inferred from aligned RNASeq reads. A splice graph is a directed acyclic graph (DAG), whose vertices represent exons and edges represent splicing events. We enumerate all maximal paths in the splice graph using a depthfirst-search (DFS) algorithm. These paths correspond to putative transcripts and are the input for the TRIP algorithm. Splice Graph 10 Classification and Existing Approaches ABSTRACT Constraints: Genome Variables: – Ti(p) - set of candidate transcripts on which pe read p can be mapped with fragment length between i-1 and i std. dev. , i=1, 2, 3, 4 – y(t) -1 if a candidate transcript t is selected, 0 otherwise – xi(p) - 1 if the pe read p is selected to be mapped with fragment length between i-1 and i std. dev. , i=1, 2, 3, 4 – n(si) - expected portion of reads mapped with fragment length between i-1 and i std. dev. , i=1, 2, 3, 4 splice graph 1 2 3 4 5 6 7 # of transcripts per gene 8 Flowchart for TRIP: (b) Sensitivity CONCLUSIONS AND FUTURE WORK Ø TRIP Algorithm for novel transcript reconstruction – fragments length distribution Ø Ongoing work – Comparison to other transcriptome reconstructions methods – Comparison on real datasets • Solid pe reads • Illumina 100 x 2 pe reads ACKNOWLEDGEMENTS • NSF award IIS-0916401 • NSF award IIS-0916948 • Agriculture and Food Research Initiative Competitive Grant no. 201167016 -30331 from the USDA National Institute of Food and Agriculture • Second Century Initiative Bioinformatics University Doctoral Fellowship, Georgia State University