Brave New World Biases inherent in NGS technology
Brave New World
Biases inherent in NGS technology
Biases inherent in NGS technology Sequencing errors at 3'-end of reads
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered
Biases inherent in NGS technology Sequencing errors at 3'-end of reads G -> T and A -> C errors CG-rich regions are higher covered
Biases inherent in NGS technology Sequencing errors at 3'-end of reads G -> T and A -> C errors CG-rich regions are higher covered Sequences preceding errors are G-rich
Biases inherent in NGS technology Sequencing errors at 3'-end of reads G -> T and A -> C errors CG-rich regions are higher covered Sequences preceding errors are G-rich Nucleotide content bias across the read
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered PCR biases G -> T and A -> C errors Sequences preceding errors are G-rich Nucleotide content bias across the read
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered Mappability bias PCR biases G -> T and A -> C errors Sequences preceding errors are G-rich Nucleotide content bias across the read
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered Mappability bias PCR biases G -> T and A -> C errors Higher coverage of the 3'-end Sequences preceding errors are G-rich Nucleotide content bias across the read
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered Mappability bias PCR biases G -> T and A -> C errors Higher coverage of the 3'-end Sequences preceding errors are G-rich Nucleotide content bias across the read Contamination by under-spliced RNAs
Biases inherent in NGS technology Sequencing errors at 3'-end of reads CG-rich regions are higher covered Mappability bias PCR biases G -> T and A -> C errors Higher coverage of the 3'-end Sequences preceding errors are G-rich Nucleotide content bias across the read Contamination by under-spliced RNAs Influence of RNA secondary structure
Biases inherent in NGS technology Sequencing errors at 3'-end of reads PCR biases Coverage non-uniformity across transcripts G -> T and A -> C errors Higher coverage of the 3'-end CG-rich regions are higher covered Mappability bias Sequences preceding errors are G-rich Nucleotide content bias across the read Contamination by under-spliced RNAs Influence of RNA secondary structure
Given a broad variety of modern sequencing protocols, platforms and versions thereof, with protocol- and platform-specific biases, to what extent are the obtained sequence data consistent across platforms and labs?
Data m. RNA Genome Illumina 46 exp. 61 exp. SOLi. D 3 exp. 7 exp. 117 (all) publicly available experiments in SRA (October 10, 2010) 26 labs all over the world. For each experiment, a subset of ~1. 5 G bases was selected.
Methods • Mapping to the reference human genome (hg 19) with bowtie: <26 bp – 1 mismatch 26 -50 bp – 2 mismatches >50 bp – 3 mismatches Illumina – base space SOLi. D – color space • Per-nucleotide gene coverage profiles • Single-exon genes • Average Pearson correlation coefficients
Clustering of experiments
Clustering of experiments Illumina RNA DNA SOLi. D Illumina DNA RNA SOLi. D
Clustering of experiments Illumina RNA DNA SOLi. D Illumina DNA RNA SOLi. D
Clustering of experiments
Correlation of gene coverage profiles • Within the same lab • R = 0. 46 ± 0. 14 Between different labs R = 0. 27 ± 0. 10
Clustering after normalization for 3' bias
Альтернативный сплайсинг DNA transcription prem. RNA Splicing and processing AAA m. RNA Translation Protein AAA
Elementary alternatives Cassette exon Alternative donor site Alternative acceptor site Retained intron
Данные • • • 181 555 729 read pairs for Dataset 1 274 927 771 reads for Dataset 2 average sample coverage of 18 million reads 64% mapped reads (up to 3 mismatches) 93% mapped reads within gene boundaries – 85% within exons – 25% on the splice junctions • genes with sufficient coverage – 9 929 in Dataset 1 – 8 617 in Dataset 2 • splice junctions – 200 464 annotated – 21, 644 novel
Интересные гены с значимыми изменениями на уровне транскриптома и протеома • • • MART (Wang and Liu 2008) DBN 1 (Shim and Lubec 2002) paralemmin (Kutzleb et al. 1998) RTN 2 (Roebroek et al. 1998) SRCIN 1 (Di Stefano et al. 2004) BIN 1 (Wechsler. Reya et al. 1997)
Splicing patterns from unsupervised clustering of IR profiles
Overrepresented GO terms • • • Neuronal differentiation Axon guidance Neurogenesis Response to unfolded proteins Cellular macromolecular complex assembly Cell morphogenesis involved in neuron differentiation Muscle contraction Neuron projection development Coated pit Structural molecule activity
Differences among species | splicing | experiment
Gene assembling Human 1. Lift. Over of splicing sites Chimp Rhesus ok! ok! 2. Expressed segment: region between two neighboring sites with sufficient coverage less than 1 bad gap > 5 nt bad ok! good Gene 1 Gene 2 3. Gene assembling alternative segments intron retention constitutive introns constitutive exons
Differences among species | splicing | verification human-chimpanzee human-macaque
Sanity check: PCA, Splicing vs Gene expression
Differences among species | splicing | result Significant exons Not significant exons
20% of genes changes splicing with age
Correction of age by lifespan
Correction of age by lifespan
Correction of age by lifespan
Example 1. spermidine/spermine N 1 acetyltransferase family member 2 Human Chimp Rhesus
Example 2. small nucleolar RNA host gene 11 Human Chimp Rhesus
Example 3. adaptor-related protein complex 1, gamma 2 subunit Human Chimp Rhesus
template • text
- Slides: 74