RNA seq I Edouard Severing A typical heat

  • Slides: 64
Download presentation
RNA seq (I) Edouard Severing

RNA seq (I) Edouard Severing

A typical heat stress experiment (climate change) Economically important frog Heat stress Control (

A typical heat stress experiment (climate change) Economically important frog Heat stress Control ( convection) 85 minutes 5 days How does the frog adapt and survive?

Coping with heat stress n The frog likely has to change several processes in

Coping with heat stress n The frog likely has to change several processes in order to cope with the heat stress. l l n n Adaptation of metabolic pathways. Prevent water loss through skin Changing the concentration of several enzymes, other proteins and molecules. We want to determine these molecule concentration changes l Starting with proteins.

Changes at the molecular level n We could measure protein concentration directly l Not

Changes at the molecular level n We could measure protein concentration directly l Not often done on a large scale n We could measure changes in the expression of the genes that encode these proteins. n Gene expression can be approximated by measuring the amount of m. RNA molecules that are produced by the gene.

Gene count and complexity 20. 000 genes 25. 000 genes

Gene count and complexity 20. 000 genes 25. 000 genes

From genes to proteins (I) Initial assumption N Protein coding genes N N m.

From genes to proteins (I) Initial assumption N Protein coding genes N N m. RNA Proteins Molecules Assumption is based on studies that were performed on bacterial systems

From genes to proteins (II) Current view N Protein coding genes X N ?

From genes to proteins (II) Current view N Protein coding genes X N ? N m. RNA Proteins Molecules What happens here ?

Splicing Pre-m. RNA 5’- -3’ Exon Intron Gene Exon Intron Splicing m. RNA 5’-

Splicing Pre-m. RNA 5’- -3’ Exon Intron Gene Exon Intron Splicing m. RNA 5’- Exon -3’

Alternative splicing Pre-m. RNA 5’- -3’ Splicing 5’- Splicing -3’ 5’- -3’

Alternative splicing Pre-m. RNA 5’- -3’ Splicing 5’- Splicing -3’ 5’- -3’

Gene count and complexity 90% of genes have AS 60% of genes have AS

Gene count and complexity 90% of genes have AS 60% of genes have AS The average number of transcripts produced by human genes is also higher than the average number of transcripts produced by plant genes

An extreme case Dscam gene produces over 38, 000 different transcripts

An extreme case Dscam gene produces over 38, 000 different transcripts

Major alternative splicing event types In humans exon skipping is most frequent AS event

Major alternative splicing event types In humans exon skipping is most frequent AS event type In plants intron retention are the most common AS event type Humans Plants Exon skipping Intron retention

RNA editing Primary transcript 5’- A C U (Predicted sequence) A C G A

RNA editing Primary transcript 5’- A C U (Predicted sequence) A C G A U - 3’ RNA-Editing After editing (Observed sequence) 5’- A C U A U G A U - 3’ Difficulty: Distinguish genuine RNA-editing from sequencing errors

Not everything is translated n A large fraction (>30%) of transcripts of protein coding

Not everything is translated n A large fraction (>30%) of transcripts of protein coding genes are degraded by the nonsensemediated decay (NMD) pathway. n The position of the stop codon is used to predict whether a transcript is likely to be degraded by the NMD pathway

Detecting putative NMD candidates Pre-m. RNA 5’- m. RNA -3’ 5’- -3’ Exon/Exon junctions

Detecting putative NMD candidates Pre-m. RNA 5’- m. RNA -3’ 5’- -3’ Exon/Exon junctions M Open reading frame Stop 5’- -3’ d > 50 -55 nt

Remember n The number of unique m. RNA molecules is much larger than the

Remember n The number of unique m. RNA molecules is much larger than the number of genes. n A large fraction of the m. RNA molecules is degraded by the NMD pathway. l NMD provides a means to regulate gene-expression at the post-transcriptional level

Process the frogs into reads for analysis Sequencing Grind N 2 Prepare for sequencing

Process the frogs into reads for analysis Sequencing Grind N 2 Prepare for sequencing >s 1 ATCGTAGGGTA >s 2 ATGGCCTAGGT Bioinformatics

Basic transcriptome analysis steps n Many research questions require the following steps: l Reconstruction

Basic transcriptome analysis steps n Many research questions require the following steps: l Reconstruction of the transcriptome • We usually only have fragments l l l Quantification of the transcriptome Differential expression analysis Other fun stuff.

de novo transcriptome reconstruction (I)

de novo transcriptome reconstruction (I)

de novo transcriptome reconstruction (II)

de novo transcriptome reconstruction (II)

Genome-guided transcriptome reconstruction Genome -3’ 5’- m. RNA

Genome-guided transcriptome reconstruction Genome -3’ 5’- m. RNA

Genome-guided transcriptome reconstruction

Genome-guided transcriptome reconstruction

Genome-guided transcriptome reconstruction

Genome-guided transcriptome reconstruction

Remember n de novo transcriptome assembly l When no reference genome is available l

Remember n de novo transcriptome assembly l When no reference genome is available l Finding features which are not on the reference genome (t. DNA insertion) l Programs: Trinity, Trans-ABy. SS, Velvet Oases n Genome-guided transcriptome reconstruction l Reference genome is available with or without annotation l Mapping programs: Top. Hat, GSNAP l Transcriptome reconstruction: Scripture, Cufflinks

RNA seq (II) Quantification Edouard Severing

RNA seq (II) Quantification Edouard Severing

A typical heat stress experiment Heat stress Control (convection)

A typical heat stress experiment Heat stress Control (convection)

Raw counts n Counting number of reads/fragments falling with exonic regions of a gene.

Raw counts n Counting number of reads/fragments falling with exonic regions of a gene. l Example: HTseq-count Exon 1 Exon 2 Exon 3 Exon 4

The same fragment count yet different expression levels Exon 1 library Exon 1 Library

The same fragment count yet different expression levels Exon 1 library Exon 1 Library size matters library

The same fragment count yet different expression levels. Exon 1 Exon 2 Exon 3

The same fragment count yet different expression levels. Exon 1 Exon 2 Exon 3 Exon 1 Transcript/gene length matters Exon 4

Normalizing/correcting for feature length and library size Reads mapped to region RPKM ≈ 1.

Normalizing/correcting for feature length and library size Reads mapped to region RPKM ≈ 1. 7 300 nt Feature length 10, 000 All mapped reads

Normalizing/correcting for feature length FPKM is analogous to RPKM = 1 RPKM = 2

Normalizing/correcting for feature length FPKM is analogous to RPKM = 1 RPKM = 2 FPKM = 1 Different picture emerges from raw counts and RPKM/FPKM values

Counting method issues n What to do with reads that map to multiple isoforms

Counting method issues n What to do with reads that map to multiple isoforms (alternative splicing) or genes Pure Random assignment? No, expression can differ Count multiple time? No, it has been derived from a single transcript Gene 1 Gene 2 Isoform 1 Isoform 2

Count issues: Back to the gene level (I)

Count issues: Back to the gene level (I)

Count issues: Back to the gene level (II)

Count issues: Back to the gene level (II)

Statistical methods: Expression levels of transcripts

Statistical methods: Expression levels of transcripts

Fishing in the dark lake experiment Question: What fraction (t) of the fish in

Fishing in the dark lake experiment Question: What fraction (t) of the fish in the lake is green? Method: We catch a number of fish and determine what fraction is green. Caution: Fish have to be immediately thrown back in the water.

Fishing in the dark lake results (I) Sane people would do: Sample(X) Fraction of

Fishing in the dark lake results (I) Sane people would do: Sample(X) Fraction of fish that is green t = 1/3

Fishing in the dark lake results Maximum likelihood estimate of t Sample(X) P(t)) t

Fishing in the dark lake results Maximum likelihood estimate of t Sample(X) P(t)) t

Fishing in a complex dark lake. Transcript quantification using RNAseq is like fishing in

Fishing in a complex dark lake. Transcript quantification using RNAseq is like fishing in a dark lake with fragmented fish. We are also forced to determine the possible origin(s) of the fish fragments Only lost an eye and a vin but not its tail

Estimating relative transcript abundances Target α 1 Transcript 1 α 2 Transcript 2 Fragmentation

Estimating relative transcript abundances Target α 1 Transcript 1 α 2 Transcript 2 Fragmentation Sequencing Observation >s 1 ATCGTAGGGTA Read mapping >s 2 ATGGCCTAGGT Which values of the α 1 and α 2 gives the highest probability of observing these reads. (α 1 + α 2 = 1 )

Maximum likelihood alignments n The likelihood of our observation (ʎ) corresponds to the product

Maximum likelihood alignments n The likelihood of our observation (ʎ) corresponds to the product of observing each of the individual mapped reads (rj ) in our set (R) R

Probability of observing a read n Probability of observing a read rj is the

Probability of observing a read n Probability of observing a read rj is the sum of the individual probabilities that a read originates from each transcript (t) in our transcript set (T). Read j Probability that rj originated from transcript t

Component 1: Compatibility Does read j map to transcript t t=1 t=2 t=3 Kj

Component 1: Compatibility Does read j map to transcript t t=1 t=2 t=3 Kj 1 = 1 Kj 2 = 1 Kj 3 = 0

Component 2: Sequencing a read from a specific transcript Probability of “sequencing” a read

Component 2: Sequencing a read from a specific transcript Probability of “sequencing” a read from transcript t Product of the relative expression level and length of transcript t

Component 2: Sequencing a read from a specific transcript Why and not just ?

Component 2: Sequencing a read from a specific transcript Why and not just ? n Longer transcripts produce more fragments than shorter transcripts at equal expression levels. α 1 Fragmentation α 2 α 1 = α 2 Fragments

Component l 1 = 200; α 1 = 0. 3 l 2 = 150;

Component l 1 = 200; α 1 = 0. 3 l 2 = 150; α 2 = 0. 2 l 3 = 50; α 3 = 0. 5 Adjust for length normalize

Component 3: Probability of originating from position q on transcript t In the case

Component 3: Probability of originating from position q on transcript t In the case of no bias:

Components: Fragment comes from a certain position of the transcript (I) Occurence 300 nt

Components: Fragment comes from a certain position of the transcript (I) Occurence 300 nt 200 nt More likely

Frequency Components: Fragment comes from a certain position of the transcript (II) Not all

Frequency Components: Fragment comes from a certain position of the transcript (II) Not all regions are equally covered.

Search for abundances that best explain the observed fragments The method used to find

Search for abundances that best explain the observed fragments The method used to find the optimum differs per program. Trapnell et al. 2010

§ The statistical methods can also provide an indication of the uncertainty in the

§ The statistical methods can also provide an indication of the uncertainty in the expression estimates § One of the sources of that uncertainty are reads that do not map uniquely. Occurrence Uncertainty in expression estimate FPKM

Remember n n The statistical methods calculate the expression level of each transcript The

Remember n n The statistical methods calculate the expression level of each transcript The gene expression can then be obtained by simply summing expression levels of its isoforms Gene RPKM = 11 Isoform 1 RPKM = 6 Isoform 2 RPKM = 5

Programs employing statistical models n n n Cufflinks l Genome annotation based • FPKM

Programs employing statistical models n n n Cufflinks l Genome annotation based • FPKM values l Numerical method for finding the maximum likelihood optimum RSEM l de novo transcriptome • Counts and RPKM values l Expectation maximization for finding the optimum Bit. Seq l de novo transcriptome • Counts and RPKM values l Markov chain Monte Carlo for sampling from the posterior distribution.

RNA seq (III) Differential expression Edouard severing

RNA seq (III) Differential expression Edouard severing

A typical heat stress experiment Heat stress (convection) Control Single measurement Many measurements Is

A typical heat stress experiment Heat stress (convection) Control Single measurement Many measurements Is this gene really important ? HSP 38 Expression level

Sources of variation Biological Technical Sequencing Treatment Grind N 2 Convection Freezer RNA extraction

Sources of variation Biological Technical Sequencing Treatment Grind N 2 Convection Freezer RNA extraction Procedure Bioinformatics

Determining expression variation n Accurately determining the variation requires many biological samples (replicates). n

Determining expression variation n Accurately determining the variation requires many biological samples (replicates). n Unfortunately in most case we only have two or three replicates. n Other methods are needed to approximate/model the variation.

Determining within condition fragment count variation modeling (DESeq/cuffdiff) Main assumption: Variance depends on the

Determining within condition fragment count variation modeling (DESeq/cuffdiff) Main assumption: Variance depends on the mean. Objective: Find a function that best describes the relationship between the mean and variance. Trapnell et al. 2012

Building the within condition fragment count distribution (DEseq) At this stage DESeq and Cuffdiff

Building the within condition fragment count distribution (DEseq) At this stage DESeq and Cuffdiff determined for each transcript/gene 1. The within condition fragment count mean 2. The within condition fragment count variance With these parameters a fragment count probability distribution can be determined. DESeq uses a negative binomial (NB) distribution. NB: Variance is larger than the mean

Building the within condition fragment count distribution (Cuffdiff) In addition to NB distribution of

Building the within condition fragment count distribution (Cuffdiff) In addition to NB distribution of DESeq Cuffdiff also takes the uncertainty in transcripts fragment assignment into account. The resulting count distribution is called a beta negative binomial (BNB) distribution This count distribution is in the end transformed to a distribution of FPKM values.

Differential gene expression § Now that the count or FPKM distributions are known we

Differential gene expression § Now that the count or FPKM distributions are known we can start testing for significant differences in expression levels. § There are several ways in which one can test whether the gene/transcript levels in two conditions are significantly different. § DESeq: uses an exact test which is analogous to the Fisher exact test. § (The hyper-geometrical is replaced by the NB distribution) § Cuffdiff: uses a t-test § (We will look at that in the next slide)

Testing for differential expression (Cuffdiff) log FC Expression Heat control log FC distribution Many

Testing for differential expression (Cuffdiff) log FC Expression Heat control log FC distribution Many measurements leads to many fold changes

Cuffdiff p-value According to authors of cufflinks State that the quantity T is approximately

Cuffdiff p-value According to authors of cufflinks State that the quantity T is approximately Normally distributed log FC distribution -|T| Unadjusted p-value of cufflinks indicates what the probability of observing a T which is more extreme the current T (Red areas)

n Samtools view n SRR 019209. 14131084 16 Chr 1 3695 50 36 M

n Samtools view n SRR 019209. 14131084 16 Chr 1 3695 50 36 M * 0 0 AGCGAGATCGACGGCGAAGCTCTTTACCCGCT%%"&'"(&&#'$&$)+$#, %83%0&1250'III+$' AS: i: -4 XN: i: 0 XM: i: 2 XO: i: 0 XG: i: 0 NM: i: 2 MD: Z: 34 G 0 A 0 YT: Z: UU XS: A: + NH: i: 1 Sw accepted_hits. bam n /mnt/geninf 15/work/course_2013/day 2/mapped_reads n