RNASeq Xiaole Shirley Liu STAT 115215 BIOBST 282

RNA-Seq Xiaole Shirley Liu STAT 115/215, BIO/BST 282

Transcription and Splicing 2

RNA-seq Protocol 3 Martin and Wang Nat. Rev. Genet. (2011)

Priming Strategies for c. DNA Synthesis 4

Why RNA-seq, not Microarray? • No need to know the genome sequence or predict genes • No need to design microarray probes • Digital representation • Higher detection range • New genes • Alternative splicing • Mutations and gene fusion 5

Experimental Design • Poly. A (m. RNA after splicing, enrich for exons) • Ribo-minus (remove too abundant house keeping genes) • Strand specific (directionality of lnc. RNA) • Sequencing: what does $500 / sample mean? – SE or PE: PE for splicing, novel transcripts – Depth: 30 -50 M differential expression, deeper transcript assembly – Read length: longer for transcript assembly, mutation calls 6

Experimental Design • Assessing biological variation requires biological replicates (no need for technical replicates) • 3 preferred, 2 OK, 1 only for exploratory assays (not good for publications) • Batch effects still exist, try to be consistent or process all samples at the same time • Better technology never eliminate the needs for good experimental design 7

Batch effect • Striking finding in 2014: “Human heart is more similar with human brain than mouse brain”? Human Heart Mouse Brain Human Brain Lin et al, PNAS 2014 8

Batch Effect • • • 1 st batch: human tissues 2 nd batch: human tissues 3 rd batch: mouse tissues 4 th batch: mouse tissues 5 th batch: human/mouse tissues • After batch removal, tissues cluster Break 9

RNA-seq Analysis

Alignment • Prefer splice-aware aligners • Top. Hat, BWA, STAR… • Sometimes need to trim the beginning bases 11

Quality Control: RSe. QC Read qualities 12

Quality Control: RSe. QC Nucleotide compositions 13

Quality Control: RSe. QC Read count distributions across genes 14

Quality Control: RSe. QC Insert size distribution and splicing junctions Paired-end read Insert size 15

Expression Index • RPKM (Reads per kilobase of transcript per million reads of library) – Total reads / 1 M, divide by gene length in KB – Corrects for coverage, gene length – Top. Hat / Cufflinks • FPKM (Fragments), PE libraries, ~RPKM/2 • TPM (transcripts per million) – Divide read count by gene length in KB (RPK), divide by scaling factor (sum of RKP across all genes / 1 M) – Proportion of reads mapped to a gene in each sample is comparable – RSEM 16

m. RNAs to RNA-seq fragments Kij = count of fragments aligned to gene i, sample j is proportional to: • • • expression of RNA length of gene sequencing depth lib. prep. factors (PCR) in silico factors (alignment). . . Break 17

Differential Expression

Statistical Power of Detecting Differential Expression • Range of count – Expression, Sequencing depth, Gene length Gene A (1 kb) Gene B (8 kb) • • • Sample size Dispersion True fold change Cufflinks: RPKM/FPKM LIMMA-VOOM and DESeq: Read count 19

Raw Counts vs. Normalized Counts 20

Sequencing Read Distribution • The number of patients arriving in an emergency room between 10 and 11 pm • # Reads mapped to a gene of 1 KB long • Poisson dist – λ average events per interval – K # events in an interval – Var = mean = λ 21

Sequencing Read Distribution • In reality, sequencing data is over-dispersed – (Mean<Variance) • Negative binomial – Def: # of successes before r failures occur, if Pb(each success) is p 22

Modeling Read Over Dispersion • Variance estimated by borrowing information from all the genes – hierarchical models • Test whether μi is the same for gene i between samples j • FDR? 23

Shrinkage of Fold Changes for RNA-seq noisy estimates due to low counts large FDR from the statistical model, but we shouldn't trust the estimate itself shrinkage is not equal. strong moderation for low information genes: low counts almost no shrinkage Break 24

Splicing Transcripts • Assign reads to splice isoforms (Top. Hat) 25

Transcript Assembly Reference-based assembly Cufflinks De novo assembly Trinity 26

Isoform Inference • If given known set of isoforms • Estimate x to maximize the likelihood of observing n 27

Known Isoform Abundance Inference 28

Identification of Differential Splicing Between RNA-seq Samples • Most differential splicing detection algorithms call differentially expressed exons, not whole transcripts, esp for novel splicing 29

Splicing Isoform Inference • With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances might have uncertainty (e. g. known set incomplete) • De novo method are usually better at detecting differential exon splicing, but not whole transcripts • De novo isoform inference is a non-identifiable problem if RNA-seq reads are short and gene is long with too many exons • Experimental validation of quantitative differential splicing is still quite hard 30

Active Field • HISAT 2 for fast alignment – Hierarchical index – https: //ccb. jhu. edu/software/hisat 2/index. shtml • Kallisto and Sleuth – Kallisto TPM, Sleuth differential expression – Known genes and transcripts – https: //scilifelab. github. io/courses/rnaseq/labs/k allisto 31

Summary • • RNA-seq design considerations Read mapping: BWA, STAR Quality control: RSe. QC Expression index: R/FPKM and TPM Differential expression: LIMMA-VOOM and DESeq Transcriptome assembly: Cufflinks, Trinity Alternative splicing: r/MATs New developments: HISAT 2, Kallisto and Sleuth 32

Acknowledgement • • • Wei Li Michael Love Alisha Holloway Simon Andrews Radhika Khetani 33