HighThroughput Sequencing Advanced Microarray Analysis BIOS 691 803

High-Throughput Sequencing Advanced Microarray Analysis BIOS 691 -803, 2008 Dr. Mark Reimers, VCU

Quantitative HTS - Outline • • Technology Preprocessing Quantitative analysis Applications – Ch. IP-Seq – RNA-Seq – Methyl-Seq

The Technology • Most sequencing proceeds by addition of fluor-labeled bases • Do this in parallel on a flat surface • Capture each stage with good camera • Align images

Roche - 454 • Parallel Pyrosequencing on beads

Mardis, Trends in Ge

454 Sequencing Operation

Illumina - Solexa

ABI SOLi. D • Resquencing each fragment with different primers • Reconstruct each fragment separately

Paired-End Reads

Issues • Pre-processing – Base calling – Mapping reads – QA • Quantitative analysis – Variation and noise – Biases – Models – Accuracy and validation

Pre-processing – Base Calling • Not all steps completed properly • Sequence can lag behind or skip ahead • Hence most light spots a mixture of different colors • Simple rule: use brightest signal

Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Thierry-Mieg

Typical Errors in Base-Calling

Position of single mismatch in uniquely mapped tags Courtesy Thierry-Mieg

Improving Base-Calling with SVM

Pre-processing – Mapping Reads • • Huge numbers (10 M – 70 M) BLAT (2002 high-speed method) Eland (proprietary Illumina) Other new methods: MAQ, SOAP

Quality Assessment • Fraction of reads mapping to targets • Typically 5 -10 M reads per lane and 60 -80% map to targets • Some repetitive sequence

Comparing Samples - A Simple Normalization • Different numbers of counts per lane • Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM) • For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)

Quant. Analysis - Variation • Poisson model often used for random variation • Most HTS data ‘overdispersed’ relative to Poisson • Negative Binomial often used – Parameter fitted

Quantitative Analysis - Biases • Not all regions represented equally • GC rich regions represented more • Independent of GC some chromosome regions represented more – Euchromatin bias • Sequence initiation site biases • ‘Mapability’ biases – some regions won’t have any uniquely mapped tags

GC Bias • Density of reads depends strongly on GC content of regions GC content (%)

Genomic Position Biases • Count tags from randomly sheared DNA in red with GC content in blue

Start Position Bias

Consistent Start Position Bias Counts per start site in lane 1 vs lane 2

RNA-Seq

RNA-Seq Data Gene Model Kidney Reads Liver Reads From Marioni et al 2008

Accuracy of Illumina RNA-Seq

Comparing RNA-Seq & Affy Issues • How replicable is RNA -Seq? • How consistent are the two technologies? • Which is better? • Marioni et al, Genome Research, 2008

Comparing Fold-Changes • D. E. by ILM • Red >250 • Green <250 • Black Not DE by ILM

Model for Variation • Poisson counts hypergeometric comparison • Make uniform p-values by adding random term – Use lower tails only

False Positive Rates • QQ-plots of p-values between tech. reps

Different Concentrations are NOT Comparable! • QQ-plots of p-values between 3 p. M and 1. 5 p. M

Normalization of RNA-Seq • Robinson et al noticed that most genes appeared less expressed in liver Fig 1 from Robinson & Oshlak, Genome Biology 2010

A Better Normalization for RNASeq - TMM • • Drop extremes of ratios Drop very high count genes Compute trimmed means of samples Center log-ratios between samples

New Things to do with RNA-Seq • Allele-specific expression • Splice variation – Between tissues – In disease • Alternate initiation sites – Select 5’ capped RNA fragments • Alternate termination

Allelic Comparison • It is possible to compare allele-specific expression counts • Sample from VCU • Replicate samples • P-values for binomial tests of equality • About half show differential expression!

Detecting Splice Variation • Deep sequencing shows up clear variation in exon usage • Wang et al Nature 2008

Tissue Map of Splice Variation From Wang et al • Brain is most distinctive • Individuals seem to differ • Cell lines seem to have distinct splice patterns

Splicing is Complex • Many different splice operations exist • Only some of these characterized by counting exon reads

Issues in Detecting Splice Variants • Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance • Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads • All possible splice junctions are not known – Hard to even search through the known ones

Methodology for Splice Variants • Count reads mapped to exons and compare ratios across samples – Wang et al, and most others • Count reads that cross splice junctions

Methodology for Finding Junctions

Ch. IP-Seq

Chromatin Immuno-precipitation

Ch. IP-Seq Workflow • • • Cross-link proteins to DNA Fragment DNA Extract with antibody Reverse cross links Sequence fragments DO CONTROLS!

Ch. IP-Seq Data • From Rozowsky et al, Nature Biotech 2009

Ch. IP-Seq vs Ch. IP-chip

Peak-Finding - Simple • Extend tags and count overlap • How much to extend?

Peak Finding – Better • Tags starting on opposite strands are likely to start at opposite ends • Identifying the cross-over point leads to improved accuracy

The Value of Controls: Ch. IP vs. Control Reads Red dots are windows containing Ch. IP peaks and black dots are windows containing control peaks used for FDR calculation

Cause of Variation in Read Density • In study of Fox. A 1 binding, even control reads enriched near Fox. A 1 binding site! • Probably due to open chromatin near Fox. A 1 binding site Density of Control Channel reads around Fox. A 1 site Courtesy Shirley Liu

Ch. IP-Seq – MACS Key Ideas • Smart peak imputation estimate – Uses read directions – Empirical estimate of fragment length • Local frequency estimate – Using control, if available – Using wide estimate, otherwise – Not using sequence

Read Lengths and Directions • Some clear clusters – even before stats • Reads on opposite sides of peak map to opposite strands – Hence fragments have opposite directions • Can estimate apparent fragment length

Fragment Lengths • Puzzle: Fragments from sonication expected to be between 200 – 500 bp • Estimated fragment size ~ 100 bp • Shirley Liu’s explanation: preferential cutting near to TF ? ?

Comparison to Ch. IP-chip • Broad correlation • Not dramatic improvement in precision !

Methyl-Seq

Methylation Assays • Affinity purification: e. g. Me. DIP-Seq (methylated dinucleotide immunoprecipitation) • Methylation-specific cleavage by endonucleases – e. g. Methyl-Seq: Cleaves with HPA 2 to identify • Bisulphite conversion – WGBS (Whole-Genome Bisulphite Sequencing) – RRBS (Reduced Representation Bisulphite Sequencing) • Cleaves with MSPI to reduce complexity

Affinity: Me. DIP-Seq & MBD-Seq

Issues with Affinity Methods • Analysis essentially like Ch. IP-Seq • BUT: Sequence count reflects both density of Cp. G’s and proportions of methylation • No individual Cp. G-level information • Advantages: no conversion so sequence tags are easily mappable

Methyl-Seq • Use HPAII to cleave only at unmethylated CCGG sites • Size-select fragments (50 -300) • Sequence fragment ends – Always starting at a CCGG • Easy to map – few possible loci (<1 M) • Paired ends give actual fragment

Schematic Here

Issues for Methyl Seq • Computational problem to re-assemble actual proportions of methylation at each locus from counts • Prone to false positives because of incomplete digestion (for reasons other than methylation of CCGG site) – e. g. insufficient time … – rates vary by 50 -fold depending on sequence context

WGBS • Bisulphite conversion, fragmentation and shotgun sequencing • Requires very many reads! • Use of capture arrays reduces work… BUT different sequences have different capture efficiencies!

WGBS Data (from capture array) • top, CHP-SKN-1; bottom, MDA-MB-231 • NB. Inconsistent tag numbers

Issues with WGBS • Lose many C’s • Hard to map to genome • Strategy depends on less penalty for mapping T to C • Too many loci!

RRBS • Too many methylation sites in genome • Cleave with MSPI and size select in order to reduce number of fragments • Convert C to T with bisulphite (not m. C) • Then sequence fragments • 1. 4 M fragments

Issues with RRBS • Fairly broad but not complete coverage of ‘interesting’ regions of genome • Bisulphite conversion of limited regions means mapping is fairly easy • Bisulphite conversion not always complete

Meta-Genomics

What is Meta-genomics? • Sequencing random fragments of DNA from all microbial denizens of a community (and traces of a few others) • Sometimes broadly used for surveys of microbial diversity based on sequencing all 16 Sr. RNA genes present

Kinds of Questions • What is out there? – Most microbial species not known • What metabolic fluxes in any environment? • What microbes associated with specific conditions? – Including disease or health • Human Microbiome Project

Environmental Meta-Genomics

Human Microbiome Project

Data Analysis Issues – 16 S r. RNA • Identification of microbes – most are unknown and un-culturable • Distinguishing errors in sequencing from novel microbes • Biases in sequencing

Data Analysis Issues - Metagenomics • Mapping and characterizing unknown protein sequences • Usually assume conservation • Full-coverage allows assembly of genomes • Counting • Biases probably smaller (Bork)