DNA Sequencing CS 273 a 2016 DNA sequencing

  • Slides: 63
Download presentation
DNA Sequencing CS 273 a 2016

DNA Sequencing CS 273 a 2016

DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGACTGGGT

DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGACTGGGT CTAGACTACGTTTTA TATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT… CS 273 a 2016

Human Genome Sequencing – A Crazy Idea 1953 Franklin, Watson, Crick 1977 Sanger &

Human Genome Sequencing – A Crazy Idea 1953 Franklin, Watson, Crick 1977 Sanger & colleagues 1985: • Robert Sinsheimer, UCSC § • Rene Dulbecco, SALK § • Chancellor, UC Santa Cruz, Salk Institute Charles De. Lisi, DOE § DOE Source: Encyclopedia of Philosophy CS 273 a Stanford 2016 “DOE’s program for unemployed bombmakers”

The Human Genome Project 1986: Discussions 1990: Launch 1996: Map first, sequence later 1997:

The Human Genome Project 1986: Discussions 1990: Launch 1996: Map first, sequence later 1997: Weber & Myers 1998: Celera 2000: Bill Clinton: 3 billion basepairs $3 billion CS 273 a 2016 2001: Draft 2003: “Finished” “most important scientific discovery in the 20 th century”

Whole Genome Shotgun Sequencing Genome Research, 1997 Let’s sequence the human genome with the

Whole Genome Shotgun Sequencing Genome Research, 1997 Let’s sequence the human genome with the shotgun strategy That is impossible, and a bad idea anyway CS 273 a 2016 Gene Myers Phil Green

Human Genome Race 1999 -2000 • 1998: Celera, Inc • 1999: Celera drosophila genome

Human Genome Race 1999 -2000 • 1998: Celera, Inc • 1999: Celera drosophila genome • 2000: Venter, Myers and colleagues Lander, Collins and colleagues § 5 years ahead of time, a computational achievement § Huge boost to sequencing technology development CS 273 a 2016

Sequencing genomes in order to align them • • • Human, 2001 Mouse, 2002

Sequencing genomes in order to align them • • • Human, 2001 Mouse, 2002 Rat, 2003 Chicken, 2004 Dog, Chimpanzee, 2005 Many mammals & other vertebrates, 2006 -today • Genscan (1997) § • MUMmer, Glass (1999 -2000) § • Pairwise genome alignment Rosetta/Glass, Twinscan (2000 -1) § • HMM for gene prediction Human/Mouse gene prediction MLAGAN, MAVID, BLASTZ (2003 -4) § Multiple genome alignment NHGRI FY 2005 Budget: $492 M. “One of the most powerful approaches for unlocking the secrets of the human genome is comparative genomics. [[…]] The current NHGRI-supported, large-scale sequencing centers [[… are. . . ]] expected to yield the equivalent of about 20 additional draft vertebrate genomes in just the next three years. CS 273 a 2016

Sequencing Growth Cost of one human genome • 2004: $30, 000 • 2008: $100,

Sequencing Growth Cost of one human genome • 2004: $30, 000 • 2008: $100, 000 • 2010: $10, 000 • 2015: “$1, 000” • ? ? ? : $300 CS 273 a 2016 SOLID

A Data-Acquisition Technology Explosion Enabled by Inexpensive Sequencing • • RNAseq • DNase-seq •

A Data-Acquisition Technology Explosion Enabled by Inexpensive Sequencing • • RNAseq • DNase-seq • ATAC-seq • Hi-C CS 273 a 2016 Ch. IP-seq for § Transcription factor binding § Nucleosome positioning § Histone Modifications • Bisulfite treatment for methylation • Me. DIP for methylation

Major Data Acquisition Efforts Human Microbiome Roadmap Epigenomics 1000 Genomes CS 273 a 2016

Major Data Acquisition Efforts Human Microbiome Roadmap Epigenomics 1000 Genomes CS 273 a 2016

Sequencing Growth Cost of one human genome • 2004: $30, 000 • 2008: $100,

Sequencing Growth Cost of one human genome • 2004: $30, 000 • 2008: $100, 000 • 2010: $10, 000 • 2015: “$1, 000” • ? ? ? : $300 How much would you pay for a smartphone? CS 273 a 2016 Stephens ZD et al. Plos Biology 2015

How soon will we all be sequenced? Applications Cost Time 2016? 2022? • Learning

How soon will we all be sequenced? Applications Cost Time 2016? 2022? • Learning from the data • Roadblocks CS 273 a 2016

Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter

Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1, 000 Other organisms have much higher polymorphism rates § Population size! CS 273 a 2016

Why humans are so similar N Out of Africa Heterozygosity: H H = 4

Why humans are so similar N Out of Africa Heterozygosity: H H = 4 Nu/(1 + 4 Nu) u ~ 10 -8, N ~ 104 H ~ 4 10 -4 CS 273 a 2016 A small population that interbred reduced the genetic variation Out of Africa ~ 40, 000 years ago

Ancient sequencing technology – Sanger Vectors DNA Shake DNA fragments Vector Circular genome (bacterium,

Ancient sequencing technology – Sanger Vectors DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) CS 273 a 2016 Known location + = (restriction site)

Ancient sequencing technology – Sanger Gel Electrophoresis 1. Start at primer (restriction site) 2.

Ancient sequencing technology – Sanger Gel Electrophoresis 1. Start at primer (restriction site) 2. Grow DNA chain 3. Include dideoxynucleoside (modified a, c, g, t) 4. Stops reaction at all possible points 5. Separate products with length, using gel electrophoresis CS 273 a 2016

Illumina cluster concept CS 273 a 2016 Slide Credit: Arend Sidow

Illumina cluster concept CS 273 a 2016 Slide Credit: Arend Sidow

Cluster generation (‘bridge amplification’) CS 273 a 2016 Slide Credit: Arend Sidow

Cluster generation (‘bridge amplification’) CS 273 a 2016 Slide Credit: Arend Sidow

Clonally Amplified Molecules on Flow Cell 1µM CS 273 a 2016 Slide Credit: Arend

Clonally Amplified Molecules on Flow Cell 1µM CS 273 a 2016 Slide Credit: Arend Sidow

Sequencing by Synthesis, One Base at a Time 3’- …-5’ G T A T

Sequencing by Synthesis, One Base at a Time 3’- …-5’ G T A T T C G G C A G A G C T Cycle 1: C T G A T Add sequencing reagents First base incorporated Remove unincorporated bases Detect signal Cycle 2 -n: CS 273 a 2016 Add sequencing reagents and repeat Slide Credit: Arend Sidow

Method to sequence longer regions genomic segment cut many times at random (Shotgun) Get

Method to sequence longer regions genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~900 bp CS 273 a 2016 ~900 bp

Two main assembly problems • De Novo Assembly • Resequencing CS 273 a 2016

Two main assembly problems • De Novo Assembly • Resequencing CS 273 a 2016

Reconstructing the Sequence (De Novo Assembly) reads Cover region with high redundancy Overlap &

Reconstructing the Sequence (De Novo Assembly) reads Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region CS 273 a 2016

Definition of Coverage C Length of genomic segment: Number of reads: Length of each

Definition of Coverage C Length of genomic segment: Number of reads: Length of each read: G N L Definition: C=NL/G Coverage How much coverage is enough? Lander-Waterman model: Prob[ not covered bp ] = e-C Assuming uniform distribution of reads, C=10 results in 1 gapped region /1, 000 nucleotides CS 273 a 2016

Repeats Bacterial genomes: Mammals: 5% 50% Repeat types: • Low-Complexity DNA (e. g. ATATACATA…)

Repeats Bacterial genomes: Mammals: 5% 50% Repeat types: • Low-Complexity DNA (e. g. ATATACATA…) • Microsatellite repeats • Transposons § SINE (a 1…ak)N where k ~ 3 -6 (e. g. CAGCAGTAGCAGCACCAG) (Short Interspersed Nuclear Elements) e. g. , ALU: ~300 -long, 106 copies § LINE § LTR retroposons (Long Interspersed Nuclear Elements) ~4000 -long, 200, 000 copies (Long Terminal Repeats (~700 bp) at each end) cousins of HIV • Gene Families genes duplicate & then diverge (paralogs) • Recent duplications ~100, 000 -long, very similar copies CS 273 a 2016

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides 50% of human DNA is composed of repeats CS 273 a 2016 Error! Glued together two distant regions

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2016

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2016

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2016

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides A R B ARB, CRD or C CS 273 a 2016 R D ARD, CRB ?

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides CS 273 a 2016

Long Reads The Holy Grail CS 273 a 2016

Long Reads The Holy Grail CS 273 a 2016

Long Reads - Pac. Bio Chemistry RS II: P 4 -C 2 RS II:

Long Reads - Pac. Bio Chemistry RS II: P 4 -C 2 RS II: P 5 -C 3 Optimized For higher quality longer reads Run time 180 min Total output ~275 Mb ~375 Mb Output/day ~2. 2 Gb ~3 Gb Mean read length ~5. 5 kb ~8. 5 kb Top 5% of reads >11 kb >18 kb Single pass accuracy ~86% ~83% >99. 999% >99. 98% ~50 k Instrument price ~$700 k Run price ~$400 Consensus (50 X) accuracy # of reads CS 273 a 2016

Long Reads – Oxford Nanopore Read length: 50, 000+? Cost ? CS 273 a

Long Reads – Oxford Nanopore Read length: 50, 000+? Cost ? CS 273 a 2016

10 x System Massively Parallel Partitioning 10 X Instrument & Reagents Read Clouds (“linked

10 x System Massively Parallel Partitioning 10 X Instrument & Reagents Read Clouds (“linked reads”) Hap 1 Hap 2 CS 273 a 2016 X 3, 000 Phased 60 Kb deletion

Read Clouds Partition 1 x CR Synthetic Long Reads (SLR): CR >= 50 x

Read Clouds Partition 1 x CR Synthetic Long Reads (SLR): CR >= 50 x Read Clouds: CR < 1 x Partition n CF CR Coverage = Xx CS 273 a 2016 C XF

Fragment Assembly (in whole-genome shotgun sequencing) CS 273 a 2016

Fragment Assembly (in whole-genome shotgun sequencing) CS 273 a 2016

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm CS 273 a 2016

Steps to Assemble a Genome Some Terminology 1. Find overlapping readsthat comes read a

Steps to Assemble a Genome Some Terminology 1. Find overlapping readsthat comes read a 500 -900 long word out of sequencer mate pair a pair of reads from two ends 2. Merge some “good” of reads into of the same insert pairs fragment longer contigs contig a contiguous sequence formed by several overlapping reads with no gaps 3. Link contigs to form supercontigs supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs 4. Derive consensus sequence derived from the sequene multiple alignment of reads in a contig CS 273 a 2016 . . ACGATTACAATAGGTT. .

1. Find Overlapping Reads aaactgcagtacggatct aaactgcagt … gtacggatct gggcccaaactgcagtac gggcccaaac … actgcagtac gtacggatctactacaca gtacggatct

1. Find Overlapping Reads aaactgcagtacggatct aaactgcagt … gtacggatct gggcccaaactgcagtac gggcccaaac … actgcagtac gtacggatctactacaca gtacggatct … ctactacaca CS 273 a 2016 (read, pos. , word, orient. ) (word, read, orient. , pos. ) aaactgcagt actgcagta … gtacggatct gggcccaaac gcccaaact … actgcagtac gtacggatct acggatcta … ctactacaca aaactgcagt acggatcta actgcagta cccaaactg cggatctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatct tactacaca

1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~

1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TACA TAGATTACACAGATTAC T GA || ||||||||| | || TAGT TAGATTACACAGATTAC TAGA • Caveat: repeats § A k-mer that occurs N times, causes O(N 2) read/read comparisons § ALU k-mers could cause up to 1, 0002 comparisons • Solution: § Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand computing resources available CS 273 a 2016

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA CS 273 a 2016

1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA

1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA insert A replace T with C TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors CS 273 a 2016 TAG-TTACACAGATTATTGA

2. Merge Reads into Contigs • Overlap graph: § Nodes: reads r 1…. .

2. Merge Reads into Contigs • Overlap graph: § Nodes: reads r 1…. . rn § Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes CS 273 a 2016

2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to

2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries CS 273 a 2016

2. Merge Reads into Contigs • Remove transitively inferable overlaps § If read r

2. Merge Reads into Contigs • Remove transitively inferable overlaps § If read r overlaps to the right reads r 1, r 2, and r 1 overlaps r 2, then (r, r 2) can be inferred by (r, r 1) and (r 1, r 2) CS 273 a 2016 r r 1 r 2 r 3

2. Merge Reads into Contigs CS 273 a 2016

2. Merge Reads into Contigs CS 273 a 2016

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved § Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK § We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: § Increase read length § Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length CS 273 a 2016

3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? CS

3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? CS 273 a 2016

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally,

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if 2 forward-reverse links supercontig (aka scaffold) CS 273 a 2016

3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs

3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs Complex algorithmic step • • CS 273 a 2016 Exponential number of paths Forward-reverse links

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) CS 273 a 2016

De Brujin Graph formulation • Given sequence x 1…x. N, k-mer length k, Graph

De Brujin Graph formulation • Given sequence x 1…x. N, k-mer length k, Graph of 4 k vertices, Edges between words with (k-1)-long overlap de Bruijn Graph Reads AAGA ACTT ACTC ACTG AGAG CCGA CGAC CTCC CTGG CTTT … CS 273 a 2016 CCG Potential Genomes TCC CGA AAG AGA AAGACTCCGACTGGGACTTT CTC GAC ACT GGA CTT AAGACTGGGACTCCGACTTT CTG GGG TGG Slide by Michael Schatz

Node Types Isolated nodes (10%) Tips (46%) Bubbles/Non-branch (9%) Dead Ends (. 2%) Half

Node Types Isolated nodes (10%) Tips (46%) Bubbles/Non-branch (9%) Dead Ends (. 2%) Half Branch (25%) Full Branch (10%) (Chaisson, 2009) CS 273 a 2016 Slide by Michael Schatz

Error Correction § Errors at end of read B’ • Trim off ‘dead-end’ tips

Error Correction § Errors at end of read B’ • Trim off ‘dead-end’ tips B A A B § Errors in middle of read • Pop Bubbles B’ C A A C B* B § Chimeric Edges A Slide by Michael Schatz A B C D x • Clip short, low coverage nodes CS 273 a 2016 B C D

De Brujin Graph formulation CS 273 a 2016

De Brujin Graph formulation CS 273 a 2016

Quality of assemblies—mouse Terminology: N 50 contig length If we sort contigs from largest

Quality of assemblies—mouse Terminology: N 50 contig length If we sort contigs from largest to smallest, and start Covering the genome in that order, N 50 is the length Of the contig that just covers the 50 th percentile. 7. 7 X sequence coverage CS 273 a 2016

Panda Genome CS 273 a 2016

Panda Genome CS 273 a 2016

Hominid lineage CS 273 a 2016

Hominid lineage CS 273 a 2016

Orangutan genome CS 273 a 2016

Orangutan genome CS 273 a 2016

Assemblathon CS 273 a 2016

Assemblathon CS 273 a 2016

Assemblathon CS 273 a 2016

Assemblathon CS 273 a 2016

History of WGA 1997 • 1982: -virus, 48, 502 bp • 1995: h-influenzae, Let’s

History of WGA 1997 • 1982: -virus, 48, 502 bp • 1995: h-influenzae, Let’s sequence the human 1 genome Mbp with the shotgun strategy • 2000: fly, 100 Mbp • 2001 – present Thatrat is*, chicken, dog, chimpanzee, § human (3 Gbp), mouse (2. 5 Gbp), several fungal genomes impossible, and a bad idea anyway Gene Myers CS 273 a 2016 Phil Green