DNA Sequencing CS 273 a 2015 What can

  • Slides: 32
Download presentation
DNA Sequencing CS 273 a 2015

DNA Sequencing CS 273 a 2015

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2015

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2015

What can we do about repeats? Two main approaches: • Cluster the reads •

What can we do about repeats? Two main approaches: • Cluster the reads • Link the reads CS 273 a 2015

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides A R B ARB, CRD or C CS 273 a 2015 R D ARD, CRB ?

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides

Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides CS 273 a 2015

Long Reads The Holy Grail CS 273 a 2015

Long Reads The Holy Grail CS 273 a 2015

Short Read Sequencing Specs • http: //systems. illumina. com/systems/sequencing. ilmn CS 273 a 2015

Short Read Sequencing Specs • http: //systems. illumina. com/systems/sequencing. ilmn CS 273 a 2015

Long Reads - Pac. Bio Chemistry RS II: P 4 -C 2 RS II:

Long Reads - Pac. Bio Chemistry RS II: P 4 -C 2 RS II: P 5 -C 3 RS II: P 6 -C 4 Optimized For higher quality longer reads Run time 180 min 240 min Total output ~275 Mb ~375 Mb ~500 Mb - 1 Gb Output/day ~2. 2 Gb ~3 Gb ~2 Gb Mean read length ~5. 5 kb ~8. 5 kb ~15 kb ~86% ~83% ~86% >99. 999% >99. 98% >99. 999% ~50 k Instrument price ~$700 k Run price ~$400 Single pass accuracy Consensus (50 X) accuracy # of reads CS 273 a 2015

Long Reads – Oxford Nanopore Read length: 50, 000+? Cost ? CS 273 a

Long Reads – Oxford Nanopore Read length: 50, 000+? Cost ? CS 273 a 2015

Moleculo Overview 1. Sample DNA is sheared into fragments of about 10 kbp 2.

Moleculo Overview 1. Sample DNA is sheared into fragments of about 10 kbp 2. Fragments are diluted and placed into 384 wells 3. Fragments are amplified through long-range PCR, cut into short fragments and barcoded 4. Short fragments are pooled together and sequenced CS 273 a 2015

10 x System Massively Parallel Partitioning 10 X Instrument & Reagents Read Clouds (“linked

10 x System Massively Parallel Partitioning 10 X Instrument & Reagents Read Clouds (“linked reads”) Hap 1 Hap 2 CS 273 a 2015 X 700, 000+ 10 X CONFIDENTIAL Phased 60 Kb deletion

Read Clouds B 1 CR CF B 2 Bn Coverage = CRCF CS 273

Read Clouds B 1 CR CF B 2 Bn Coverage = CRCF CS 273 a 2015

Fragment Assembly (in whole-genome shotgun sequencing) CS 273 a 2015

Fragment Assembly (in whole-genome shotgun sequencing) CS 273 a 2015

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use

Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm CS 273 a 2015

Steps to Assemble a Genome Some Terminology 1. Find overlapping readsthat comes read a

Steps to Assemble a Genome Some Terminology 1. Find overlapping readsthat comes read a 500 -900 long word out of sequencer mate pair a pair of reads from two ends 2. Merge some “good” of reads into of the same insert pairs fragment longer contigs contig a contiguous sequence formed by several overlapping reads with no gaps 3. Link contigs to form supercontigs supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs 4. Derive consensus sequence derived from the sequene multiple alignment of reads in a contig CS 273 a 2015 . . ACGATTACAATAGGTT. .

1. Find Overlapping Reads aaactgcagtacggatct aaactgcagt … gtacggatct gggcccaaactgcagtac gggcccaaac … actgcagtac gtacggatctactacaca gtacggatct

1. Find Overlapping Reads aaactgcagtacggatct aaactgcagt … gtacggatct gggcccaaactgcagtac gggcccaaac … actgcagtac gtacggatctactacaca gtacggatct … ctactacaca CS 273 a 2015 (read, pos. , word, orient. ) (word, read, orient. , pos. ) aaactgcagt actgcagta … gtacggatct gggcccaaac gcccaaact … actgcagtac gtacggatct acggatcta … ctactacaca aaactgcagt acggatcta actgcagta cccaaactg cggatctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatct tactacaca

1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~

1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TACA TAGATTACACAGATTAC T GA || ||||||||| | || TAGT TAGATTACACAGATTAC TAGA • Caveat: repeats § A k-mer that occurs N times, causes O(N 2) read/read comparisons § ALU k-mers could cause up to 1, 0002 comparisons • Solution: § Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand computing resources available CS 273 a 2015

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG

1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA CS 273 a 2015

1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA

1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA insert A replace T with C TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors CS 273 a 2015 TAG-TTACACAGATTATTGA

2. Merge Reads into Contigs • Overlap graph: § Nodes: reads r 1…. .

2. Merge Reads into Contigs • Overlap graph: § Nodes: reads r 1…. . rn § Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes CS 273 a 2015

2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to

2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries CS 273 a 2015

2. Merge Reads into Contigs • Remove transitively inferable overlaps § If read r

2. Merge Reads into Contigs • Remove transitively inferable overlaps § If read r overlaps to the right reads r 1, r 2, and r 1 overlaps r 2, then (r, r 2) can be inferred by (r, r 1) and (r 1, r 2) CS 273 a 2015 r r 1 r 2 r 3

2. Merge Reads into Contigs CS 273 a 2015

2. Merge Reads into Contigs CS 273 a 2015

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved

Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved § Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK § We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: § Increase read length § Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length CS 273 a 2015

3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? CS

3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? CS 273 a 2015

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally,

3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if 2 forward-reverse links supercontig (aka scaffold) CS 273 a 2015

3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs

3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs Complex algorithmic step • • CS 273 a 2015 Exponential number of paths Forward-reverse links

De Brujin Graph formulation • Given sequence x 1…x. N, k-mer length k, Graph

De Brujin Graph formulation • Given sequence x 1…x. N, k-mer length k, Graph of 4 k vertices, Edges between words with (k-1)-long overlap CS 273 a 2015

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA

4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) CS 273 a 2015

Panda Genome CS 273 a 2015

Panda Genome CS 273 a 2015

History of WGA 1997 • 1982: -virus, 48, 502 bp • 1995: h-influenzae, Let’s

History of WGA 1997 • 1982: -virus, 48, 502 bp • 1995: h-influenzae, Let’s sequence the human 1 genome Mbp with the shotgun strategy • 2000: fly, 100 Mbp • 2001 – present Thatrat is*, chicken, dog, chimpanzee, § human (3 Gbp), mouse (2. 5 Gbp), several fungal genomes impossible, and a bad idea anyway Gene Myers CS 273 a 2015 Phil Green