Fragment Assembly in wholegenome shotgun sequencing CS 273
Fragment Assembly (in whole-genome shotgun sequencing) CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Sequence Assembly genomic segment cut many times at random (Shotgun) Get one or two reads from each segment ~100 bp CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou ~100 bp
Two main assembly problems • De Novo Assembly • Resequencing CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides 50% of human DNA is composed of repeats Error! Glued together two distant regions CS 273 a Lecture 4, Autumn 08, Batzoglou CS 273 a 2014
Sequencing and Fragment Assembly AGTAGCACAGA CTACGACGAGA CGATCGTGCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT 3 x 109 nucleotides A R B ARB, CRD or C CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou R D ARD, CRB ?
Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Steps to Assemble a Genome Some Terminology 1. Find overlapping readsthat comes read a 500 -900 long word out of sequencer mate pair a pair of reads from two ends 2. Merge some “good” of reads into of the same insert pairs fragment longer contigs contig a contiguous sequence formed by several overlapping reads with no gaps 3. Link contigs to form supercontigs supercontig an ordered and oriented set (scaffold) of contigs, usually by mate pairs 4. Derive consensus sequence derived from the sequene multiple alignment of reads in a contig CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou . . ACGATTACAATAGGTT. .
1. Find Overlapping Reads aaactgcagtacggatct aaactgcagt … gtacggatct gggcccaaactgcagtac gggcccaaac … actgcagtac gtacggatctactacaca gtacggatct … ctactacaca CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou (read, pos. , word, orient. ) (word, read, orient. , pos. ) aaactgcagt actgcagta … gtacggatct gggcccaaac gcccaaact … actgcagtac gtacggatct acggatcta … ctactacaca aaactgcagt acggatcta actgcagta cccaaactg cggatctactacac ctgcagtac gcccaaact ggcccaaac gggcccaaa gtacggatct tactacaca
1. Find Overlapping Reads • Find pairs of reads sharing a k-mer, k ~ 24 • Extend to full alignment – throw away if not >98% similar TACA TAGATTACACAGATTAC T GA || ||||||||| | || TAGT TAGATTACACAGATTAC TAGA • Caveat: repeats § A k-mer that occurs N times, causes O(N 2) read/read comparisons § ALU k-mers could cause up to 1, 0002 comparisons • Solution: § Discard all k-mers that occur “too often” • Set cutoff to balance sensitivity/speed tradeoff, according to genome at hand computing resources available CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
1. Find Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
1. Find Overlapping Reads • Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTATTGA TAGATTACACAGATTACTGA TAG-TTACACAGATTACTGA insert A replace T with C TAGATTACACAGATTACTGA TAG-TTACACAGATTATTGA correlated errors— probably caused by repeats disentangle overlaps TAGATTACACAGATTACTGA In practice, error correction removes up to 98% of the errors CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou TAG-TTACACAGATTATTGA
2. Merge Reads into Contigs • Overlap graph: § Nodes: reads r 1…. . rn § Edges: overlaps (ri, rj, shift, orientation, score) Reads that come from two regions of the genome (blue and red) that contain the same repeat Note: of course, we don’t know the “color” of these nodes CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
2. Merge Reads into Contigs repeat region Unique Contig Overcollapsed Contig We want to merge reads up to potential repeat boundaries CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
2. Merge Reads into Contigs • Remove transitively inferable overlaps § If read r overlaps to the right reads r 1, r 2, and r 1 overlaps r 2, then (r, r 2) can be inferred by (r, r 1) and (r 1, r 2) CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou r r 1 r 2 r 3
2. Merge Reads into Contigs CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Repeats, errors, and contig lengths • Repeats shorter than read length are easily resolved § Read that spans across a repeat disambiguates order of flanking regions • Repeats with more base pair diffs than sequencing error rate are OK § We throw overlaps between two reads in different copies of the repeat • To make the genome appear less repetitive, try to: § Increase read length § Decrease sequencing error rate Role of error correction: Discards up to 98% of single-letter sequencing errors decreases error rate decreases effective repeat content increases contig length CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
3. Link Contigs into Supercontigs Normal density Too dense Overcollapsed Inconsistent links Overcollapsed? CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
3. Link Contigs into Supercontigs Find all links between unique contigs Connect contigs incrementally, if 2 forward-reverse links supercontig (aka scaffold) CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
3. Link Contigs into Supercontigs Fill gaps in supercontigs with paths of repeat contigs Complex algorithmic step • • Exponential number of paths Forward-reverse links CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
De Brujin Graph formulation • Given sequence x 1…x. N, k-mer length k, Graph of 4 k vertices, Edges between words with (k-1)-long overlap CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
4. Derive Consensus Sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting (Alternative: take maximum-quality letter) CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Panda Genome CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
History of WGA 1997 • 1982: -virus, 48, 502 bp • 1995: h-influenzae, Let’s sequence the human 1 genome Mbp with the shotgun strategy • 2000: fly, 100 Mbp • 2001 – present Thatrat is*, chicken, dog, chimpanzee, § human (3 Gbp), mouse (2. 5 Gbp), several fungal genomes impossible, and a bad idea anyway Gene Myers CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou Phil Green
Metagenomics CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Metagenomics Intestinal microbiota metabolism of L-carnitine […] promotes artherosclerosis. Nature Medicine 2013. CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou
Metagenomics Artificial sweeteners induce glucose intolerance transferable to germ-free mice. CS 273 a Lecture CS 273 a 2014 4, Autumn 08, Batzoglou J Suez et al. Nature 000, 1 -6 (2014) doi: 10. 1038/nature 13793
- Slides: 27