Sequencing and Sequence Assembly overview of the genome

Sequencing and Sequence Assembly --overview of the genome sequenceing process Presented by NIE , Lan CSE 497 Feb. 24, 2004 1

Introduction l l 2 Q: What is Sequence A: To sequence a DNA molecule is to obtain the string of bases that it contains. Also know as read Q: How to sequence A: Recall the Sanger Sequencing technology mentioned in Chapter 1

Introduction Sanger Sequencing l Cut DNA at each base: A, C, G, T Fragment’s migrate distance is inversely proportional to their size l Run gel and read off sequence l 3 TCGCGATAGCTGTGCTA

Introduction Limitation The size of DNA fragments that can be read in this way is about 700 bps l Problem Most genomes are enormous (e. g 108 base pair in case of human). So it is impossible to be sequenced directly! This is called Large-Scale Sequencing l 4

Introduction l Solution v Break the DNA into small fragments randomly Sequence the readable fragment directly Assemble the fragment together to reconstruct the original DNA Scaffolder gaps v v v 5 Solving a one-dimensional jigsaw puzzle with millions of pieces(without the box) !

1. 2. 3. 4. 5. 6 Break Sequence Assemble Scaffolder Conclusion

Break v. DNA can be cutten into pieces through mechanical means 7

Issues in Break v How? • Coverage The whole fragments provide an 8 X oversampling of the genome • Random Libraries with pieces sizes of 2, 4, 6, 10, 12 and 40 k bp were produced • Clone Obtaining several copies of the original genome and fragments 8

1. 2. 3. 4. 5. 9 Break Sequence Assemble Scaffolder Conclusion

Sequence clone Directed sequencing (GEL) Q: can we read the fragment from both end? 10 GTCCAGCCT

1. 2. 3. 4. 5. 11 Break Sequence Assemble Scaffolder Conclusion

3. Assemble l A Simple Example --ACCGT CGTGC TTAC ----CGTGC TTACCGTGC Overlap: The suffix of a fragment is same as the prefix of another. Assemble: align multiple fragments into single continuous sequence based on fragment overlap 12

3. Assemble fragments assemble contig 1 13 gap contig 2 target original

A simple model l The simplest, naive approximation of DNA assemble corresponds to Shortest Superstring Problem(SCS): Given a set of string s 1, . . . , sn, find the shortest string s such that each si appears as a substring of s. --ACCGT ----CGTGC TTACCGTGC 14

(1) Overlap step Create an overlap graph in which every node is a fragment and edges indicate an overlap (2) Layout step Determine which overlaps will be used in the final assembly, find an optimal spanning forest on the overlap graph 15

Overlap step Finding overlap l Compare each fragment with other fragments to find whethere’s overlap on its end part and another’s beginning part. We call ‘a overlap b’ when a’s suffix equal to b’s prefix 16

Overlap step Overlap graph l. Directed, weighted graph G(V, E, w) l. V: set of fragments l. E : set of directed edge indicates the overlap between two fragments. An edge <a, b, w> means an overlap between a and b with weight w. this equal to suffix(a, w)=prefix(b, w) 17

Example W=AGTATTGGCAATC Z=AATCGATG U=ATGCAAACCT X=CCTTTTGG Y=TTGGCAATCA S=AATCAGG 5 s y 9 4 w x 3 4 z 18 3 u

Layout step l l 19 Looking for shortest common superstring is the same as looking for path of maxium weight Using greedy algorithm to select a edge with the best weight at every step. The selected edge is checked by Rule. If this check is accepted, the edge is accepted, otherwise omit this edge Rule: for either node on this edge, indegree and outdegree <=1; Acyclic

l 20 At last the fragments merged together , from the point of graph, it is a forest of hamitonian paths(a path through the graph that contains each node at most once). , each path correspond to a contig

Example W=AGTATTGGCAATC Z=AATCGATG U=ATGCAAACCT X=CCTTTTGG Y=TTGGCAATCA S=AATCAGG W->Y->S AGTATTGGCAATCA AATCAGG 5 s y 9 AGTATTGGCAATCAGG 4 w x Z->U->X AATCGATG 3 4 z 3 u ATGCAAACCT CCTTTTG G AATCGATGCAAACCT TTTGG 21

l Geedy Algorithm is neither optimal nor complete, and will introduce gap GCC 2 ATGC l. Can’t 22 2 3 TGCAT correctly model the assembly problem due to complication in the real problem instance

Complication with Assemble 1. 2. 3. 4. 23 Sequencing errors. Most sequencers have around 1% error in the best case. Unknown orientation. Could have sequenced either strand. Bias in the reads. Not all regions of the sequence will be covered equally. Repeats. There is much repetitive sequence, especially in human and higher plants

Sequenceing Errors Fragments contains 3 kinds of errors: insert, deletion, substitution Possibility : Substitutions ( 0. 5 -2% ), insert and deletion occur roughly 10 times less frequently 24 http: //compbio. uchsc. edu/Hunter_lab/Hunter/bioi 7711/lecture 6. ppt

Problems with the simple model - Errors Y: CGTGC A Z: TTAC 5 25 3 --ACCGT ----CGTGC z TTAC -TACCGT TTACCGTGC G y u y 2 u U: TACCGT x 3 x x: ACCGT z

Problems with the simple model - Errors Solution Allow for bounded number of mismatches between overlapping fragments ----- Approximate overlaps Criterion: minimum overlap length(40 bps), error rate(less than 6% mismatches ) How? Using semi-global alignment to find the best match between the suffix of one sequence and the prefix of another. 26

semi-global alignment Score system: 1 for matches, -1 for mismatches, -2 for gaps Initializing the first row and first column of zero, ignore gap in both extremities Algorithm is same as global comparision Search last column for higest score and obtain alignment by tracing back to start point ( overlap of x over y). overlap of y over x corresponds to the max in the last row y x 0000000…… 0 0 0 27

X A C C G T 0: 0 0 0 Y: C 0 -1 1 1 -1 -2 G 0 -1 -1 0 2 0 A 0 1 -1 -2 1 1 Overlap: x->y ACCG-T— --CGATGC T 0 -1 0 -2 -1 2 G 0 -1 -2 -1 -1 0 C 0 -1 -2 -2 28 Overlap: y->x CGATGC------ACCGT

Problems with the simple model - Errors 3 x x: ACCGT Y: CGTGC A Z: TTAC 5 3 z 2 y 0 0 u 29 TTAC -TACCGT TTACCGTGC G x --ACCGT ----CGTGC 2 u U: TACCGT y -2 z Criterion 1. Score>-3 2. Mismatch<2 --ACCG-T ----CGATGC TT-C -TAGCGT TTACCGTGC

Problems with the simple model Unkown orientation Unknowns Orientation: y Y’ Fragments can be read from both of the DNA strands. x Solution X’ Try all possible combination z 30 Z’ CACGT ACGT -ACGT ACTACG CGTAGT --CGTAGT GTACT AGTAC -----AGTAC CACGTAGTACTGA

Problems with the simple model - Repeats can be characterized by length, copy number & fidelity between copies – Human T-cell receptor: 5 x of a 4 kb gene w/ ~3% variation – ALUs. ~300 bp w/5 -15% variation, clustering to be 50 -60% of many human sequence regions – microsatellites, 3 -6 bp with thousands of repeats in centromeric and telemeric regions, 1 -2% variation. 31 gepard. bioinformatik. uni-saarland. de/html/Bioinformatik. IIIWS 0304 -Dateien/ V 3 -Assembly. ppt

Problems with the simple model Repeat 2 Rearrangment Original One A X 1 B A X 1 X 2 C X 3 D 3 X C B Fragment X 2 D Assembler A X 2 Consensus A X 2 32 C C X 3 B X 1 D

Problems with the simple model Repeat 3 Original one A X 1 Assembly A X 1 X 2 B X 2 Overcollapsing C Target one A X C Contig 1 33 C ! B gap Shortest string is not always the best! B Contig 2

Problems with the simple model -Lack of coverage Not all regions of the sequence will be covered equally Target DNA Uncovered area Solution Do more sampling to increase the coverage level 34 Using scaffolder technology

1. 2. 3. 4. 5. 35 Break Sequence Assemble Scaffolder Conclusion

4. Scaffolder A A 36 X C’ B C X B’ v Scaffold Given a set of non-overlapping contigs, order and orient them to reconstruct the original DNA v How? Is there any relationsip can be built between different contigs?

4. Scaffolder -Mate Pairs l Mate pairs: v The sequenced ends are facing towards each other The distance between the two fragments is known( insert size – fragment size) The mate pairs is extremly valuable during the scaffold step. v v Mate Pair 37

4. Scaffolder -Method • A scaffold retrieve the original mate pairs spanning in different contigs • Using the link information of the pairs( Distance, Orientation) to orients contigs and estimates the gap size, this is calles “walk” 38

4 Scaffolder -Example Contig 1 Contig 2 gap 39

4 Scaffolder Graph Representation v Nodes: contigs v Directed edges: constraints on relative placement of contigs – relative order and relative orientation l 40 http: //jbpc. mbl. edu/jbpc/Genomes. Media/10_14 POP. PPT

1. 2. 3. 4. 5. 41 Break Sequence Assemble Scaffolder Conclusion

5. Conclusion The whole genome sequencing process Break-> Sequence -> Assemble-> Scaffolder v A Simple Model Using overlap graph to construct the shortest common string However, it can’t corrctly model the assembly problem v 42

Conclusion. Repeat • Repeat detection – pre-assembly: find fragments that belong to repeats l l – – • during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential misassemblies. (Reputer, Repeat. Masker) Repeat resolution – – 43 statistically (most existing assemblers) repeat database (Repeat. Masker) find DNA fragments belonging to the repeat determine correct tiling across the repeat