An Eulerian path approach to DNA fragment assembly

An Eulerian path approach to DNA fragment assembly Presented by: Zohar Barak – zoharbarak@mail. tau. ac. il Dor Alon – doralon 1@mail. tau. ac. il PAVEL A PEVZNER, HAIXU TANG, AND MICHAEL S WATERMAN. AN EULERIAN PATH APPROACH TO DNA F R A G M E N T A S S E M B L Y. P R O C E E D I N G S O F T H E N A T I O N A L A C A D E M Y O F S C I E N C E S, 98(17): 9748{9753, 2001 1

What are we going to see • Background and motivation • De Bruijn Graph • Eulerian Path Approach • Error and data correction • Eulerian Superpath Problem • Using DB data • Results and conclusions 2

Some Terminology • L-tuples = k-mers = strings of constant size (l for l-tuple, k for k-mer) • EP or EPP – Eulerian Path or Eulerian Path Problem • ES or ESP – Eulerian Superpath or Eulerian Superpath Problem • NM – Neisseria meningitidis (used it’s sequence data for examples throughout the article) ◦ Was considered to be one of the most “difficult-to-assemble” and “repeat-rich” bacterial genomes completed • EULER – The assembler that uses the methods discussed in this article 3

Background and motivation – overlap-layout-consensus (OLC) • In the period of 20 years before the article was out (2001), fragment assembly in DNA sequencing followed the “overlap–layout–consensus” paradigm that was used in all available assembly tools. • The OLC algorithm has 3 steps (1) Overlap - build overlap graph (2) Layout - Bundle stretches of the overlap graph into contigs (3) Consensus - Pick most likely nucleotide sequence for each contig 4

Problems with OLC - Huge graph – Memory and build time - Long time to assemble * Ham. Path problem - Problem with repeats 5

6

Many problems encountered in OLC Suggesting a new approach: An Eulerian path approach. 7

Reminder of de Bruijn Graph 8

An easy example 9

Building the de Bruijn Graph • 10

Comparison 11

Eulerian Path Finding • We’ll solve the assembly problem by finding an Eulerian path to the graph • EP problem can be solved in linear time 12

Error Correction • Unlike OLC assemblers, EULER makes the consensus step – error correction, the first step. • If genome G is known, then error correction in a read s can be done by aligning the read s against G. • Catch 22! To assemble a genome, we’d like to correct errors in reads first, but to correct errors in reads, one has to assemble the genome first. 13

Error Correction – Bypassing Catch-22 • 14

Spectral Alignment Problem • Let T be a collection of l-tuples called a spectrum. A string s is called a T-string if all its l-tuples belong to T • Spectral Alignment Problem - Given a string s and a spectrum T, find the minimum number of mutations in s that transform s into a T-string. • Aligning a read with the set solid tuples changes the sets of weak and solid and l-tuples. • Iteratively aligning the set of all reads with the set of solid l-tuples gradually reduces the number of weak l-tuples and increase the number of solid l-tuples. • Leads to elimination of many read errors: • Usually, reads with poor alignments represents different errors. It is common to ignore these reads. • But we can do better! 15

Error Correction Problem • 16

Error Correction Problem • 17

Data correction or Data corruption? • However, this procedure isn’t perfect while deciding which nucleotide (for example A or T) is correct in a given l-tuple within a read. • If the correct nucleotide is A, but T is also present in some reads from the same region, we might assign T instead of A to all reads- introducing an error rather than to correct it. • These errors are easy to fix later, and it is much more important for the reads of the same region to be consistent thus reducing the complexity of the De-Bruijn graph. • For the NM sequencing project, EULER eliminated 234, 410 errors and introduced 1452 errors. 18

Back to graphs 19

Repeats • A path v 1 … vn in the de Bruijn graph is called a repeat if indegree(v 1) > 1, outdegree(vn) > 1, and indegree (vn) = outdegree(vi) = 1 for 1 ≤ i ≤ n − 1 • Edges entering the vertex v 1 are called entrances into a repeat, whereas edges leaving the vertex vn are called exits from a repeat 20

A repeat v 1 … vn and a system of paths overlapping with this repeat. Branching vertex Entrances Branching vertex Exists 21

Tangles • A repeat is called a tangle if there is no read-path containing this repeat. • Tangles create problems in fragment assembly, because pairings of entrances and exits in a tangle cannot be resolved via the analysis of read-paths. • To address this issue, we formulate the following generalization of the Eulerian Path Problem: Eulerian Superpath Problem 22

Eulerian Superpath Problem (ESP) • 23

Eulerian Superpath Problem (ESP) • 24

• Case: no multiple edges Let x = (vin, vmid) and y = (vmid, vout) be two consecutive edges in graph G, and let �� x, y be a collection of all paths from �� that include both these edges as a subpath. • Define �� that end with x and �� →x as a collection of paths from �� y→ as a collection of paths from �� that start with y. 25

Case: no multiple edges • The x, y-detachment is a transformation that adds a new edge z = (vin, vout) and deletes the edges x and y from G • This detachment alters the system of paths �� as follows: I. III. • substitute z for x, y in all paths from �� x, y substitute z for x in all paths from �� →x substitute z for y in all paths from �� y→ Because every detachment reduces the number of edges in G, the detachments will eventually shorten all paths from �� to single edges and will reduce the Eulerian Superpath Problem to the Eulerian Path Problem. 26

Case: got multiple edges • in the case of graphs with multiple edges, the detachment procedure may lead to errors, because “directing” all paths from the set �� →x through a new edge z may not be an equivalent transformation. • In this case, the edge x may be visited many times in the Eulerian path, and it may or may not be followed by the edge y on some of these visits. 27

• For illustration purposes, let us consider a simple case when the vertex vmid has the only incoming edge x = (vin, vmid) with multiplicity 2 and two outgoing edges y 1 = (vmid, vout 1) and y 2 = (vmid, vout 2), each with multiplicity 1. • In this case, the Eulerian path visits the edge x twice; in one case, it is followed by y 1, and in another case, it is followed by y 2. • Consider an x, y 1 -detachment that adds a new edge z = (vin, vout 1) after deleting the edge y 1 and one of two copies of the edge x. • This detachment: (i) (ii) shortens all paths in �� x, y 1 by substitution of x, y 1 by a single edge z substitutes z for y 1 in every path from �� y 1→. 28

• If �� →x is not empty, it is not clear whether the last edge of a path P ∈ �� →x should be assigned to the edge z or to the (remaining copy of) edge x. • To resolve this dilemma, one has to analyze every path P ∈ �� →x and decide whether it “relates” to �� x, y 1 (in this case, it should be directed through z) or to �� x, y 2 (in this case, it should be directed through x). By “relates” to �� x, y 1 (�� x, y 2), we mean that every Eulerian superpath visits y 1 (y 2) immediately after visiting P. 29

More definitions… • Two paths are called consistent if their union is a path again. • A path P is consistent with a set of paths �� if it is consistent with all paths in �� and inconsistent otherwise (i. e. , if it is inconsistent with at least one path in �� ). 30

• There are three possibilities: ◦ (a) P is consistent with exactly one of the sets �� x, y 1 and �� x, y 2 ◦ (b) P is inconsistent with both �� x, y 1 and �� x, y 2 ◦ (c) P is consistent with both �� x, y 1 and �� x, y 2. 31

Case (a): Consistent with one • In the first case, the path P is called resolvable, because it can be unambiguously related to either �� x, y 1 or �� x, y 2. • An edge x is called resolvable if all paths in �� →x are resolvable. • If the edge x is resolvable, then the described x, ydetachment is an equivalent transformation after the correct assignments of last edges in every path from �� →x. • In the analysis of the NM project, the researchers found that 18, 026 among 18, 962 edges in the de Bruijn graph are resolvable. 32

Case (b): Inconsistent with both • The secondition implies that the Eulerian Superpath Problem has no solution, because P, �� x, y 1, and �� x, y 2 impose three different scenarios for just two visits of the edge x. • After discarding the poor-quality and chimeric reads, the researchers did not encounter this condition in the NM project. 33

Case (c): Consistent with both • The last condition (P is consistent with both �� x, y 1 and �� x, y 2) corresponds to the most difficult situation. • If this condition holds for at least one path in �� →x , the edge x is called unresolvable, and we postpone analysis of this edge until all resolvable edges are analyzed. • The researchers observed that equivalent transformation of other resolvable edges often resolves previously unresolvable edges. • However, some edges cannot be resolved even after the detachments of all resolvable edges are completed. • Such situations usually correspond to tangles, and they have to be addressed by another equivalent transformation called a cut. 34

Cuts • Consider a fragment of graph G with 5 edges and 4 paths y 3 − x, y 4 − x, x − y 1, and x − y 2. • In this symmetric situation, x is a tangle, and there is no information available to relate any of paths y 3 − x and y 4 − x to any of paths x − y 1 and x − y 2. • An edge x = (v, w) is removable if: ◦ (i) it is the only outgoing edge for v and the only incoming edge for w ◦ (ii) x is either the initial or the terminal edge for every path P ∈ �� containing x. • An x-cut transforms �� into a new system of paths by simply removing x from all paths in �� →x and �� x→ without affecting the graph G itself. • An x-cut is an equivalent transformation if x is a removable edge. 35

• Detachments and cuts proved to be powerful techniques to untangle the de Bruijn graph and to reduce the fragment assembly to the Eulerian Path Problem for all studied bacterial genomes. • However, there is still a gap in theoretical analysis of the Eulerian Superpath Problem in the case when the system of paths is not amenable to either detachments or cuts. 36

Using DB data • 37

Using DB data • 38

Using DB data • 39

Using DB data • 40

Last case example • Red mate-pair corresponds to 2 different paths • Blue corresponds to a unique path. • Paths generated lead to resolution of both tangles 41

• Insert length is the length of the sequence in between a pair of reads. • In the NM project (with insert length up to 1, 800), EULER left only 5 unresolved tangles of length 3, 610, 3, 215, 2, 741, 2, 503, and 1, 831. • After completing EULER-DB, we build scaffolds by using DB data as “bridges” between different contigs (EULER-SF). EULER-SF combines the 91 contigs into 60 scaffolds, thus closing most gaps that are shorter than the insert length 42

43

More results 44

45

Conclusions • EULER bypasses the “repeat problem, ” because the Eulerian Superpath approach transforms most repeats into different paths in the de Bruijn graph. As a result, EULER does not even notice repeats unless they are long perfect repeats. • Tangles caused by long repeats may be resolved by DB information (EULER-DB) • EULER has excellent scaling potential for eukaryotic genomes, because there exist linear-time algorithms for the Eulerian Path Problem. 46

Thank you for listening! 47