CSCI 2950 C Genomes Networks and Cancer Computability
CSCI 2950 -C Genomes, Networks, and Cancer Computability of Models for Sequence Assembly
Outline • Some Terminology • Other Algorithms • Assembly of Double-Stranded DNA with Bidirected Flow – Chinese Postman Problem ~ Eulerization Problem – Bidirected De Brujin Graph • Discussion
String Terminology Let v and w be two string over the alphabet • • v. w : concatenation of v and w |v| : length of v v[i] : ith character of v v[i, j] : substring of v, beginning at the ith character, ending at the jth character • vk: v concatenated with itself k times • i, j s. t. v = w[i, j] : v is a substring of w
String Terminology • A string of length k is called a k-mer • The set all k-mers that are substring of v is called k-spectrum of v • A pair of reverse complement k-mers is called a k-molecule
Graph Terminology
Graph Terminology
The String Graph Framework & The De Brujin Graph Framework NP - HARD
Assembly of Double. Stranded DNA with Bidirected Flow
Recall that. . . Given a weighted bidirected graph G : • Chinese Walk ~ cyclical walk that traverses each edge at least once • Chinese Postman Problem (CPP) ~ finding a minimum weight Chinese Walk of G or reporting the non-existence of such a walk • Eulerization Problem (EP) ~ finding a minimum weight Eulerization Extension of G or reporting the non-existence of such an extension
Theorem - 1 Given a bidirected graph G, G contains an Eulerian tour if and only if it is connected and balanced
Theorem - 2 Given a weighted bidirected graph G, There exists a Chinese walk of weight i 1 if and only if there exists an Eulerian extension of weight i 2
Proof ( 1 2 ) W : a Chinese walk in G • Construct a new graph W 2 , induced by W, in a way that the multiplicity of each edge is the number of time it is traversed by W
Proof ( 1 2 ) W : a Chinese walk in G • Construct a new graph W 2 , induced by W, in a way that the multiplicity of each edge is the number of time it is traversed by W G
Proof ( 1 2 ) W : a Chinese walk in G • Construct a new graph W 2 , induced by W, in a way that the multiplicity of each edge is the number of time it is traversed by W G W 2
Proof ( 1 2 ) G W 2 W visits every edge of G at least once W 2 is an extension of G + W visits every edge of W 2 exactly once W is an Eulerian circuit of W 2 is an Eulerian extension of G
Proof ( 2 1 ) G 2 : an Eulerian extension of G G G 2
Proof ( 2 1 ) W 2 : an Eulerian circuit in G 2 with weight w G G 2 W 2 : Aa. Bb. Cc. Bb. Cf. De. Cg. Ad. De. Cg. A
Proof G (2 1 ) G 2 W 2 : Aa. Bb. Cc. Bb. Cf. De. Cg. Ad. De. Cg. A Construct W from W 2 by replacing every edge e’ G by an edge e G such that e’ is a duplicate of e. W : Aa. Bb. Cc. Bb. Cf. De. Cg. Ad. De. Cg. A
Proof (2 1 ) G W : Aa. Bb. Cc. Bb. Cf. De. Cg. Ad. De. Cg. A W is a cyclical walk in G which traverses every edge at least once and its weight is the same as the weight of W 2 , i. W is a Chinese Walk with weight i
Given a weighted bidirected graph G, Theorem - 1 G contains an Eulerian tour if and only if it is connected and balanced Theorem - 2 There exists a Chinese walk of weight i if and only if there exists an Eulerian extension of weight i
A Polynomial Time Algorithm for CPP ~ based on the Theorem 1 & 2 Given a weighted bidirected graph G, - If G is not connected, any extension will be not connected No Chinese Walk exists
A Polynomial Time Algorithm for CPP -If G is connected, formulate EP as a min -cost bidirected flow problem as follows: (G’ is the desired extension of G) Constants we : weight of edge e Variables fe: additional copies of edge e required to extend from G to G’
A Polynomial Time Algorithm for CPP Constraints ~ using Theorem – 1 for each vertex x for each edge e
A Polynomial Time Algorithm for CPP Integer Programming Model:
A Polynomial Time Algorithm for CPP Soundness of the Algorithm G is connected G’ is connected + Constraint – 1 G’ is balanced G’ is Eulerian
A Polynomial Time Algorithm for CPP Is G’ a min-weight Eulerianextension?
A Polynomial Time Algorithm for CPP Is G’ a min-weight Eulerianextension? Yes! Objective Function minimizes total weight of inserted edges
A Polynomial Time Algorithm for CPP Pseudo-code IF G is not connected RETURN “no Chinese walk exists” ELSE Solve it as a Min-Cost Flow Problem IF there is no feasible solution, RETURN “no Chinese walk exists” ELSE. . .
A Polynomial Time Algorithm for CPP ELSE Construct the G’ Find an Eulerian circuit of G’ Find the corresponding Chinese walk
A Polynomial Time Algorithm for CPP Running Time? 2 2 O(|E| log (|V|))
A Polynomial Time Algorithm for CPP Integer Programming Model:
A Polynomial Time Algorithm for CPP
A Polynomial Time Algorithm for CPP Optimal Solution: fb = 1 fe = 1 fg = 1 all other variables are zero
A Polynomial Time Algorithm for CPP ? A Polynomial Time Algorithm for Sequence Alignment
A Polynomial Time Algorithm for Sequence Alignment Input: k-molecule spectrum of the genome ATT TTG TGC GCC CCA CAA AAC TAA AAC ACG CGG GGT GTT TTG
A Polynomial Time Algorithm for Sequence Alignment Arbitrarily label one k-mer as positive, one k -mer as negative - ATT TAA +CCA GGT- - TTG +TGC +GCC + AAC+ ACG CGG+CAA GTT - +AAC TTG -
A Polynomial Time Algorithm for Sequence Alignment Construct nodes from all possible (k-1) molecules For every k-molecule in the spectrum, let z be one of its two k-mers Let x and y be (k-1)-mers corresponding to z[1. . k-1] and z[2. . k] respectively -AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise - ATT TAA+ + AC TG- -AT TA+ - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise - TTG AAC+ -AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise + TGC ACG-AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise + GCC CGG-AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise + CCA GGT-AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise + CAA GTT-AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment Insert edges according to the following criteria: An edge is positive incident to x, if x is from the positive strand, and negative incident otherwise An edge is negative incident to y, if y is from the positive strand, and positive incident otherwise + AAC TTG-AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment -AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment How to read the sequence? If a positive incident edge is used to enter the node, read negative k-mer If a negative incident edge is used to enter the node, read positive k-mer -AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG-
A Polynomial Time Algorithm for Sequence Alignment -AT TA+ + AC TG- - TT AA+ +CA GT- + TG AC+CC GG- + GC CG- ATTGCCAAC
Future Work. . . • NP – hardness ? • Optimal solution ? ~ parsimony assumption
Thanks. . . Any questions / comments?
- Slides: 49