A Approximation Algorithm for Shortest Superstring Sweedyk Z

Outline • • Introduction Basic definitions String functions The approximation algorithm The upper bound

Introduction • Let S = {s 1, s 2, …, sn} be a set

For example, S = { ab, bcd, de, abc }, then abcde is a

Basic definitions Let’s introduce some basic definitions. 7

Overlap • Let s and t be two strings. Let the suffix f of

OV (s, t) is the set of overlaps of s with respect to t.

ov (s, t), pref (s, t) and suff (s, t) • We use ov

Distance/ overlap graph • Let S be a set of strings. The distance/ overlap

For example, S = { u 0, u 1, u 2}, where u 0

The distance/ overlap multigraph g. S • We define overlap ov (e) = ov

For example, S = {u 0, u 1, u 2} u 0 = ababc,

• Why are the above graph useful? • Consider the Hamiltonian path u

• Roughly speaking, we are interested in a cycle which covers all vertices

• We have oversimplified the problem, because there may well be more than

Cycle cover • A cycle cover of GS is a set of simple cycles

The following cycle c = (u 0, u 1, u 2) is a cycle

• The following red and blue cycles also form a cycle cover. 4,

• A minimum-length cycle cover CS* is a cycle cover of GS with

• Since each cycle cover corresponds to several superstrings, the minimum cycle cover

• For example, Let S = {v 1, v 2, v 3, v

And we proceed the greedy algorithm to construct CS* : v 0 = aggtt,

Now, the following graph is CS* v 0 = aggtt, v 1 = gttaag,

v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3

Open • Let c = (s 0, s 1, …, sj-1, s 0) be

• A cycle c may have many opens. We can regard opens as

• For any cycle c, an open is a Hamiltonian path of this

• For , we denote OP(c) to be the set of opens of

• The vertices are called, respectively, xfirst and xlast and the edge <

Lemma 2. 12 • Let c be a cycle. We denote sop (c) to

For example, 1, 4 u 0 4, 1 c 2 u 1 Cycle cover

String functions and lemmas • At first, we should know the meaning of the

Expansion • e = < s, t, k > and are versions of each

1 -expansion • is an expansion of c if every edge of is an

• When we refer to a 1 -expansion of cx for we mean

• Let’s take a look at an example here with 3 strings where

y 1 = abcd, y 2 = cdba, y 3 = cdcdbaba Case 1:

• The above example shows we have to consider some string functions to

Pseudolength • Let x be a string in US* and let be an expansion

• Actually, the pseudolength d |cx| measures the losing length after connecting to

u 1 = cabab u 0 = ababc • For example, u 0 =

Fact 3. 5 • Let x be a string in US*. The 1 -expansion

• There exist certain 1 -expansions of a cycle cx based on the

• We omit the detail of all the string functions and just give

• For example, let’s take a look at the string function trade-off :

u 0 = ababc, u 1 = cabab, u 2 = bababa x 1

• From a lemma, a 1 -expansion of cx corresponding to ) with

Outline • • Introduction Basic definitions String functions and lemmas The approximation algorithm The

The approximation algorithm • Before proceeding to the algorithm, we should understand the important

Edge exchange and winning edge • Let C be a cycle cover and let

For example, The cycle length is 9 winning edge The cycle length is 7

The cycle length is 20. v 1 2, 3 v 0 6, 0 v

The cycle length before edge exchange: 20 The cycle length after edge exchange: 18

Parsimonious edge exchange and losing edge • Let C be a cycle cover and

Lemma 2. 2 • Let s, t, u and v be strings. If ovk

The approximation algorithm • 1. Construct GS and find CS*. Compute US* and the

u 0 = ababc, u 1 = cabab, u 2 = bababa CS* is

Now, we choose merging edges to merge the cycles: 1, 4 u 0 4,

u 0 = ababc, u 1 = cabab, u 2 = bababa c 1

• However, the optimal solution is right cababc with length 10. • This

• Since the formal analyses of lower bound and the upper bound for

The upper bound • Let S = { u 0, u 1, u 2

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. •

And we make an parsimonious edge exchange for CU. 3, 2 u 0 5,

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. 5,

• So we obtain that: |CS*| ≤ | AOPTS | ≤ 7 10

The lower bound • Let S = { u 0, u 1, u 2

• Then we find a Hamiltonian cycle c = u 0 -u 1

• We find that < u 2, 2 > is a winning edge

• The length of the local superstring of u 1 to u 0

• However, the optimal solution is cababc, which has length 10, so |

Conclusion • Probably the most interesting open question in superstring study is whether the

• We conjecture that our algorithm can be modified slightly and the analysis

• Actually, as I looked up for the relative research, I found that

Greedy-cover algorithm • Let CS* = . Order the edges of GS as ,

Slides: 108

Download presentation

A -Approximation Algorithm for Shortest Superstring Sweedyk, Z. SIAM Journal on Computing, Vol. 29, No. 3, 1999, pp. 954 -986 Speaker: Chuang-Chieh Lin Advisor: R. C. T. Lee National Chi-Nan University 1

Outline • • Introduction Basic definitions String functions The approximation algorithm The upper bound The lower bound Conclusion 2

Outline • • Introduction Basic definitions String functions The approximation algorithm The upper bound The lower bound Conclusion 3

Introduction • Let S = {s 1, s 2, …, sn} be a set of strings. A superstring of S is a string containing each as a contiguous substring. • The shortest superstring problem is to find a minimum length superstring of the input set S. • This problem has important applications in computational biology and in data compression. 4

For example, S = { ab, bcd, de, abc }, then abcde is a superstring of length 5 of S and abcabcde is a superstring of length 8 of S. 5

Outline • • Introduction Basic definitions String functions The approximation algorithm The upper bound The lower bound Conclusion 6

Basic definitions Let’s introduce some basic definitions. 7

Overlap • Let s and t be two strings. Let the suffix f of s and the prefix p of t are the same, then we call f or p the overlap of s with respect to t. • For example, s = cabab t = babcba bab is the overlap of s with respect to t. 8

OV (s, t) is the set of overlaps of s with respect to t. For example, s = cabab, t = bababa OV (s, t) = {ε, b, bab }, OV (s, s) = {ε}, OV (t, t) = {ε, baba }, OV (t, s) = {ε}. 9

ov (s, t), pref (s, t) and suff (s, t) • We use ov (s, t) to denote the longest string in OV (s, t); pref (s, t) and suff (s, t) denote the prefix of s and suffix of t corresponding to ov (s, t). • Furthermore, we use δS to denote pref (s, s) • For example, u 1 = cabab u 2 = bababa So, pref (u 1, u 2) = ca, suff (u 1, u 2) = aba, 10

Distance/ overlap graph • Let S be a set of strings. The distance/ overlap graph GS is a complete diagraph with vertex set S; each edge of the graph is assigned a positive length as follows. • the edge e from s to t has length | e | = | pref (s, t) |. 11

For example, S = { u 0, u 1, u 2}, where u 0 = ababc, u 1 = cabab, u 2 = bababa. The following graph is GS. 1 5 5 4 u 0 5 u 1 3 2 u 2 6 u 0 = ababc u 1 = cabab u 0 = ababc 2 12

The distance/ overlap multigraph g. S • We define overlap ov (e) = ov (s, t). • The distance/ overlap multigraph g. S for S is constructed out of the distance/ overlap graph. Every and every an edge from s to t has length and overlap | v |. 13

For example, S = {u 0, u 1, u 2} u 0 = ababc, u 1 = cabab, u 2 = bababa 1, 4 5, 0 4, 1 u 0 5, 0 3, 3 u 1 2, 3 6, 0 We use “m, n” to denote the “length and the overlap” of that edge. u 2 2, 4 14

• Why are the above graph useful? • Consider the Hamiltonian path u 0 -u 1 -u 2. Its total overlap is 1 + 3 = 4. The corresponding superstring is ababcabababa (12) • Consider the Hamiltonian path u 1 -u 2 -u 0. Its total overlap is 3 + 3 = 6. Its corresponding superstring is cababc (10) (optimal solution). 15

• Roughly speaking, we are interested in a cycle which covers all vertices with the largest sum of overlaps, or the smallest sum of lengths. 16

• We have oversimplified the problem, because there may well be more than one cycle in the cycle cover. • In this case, we have to combine cycles. 17

Cycle cover • A cycle cover of GS is a set of simple cycles that cover all the vertices of the graph. 18

The following cycle c = (u 0, u 1, u 2) is a cycle cover of GS 4, 1 u 0 3, 3 u 1 2, 3 u 2 c where S = { u 0, u 1, u 2 }, u 0 = ababc, u 1 = cabab, u 2 = bababa 19

S = { u 0, u 1, u 2 }, u 0 = ababc, u 1 = cabab, u 2 = bababa • The following cycles also form a cycle cover of GS. 1, 4 u 0 4, 1 u 2 2, 4 20

• The following red and blue cycles also form a cycle cover. 4, 1 5, 1 v 1 2 2, 3 4, 0 5, 4, 0 3, 2 4, 0 0 1 4, 1 1 v 4 5, 6, 0 0 5, 4, v 2 5, v 0 6, 0 0 4, 0 3, 2 v 3 1 5, 0 4, 21

• A minimum-length cycle cover CS* is a cycle cover of GS with minimum sum of lengths of edges. The greedy algorithm can be used to construct CS*. 22

• Since each cycle cover corresponds to several superstrings, the minimum cycle cover somehow corresponds to a rather short superstring. 23

• For example, Let S = {v 1, v 2, v 3, v 4, v 5} v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3 = gcata, v 4 = tacc. Then g. S is as follows: 4, 1 5, 1 v 1 2 2, 3 4, 0 5, 4, 0 3, 2 4, 0 0 1 4, 1 1 v 4 5, 6, 0 0 5, 4, v 2 5, v 0 6, 0 0 4, 0 3, 2 v 3 1 5, 0 4, 24

And we proceed the greedy algorithm to construct CS* : v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3 = gcata, v 4 = tacc 4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 25

4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 26

4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 27

4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 28

4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 29

4, 1 5, 1 4, v 1 2 5, 0 2, 3 4, 0 3, 2 0 0 3, 2 v 3 1 4, 0 5, 3, 2 4, 0 0 1 4, 1 5, 6, 0 0 5, v 4 v 2 5, 4, 0 v 0 6, 0 30

Now, the following graph is CS* v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3 = gcata, v 4 = tacc 4, v 1 2 2, 3 v 0 c 1 v 2 c 3 3, 2 4, 0 v 4 3, 2 c 2 v 3 31

v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3 = gcata, v 4 = tacc. • The superstrings corresponding to the cycles of this cycle cover are as follows v 0 - v 1: v 2 - v 3: v 4: aggttaagcata tacc The superstring: aggttaagcatacc can be obtained by concatenating the three cycles. 32

• Why do we use “cycles”? 33

Open • Let c = (s 0, s 1, …, sj-1, s 0) be a cycle of GS. For any l , the string , where the indices are taken modulo j, is called an open of c. 34

• A cycle c may have many opens. We can regard opens as local superstrings. 35

For example, 1, 4 u 0 4, 1 c 2 u 1 u 2 4, 2 c 1 u 0 = u 1 = u 2 = c 1 = c 2 = ababc cabab bababa (u 2, u 2) (u 0, u 1, u 0) Let x 1 = bababa, x 21 = ababcabab, x 22 = cababc x 1 is an open of c 1. x 21 and x 22 are opens of c 2. 36

• For any cycle c, an open is a Hamiltonian path of this cycle. 37

• For , we denote OP(c) to be the set of opens of c and US* = 38

For example, 1, 4 u 0 4, 1 c 2 u 2 4, 2 c 1 u 0 = u 1 = u 2 = c 1 = c 2 = ababc cabab bababa (u 2, u 2) (u 0, u 1, u 0) OP(c 1) = { bababa } OP(c 2) = { ababcabab, cababc } 39

• The vertices are called, respectively, xfirst and xlast and the edge < xlast , xfirst > is called the opening edge of x. An opening edge of x is an edge whose removal creates the open x. For example, is the opening edge of x 1 is the opening edge of x 21 40

Lemma 2. 12 • Let c be a cycle. We denote sop (c) to be the shortest open of c. If the minimum length cycle cover CS* consists of a single cycle c, sop (c) is a shortest superstring of S. 41

For example, 1, 4 u 0 4, 1 c 2 u 1 Cycle cover c 2 is a minimum length cycle cover and c 2 consists of just one cycle. OP (c 2) = { ababcabab, cababc }. So sop (c 2) = cababc is a shortest superstring of u 0 = ababc and u 1 = cabab. 42

Outline • • Introduction Basic definitions String functions The approximation algorithm The upper bound The lower bound Conclusion 43

String functions and lemmas • At first, we should know the meaning of the expansion of a cycle or an edge. 44

Expansion • e = < s, t, k > and are versions of each other and if , we say that e is an expansion of • For example, s = bbcabba, t = abbabab bbcabbabab • Let e = < s, t, 1>, expansion of. . Therefore, e is an 45

1 -expansion • is an expansion of c if every edge of is an expansion of an edge in c. • An edge < s, t, k > is tight if k = |ov (s, t)| and loose otherwise. • We call a cycle of g. S a 1 -expansion of if is an expansion of c and it has only one loose edge. 46

• When we refer to a 1 -expansion of cx for we mean that the only possible loose edge is <xlast, xfirst>. • For example, 1, 4 u 0 4, 1 3, 2 u 1 u 0 u 1 = cabab u 0 = ababc • , is a 1 -expansion of 4, 1 u 1 = cabab u 0 = ababc . 47

• Let’s take a look at an example here with 3 strings where an expansion of the superstring of two strings should be expanded so that the final superstring covering the three strings is even shorter. 48

y 1 = abcd, y 2 = cdba, y 3 = cdcdbaba Case 1: without expansion: y 1= abcd y 12 = abcdba y 2 = cdba y 12 = abcdba y 123 = cdcdbababcdba y 3 = cdcdbaba Case 2: with expansion: y 1= abcd y 12 = abcdcdba y 2 = cdba y 12 = abcdcdba y 123 = cdcdbaba y 3 = cdcdbaba 49

• The above example shows we have to consider some string functions to improve our solutions. 50

Pseudolength • Let x be a string in US* and let be an expansion of ex. We denote the 1 -expansion of cx corresponding to as , where The quantity d |cx| is called the pseudolength of the edge and d is called the normalized pseudolength of the edge. 51

• Actually, the pseudolength d |cx| measures the losing length after connecting to the other string y. 52

u 1 = cabab u 0 = ababc • For example, u 0 = ababc, u 1 = cabab, c 2 = so. Let x 0 = ababcabab an open of c 2 , = , so | x 0 | = 9 and ov ( = (u 0, u 1, u 0), = , ) = 2. 53

Fact 3. 5 • Let x be a string in US*. The 1 -expansion exists for some d if and only if there is an expansion of ex with pseudolength d |cx|. • If is an expansion of ex with pseudolength d |cx|, then d ≥ 1 with equality if and only if. 54

• There exist certain 1 -expansions of a cycle cx based on the string functions, lemmas and corollaries. • These string functions allow us to identify the expansions of cx. • The string functions can shows the situations of overlap between any two strings. 55

• We omit the detail of all the string functions and just give an example to describe their function simply. 56

• For example, let’s take a look at the string function trade-off : • Let x be a string in US*, cx ≠ cy. The trade-off of x with respect to y, denoted tr (x, y), is defined as 57

u 0 = ababc, u 1 = cabab, u 2 = bababa x 1 = bababa x 21 = ababcabab x 1 = bababa • For example, x 21 = ababcabab, x 1 = bababa ovmax(x 1, x 21) = 3 x 1 = bababa x 1 | x 21 | = 2, | x 1 | = 6. ovmax(x 1, x 21) 58

• From a lemma, a 1 -expansion of cx corresponding to ) with pseudolength = exists. For example, x 1 = bababa 59

Outline • • Introduction Basic definitions String functions and lemmas The approximation algorithm The upper bound The lower bound Conclusion 60

The approximation algorithm • Before proceeding to the algorithm, we should understand the important idea: edge exchange. 61

Edge exchange and winning edge • Let C be a cycle cover and let e = < s, t > be an edge of GS. Assume e 1 = < s, u > and e 2 = < v, t >, are respectively, the out-edge of s and in-edge of t in C. The edge exchange of e is denoted , is the cycle cover where e 3 = <v, u>. And e is a winning edge if 62

For example, The cycle length is 9 winning edge The cycle length is 7 1, 4 4, 1 u 0 3, 3 C u 1 4, 1 u 0 2, 3 u 2 u 2 = bababa 2, 4 63

v 0 = aggtt, v 1 = gttaag, v 2 = taagc, v 3 = gcata, v 4 = tacc • Another example, 4, 1 5, 1 v 1 2 2, 3 4, 0 5, 4, 0 3, 2 4, 0 0 1 4, 1 1 v 4 5, 6, 0 0 5, 4, v 2 5, v 0 6, 0 0 4, 0 3, 2 v 3 1 5, 0 4, 64

The cycle length is 20. v 1 2, 3 v 0 6, 0 v 2 5, 0 v 4 4, 0 3, 2 v 3 65

v 1 2, 3 v 0 6, 0 v 2 5, 0 3, 2 v 4 4, 0 3, 2 v 3 66

v 1 2, 3 v 0 6, 0 v 2 5, 0 3, 2 v 4 4, 0 3, 2 v 3 67

v 1 2, 3 v 0 6, 0 v 2 3, 2 4, 0 5, 0 v 4 4, 0 3, 2 v 3 68

The cycle length before edge exchange: 20 The cycle length after edge exchange: 18 Therefore, we reduced the cycle length. v 1 2, 3 6, 0 v 0 3, 2 4, 0 v 2 v 4 3, 2 v 3 69

Parsimonious edge exchange and losing edge • Let C be a cycle cover and let e = < s, t, k > be an edge of GS. Assume e 1 = < s, u, j > and e 2 = < v, t l >, are respectively, the out-edge of s and in-edge of t in C. The parsimonious edge exchange of e in C, denoted , is the cycle cover where And e 3 is called a losing edge. 70

For example, S = { u 0, u 1, u 2 }, u 0 = ababc, u 1 = cabab, u 2 = bababa The cycle length is 9 winning edge The cycle length is 9 1, 4 4, 1 u 0 3, 3 C u 2 u 1 u 0 4, 1 u 1 2, 3 losing edge u 2 = bababa u 2 4, 2 71

v 1 2, 3 v 0 6, 0 v 2 3, 2 4, 0 5, 0 winning edge losing edge v 4 4, 0 3, 2 v 3 72

Lemma 2. 2 • Let s, t, u and v be strings. If ovk (s, t), ovl (s, u), and ovj (v, t) exist for k ≥ max( j, l ), then ovm(v, u) exists for m = max(0, j + l − k). • Let’s go to see an example: l j v t s u j+l−k k 73

The approximation algorithm • 1. Construct GS and find CS*. Compute US* and the string functions. • 2. Build the set of merging edges W. • 3. Let C = CS*. While W is nonempty do Let e = < s, t > be a minimum-overlap edge in W. If s and t are in different cycles of C, then C = χ(C, e). W = W {e}. • 4. Set AOPTS to the concatenation of sop (c), . 74

For example, S = { u 0, u 1, u 2}, where u 0 = ababc, u 1 = cabab, u 2 = bababa. The following graph is g. S. 1, 4 5, 0 4, 1 u 0 5, 0 3, 3 u 1 2, 3 6, 0 u 2 2, 4 75

u 0 = ababc, u 1 = cabab, u 2 = bababa CS* is as follows: 1, 4 u 0 4, 1 u 1 c 2 u 2 2, 4 c 1 = (u 2, u 2) c 2 = (u 0, u 1, u 0) OP(c 1) = { bababa } OP(c 2) = { ababcabab, cababc } US* = {bababa, ababcabab, cababc} Let x 1 = bababa, x 21 = ababcabab, x 22 = cababc x 1 is an open of c 1. x 21 and x 22 are opens of c 2. 76

1, 4 u 0 4, 1 u 1 c 2 u 0 = ababc, u 1 = cabab, u 2 = bababa c 1 = (u 2, u 2) c 2 = (u 0, u 1, u 0) u 2 2, 4 We begin the coloring action from the minimum length cycle. c 1 77

Now, we choose merging edges to merge the cycles: 1, 4 u 0 4, 1 u 1 c 2 According to the construction algorithm of W, we choose to merge c 1 and c 2. . 2, 3 u 2 2, 4 c 1 u 0 = ababc, u 1 = cabab, u 2 = bababa c 1 = (u 2, u 2) c 2 = (u 0, u 1, u 0) 78

1, 4 u 0 4, 1 u 1 c 2 2, 3 u 2 2, 4 c 1 79

1, 4 u 0 4, 1 u 1 2, 3 u 2 2, 4 80

1, 4 u 0 3, 3 4, 1 u 1 2, 3 u 2 2, 4 81

u 0 3, 3 4, 1 u 2 u 1 2, 3 Let this cycle be cfinal. 82

u 0 = ababc, u 1 = cabab, u 2 = bababa c 1 = (u 2, u 2), c 2 = (u 0, u 1, u 0) • At last, We try to find out sop (cfinal ). • OP (cfinal ) = {ababcabababa(12), cababc(10), babababcabab(12)}. • Therefore, sop (cfinal ) = cababc. u 0 3, 3 4, 1 u 2 u 1 2, 3 83

• However, the optimal solution is right cababc with length 10. • This approximation algorithm finds out the optimal solution at this case. 84

Outline • • Introduction Basic definitions String functions and lemmas The approximation algorithm The upper bound The lower bound Conclusion 85

• Since the formal analyses of lower bound and the upper bound for the optimal solution is too complicated and difficult for us to understand, now we’re going to describe general strategy relative to simpler examples. 86

The upper bound • Let S = { u 0, u 1, u 2 }, where u 0 = ababc, u 1 = cabab, u 2 = baba. 1, 4 5, 0 4, 1 u 0 5, 0 1, 3 u 1 2, 3 4, 0 u 2 2, 2 87

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. • CS* = {c 1, c 2}, where c 1 = (u 2, u 2), c 2 = (u 0, u 1, u 0) 1, 4 u 0 4, 1 c 2 u 1 u 2 c 1 2, 2 Let x 0 = ababcabab, x 1 = cababc, x 2 = baba x 2 is an open of c 1 ; x 0 and x 1 are opens of c 2. | CS* | = 1 + 4 + 2 = 7 88

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. • From the algorithm, we obtain AOPTS = ababcabab ∙baba =ababcababa, so | AOPTS | = 10 1, 4 u 0 4, 1 c 2 u 1 u 2 c 1 2, 2 However, the optimal solution is OPTS = cabababc |OPTS| = 8. 89

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. • Now, we make an expansion CU of CS*: 3, 2 u 0 5, 0 u 1 CU u 1 = cabab u 0 = ababc u 1 = cabab u 2 u 1 = baba u 0 = baba 4, 0 90

And we make an parsimonious edge exchange for CU. 3, 2 u 0 5, 0 u 1 2, 3 u 2 5, 0 u 0 4, 0 u 2 u 1 2, 3 4, 0 91

Note: u 0 = ababc, u 1 = cabab, u 2 = baba. 5, 0 u 0 4, 0 u 2 u 1 2, 3 c 1 { ababccababa(11), cababaababc(11), babaababccabab(14) } ababccababa or cababaababc 92

• So we obtain that: |CS*| ≤ | AOPTS | ≤ 7 10 11 17. 5 20 93

Outline • • Introduction Basic definitions String functions and lemmas The approximation algorithm The upper bound The lower bound Conclusion 94

The lower bound • Let S = { u 0, u 1, u 2 }, where u 0 = abc, u 1 = cab, u 2 = bababa, then g. S is constructed as follows: 1, 2 3, 0 2, 1 u 0 3, 0 5, 1 u 1 2, 1 6, 0 u 2 2, 4 95

• Then we find a Hamiltonian cycle c = u 0 -u 1 -u 2 of g. S. • Clearly, c doesn’t contain . 2, 1 u 0 5, 1 u 1 2, 1 u 2 96

• We find that is a winning edge for c. Let e = . We can make a cycle cover by a parsimonious edge exchange : 2, 1 u 0 5, 1 u 1 2, 1 u 2 4, 2 97

• We find that is a winning edge for c. Let e = . We can make a cycle cover by a parsimonious edge exchange : u 0 2, 1 3, 0 c 1 u 2 c 2 4, 2 98

• The length of the local superstring of u 1 to u 0 is 2 + 3 + ov (u 1, u 0). Thus the cycle length = 2 + 3 = 5 is a lower bound of the local superstring of u 1 to u 0. • The global superstring has to consider the connection between u 0 and u 2. We may ignore this when we calculate the lower bound. 99

• Therefore, |CL| = 2 + 3 + 4 = 9. u 0 2, 1 3, 0 c 1 u 2 c 2 4, 2 100

• However, the optimal solution is cababc, which has length 10, so | CL | = |OPTS| − 1. 101

Outline • • Introduction Basic definitions String functions and lemmas The approximation algorithm The upper bound The lower bound Conclusion 102

Conclusion • Probably the most interesting open question in superstring study is whether the greedy method yields a 2 -approximation. • Of course, the other important question in this area is whether OPTS can be approximated within a factor of 2 by any algorithm. 103

• We conjecture that our algorithm can be modified slightly and the analysis improved to prove a 2 1/3 bound. • Unfortunately, the analysis is even more complicated, perhaps worse, the algorithm becomes extremely complex. 104

• Actually, as I looked up for the relative research, I found that the ratio has not been improved since this paper was born. 105

Thank you. 106

Happy Teacher’s Day 107

Greedy-cover algorithm • Let CS* = . Order the edges of GS as , so that • For i = 1, …, n 2 Add ei = < s, t > to CS* if s doesn’t have an out-edge and t doesn’t have an inedge in CS*. 108