CAP 5510 Bioinformatics Sequence Comparison Tamer Kahveci CISE
CAP 5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida 1
Goals • Understand major sequence comparison algorithms. • Gain hands on experience 2
Why Compare Sequences ? • • • Prediction of function Construction of phylogeny Shotgun sequence assembly Finding motifs Understanding of biological processes 3
Question • Q = AATTCGA • X = ACATCGG • Y = CATTCGCC • Z = ATTCCGC • Form groups of 2 -3. Sort X, Y, and Z in decreasing similarity to Q. (5 min) 4
Dot Plot A A C A T T C G A How can we compute similarity? A T C G O(m+n) time Is it a good scheme ? Use longer subsequences (k-gram) G 5
Dot Plot A A T T C G A A C A T C G Use longer subsequences (k-gram) G 6
Alignment types 1/2 Global alignment: align entire sequences Local alignment: align subsequences 7
Alignment types 2/2 Dovetail alignment: align opposite ends Pattern search: align entire sequence to a subsequence 8
Global Alignment • Q = AATTCGA • |rr|||r • X = ACATCGG • Q = A-ATTCGA • |i|d|||r • X = ACA-TCGG • 4 match • 3 mismatch • • 5 1 1 1 match insert delete mismatch How can we evaluate the quality of an alignment? 9
Edit Distance Definition: Minimum number of insert / delete / replace operators to transform one sequence into the other. 1. AATTCGA 2. ACATTCGA 3. ACATCGA 4. ACATCGG • Q = A-ATTCGA • |i|d|||r • X = ACA-TCGG How do we find the minimum edit distance ? 10
Each Alignment Maps to a Path A A T T C G A – A T T C G A | i | d | | | r A C A – T C G G A C A T C G 11
Global sequence alignment (Needleman-Wunsch) • Compute distance recursively : dynamic programming. Case 0 : one string is empty (n) Case 1 : match (0) or mismatch (1) Case 2 : delete (1) Case 3 : insert (1) 12
Global sequence alignment (Needleman-Wunsch) • D(i, j) = edit distance between A(1: i) and B(1: j) • d(a, b) = 0 if a = b, 1 otherwise. • Recurrence relation – D(i, 0) = Σ d(A(k), -), 0 <= k <= i – D(0, j) = Σ d(-, B(k)), 0 <= k <= j – D(i, j) = Min { • D(i-1, j) + d(A(i), -), • D(i, j-1) + d(-, B(j)), • D(i-1, j-1) + d(A(i), B(j))} d A C G T - A 0 1 1 C 1 0 1 1 1 G 1 1 0 1 1 T 1 1 1 0 1 - 1 1 0 13
DP Example A A T T C G A C D(i, 0) = Σ d(A(k), -), 0 <= k <= i D(0, j) = Σ d(-, B(k)), 0 <= k <= j A T C D(i, j) = Min { • D(i-1, j) + d(A(i), -), • D(i, j-1) + d(-, B(j)), • D(i-1, j-1) + d(A(i), B(j))} G 14
DP Example: Backtracking A A T T C G 0 1 2 3 4 5 6 A 1 0 1 2 3 4 5 C 2 1 1 2 3 3 4 A 3 2 1 2 3 4 4 T 4 3 2 1 2 3 4 C 5 4 3 2 2 2 3 G 6 5 4 3 3 3 2 • O(mn) time and space • Reconstruct alignment • O(min{m, n}) space if alignment not needed. How ? 15
DP in Linear Space ? 16
Linear Space DP A A C A T T C G • Keep two vectors at a time: – Two columns or two rows • O(min{m, n}) space • O(mn) time • No backtracking C G 17
Linear Space DP with Backtracking • Find midpoint of the alignment – Align the first half – Align the second half – Choose the point with best sum of score/distance • Search the upper left and lower right of mid point 18
Linear Space DP with Backtracking: Time Complexity • • • 2(n/2 x m) = nm 2(n/4 x k) + 2(n/4 x (m-k)) = nm/2 … nm/2 i Adds up to 2 nm 19
Number of Alignments • N(n, m) = number of alignments of sequences of n and m letters (not necessarily optimal alignment). • N(0, i) = N(i, 0) = 1 • N(n, m) = N(n-1, m) + N(n, m-1) + N(n-1, m-1) • N(n, n) ~ (1 + 21/2)2 n+1 n-1/2. • N(1000, 1000) > 10767 • 1080 atoms in the universe ! 20
Edit Distance: a Good Measure? • Compare these two alignments. Which one is better ? Scoring scheme: • +1 for each match • -1 for each mismatch/indel Can be computed the same as edit distance by including +1 for each match • Q = AATTCGA • | ||| • X = ACATCGG • Q = A-ATTCGA • | | ||| • X = ACA-TCGG 21
DP Example – try again A A T T C G Scoring scheme: • +1 for each match • -1 for each mismatch/indel A C A T C G D(i, 0) = -i, 0 <= k <= i D(0, j) = -j, 0 <= k <= j D(i, j) = Max { • D(i-1, j) - 1, • D(i, j-1) - 1, • D(i-1, j-1) + d(A(i), B(j))} 22
More Trouble: Scoring Matrices • Different mutations may occur at different rates in nature. Why ? • E. g. , each amino acid = three nucleotides. Transformation of one amino acid to other due to single nucleotide modification may be biased – – E = GAA, GAG D = GAU, GAC F = UUU, UUC E similar to D, not similar to F • Mutation probability of different pairs of nucleotides may differ. • PAM, BLOSUM matrices 23
The BLOSUM 45 Matrix A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -1 -1 0 -2 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 -1 -2 -2 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 V 0 -2 -3 -3 -1 -3 -3 3 1 -2 1 0 -3 -1 5 24
score(H, P) = -2, gap Penalty = – 8 H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 A -16 W -24 H -32 E -40 A -48 E -56 25
Score(E, P) = 0, score(E, A) = -1, score(H, A) = -2 H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -8 A -16 -10 -3 W -24 H -32 E -40 A -48 E -56 26
HEAGAWGHE - P - - AW- HEAE Optimal alignment: H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -8 -16 -24 -33 -42 -49 -57 -65 -73 A -16 -10 -3 -4 -12 -19 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -4 -12 -21 -29 -37 H -32 -14 -18 -13 -8 -9 -12 -6 -2 -11 -19 E -40 -22 -8 -16 -9 -12 -14 -6 4 -5 A -48 -30 -16 -3 -11 -12 -14 -4 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -8 2 27
Distance v. s. Similarity • Similarity model: s(a, b), g’(k) • Distance model: d(a, b), g(k) If there is a constant c, such that – S(a, b) = c – d(a, b) – G’(k) = g(k) – kc/2 Then Similarity optimal alignment = distance optimal alignment 28
Banded Global Alignment • Two sequences differ by at most w edit operations (w<<n). • How can we align ? 29
Banded Alignment Example A C C A C 0 A C C A T A -2 A -4 -6 -2 1 -1 -3 -5 -4 -1 2 0 -2 -4 -6 -3 0 1 1 -1 -3 -5 -2 1 0 2 0 -2 -4 -1 0 1 1 1 -1 -3 0 -1 2 0 2 -2 -1 0 -1 2 • O(wn) time and space. • Example: – – w=3. Match = +1 Mismatch = -1 Indel = -2 30
Global Alignment ? • Q = A-ATTCGA • | | ||| • X = ACA-TCGG • Q = AATTCGA • ||||| • Y = CATTCGCC Which one is more similar to Q ? 31
Local alignment G C T G G A A G - G C A T T A | r | | d | | | T A C A A G C A C G Local alignment: highest scoring subsequence alignment. How can we find it ? Brute force: O(n 3 m 3) Gotoh (Smith-Waterman): O(nm) 32
Local Suffix Alignment X[1: i] Y[1: j] • V(i, 0) = v(0, j) = 0 • V(i, j) = max{0, v(i-1, j-1) + s(x(i), y(j)), v(i-1, j) + s(x(i), -) v(i, j-1) + s(-, y(j))} 33
Local Alignment • The prefixes with highest local suffix alignment 34
Local Alignment Example Match = +5 Mismatch = -4 P’s subsequence: G C A Q’s subsequence: G A A G – G C A Q --- P G C T G G A A G G C A T 0 0 0 G 0 5 1 0 5 5 1 0 0 C 0 1 10 6 2 1 1 0 1 1 10 6 2 A 0 0 6 6 2 0 6 15 11 G 0 5 2 2 11 7 3 11 11 A 0 1 1 0 7 7 11 8 7 7 3 8 7 G 0 5 11 7 7 13 12 8 4 4 C 0 0 10 6 2 7 7 3 9 8 17 13 9 A 0 0 6 6 2 3 11 12 8 5 13 22 18 C 0 0 5 2 2 0 7 8 8 4 18 18 18 G 0 5 1 1 7 7 5 4 13 13 14 14 14 35
Goals • Other important sequence comparison problems – – – – – end free search pattern search non-overlapping alignments gaps linear-space algorithms bitwise operations neighborhood searching NFAs Approximate alignment 36
Dovetail alignment C C A – T G A C T T C C A G T G AKA End space free alignment How can we find it ? 37
End space free alignment CCA-TGAC TTCCAGTG OR 38
Pattern search AAGCAGCCA-TGACGGAAAT CCAGTG How can we find it ? 39
Pattern search • AAGCAGCCATGACGGAAAT • CCAGTG 40
Non-overlapping Local Alignments • GCTCTGCGAATA • CGTTGAGATACT • Find all nonoverlapping local alignments with score > threshold. • Two alignments overlap if they share same letter pair. • How do we find ? 41
Non-overlapping Local Alignments 1. Compute DP matrix 2. Find the largest scoring alignment > threshold 3. Report the alignment 4. Remove the effects of the alignment from the matrix 5. Go to step 2 42
Next: Closer look into gaps 43
Gaps • Q = AATTCGAG • ||||| • Y = -ATTCGC- • Q = AATTCGAG • ||||| • Z = AATTCC-- Which one is more similar to Q ? Starting an indel is less likely than continuing an indel. Affine gap model: Large gap open and smaller gap extend penalty. How can we compute it ? 44
Computing affine gaps • 3 cases i j i j E F G 45
Recursions i j E • E(i, 0) = gap_open + i x gap_extend • E(i, j) = max{E(i, j-1) + gap_extend, V(i, j-1) + gap_open + gap_extend} 46
Recursions i j F • F(0, j) = gap_open + j x gap_extend • F(i, j) = max{F(i-1, j) + gap_extend, V(i-1, j) + gap_open + gap_extend} 47
Recursions i j G • G(i, j) = G(i-1, j-1) + s(x(i), y(j)) 48
Recursions • V(i, 0) = gap_open + i x gap_extend • V(0, j) = gap_open + j x gap_extend • V(i, j) = max{E(i, j), F(i, j), G(i, j)} 49
Other Gap Models • Constant: fixed gap penalty per gap regardless of length • Non-linear: Gap cost increase is non-linear. – E. g. , g(n) = -(1 + ½ + 1/3 + … + 1/n) • Arbitrary 50
Next: inversions 51
Alignment with Inversions • A’ = T and G’ = C • ACTCTCTCGCTGTACTG • AATCT-ACTACTGCTTG • Each letter is inverted only once. • An inversion cost (inv) for each inverted block. • How to find the alignment ? 52
Alignment with Inversions 1. For i=1: m 1. For j=1: n 1. For g=1: I 1. For h=1: j 1. Compute Z(g, h; I, j) 2. V(I, j) = max{ » Max{v(i-1, j-1) + z(g, h; I, j)} + inv » V(i-1, j-1) + s(xi, yj) » V(i-1, j) + ins » V(I, j-1) + del} • O(n 6) time 53
Alignment with Inversions: Faster Method 1. Find all local alignments of x and y’ (Z) 2. V(I, j) = max{ 1. 2. 3. 4. • max{V(g-1, h-1) + Z(g, h; I, j)} + inv, V(i-1, j-1) + s(xi, yj), V(i-1, j) + ins V(I, j-1) + del } O(nm. L) time, where L is the average number of inverse alignments ending at (i, j) 54
Recap & Goals • Other important sequence comparison problems – – – – end free search pattern search non-overlapping alignments gaps linear-space algorithms Inversions Approximate alignment Homology 55
Approximate Global Alignment of Sequences T. Kahveci, V. Ramaswamy, H. Tao, T. Li - 2005 56
The problem • Given sequences X and Y – Bounded: Find global alignment of X and Y with at most k edit ops. – Unbounded: Find global alignment of X and Y with p% approximation • p = 100 % = optimal alignment. 57
Frequency Vectors [KS’ 01] • Frequency vector is the count of each letter. n. A n. C n. G n. T – f(s = AATGATAG) = [4, 0, 2, 2]. • Edit operations & frequency vectors: – (del. G), s = AAT. ATAG => f(s) = [4, 0, 1, 2] – (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] – (A C), s = ACCTATAG => f(s) = [3, 2, 1, 2] • Use frequency vectors to measure distance! 58
An Approximation to ED: Frequency Distance (FD) • s = AATGATAG => f(s)=[4, 0, 2, 2] • q = ACTTAGC => f(q)=[2, 2, 1, 2] – dec = (4 -2) + (2 -1) = 3 – inc = (2 -0) = 2 – FD(f(s), f(q)) = 3 – ED(q, s) = 4 • FD(f(s 1), f(s 2))=max{inc, dec}. • FD(f(s 1), f(s 2)) ED(s 1, s 2). 59
Distance Prediction using Frequency Vectors ED GED A C T - - T A G R I I A A T G A T A G V = [6, 5, 9, 10] U = [11, 4, 4] A C T T A G C * * A A T G A T A Given frequency vectors of two strings x and y, GED(x, y) is normally distributed. Q = [12, 10, 3, 5] 60
Mean : Variance : 61
Bounded Alignment: lower bounding the alignment • Mi, j = Edit distance between prefixes of X and Y X j Y i • d = lower bound to ED between suffixes of X and Y with at least p% probability. Mi, j d • If (Mi, j + d > cutoff) then p% d – No solution exists from (i, j) with p% probability. – Remove entry (i, j) 62
Cost of computing lower bound? • Frequency vectors can be computed in O(1) time incrementally. A A T T C [A C G T] [2 1 0 2] [1 1 0 2] 63
Unbounded Alignment: upper bounding the alignment X Y i • Dij = upper bound to the distance between suffixes. j Mi, j Dij • Use mini, j {Mi, j + Dij} as cutoff. • Prune if (Mi, j + d > cutoff) • Dij: desirable if it is • Computed quickly • Tight 64
How to Compute the Upper Bound? X Y i X j Di, j Y Mi, j Dij = distance for a sample alignment (suffix) - A A C C T C e. g. G C A T C T A Di, j = 4 65
Cost of Computing Upper Bound? A A T C T G • Upper bound can be computed in O(1) time incrementally. D=3 A A T C T G - C T C A G A A T C T G D=3 - - T C A G A T C T G D=2 C T C A G 66
Optimization 3: Path Prune X • No solution exists from entry (i, j) if its path to entry (0, 0) is blocked. • Remove entry (i, j) X 67
Optimization 3: Path Prune X X • No solution exists from entry (i, j) if its path to entry (0, 0) is blocked. • Remove entry (i, j) X X X 68
Unbounded Alignment: Time 69
Unbounded Alignment: Space 70
Bounded Alignment: Time 71
Bounded Alignment: Space 72
Recap & Goals • Other important sequence comparison problems – – – banded alignment end free search pattern search non-overlapping alignments gaps linear-space algorithms inversions bitwise operations neighborhood searching NFAs Homology 73
What is Similarity Anyway ? • Similar: have similar letters • Homolog: have common ancestor • Not exactly the same ! • Three types of homology Parent Organism – Paralog – Ortholog – Xenolog Organism A Organism B 74
Paralog & Ortholog (1) • "Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event. “ W-H Li 1. 2. A gene called A in species w is duplicated producing initially two copies of A. With time the two copies diverge by evolution forming related genes A 1 and A 2. These two genes are said to be paralogous to one another. Paralogy typically involves comparisons within a species. 3. 75
Paralog & Ortholog (2) • Two species, x and y evolve from species w, their common ancestor. The descendants of the A 1 and A 2 genes are now called A 1 x, A 1 y, and A 2 x, A 2 y to reflect which species they now occupy. A 1 x is orthologous to A 1 y and A 2 x is orthologous to A 2 y. The comparison is between two species. 76
Xenolog • Xenology is defined as that condition (horizontal transfer) where the history of the gene involves an interspecies transfer of genetic material. It does not include transfer between organelles and the nucleus. It is the only form of homology in which the history has an episode where the descent is not from parent to offspring but, rather, from one organism to another. 77
Paralog, Ortholog, Xenolog Ortholog Paralog 78
Recommended Reading • Fitch, WM, “Homology a personal view on some of the problems”, Trends. Genet. , 2000, 16: 227 -231 79
Overview • Dot plots • Dynamic programming solutions – Local, global alignments and their extensions • Distance and similarity models • Gap models • Improvements and different on computation of sequence similarity • Similarity versus homology 80
Next: Substitution Patterns • Predict substitutions • What are scoring matrices and how are they derived? 81
- Slides: 81