Sequence Alignment 1 Outline DNA Sequence Comparison First

Outline • • • DNA Sequence Comparison: First Success Stories Dynamic Programming vs. Recursion

Problem Solving Approach • Dynamic Programming – – Bottom-Up Approach Repetitive calculations LIFO Approach

DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known

Relation Of Sequences • Homolog – Has a Common ancestor 1. Ortholog Gene or

Comparing Sequences • Sequence Alignment – Pairwise Sequence Alignment • Between 2 Sequences –

Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and

Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number

Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with

Edit Distance: Example TGCATAT ATCCGAT in 5 steps also TGCATAT ATCCGAT in 4 steps

Aligning DNA Sequences V = ATCTGATG W = TGCATAC match n=8 m=7 mismatch 4

Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v =

Edit Graph for LCS Problem i 0 T 1 G 2 C 3 A

Computing LCS Let vi = prefix of v of length i: v 1 …

Computing LCS (cont’d) i-1, j -1 si, j = MAX si-1, j + 0

Every Path in the Grid Corresponds to an Alignment W V 0 0 A

Alignment as a Path in the Edit Graph 0 1 A A 0 1

Alignment: Dynamic Programming si, j = si-1, j-1+1 if vi = wj max si-1,

Backtracking Example Find a match in row and column 2. i=2, j=2, 5 is

LCS Algorithm 1. LCS(v, w) 2. for i 1 3. si, 0 0 4.

Printing LCS: Backtracking 1. Print. LCS(b, v, i, j) 2. if i = 0

LCS Runtime • It takes O(nm) time to fill in the nxm dynamic programming

From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS)

Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ,

The Global Alignment Problem Find the best alignment between two strings under a given

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the

Making a Scoring Matrix • Scoring matrices are created based on biological evidence. •

Scoring Matrix: Example A R N K A 5 -2 -1 -1 R -

Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the

Scoring matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution

PAM • Point Accepted Mutation (Dayhoff et al. ) • 1 PAM = PAM

PAMX • PAMx = PAM 1 x – PAM 250 = PAM 1250 •

BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest

Local vs. Global Alignment (cont’d) • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | ||| || | |

Local Alignment: Example Local alignment Compute a “mini” Global Alignment to get Local Global

Similarity Based on Dot Plots • Simple Dot Plot • Dot Plot With 75%

Dotplot for a small protein against itself identity (i=j) similarity of sequence with other

Local Alignments: Why? • Two genes in different species may be similar over short

The Local Alignment Problem • Goal: Find the best local alignment between two strings

The Problem with this Problem • Long run time O(n 4): - In the

The Local Alignment Recurrence • The largest value of si, j over the whole

Further Reading • Better Gap penalty strategy • -(ρ + σx) • Multiple Sequence

Slides: 61

Download presentation

Sequence Alignment 1

Outline • • • DNA Sequence Comparison: First Success Stories Dynamic Programming vs. Recursion Relation of Sequences Comparing Sequences Hamming Distance vs. Edit Distance Sequence Alignment Longest Common Subsequence Problem Scoring Matrices: PAM and BLOSUM Local vs. Global Alignment 2

Problem Solving Approach • Dynamic Programming – – Bottom-Up Approach Repetitive calculations LIFO Approach Mem: Return Point+Environmental Variables and Runtime Stack • Divide and Conquer – – Top-Down Approach Calculate each once FIFO Approach Result Table Fibonacci Example 3

DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • Computing a similarity score between two genes tells how likely it is that they have similar functions • Dynamic programming is a technique for revealing similarities between genes 4

Relation Of Sequences • Homolog – Has a Common ancestor 1. Ortholog Gene or Protein Family Members In 2 different organism 2. Paralog Gene or Protein Family Members In 1 organism • Xenolog – Have Some Similarities but not from a common ancestor 6

Relation Of Sequences 7

Comparing Sequences • Sequence Alignment – Pairwise Sequence Alignment • Between 2 Sequences – Multiple Sequence Alignment • Between more than 2 Sequences 8

Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and w : v : AT AT w: T AT A • The Hamming distance: d. H(v, w) = 8 is large but the sequences are very similar 9

Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : A T A T -w : -- T A T A • The edit distance: d. H(v, w) = 2. • Hamming distance neglects insertions and deletions in DNA 10

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v, w) = MIN number of elementary operations to transform v w 11

Edit Distance vs Hamming Distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = ATAT W = TATA Hamming distance: d(v, w)=8 Just one shift Make it all line up V = - ATAT W = TATA Edit distance: d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ? ? ? 12

Edit Distance: Example TGCATAT ATCCGAT in 5 steps also TGCATAT ATCCGAT in 4 steps Can it be done in 3 steps? ? ? 13

Aligning DNA Sequences V = ATCTGATG W = TGCATAC match n=8 m=7 mismatch 4 1 2 2 matches mismatches insertions deletions C T G A T G V A T T G C A T A C W indels deletion insertion 14

Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v 1 v 2…vm and w = w 1 w 2…wn • The LCS of v and w is a sequence of positions in v: 1 < i 2 < … < it < m and a sequence of positions in w: 1 < j 2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 15

Edit Graph for LCS Problem i 0 T 1 G 2 C 3 A 4 T 5 A 6 C 7 j A T C T G A T C 0 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges 17

Computing LCS Let vi = prefix of v of length i: v 1 … vi and wj = prefix of w of length j: w 1 … wj The length of LCS(vi, wj) is computed by: si, j = max si-1, j si, j-1 si-1, j-1 + 1 if vi = wj 18

Computing LCS (cont’d) i-1, j -1 si, j = MAX si-1, j + 0 si, j -1 + 0 si-1, j -1 + 1, if vi = wj 1 i, j -1 0 0 i, j Every alignment path is from source to sink 19

Every Path in the Grid Corresponds to an Alignment W V 0 0 A A 1 T 2 G 3 T 4 T 1 C 2 G 3 4 012 2 34 V= AT- GT | | | W= A T C G – 012 344 20

Alignment as a Path in the Edit Graph 0 1 A A 0 1 2 T T 2 2 _ C 3 3 G G 4 4 T T 5 5 T _ 5 6 A A 6 7 T _ 6 7 _ C 7 - Corresponding path (0, 0) , (1, 1) , (2, 2), (2, 3), (3, 4), (4, 5), (5, 5), (6, 6), (7, 7) 21

Alignment: Dynamic Programming si, j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 Arrows from. show where the score originated if from the top if from the left if vi = wj 22

Backtracking Example Find a match in row and column 2. i=2, j=2, 5 is a match (T). j=2, i=4, 5, 7 is a match (T). Since vi = wj, si, j = si-1, j-1 +1 s 2, 2 s 2, 5 s 4, 2 s 5, 2 s 7, 2 = = = [s 1, 1 [s 1, 4 [s 3, 1 [s 4, 1 [s 6, 1 = = = 1] 1] 1] + + + 1 1 1 23

LCS Algorithm 1. LCS(v, w) 2. for i 1 3. si, 0 0 4. for j 1 5. s 0, j 0 6. for i 1 7. for j 8. 9. si, j 10. 11. bi, j • • • to n to m to n 1 to m si-1, j max si, j-1 si-1, j-1 + 1, if vi = wj “ “ if si, j = si-1, j “ “ if si, j = si, j-1 “ “ if si, j = si-1, j-1 + 1 return (sn, m, b) 24

Printing LCS: Backtracking 1. Print. LCS(b, v, i, j) 2. if i = 0 or j = 0 3. return 4. if bi, j = “ “ 5. Print. LCS(b, v, i-1, j-1) 6. print vi else 7. 8. if bi, j = “ “ 9. Print. LCS(b, v, i-1, j) 10. else 11. Print. LCS(b, v, i, j-1) 25

LCS Runtime • It takes O(nm) time to fill in the nxm dynamic programming matrix. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 26

From LCS to Alignment: Change up the Scoring • The Longest Common Subsequence (LCS) problem —the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 27

Simple Scoring • When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 28

The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = -б = 1 if match = -µ if mismatch si, j si-1, j-1 +1 if vi = wj = max s i-1, j-1 -µ if vi ≠ wj s i-1, j - σ s i, j-1 - σ m : mismatch penalty σ : indel penalty 29

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1, j-1 + δ (vi, wj) si, j = max s i-1, j + δ (vi, -) s i, j-1 + δ (-, wj) 31

Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 32

Scoring Matrix: Example A R N K A 5 -2 -1 -1 R - 7 -1 3 N - - 7 0 K - - - 6 • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. 33

Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate glutamate – Nonpolar to nonpolar • alanine valine – Similarly behaving residues • leucine to isoleucine 34

Scoring matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA is less conserved than protein sequences – Less effective to compare coding regions at nucleotide level 35

PAM • Point Accepted Mutation (Dayhoff et al. ) • 1 PAM = PAM 1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may not changed at all 36

PAMX • PAMx = PAM 1 x – PAM 250 = PAM 1250 • PAM 250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln. . . Trp Tyr Val A R N D C Q Ala A 13 3 4 5 2 3 Arg R 6 17 4 4 1 5 Asn N 9 4 6 8 1 5 Asp D 9 3 7 11 1 6 Cys C 5 2 2 1 52 1 Gln Q 8 5 5 7 1 10 Glu E 9 3 6 10 1 7 Gly G 12 2 4 5 2 3 His H 6 6 2 7 Ile I 8 3 3 3 2 2 W Y V 0 1 7 2 1 4 0 2 4 0 1 4 0 3 4 0 1 4 1 3 5 0 2 4 Leu L 6 2 2 2 1 3 Lys. . . K. . . 7. . . 9 5 5 1 2 15 0 1 10 37

BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUM 62 was created using sequences sharing no more than 62% identity 38

The Blosum 50 Scoring Matrix 39

Comparing PAM and BLOSUM 40

Local vs. Global Alignment • The Global Alignment Problem tries to find the longest path between vertices (0, 0) and (n, m) in the edit graph. • The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i, j) and (i’, j’) in the edit graph. 42

Local vs. Global Alignment (cont’d) • Global Alignment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | ||| || | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C • Local Alignment—better alignment to find conserved segment tcc. CAGTTATGTCAGgggacacgagcatgcagagac |||||| aattgccgccgtcgttttcag. CAGTTATGTCAGatc 45

Local Alignment: Example Local alignment Compute a “mini” Global Alignment to get Local Global alignment 46

Similarity Based on Dot Plots • Simple Dot Plot • Dot Plot With 75% identity Filtering 47

Dotplot for a small protein against itself identity (i=j) similarity of sequence with other parts of itself 48 48

Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence 49

The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings 50

The Problem with this Problem • Long run time O(n 4): - In the grid of size n x n there are ~n 2 vertices (i, j) that may serve as a source. - For each such vertex computing alignments from (i, j) takes O(n 2) time. 51

The Local Alignment Recurrence • The largest value of si, j over the whole edit graph is the score of the best local alignment. • The recurrence: si, j = max 0 si-1, j-1 + δ (vi, wj) s i-1, j + δ (vi, -) s i, j-1 + δ (-, wj) Power of ZERO: there is only this change from the original recurrence of a Global Alignment - since there is only one “free ride” edge entering into every vertex 52

Local Align vs. Global Align 53

Local Align vs. Global Align 54

Further Reading • Better Gap penalty strategy • -(ρ + σx) • Multiple Sequence Alignment • x: y: z: AC-GCGG-C AC-GC-GAG GCCGC-GAG 55

Heuristic Alignment Algorithms 56