COMP 571 Bioinformatics Sequence Analysis Pairwise Sequence Alignment
COMP 571 Bioinformatics: Sequence Analysis Pairwise Sequence Alignment (II)
Copyright notice Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0 -471 -21004 -8). Copyright © 2003 by John Wiley & Sons, Inc. The book has a homepage at http: //www. bioinfbook. org including hyperlinks to the book chapters. This presentation is a modification of one of the presentations on the book homepage
Pairwise alignment: protein sequences can be more informative than DNA • protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time • DNA sequences can be translated into protein, and then used in pairwise alignments
Page 54
Pairwise alignment: protein sequences can be more informative than DNA • DNA can be translated into six potential proteins 5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
Pairwise alignment: protein sequences can be more informative than DNA • Many times, DNA alignments are appropriate --to confirm the identity of a c. DNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||| |||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
retinol-binding protein 4 (NP_006735) b-lactoglobulin (P 02754) Page 42
Pairwise alignment of retinol-binding protein 4 and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |. . . | : . ||||. : | : 1. . . MKCLLLALALTCGAQALIVT. . QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR. LLNNWD. . VCADMVGTFTDTE 97 RBP : | | : : |. |. || |: || |. 45 ISLLDAQSAPLRV. YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV. . . QYSC 136 RBP || ||. | : . |||| |. . | 94 IPAVFKIDALNENKVL. . . . VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ. EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI. . . . 178 lactoglobulin Page 46
Definitions Similarity The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physicochemical properties of the original residue.
Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |. . . | : . ||||. : | : 1. . . MKCLLLALALTCGAQALIVT. . QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR. LLNNWD. . VCADMVGTFTDTE 97 RBP : | | : : |. |. || |: || |. 45 ISLLDAQSAPLRV. YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV. . . QYSC 136 RBP || ||. | : . |||| |. . | 94 IPAVFKIDALNENKVL. . . . VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Identity (bar) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ. EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI. . . . 178 lactoglobulin Page 46
Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |. . . | : . ||||. : | : 1. . . MKCLLLALALTCGAQALIVT. . QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR. LLNNWD. . VCADMVGTFTDTE 97 RBP : | | : : |. |. || |: || |. 45 ISLLDAQSAPLRV. YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV. . . QYSC 136 RBP || ||. | : . |||| |. . | 94 IPAVFKIDALNENKVL. . . . VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin Somewhat similar (one dot) Very similar (two dots) 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ. EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI. . . . 178 lactoglobulin Page 46
Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP. ||| |. |. . . | : . ||||. : | : 1. . . MKCLLLALALTCGAQALIVT. . QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR. LLNNWD. . VCADMVGTFTDTE 97 RBP : | | : : |. |. || |: || |. 45 ISLLDAQSAPLRV. YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV. . . QYSC 136 RBP || ||. | : . |||| |. . | 94 IPAVFKIDALNENKVL. . . . VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ. EELCLARQYRLIV 185 RBP. | | | : ||. | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI. . . . 178 lactoglobulin Internal gap Terminal gap Page 46
Pairwise sequence alignment allows us to look back billions of years ago (BYA) Origin of life 4 Earliest fossils Origin of Eukaryote/ eukaryotes archaea 3 2 Fungi/animal Plant/animal 1 insects 0 Page 48
Multiple sequence alignment of glyceraldehyde 3 -phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD. APM. . F SKDNTPM. . F SS. TAPM. . F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE. DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SMKDWRGGRG SHKDWRGGRT NGKP. RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA Page 49
Amino Acid Substitution Matrices
lys found at 58% of arg sites Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins (myoglobins and hemoglobins from human to lamprey). Black: identity Gray: very conservative substitutions (>40% occurrence) White: fairly conservative substitutions (>21% occurrence) Red: no substitutions observed Page 80
Page 80
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years Ig kappa chain Kappa casein luteinizing hormone b lactalbumin complement component 3 epidermal growth factor proopiomelanocortin pancreatic ribonuclease haptoglobin alpha serum albumin phospholipase A 2, group IB prolactin carbonic anhydrase C Hemoglobin a Hemoglobin b 37 33 30 27 27 26 21 21 20 19 19 17 16 12 12 Page 50
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 luteinizing hormone b 30 lactalbumin 27 complement component 3 27 epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 human (NP_005203) versus mouse (NP_031812) haptoglobin alpha 20 serum albumin 19 phospholipase A 2, group IB 19 prolactin 17 carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years apolipoprotein A-II lysozyme gastrin myoglobin nerve growth factor myelin basic protein thyroid stimulating hormone b parathyroid hormone parvalbumin trypsin insulin calcitonin arginine vasopressin adenylate kinase 1 10 9. 8 8. 9 8. 5 7. 4 7. 3 7. 0 5. 9 4. 4 4. 3 3. 6 3. 2 Page 50
Dayhoff’s 34 protein superfamilies Protein PAMs per 100 million years triosephosphate isomerase 1 vasoactive intestinal peptide glyceraldehyde phosph. dehydrogease cytochrome c collagen troponin C, skeletal muscle alpha crystallin B chain glucagon glutamate dehydrogenase histone H 2 B, member Q ubiquitin 2. 8 2. 6 2. 2 1. 7 1. 5 1. 2 0. 9 0 Page 50
Pairwise alignment of human (NP_005203) versus mouse (NP_031812) ubiquitin
Accepted point mutations (PAMs): inferring amino acid substitutions between a protein and its ancestor Dayhoff et al. compared protein sequences with inferred ancestors, rather than with each other directly. Consider four globins (myoglobin, alpha, beta, delta globin). In a phylogenetic tree there are four existing sequences plus two inferred ancestral sequences (5, 6). 6 5 1 4 2 3
Accepted point mutations (PAMs): inferring amino acid substitutions between a protein and its ancestor The tree is made from a multiple sequence alignment of the four globins. Consider a comparison of alpha globin to myoglobin, and to their common ancestor (node 6). • A direct comparison suggests alanine changed to glycine. But an ancestral glutamate changed to ala or gly! • Three additional examples are boxed. 5 1 6 2 3 4 beta MVHLTPEEKSAVTALWGKV delta MVHLTPEEKTAVNALWGKV alpha MV. LSPADKTNVKAAWGKV myoglobin. MGLSDGEWQLVLNVWGKV 5 MVHLSPEEKTAVNALWGKV 6 MVHLTPEEKTAVNALWGKV
Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Dayhoff (1978) p. 346. Page 52
Multiple sequence alignment of glyceraldehyde 3 -phosphate dehydrogenases fly human plant bacterium yeast archaeon GAKKVIISAP GAKRVIISAP GAKKVVMTGP GAKKVVITAP GADKVLISAP SAD. APM. . F SKDNTPM. . F SS. TAPM. . F PKGDEPVKQL VCGVNLDAYK VMGVNHEKYD VVGVNEHTYQ VKGANFDKY. VMGVNEEKYT VYGVNHDEYD PDMKVVSNAS NSLKIISNAS PNMDIVSNAS AGQDIVSNAS SDLKIVSNAS GE. DVVSNAS CTTNCLAPLA CTTNCLAPLA CTTNSITPVA fly human plant bacterium yeast archaeon KVINDNFEIV KVIHDNFGIV KVVHEEFGIL KVINDNFGII KVINDAFGIE KVLDEEFGIN EGLMTTVHAT EGLMTTVHAI EGLMTTVHAT EGLMTTVHSL AGQLTTVHAY TATQKTVDGP TATQKTVDGP TGSQNLMDGP SGKLWRDGRG SMKDWRGGRG SHKDWRGGRT NGKP. RRRRA AAQNIIPAST ALQNIIPAST ASQNIIPSST ASGNIIPSST AAENIIPTST fly human plant bacterium yeast archaeon GAAKAVGKVI GAAKAVGKVL GAAQAATEVL PALNGKLTGM PELNGKLTGM PELQGKLTGM PELEGKLDGM AFRVPTPNVS AFRVPTANVS AFRVPTSNVS AFRVPTPNVS AFRVPTVDVS AIRVPVPNGS VVDLTVRLGK VVDLTCRLEK VVDLTVRLEK VVDLTVKLNK ITEFVVDLDD GASYDEIKAK PAKYDDIKKV GASYEDVKAA AATYEQIKAA ETTYDEIKKV DVTESDVNAA
The relative mutability of amino acids Dayhoff et al. described the “relative mutability” of each amino acid as the probability that amino acid will change over a small evolutionary time period. The total number of changes are counted (on all branches of all protein trees considered), and the total number of occurrences of each amino acid is also considered. A ratio is determined. Relative mutability [changes] / [occurrences] Example: sequence 1 ala sequence 2 ala his arg val ser ala val For ala, relative mutability = [1] / [3] = 0. 33 For val, relative mutability = [2] / [2] = 1. 0 Page 53
The relative mutability of amino acids Asn Ser Asp Glu Ala Thr Ile Met Gln Val 134 120 106 102 100 97 96 94 93 74 His Arg Lys Pro Gly Tyr Phe Leu Cys Trp 66 65 56 56 49 41 41 40 20 18 Page 53
The relative mutability of amino acids Asn Ser Asp Glu Ala Thr Ile Met Gln Val 134 120 106 102 100 97 96 94 93 74 His Arg Lys Pro Gly Tyr Phe Leu Cys Trp 66 65 56 56 49 41 41 40 20 18 Note that alanine is normalized to a value of 100. Trp and cys are least mutable. Asn and ser are most mutable. Page 53
Normalized frequencies of amino acids Gly Ala Leu Lys Ser Val Thr Pro Glu Asp 8. 9% 8. 7% 8. 5% 8. 1% 7. 0% 6. 5% 5. 8% 5. 1% 5. 0% 4. 7% Arg Asn Phe Gln Ile His Cys Tyr Met Trp 4. 1% 4. 0% 3. 8% 3. 7% 3. 4% 3. 3% 3. 0% 1. 5% 1. 0% • blue=6 codons; red=1 codon • These frequencies fi sum to 1 Page 53
Page 54
Dayhoff’s numbers of “accepted point mutations”: what amino acid substitutions occur in proteins? Page 52
Dayhoff’s mutation probability matrix for the evolutionary distance of 1 PAM We have considered three kinds of information: • a table of number of accepted point mutations (PAMs) • relative mutabilities of the amino acids • normalized frequencies of the amino acids in PAM data This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e. g. 1 PAM). Page 50
Dayhoff’s PAM 1 mutation probability matrix Original amino acid Page 55
Dayhoff’s PAM 1 mutation probability matrix Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side)
Substitution Matrix A substitution matrix contains values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM.
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM 1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM 1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM 1. For PAM 250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).
Dayhoff’s PAM 1 mutation probability matrix Page 55
Dayhoff’s PAM 0 mutation probability matrix: the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid Page 56
Dayhoff’s PAM 2000 mutation probability matrix: the rules for very distantly related proteins PAM A R N D C Q E G A Ala 8. 7% 4. 1% 4. 0% 4. 7% 3. 3% 3. 8% 5. 0% 8. 9% R N D C Q E G Arg Asn Asp Cys Gln Glu Gly 8. 7% 8. 7% 4. 1% 4. 0% 4. 7% 4. 7% 3. 3% 3. 8% 3. 8% 5. 0% 5. 0% 8. 9% 8. 9% Top: original amino acid Side: replacement amino acid Page 56
PAM 250 mutation probability matrix Top: original amino acid Side: replacement amino acid Page 57
PAM 250 log odds scoring matrix Page 58
Why do we go from a mutation probability matrix to a log odds matrix? • We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. • Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). Page 57
How do we go from a mutation probability matrix to a log odds matrix? • The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a, b is given by: S(a, b) = 10 log 10 (Mab/pb) As an example, for tryptophan, S(a, tryptophan) = 10 log 10 (0. 55/0. 010) = 17. 4 Page 57
What do the numbers mean in a log odds matrix? S(a, tryptophan) = 10 log 10 (0. 55/0. 010) = 17. 4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a, b) = 17 Probability of replacement (Mab/pb) = x Then 17 = 10 log 10 x 1. 7 = log 10 x 101. 7 = x = 50 Page 58
What do the numbers mean in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1. 6 times as frequently as expected by chance. A score of 0 is neutral. A score of – 10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Page 58
PAM 250 log odds scoring matrix Page 58
PAM 10 log odds scoring matrix Page 59
More conserved Less conserved Rat versus mouse RBP Rat versus bacterial lipocalin
Comparing two proteins with a PAM 1 matrix gives completely different results than PAM 250! Consider two distantly related proteins. A PAM 40 matrix is not forgiving of mismatches, and penalizes them severely. Using this matrix you can find almost no match. hsrbp, 136 CRLLNLDGTC btlact, 3 CLLLALALTC * ** A PAM 250 matrix is very tolerant of mismatches. 24. 7% identity in 81 residues overlap; Score: 77. 0; Gap frequency: 3. 7% rbp 4 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDV btlact 21 QTMKGLDIQKVAGTWYSLAMAASD-ISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWEN * **** * * ** * rbp 4 86 --CADMVGTFTDTEDPAKFKM btlact 80 GECAQKKIIAEKTKIPAVFKI ** ** Page 60
BLOSUM Matrices BLOSUM matrices are based on local alignments. BLOSUM stands for blocks substitution matrix. BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 60
62 co l e ps la Percent amino acid identity BLOSUM Matrices 100 30 BLOSUM 62
100 e lla 100 30 e ps 30 62 lla 62 co l la ps co Percent amino acid identity 100 ps e BLOSUM Matrices BLOSUM 80 BLOSUM 62 30 BLOSUM 30
BLOSUM Matrices All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM 62 is the default matrix in BLAST 2. 0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Page 60
Blosum 62 scoring matrix Page 61
Blosum 62 scoring matrix Page 61
Rat versus mouse RBP Rat versus bacterial lipocalin Page 61
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM 1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM 1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM 1. For PAM 250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).
Percent identity Two randomly diverging protein sequences change in a negatively exponential fashion “twilight zone” Evolutionary distance in PAMs Page 62
Percent identity At PAM 1, two proteins are 99% identical At PAM 10. 7, there are 10 differences per 100 residues At PAM 80, there are 50 differences per 100 residues At PAM 250, there are 80 differences per 100 residues “twilight zone” Differences per 100 residues Page 62
PAM matrices reflect different degrees of divergence PAM 250
PAM: “Accepted point mutation” • Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations. ) • Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. • PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968) Page 62
Ancestral sequence ACCCTAC A C C C --> G T --> A A --> C --> T C Sequence 1 ACCGATC no change single substitution multiple substitutions coincidental substitutions parallel substitutions convergent substitutions back substitution Li (1997) p. 70 A C --> A --> T C --> A T --> A A --> T C --> T --> C Sequence 2 AATAATC
Percent identity between two proteins: What percent is significant? 100% 80% 65% 30% 23% 19% We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values. As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous.
- Slides: 64