Multiple Sequence Alignment An Inexact Science Why Do

Why Do Multiple Sequence Alignments? Since Biological Sequences often occur in families, we may

For these reasons multiple sequence alignments are used to: • provide insight into likely

Any Multiple Sequence Alignment starts with a primary sequence. In the following we start

The previous example clearly illustrates the following observations: • We can not expect two

Evolutionarily correct alignment is more difficult to infer than structural alignment. • Evolutionary history

In general, humans can do better than machines aligning sequences. Many Biologists, including Dr.

The basic algorithm is called progressive alignment developed in the late 1980’s by Da-Fei

We will illustrate this process for the following three sequences: >gi|62901522|sp|P 01865|GCAM_MOUSE Ig gamma-2

Pairwise alignments: Seq. A Name Len(aa)Seq. B Name Len(aa) Score ================================================== 1 gi|62901522|sp|P 01865|GCAM_MOU

Basic problem is assigning a score to the alignment. Producing a meaningful score is

We will tend to ignore the phylogenetic tree and concentrate on the first criterion.

The score for a column, S(mi), is usually computed as a Sum of Pairs,

Problem with sum of pairs scoring: Suppose you are comparing N sequences and using

None the less SP is commonly used. But the problem does not stop here.

Slides: 16

Download presentation

Multiple Sequence Alignment “An Inexact Science”

Why Do Multiple Sequence Alignments? Since Biological Sequences often occur in families, we may want to know to what family does our sequence of interest belong? For the most part we are interested in families of proteins since homologous sequences retain similar structures and functions. Thus, if we know a feature of one of the proteins, we can possibly identify similar features in homologous proteins and predict that they have similar functions. Since most proteins have been identified by the sequencing of genomic DNA, the functions of most proteins have been assigned on the basis of homology to other known proteins rather than on the basis of results from biochemical or functional (cell biological) assays. Aligned residues tend to occupy the corresponding positions in the 3 Dstructure of each aligned protein. While protein structures evolve over time, their sequences evolve more rapidly than the structures, so a good multiple sequence alignment can tell us something about the evolution of the protein.

For these reasons multiple sequence alignments are used to: • provide insight into likely function and structure of a protein • provide a more sensitive method for finding distantly related members of a protein family • find conserved residues or motifs in a subset of the results from a database search • gain understanding of the evolution of a particular protein • help understand the gene products of a newly sequenced genome

Any Multiple Sequence Alignment starts with a primary sequence. In the following we start with h. MSH 3 - Swiss Prot Accession Number P 20585 sp|P 20585|MSH 3_HUMAN sp|P 43246|MSH 2_HUMAN sp|O 15457|MSH 4_HUMAN sp|O 43196|MSH 5_HUMAN sp|P 52701|MSH 6_HUMAN sp|P 13705|MSH 3_MOUSE sp|P 25336|MSH 3_YEAST tr|Q 73 MQ 8 GRGTSTHDGIAIAYATLEYFIRDVKSLTLFVTHYPPVCELEKNYSHQVGNYHMGFLVS 1035 GRGTSTYDGFGLAWAISEYIATKIGA--FCMFATHFHELTALAN-QIPTVNNLHVTALTT 807 GRGTNTEEGIGICYAVCEYLLSLKAF---TLFATHFLELCHIDALYPNVENMHFEVQHVK 818 GKGTNTVDGLALLAAVLRHWLARGPTCPHIFVATNFLSLVQLQLLPQGPLVQYLTMETCE 733 GRGTATFDGTAIANAVVKELAETIKC--RTLFSTHYHSLVEDYSQNVAVRLGHMACMVEN 1273 GRGTSTHDGIAIAYATLEYFIRDVKS--LTLFVTHYPPVCELEKCYPEQVGNYHMGFLVN 989 GRGTGTHDGIAISYALIKYFSELSDCP-LILFTTHFPMLGEIKS---PLIRNYHMDYVEE 957 KKGKWLVAVDNLKLTIAEDDIEVCENQEKLKLSKPIVSIISDTSAPSSRPS----- 740 : *. : : . . . : Interpreting the notation at the bottom of the alignment * = entirely conserved column : = all residues have approximately the same size and hydropathy. = one of either size or hydropathy have been preserved In general, our lab manual says that a good block in an alignment is at least 10 -30 aa long and contains at least one to three *’s, five to seven : ’s, and a few. ’s. If this is the case, we suspect that we have identified a conserved region within the alignment.

The previous example clearly illustrates the following observations: • We can not expect two protein structures with different sequences to be completely superposable (i. e. to be able to completely superimpose one upon the other. ) • Research by Chothia & Lesk in 1986 found that given two protein sequence alignments that were clearly homologous (30% identical) usually only about 50% of the individual residues were superposable.

Evolutionarily correct alignment is more difficult to infer than structural alignment. • Evolutionary history of the residues from a sequence of a family is not usually known from any source. • It must be inferred from sequence alignment. • Sequence alignment has an independent source of reference – this may not be the “common” ancestor. This does not say that multiple sequence alignment is straight forward – there is no way to define an unambiguously correct alignment.

In general, humans can do better than machines aligning sequences. Many Biologists, including Dr. James, can do high quality alignments by hand. Some of the Factors to be considered are: • Highly conserved sequences • Buried hydorphobic residues • Expected patterns of insertions and deletions that tend to alternate with conserved sequences • Phylogenetic relationships • Influence of secondary and tertiary structure e. g. alternation of hydrophobic and hydrophilic columns in an exposed β – sheet. But, it is HARD and TEDIOUS work!! Most times it is best to start with a machine generated sequence alignment.

The basic algorithm is called progressive alignment developed in the late 1980’s by Da-Fei Feng and Russell Doolittle. The algorithm follows these steps: • First a group of proteins is chosen to be aligned • Every protein in the group is globally aligned using Needleman. Wunsch with every other protein in the group to be aligned • A distance matrix is used to score the pairwise alignments. These scores are used to generate a guide tree to construct the alignment. • The guide tree which reflects the relatedness of the sequences to be aligned shows the order in which sequences should be added to the multiple aligned starting with the two most closely related sequences • One by one the next most closely related sequences are added to the alignment (for example, one naïve way is to create a consensus alignment for the first two, which may include gaps, then align the next sequence with this consensus alignment – continue the process until all of the sequences are aligned. )

We will illustrate this process for the following three sequences: >gi|62901522|sp|P 01865|GCAM_MOUSE Ig gamma-2 A chain C region, membrane-bound form KTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVT SSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPCPPCKCPAPNLLGGPSVFIFPPKIKDVLMISLSPI VTCVVVDVSEDDPDVQISWFVNNVEVHTAQTQTHREDYNSTLRVVSALPIQHQDWMSGKEFKCKVNNK DL PAPIERTISKPKGSVRAPQVYVLPPPEEEMTKKQVTLTCMVTDFMPEDIYVEWTNNGKTELNYKNTEPVL DSDGSYFMYSKLRVEKKNWVERNSYSCSVVHEGLHNHHTTKSFSRTPGLDLDDVCAEAQDGELDGLWT TI TIFISLFLLSVCYSASVTLFKVKWIFSSVVELKQTISPDYRNMIGQGA >gi|121048|sp|P 01863|GCAA_MOUSE Ig gamma-2 A chain C region, A allele AKTTAPSVYPLAPVCGDTTGSSVTLGCLVKGYFPEPVTLTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTV TSSTWPSQSITCNVAHPASSTKVDKKIEPRGPTIKPCPPCKCPAPNLLGGPSVFIFPPKIKDVLMISLSP IVTCVVVDVSEDDPDVQISWFVNNVEVHTAQTQTHREDYNSTLRVVSALPIQHQDWMSGKEFKCKVNN KD LPAPIERTISKPKGSVRAPQVYVLPPPEEEMTKKQVTLTCMVTDFMPEDIYVEWTNNGKTELNYKNTEPV LDSDGSYFMYSKLRVEKKNWVERNSYSCSVVHEGLHNHHTTKSFSRTPGK >gi|113588|sp|P 01878|IGHA_MOUSE Ig alpha chain C region ESARNPTIYPLTLPPALSSDPVIIGCLIHDYFPSGTMNVTWGKSGKDITTVNFPPALASGGRYTMSNQLT LPAVECPEGESVKCSVQHDSNPVQELDVNCSGPTPPPPITIPSCQPSLSLQRPALEDLLLGSDASITCTL NGLRNPEGAVFTWEPSTGKDAVQKKAVQNSCGCYSVSSVLPGCAERWNSGASFKCTVTHPESGTLTGTI A KVTVNTFPPQVHLLPPPSEELALNELLSLTCLVRAFNPKEVLVRWLHGNEELSPESYLVFEPLKEPGEGA TTYLVTSVLRVSAETWKQGDQYSCMVGHEALPMNFTQKTIDRLSGKPTNVSVSVIMSEGDGICY

Pairwise alignments: Seq. A Name Len(aa)Seq. B Name Len(aa) Score ================================================== 1 gi|62901522|sp|P 01865|GCAM_MOU 398 2 gi|121048|sp|P 01863|GCAA_MOUSE 330 99 1 gi|62901522|sp|P 01865|GCAM_MOU 398 3 gi|113588|sp|P 01878|IGHA_MOUSE 344 24 2 gi|121048|sp|P 01863|GCAA_MOUSE 330 3 gi|113588|sp|P 01878|IGHA_MOUSE 344 25 ================================================== Guide Tree ( gi|62901522|sp|P 01865|GCAM_MOU: 0. 00966, gi|121048|sp|P 01863|GCAA_MOUSE: -0. 00360, gi|113588|sp|P 01878|IGHA_MOUSE: 0. 74906);

The Alignment:

Basic problem is assigning a score to the alignment. Producing a meaningful score is a very inexact science at this point and may continue to be that. A good scoring system should take into account 1. The fact that some positions are more conserved than others – thus, we infer that it needs to be position specific scoring. 2. The fact that some sequences are not independent, but related by a phylogenetic tree. Sometimes there is a small set of sequences that can be aligned unambiguously and used as a starting point for the complete alignment.

We will tend to ignore the phylogenetic tree and concentrate on the first criterion. Simplifying assumption: individual columns of alignment are statistically independent. Note: Gaps will be aligned with gaps. Only restriction is that every column must have at least one non-gap character. Typical scoring function: Let m be some alignment S(m) = G + sum( S( mi ) ) Where mi is column i in the alignment S( mi ) is the score of that column, and G is a function for scoring the comparison of gaps.

The score for a column, S(mi), is usually computed as a Sum of Pairs, i. e. each residue in the column is paired with the residues of previous column entries and evaluated using a standard scoring scheme (BLOSUMn or PAMn n = your favorite number). Gaps next to a residue are usually scored using the gap penalty and adjacent gaps are usually scored as 0. The formula for the score of a column:

Problem with sum of pairs scoring: Suppose you are comparing N sequences and using BLOSUM 50 as your comparison matrix. Suppose all N rows in a column have L (leucine). The BLOWSUM 50 score for comparing L to L is 5. for a total score for the column of 5*N*(N – 1)/2 (5 times the number of symbol pairs in the column) Now suppose one of the L’s is replaced by a G (Glycine) The BLOSUM 50 score for comparing L to G is -4 instead of 5 so (N – 1) pairs are reduced by 9 The fraction reduction in the SP score is The problem is N in the denominator. As more sequences are added (and say they all have L’s) the effect of this G is reduced. It should be amplified to show the disparity.

None the less SP is commonly used. But the problem does not stop here. To find the best scoring alignment multidimensional Dynamic Programming similar to that we did earlier is used This becomes intractable very fast. Suppose only aligning 3 sequences.