C E N T R E F O

Multiple sequence alignment q Sequences can be conserved across species and perform similar or

What to ask yourself q How do we get a multiple alignment? (three or

Multiple alignment methods q Multi-dimensional dynamic programming > extension of pairwise sequence alignment. q

Simultaneous multiple alignment Multi-dimensional dynamic programming The combinatorial explosion q 2 sequences of length

Multi-dimensional dynamic programming (Murata et al. , 1985) Sequence 1 ce 3 n e

The MSA approach q MSA (Lipman et al. , 1989, PNAS 86, 4412) Ø

The DCA approach q DCA (Stoye et al. , 1997, Appl. Math. Lett. 10(2),

So in effect … Sequence 1 ce 3 n e qu Sequence 2 Se

The progressive alignment method q Underlying idea: usually we are interested in aligning families

How to represent a block of sequences? q Historically: consensus sequence – single sequence

Multiple alignment profiles (Gribskov et al. 1987) q Gribskov created a probe: group of

Multiple alignment profiles Core region Gapped region Core region i A C D W

So, for scoring profiles … q Think of sequence-sequence alignment. q Same principles but

Progressive alignment 1. 2. 3. Perform pair-wise alignments of all of the sequences; Use

Progressive multiple alignment 1 2 1 3 Score 1 -2 4 5 Score 4

General progressive multiple alignment technique (follow generated tree) d 1 3 2 5 root

PRALINE progressive strategy d 1 3 2 5 4

There are problems … Accuracy is very important !!!! q Alignment errors during the

Clustal, Clustal. W, Clustal. X • CLUSTAL W/X (Thompson et al. , 1994) uses

Sequence weighting dilemma Pair-wise alignment quality versus sequence identity (Vogt et al. , JMB

Flavodoxin-che. Y consistency scores (PRALINE prepro=0) 1 fx 1 FLAV_DESVH FLAV_DESDE FLAV_DESGI FLAV_DESSA 4

Flavodoxin-che. Y consistency scores (PRALINE prepro=1500) 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4

Iteration Alignment iteration: • do an alignment • learn from it • do it

Consistency-based iteration Pre-profiles Multiple alignment positional consistency scores

Iteration Convergence Limit cycle Divergence

CLUSTAL X (1. 64 b) multiple sequence alignment Flavodoxin-che. Y 1 fx 1 FLAV_DESVH

Flavodoxin-che. Y: Pre-processing (prepro 1500) 1 fx 1 FLAV_DESDE FLAV_DESVH FLAV_DESSA FLAV_DESGI 2 fcr

Flavodoxin-che. Y: Local Pre-processing (locprepro 300) 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4

Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) VHLTPEEKSAVTALWGKVNVD

Why use (predicted) structural information • “Structure more conserved than sequence” – Many structural

Two superposed protein structures with two wellsuperposed helices Red: well superposed Blue: low match

How to combine ss and aa info Amino acid substitution matrices Dynamic programming search

Sequences to be aligned Predict secondary structure Secondary structure HHHHCCEEECCHH CCCCCCEEEECCHH HHHCCCCEEHHH HHHHHCCEEEECCC HHHHHHHCCCEEEE

Using predicted secondary structure 1 fx 1 FLAV_DESVH FLAV_DESGI FLAV_DESSA FLAV_DESDE 2 fcr FLAV_ANASP

PSI-PRALINE Multiple alignment of distant sequences using PSI-BLAST

Distant sequences • Methyltransferase (16. 7% sequence identity) • Same a/b fold 1 GZ

How well do alignment methods perform • Normal pair-wise alignment • 10% correctly aligned

State-of-the-art homology detection • Sequence used to scan database and collect homologous information (PSI-BLAST)

The PRALINE way • Sequence used to scan database and collect homologous information (PSI-BLAST)

A The effects of using E-value thresholds of increasing stringency in PRALINEPSI on the

Integrating alignment methods and alignment information with T -Coffee • Integrating different pair-wise alignment

Matrix extension T-Coffee Tree-based Consistency Objective Function For alignm. Ent Evaluation Cedric Notredame Des

Using different sources of alignment information Clustal Structure alignments Dialign Lalign Manual T-Coffee

T-Coffee library system Seq 1 AA 1 Seq 2 AA 2 Weight 3 3

Search matrix extension – alignment transitivity

T-Coffee Other sequences Direct alignment

but. . . T-COFFEE (V 1. 23) multiple sequence alignment Flavodoxin-che. Y 1 fx

Slides: 76

Download presentation

C E N T R E F O R I N T E G R A T I V E 1 -month Practical Course Genome Analysis B I O I N F O R M A T I C S V U Lecture 5: Multiple Sequence Alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The Netherlands ibi. vu. nl heringa@cs. vu. nl

Multiple sequence alignment

Multiple sequence alignment q Sequences can be conserved across species and perform similar or identical functions. > hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved; > identification of regions or domains that are critical to functionality. q Sequences can be mutated or rearranged to perform an altered function. > which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment.

What to ask yourself q How do we get a multiple alignment? (three or more sequences) q What is our aim? – Do we go for max accuracy, least computational time or the best compromise? q What do we want to achieve each time

Sequence-sequence alignment sequence

Multiple alignment methods q Multi-dimensional dynamic programming > extension of pairwise sequence alignment. q Progressive alignment > incorporates phylogenetic information to guide the alignment process q Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence

Simultaneous multiple alignment Multi-dimensional dynamic programming The combinatorial explosion q 2 sequences of length n Ø n 2 comparisons q Comparison number increases exponentially Ø i. e. n. N where n is the length of the sequences, and N is the number of sequences q Impractical for even a small number of short sequences

Multi-dimensional dynamic programming (Murata et al. , 1985) Sequence 1 ce 3 n e qu Sequence 2 Se

The MSA approach q MSA (Lipman et al. , 1989, PNAS 86, 4412) Ø MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube. Ø Ø Ø Ø Ø Calculate all pair-wise alignment scores. Use the scores to to predict a tree. Calculate pair weights based on the tree (lower bound). Produce a heuristic alignment based on the tree. Calculate the maximum weight for each sequence pair (upper bound). Determine the spatial positions that must be calculated to obtain the optimal alignment. Perform the optimal alignment. Report the weight found compared to the maximum weight previously found (measure of divergence). Extremely slow and memory intensive. Max 8 -9 sequences of ~250 residues.

The DCA approach q DCA (Stoye et al. , 1997, Appl. Math. Lett. 10(2), 67 -73) Ø Each sequence is cut in two behind a suitable cut position somewhere close to its midpoint. Ø This way, the problem of aligning one family of (long) sequences is divided into the two problems of aligning two families of (shorter) sequences. Ø This procedure is re-iterated until the sequences are sufficiently short. Ø Optimal alignment by MSA. Ø Finally, the resulting short alignments are concatenated.

So in effect … Sequence 1 ce 3 n e qu Sequence 2 Se

The progressive alignment method q Underlying idea: usually we are interested in aligning families of sequences that are evolutionary related. q Principle: construct an approximate phylogenetic tree for the sequences to be aligned and than to build up the alignment by progressively adding sequences in the order specified by the tree. q But before going into details, some facts about multiple alignment profiles …

How to represent a block of sequences? q Historically: consensus sequence – single sequence that best represents the amino acids observed at each alignment position. When consensus sequences are used the pair-wise DP algorithm can be used without alterations q Modern methods: Alignment profile – representation that retains the information about frequencies of amino acids observed at each alignment position.

Multiple alignment profiles (Gribskov et al. 1987) q Gribskov created a probe: group of typical sequences of functionally related proteins that have been aligned by similarity in sequence or three-dimensional structure (in his case: globins & immunoglobulins). q Then he constructed a profile, which consists of a sequence position-specific scoring matrix M(p, a) composed of 21 columns and N rows (N = length of probe). q The first 20 columns of each row specify the score for finding, at that position in the target, each of the 20 amino acid residues. An additional column contains a penalty for insertions or deletions at that position (gapopening and gap-extension).

Multiple alignment profiles Core region Gapped region Core region i A C D W Y - f. A. . f. C. . f. D. . f. W. . f. Y. . Gapo, gapx Position dependent gap penalties

Profile building q Example: each aa is represented as a frequency; gap penalties as weights. A C D W Y 0. 5 0 0. 5 Gap penalties 1. 0 i 0. 3 0. 1 0 0. 3 0. 5 0. 2 0. 1 0. 2 1. 0 Position dependent gap penalties

Profile-sequence alignment sequence ACD……VWY

Sequence to profile alignment A A V V L 0. 4 A 0. 2 L 0. 4 V Score of amino acid L in sequence that is aligned against this profile position: Score = 0. 4 * s(L, A) + 0. 2 * s(L, L) + 0. 4 * s(L, V)

Profile-profile alignment profile A C D. . Y profile ACD……VWY

Profile to profile alignment 0. 4 A 0. 75 G 0. 2 L 0. 25 S 0. 4 V Match score of these two alignment columns using the a. a frequencies at the corresponding profile positions: Score = 0. 4*0. 75*s(A, G) + 0. 2*0. 75*s(L, G) + 0. 4*0. 75*s(V, G) + + 0. 4*0. 25*s(A, S) + 0. 2*0. 25*s(L, S) + 0. 4*0. 25*s(V, S) s(x, y) is value in amino acid exchange matrix (e. g. PAM 250, Blosum 62) for amino acid pair (x, y)

So, for scoring profiles … q Think of sequence-sequence alignment. q Same principles but more information for each position. Reminder: q The sequence pair alignment score S comes from the sum of the positional scores M(aai, aaj) (i. e. the substitution matrix values at each alignment position minus penalties if applicable) q Profile alignment scores are exactly the same, but the positional scores are more complex

Scoring a profile position Profile 1 A C D. . Y Profile 2 A C D. . Y q At each position (column) we have different residue frequencies for each amino acid (rows) SO: q Instead of saying S=M(aa 1, aa 2) (one residue pair) q For frequency f>0 (amino acid is actually there) we take:

Progressive alignment 1. 2. 3. Perform pair-wise alignments of all of the sequences; Use the alignment scores to produces a dendrogram using neighbour-joining methods (guide-tree); Align the sequences sequentially, guided by the relationships indicated by the tree. n. Biopat (first integrated method ever) n. MULTAL (Taylor 1987) n. DIALIGN (1&2, Morgenstern 1996) n. PRRP (Gotoh 1996) n. Clustal. W (Thompson et al 1994) n. PRALINE (Heringa 1999) n. T Coffee (Notredame 2000) n. POA (Lee 2002) n. MAFFT (Katoh 2002) n. MUSCLE (Edgar 2004) n. PROBCONS (Do 2005)

Progressive multiple alignment 1 2 1 3 Score 1 -2 4 5 Score 4 -5 Score 1 -3 Scores 5× 5 Clustering (tree-building) method Similarity matrix Iteration possibilities Align Guide tree Multiple alignment

General progressive multiple alignment technique (follow generated tree) d 1 3 2 5 root 1 3 2 5

PRALINE progressive strategy d 1 3 2 5 4

There are problems … Accuracy is very important !!!! q Alignment errors during the construction of the MSA cannot be repaired anymore: errors made at any alignment step are propagated through subsequent progressive steps. q The comparisons of sequences at early steps during progressive alignments cannot make use of information from other sequences. q It is only later during the alignment progression that more information from other sequences (e. g. through profile representation) becomes employed in further alignment steps. “Once a gap, always a gap” Feng & Doolittle, 1987

Clustal, Clustal. W, Clustal. X • CLUSTAL W/X (Thompson et al. , 1994) uses Neighbour Joining (NJ) algorithm (Saitou and Nei, 1984), widely used in phylogenetic analysis, to construct a guide tree. • Sequence blocks are represented by a profile, in which the individual sequences are additionally weighted according to the branch lengths in the NJ tree. • Further carefully crafted heuristics include: – (i) local gap penalties – (ii) automatic selection of the amino acid substitution matrix, (iii) automatic gap penalty adjustment – (iv) mechanism to delay alignment of sequences that appear to be distant at the time they are considered. • CLUSTAL (W/X) does not allow iteration (Hogeweg and Hesper, 1984; Corpet, 1988, Gotoh, 1996; Heringa, 1999, 2002)

Sequence weighting dilemma Pair-wise alignment quality versus sequence identity (Vogt et al. , JMB 249, 816 -831, 1995)

Additional strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension Objective: try to avoid (early) errors

Profile pre-processing 1 2 13 Score 1 -2 Score 1 -3 4 5 Score 4 -5 1 1 2 3 4 5 A C D. . Y 1 Pi P x Key Sequence Pre-alignment Master-slave (N-to-1) alignment Pre-profile

Pre-profile generation 1 1 2 1 3 Score 1 -2 4 5 Score 4 -5 1 2 3 4 5 2 2 134 5 5 5 1 2 3 4 Score 1 -3 Pre-alignments Cut-off A C D. . Y Pre-profiles

Pre-profile alignment Pre-profiles 1 2 3 4 5 A C D. . Y Final alignment A C D. . Y 1 2 3 4 5

Pre-profile alignment 1 2 3 12 3 4 5 21 3 4 5 31 2 4 5 41 4 23 5 5 5 1 2 3 4 Final alignment 1 2 3 4 5

Pre-profile alignment Alignment consistency 1 2 3 12 3 4 5 21 3 4 5 2 31 2 4 5 41 4 23 5 5 1 2 3 4 5 Ala 131 A 131 L 133 C 126 A 131

PRALINE pre-profile generation • Idea: use the information from all query sequences to make a pre-profile for each query sequence that contains information from other sequences • You can use all sequences in each pre-profile, or use only those sequences that will probably align ‘correctly’. Incorrectly aligned sequences in the preprofiles will increase the noise level. • Select using alignment score: only allow sequences in pre-profiles if their alignment with the score higher than a given threshold value. In PRALINE, this threshold is given as prepro=1500 (alignment score threshold value is 1500 – see next two slides)

Flavodoxin-che. Y consistency scores (PRALINE prepro=0) 1 fx 1 FLAV_DESVH FLAV_DESDE FLAV_DESGI FLAV_DESSA 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_ECOLI FLAV_AZOVI FLAV_ENTAG FLAV_CLOAB 3 chy --78999999 TEYTAETIARQL 8776 -6657777777553799 VL 999 ST 97775599989 -435566677798998878 AQGRKVACF -4678899999 TEYTAETIAREL 7777 -7757777777553799 VL 999 ST 97775599989 -435566677798998878 AQGRKVACF -478999999999998877669565888877778763 YDAVL 999 SAW 9877789877753556666669777776789 GRKVAAF -4678899999 TEGVAEAIAKTL 9997 -76678888777777887539 DVVL 999 ST 987776 --9889546667776697776557777888888 9367779999999999988759765777888887639999 STW 77765 --99995366666777979987799999 -8787799999999997766669675677888888777999999988777776 --988957778888889777323788888 9776779999999997777766 -66566667778889997679999987777669 --887362334466695555455778888888 --878999999 TEVADFIGK 996541900300000112233355679 DLLF 9999985531288811122455555540777777788888 -47899 LFYGTQTGKTESVAEIIR 9777653922356677778977799999988843 --99985557787778999988799999 997789999 GSDTGNTENIAKMIQ 87742229224566788899999955699999755553 ----99262225555495777767778999999 --79 IGLFFGSNTGKTRKVAKSIK 99887759657577888888999777899999987776111222244555 -5555555778999999 947899999999999875522922323455555568889999999887552111133477777 -777777799999 -86999 ILYSSKTGKTERVAK 999755555505767888887777765778899998522223 --98883422344555977777777 0122222223333335666665555555222922222221112163335555755553222888877674533344493332222222 Avrg Consist Conservation 86677788888999998776554844455566666555788888766544887666334445566586666556778888888 0125538675848969746963946463343045244355446543473516658868567554455000000314365446505575435547747759 1 fx 1 FLAV_DESVH FLAV_DESDE FLAV_DESGI FLAV_DESSA 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_ECOLI FLAV_AZOVI FLAV_ENTAG FLAV_CLOAB 3 chy G 888799955555559888888888899777 ----7777797787787978 ---555555566776555677777778888799 -----A 88878685555555999988888889998879 --8777788 -98777777 --8555544332456677777599 -----877759777555556777777778 ---888888876677787777755555542424667888887777 -------977768777555556777777776788777778888 -978985555565365568888877 -------8677775555266666555555577887767999877777977777665555544446666555798 -----857777566666652555677777888888868997788898877655867788554433322222221223355557 -------877773573333333777766667777765533333333228333332244444567777777888777633 -----9777737753333447778888887777777333344444433833333344444455577777788777734 -----97774378644444477778888883333444444424444455555455577566778888877734110000 977763553333334666666677777333344444448233335555554555888877772311 ---97777388655555586666677666633333333221233333444445555566666555582 -----76662722222221244444555555878822222221111111222222344443333333233399 -----22222722224111355431113324578 -877789976665568777763222222322222323344444422 ------ Avrg Consist Conservation 866656564444444666666665666555555565444443344455666666889999 73663057433334163464534444*746710000011010011000000010434744645443225474454448434301000000 Consistency are scored 0 to. SId= 10; 3838 the Av. SId= value 0. 297 10 is represented by the corresponding amino acid (red) Iteration 0 values SP= 135136. 00 Av. SP=from 10. 473

Flavodoxin-che. Y consistency scores (PRALINE prepro=1500) 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_AZOVI FLAV_ENTAG FLAV_ECOLI FLAV_CLOAB 3 chy -42444 IVYGSTTGNTEYTAETIARQL 886666666577777775667888 DLVLLGCSTW 77766 ----995476666769 -77888788 AQGRKVACF -34444 IVYGSTTGNTEYTAETIAREL 776666666577777775667888 DLVLLGCSTW 77766 ----995476666769 -77888788 AQGRKVACF -33444 IVYGSTTGNTET 99999888777655777668888899666686 YDIVLFGCSTW 77777 ----996466666779 -88 SL 98 ADLKGKKVSVF -34444 IVYGSTTGNTEGVA 99999765555677777886666678 DVVLLGCSTW 77777 ----995466666779 -88887688888 KKVGVF -44777 IVFGSSTGNTE 988777666655566777778899999777777 Y DAVLFGCSAW 88877 ----997587777779 -8887766777 GRKVAAF -32222 IVYWSGTGNTE 8888766667788888 NI 8888586 DILILGCSA 888888 ------8 -8888886 --66665378 IS GKKVALF -12222 IVYWSGTGNTEAMA 88888888555555485 DVILLGCPAMGSE 77 ------572222288 --8888755588 GKKVGLF -41456 IFFSTSTGNTTEVA 999998865432222765554443244779 YDLLFLGAPT 944411999 -111112454441 -8 D KLPEVDMKDLPVAIF -00456 LFYGTQTGKTESVAEII 987755323322427776666623589 YQYLIIGCPTW 55532 --999843678 W 988899998888888 GKLVAYF -42445 LFFGSNTGKTRKVAKSIK 87777434333536666665467777 YQFLILGTPTLGEG 8622222355558 -45666666888 KTVALF -266 IG IFFGSDTGQTRKVAKLIHQKL 6664664424 DVRRATR 88888 SYPVLLLGTPT 888886444446 WQEF 8 -8 NTLSEADLTGKTVALF -51114 IFFGSDTGNTENIAKMI 987743311111555555588355599 YDILLLGIPT 954431 ----88355225544 --44666666779 KLVALF -63666 ILYSSKTGKTERVAKLIE 6333333333366 LQESEGIIFGTPTY 63 --6 ----66 SWE 3333333 GKLGAAF ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ -AGGYGFVI---SDWNMPNM-----DGLEL--LKTIRADGAMSALPVLM Avrg Consist Conservation 93344599999999988776655556666677566678899999767658888775555566668967777677889999999 0236428675848969746963946463344354312564565414344366588685675544550000003144654460055575345547747759 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_AZOVI FLAV_ENTAG FLAV_ECOLI FLAV_CLOAB 3 chy G 98879 -89 -999877977 --7788899999999955 --88888 -9988887798999777778766553344588776666222266899899 G 98878 -688688888 -88 --88999999979988888887788889 -89 -9787777666756645577776666654466899899 G 98879 -898688888987 --788888999 GATLV 7698899 -9998789888 -8899787878776663122477788888333276899899 AS 8888 -68 -888888899 --99999988888 -9998888898877889788877666885422221225555333277999999 GS 2228 -22822222 --238888888888888888888777886676553557755553322128888 G 4888 --28 -8888882 MD--AWKQRTEDTGATVI 77 -----------77222 --224444222222244222112 -------GLGDA 5 -8 Y 5 DNFC 88 -88 --88777777654445555544385555777774465333357799999987555333899899 GTGDQ 5 -GY 5899999 -99 --99 EEKISQRGG 999755555444443328444446666555556666676666433333899899 GLGDQ 5 -885777555 -55 --555557888855555555548555555666555555888855555544442 --288 GLGDQL-NYSKNFVSA-MR--ILYDLVIARGACVVG 8888 EGYKFSFSAA 6664 NEFVGLPLDQEN 88888 EERIDSWLE 88842242688688 GC 995497846888889879977777888885544444444114444777774455775567788888887433322100100 STANS 636666333333666666666333336336663333336 EDENARIFGERIANKVKQI 333333666666 VTAEA---KKENIIAA------AQAGAS------------- GYVVK-----PFTAATLEEKLNKIFEKLGM ------ Avrg Consist Conservation 998877978777779977888888667777767766677777676667766655455577776666433355788788 746640037154545706300354534444 *74575300000101001000010683760144442335574454448434301000000 Iteration 0 SP= 136702. 00 Av. SP= 10. 654 SId= 3955 Av. SId= 0. 308 Consistency values are scored from 0 to 10; the value 10 is represented by the corresponding amino acid (red)

Multiple alignment methods q Multi-dimensional dynamic programming > extension of pairwise sequence alignment. q Progressive alignment > incorporate phylogenetic information to (create an order to) guide the alignment process q Iterative alignment > correct problems with progressive alignment by repeatedly realigning subgroups of sequences

Iteration Alignment iteration: • do an alignment • learn from it • do it better next time Bootstrapping

Consistency-based iteration Pre-profiles Multiple alignment positional consistency scores

Pre-profile update iteration Pre-profiles Multiple alignment

Iteration Convergence Limit cycle Divergence

CLUSTAL X (1. 64 b) multiple sequence alignment Flavodoxin-che. Y 1 fx 1 FLAV_DESVH FLAV_DESGI FLAV_DESSA FLAV_DESDE FLAV_CLOAB FLAV_MEGEL 4 fxn FLAV_ANASP FLAV_AZOVI 2 fcr FLAV_ENTAG FLAV_ECOLI 3 chy -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK MPKALIVYGSTTGNTEYTAETIARELADAG-Y-EVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD-SLEETGAQGRK MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-ETTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE-DLDRAGLKDKK MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-DVELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD-SLENADLKGKK MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-EVTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE-EFNRFGLAGRK -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-EVKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN-----ISWEMKKWID-ESSEFNLEGKL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-DVESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF-TDLAPKLKGKK ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-DVNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI-EEISTKISGKK SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT----LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS-ELDDVDFNGKL -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD---ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP-KIEGLDFSGKT --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP---IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLYDKLPEVDMKDLP MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP---LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN-TLSEADLTGKT -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD----VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD-------WDDFFP-TLEEIDFNGKL --ADKELKFLVVDDFSTMRRIVRNLLKELG----FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG-LELLKTIR--. . : 1 fx 1 FLAV_DESVH FLAV_DESGI FLAV_DESSA FLAV_DESDE FLAV_CLOAB FLAV_MEGEL 4 fxn FLAV_ANASP FLAV_AZOVI 2 fcr FLAV_ENTAG FLAV_ECOLI 3 chy VACFGCGDSSYEYF--CGAVDAIEEKLKNLGAEIVQDG----------------LRIDGDPRAARDDIVGWAHDVRGAI--------------VGVFGCGDSSYTYF--CGAVDVIEKKAEELGATLVASS--------LKIDGEPDSAE--VLDWAREVLARV-------VSVFGCGDSDYTYF--CGAVDAIEEKLEKMGAVVIGDS--------LKIDGDPERDE--IVSWGSGIADKI-------VAAFASGDQEYEHF--CGAVPAIEERAKELGATIIAEG--------LKMEGDASNDPEAVASFAEDVLKQL-------GAAFSTANSIAGGS--DIALLTILNHLMVKGMLVYSGGVA----FGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF-----VGLFGSYGWGSGE-----WMDAWKQRTEDTGATVIGTA--------IVN-EMPDNAPECKE-LGEAAAKA--------VALFGSYGWGDGK-----WMRDFEERMNGYGCVVVETP--------LIVQNEPDEAEQDCIEFGKKIANI--------VAYFGTGDQIGYADNFQDAIGILEEKISQRGGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL-----VALFGLGDQVGYPENYLDALGELYSFFKDRGAKIVGSWSTDGYEFESSEAVV-DGKFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL---VAIFGLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVR-DGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----VALFGLGDQLNYSKNFVSAMRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL------VALFGCGDQEDYAEYFCDALGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA AD--GAMSALPVL-----MVTAEAKKENIIAAAQAGAS--------GYV-VKPFTAATLEEKLNKIFEKLGM-------. . : . .

Flavodoxin-che. Y: Pre-processing (prepro 1500) 1 fx 1 FLAV_DESDE FLAV_DESVH FLAV_DESSA FLAV_DESGI 2 fcr FLAV_AZOVI FLAV_ENTAG FLAV_ANASP FLAV_ECOLI 4 fxn FLAV_MEGEL FLAV_CLOAB 3 chy -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF MSKVLIVFGSSTGNT-ESIa. QKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFg. CSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf MPKALIVYGSTTGNT-EYTa. ETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLg. CSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf MSKSLIVYGSTTGNT-ETAa. EYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFg. CSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf MPKALIVYGSTTGNT-EGVa. EAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLg. CSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF -AKIGLFFGSNTGKT-RKVa. KSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILg. TPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf MATIGIFFGSDTGQT-RKVa. KLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLg. TPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf SKKIGLFYGTQTGKT-ESVa. EIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIg. CPTWNIGEL----QSDWEGLY-SELDDVDFNGKLVAYf -AITGIFFGSDTGNT-ENIa. KMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLg. IPTWYYGE----AQCDWDDFF-PTLEEIDFNGKLVALf -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF MVE--IVYWSGTGNT-EAMa. NEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLg. CPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf -MKISILYSSKTGKT-ERVa. KLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFg. TPTYYAN-----ISWEMKKWI-DESSEFNLEGKLGAAf ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM-----DGLELL-KTIRADGAMSALPVLM 1 fx 1 FLAV_DESDE FLAV_DESVH FLAV_DESSA FLAV_DESGI 2 fcr FLAV_AZOVI FLAV_ENTAG FLAV_ANASP FLAV_ECOLI 4 fxn FLAV_MEGEL FLAV_CLOAB 3 chy GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD-----------GLRIDGD--PRAARDDIVGWAHDVRGAI-------ASGDQ-EY-EHFCGA-VPAIEERAKELg. ATIIAE-----------GLKMEGD--ASNDPEAVASf. AEDVLKQL-------GCGDS-SY-EYFCGA-VDAIEEKLKNLg. AEIVQD-----------GLRIDGD--PRAARDDIVGw. AHDVRGAI-------GCGDS-DY-TYFCGA-VDAIEEKLEKMg. AVVIGD-----------SLKIDGD--PE--RDEIVSw. GSGIADKI-------GCGDS-SY-TYFCGA-VDVIEKKAEELg. ATLVAS-----------SLKIDGE--PD--SAEVLDw. AREVLARV-------GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----GLGDQVGYPENYLDA-LGELYSFFKDRg. AKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAw. LAQIAPEFGLS--L-GLGDQLNYSKNFVSA-MRILYDLVIARg. ACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSw. LEKLKPAV-L-----GTGDQIGYADNFQDA-IGILEEKISQRg. GKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSw. VAQLKSEFGL-----GCGDQEDYAEYFCDA-LGTIRDIIEPRg. ATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKw. VKQISEELHLDEILNA G-----SY-GWGDGKWMRDFEERMNGYGCVVVET-----------PLIVQNE--PDEAEQDCIEFGKKIANI----G-----SY-GWGSGEWMDAWKQRTEDTg. ATVIGT-----------AIVNEM--PDNA-PECKEl. GEAAAKA----STANSIAGGSDIA---LLTILNHLMVKg. MLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIf. GERi. ANk. VKQIF-----VTAEAKK--ENIIAA-----AQAGAS-------------GYVV-----KPFTAATLEEKLNKIFEKLGM------ Iteration 0 T G SP= 136944. 00 Av. SP= 10. 675 SId= 4009 Av. SId= 0. 313

Flavodoxin-che. Y: Local Pre-processing (locprepro 300) 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_AZOVI FLAV_ENTAG FLAV_ECOLI FLAV_CLOAB 3 chy --PKALIVYGSTTGNTEYTAETIARQLANAGYEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACF -MPKALIVYGSTTGNTEYTa. ETIARELADAGYEVDSRDAASVEAGGLFEGFDLVLLg. CSTWGDDSI------ELQDDFIPL--FDSLEETGAQGRKVACf -MSKSLIVYGSTTGNTETAa. EYVAEAFENKEIDVELKNVTDVSVADLGNGYDIVLFg. CSTWGEEEI------ELQDDFIPL--YDSLENADLKGKKVSVf -MPKALIVYGSTTGNTEGVa. EAIAKTLNSEGMETTVVNVADVTAPGLAEGYDVVLLg. CSTWGDDEI------ELQEDFVPL--YEDLDRAGLKDKKVGVf -MSKVLIVFGSSTGNTESIa. QKLEELIAAGGHEVTLLNAADASAENLADGYDAVLFg. CSAWGMEDL------EMQDDFLSL--FEEFNRFGLAGRKVAAf --MK--IVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLN-EDILILGCSAMGDEVL------E-ESEFEPF--IEEIS-TKISGKKVALF -MVE--IVYWSGTGNTEAMa. NEIEAAVKAAGADVESVRFEDTNVDDVAS-KDVILLg. CPAMGSEEL------E-DSVVEPF--FTDLA-PKLKGKKVGLf ---KIGIFFSTSTGNTTEVADFIGKTLGAKADAPI--DVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFL-YDKLPEVDMKDLPVAIF -SKKIGLFYGTQTGKTESVa. EIIRDEFGNDVVTLH--DVSQAEV-TDLNDYQYLIIg. CPTWNIGEL----QSDWEGL--YSELDDVDFNGKLVAYf --AKIGLFFGSNTGKTRKVa. KSIKKRFDDETMSDA-LNVNRVSA-EDFAQYQFLILg. TPTLGEGELPGLSSDCENESWEEF--LPKIEGLDFSGKTVALf -MATIGIFFGSDTGQTRKVa. KLIHQKLDG--IADAPLDVRRATR-EQFLSYPVLLLg. TPTLGDGELPGVEAGSQYDSWQEF--TNTLSEADLTGKTVALf --AITGIFFGSDTGNTENIa. KMIQKQLGKDVADVH--DIAKSSK-EDLEAYDILLLg. IPTWYYGEA----QCDWDDF--FPTLEEIDFNGKLVALf --MKISILYSSKTGKTERVa. KLIEEGVKRSGNIEVKTMNLDAVDKKFLQESEGIIFg. TPTYYA------NISWEMKKWIDESSEFNLEGKLGAAf ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQ-AGGYGFVI---SDWNMPNM-----DGLEL--LKTIRADGAMSALPVLM 1 fx 1 FLAV_DESVH FLAV_DESSA FLAV_DESGI FLAV_DESDE 4 fxn FLAV_MEGEL 2 fcr FLAV_ANASP FLAV_AZOVI FLAV_ENTAG FLAV_ECOLI FLAV_CLOAB 3 chy GCGDS--SY-EYFCGA-VD--AIEEKLKNLGAEIVQD-----------GLRID--GDPRAARDDIVGWAHDVRGAI-------GCGDS--SY-EYFCGA-VD--AIEEKLKNLg. AEIVQD-----------GLRID--GDPRAARDDIVGw. AHDVRGAI-------GCGDS--DY-TYFCGA-VD--AIEEKLEKMg. AVVIGD-----------SLKID--GDPE--RDEIVSw. GSGIADKI-------GCGDS--SY-TYFCGA-VD--VIEKKAEELg. ATLVAS-----------SLKID--GEPD--SAEVLDw. AREVLARV-------ASGDQ--EY-EHFCGA-VP--AIEERAKELg. ATIIAE-----------GLKME--GDASNDPEAVASf. AEDVLKQL-------GS------Y-GWGDGKWMR--DFEERMNGYGCVVVET-----------PLIVQ--NEPDEAEQDCIEFGKKIANI----GS------Y-GWGSGEWMD--AWKQRTEDTg. ATVIGT-----------AI-VN--EMPDNA-PECKEl. GEAAAKA----GLGDAE-GYPDNFCDA-IE--EIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----GTGDQI-GYADNFQDA-IG--ILEEKISQRg. GKTVGYWSTDGYDFNDSKALRN-GKFVGLALDEDNQSDLTDDRIKSw. VAQLKSEFGL-----GLGDQV-GYPENYLDA-LG--ELYSFFKDRg. AKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAw. LAQIAPEFGLS--L-GLGDQL-NYSKNFVSA-MR--ILYDLVIARg. ACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSw. LEKLKPAV-L-----GCGDQE-DYAEYFCDA-LG--TIRDIIEPRg. ATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKw. VKQISEELHLDEILNA STANSIAGGSDIALLTILNHLMVKg. MLVYSGGVAFGKPKTHLGYVH-----INEIQENEDENARIf. GERi. ANk. VKQIF-----VTAEA---KKENIIAA------AQAGAS-------------GYVVK-----PFTAATLEEKLNKIFEKLGM------ G

Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension Objective: integrate secondary structure information to anchor alignments and avoid errors (“Structure more conserved than sequence”)

Protein structure hierarchical levels PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) VHLTPEEKSAVTALWGKVNVD EVGGEALGRLLVVYPWTQRFF ESFGDLSTPDAVMGNPKVKAH GKKVLGAFSDGLAHLDNLKGTF ATLSELHCDKLHVDPENFRLLG NVLVCVLAHHFGKEFTPPVQAA YQKVVAGVANALAHKYH QUATERNARY STRUCTURE (oligomers) TERTIARY STRUCTURE (fold)

Why use (predicted) structural information • “Structure more conserved than sequence” – Many structural protein families (e. g. globins) have family members with very low sequence similarities. For example, globin sequences identities can be as low as 10% while still having an identical fold. • This means that you can still observe equivalent secondary structures in homologous proteins even if sequence similarities are extremely low. • But you are dependent on the quality of prediction methods. For example, secondary structure prediction is currently at 76% correctness. So, 1 out of 4 predicted amino acids is still incorrect.

Two superposed protein structures with two wellsuperposed helices Red: well superposed Blue: low match quality C 5 anaphylatoxin -- human (PDB code 1 kjs) and pig (1 c 5 a)) proteins are superposed

How to combine ss and aa info Amino acid substitution matrices Dynamic programming search matrix M D A A S T I L C G S MDAGSTVILCFV HHHCCCEEEEEE H H H C C C E E E C C H E E C Default

In terms of scoring… • So how would you score a profile using this extra information? – Same formula as in lecture 6, but you can use sec. struct. specific substitution scores in various combinations. • Where does it fit in? – Very important: structure is more conserved than sequence so if structures have forgotten how to match (I. e. they are too divergent), the secondary structure elements might help the alignment.

Sequences to be aligned Predict secondary structure Secondary structure HHHHCCEEECCHH CCCCCCEEEECCHH HHHCCCCEEHHH HHHHHCCEEEECCC HHHHHHHCCCEEEE Align sequences using secondary structure Multiple alignment

Using predicted secondary structure 1 fx 1 FLAV_DESVH FLAV_DESGI FLAV_DESSA FLAV_DESDE 2 fcr FLAV_ANASP FLAV_ECOLI FLAV_AZOVI FLAV_ENTAG 4 fxn FLAV_MEGEL FLAV_CLOAB 3 chy -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeee MPK-ALIVYGSTTGNTEYTa. ETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLg. CSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhh eeeeee hhhhhh eeeee MPK-ALIVYGSTTGNTEGVa. EAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLg. CSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhh eeeeee MSK-SLIVYGSTTGNTETAa. EYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFg. CSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhh h eeeee MSK-VLIVFGSSTGNTESIa. QKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFg. CSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhh eeeee hhhhhheeeee hhhhhhh hh eeeee --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhggg b eeggg s gggggg seeeeeee stt s sthhhhhhhtggg tt eeeee SKK-IGLFYGTQTGKTESVa. EIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIg. CPTWNIGEL----QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhheeeeee hhhhh eeeeee -AI-TGIFFGSDTGNTENIa. KMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLg. IPTWYYGEA----QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhheeeee hhhhh eeeeee -AK-IGLFFGSNTGKTRKVa. KSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILg. TPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhheeeee hhhhh eeeeee MAT-IGIFFGSDTGQTRKVa. KLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLg. TPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhheeeee hhhhh eeeee ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeee M---VEIVYWSGTGNTEAMa. NEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLg. CPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhh eeeee M-K-ISILYSSKTGKTERVa. KLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFg. TPTY-YANI----SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhh eeeeee hhhhh eeeee ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM-----DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhht eeeesshh hhhh eeeee s sss hhhhh ttttt eeee GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD-----------GLRIDGD--PRAARDDIVGWAHDVRGAI-------eee s ss sstthhhhhhttt ee s eeees gggghhhhhhh GCGDS-SY-EYFCGAVDAIEEKLKNLg. AEIVQD-----------GLRIDGD--PRAARDDIVGw. AHDVRGAI-------eee hhhhhh eeeee hhhhhhh GCGDS-SY-TYFCGAVDVIEKKAEELg. ATLVAS-----------SLKIDGE--P--DSAEVLDw. AREVLARV-------eee hhhhhh eeeee hhhhhh GCGDS-DY-TYFCGAVDAIEEKLEKMg. AVVIGD-----------SLKIDGD--P--ERDEIVSw. GSGIADKI-------hhhhhh eeeee e eee ASGDQ-EY-EHFCGAVPAIEERAKELg. ATIIAE-----------GLKMEGD--ASNDPEAVASf. AEDVLKQL-------e hhhhhhh eeeee ee hhhhhh GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----eee ttt ttsttthhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhht GTGDQIGYADNFQDAIGILEEKISQRg. GKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSw. VAQLKSEFGL-----hhhhhhh eeee hhhhhhhh GCGDQEDYAEYFCDALGTIRDIIEPRg. ATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKw. VKQISEELHLDEILNA hhhhhhh eeee hhhhhhhhh GLGDQVGYPENYLDALGELYSFFKDRg. AKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAw. LAQIAPEFGLS--L-e hhhhhhh eeeee hhhhhh GLGDQLNYSKNFVSAMRILYDLVIARg. ACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSw. LEKLKPAV-L-----hhhhhhhh eeee hhhhhhhhhhhh G-----SYGWGDGKWMRDFEERMNGYGCVVVET-----------PLIVQNE--PDEAEQDCIEFGKKIANI----e eesss shhhhhhtt ee s eeees ggghhhhhht G-----SYGWGSGEWMDAWKQRTEDTg. ATVIGT-----------AIVNEM--PDNAPE-CKEl. GEAAAKA----hhhhhh eeeee h hhhh STANSIA-GGSDIALLTILNHLMVK-g. MLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIf. GERi. ANk. V--KQIF-hhhhhhh eeeee hhhhhh h ------TAEAKKENIIAAAQAGASGY-------------VVK----P-FTAATLEEKLNKIFEKLGM-----ess hhhhhtt see ees s hhhhhhhht G

Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension

PSI-PRALINE Multiple alignment of distant sequences using PSI-BLAST

Distant sequences • Methyltransferase (16. 7% sequence identity) • Same a/b fold 1 GZ 0 A SEIYGIHAVQALLERAPERFQEVFILKGREDKRL LPLIHALESQGVVIQLANRQYLDEKSDGAVHQG IIARVKPGRQ 1 IPAA MRITSTANPRIKELARLLERKHRDSQRRFLIEG AREIERALQAGIELEQALVWEGGLNPEEQQVYA ALLALLEVSEAVLKKLSVRDNPAGLIALARMPER

How well do alignment methods perform • Normal pair-wise alignment • 10% correctly aligned positions • T-COFFEE (Notredame et al. , 2000) • 15% correctly aligned positions • MUSCLE (Edgar, 2004) • 15% correctly aligned positions

Profile-profile alignment • Homology detection has provided a solution • • • BLAST (sequence-sequence) PSI-BLAST (profile-sequence) New generation methods (profile-profile)

State-of-the-art homology detection • Sequence used to scan database and collect homologous information (PSI-BLAST) – Local pair-wise alignment • PSI-BLAST profile used to scan profiles of other sequences – Local pair-wise alignment • The difference between methods is the profile scoring scheme used for the alignment

The PRALINE way • Sequence used to scan database and collect homologous information (PSI-BLAST) – Local alignment • PSI-BLAST profiles are used for the alignment instead of the original sequences – Global, pair-wise OR multiple alignment • PRALINE scores 92% correctly aligned positions on the previous example (methyltransferase)

Pair-wise alignment PSI

Multiple alignment PSI PREPRO

Example: methyltransferases

A The effects of using E-value thresholds of increasing stringency in PRALINEPSI on the 624 HOMSTRAD pairwise alignments. (A) The difference between the average Q scores of PRALINEPSI and the basic PRALINE method B (B) The distributions of improved, equal and worsened cases compared with the basic PRALINE method for each Evalue threshold. The ‘inc’ column is the PRALINEPSI incremental strategy starting from a threshold of 10 -6, and the ‘max’ column is PRALINEPSI’s theoretical upper limit for the tested threshold range.

Strategies for multiple sequence alignment • Profile pre-processing • Secondary structure-induced alignment • Homology-extended alignment • Matrix extension Objective: try to avoid (early) errors

Integrating alignment methods and alignment information with T -Coffee • Integrating different pair-wise alignment techniques (NW, SW, . . ) • Combining different multiple alignment methods (consensus multiple alignment) • Combining sequence alignment methods with structural alignment techniques • Plug in user knowledge

Matrix extension T-Coffee Tree-based Consistency Objective Function For alignm. Ent Evaluation Cedric Notredame Des Higgins Jaap Heringa J. Mol. Biol. , 302, 205 -217; 2000

Using different sources of alignment information Clustal Structure alignments Dialign Lalign Manual T-Coffee

T-Coffee library system Seq 1 AA 1 Seq 2 AA 2 Weight 3 3 V 31 5 V 31 6 L 33 L 34 5 5 L 33 l 33 R 35 21 I 36 35 6 6 10 14

Matrix extension 1 1 1 2 2 3 4 3 4 4

Search matrix extension – alignment transitivity

T-Coffee Other sequences Direct alignment

Search matrix extension

but. . . T-COFFEE (V 1. 23) multiple sequence alignment Flavodoxin-che. Y 1 fx 1 FLAV_DESVH FLAV_DESGI FLAV_DESSA FLAV_DESDE 4 fxn FLAV_MEGEL FLAV_CLOAB 2 fcr FLAV_ENTAG FLAV_ANASP FLAV_AZOVI FLAV_ECOLI 3 chy ----PKALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-------MPKALIVYGSTTGNTEYTAETIARELADAG-YEVDSRDAASVE-AGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPL-FDSLEETGAQGRK-------MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-METTVVNVADVT-APGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPL-YEDLDRAGLKDKK-------MSKSLIVYGSTTGNTETAAEYVAEAFENKE-IDVELKNVTDVS-VADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPL-YDSLENADLKGKK-------MSKVLIVFGSSTGNTESIAQKLEELIAAGG-HEVTLLNAADAS-AENLADGYDAVLFGCSAWGMEDLE------MQDDFLSL-FEEFNRFGLAGRK-----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVN-IDELL-NEDILILGCSAMGDEVLE-------ESEFEPF-IEEIS-TKISGKK-----MVEIVYWSGTGNTEAMANEIEAAVKAAG-ADVESVRFEDTN-VDDVA-SKDVILLGCPAMGSEELE-------DSVVEPF-FTDLA-PKLKGKK----MKISILYSSKTGKTERVAKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQ-ESEGIIFGTPTYYAN-----ISWEMKKW-IDESSEFNLEGKL-----KIGIFFSTSTGNTTEVADFIGKTLGAKA---DAPIDVDDVTDPQAL-KDYDLLFLGAPTWNTGA----DTERSGTSWDEFLYDKLPEVDMKDLP-------MATIGIFFGSDTGQTRKVAKLIHQKLDGIA---DAPLDVRRAT-REQF-LSYPVLLLGTPTLGDGELPGVEAGSQYDSWQEF-TNTLSEADLTGKT-------SKKIGLFYGTQTGKTESVAEIIRDEFGNDV---VTLHDVSQAE-VTDL-NDYQYLIIGCPTWNIGEL----QSDWEGL-YSELDDVDFNGKL----AKIGLFFGSNTGKTRKVAKSIKKRFDDET-M-SDALNVNRVS-AEDF-AQYQFLILGTPTLGEGELPGLSSDCENESWEEF-LPKIEGLDFSGKT----AITGIFFGSDTGNTENIAKMIQKQLGKDV---ADVHDIAKSS-KEDL-EAYDILLLGIPTWYYGEA----QCDWDDF-FPTLEEIDFNGKL----ADKELKFLVVD--DFSTMRRIVRNLLKELGFN-NVE-EAEDGVDALNKLQ-AGGYGFVISDWNMPNMDGLE-------LLKTIRADGAMSALPVLMV : . . . : : ---------VACFGCGDSS--YEYFCGA-VDAIEEKLKNLGAEIVQDG---------------------LRIDGDPRAA--RDDIVGWAHDVRGAI------------VGVFGCGDSS--YTYFCGA-VDVIEKKAEELGATLVASS-----------LKIDGEPDSA----EVLDWAREVLARV--------VSVFGCGDSD--YTYFCGA-VDAIEEKLEKMGAVVIGDS-----------LKIDGDPE----RDEIVSWGSGIADKI--------VAAFASGDQE--YEHFCGA-VPAIEERAKELGATIIAEG-----------LKMEGDASND--PEAVASFAEDVLKQL--------VALFGS------YGWGDGKWMRDFEERMNGYGCVVVETP-----------LIVQNEPD--EAEQDCIEFGKKIANI---------VGLFGS------YGWGSGEWMDAWKQRTEDTGATVIGTA-----------IV--NEMP--DNAPECKELGEAAAKA---------GAAFSTANSI--AGGSDIA-LLTILNHLMVKGMLVY----SGGVAFGKPKTHLGYVHINEIQENEDENARIFGERIANKVKQIF----------VAIFGLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRDG-KFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-------VALFGLGDQLNYSKNFVSA-MRILYDLVIARGACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSWLEKLKPAVL--------VAYFGTGDQIGYADNFQDA-IGILEEKISQRGGKTVGYWSTDGYDFNDSKALRNG-KFVGLALDEDNQSDLTDDRIKSWVAQLKSEFGL-------VALFGLGDQVGYPENYLDA-LGELYSFFKDRGAKIVGSWSTDGYEFESSEAVVDG-KFVGLALDLDNQSGKTDERVAAWLAQIAPEFGLSL------VALFGCGDQEDYAEYFCDA-LGTIRDIIEPRGATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKWVKQISEELHLDEILNA TAEAKKENIIAAAQAGASGYVVKPFT---AATLEEKLNKIFEKLGM-----------------------------.