Burkhard Morgenstern Institut fr Mikrobiologie und Genetik Grundlagen

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Grundlagen der Bioinformatik Multiples Sequenzalignment Juni 2007

`Progressive´ Alignment Most popular approach to (global) multiple sequence alignment: Progressive Alignment Since mid-Eighties: Feng/Doolittle, Higgins/Sharp, Taylor, …

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Guide tree

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYPAVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYPAVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN-----WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVNWW--RLNDKEGYVPRNLLGLYPAVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN-----WLN-YNEERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN-------WW--RLNDKEGYVPRNLLGLYP-------AVVIQDNSDIKVVP--KAKIIRD------YAVESEA---SVQ--PVAALERIN-----WLN-YNE---ERGDFPGTYVEYIGRKKISP Profile alignment, “once a gap - always a gap”

`Progressive´ Alignment WCEAQTKNGQGWVPSNYITPVN-------WW--RLNDKEGYVPRNLLGLYP-------AVVIQDNSDIKVVP--KAKIIRD------YAVESEA---SVQ--PVAALERIN-----WLN-YNE---ERGDFPGTYVEYIGRKKISP Most important implementation: CLUSTAL W

`Progressive´ Alignment CLUSTAL W; Thompson et al. , 1994 (~17. 000 citations) Pairwise distances as 1 - percentage of identity Calculate un-rooted tree with Neighbor Joining Define root as central position in tree Define sequence weights based on tree Gap penalties calculated based on various parameters

Tools for multiple sequence alignment Problems with traditional approach: Results depend on gap penalty Heuristic guide tree determines alignment; alignment used for phylogeny reconstruction Algorithm produces global alignments.

Tools for multiple sequence alignment Problems with traditional approach: But: Many sequence families share only local similarity E. g. sequences share one conserved motif

Local sequence alignment EYENS ERYAS Find common motif in sequences; ignore the rest

Local sequence alignment E-YENS ERYA-S Find common motif in sequences; ignore the rest

Local sequence alignment E-YENS ERYA-S Find common motif in sequences; ignore the rest – Local alignment

Local sequence alignment Traditional alignment approaches: Either global or local methods!

New question: sequence families with multiple local similarities Neither local nor global methods appliccable

New question: sequence families with multiple local similarities Alignment possible if order conserved

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach Consistency!

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

The DIALIGN approach

T-COFFEE C. Notredame, D. Higgins, J. Heringa (2000), T-Coffee: A novel algorithm for multiple sequence alignment, J. Mol. Biol. Problem: progressive alignment can go wrong if mistakes are made at an early stage. Example …

T-COFFEE Seq. A Seq. B Seq. C Seq. D GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT GARFIELD THE VERY FAST CAT THE FAT CAT

T-COFFEE Seq. A Seq. B Seq. C Seq. D GARFIELD THE LAST FAT CAT GARFIELD THE FAST CAT GARFIELD THE VERY FAST CAT THE FAT CAT

T-COFFEE

T-COFFEE Idea: consider different pairwise alignments (local and global) check how these alignments support each other

T-COFFEE

T-COFFEE

T-COFFEE Less sensitive to spurious pairwise similarities Can handle local homologies better than CLUSTAL

Evaluation of multi-alignment methods Alignment evaluation by comparison to trusted benchmark alignments. `True’ alignment known by information about structure or evolution.

Evaluation of multi-alignment methods For protein alignment: M. Mc. Clure et al. (1994): 4 protein families, known functional sites J. Thompson et al. (1999): Benchmark data base, 130 known 3 D structures (BAli. BASE) T. Lassmann & E. Sonnhammer (2002): BAli. BASE + simulated evolution (ROSE)

Evaluation of multi-alignment methods

Evaluation of multi-alignment methods Alignment evaluation by comparison to trusted benchmark alignments. `True’ alignment known by information about structure or evolution.

Evaluation of multi-alignment methods

Evaluation of multi-alignment methods 1 abo. A 1 ycs. B 1 pht 1 ihv. A 1 vie 1 1 1 abo. A 1 ycs. B 1 pht 1 ihv. A 1 vie 36 39 51 27 28 . NLFVALYDfvasgdntlsitk. GEKLRVLgynhn. . . g. E k. GVIYALWDyepqnddelpmke. GDCMTIIhrede. . . dei. E g. YQYRALYDykkereedidlhl. GDILTVNkgslvalgfsdgqearpeei. G. NFRVYYRDsrd. . . pvwk. GPAKLLWkg. . . . e. G. drvrkksga. . awq. GQIVGWYctnlt. . . pe. G WCEAQt. . kngq. GWVPSNYITPVN. . . WWWARl. . ndke. GYVPRNLLGLYP. . . WLNGYnettger. GDFPGTYVEYIGrkkisp AVVIQd. . nsdi. KVVPRRKAKIIRd. . . YAVESeahpgsv. QIYPVAALERIN. . . Key alpha helix RED beta strand GREEN core blocks UNDERSCORE BAli. BASE Reference alignments

Evaluation of multi-alignment methods 5 categories of benchmark sequences (globally related, internal gaps, end gaps) CLUSTAL W, RPPR perform well on globally related sequences, DIALIGN superior for local similarities Conclusion: no single best multi alignment program!

Evaluation of multi-alignment methods T. Lassmann & E. Sonnhammer (2002): BAli. BASE + simulated evolution (ROSE)


Result: DIALIGN best for distantly related sequences, TCOFFEE best for closely related sequences
- Slides: 66