Classifying MSA Packages Multiple Sequence Alignments in the

  • Slides: 58
Download presentation
Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique

Classifying MSA Packages Multiple Sequence Alignments in the Genome Era Cédric Notredame Information Génétique et Structurale CNRS-Marseille, France

What’s in a Multiple Alignment? l Structural Criteria – l Evolutive Criteria – l

What’s in a Multiple Alignment? l Structural Criteria – l Evolutive Criteria – l Residues are arranged so that those playing a similar role end up in the same column. Residues are arranged so that those having the same ancestor end up in the same column. Similarity Criteria – As many similar residues as possible in the same column

What’s in a Multiple Alignment?

What’s in a Multiple Alignment?

What’s in a Multiple Alignment? l l The MSA contains what you put inside…

What’s in a Multiple Alignment? l l The MSA contains what you put inside… You can view your MSA as: – – – A record of evolution A summary of a protein family A collection of experiments made for you by Nature…

What’s in a Multiple Alignment?

What’s in a Multiple Alignment?

Multiple Alignments: What Are They Good For? ? ?

Multiple Alignments: What Are They Good For? ? ?

Computing the Correct Alignement is a Complicated Problem

Computing the Correct Alignement is a Complicated Problem

A Taxonomy of Multiple Sequence Alignment Packages Objective Function Assembly Algorithms

A Taxonomy of Multiple Sequence Alignment Packages Objective Function Assembly Algorithms

The Objective Function

The Objective Function

The Assembly Algorithm

The Assembly Algorithm

A Tale of Three Algorithms l Progressive: Clustal. W l Iterative: Muscle l Concistency

A Tale of Three Algorithms l Progressive: Clustal. W l Iterative: Muscle l Concistency Based: T-Coffee and Probcons

Clustal. W Algorithm l l l Paula Hogeweg: First Description (1981) Taylor, Dolittle: Reinvention

Clustal. W Algorithm l l l Paula Hogeweg: First Description (1981) Taylor, Dolittle: Reinvention in 1989 Higgins: Most Successful Implementation

Clustal. W

Clustal. W

Clustal. W

Clustal. W

Muscle Algorithm: Using The Iteration l AMPS: First iterative Algorithm (Barton, 1987) l Stochastic

Muscle Algorithm: Using The Iteration l AMPS: First iterative Algorithm (Barton, 1987) l Stochastic methods: Genetic Algorithms and Simulated Annealing (Notredame, 1995) l Prrp: Ancestor of MUSCLE and MAFT (1996) l Muscle: the most succesful iterative strategy to this day

Muscle Algorithm: Using The Iteration

Muscle Algorithm: Using The Iteration

Concistency Based Algorithms l Gotoh (1990) – l Martin Vingron (1991) – – l

Concistency Based Algorithms l Gotoh (1990) – l Martin Vingron (1991) – – l – Concistency Agglomerative Assembly T-Coffee (2000, Notredame) – – l Dot Matrices Multiplications Accurate but too stringeant Dialign (1996, Morgenstern) – l Iterative strategy using concistency Concistency Progressive algorithm Prob. Cons (2004, Do) – T-Coffee with a Bayesian Treatment

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

T-Coffee and Concistency…

Probcons: A bayesian T-Coffee Score(xi ~ yj | x, y, z) ∑k P(xi ~

Probcons: A bayesian T-Coffee Score(xi ~ yj | x, y, z) ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y) Score=S (MIN(xz, zk))/MAX(xz, zk)

Evaluating Methods… Who is the best? Says who…?

Evaluating Methods… Who is the best? Says who…?

Structures Vs Sequences

Structures Vs Sequences

Evaluating Alignments Quality: Collections and Results

Evaluating Alignments Quality: Collections and Results

Evaluating Alignments Quality Collections l Homstrad: The most Ancient l SAB: Yet Another Benchmark

Evaluating Alignments Quality Collections l Homstrad: The most Ancient l SAB: Yet Another Benchmark l Prefab: The most extensive and automated l Bali. Base: the first designed for MSA benchmarks (Recently updated)

Homstrad (Mizuguchi, Blundell, Overington, 1998) l Hand Curated Structure Superposition Hom +0 l Not

Homstrad (Mizuguchi, Blundell, Overington, 1998) l Hand Curated Structure Superposition Hom +0 l Not designed for Multiple Alignments Hom +3 l Biased with Clustal. W Hom +8 l No CORE annotation

Homstrad: Known issues Thiored. aln 1 aaza 1 ego 1 thx 2 trxa 3

Homstrad: Known issues Thiored. aln 1 aaza 1 ego 1 thx 2 trxa 3 trx 3 grx ------------ mfkvygydsnihkcvycdnakrlltvkk-----qpf ------------ mqtvifgrs----gcpycvrakdlaeklsnerddfqy skgviti-tdaefesevlkae-qpvlvyfwaswcgp cqlmsplinlaantys---drlkv sdkiihl-tddsfdtdvlkad-gailvdfwaewcgp ckmiapildeiadeyq---gkltv --mvkqiesktafqealdaagdklvvvdfsatwcgp ckmikpffhslsekys----nvif ------------ anveiytke----tcpyshrakallsskg-----vsf : . 1 aaza 1 ego 1 thx 2 trxa 3 trx 3 grx efinimpekgvfddekiaelltklgrdtqigltmpqvfapd----gshigg---fdqlre qyvdirae-----gitkedlqqkagkp---vetvpqifv-d----qqhigg---ytdfaa vkleid-----pnpttvkkykve-----gvpalrlvkgeqildstegviskdklls aklnid-----qnpgtapkygir-----giptlllfkngevaatkvgalskgqlke levdvd-----dcqdvasecevk-----ctptfqffkkgqkvgefsgan-keklea qelpidgn-----aakreemikrsgr-----ttvpqifi-d----aqhigg---yddlya : : . *. :

Homstrad

Homstrad

SAB (Wale, 2003) l Multiple Structural Alignments of distantly related sequences SABs +0 TWs

SAB (Wale, 2003) l Multiple Structural Alignments of distantly related sequences SABs +0 TWs +3 l TWs: very low similarity (250 MSAs) l TWd: Low Similarity (480 MSAs) TWs +8

SAB

SAB

Prefab (Edgar, 2003) l Automatic Pairwise Structural Alignments Align with CE and FSSP l

Prefab (Edgar, 2003) l Automatic Pairwise Structural Alignments Align with CE and FSSP l Align Pairs of Structures with Two Methods to define CORES l Add 50 intermediate sequences with PSI-BLAST l Large dataset (1675 MSAs) Add Intermediate Sequences with Psi-Blast Prefab

Prefab (MUSCLE Reference Dataset)

Prefab (MUSCLE Reference Dataset)

Who is the Best? ? ? N. MSAs T-Coffee Probcons Muscle Hom+50 40 49.

Who is the Best? ? ? N. MSAs T-Coffee Probcons Muscle Hom+50 40 49. 71 51. 59 46. 90 SABs+50 209 21. 85 22. 53 19. 61 SABf+50 425 45. 18 44. 85 38. 17 Prefab 1675 67. 96 67. 95 66. 05

A Case for reading papers The FFT of MAFFT

A Case for reading papers The FFT of MAFFT

G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The

G-INS-i, H-INS-i and F-INS-i use pairwise alignment information when constructing a multiple alignment. The two options ([HF] -INS-i) incorporate local alignment information and do NOT USE FFT.

Improving T-Coffee l Ease The Use Heterogenous Information – l 3 DCoffee Speed up

Improving T-Coffee l Ease The Use Heterogenous Information – l 3 DCoffee Speed up the algorithm – – T-Coffee. DPA (Double Progressive Algorithm) Parallel T-Coffee (collaboration with EPFL)

3 D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

3 D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

3 D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

3 D-Coffee: Combining Sequences and Structures Within Multiple Sequence Alignments

T-Coffee-DPA DPA: Double Progressive ALN Target: 1000 -10. 000 seq Principle: DC Progressive ALN

T-Coffee-DPA DPA: Double Progressive ALN Target: 1000 -10. 000 seq Principle: DC Progressive ALN Application: Decreasing Redundancy

Who is the Best ? ? ? l Most Packages claim to be more

Who is the Best ? ? ? l Most Packages claim to be more accurate than T-Coffee, few really are… l None of the existing packages is concistently the best: The PERFECT method does not exist

Conclusion l Concistency Based Methods Have an Edge over Conventional – – l Hard

Conclusion l Concistency Based Methods Have an Edge over Conventional – – l Hard to tell Methods Appart – – l Better management of the data Better extension possibilities Reference databases are not very precise Algorithms evolve quickly Sequence Alignment is NOT a solved problem – Will be solved when Structure Prediction is solved

Conclusion

Conclusion

http: //igs-server. cnrs-mrs. fr/Tcoffee l l l l Fabrice Armougom Sebastien Moretti Olivier Poirot

http: //igs-server. cnrs-mrs. fr/Tcoffee l l l l Fabrice Armougom Sebastien Moretti Olivier Poirot Karsten Sure Chantal Abergel Des Higgins Orla O’Sullivan Iain Wallace cedric. notredame@europe. com

Dissemination: The right Vector Amazon. com: 12/11/05 Barnes&Noble (US): 12/11/05 Amazon. co. uk: 12/11/05

Dissemination: The right Vector Amazon. com: 12/11/05 Barnes&Noble (US): 12/11/05 Amazon. co. uk: 12/11/05

Cadrie Notredom et Michael Claverie

Cadrie Notredom et Michael Claverie

T-Coffee-DPA l T-Coffee-DPA is about 20 times faster than the Standard T-Coffee l Preliminary

T-Coffee-DPA l T-Coffee-DPA is about 20 times faster than the Standard T-Coffee l Preliminary tests indicate a slightly higher accuracy l Beta-Test versions will be available by September but can will be sent on request.

3 D TCoffee. DPA Vs The Human Kinome… l 521 sequences l 46 structures

3 D TCoffee. DPA Vs The Human Kinome… l 521 sequences l 46 structures having 80% or more sequence identity with other kinome structures l Use of 3 D-Coffee. DPA (unpublished) developped especially for the kinome analysis

Structure Based Evaluation l Include Sequences with Known Structures – – l Do Not

Structure Based Evaluation l Include Sequences with Known Structures – – l Do Not use Structural Information Score 1 Use Structural Information: Score 2 Score 1 Vs Score 2 – – Evaluates the accuracy of reconstruction strategy Estimates accuracy of alignment for sequences Without a known structure

How Good is Our Kinome Alignment ? ? ?

How Good is Our Kinome Alignment ? ? ?

Bali. Base (Thompson, 1999) l l Hand Made Structure Superposition All the sequences do

Bali. Base (Thompson, 1999) l l Hand Made Structure Superposition All the sequences do not have Structures Comparisons are made on CORE blocks Different categories for different types of problems

Most Reference Databases Have problems: Bali. Base l l Balibase 1 abo Reference 1

Most Reference Databases Have problems: Bali. Base l l Balibase 1 abo Reference 1 1 abo. A -NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN-------GEW 1 ycs. B KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDE------ de. IEW 1 pht GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPee. IGW 1 ihv. A -NFRVYYRDSRD------PVWKGPAKLLWKG---------EGA * : 1 abo. A 1 ycs. B 1 pht 1 ihv. A l l CEAQT--KNGQGWVPSNYITPVN-----WWARL--NDKEGYVPRNLLGLYP-----LNGYNETTGERGDFPGTYVEYIGRKKISP VVIQD--NSDIKVVPRRKAKIIRD----- Balibase 1 abo Reference 2 1 abo. A -NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN-------GEW 1 ycs. B KGVIYALWDYEPQNDDELPMKEGDCMTIIHREDE DE------IEW 1 pht GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFSDGQEARPEEIGW 1 ihv. A -NFRVYYRDSRD------PVWKGPAKLLWKG---------EGA * : 1 abo. A 1 ycs. B 1 pht 1 ihv. A CEAQTK--NGQGWVPSNYITPVN-----WWARL--NDKEGYVPRNLLGLYP-----LNGYNe. TTGERGDFPGTYVEYIGRKKISP VVIQD--NSDIKVVPRRKAKIIRD-----

3 D TCoffee. DPA Vs The Human Kinome… l Sequences in our Kinome MSA

3 D TCoffee. DPA Vs The Human Kinome… l Sequences in our Kinome MSA dataset have been provided by Aventis l Do not inlude the Alpha Kinases l Assembling an exhaustive Kinome Dataset remains a target (c. f. Projects)