Multiple Sequence Alignment by Iterative TreeNeighbor Alignments Susan
Multiple Sequence Alignment by Iterative Tree-Neighbor Alignments Susan Bibeault June 9, 2000
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 2 • •
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 3 • •
Multiple Sequence Alignment • Problem Given Sequence Set: – Insert gaps into sequences so that evolutionary conserved regions are aligned • Important tool – Relate Homologous Proteins – Discover Conserved Regions -VLSPADN--VKAAWGKVGAHAGEYGAEALERM V-LSPADN--VKAAWGKVGAHAGEYGAEALERM---FF----V VLSPADNVKAAWGKVGAHAGEYGAEALERMF VHLTPEEKSAVTALWGKVNVD--EVGGEALGRLLVVYP VHLTPEEKSAVTALWGKVNVD--EVGGEALGRL LVVYP VHVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVY G-LSDGEWQLVLNVWGKVEA---DIPGHVLIRL---FK -GLSDGEWQLVLNVWGKVEA---DIPGHVLIRL FK---G GLSDGEWQLVLNVWGKVEADIPGHVLIRLFK 4
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 5 • •
Scoring Multiple Alignments Sum of Pairs Tree based gorilla chimpanzee cost(i, j) =6 orangutan gibbon cost(edge) =m 1 m 6 human
Alignments V L S P A D N V K A G L S D G E W Q L V L Scoring Cost Matrix: C (aa 1, aa 2) Gaps Penalties: Simple: C (aa, -) Affine: C(-) + Len * C (aa, -) 7 Cost(s[1. . i], t[i. . j]) = min( Cost(s[1. . i], t[i. . j-1]) – g, Cost(s[1. . i-1], t[i. . j-1]) – C(s[i], t[j]) Cost(s[1. . i-1], t[i. . j]) – g))
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 8 • •
Current Approaches • Global Methods Global Alignment – Optimal Algorithms (MSA, MWT, MUSEQAL) ABCDEFGHI – Progressive (MULTALIGN, PILEUP, CLUSTAL, MULTAL, AMULT, DFALIGN, MAP, PRRP, AMPS) : : : : • Local methods ABCD-FGHI – PIMA, DIALIGN, PRALIGN, MACAW, Block. Maker, Local Alignment Iteralign Combined (GENALIGN, ASSEMBLE, DCA) : : Statistical (HMMT, SAGA, SAM, Match Box) ZZZABCDEEEE Parsimony (MALIGN, Tree. Align) 9 • • • XXXABCDYYY
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 10 • •
Our Heuristic • Distance Estimation • Tree Construction • Node Initialization • Tree Partitioning • Iteration 11
Estimation of Protein Distance Aligned Sequences Estimated Pair Distances Issue: Implied vs. Optimal Pair Alignments PEAAALYGRFT---IKSDVW PESLALYNKFSIKSDVW PESLALYNKF---SIKSDVW PEALNYGRY-SSESDVW PEALNYGRY----SSESDVW PESLALYNKF---SIKSDVW PEALNYGWY----SSESDVW PEALNYGRY----SSESDVW PEVIRMQDDNPFSFSQSDVY PESLALYNKFSIKSDVW PEALNYGWY----SSESDVW PEAL-NYGRYSSESDVW PEVIRMQDDNPFSFSQSDVY 12
Optimal Pair vs. Implied Pair 13
Interior Node Classification • Interior Nodes Classified by Percent Identity – PID = (# matched residues) / (# total residues) – User Specified Tiers – User Specified Cost Criterion • Example: – PID > 60% -- PAM 40 – High Gap Penalties – PID > 40% -- PAM 120 – Medium Gap Penalties – PID < 40% -- PAM 200 – Low Gap Penalty 14
Ordering Alignments Isolate Sub Trees Threshold PID Order Alignments 1. Sub Tree 2. Border Nodes 3. Integrate All 15
Interior Alignments Sum of Pairs Bounded Search Implementation Modular Reentrant Flexible Cost Criterion 16
Generating Consensus Alignment (A 1, A 2, A 3) A 1 D 1 Consensus X Min ( Di(Ai, X) ) For Each Position i: D 2 X D 3 A 3 Xi = Min (cost( , A 1 i) + cost( , A 2 i) + cost( , A 3 i)) A 2 17
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 18 • •
Testing the Method • BAli. BASE benchmark – “Correct” Alignments – Core Blocks of Conserved Motifs – Typical “Hard Problem” Sets • Protein Parsimony – Measures “Evolutionary Steps” of Alignment 19
Baseline BAli. BASE SP b e t t e r 20
Baseline BAli. BASE TC b e t t e r 21
Baseline - Prot. Pars b e t t e r 22
Orphans/Families BAli. BASE SP b e t t e r 23
Orphans/Families Prot. Pars b e t t e r 24
Larger Families b e t t e r 25
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 26 • •
Conclusions Solution Quality Captures Evolutionary Information Iterations Converge Quickly Useful Tool 27 • •
Outline Problem Statement and Importance Terminology Current Approaches Our Alignment Heuristic Performance Results Conclusions Future Work 28 • •
Future Work Improved Alignment Consensus Multiple Partitioning Thresholds Multiple Solutions Integrated Phylogeny Modifications Parallel Implementation 29 • • •
- Slides: 29