CAP 5510 Bioinformatics Multiple Sequence Alignment Tamer Kahveci
CAP 5510 – Bioinformatics Multiple Sequence Alignment Tamer Kahveci CISE Department University of Florida 1
Goals • Understand – What is multiple alignment – Why align multiple sequences • Learn – How multiple alignments are scored – Major multiple alignment methods • Dynamic programming – Standard – MSA • Progressive alignment – Star – CLUSTALW 2
What is Multiple Alignment? • • Alignment of more than two sequences Global: multiple alignment – http: //www-igbmc. u-strasbg. fr/Bio. Info/BAli. BASE/ scxa_buteu scx 1_titse scx 6_titse scx 1_cenno six 2_leiqu vrdgyiaddk. kdgypveyd. regypadsk. kdgylvdak. . dgyirkrd dcayfcgr. . ncayicwnyd gckitcflta gckkncyklg gcklsclfg. . naycdeeck. naycdklck. agycntect kndycnrecr. negcnkeck scxa_buteu scx 1_titse scx 6_titse scx 1_cenno six 2_leiqu cwcyklpdwv cycyglpdse cycyglpesv cyceglsdst cwceglpd. e pikqkvsgk. ptktn. . gk. kiwtsetnk. ptwplp. nkt ktwksetn. t cn. . cksgkk c. . . csgk. . cg. . . . kgaesgk. . dkkadsgy. . lkkgssgy mkhrggsygy. . syggsygy cwyagqygna cyw. . . vhil caw. . . pa c. . . ygfg cwt. . . wgla 3
What is Local Multiple Alignment? • • ID AC DE BL Local: motif (http: //blocks. fhcrc. org/blocks-bin/getblock. sh? PR 00624 ) HISTONEH 5; BLOCK PR 00624 A; distance from previous block=(9, 12) Histone H 5 signature adapted; width=22; seqs=9; 99. 5%=986; strength=1407 H 10_HUMAN|P 07305 H 5 A_XENLA|P 22844 H 10_RAT|P 43278 H 10_MOUSE|P 10922 Q 91759 H 5 B_XENLA|P 22845 H 5_CHICK|P 02259 H 5_CAIMO|P 06513 H 5_ANSAN|P 02258 ( 10) ( 11) ( 10) ( 9) ( 11) ( 12) AKPKRAKASKKSTDHPKYSDMI AKPKRSKALKKSTDHPKYSDMI AKPKRAKAAKKSTDHPKYSDMI AKPKRAKASKKSTDHPKYSDMI AKPRRSKASKKSTDHPKYSDMI AKPKRVKASRRSASHPTYSEMI AKPKRAKAPRKPASHPSYSEMI AKPKRARAPRKPASHPTYSEMI 63 71 70 63 71 71 100 91 100 4
Why Multiple Sequence Alignment • Basis for phylogeny • Helps find conserved regions in sets of proteins – Conserved regions • Provide insight into substitution patterns • Gives hints about functional sites 5
How to Evaluate Multiple Alignments 6
Sum of Pairs (SP) • Sum of induced pairwise alignment score of all pairs • Ignore space pairs aligned together A B C D cwcyklpdwv cycyglpdse cycyglpesv cyceglsdst pikqkvsgk. ptktn. . gk. kiwtsetnk. ptwplp. nkt cn. . cksgkk c. . . csgk. . A B cwcyklpdwv pikqkvsgk cn. . cycyglpdse ptktn. . gk cksgkk A C cwcyklpdwv pikqkvsgk cn cycyglpesv kiwtsetnk c. A D cwcyklpdwv pikqkvsgk. cn. . cyceglsdst ptwplp. nkt csgk B C cycyglpdse ptktn. . gk cksgkk cycyglpesv kiwtsetnk c. . . B D cycyglpdse ptktn. gk. cksgkk cyceglsdst ptwplpnkt csgk. . C D cycyglpesv kiwtsetnk. c. . . cyceglsdst ptwplp. nkt csgk + 7
BAli. BASE Benchmark • Compare to a set of hand-aligned sequences • Check positions of letters – If the letters appear at the same position as the benchmark => good • Score between 0 ( ) and 1 ( ) • http: //www-igbmc. ustrasbg. fr/Bio. Info/BAli. BASE/prog_scores. html 8
Finding Multiple Sequence Alignments 9
Dynamic Programming 10
Dynamic Programming • Similar to pairwise alignment – Compare NV and NS 22 -1 = 3 cases NV NS S = max V N N + V S N N + V - N N + S If k sequences are aligned – => k-dimensional matrix is filled 11
Dynamic Programming A S V k=3 2 k – 1=7 cases 12
Complexity • Space complexity: O(nk) for k sequences each n long. • Computing at a cell: O(2 k). cost of computing δ. • Time complexity: O(2 knk). cost of computing δ. • Finding the optimal solution is exponential in k • Proven to be NP-complete for a number of cost functions 13
MSA (Carrillo, Lipman’ 88) 14
MSA – Idea 2 3 1 15
MSA algorithm (1/3) • Find pairwise alignment • Trial multiple alignment produced by a tree, cost = d • This provides a limit to the volume within which optimal alignments are found • Specifics – – – Sequences x 1, . . , xr. Alignment A, cost = c(A) Optimal alignment A* Aij = induced alignment on xi, . . , xj on account of A D(xi, xj) = cost of optimal pairwise alignment of xi, xj <= c(Aij ) 16
MSA algorithm (2/3) • d >= c(A*) = c(A*uv) + iΣ< j c(A*ij) >= (i, j) ≠ (u, v) c(A*uv) + Σ D(xi, xj) i<j (i, j) ≠ (u, v) • c(A*uv) <= d - iΣ< j D(xi, xj) = B(u, v) (i, j) ≠ (u, v) • Compute B(u, v) for each pair of u, v • Consider any cell f with projection (s, t) on u, v plane. • If A* passes through f then A*uv passes through (s, t) – beststuv = best pairwise alignment of xu, xv that passes through (s, t). – beststuv = distance of the prefixes up to (s, t) + cost(xsi, xsj) + distance of suffixes after (s, t) 17
MSA algorithm (3/3) • If beststuv > B(u, v), then – A* cannot pass through cell f – Discard such cells from computation of DP 18
Question Align: s 1: MPE s 2: MKE s 3: MSKE s 4: SKE BLOSUM 62 19
Progressive Alignment 20
Star Alignment 21
Star Alignments • Heuristic method for multiple sequence alignments • Select a sequence c as the center of the star • For each sequence x 1, …, xk such that xi c, perform a Needleman-Wunsch global alignment for xi and c 22
Star Alignments Example s 1 MPE MSKE | || MKE M-KE s 1: MPE s 2: MKE s 3: MSKE s 4: SKE s 3 s 2 SKE || MKE s 4 n. All MPE MKE M-PE M-KE MSKE S-KE induced pairwise alignments to the center sequence is the optimal one. • How should we choose a center? (Exercise: try s 4 as the center) • Try all of them? 23
CLUSTAL-W (Thompson, Higgins, Gibson 1994) 24
CLUSTAL-W (1/4) • Given sequences A, B, C, D, E • Compare all pairs and construct a distance matrix A B C D E 25
CLUSTAL-W (2/4) • Find phylogenetic tree for A, B, C, D, E using neighbor joining A E B C A D E B A C D E B C D A B C D E 26
CLUSTAL-W (3/4) • Align sequences starting from leaf level – Edge weights are used to compute the score of the alignment A B C D E • O(k 2 n 2) time • O(n 2) space • Result depends on sequence order 27
CLUSTAL-W (4/4) • • Sample query using Clustal. W http: //www. ebi. ac. uk/clustalw/ 28
CLUSTAL-W (4/4) • Sample query using Clustal. W 29
Other Progressive Methods • • T-COFFEE PILUP Muscle … 30
T-coffee (Notredame, Higgins, Heringa 2000) • Find a library of alignments between pairs of sequences. • Create a new scoring matrix for each pair of sequences using the library – Directly from alignment of s 1 and s 2 – Indirectly through alignment of s 1, s 3 and s 3, s 2. s 1 • Use these scoring matrices during progressive alignment s 2 Scoring matrix for s 1 and s 2 31
T-Coffee (1/2) • Given sequences A, B, C, D, E • Create primary library 32
T-Coffee (2/2) • Create extended library • Create similarity matrix Seq. A 33
Iterative Alignment 34
PRRP A B C D E cwcyklpdwv cycyglpdse cycyglpesv cyceglsdst cyceglpdst pikqkvsgk. ptktn. . gk. kiwtsetnk. ptwplp. nkt piwplp. nkt cn. . cksgkk c. . . csgk. . ctgk. . 1. Find some initial alignment 2. Construct phylogenetic tree based on multiple alignment A A B C D E B cwcyklpdwv cycyglpdse cycyglpesv cyceglsdst cyceglpdst C D pikqkvsgk. ptktn. . gk. kiwtsetnk. ptwplp. nkt piwplp. nkt E cn. . cksgkk c. . . csgk. . ctgk. . Go back if the result has improved 3. Align sequences 35
Other methods • Genetic algorithm (machine learning) • Partial order graphs (graph matching) • HMMER (hidden markov model) • For a comparison: – http: //www. cise. ufl. edu/~tamer/papers/psb 2006. pdf 36
Motif Logos ID HISTONEH 5; BLOCK AC PR 00624 A; distance from previous block=(9, 12) DE Histone H 5 signature BL adapted; width=22; seqs=9; 99. 5%=986; strength=1407 H 10_HUMAN|P 07305 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 H 5 A_XENLA|P 22844 ( 11) AKPKRSKALKKSTDHPKYSDMI 71 H 10_RAT|P 43278 ( 10) AKPKRAKAAKKSTDHPKYSDMI 70 H 10_MOUSE|P 10922 ( 10) AKPKRAKASKKSTDHPKYSDMI 63 Q 91759 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H 5 B_XENLA|P 22845 ( 9) AKPRRSKASKKSTDHPKYSDMI 71 H 5_CHICK|P 02259 ( 11) AKPKRVKASRRSASHPTYSEMI 100 H 5_CAIMO|P 06513 ( 12) AKPKRAKAPRKPASHPSYSEMI 91 H 5_ANSAN|P 02258 ( 12) AKPKRARAPRKPASHPTYSEMI 100 37
- Slides: 37