Multiple alignment Aims of multiple alignment To introduce

Aims of multiple alignment To introduce the different approaches to multiple sequence alignment To

Objectives of multiple alignment To select an appropriate multiple sequence alignment program To carry

The result of searching databases is the establishment of a list of sequences,

Why create multiple sequence alignments? to attempt a phylogenetic analysis of the sequences so

Global versus local alignments Things would be much simpler if we only considered sequences

Substitutions and Gaps In trying to establish the evolutionary trajectories of a group of

Example of multiple alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG- ATLVCLISDFYPGA--VTVAWKADS- AALGCLVKDYFPEP--VTVSWNSG-- VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple Alignment Method The most practical and widely used method in multiple sequence alignment

Multiple Alignment Method Compare all sequences pairwise. The steps are summarized as follows: Perform

Choosing sequences for alignment General considerations The more sequences to align the better. Don’t

Multiple alignment in GCG The program available in GCG for multiple alignment is Pileup.

Choosing sequences for Pile. Up q As far as possible, try to align sequences

Output of Pileup !!NA_MULTIPLE_ALIGNMENT 1. 0 Pile. Up of: @tnf. list Symbol comparison table:

Output of Pileup // OATNFA 1 OATNFAR BSPTNFA CEU 14683 HSTNFR SYNTNFTRP CATTNFAA CFTNFA

Output of Pileup 401 OATNFAR BSPTNFA CEU 14683 HSTNFR SYNTNFTRP CATTNFAA CFTNFA RABTNFM RNTNFAA

Pile. Up considirations Pile. Up does global multiple alignment, and therefore is good for

Pileup special options Creating an end-weighted alignment: -ENDWeight Realigning part of an existing alignment:

Displaying a multiple alignment in GCG There are several programs to display the multiple

Shady. Box is a multiple alignment editor program which enables you to box and

Clustal. W- for multiple alignment Clusta. W is a general purpose multiple alignment program

Clustal. W- for multiple alignment Clustal. W can create multiple alignments, manipulate existing alignments,

Running Clustal. W [~]% clustalw ******************************* CLUSTAL W (1. 7) Multiple Sequence Alignments ***********************************

Running Clustal. W • The input file for clustal. W is a file containing

Using Clustal W ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now

Output of Clustal. W CLUSTAL W (1. 7) multiple sequence alignment HSTNFR SYNTNFTRP CFTNFA

Clustal. X - Multiple Sequence Alignment Program Clustal. X provides a new window-based user

Slides: 38

Download presentation

Multiple alignment

Aims of multiple alignment To introduce the different approaches to multiple sequence alignment To identify criteria for selecting a multiple sequence alignment program

Objectives of multiple alignment To select an appropriate multiple sequence alignment program To carry out a multiple sequence alignment using CLUSTALX

The result of searching databases is the establishment of a list of sequences, either protein or nucleotide, which exhibit significant similarity and are inferred to be homologous These sequences can then be subjected to multiple sequence Alignment The process that involves an attempt to place residues in columns that derive from a common ancestral residue by substitutions The most successful alignment is the one that most closely represents the evolutionary history of the sequences

Why create multiple sequence alignments? to attempt a phylogenetic analysis of the sequences so as to construct evolutionary trees the identification of functional sites the identification of modules in multi modular protein the identification of motifs the detection of weak similarities in databases using profiles the design of PCR primers for the identification of related genes

Global versus local alignments Things would be much simpler if we only considered sequences that are homologous over their entire length and could be globally aligned Homology is often restricted to certain regions of sequence Many proteins are multi-modular and the shuffling of modules is part of the evolutionary process An attempt to align, over their entire length, sequences that share some, but not all of their modules, would be bound to lead to errors In such a case a series of multiple local sequence alignments of each of the modules would be appropriate

Substitutions and Gaps In trying to establish the evolutionary trajectories of a group of related sequences the same problem is encountered as met in pairwise alignment How do you deal with substitutions and gaps? The solution is the same Use of gap penalties, gap extension penalties and substitution matrices such as PAM and BLOSUM

Example of multiple alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG- ATLVCLISDFYPGA--VTVAWKADS- AALGCLVKDYFPEP--VTVSWNSG-- VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple Alignment Method The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods. The principal is that multiple alignments is achieved by successive application of pairwise methods.

Multiple Alignment Method Compare all sequences pairwise. The steps are summarized as follows: Perform cluster analysis on the pairwise data to generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Steps in Multiple Alignment

Choosing sequences for alignment General considerations The more sequences to align the better. Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment.

Multiple alignment in GCG The program available in GCG for multiple alignment is Pileup. The input file for Pileup is a list of sequence file_names or sequence codes in the database, created by a text editor. Pileup creates a multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. Please note that there is no one absolute alignment, even for a limited number of sequences.

Choosing sequences for Pile. Up q As far as possible, try to align sequences of similar length. q Pileup can align sequences of up to 5000 residues, with 2000 gaps (total 7000 characters). q Pileup is a good program only for similar (close) sequences.

Output of Pileup !!NA_MULTIPLE_ALIGNMENT 1. 0 Pile. Up of: @tnf. list Symbol comparison table: Gen. Run. Data: pileupdna. cmp Comp. Check: 6876 Gap. Weight: 5 Gap. Length. Weight: 1 tnf. msf MSF: 1706 Type: N August 12, 1997 08: 10 Check: 5044. . Name: Name: Name: OATNFA 1 OATNFAR BSPTNFA CEU 14683 HSTNFR SYNTNFTRP CATTNFAA CFTNFA RABTNFM RNTNFAA Len: Len: Len: 1706 1706 1706 Check: Check: Check: 5831 7533 1732 6670 191 3706 7430 2566 5089 4296 Weight: Weight: Weight: 1. 00 1. 00

Output of Pileup // OATNFA 1 OATNFAR BSPTNFA CEU 14683 HSTNFR SYNTNFTRP CATTNFAA CFTNFA RABTNFM RNTNFAA 1 ~~~~~~~~~~ ~GGCCAAGAG ~~~~~GGGAC ACCAGGGGAC CAGCCAAGAG ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~GCAGA AGCAGACGCT CCCTCAGCAA GGACAGCAGA ~~~~~~~~~~ ~~~~~~~~~~AAGCTC CCTCAGTGAG GACACGGGCA ~~~~~~~~~~

Output of Pileup 401 OATNFAR BSPTNFA CEU 14683 HSTNFR SYNTNFTRP CATTNFAA CFTNFA RABTNFM RNTNFAA TTCAG. . TTCAA. . . TTCAG. . CCCAG. . . TCCAG. . . CCCAGATGGT CCCAGACCCT . ACACTCAGG. ACCCTCAGG. GCAGTCAGA. ACACTCAGA. ACAGTCAAA CACCCTCAGA CACACTCAGA TCATCTTCTC TCCTCTTCTC TCATCTTCTC TCAGCTTCTC TCATCTTCTC AAGC GAAC GGGC AAAA

Output of Pileup

Pile. Up considirations Pile. Up does global multiple alignment, and therefore is good for a group of similar sequences. Pile. Up will fail to find the best local region of similarity (such as a shared motif) among distant related sequences. Pile. Up always aligns all of the sequences you specified in the input file, even if they are not related. The alignment can be degraded if some of the sequences are only distantly related.

Pileup special options Creating an end-weighted alignment: -ENDWeight Realigning part of an existing alignment: -INSitu -Begin=XX -END=YYwhere XX and YY specify the exact positions to begin (XX) and end (YY) the realignment.

Displaying a multiple alignment in GCG There are several programs to display the multiple alignment prettily. The Pretty program prints sequences with their columns aligned and can display a consensus for the alignment, allowing you to look at relationships among the sequences. The Pretty. Box program displays the alignment graphically with the conserved regions of the alignment as shaded boxes. The output is in Postscript format.

Example of Pretty. Box Output

Shady. Box is a multiple alignment editor program which enables you to box and shade residues or segments of multiple aligned sequences. Shady. Box will work on a msf or pretty output file, and will produce a postscript output file. The original input file is not changed. Shady. Box enables you to save your work in the middle, exit the program, and resume at a later stage.

Shady. Box Output

Clustal. W- for multiple alignment Clusta. W is a general purpose multiple alignment program for DNA or proteins. Clustal. W is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic Clustal. W is cited: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22: 4673 -4680.

Clustal. W- for multiple alignment Clustal. W can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogentic trees. Alignment can be done by 2 methods: - slow/accurate - fast/approximate

Running Clustal. W [~]% clustalw ******************************* CLUSTAL W (1. 7) Multiple Sequence Alignments *********************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile / Structure Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice:

Running Clustal. W • The input file for clustal. W is a file containing all sequences in one of the following formats: • NBRF/PIR, EMBL/Swiss. Prot, Pearson (Fasta), GDE, Clustal, GCG/MSF, RSF.

Using Clustal W ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = OFF 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu

Output of Clustal. W CLUSTAL W (1. 7) multiple sequence alignment HSTNFR SYNTNFTRP CFTNFA CATTNFAA RABTNFM RNTNFAA OATNFA 1 OATNFAR BSPTNFA CEU 14683 GGGAAGAG---TTCCCCAGGGACCTCTCTCTAATCAGCCCTCTGGCCCAG------GCAG ----------------------TGTCCAG------ACAG GGGAAGAG---CTCCCACATGGCCTGCAACTAATCAACCCTCTGCCCCAG------ACAC AGGAGGAAGAGTCCCCAAACAACCTCCATCTAGTCAACCCTGTGGCCCAGATGGTCACCC AGGAGGAGAAGTTCCCAAATGGGCTCCCTCTCATCAGTTCCATGGCCCAGACCCTCACAC GGGAAGAGCAGTCCCCAGCTGGCCCCTCCTTCAACAGGCCTCTGGTTCAG------ACAC GGGAAGAGCAGTCCCCAGGTGGCCCCTCCATCAACAGCCCTCTGGTTCAA------ACAC GGGAAGAGCAATCCCCAACTGGCCTCTCCATCAACAGCCCTCTGGTTCAG------ACCC ** *

Clustal. X - Multiple Sequence Alignment Program Clustal. X provides a new window-based user interface to the Clustal. W program. It uses the Vibrant multi-platform user interface development library, developed by the National Center for Biotechnology Information (Bldg 38 A, NIH 8600 Rockville Pike, Bethesda, MD 20894) as part of their NCBI SOFTWARE DEVELOPEMENT TOOLKIT.

Clustal. X

Thanks