SEQUENCE ALIGNMENT OUTLINE Sequence Alignment Types of a























- Slides: 23

SEQUENCE ALIGNMENT

OUTLINE • Sequence Alignment • Types of a sequence alignment • Methods of sequence alignment • Dot Matrix method • Dynamic programming method • Word method or k-tuple method Bioinformatics

DEFINITION OF SEQUENCE ALIGNMENT Sequence alignment is a way of arranging sequences of DNA, RNA or protein to identify regions of similarity is made to align the entire sequence. the similarity may indicate the funcutional, structural and evolutionary significance of the sequence. The sequence alignment is made between a known sequence and unknown sequence or between two unknown sequences. The known sequence is called reference sequence. the unknown sequence is called query sequenc.

INTERPRETATION OF SEQUENCE ALIGNMENT • Sequence alignment is useful for discovering structural, functional and evolutionary information. • Sequences that are very much alike may have similar secondary and 3 D structure, similar function and likely a common ancestral sequence. It is extremely unlikely that such sequences obtained similarity by chance. For DNA molecules with nnucleotides such probability is very low P=4. n. For proteins the probability even much lower P=20, n–where nis a number of amino acid residues • Large scale genome studies revealed existence of horizontal transfer of genes and other sequences between species, which may cause similarity between some sequences in very distant species.

TYPES OF SEQUENCE ALIGNMENT Sequence Alignment is of two types , namely : Global Alignment Local Alignment Global Alignment : is a matching the residues of two sequences across their entire length. global alignment matches the identical sequences. Local Alignment : is a matching two sequence from regions which have more similarity with each other.

TYPES OF SEQUENCE ALIGNMENT Global alignment Input: treat the two sequences as potentially equivalent Goal: identify conserved regions and differences Applications: - Comparing two genes with same function (in human vs. mouse). - Comparing two proteins with similar function.

TYPES OF SEQUENCE ALIGNMENT Local alignment Input: The two sequences may or may not be related Goal: see whether a substring in one sequence aligns well with a substring in the other Note: for local matching, overhangs at the ends are not treated as gaps Applications: - Searching for local similarities in large sequences (e. g. , newly sequenced genomes). - Looking for conserved domains or motifs in two proteins

TYPES OF SEQUENCE ALIGNMENTS • L G P S S K Q T G K G S - S R I • W D N Globalalignment • L N - I T K S A G K G A I M R L G D A • - - - - T G K G - - - - • Localalignment • - - - - A G K G - - - -

METHOD OF SEQUENCE ALIGNMENT • Dot matrix method • The dynamic programming (DP) algorithm • Word or k-tuple methods

DOT MATRIX ANALYSIS • A dot matrix is a grid system where the similar nucleotides of two DNA sequences are represented as dots. • It also called dot plots. • It is a pairwise sequence alignment made in the computer. • The dots appear as colourless dots in the computer screen. • In dot matrix , nucleotides of one sequence are written from the left to right on the top row and those of the other sequence are written from the top to bottom on the left side (column) of the matrix. At every point, where the two nucleotides are the same , a dot in the intersection of row and column becomes a dark dot. when all these darken dots are connected, it gives a graph called dot plot. the line found in the dot plot is called recurrence plot. Each dot in the plot represents a matching nucleotide or amino acid.

DOT MATRIX ANALYSIS • Dot matrix method is a qualitative and simple to analyze sequences. however , it takes much time to analyze large sequences. • Dot matrix method is useful for the following studies : • Sequence similarity between two nucleotide sequences or two amino acid sequences. • Insertion of short stretches in DNA or amino acid sequence. • Deletion of short stretches from a DNA or amino acid sequence. • Repeats or inserted repeats in a DNA or amino acid sequence.

DOT MATRIX ANALYSIS: TWO IDENTICAL SEQUENCES • Nucleic Acids Dot Plots

DOT MATRIX ANALYSIS: TWO VERY DIFFERENT SEQUENCES • Nucleic Acids Dot Plots of genes

DOT MATRIX ANALYSIS: TWO SIMILAR SEQUENCES • Nucleic Acids Dot Plots of genes

DYNAMIC PROGRAMMING METHOD • Is the process of solving problems where one needs to find the best decision one after another. • It was introduced by Richard Bellman in 1940. • The word programming here denotes finding an acceptable plan of action not computer programming. • It is useful in aligning nucleotide sequence of DNA and amino acid sequence of proteins coded by that DNA. • Dynamic programming is a three step process that involves : 1) Breaking of the problem into small subproblems. 2) Solving subproblems using recursive methods. 3) Construction of optimal solutions for original problem using the optimal solutions.

DYNAMIC PROGRAMMING ALGORITHM FOR SEQUENCE ALIGNMENT • The method compares every pair of characters in the two sequences and generates an alignment, which is the best or optimal. • This is a highly computationally demanding method. However the latest algorithmic improvements and ever increasing computer capacity make possible to align a query sequence against a large DB in a few minutes. • Each alignments has its own score and it is essential to recognise that several different alignments may have nearly identical scores, which is an indication that the dynamic programming methods may produce more than one optimal alignment. However intelligent manipulation of some parameters is important and may discriminate the alignments with similar scores. • Global alignment program is based on Needleman-Wunsch algorithm and local alignment on Smith-Waterman. Both algorithms are derivates from the basic dynamic programming algorithm.

DESCRIPTION OF THE DYNAMIC PROGRAMMING ALGORITHM • The alignment procedure depends upon scoring system, which can be based on probability that 1) a particular amino acid pair is found in alignments of related proteins (pxy); 2) the same amino acid pair is aligned by chance (pxpy); 3) introduction of a gap would be a better choice as it increases the score. • The ratio of the first two probabilities is usually provided in an amino acid substitution matrix. There are many such matrices, two of them PAM and BLOSUM are considered later. • The score for the gap introduction and its extension is also calculated from the matrices and represent a prior knowledge and some assumptions. One of them is quite simple, if negative cost of a gap is too high a reasonable alignment between slightly different sequences will be never achieved but if it is too low an optimal alignment is hardly possible. Other assumptions are based on sophisticated statistical procedures.

DERIVATION OF THE DYNAMIC PROGRAMMING ALGORITHM BLOSSUM 62 1. Score of new = Score of previous + Score of new alignment (A) V D. S - C Y V D S - C Y V E. S L C Y V E S L C Y 15 2. Score of = 8 aligned pair + 7 = Score of previous + Score of new alignment (A) alignment (B) aligned pair V D S - C V V E S L C 8 = -1 + 9 3. Repeat removing aligned pairs until end of alignments is reached

SCORING MATRICES: PAM (PERCENT ACCEPTED MUTATION) Amino acids are grouped according to to the chemistry of the side group: (C) sulfhydryl, (STPAG)small hydrophilic, (NDEQ) acid, acid amide and hydrophilic, (HRK) basic, (MILV) small hydrophobic, and (FYW) aromatic. Log odds values: +10 means that ancestor probability is greater, 0 means that the probability are equal, -4 means that the change is random. Thus the probability of alignment YY/YY is 10+10=20, whereas YY/TP is – 3 -5=-8, a rare and unexpected between homologous sequences.

SCORING MATRICES: BLOSUM 62 (BLOCKS AMINO ACID SUBSTITUTION MATRICES) Ideology of BLOSUM is similar but it is calculated from a very different and much larger set of proteins, which are much more similar and create blocks of proteins with a similar pattern

FORMAL DESCRIPTION OF DYNAMIC PROGRAMMING ALGORITHM i-x Si-x, j-wx Si– 1 j-, 1+s(ai, bj) i-1 i Si, j-y-wy i-y Si, j j-1 j • This diagram indicates the moves that are possible to reach a certain position (i, j)starting from the previous row and column at position (i-1, j-1)or from any position in the same row or column • Diagonal move with no gap penalties or move from any other position from column jor row i, with a gap penalty that depends on the size of the gap

WORD METHOD OR K-TUPLE METHOD • It is used to find an optimal alignment solution, but is more than dynamic programming. • This method is useful in large-scale database searches to find whethere is significant match available with the query sequence. • Word method is used in the database search tools FASTA and the BLAST family. • They identify a series of short , non-overlapping subsequences (words) of the query sequence. • Then they are matched to candidate database sequences to get result.

WORD METHOD OR K-TUPLE METHOD • In the FASTA method , the user defines a value kto use as the word length to search the database. it is slower but more sensitive at lower values of k. they are also perferred for serches involving a very short qurery sequence. • The BLAST provides a number of algorithms optimized for particular types of queries , for distantly related sequence matches. • It is a good alternative to FASTA. However , the results are not very accurate. • Like FASTA , BLAST uses a word search of length k, but evaluates only the most significant word m, latches rather than every word match.