COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment

COT 6930 HPC and Bioinformatics Pairwise Sequence Alignment (PSA) Xingquan Zhu Dept. of Computer Science and Engineering

Outline l l Dynamic Programming Dot Matrix Plot FASTA BLAST

Difficulty of PSA l l Consider two sequences of length n l Number of possible global alignments l For n = 40, there are over 1023 possible alignments Approaches l l “Optimal” solution l Dynamic programming Heuristic solutions l Dot matrix plot l FASTA l BLAST

Dynamic Programming (DP) l Alignment l l Problem can be subdivided Optimal alignment of n bases l l l Bioinformatics application l l l 1. Incorporates optimal alignment of 1…n-1 bases 2. Combine with best alignment for nth base Global alignment Local alignment [Needleman-Wunsch 1970] [Smith-Waterman 1981] Complexity l l O(n 2) or O(n × m) algorithm (sequences of length n, m) Feasible for moderate sized sequences, not entire genomes

Dynamic Programming (DP) l l Dynamic programming usually consists of three components. l Recursive relation l Tabular computation l Trace back This efficient recursive method is used to search through all possible alignments and finding the one with the optimal score.

How DP is used for alignment? l An alignment is represented as a path through a matrix. To search through the matrix of all possible paths and find the optimal path DP is used.

Four possible outcomes in aligning two sequences [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!)

Intuition of Dynamic Programming If we already have the optimal solution to: XY AB then we know the next pair of characters will either be: XYZ ABC or XYZ AB- (where “-” indicates a gap). So we can extend the match by determining which of these has the highest score.

Intuition of Dynamic Programming Considering sequences: “HEAGA” and “PA”

DP (Global Alignment) l Recursive formula l Notation l l l xi yj x 1. . i F l F(i, j) d s ith letter of string x jth letter of string y prefix of x from letters 1 through i matrix of optimal scores (DP Matrix) represents optimal score lining up x 1. . i with y 1. . j gap penalty scoring matrix

Constant vs. Non-constant Gap Penalty l l Constant gap penalty Non-constant gap penalty

DP (Global Alignment) l Algorithm l l Initialize: F(0, 0) = 0, F(i, 0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula

DP Example l Sequences l l Seq 1: HEAGAWGHEE Seq 2: PAWHEAE Gap Penalty: -8 (Constant) Scoring Matrix (Blosum 50)

DP Example Algorithm Initialize F(0, 0) = 0, F(i, 0) = i × d, F(0, j) = j × d (gap penalties) Fill from top left to bottom right using recursive formula Blosum 50

DP Example (Traceback) l Trace arrows from bottom right to top left l Diagonal – both match l Up – left sequence match a gap § l Or insert a gap to top sequence Left – top sequence match a gap § Or insert a gap to left sequence

Global alignment vs. local alignment l l Global alignment: the entire sequence of each protein or DNA sequence is contained in the alignment. Local alignment: only regions of greatest similarity between two sequences are aligned percent identity: ~26% RBP: 26 glycodelin: 23 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84 + K++ + + +GTW++MA + L + A V T + +L+ W+ QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81

Global vs. local alignment l l l Global alignment are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Global methods are useful when you want to force two sequences to align over their entire length Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences.

DP (local alignment) l l Make 0 minimal score (i. e. , start new alignment) Alignment can start / end anywhere l Start at highest score(s) l End when 0 reached

DP (local alignment) Blosum 50

DP (Local Alignment) l Traceback l Start at highest score and trace arrows back to first 0

Multiple, Repeat, & Overlap Matches l Multiple alignments l l l Repeat matches l l l “Optimal” global alignment may be best by small margin Can report several high-scoring alignments Allow sections to match repeatedly Find repeated patterns (domain / motif) Overlap matches l l Matching when the two sequences overlap Does not penalize overhanging ends

Overlap Alignment Consider the following problem: Find the most significant overlap between two sequences S, T ? Possible overlap relations: a. b. Difference from local alignment: require alignment between the endpoints of the two sequences.

Overlap Alignment Formally: given S[1. . n] , T[1. . m] find i, j such that: d=max{D(S[1. . i], T[j. . m]) , D(S[i. . n], T[1. . j]) , D(S[1. . n], T[i. . j]) , D(S[i. . j], T[1. . m]) } is maximal. Solution: Same as Global alignment except we don’t not penalise overhanging ends.

Overlap Alignment l Initialization: F(i, 0)=0 , F(0, j)=0 u Recurrence: as in global alignment u Score: maximum value at the bottom line and rightmost line

Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme : u. Match: +4 u. Mismatch: -1 u. Indel: -5

Overlap Alignment (Example) S = PAWHEAE T = HEAGAWGHEE Scoring scheme: u. Match: +4 u. Mismatch: -1 u. Indel: -5

Overlap Alignment (Example) The best overlap is: PAWHEAE----HEAGAWGHEE Pay attention! A different scoring scheme could yield a different result, such as: Scoring scheme : u. Match: +4 u. Mismatch: -1 u. Indel: -5 -2 ---PAW-HEAE HEAGAWGHEE-

Outline l l Dynamic Programming Dot Matrix Plot FASTA BLAST

Dot Matrix Plot l Method l l Put two sequences on vertical and horizontal axes of graph Put dots wherever there is a match Diagonal line is region of identity (local alignment) Filter non-diagonals (minimal length, identity threshold)

Simple Dot Plot Example

The Dotplot is a Table for overview of the similarities between two sequences The palindromic sequence: MAXISTAYAWAYATSI XAM The stretches of similar residues show up as diagonals The palindromic sequence is very important part of DNA sequence. Many enzyme recognize these sequences

Dot Matrix Plot Example l l Chimpanzee DNA aligned with itself Filters needed to eliminate random matches filtered 6 of 10 filtered 8 of 10

Dot Matrix Plot l Advantage l l l Extremely simple Can use scoring matrix Can use multiple filters to rank matches Example l Red > 50 bases, 95% identical (very close) l Blue > 25 bases, 75% identical (close) l Green > 10 bases, 50% identical (distant) Limitations l l O(n 2) algorithm Relies on visual analysis Difficult to find “best” alignment Despite multiple filters, difficult to compare alignments

Dot Matrix Plot l Demo l http: //www. vivo. colostate. edu/molkit/dnadot/

Outline l l Dynamic Programming Dot Matrix Plot FASTA BLAST

FASTA l l - Idea - Problem of Dynamic Programming DP compute the score in a lot of useless area for optimal sequence G A A T T C A G T T A G 1 1 1 G 1 1 1 1 2 2 A 1 2 2 2 2 2 T 1 2 2 3 3 3 3 C 1 2 2 3 3 4 4 4 G 1 2 2 3 3 4 4 5 5 A 1 2 3 3 3 4 5 5 6 FASTA focuses on diagonal area

FASTA l - Heuristic Good local alignment should have some exact match subsequence. FASTA focus on this area

FASTA High Level Description Let q be a query max 0 For each sequence, s in DB compare q with s and compute a score, y if max < y max y; best. Sequence s ; Return best. Sequence

FASTA - Algorithm l Step 1 Find all hot-spots // Hot spots is pairs of words of length k that exactly match k=4 -6 for DNA, k=1 -2 for Amino acid sequences Sequence 1 Hot Spots Sequence 2

FASTA - Algorithm l Step 1 in detail Use look-up Table Query : GAATTCAGTTA Sequence: G G A T C G A A T T C A G T T A Look-up Table Q A C G T Location 2, 3, 7, 11 6 1, 8 4, 5, 9, 10 Dot—Matrix G * * A * * T * * C * * G * A * * *

FASTA - Algorithm Step 2 Score the Hot-spot and locate the ten best diagonal run. l // There is some scoring system; ex. PAM 250

FASTA - Algorithm l Step 3 Combine sub-alignments into one alignment with GAP One of local alignment

FASTA - Algorithm l Step 4 # Consider weighted direct graph. # Let node be a sub-alignment found in step 1 # Let u and v be nodes # Edge (u, v) exists if alignment u is before in the sequence. # Each edge has gap penalty (negative) Sub-sequence # Find the maximum weight path Edge One Sequence

FASTA - Algorithm l GAP Step 4 in detail Sub-alignment -3 -5 Gap One of Sequence -3 Max Weight Path

FASTA - Algorithm l Step 5 Use the dynamic programming in restricted area around the best-score alignment to find out the higher-score alignment than the best-score alignment Width of this band is a parameter

FASTA - Algorithm l Summary of Algorithm 1: Find all hot-spots // Hot spots is pairs of words of length k that exactly match 2: Score the Hot-spot and locate the ten best diagonal run. 3: Combine sub-alignments into one alignment 4: Score Each alignment with gap penalty and pick up the best-score alignment 5: Use the dynamic programming in restricted area around the bestscore alignment to find out the alignment greater than the best-score alignment.

FASTA - Complexity l Complexity # Step 1 and 2 // select the best 10 diagonal run Let n be a sequence from DB O(n+m) because Step 1 just uses look up the table O(n+m) << O(mn) m, n = 100 to 200

FASTA - Complexity # Step 3 and 4 // compute the MAX Weight Path Let r be the number of sub-alignments. (r = 10) Let s be the number of edges Positive Weight O(rs) < O(m*n) -3 -5 -3 Max Weight Path

FASTA - Complexity # Step 5 // compute partial D. P. Depends on the restricted area < O(mn) Width of this band is a parameter Therefore, FASTA is faster than DP

FASTA l Approach l l Recap Derived from logic of the dot matrix method View sequences as sequences of short words (k-tuple) l DNA - 6 bases, protein - 1 or 2 amino acids Start from nearby regions of exact matching words Motivation l l l Good alignments should contain many exact matches Hashing can find exact matches in O(n) time Diagonals can be formed from exact matches quickly Look only at matches near longest diagonals Apply more precise alignment to small search space at end

FASTA Recap

Outline l l Dynamic Programming Dot Matrix Plot FASTA BLAST

BLAST l Approach (Basic Local Alignment Search Tool) l l l View sequences as sequences of short words (k-tuple) l DNA - 11 bases, protein - 3 amino acids Create hash table of neighborhood (closely-matching) words Use statistics to set threshold for “closeness” Start from exact matches to neighborhood words Motivation l l Good alignments should contain many close matches Statistics can determine which matches are significant l Make much more sense than % identity Hashing can find matches in O(n) time Extending matches in both directions finds alignment l Yields high-score /maximal segment pairs (HSP / MSP)

BLAST - Heuristic Another Heuristic algorithm l Heuristic but evaluating the result statistically. Homologous sequence are likely to contain a short high scoring word pair, a hit. BLAST tries to extend it on the both sides to get optimal sequence. l Sequence A T T A G ……………. Short high score Word

BLAST - Algorithm l Neighborhood Word Step 1: Preprocessing Query Compile the short-hit scoring word list from query. The length of query word, w, is 3 for brosom scoring Threshold T is 13

BLAST - Algorithm l Step 2: Create neighborhood words for each query word Query Word Neighborhood words

BLAST Word Matching MEAAVKEEISVEDEAVDKNI MEA EAA AAV Break query AVK VKE into words: KEE EEI EIS ISV. . . Break database sequences into words:

BLAST - Algorithm l Step 3: Scan DB Hash Table Query: LAALLNKCKTPQGQRLVNQWIKQPLMD Hash Table Word list

BLAST - Algorithm l Step 3: Scanning DB For each words list, identify all exact matches with DB sequences Neighborhood Query Word Sequences in DB Word list Sequence 1 Sequence 2 Step 1 Step 2 The purpose of Step 1 and 2 is as same as FASTA

l Step 4: Extend the alignment l l l Given two strings S 1 and S 2, a segment pair is a pair of equallength substrings of S 1 and S 2 A Maximal Segment Pair (MSP) in S 1 and S 2, is a segment pair with the maximum score over all segment pairs in S 1 and S 2 A High-score Segment Pair (HSP) in S 1 and S 2, is a segment pair which does not increase its score while either extending or shortening its length l Also called a local maximal segment pair (LMSP)

Seq_XYZ: Query: HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCCWA QSVFDYIYYGCYCGWGLG_GK__PRDA E-val=10 -13 • Use two word matches as anchors to build an alignment between the query and a database sequence. • Then score the alignment.

BLAST – Algorithm l Step 5: l Align best segments using dynamic programming l Evaluate the alignment statistically l E-value = the number of HSPs having score S (or higher) expected to occur only by chance. l l E(S) ≈ K × N ×M × e - S § N, M denote the sequence lengths § K and are parameters Smaller E-value, more significant in statistics Bigger E-value , by chance E-value Range § § E < 10 -100 → very low, homologs or identical genes E < 10 -3 → moderate, may be related genes E > 1 → high, probably / may be unrelated 0. 5 < E < 1 → ? ? ? In the “twilight zone” Try detailed search

BLAST (Ungapped)

BLAST - Running Time Alignment algorithms • O(n 2) – dynamic programming, dot matrix • O(n) – FASTA, BLAST l Running Time The length of Query DB size : 153 : 5997 sequences Algorithm DP FASTA BLAST PC : Pentium 4 By Dr. Takeshi Kawabata Nara Sentan Gijyutu University Running Time 16. 989 [s] 0. 618 [s] 0. 118 [s]

Comparison of Algorithm l Dynamic Programming l most sensitive result l l DP uses all information of two sequences Running time is slow l DP compute the useless area for computing the optimal sequence.

Comparison of Algorithm l FASTA l Less sensitive than DP and BLAST FASTA uses partial information to speed up the computation. l FASTA does not evaluate the result statistically. l l Running time is faster DP l the same reason as the above.

Comparison of Algorithms l BLAST l Sensitive than FASTA l l BLAST evaluate the result statistically Faster than FASTA l Because BLAST evaluate the entire DB with the same threshold based on statistics. BLAST eliminate noises and reduces the running time.

FASTA vs BLAST Compare the query and sequences in DB with the same threshold. Query DB DB FASTA compare the query and a sequence one by one And compare the each result.

Outline l l Dynamic Programming Dot Matrix Plot FASTA BLAST