Introduction to Bioinformatics Sequence Alignments Sequence Alignments Cornerstone

Introduction to Bioinformatics Sequence Alignments

Sequence Alignments · Cornerstone of bioinformatics · What is a sequence? • Nucleotide sequence • Amino acid sequence · Pairwise and multiple sequence alignments • We will focus on pairwise alignments · What alignments can help • Determine function of a newly discovered gene sequence • Determine evolutionary relationships among genes, proteins, and species • Predicting structure and function of protein Intro to Bioinformatics – Sequence Alignment 2 Acknowledgement: This notes is adapted from lecture notes of both Wright State University’s Bioinformatics Program and Professor Laurie Heyer of Davids

DNA Replication · Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set · DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Intro to Bioinformatics – Sequence Alignment 3

Over time, genes accumulate mutations · Environmental factors • Radiation • Oxidation · Mistakes in replication or repair · Deletions, Duplications · Insertions, Inversions · Translocations · Point mutations Intro to Bioinformatics – Sequence Alignment 4

Deletions · Codon deletion: ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal · Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG? … • Almost always lethal Intro to Bioinformatics – Sequence Alignment 5

Indels · Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 6

The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Intro to Bioinformatics – Sequence Alignment 7

Comparing Two Sequences · Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT · Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 8

Why Align Sequences? · The draft human genome is available · Automated gene finding is possible · Gene: AGTACGTATAGCGTAA • What does it do? · One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match Intro to Bioinformatics – Sequence Alignment 9

Gaps or No Gaps · Examples Intro to Bioinformatics – Sequence Alignment 10

Scoring a Sequence Alignment Given · Match score: +1 · Mismatch score: +0 · Gap penalty: – 1 · Sequences ACGTCTGATACGCCGTATAGTCTATCT ||||| || |||| ----CTGATTCGC---ATCGTCTATCT · Matches: 18 × (+1) · Mismatches: 2 × 0 Score = +11 · Gaps: 7 × (– 1) Intro to Bioinformatics – Sequence Alignment 11

Origination and Length Penalties · Note that a bioinformatics computational model or algorithm must be “biologically meaningful” or even “biologically significant” · We want to find alignments that are evolutionarily likely. · Which of the following alignments seems more likely to you? ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGAT-------ATAGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT AC-T-TGA--CG-CGT-TA-TCTATCT · We can achieve this by penalizing more for a new gap, than for extending an existing gap Intro to Bioinformatics – Sequence Alignment 12

Scoring a Sequence Alignment (2) Given · Match/mismatch score: +1/+0 · Gap origination/length penalty: – 2/– 1 · Sequences ACGTCTGATACGCCGTATAGTCTATCT ||||| || |||| ----CTGATTCGC---ATCGTCTATCT · Matches: 18 × (+1) · Mismatches: 2 × 0 Score · Origination: 2 × (– 2) · Length: 7 × (– 1) = +7 · Caution: Sometime “gap extension” used instead of “gap length” … Intro to Bioinformatics – Sequence Alignment 13

How can we find an optimal alignment? · Finding the alignment is computationally hard: ACGTCTGATACGCCGTATAGTCTATCT CTGAT---TCG-CATCGTC--T-ATCT · C(27, 7) gap positions = ~888, 000 possibilities · It’s possible, as long as we don’t repeat our work! · Dynamic programming: The Needleman & Wunsch algorithm Intro to Bioinformatics – Sequence Alignment 14

Dynamic Programming · Technique of solving optimization problems • Find and memorize solutions for subproblems • Use those solutions to build solutions for larger subproblems • Continue until the final solution is found · Recursive computation of cost function in a non -recursive fashion Intro to Bioinformatics – Sequence Alignment 15

Dynamic Programming · Example • Solving Fibonacci number f(n) = f(n-1) + f(n-2) recursively takes exponential time Ø Because many numbers are recalculated • Solving it using dynamic programming only takes a linear time Ø Use an array f[] to store the numbers f[0] = 1; f[1] = 1; for (i = 2; i <= n; i++) f[i] = f[i – 1] + f[i – 2]; Intro to Bioinformatics – Sequence Alignment 16

Global Sequence Alignment · Needleman-Wunsch algorithm Di, j = max{ Di-1, j + d(Ai, –), Di, j-1 + d(–, Bj), Di-1, j-1 + d(Ai, Bj) } B i-1 A i j-1 j · Suppose we are aligning: a with a … Seq 1 Seq 2 Intro to Bioinformatics – Sequence Alignment 17

Dynamic Programming (DP) Concept · Suppose we are aligning: CACGA Intro to Bioinformatics – Sequence Alignment 18

DP – Recursion Perspective · Suppose we are aligning: ACTCG ACAGTAG · Last position choices: G G +1 ACTC ACAGTA G - -1 ACTC ACAGTAG G -1 ACTCG ACAGTA Intro to Bioinformatics – Sequence Alignment 19

What is the optimal alignment? · ACTCG ACAGTAG · Match: +1 · Mismatch: 0 · Gap: – 1 Intro to Bioinformatics – Sequence Alignment 20

Needleman-Wunsch: Step 1 · Each sequence along one axis · Gap penalty multiples in first row/column · 0 in [1, 1] (or [0, 0] for the CS-minded) Intro to Bioinformatics – Sequence Alignment 21

Needleman-Wunsch: Step 2 · Vertical/Horiz. move: Score + (simple) gap penalty · Diagonal move: Score + match/mismatch score · Take the MAX of the three possibilities Intro to Bioinformatics – Sequence Alignment 22

Needleman-Wunsch: Step 2 (cont’d) · Fill out the rest of the table likewise… Intro to Bioinformatics – Sequence Alignment 23

Needleman-Wunsch: Step 2 (cont’d) · Fill out the rest of the table likewise… · The optimal alignment score is calculated in the lower-right corner Intro to Bioinformatics – Sequence Alignment 24

But what is the optimal alignment · To reconstruct the optimal alignment, we must determine of where the MAX at each step came from… Intro to Bioinformatics – Sequence Alignment 25

A path corresponds to an alignment · = GAP in top sequence · = GAP in left sequence · = ALIGN both positions · One path from the previous table: · Corresponding alignment (start at the end): AC--TCG ACAGTAG Intro to Bioinformatics – Sequence Alignment Score = +2 26

Algorithm Analysis · Brute force approach • If the length of both sequences is n, number of possibility = C(2 n, n) = (2 n)!/(n!)2 22 n / ( n)1/2, using Sterling’s approximation of n! = (2 n)1/2 e-nnn. • O(4 n) · Dynamic programming • O(mn), where the two sequence sizes are m and n, respectively • O(n 2), if m is in the order of n Intro to Bioinformatics – Sequence Alignment 27

Practice Problem · Find an optimal alignment for these two sequences: GCGGTT GCGT · Match: +1 · Mismatch: 0 · Gap: – 1 Intro to Bioinformatics – Sequence Alignment 28

Practice Problem · Find an optimal alignment for these two sequences: GCGGTT GCG-TIntro to Bioinformatics – Sequence Alignment Score = +2 29

Semi-global alignment · Suppose we are aligning: GCG GGCG · Which do you prefer? G-CG -GCG GGCG · Semi-global alignment allows gaps at the ends for free. • Terminal gaps are usually the result of incomplete data acquisition no biologically significant Intro to Bioinformatics – Sequence Alignment 30

Semi-global alignment · Semi-global alignment allows gaps at the ends for free. · Initialize first row and column to all 0’s · Allow free horizontal/vertical moves in last row and column Intro to Bioinformatics – Sequence Alignment 31

Local alignment · Global alignments – score the entire alignment · Semi-global alignments – allow unscored gaps at the beginning or end of either sequence · Local alignment – find the best matching subsequence · CGATG AAATGGA · This is achieved by allowing a 4 th alternative at each position in the table: zero. Intro to Bioinformatics – Sequence Alignment 32

Local Sequence Alignment · Why local sequence alignment? • Subsequence comparison between a DNA sequence and a genome • Protein function domains • Exons matching · Smith-Waterman algorithm Initialization: D 1, j = , Di, 1 = Di, j = max{ Di-1, j + d(Ai, –), Di, j-1 + d(–, Bj), Di-1, j-1 + d(Ai, Bj), 0} Intro to Bioinformatics – Sequence Alignment 33

Local alignment · Score: Match = 1, Mismatch = -1, Gap = -1 c a a a t g g a 0 0 0 0 g 0 0 0 0 a 0 0 0 1 1 0 t 0 1 1 1 0 0 0 2 g 0 0 2 1 0 0 0 0 1 3 2 1 CGATG AAATGGA Intro to Bioinformatics – Sequence Alignment 34

Local alignment · Another example Intro to Bioinformatics – Sequence Alignment 35

More Example · Align ATGGCCTC ACGGCTC Mismatch = -3 Gap = -4 Global Alignment: ATGGCCTC ACGGC-TC -A T G G C C T C Intro to Bioinformatics – Sequence Alignment -0 -4 -8 -12 -16 -20 -24 -28 -32 A -4 1 -3 -7 -11 -15 -19 -23 -27 C -8 -3 -2 -6 -10 -14 -18 -22 G -12 -7 -6 -1 -5 -9 -13 -17 -21 G -16 -11 -10 -5 0 -4 -8 -12 -16 C -20 -15 -14 -9 -4 1 -3 -7 -11 T -24 -19 -14 -13 -8 -3 -2 -2 -6 C -28 -23 -18 -17 -12 -7 -2 -5 -1 36

More Example Local Alignment: ATGGCCTC ACGG CTC or ATGGCCTC ACGGC TC Intro to Bioinformatics – Sequence Alignment -A T G G C C T C -0 0 0 0 0 A 0 1 0 0 0 0 C 0 0 0 1 1 0 1 G 0 0 0 1 1 0 0 G 0 0 0 1 2 0 0 C 0 0 0 3 1 0 1 T 0 0 1 0 0 2 0 C 0 0 0 1 1 0 3 37

Scoring Matrices for DNA Sequences · Transition: A G C T · Transversion: a purine (A or G) is replaced by a pyrimadine (C or T) or vice versa Intro to Bioinformatics – Sequence Alignment 38

Scoring Matrices for Protein Sequence · PAM 250 Intro to Bioinformatics – Sequence Alignment 39

Scoring Matrices for Protein Sequence · PAM (Percent Accepted Mutations) · 1 PAM unit can be thought of as the amount of time in which an “average” protein mutates 1 out of every 100 amino acids. · Perform multiple sequence alignment on a “family” of proteins that are at least 85% similar. Find the frequency of amino acid i and j are aligned to each other and normalize it. · Entry mij in PAM 1 represents the probability of amino acid i substituted by amino acid j in 1 PAM unit · PAM 2 = PAM 1 × PAM 1 · PAM n = (PAM 1)n, e. g. , PAM 250 = (PAM 1)250 · Questions · Why are values in a PAM matrix integers? Shouldn’t a probability be between 0 and 1? · Why is PAM 250 possible? Shouldn’t a probability be less than or equal to 100%? Intro to Bioinformatics – Sequence Alignment 40

Scoring Matrices for Protein Sequence · BLOSUM (BLOcks SUbstitution Matrix) 62 Intro to Bioinformatics – Sequence Alignment 41

Using Protein Scoring Matrices · Divergence BLOSUM 80 PAM 1 Closely related Less divergent Less sensitive BLOSUM 62 PAM 120 · Looking for BLOSUM 45 PAM 250 Distantly related More divergent More sensitive • Short similar sequences → use less sensitive matrix • Long dissimilar sequences → use more sensitive matrix • Unknown → use range of matrices · Comparison • PAM – designed to track evolutionary origin of proteins • BLOSUM – designed to find conserved regions of proteins Intro to Bioinformatics – Sequence Alignment 42

Multiple Sequence Alignment · Why multiple sequence alignment (MSA) • Two sequences might not look very similar. • But, some “similarity” emerges with more sequences, however. · Is dynamic programming still applicable? · CLUSTALW • One of most popular tools for MSA • Heuristic-based approach • Basic 1) Calculate the distance matrix based on pairwise alignments 2) Construct a guide tree NJ (unrooted tree) Mid-point (rooted tree) 3) Progressive alignment using the guide tree 43