Sequence IO How to find sequence information from

Sequence I/O Writing a file from Bio. Seq import Seq from Bio. Seq. Record

Pairwise Alignment - DNA Seq 1 A C T T G A Seq 1

Protein sequences Seq 1 Seq 2 identity similarity or not GDAFPLHKLEL-GPV GDA P+H++ +

Describing similarity The sequences are identical 100% similarity The sequences are similar (protein) there

Alignment algorithms GLOBAL ALIGNMENT Considers similarity across the entire length of the sequences LOCAL

Original manual method Start with a 2 D matrix on which you put the

Dotplot of similarity G D I H I F G D V H I

Doesn’t scale well GDIHAL GRVHIF G D I H A L G R V

Computational considerations Sequence orientation Character substitutions Physio-chemical properties which strand? variations protein structure =>

Comparing Strings Called “edit distance” measurements. Hamming distance: the number of different characters between

Dynamic programming A method of solving a large problem by dividing it up into

Dynamic programing A recursive algorithm To find an LCS of X and Y ,

Dynamic programing BCBA AB C BDAB BDCAB A

Step 3: Computing the length of a LCS BCBA AB C BDAB BDCAB A

Step 4: Constructing a LCS (Backtracking) BCBA AB C BDAB BDCAB A CSC 317

Putting it all together • Create a matrix - as in dotplot but “virtual”

Needleman and Wunsch Global alignment algorithm for sequence comparison. Biased towards finding similarity, rather

Smith-Waterman algorithm Local alignment algorithm which finds shorter, locally similar regions (substrings). It is

Sequence alignments For pairwise alignments Biopython contains the Bio. pairwise 2 module What you

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global)

Pairwise sequence alignments from Bio import pairwise 2 from Bio import Seq. IO rom

Problem with N-W and S-W They are exhaustive, with stringent statistical thresholds As the

Two heuristic approaches Each is based on different assumptions and calculations of probability thresholds

The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller

BLAST Using online BLAST database type from Bio. Blast import NCBIWWW result_handle = NCBIWWW.

BLAST Using local BLAST http: //blast. ncbi. nlm. nih. gov/Blast. cgi? CMD=Web&PAGE_TYPE=Blast. Docs&DOC_TYPE=Download -

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings),

Slides: 29

Download presentation

Sequence I/O How to find sequence information from Bio import Seq. IO orchid_dict = Seq. IO. to_dict(Seq. IO. parse("ls_orchid. fasta", ”fasta")) creates Python dictionary with each entry held as a Seq. Record object in memory

Sequence I/O Writing a file from Bio. Seq import Seq from Bio. Seq. Record import Seq. Record from Bio. Alphabet import generic_protein Creating seq. Rec rec 1 = Seq. Record(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" +"SSAC", generic_protein), id="gi|14150838|gb|AAK 54648. 1|AF 376133_1", description="chalcone synthase [Cucumis sativus]") rec 2 = Seq. Record(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ" +"DMVVVEIPKLGKEAAVKAIKEWGQ", generic_protein), id="gi|13919613|gb|AAK 33142. 1|", description="chalcone synthase [Fragaria vesca subsp. bracteata]") my_records = [rec 1, rec 2, rec 3] Seq. IO. write(my_records, "my_example. faa", "fasta") writing to file

Pairwise Alignment - DNA Seq 1 A C T T G A Seq 1 Seq 2 Single Nucleotide Polymorphism (SNP) Seq 2 A G T T A G A A C T T G A A G T T A G A A C T T - G A A G T T A G A insertion/deletion (indel)

Protein sequences Seq 1 Seq 2 identity similarity or not GDAFPLHKLEL-GPV GDA P+H++ + GPV GDASPVHQMTMKGPV = the residues are identical = the residues are physio-chemically similar eg hydrophobic A L I V F M = gap or different

Describing similarity The sequences are identical 100% similarity The sequences are similar (protein) there is some level of similarity this can range from 99% to 45% even 20% use of adjectives: highly, significantly, somewhat The sequences are homologous implies an evolutionary relationship ie they have a common ancestor requires knowledge of protein or species phylogeny

Alignment algorithms GLOBAL ALIGNMENT Considers similarity across the entire length of the sequences LOCAL ALIGNMENT Looks at regions of similarity within the sequences, substrings. These are usually biologically meaningful motifs or domains

Original manual method Start with a 2 D matrix on which you put the query sequence (I) on the vertical, and the test sequence (J) on the horizontal axis. G D I H I F Find the common k-words. G GDIHIF GDVHIF D V H I F

Dotplot of similarity G D I H I F G D V H I F Finding the diagonal. Visualization of similarity

Doesn’t scale well GDIHAL GRVHIF G D I H A L G R V - Visualization of similarity not so good with distantly related sequences - No statistical/numerical measure of similarity + H I F - Slow + +

Computational considerations Sequence orientation Character substitutions Physio-chemical properties which strand? variations protein structure => Biological meaning? Need a system whereby substitutions and gaps are weighted. Different types of scoring systems can be used with different algorithms – The main problem is computational time.

Comparing Strings Called “edit distance” measurements. Hamming distance: the number of different characters between strings of equal lengths; only allows substitutions. Levenshtein distance: the strings do not have to be the same length; allows insertions, deletions, substitutions. Applied to automated spelling corrections…

Dynamic programming A method of solving a large problem by dividing it up into smaller subproblems and solving those, then putting them together to solve the larger problem. options. Can often have more than one way of aligning two sequences, dynamic programming allows backtracking and testing of various The path that most effectively links together the subproblems into an optimal selected as of thescoring final alignment. Needs to solution be basedison statistics system and probability thresholds set May still end up with a few solutions of equal scores – Ultimately need a biologically relevant answer.

Dynamic programing A recursive algorithm To find an LCS of X and Y , we may need to find the LCSs of X and Yn-1 and of Xm-1 and Y. Furthermore, each of these subproblems has the subsubproblem of finding an LCS of Xm-1 and Yn-1. c[i, j] is the length of an LCS of the sequences Xi and Yj. The optimal substructure of the LCS problem gives the recursive formula if i = 0 or j = 0 if i, j > 0 and xi = yj if i, j > 0 and xi ≠ yj

Dynamic programing BCBA AB C BDAB BDCAB A

Backtracking BCBA AB C BDAB BDCAB A

Step 3: Computing the length of a LCS BCBA AB C BDAB BDCAB A CSC 317 16

Step 4: Constructing a LCS (Backtracking) BCBA AB C BDAB BDCAB A CSC 317 17

Putting it all together • Create a matrix - as in dotplot but “virtual” • Calculate the match/mismatch score step by step as you go along – how many are too much? • Then backtrack and do it again – iterative, to find optimal score

Needleman and Wunsch Global alignment algorithm for sequence comparison. Biased towards finding similarity, rather than difference. Based on 2 D matrix, gives a maximum match; ie the largest number of residues (n or aa) of one sequence that can be matched with another, allowing for all possible deletions and substitutions. Begins at beginning of sequence and ends at the end.

Smith-Waterman algorithm Local alignment algorithm which finds shorter, locally similar regions (substrings). It is more sensitive to weaker sequence similarities, and to finding conserved motifs This is also a matrix-based approach, but each cell in the matrix defines the end of a potential alignment. It is the basis for many subsequent alignment algorithms, including BLAST.

Sequence alignments For pairwise alignments Biopython contains the Bio. pairwise 2 module What you would do: 1. ) Prepare an input file of your unaligned sequences, typically this will be a FASTA file which you might create using Bio. Seq. IO. 2. ) Call the command line tool to process this input file, typically via one of Biopython’s command line wrappers (which we’ll discuss here). 3. ) Read the output from the tool, i. e. your aligned sequences, typically using Bio. Align. IO. .

Pairwise sequence alignments contains essentially the same algorithms as water (local) and needle (global) from Bio import pairwise 2 from Bio import Seq. IO seq 1 = Seq. IO. read("alpha. faa", "fasta”) seq 2 = Seq. IO. read("beta. faa", "fasta") alignments = pairwise 2. align. globalxx(seq 1. seq, seq 2. seq) global alignment 2 letter code (scoring matches, gaps) x … match scores 1, x … no cost for gaps m. . . general values, s. . . costs for gaps print(pairwise 2. format_alignment(*alignments[0]))

Pairwise sequence alignments from Bio import pairwise 2 from Bio import Seq. IO rom Bio. Subs. Matrix. Info import blosum 62 costs for opening, extending gaps seq 1 = Seq. IO. read("alpha. faa", "fasta”) seq 2 = Seq. IO. read("beta. faa", "fasta") alignments = pairwise 2. align. localds(seq 1. seq, seq 2. seq, blosum 62, -10, -0. 5) local alignment substitution matrix, Blosum, PAM print(pairwise 2. format_alignment(*alignments[0]))

Problem with N-W and S-W They are exhaustive, with stringent statistical thresholds As the databases got larger, these began to take too long – computational constraints Need a program that runs an algorithm that is based on dynamic programming (substrings) but uses a heuristic approach (takes assumptions, quicker, not as refined). Therefore can deal with comparing a sequence against 100 million others in a relatively short time… the bigger the database, the more optimal the results.

Two heuristic approaches Each is based on different assumptions and calculations of probability thresholds (ie the statistical significance of results) 1. FASTA All proteins in the db are equally likely to be related to the query probability multiplied by the number of sequences in database 2. BLAST Query more likely to be related to a sequence its own length or shorter probability multiplied by N/n N: total number of residues in database n: length of subject sequence

The BLAST algorithm Basic Local Alignment Search Tool Breaks down your sequence into smaller segments (words) Does the same for all sequences in the database Looks for exact matches, word by word, and expands those upand down- stream one base at a time, allowing for a certain number of mismatches Stops when sequence ends or statistical significance becomes too low (too many mismatches) Can find more than one area of similarity between two sequences

BLAST Using online BLAST database type from Bio. Blast import NCBIWWW result_handle = NCBIWWW. qblast("blastn", "nt", "8332116") Blast program sequence or from Bio. Blast import NCBIWWW from Bio import Seq. IO Just sequence record = Seq. IO. read("m_cold. fasta", format="fasta") result_handle = NCBIWWW. qblast("blastn", "nt", record. seq) result_handle = NCBIWWW. qblast("blastn", "nt", record. format(“fasta”) whole seq. information

BLAST Using local BLAST http: //blast. ncbi. nlm. nih. gov/Blast. cgi? CMD=Web&PAGE_TYPE=Blast. Docs&DOC_TYPE=Download - First we create a command line prompt (for example): blastn -query opuntia. fasta -db nr -out opuntia. xml -evalue 0. 001 Blast program query sequence database output file evalue threshold - then we us the BLAST Python wrappers from Bio. Python: from Bio. Blast. Applications import Ncbiblastx. Commandline blastn_exe = r”path to blastn" blastn_cline = Ncbiblastn. Commandline(query="opuntia. fasta", db="nr", evalue=0. 001, out="opuntia. xml”) blastn_cline() use Bio. Blast. NCBIXML. parse() to parse results in the output file

Algorithm evolution Smith Waterman Local alignment algorithm – finds small, locally similar regions (substrings), matrix-based, each cell in the matrix defined the end of a potential alignment. BLAST – start with highest scoring short pairs and extend and down the sequence. Great, but when you’re talking about millions of reads…