Gapped BLAST and PSIBLAST a new generation of

Gapped BLAST and PSIBLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller and David J. Lipman

Introduction to BLAST • BLAST is a heuristic approximation to dynamic programming based local alignment. • Finds locally maximal segment pairs with scores over a cutoff. • Has a formal statistical theory to assess the significance of scores.

Basic Algorithm • Looks for words of length w with score greater than T. • These hits are then extended to check for segment pairs with score greater than S (>T. ) • Tradeoff: Lowering T reduces probability of missing segment pairs (increases sensitivity) but increases number of hits to be extended.

Scanning for hits • Two Approaches: – Positions of length w words in query with score higher than T stored in a 20 w sized array and hits detected by array lookup. – A DFA for the appropriate words is generated and used to scan the sequences. A Mealy machine (acceptance on transitions) is used for efficiency.

Other Issues • Hit extension is simplified by stopping when score falls below a threshold compared to the best score found for shorter extensions. • The various parameters are chosen based on experiments using random sequences. • Combinations of MSP’s can be used to get better scores for matching sequences.

Two-Hit Method Original BLAST: One-Hit Two-Hit Method • Extend each hit to determine if it is in a highscoring alignment • Extension consumes >90% of processing time • Extension invoked only if there are two nonoverlapping word pairs on the same diagonal • Lowering T yields more hits, but only a few are extended • 3 x faster • hit: short word pair whose aligned score ≥ T T: threshold parameter; as T ↑, speed ↑, probability of missing weak similarities ↑

Two-Hit Method Algorithm • Scan db for hits (word pair scoring ≥ T) • Seek pairs of non-overlapping hits found with distance A of one another on same diagonal • Invoke (ungapped) extension to determine if hits lie within a statistically significant alignment with query. Extend until alignment score has dropped ≥ X below max score yet attained.

Gapped Alignments Original BLAST Implicitly treat gapped alignments: • Locate several distinct HSPs within same db sequence • Calculate statistical significance on combined result Gapped BLAST • Trigger gapped extension for any HSP exceeding moderate score Sg • Gapped extension longer to execute, few undergo this extension HSP: high-scoring segment pair; locally optimal

Advantage of New Heuristic for Generating Gapped Alignments • Two or more HSPs may each have low scores independently, but can have a statistically significance together • Only one of the constituent HSPs need to be found to generate a successful combined result – can increase T

Older Gapped Alignments • Confine the dynamic programming to a banded section of the full path graph • Optimal gapped alignment may be outside this band • As width of band ↑, speed ↓

New Heuristic for Generating Gapped Alignments • Starting from a seed HSP, dynamic programming proceeds both bidirectionally through the path graph • Consider only cells for which optimal local alignment score falls ≤ Xg below best score yet found • Region of path graph explored adapts to alignment being constructed • Seed: central residue pair of segment with highest alignment along HSP

New Gapped BLAST • Ungapped extension of second hit invoked for two non-overlapping hits of score ≥ T within distance A of one another • If HSP generated has normalized score ≥ Sg, gapped extension is triggered • Resulting gapped alignment reported if statistically significant (low enough E-value) • Runs on average 3 x faster than original BLAST

PSI-BLAST: Overview • Results of initial BLAST search used to construct position-specific scores. • BLAST is repeated using the new scores till no more sequences are found. • Position-specific scores improve the ability of successive BLAST iterations for detecting remote homologs.

Position-specific score matrix • Dimensions: Lx 20 • “Multiple Alignment” created using all segments with e-value above a threshold. • Alignment based on pairwise alignments. • Columns with gaps in query ignored. • For each column C a reduced alignment MC is created.

• MC includes all sequences with a residue in C and all columns which have the above sequences. • Sequence weighting method used to generate observed residue frequencies. • Score for residue i in column C given by log(Qi/ Pi) • Qi is the weighted sum of observed frequencies and a pseudocount.