Parallel Computation in Biological Sequence Analysis Par Align
Parallel Computation in Biological Sequence Analysis: Par. Align & Turbo. BLAST Larissa Smelkov
Biological Sequence Alignment Ap plic atio n Al go rith m Global To identify conserved Goal regions and differences Needleman-Wunsch 1. Comparing two genes with same function (human vs. mouse) 2. Comparing two proteins with similar function Local To see whether 2 strings have a common substring Smith-Waterman 1. Searching for local similarities in large sequences (newly sequenced genomes) 2. Looking for motifs in 2 proteins
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIES GSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCL KDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAA HAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKD SAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSK CDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV
Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIES GSVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCL KDGKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAA HAVVARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKD SAIMLKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSK CDRWSVVSNGDVECTVVDETKDCIIKIMKGEADAV
Similar Substrings DSLSGGERQ–RA–WIAMLVAQDSRC : : : : DQLSGSPRQNRIQWIAVLKAEKSKC
Talk Outline n n n Problem Description Smith-Waterman Algorithm BLAST Par. Align Turbo. BLAST Comparison
Problems of Comparison of 2 Sequences n Evolution Factor n n Additions Deletions Substitutions Human Factor n n Typos Duplicates
Solution n Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty
Score Matrix: BLOSUM 45
Pairwise Alignment Example ELEPHANT PANTHER
S-W: Dynamic Programming Matrix
S-W: Formula T[i, j] = max T[i-1, j-1] + score(s[i], t[j]) T[i-1, j] – g T[i, j-1] – g 0 g – gap penalty g = 8 (in our example) T[i-1, j-1] T[i-1, j] ?
S-W: Dynamic Programming Matrix
S-W: Dynamic Programming Matrix
S-W: Dynamic Programming Matrix
S-W: Dynamic Programming Matrix
S-W: Result Alignment ELEPHANT : : P– ANTHER
S-W: Summary n Uses n n n Complexity n n Score matrix Gap penalties O(mn) Sensitivity n High
Growth of Gen. Bank ~ 33 mln sequences as of Feb. 14, 2004 http: //www. ncbi. nlm. nih. gov/Genbank/genbankstats. html
BLAST: Basic Local Alignment Search Tool
BLAST: Steps n Divide both sequences into words of length w n n n default w = 3 Calculate score for each pair Extend high scored pairs to increase score
BLAST: Divide Sequences
BLAST: Calculate Score ELE PAN LEP PAN EPH PAN 0 -1 0 score: -1 -3 -1 -2 score: -6 0 -1 1 score: 0 PHA PAN HAN PAN ANT PAN 9 -2 -1 score: 6 -2 5 6 score: 9 -1 -1 0 score: -2
BLAST: Sort Pairs on Score
BLAST: Extension
BLAST: Summary n Uses n n Complexity n n Score matrix Gap penalties Heuristics to reduce computations O(m) with O(n) processors Sensitivity n Low
Sensitivity n n n Task: Align 2 sequences: AXBXCXDXE ABCDE Smith-Waterman: AXBXCXDXE : : : A– B– C– D– E BLAST: Ø (no similar substrings)
S-W vs. BLAST Speed BLAST S-W Sensitivity
S-W and BLAST n Using them now n n Too costly Inefficient Time-consuming Solution n n More heuristics More parallelism
Par. Align
Par. Align: Steps n n Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W
Par. Align: Microparallelism n n n Divide wide registers into smaller units Perform the same operation on different data sources Modern microprocessors have this technology built in
Par. Align: Calculate Scores in Parallel
Par. Align: Estimate of Gaps
Par. Align: Apply S-W in Parallel
Par. Align: Summary n Uses n n Requirement for machine n n Modern microprocessor Speed n n SIMD technology (single instruction multiple data) S-W Algorithm Heuristics to reduce computations Fast Sensitivity n Medium
Turbo. BLAST
Turbo. BLAST: Steps n Divide the job n n n Parts of query against partition of database Apply BLAST Merge results
Turbo. BLAST: Implementation n n A three-tier system Components n n n Client Master Workers
Turbo. BLAST: Schema Client job results • Sets up tasks • Manages execution • Coordinates Workers • Provides VSM tasks • Divides job into tasks • Writes results to file res ult s k s a t t sk s ta ue req Workers Master File Provider t es rt u pa req B DB D • Divide task • Schedule subtasks • Solve subtasks • Merge results Turbo Hub
Turbo. BLAST: Client n n n Takes a BLAST job and divides it into a number of initial BLAST tasks. Submits these tasks to the Master Retrieves the results, and writes them to file.
Turbo. BLAST: Master n n n Accepts tasks from Clients and sets them up to for processing by the Workers Includes Turbo. Hub (the server portion of a parallel execution system) Includes File Provider (Java application that manages the databases)
Turbo. BLAST: Worker n n Workers are processors Run a Java application and perform the BLAST computations Merge the result Are responsible for scheduling
Turbo. Hub n n n n Turbo. Hub is execution engine for parallel and distributed Java applications Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors
Turbo. BLAST: Turbo. Hub n n n Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of Workers Supports fault tolerance
Turbo. BLAST: File Provider n n Maintains a copy of each database Delivers all or part of each database to Workers as they require them
Turbo. BLAST: Advantages n Size of each task is optimal n n Large set of tasks n n no waste of time for processors No algorithm change n n n processing is efficient on the processor that computes the task Support for all flavors of BLAST Ease to update Applicable for different environments (PC, Macintosh …)
Turbo. BLAST: Experiment n Input data n n n Database n n 500 proteins 200 – 400 amino acids in each 1, 681, 522, 266 sequences Hardware n n IBM Linux cluster 8 dual-processor workstations n n n 2 Pentium III processors, 996 Mhz each 2 Gbyte memory 100 Mbit Ethernet
Turbo. BLAST: Results of Experiment
Turbo. BLAST: Results of Experiment
Turbo. BLAST: Summary n Divide and Conquer n n n Uses BLAST Algorithm Requirement for each machine n n n Java VM Local BLAST executable Speed n n Use many copies of BLAST in parallel Very fast Sensitivity n Low
Comparison of Algorithms/Products Speed Turbo BLAST Par. Align BLAST S-W Sensitivity
References n n R. D. Bjornson, A. H. Sherman, S. B. Weston, N. Willard, J. Wing “Turbo. BLAST: A Parallel Implementation of BLAST Built on the Turbo. Hub” Intl. Parallel and Distributed Processing Symposium (IPDPS), 2002. Rognes T. “Par. Align: a parallel sequence alignment algorithm for rapid and sensitive database searches” Oxford University Press, 2001
Don’t ask any Questions, please…
PS n Web site there you can donate your computer time to participate in search of methods to cure cancer: http: //www. the-optimists. org. uk
- Slides: 55