qgram Based Database Searching Using A Suffix Array
q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, Saarbrücken E. Rivals P. Ferragina M. Vingron Deutsches Krebsforschungszentrum, Heidelberg
Outline · Existing Work · Motivation · Problem · Algorithm · Results
· Examples : • BLAST • FASTA · Linear Scan (No Index) · Good Sensitivity
· Today: New Applications · Examples: • EST-Clustering • Large Scale Shotgun Assembly · Low Sensitivity · Multiple Searches ÞSpecialized Algorithms Needed
Pattern P TCGATTACAGTGAAT w=8 Database D GCATTCGATGGACTAGTGAATCAGT · Local Alignment, minimum Length w · Low Error Rate (<10% Edit Distance)
• Filter Step: • Identify Hotspots • Scan Step: • Scan Hotspots with BLAST
• q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTACAGTGAAT q=3 TCG # of q-grams : CGA |P| - q + 1 GAT ATT TTA TAC w=8 GCATTCGATGGACTAGTGAATCAGT Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams
• q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTAC · Divide D into Blocks · Count matching q-grams per Block · Scan Blocks with counter ³ t How to find the matching q-grams? GCATTCGATGGACTAGTGAATCAGT 4321 0
• q-gram Filtration • Block Addressing • Suffix Array • Window Shifting TCGATTAC AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 3 23 16 11 GCATTCGATGGACTAGTGAATCAGT 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 · Sorted List of Pointers to Suffixes, O(log |D|) Access Time · Precompute Searches for q-grams, O(1) Time Access
• q-gram Filtration • Block Addressing • Suffix Array • Window Shifting q=3 w=8 e=1 t=3 TCGATTACAGTGAAT · Move Window over Query · Mark full Blocks for each Window · Scan Marked Blocks GCATTCGATGGACTAGTGAATCAGT 2304 107
· Influence of the Block Size · Sensitivity · Running Times · Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333 Mhz, 4 GB RAM
Influence of Block Size
Sensitivity · 1000 Queries · BLAST Cutoff E = 0. 00001 · Number of identical hitlists • Mouse EST DB: 91. 4 % • Human EST DB: 97. 1 % Þ QUASAR finds many Hits below selected Error Level
Running Times · Test Parameters: 6% Error l w = 50 l q = 11 l block size 2048 l scan with BLAST l time averaged for 1000 queries l · ~30 times faster than BLAST
Overhead for Loading the Index · 1000 queries · Human EST DB, 280 Mbps · BLAST Test Run: • 5 seconds Load Time • 13. 270 seconds Search Time · QUASAR Test Run: • 90 seconds Load Time • 380 seconds Search Time
- Slides: 16