SHRi MP Accurate Mapping of Short Reads in
- Slides: 33
SHRi. MP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno
How SHRi. MP works: l Stage 1: Map reads to target genome l Stage 2: Compute statistics
Read Mapping l Three phases l Very fast k-mer scan (index reads, scan genome) l Fast, vectorized Smith-Waterman to confirm l Slow, complete backtracking S-W for top ‘n’ hits
Read Mapping: Phase 1 l Create a hash table of size 4^(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index AACTGTACCAGTGAG …
Read Mapping: Phase 1 l Create a hash table of size 4^(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index AACTGTaccagtgag …
Read Mapping: Phase 1 l Create a hash table of size 4^(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index AACTGTA … a. ACTGTAccagtgag
Read Mapping: Phase 1 l Create an index of size 4(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This is our k-mer to read index AACTGTA CTGTAC … aa. CTGTACcagtgag
Read Mapping: Phase 1 l Create a hash table of size 4^(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index AACTGTA CTGTACC … acc. TGTACCagtgag
Read Mapping: Phase 1 l Create a hash table of size 4^(k-mer length) l l 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index AACTGTA CTGTACC … Read 7 Read 12 Read 32 Read 13 Read 18 Read 7 Read 12 Read 15
Read Mapping: Phase 1 Once we’ve indexed all reads, just scan the genome by k-mer Genome Reads l
Read Mapping: Phase 1 l Remember the k-mer hits within a given interval (window) l When sufficient hits, look more closely l “Look more closely” means calculate a fast Smith. Waterman score
Technicalities l We don’t always use full k-mers (q-grams). l We actually support ‘spaced seeds’, but the algorithm doesn’t change much. l For each spaced seed, ‘compress out’ the k-mer and use it as the hash index
Read Mapping: Phase 2 l Smith-Waterman is very expensive l Nx. M matrix isn’t too big for short reads and windows, but… l We call the vectorized code millions of times l We don’t want a bottleneck – aim for no more than 50% of the total runtime l We only want one score as quickly as possible
Read Mapping: Phase 2 A C T A G A T C C A G T Cell being computed Previously computed cells C T T G
Read Mapping: Phase 2 l Each forward-facing diagonal in S-W matrix depends on: l l l Small constant # of previous diagonals Small constant # of scalars We can compute entire diagonals in parallel l Our speed-up is proportional to the diagonal size
Read Mapping: Phase 2 A C T A G A C T T G T + C - C Current - A Previous - G Penultimate + T A C T A G A C T T G A C C T + - - - + T G
Read Mapping: Phase 2 l Most commodity processors have vector instructions l l Remember the MMX brouhaha? SIMD – Single Instruction, Multiple Data 4 2 6 12 9 21 8 7 + 15 3 = 23 10
Read Mapping: Phase 2 A C T A G A C T T G T + C - C Current - A Previous - G Penultimate + T A C T A G A C T T G A C C T + - - - + T G
Read Mapping: Phase 2 l Match scores typically use a scoring matrix l l Scoring. Matrix[Seq. A[i]][Seq. B[j]] But this doesn’t scale: Individual cell scores become a bottleneck l Can precompute a ‘query profile’ (expensive), or… l If we only care about strict match/mismatch we can use logical bit-wise operations l SIMD instructions work here (fully parallel)
Read Mapping: Phase 2 l Results: l Our vectorized S-W is as fast, or faster than other very complicated SIMD implementations l 500 million+ matrix cells/second on Core 2 machines l Even with small seeds, S-W accounts for at most half of the total run time
Read Mapping: Phase 3 l Recap: l K-mer scan selects areas of reasonable similarity l Vectorized S-W (dis)confirms similarity l Best ‘n’ hits per read are given a full alignment with backtrace
Read Mapping: Phase 3 l Letter-space alignments are simple: l l K-mer scan, Vectorized S-W, Full S-W in letters, give user pretty output What about AB SOLi. D colour-space? l l l Biologists want to see A, C, G, T, not 0, 1, 2, 3… Dealing with strange SOLi. D properties… Our solution: l l K-mer scan, Vectorized S-W in colour-space Full S-W in letter-space, but we can’t just convert
AB Di-base Reads l We think in terms of nucleotides: l l A, C, G, and T’s. AB’s NGS machine outputs 4 colours l One colour per pair of bases: T T G AG C G T T C T 0 1 2 2 3 3 1 0 2 A C G T A 0 1 2 3 C 1 0 3 2 G 2 3 0 1 T 3 2 1 0
AB Di-base Reads 0 A C G T A 0 1 2 3 C 1 0 3 2 1 G 2 3 0 1 C T 3 2 1 0 0 2 A 3 G 3 1 T 2 0 0
SOLi. D Translations l Given the following read, there are 4 translations (we need an initial base): 0 1 2 2 3 3 1 0 2 A A C T C G C A A G C C A G A T A C C T G G T C T A T G G A T T G A G C G T T C
SOLi. D Translations l Reads begin with a known primer (‘T’) 0 1 2 2 3 3 1 0 2 A A C T C G C A A G C C A G A T A C C T G G T C T A T G G A T T G A G C G T T C
SOLi. D Translations l What happens if a read error occurs? l The right translation was: T T G A G C G T T C 0 1 0 2 3 3 1 0 2 A A C C T A T G G A C C A A G C G T T C G C A A G T T G G A T A C C T
Colour-space Smith-Waterman l There are four unique translations for every read l An error will cause us to change frames (different translation) l Why not do a S-W across all four letter-space translations with some error penalty?
r Lette Colour-space Smith-Waterman l l Think of 4 S-W matrices stacked above one another If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1 Frame 2 Frame 3 Frame 4
Colour-space Smith-Waterman l End result: G: 1123724 T: R: 0 TA-ACCACGGTCACACTTGCATCAC || ||||| |||X||||||| TACACCACGGTCAGACTt. GCATCAC T 0311101130121221211313211 Should be ‘ 0’ 1123701 24
Statistics l After reads are mapped, mull over the results l For each read: l P(hit by pure chance – not a valid hit) l P(hit generated by genome – valid hit) l P(hit is best of all for particular read)
Results l Speed l Simple k-mer scan is very fast l Important when seeds are bigger (less S-W) l Vectorized S-W is fast l Important when seeds are smaller (more S-W) l Generally well-balanced run time l Big seeds make k-mer scan the bottleneck (this is good it’s really fast) l Easily parallelised – just divide the reads over CPUs
Results l C. Savingyi l l 22 M 25 bp reads 173 Mb genome l l l S-W would take at least a few thousand CPU days SHRi. MP runs in about 50 CPU days with fairly small seeds (length 8, weight 7) SNP, indel, error rates correspond well to known averages for this organism
- Long and short
- Shri dadaji institute of technology and science
- Shri bhairavdev vidyalaya
- Sndp movement upsc
- Shri ambika vidyalaya kedgaon
- Shri sad
- Heddy shri ahimsa putra
- Memory parameters
- Forward mapping vs backward mapping
- Transform mapping dan transaction mapping
- Present simple vs present continuous
- When everybody talks during the lesson
- A mechanism that spins reads and writes disks
- Raven rhyme scheme
- Fpros
- Andy sometimes read comics
- Three reads routine
- Modelled reading
- Light-sensing input device that reads printed text
- Irip nebraska
- Andy sometimes comics
- Speed tunnel
- The usc micrometer reads in
- What is ict in computer
- Raw reads meaning
- Contigs assembly
- Monotonic read consistency
- As jeff reads his psychology textbook
- Accurate but not precise example
- Packet
- Precise measurements are always accurate
- Explanation of code
- Accurate cutting facilitates sewing and improve garment
- Front elevation of a square based pyramid