CS 5263 CS 4593 Bioinformatics Nextgeneration sequencing Mapping

CS 5263 & CS 4593 Bioinformatics Next-generation sequencing Mapping short reads

Short read mapping • Input: – A reference genome – A collection of many 25 -100 bp tags (reads) – User-specified parameters • Output: – One or more genomic coordinates for each tag • In practice, only 70 -75% of tags successfully map to the reference genome. Why?

Multiple mapping • A single tag may occur more than once in the reference genome. • The user may choose to ignore tags that appear more than n times. • As n gets large, you get more data, but also more noise in the data.

Inexact matching ? • An observed tag may not exactly match any position in the reference genome. • Sometimes, the tag almost matches one or more positions. • Such mismatches may represent a SNP (single-nucleotide polymorphism, see wikipedia) or a bad read-out. • The user can specify the maximum number of mismatches, or a phred-style quality score threshold. • As the number of allowed mismatches goes up, the number of mapped tags increases, but so does the number of incorrectly mapped tags.

Read Length is Not As Important For Resequencing Jay Shendure

Mapping Reads Back • Hash Table (Lookup table) – FAST, but requires perfect matches. [O(m n + N)] • Array Scanning – Can handle mismatches, but not gaps. [O(m N)] • Dynamic Programming (Smith Waterman) – Indels – Mathematically optimal solution – Slow (most programs use Hash Mapping as a prefilter) [O(mn. N)] • Burrows-Wheeler Transform (BW Transform) – FAST. [O(m + N)] (without mismatch/gap) – Memory efficient. – But for gaps/mismatches, it lacks sensitivity

Spaced seed alignment • Tags and tag-sized pieces of reference are cut into small “seeds. ” • Pairs of spaced seeds are stored in an index. • Look up spaced seeds for each tag. • For each “hit, ” confirm the remaining positions. • Report results to the user.

Burrows-Wheeler • Store entire reference genome. • Align tag base by base from the end. • When tag is traversed, all active locations are reported. • If no match is found, then back up and try a substitution.

Why Burrows-Wheeler? • BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2 GB of memory • Linear-time search algorithm – proportional to length of query for exact matches

Burrows-Wheeler Transform (BWT) BWT acaacg$ $acaacg$ac acaacg$aca caacg$acaa g$acaac gc$aaac Burrows-Wheeler Matrix (BWM)

Burrows-Wheeler Matrix $acaacg$ac acaacg$aca caacg$acaa g$acaac

Burrows-Wheeler Matrix 3 1 4 2 5 6 $acaacg$ac acaacg$aca caacg$acaa g$acaac See the suffix array?

Key observation a 1 c 1 a 2 a 3 c 2 g 1$1 “last first (LF) mapping” The i-th occurrence of character X in the last column corresponds to the same text character as the i-th occurrence of X in the first column. 1$acaacg 1 2 aacg$ac 1 1 acaacg$1 3 acg$aca 2 1 caacg$a 1 2 cg$acaa 3 1 g$acaac 2

Recover text 4 6 5 6 3 3 4 4 2 5 6 5 6 3 1 4 2 5 6

Exact match 3 1 4 2 5 6

Exact match (another example) BWT(agcagcagact) = tgcc$ggaaaac Search for pattern: gca gca gca $agcagcagact act$agcagcag agact$agcagc agcagact$agc agcagcagact$ cagact$agcag cagcagact$ag ct$agcagcaga gact$agcagca gcagact$agca gcagcagact$a t$agcagcagac Test with your own seq and pattern at: http: //www. allisons. org/ll/Alg. DS/Strings/BWT/

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g t 1 5 8 11 FM indices Next block: From 1 + 0 = 1 to 1 + (4 -1) = 4

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices Next block: From 5 + 0 = 5 to 5 + (2 -1) = 6

Auxiliary data structures Key for efficient pattern matching: how to find the corresponding chars in the first column efficiently, in terms of both time and space. gca rank BWT SA $agcagcagact t a c g t 1 9 act$agcagcag g 0 0 1 1 2 7 agact$agcagc c 0 1 1 1 3 4 agcagact$agc c 0 2 1 1 4 1 agcagcagact$ $ 0 2 1 1 5 6 cagact$agcag g 0 2 2 1 6 3 cagcagact$ag g 0 2 3 1 7 10 ct$agcagcaga a 1 2 3 1 8 8 gact$agcagca a 2 2 3 1 9 5 gcagact$agca a 3 2 3 1 10 2 gcagcagact$a a 4 2 3 1 11 11 t$agcagcagac c 4 3 3 1 a c g T 1 5 8 11 FM indices Next block: From 8 + 1 = 9 to 8 + (3 -1) = 10

Inexact match

Main advantage of BWT against suffix array • BWT needs less memory than suffix array • For human genome m = 3 * 109 : – Suffix array: mlog 2(m) bits = 4 m bytes = 12 GB – BWT: m/4 bytes plus extras = 1 - 2 GB • m/4 bytes to store BWT (2 bits per char) • Suffix array and occurrence counts array take 5 m log 2 m bits = 20 n bytes • In practice, SA and OCC only partially stored, most elements are computed on demand (takes time!) • Tradeoff between time and space

Comparison Spaced seeds • Requires ~50 Gb of memory. • Runs 30 -fold slower. • Is much simpler to program. Burrows-Wheeler • Requires <2 Gb of memory. • Runs 30 -fold faster. • Is much more complicated to program. MAQ Bowtie

Short-read mapping software Software Technique Developer License Eland Hashing reads Illumnia ? SOAP Hashing refs BGI Academic Maq Hashing reads Sanger (Li, Heng) GNUPL Bowtie BWT Salzberg/UMD BWA BWT Sanger (Li, Heng) GNUPL SOAP 2 BWT & hashing BGI Academic GNUPL http: //www. oxfordjournals. org/our_journals/bioinformatics/nextgenerationsequencing. html

References • • (Bowtie) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Langmead et al, Genome Biology 2009, 10: R 25 SOAP: short oligonucleotide alignment, Ruiqiang Li et al. Bioinformatics (2008) 24: 713 -4 (BWA) Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform, Li Heng and Richard Durbin, (2009) 25: 1754– 1760 SOAP 2: an improved ultrafast tool for short read alignment, Ruiqiang Li, (2009) 25: 1966– 1967 (MAQ) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Li H, Ruan J, Durbin R. Genome Res. (2008) 18: 1851 -8. Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S 6 - S 12 (2009) http: //www. allisons. org/ll/Alg. DS/Strings/BWT/