How Do We Locate DiseaseCausing Mutations Combinatorial Pattern
How Do We Locate Disease-Causing Mutations? Combinatorial Pattern Matching Part 2 Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach © 2018 by Compeau and Pevzner. All rights reserved.
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Idea to Reduce Memory: Compress the Genome Run-length encoding: compresses every run of n identical symbols X into the substring “n. X”. Text GGGGGCCCCCCAAAAAAATTTTTTTTCCCCCG 10 G 11 C 7 A 15 T 5 C 1 G Run-length encoding Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Idea to Reduce Memory: Compress the Genome Run-length encoding: compresses every run of n identical symbols X into the substring “n. X”. Text GGGGGCCCCCCAAAAAAATTTTTTTTCCCCCG 10 G 11 C 7 A 15 T 5 C 1 G Run-length encoding Problem: Genomes don’t have lots of runs… Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Idea to Reduce Memory: Compress the Genome Run-length encoding: compresses every run of n identical symbols X into the substring “n. X”. Text GGGGGCCCCCCAAAAAAATTTTTTTTCCCCCG 10 G 11 C 7 A 15 T 5 C 1 G Run-length encoding …but they do have a lot of repeats! Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text Convert repeats to runs f(Text) Run-length encoding RLE(f(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text Convert repeats to runs? f(Text) Run-length encoding RLE(f(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text Convert repeats to runs? f(Text) One way we could convert repeats into runs would be to simply sort Text lexicographically. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text One way we could convert repeats into Convert repeats to runs? runs would be to sort Text lexicographically. f(Text) Checkpoint: What is wrong with sorting the string? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text One way we could convert repeats into Convert repeats to runs? runs would be to sort Text lexicographically. f(Text) Checkpoint: What is wrong with sorting the string? Answer: There is no way to “undo” the compression to recover Text. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Converting Repeats to Runs? Text Convert repeats to runs? f(Text) Formally, define SA, n as the set of all strings of length n constructed over an alphabet A. We want a function f: SA, n that is invertible: if f(x) = f(y), then x = y. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
The Burrows-Wheeler Transform $ panamabananas$ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform $ panamabananas$ $panamabananas p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform $ panamabananas$ $panamabananas s$panamabanana p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform $ panamabananas$ $panamabananas s$panamabanana as$panamabanan p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p $ p a n s a a n m a a n a Form all cyclic rotations of “panamabananas$” Bioinformatics Algorithms: An Active Learning Approach. Copyright 2018 Compeau and Pevzner. b
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p Form all cyclic rotations of “panamabananas$” $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana Sort the strings lexicographically ($ comes first) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p Form all cyclic rotations of “panamabananas$” $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana We call this matrix of symbols M(Text) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
The Burrows-Wheeler Transform panamabananas$ $panamabananas s$panamabanana as$panamabanan nas$panamabana anas$panamaban nanas$panamaba ananas$panamab bananas$panama abananas$panam mabananas$pana amabananas$pan namabananas$pa anamabananas$p Form all cyclic rotations of Text = “panamabananas$” $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana BWT(Text) = last column of M(text)= “smnpbnnaaaaa$a”. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Let’s Examine BWT(Text) for Text = Watson & Crick, 1953 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
BWT Converts Repeats to Runs Text Burrows-Wheeler Transform!Convert repeats to runs BWT(Text) Run-length encoding RLE(BWT(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
How Can We Decompress? Text Convert repeats to runs BWT(Text) Run-length encoding RLE(BWT(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
How Can We Decompress? Text IS IT POSSIBLE? Convert repeats to runs BWT(Text) EASY Run-length encoding RLE(BWT(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Sorting all elements of “annb$aa” gives the first column of matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 2 -mers a$ na na ba $b an an We now know the 2 -mer composition of the (circular) string banana$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 2 -mers a$ na na ba $b an an Sort $b a$ an an ba na na We now know the 2 -mer composition of the (circular) string banana$ Sorting gives us the first 2 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 2 -mers a$ na na ba $b an an Sort $b a$ an an ba na na We now know the 2 -mer composition of the (circular) string banana$ Sorting gives us the first 2 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 3 -mers a$b na$ nan ban $ba ana We now know the 3 -mer composition of the (circular) string banana$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 3 -mers a$b na$ nan ban $ba ana Sort $ba a$b ana ban na$ nan We now know the 3 -mer composition of the (circular) string banana$ Sorting gives us the first 3 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 3 -mers a$b na$ nan ban $ba ana Sort $ba a$b ana ban na$ nan We now know the 3 -mer composition of the (circular) string banana$ Sorting gives us the first 3 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 4 -mers a$ba na$b nana bana $ban ana$ anan We now know the 4 -mer composition of the (circular) string banana$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 4 -mers a$ba na$b nana bana $ban ana$ anan Sort $ban a$bb anaa bann na$b nana We now know the 4 -mer composition of the (circular) string banana$ Sorting gives us the first 4 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 4 -mers a$ba na$b nana bana $ban ana$ anan Sort $ban a$bb anaa bann na$b nana We now know the 4 -mer composition of the (circular) string banana$ Sorting gives us the first 4 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 5 -mers a$ban na$ba nana$ banan $bana ana$b anana We now know the 5 -mer composition of the (circular) string banana$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 5 -mers a$ban na$ba nana$ banan $bana ana$b anana Sort $bana a$bbn anaab anaaa bannn na$ba nana$ We now know the 5 -mer composition of the (circular) string banana$ Sorting gives us the first 5 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 5 -mers a$ban na$ba nana$ banan $bana ana$b anana Sort $bana a$bbn anaab anaaa bannn na$ba nana$ We now know the 5 -mer composition of the (circular) string banana$ Sorting gives us the first 5 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 6 -mers a$bana na$ban nana$b banana $banan ana$ba anana$ We now know the 6 -mer composition of the (circular) string banana$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 6 -mers a$bana na$ban nana$b banana $banan ana$ba anana$ Sort $banan a$bbna anaaba anaaa$ bannna na$ban nana$b We now know the 6 -mer composition of the (circular) string banana$ Sorting gives us the first 6 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba 6 -mers a$bana na$ban nana$b banana $banan ana$ba anana$ Sort $banan a$bbna anaaba anaaa$ bannna na$ban nana$b We now know the 6 -mer composition of the (circular) string banana$ Sorting gives us the first 6 columns of the matrix. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba We now know the entire matrix and can reconstruct Text! Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Reconstructing Text = banana $banana a$banan ana$ban anana$b banana$ na$bana nana$ba We now know the entire matrix and can reconstruct Text! But can we reconstruct Text from BWT(Text) without needing |Text|2 space? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
A Strange Observation $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana These strings are sorted Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana Chop off a bananas$panam mabananas$pan namabananas$panamab nas$panamabanan These strings are sorted Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana Chop off a bananas$panam mabananas$pan namabananas$panamab nas$panamabanan These strings are sorted Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. Still sorted
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana These strings are sorted Chop off a bananas$panam mabananas$pan namabananas$panamab nas$panamabanan Add a to end bananas$panama mabananas$pana namabananas$panamaba nas$panamabanana Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. Still sorted
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana These strings are sorted Chop off a bananas$panam mabananas$pan namabananas$panamab nas$panamabanan Still sorted Add a to end bananas$panama mabananas$pana namabananas$panamaba nas$panamabanana Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. Still sorted
Is It True in General? 1 2 3 4 5 6 $panamabananas$panam amabananas$pan anamabananas$panamab anas$panamabanan bananas$panama mabananas$pana namabananas$panamaba nas$panamabananas$ s$panamabanana Chop off a 1 2 Ordering 3 doesn’t 4 change! 5 6 These strings are sorted bananas$panam mabananas$pan namabananas$panamab nas$panamabanan Still sorted Add a to end bananas$panama mabananas$pana namabananas$panamaba nas$panamabanana Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. Still sorted
Is It True in General? $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 First-Last Property: The kth occurrence of symbol in First. Column and the k-th occurrence of symbol in Last. Column correspond to the same position of symbol in Text. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
More Efficient BWT Decompression $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 $ p a n s a a n m a a n a b Memory: 2|Text| = O(|Text|). Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
So We Can Decompress BWT. . . Text It’s possible! Convert repeats to runs BWT(Text) EASY Run-length encoding RLE(BWT(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
So We Can Decompress BWT. . . But What About Pattern Matching? Text It’s possible! Convert repeats to runs BWT(Text) EASY Run-length encoding RLE(BWT(Text)) Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Recalling Our Goal Pattern Matching with Suffix Array: • Runtime: O(|Text| + |Patterns|) • Memory: O(|Text|) • Problem: suffix tree takes ~20 x |Text| space, but suffix array takes ~4 x |Text| space. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Recalling Our Goal Pattern Matching with Suffix Array: • Runtime: O(|Text| + |Patterns|) • Memory: O(|Text|) • Problem: suffix tree takes ~20 x |Text| space, but suffix array takes ~4 x |Text| space. Can we use BWT(Text) as our data structure instead? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Pattern Matches “Clump” at Start of M(Text) $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Connecting M(Text) to Suffix Array $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Sorted Suffixes Suffix Array $ abananas amabananas$ as$ bananas$ mabananas$ panamabananas$ s$ Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. 13 5 3 1 7 9 11 6 4 2 8 10 0 12
Connecting M(Text) to Suffix Array $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 We could find pattern matches easily if we had all the suffixes, but we would need O(|Text|2) space… Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Connecting M(Text) to Suffix Array $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 We could find pattern matches easily if we had all the suffixes, but we would need O(|Text|2) space… We are going to pattern match using just two columns of M(Text): First. Column and Last. Column Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Now we can apply the First -Last Property and find where these three “n” are hiding in First. Column. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1 a? ? ? a 3 n 2 a? ? ? a 4 n 3 a? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Now we can apply the First -Last Property and find where these three “n” are hiding in First. Column. We can infer that these three rows start with “na”. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1 a? ? ? a 3 n 2 a? ? ? a 4 n 3 a? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1 a? ? ? a 3 n 2 a? ? ? a 4 n 3 a? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas All three match, and again we apply the First-Last Property. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Match Patterns Backward $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3 na? ? ? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5 na? ? ? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1 a? ? ? a 3 n 2 a? ? ? a 4 n 3 a? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Searching for ana in Text = panamabananas All three match, and again we apply the First-Last Property. We have found the occurrences of “ana”! Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Additional Information Needed $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Problem: For a DNA alphabet, it will take (1/4)|Text| runtime just to cycle through all these! Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Additional Information Needed $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 $ 0 0 0 1 1 a 0 0 0 1 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 “Count” arrays allow us to know what we’ve encountered at every point. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Additional Information Needed $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 Here: we encounter n 1 through n 3 by checking the two highlighted values. Of course, they’re a big memory waste… Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Also Need ”Last to First” Info $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Last To First 13 8 9 12 7 10 11 1 2 3 4 5 0 6 Another big memory waste… Beyond scope of the course: we can compute this very quickly using Count arrays. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Where are the Matches? Multiple Pattern Matching Problem: Find all occurrences of a collection of patterns in a text. • Input: A string Text and a collection Patterns containing (shorter) strings. • Output: All starting positions in Text where a string from Patterns appears as a substring. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Where are the Matches? $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Example: We know that ana occurs 3 times, but where? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
The Suffix Array Holds the Key $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
The Suffix Array Holds the Key $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 “But we’ve already seen that we can use the suffix array for pattern matching! How is this useful? ” Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
How Can BWT Possibly Be Useful? $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. p 0 0 0 1 1 1 s 1 1 1 1
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
How Can BWT Possibly Be Useful? Critical insight: we can store just a small fraction of all this data and not change the big-O runtime of pattern matching using BWT. Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner. p 0 0 0 1 1 1 s 1 1 1 1
First: A “Partial” Suffix Array $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 Partial suffix array: only stores values that are divisible by K for some integer K. K = 5 at left. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We can make at most K – 1 additional backtrack steps to determine where our pattern match is. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We can make at most K – 1 additional backtrack steps to determine where our pattern match is. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1 ana? ? ? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We can make at most K – 1 additional backtrack steps to determine where our pattern match is. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1 ana? ? ? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We can make at most K – 1 additional backtrack steps to determine where our pattern match is. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1 bana? ? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1 ana? ? ? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We can make at most K – 1 additional backtrack steps to determine where our pattern match is. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
First: A “Partial” Suffix Array $1? ? ? s 1 a 1 bana? ? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4 na? ? ? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1 ana? ? ? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 We have a match! We know “abana” occurs at position 5, and we took two steps back, so this “ana” occurs at position 7. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Second: “Checkpoint” Count Arrays $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 “Count” arrays help us know what we’ve encountered up to given row. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Second: “Checkpoint” Count Arrays $1? ? ? s 1 a 1? ? ? m 1 a 2? ? ? n 1 a 3? ? ? p 1 a 4? ? ? b 1 a 5? ? ? n 2 a 6? ? ? n 3 b 1? ? ? a 1 m 1? ? ? a 2 n 1? ? ? a 3 n 2? ? ? a 4 n 3? ? ? a 5 p 1? ? ? $1 s 1? ? ? a 6 $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 “Count” arrays help us know what we’ve encountered up to given row. Checkpoint arrays: only store every C arrays. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Second: “Checkpoint” Count Arrays $? ? ? s a? ? ? m a? ? ? n a? ? ? p a? ? ? b a? ? ? ? ? ? n b? ? ? a m? ? ? ? ? ? a n? ? ? ? ? ? a p? ? ? $ s? ? ? a $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 Checkpoint: Say we are at position 13 and see “a”. How can we know how many “a”s have occurred so far? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Second: “Checkpoint” Count Arrays $? ? ? s a? ? ? m a? ? ? n a? ? ? p a? ? ? b a? ? ? ? ? ? n b? ? ? a m? ? ? ? ? ? a n? ? ? ? ? ? a p? ? ? $ s? ? ? a $ 0 0 0 1 1 a 0 0 0 0 1 2 3 4 5 5 6 b 0 0 1 1 1 1 1 m 0 1 1 1 1 n 0 0 1 1 1 2 3 3 3 3 p 0 0 0 1 1 1 s 1 1 1 1 Answer: Go to nearest checkpoint array, which has counted 3 “a”s up to position 10, and then count three more (total of 6). Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Third: “First Occurrence” $? ? ? s a? ? ? m a? ? ? n a? ? ? p a? ? ? b a? ? ? ? ? ? n b? ? ? a m? ? ? ? ? ? a n? ? ? ? ? ? a p? ? ? $ s? ? ? a Checkpoint: How can we store the information in the first column with as little information as possible? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Third: “First Occurrence” $? ? ? s a? ? ? m a? ? ? n a? ? ? p a? ? ? b a? ? ? ? ? ? n b? ? ? a m? ? ? ? ? ? a n? ? ? ? ? ? a p? ? ? $ s? ? ? a Checkpoint: How can we store the information in the first column with as little information as possible? Answer: We could use run length encoding, but even shorter is to store the “first occurrence” position of each symbol. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Third: “First Occurrence” 0 $? ? ? s 1 a? ? ? m a? ? ? n a? ? ? p a? ? ? b a? ? ? ? ? ? n 7 b? ? ? a 8 m? ? ? a 9 n? ? ? ? ? ? a n? ? ? a 12 p ? ? ? $ 13 s ? ? ? a Checkpoint: How can we store the information in the first column with as little information as possible? Answer: We could use run length encoding, but even shorter is to store the “first occurrence” position of each symbol. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
We Need Just Four Items. . . Total Memory (K = C = 100): ~1. 5|Text|! First Occurrence 0 1 7 8 9 12 13 BWT(Text ) s m n p b n n a a a $ a Suffix Array 13 5 3 1 7 9 11 6 4 2 8 10 0 12 Count Arrays $ a b m n p s 0 0 0 1 0 0 0 1 1 1 1 1 0 0 1 1 2 1 1 0 0 1 1 3 1 1 0 1 1 1 3 1 1 0 2 1 1 3 1 1 0 4 1 1 3 1 1 0 5 1 1 3 1 1 1 6 1 1 3 1 1 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Outline • • • The Burrows-Wheeler Transform Inverting the Burrows-Wheeler Transform Pattern Matching with BWT Where are the Matched Patterns? Burrows and Wheeler Set Up Checkpoints Epilogue: Mismatch-Tolerant Read Mapping Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Returning to Our Original Problem Multiple Pattern Matching Problem: Find all occurrences of a collection of patterns in a text. • Input: A string Text and a collection Patterns containing (shorter) strings. • Output: All starting positions in Text where a string from Patterns appears as a substring. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Returning to Our Original Problem Multiple Approximate Pattern Matching Problem: Find all occurrences of a collection of patterns in a text. • Input: A string Text, a collection Patterns containing (shorter) strings, and an integer d. • Output: All starting positions in Text where a string from Patterns appears as a substring with at most d mismatches. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Say that Pattern appears in Text with just one mismatch. Pattern Text acttggct …ggcacactaggctcc… Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Say that Pattern appears in Text with just one mismatch. Pattern Text acttggct …ggcacactaggctcc… Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Say that Pattern appears in Text with just one mismatch. Pattern Text acttggct …ggcacactaggctcc… If we divide the strings in half, then one must match exactly! Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Let’s take another example of strings that have 3 mismatches. Pattern acttaggctcgggataatcc Text … g g c a c t a a g t c g g g a t a a g c c … Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Let’s take another example of strings that have 3 mismatches. Pattern acttaggctcgggataatcc Text … g g c a c t a a g t c g g g a t a a g c c … Now we can divide strings into four equal pieces and find at least one that matches. Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Checkpoint: If Pattern has length 23 and appears in Text with 3 mismatches, can we conclude that Pattern shares a 6 -mer with Text? Can we conclude that it shares a 5 -mer with Text? Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Theorem: If Pattern occurs in Text with d mismatches, then we can divide Pattern into d + 1 “equal” pieces and find at least one exact match. XXXXXXXXXXXXXXXXXX Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Theorem: If Pattern occurs in Text with d mismatches, then we can divide Pattern into d + 1 “equal” pieces and find at least one exact match. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Theorem: If Pattern occurs in Text with d mismatches, then we can divide Pattern into d + 1 “equal” pieces and find at least one exact match. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding Theorem: If Pattern occurs in Text with d mismatches, then we can divide Pattern into d + 1 “equal” pieces and find at least one exact match. XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 1: Seeding An algorithm for finding all pattern matches with up to d mismatches. 1. Divide Pattern into d+1 ”equal” segments (called seeds). 2. Find which seeds match Text exactly (seed detection). 3. Attempt to extend all seeds in both directions to verify whether Pattern occurs with at most d mismatches (seed extension). Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas If we allow 1 mismatch, then we need to keep the red letters around. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas If we allow 1 mismatch, then we need to keep the red letters around. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 # Mismatches 1 0 1 1 0 0 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas Now we extend only strings that have at most 1 mismatch. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 # Mismatches 1 0 1 1 0 0 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas Now we extend only strings that have at most 1 mismatch. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 # Mismatches 1 1 0 0 0 1 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas One string produces a second mismatch (the $), so we discard it. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 # Mismatches 1 1 0 0 0 2 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
Method 2: BWT Saves the Day Again Recall: searching for ana in panamabananas In the end, we have five 3 mers with at most 1 mismatch. $1 panamabananas 1 a 1 bananas$panam 1 a 2 mabananas$pan 1 a 3 namabananas$p 1 a 4 nanas$panamab 1 a 5 nas$panamaban 2 a 6 s$panamabanan 3 b 1 ananas$panama 1 m 1 abananas$pana 2 n 1 amabananas$pa 3 n 2 anas$panamaba 4 n 3 as$panamabana 5 p 1 anamabananas$1 s 1$panamabanana 6 # Mismatches 1 1 0 0 0 2 Bioinformatics Algorithms: An Active Learning Approach. © 2018 Compeau and Pevzner.
- Slides: 172