Approximate String Matching A Guided Tour to Approximate

  • Slides: 39
Download presentation
Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman

Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman 1

Outline: �Definition of approximate string matching (ASM) �Applications of ASM �Algorithms �Conclusion 2

Outline: �Definition of approximate string matching (ASM) �Applications of ASM �Algorithms �Conclusion 2

Approximate string matching �Approximate string matching is the process of matching strings while allowing

Approximate string matching �Approximate string matching is the process of matching strings while allowing for errors. 3

The edit distance �Strings are compared based on how close they are �This closeness

The edit distance �Strings are compared based on how close they are �This closeness is called the edit distance �The edit distance is summed up based on the number of operations required to transform one string into another 4

Levenshtein / edit distance �Named after Vladimir Levenshtein who created his Levenshtein distance algorithm

Levenshtein / edit distance �Named after Vladimir Levenshtein who created his Levenshtein distance algorithm in 1965 �Accounts for three basic operations: �Inserts , deletions, and replacements �In the simplified version, all operations have a cost of 1 �Example: “mash” and “march” have edit distance of 2 5

Other distance algorithms �Hamming distance: �Allows only substitutions with a cost of one each

Other distance algorithms �Hamming distance: �Allows only substitutions with a cost of one each �Episode distance: �Allows only insertions with a cost of one each �Longest Common Subsequence distance: �Allows only insertions and deletions costing one each 6

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms �Conclusion 7

Applications �Computational biology �Signal processing �Information retrieval 8

Applications �Computational biology �Signal processing �Information retrieval 8

Computational biology �DNA is composed of Adenine, Cytosine, Guanine, and Thymine (A, C, G,

Computational biology �DNA is composed of Adenine, Cytosine, Guanine, and Thymine (A, C, G, T) �One can think of the set {A, C, G, T} as the alphabet for DNA sequences �Used to find specific, or similar DNA sequences �Knowing how different two sequences are can give insight to the evolutionary process. 9

Signal processing �Used heavily in speech recognition software �Error correction for receiving signals �Multimedia

Signal processing �Used heavily in speech recognition software �Error correction for receiving signals �Multimedia and song recognition 10

Information Retrieval �Spell checkers �Search engines �Web searches (Google) �Personal files (agrep for unix)

Information Retrieval �Spell checkers �Search engines �Web searches (Google) �Personal files (agrep for unix) �Searching texts with errors such as digitized books �Handwriting recognition 11

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms �Conclusion 12

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 13

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 13

Definitions �Let ∑ be a finite alphabet of size |∑| = σ �Let T

Definitions �Let ∑ be a finite alphabet of size |∑| = σ �Let T є ∑* be a text of length n = |T| �Let P є ∑* be a pattern of length m = |P| �Let k є R be the maximum error allowed �Let d : ∑* × ∑* R be a distance function �Therefore, given T, P, k, and d(. ), return the set of all text positions j such that there exists i such that d(P, Ti. . j) ≤ k 14

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 15

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 15

Dynamic Programming �oldest to solve the problem of approximate string matching �Not very efficient

Dynamic Programming �oldest to solve the problem of approximate string matching �Not very efficient �Runtime of O(|x||y|) �However, space is O(min(|x||y|)) �Most flexible when adapting to different distance functions 16

Computing the edit distance �To compute the edit distance: ed(x, y) �Create a matrix

Computing the edit distance �To compute the edit distance: ed(x, y) �Create a matrix C 0. . |x|, 0. . |y| where Ci, j represents the minimum operations needed to match x 1. . i to y 1. . j �Ci, 0 = i �C 0, j = j �Ci, j = if(xi = yj) then Ci-1, j-1 else 1 + min(Ci-1, Ci, j-1, Ci-1, j-1) 17

Edit distance example �Ci, 0 = i �C 0, j = j �if(xi =

Edit distance example �Ci, 0 = i �C 0, j = j �if(xi = yj) Ci, j = Ci-1, j-1 else Ci, j = 1 +min(Ci-1, Ci, j-1, Ci-1, j-1) 18

Text searching �The previous algorithm can be converted to search a text for a

Text searching �The previous algorithm can be converted to search a text for a given pattern with few changes �Let y = Pattern, and x = Text �Set C 0, j = 0 so that any text position is the start of a match �Ci, j = if(Pi = Tj) then Ci-1, j-1 else 1+min(Ci-1, j, Ci, j-1, Ci-1, j-1) 19

Text search example �In English: if the letters at the index are the same,

Text search example �In English: if the letters at the index are the same, then the current position = the top left position. If the letters are not the same, then the current position is the minimum of left, top, and top left plus one. 20

Improvements �Example algorithm listed was the first �Many DP based algorithms improved on the

Improvements �Example algorithm listed was the first �Many DP based algorithms improved on the search time �In 1992, Chang and Lampe produce new algorithm called “column partitioning” with an average search time of O(kn÷√σ) where k=errors, n=text length, and σ=size of alphabet 21

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 22

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 22

Automatons for approx. search �Model search with a nondeterministic finite automata � 1985: Esko

Automatons for approx. search �Model search with a nondeterministic finite automata � 1985: Esko Ukkonen proposes a deterministic form �Fast: deterministic form has O(n) worst case search time �Large: space complexity of DFA grows exponentially with respect to the pattern length 23

NFA example with k = 2 Matching the pattern “survey” on text “surgery” 24

NFA example with k = 2 Matching the pattern “survey” on text “surgery” 24

Improvements �In 1996 Kurtz[1996] proposes lazy construction of DFA �Space requirements reduced to O(mn)

Improvements �In 1996 Kurtz[1996] proposes lazy construction of DFA �Space requirements reduced to O(mn) 25

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 26

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 26

Bit-parallelism �Takes advantage of the inherent parallelism of computer when dealing in bits �Changes

Bit-parallelism �Takes advantage of the inherent parallelism of computer when dealing in bits �Changes an existing algorithm to operate at the bit level �Operations can be reduced by factor of w where w is the number of bits in a word 27

Shift-Or �Was the first bit-parallel algorithm �Parallelizes the operation of an NFA that tries

Shift-Or �Was the first bit-parallel algorithm �Parallelizes the operation of an NFA that tries to match the pattern exactly �NFA has m+1 states 28

�Builds table B which stores a bit mask for every character c �For the

�Builds table B which stores a bit mask for every character c �For the mask B[c], the bit bi is set if and only if Pi = c �Search state is kept in a machine word D = dm. . d 1 �di is 1 when P 1. . i matches the end of the text scanned so far �Match is registered when dm = 1 29

�To start, D is set to 1 m �D is updated upon reading a

�To start, D is set to 1 m �D is updated upon reading a new text character using the following formula �D’ ((D << 1) | 0 m-1 1) & B[Tj] �This representation ends up working similar to a DFA in that the final state is only reached if the previous state has been reached and so on. 30

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 31

Algorithms �Definitions �Dynamic Programming algorithms �Automatons �Bit-parallelism �Filters 31

Filters �Originating in the 1990’s �Filter algorithms attempt to filter out large sections of

Filters �Originating in the 1990’s �Filter algorithms attempt to filter out large sections of code based on the fact that a given pattern can not be there �Needs a different kind of algorithm to check portions of text which are not filtered out 32

conceptually �Filter algorithms are really exact match pattern searchers �Exact pattern matching is much

conceptually �Filter algorithms are really exact match pattern searchers �Exact pattern matching is much quicker �Breaks up original pattern into parts and searches the text for those exact parts �Example from Navarro: if “sur” and “vey” don’t appear in a section, then “survey” can’t either 33

Filters �must be paired with a non-filter algorithm such as one of the dynamic

Filters �must be paired with a non-filter algorithm such as one of the dynamic programming algorithms �Performance dependant upon number of errors allowed �Are the fastest of the algorithms surveyed �Best theoretical average cost O(n(k + logσ m)/m) 34

Hierarchical verification method �Created by Navarro and Baeza-Yates in 1998 �Original pattern is recursively

Hierarchical verification method �Created by Navarro and Baeza-Yates in 1998 �Original pattern is recursively split with each half searching on k/2 errors �In example: if search on text “xxxbbxxxxxx”, the leaf “bbb” will return a match with one error �Checking the parent subdivision shows that there is no match 35

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms

Outline: �What is approximate string matching (ASM)? �What are the applications of ASM? �Algorithms �Conclusion 36

Conclusion �Generally a combination of a fast filter and a fast verifying algorithm is

Conclusion �Generally a combination of a fast filter and a fast verifying algorithm is the fastest overall �For non-filtering algorithms, a NFA bit-parallelized by diagonals is the fastest �Approximate string matching has greatly influenced the field of computer science and will play an important role in future technology. 37

References �“A Guided Tour to Approximate String Matching”, Gonzalo Navarro �“Implementation of a Bit-parallel

References �“A Guided Tour to Approximate String Matching”, Gonzalo Navarro �“Implementation of a Bit-parallel Aproximate String Matching Algorithm”, Mikael Onsjo and Osamu Watanabe �“A Partial Deterministic Automaton for Approximate String Matching”, Gonzalo Navarro �http: //en. wikipedia. org/wiki/Approximate_string_ma tching 38

Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman

Approximate String Matching A Guided Tour to Approximate String Matching Gonzalo Navarro Justin Wiseman 39