SC A L E D Pattern Matching Amihood

  • Slides: 38
Download presentation
SC A L E D Pattern Matching Amihood Amir Bar-Ilan University Ayelet Butman Moshe

SC A L E D Pattern Matching Amihood Amir Bar-Ilan University Ayelet Butman Moshe Lewenstein and Johns Hopkins University Bar- Ilan University

Motivation Searching for Templates in Aerial Photographs Input: Aerial photo Template Task: Search for

Motivation Searching for Templates in Aerial Photographs Input: Aerial photo Template Task: Search for all locations where the template appears in the image.

Model • Low level (pixel level) avoid costly processing • Asymptotically efficient solutions. •

Model • Low level (pixel level) avoid costly processing • Asymptotically efficient solutions. • Serial, exact algorithms.

Types of Approximations Local errors: Level of detail Occlusion Noise results: O(n² log m)

Types of Approximations Local errors: Level of detail Occlusion Noise results: O(n² log m) mismatches O(n²k²) edit distance, k errors, AL-88 rectangular patterns. O(n²k√(m log m) √(k log k) edit distance, k errors, half rectangular patterns AF-95

Types of Approximation Orientation. results: O(n²m 5 ) FU-98 O(n²m³) ACL-98 Scaling: Natural scales:

Types of Approximation Orientation. results: O(n²m 5 ) FU-98 O(n²m³) ACL-98 Scaling: Natural scales: results: O(n) 1 -d EV-88 O(n² log |Σ|) 2 -d ALV-92 O(n²) dictionary AC-96 Real scales: this result: O(n) 1 -d, truncation

It seems daunting, but…

It seems daunting, but…

CPM 2003: Morelia, Mexico

CPM 2003: Morelia, Mexico

Problem inherently inexact What if occurrence is 1½ times bigger? What is the meaning

Problem inherently inexact What if occurrence is 1½ times bigger? What is the meaning of “½ a pixel”? Solutions until now: Natural Scales Consider only discrete scales: 1, 2, 3, 4, 5, . . .

Text: n Definition: Pattern: m m n Find all occurrences of the pattern in

Text: n Definition: Pattern: m m n Find all occurrences of the pattern in the text in all discrete sizes.

Discrete exact Scaled Matching T AAAAAA AA AAAAACCAAAAAAAA AA AAAAAAAAACCAA AAACCCAAAAAAA AAACCCAAAACAA AAAAAA AA

Discrete exact Scaled Matching T AAAAAA AA AAAAACCAAAAAAAA AA AAAAAAAAACCAA AAACCCAAAAAAA AAACCCAAAACAA AAAAAA AA AAAAACCAC AAAAAA AA P AAA ACA AAA

Discrete exact Scaled Matching P³ P ZUY KVS XET ZZZUUUYYY KKKVVVSSS XXXEEETTT

Discrete exact Scaled Matching P³ P ZUY KVS XET ZZZUUUYYY KKKVVVSSS XXXEEETTT

Idea: Fix a scale s s s n/s n Constant amount of work for

Idea: Fix a scale s s s n/s n Constant amount of work for each square (s-block)

Algorithm time Time for scale s: Total time: converges to a constant Making the

Algorithm time Time for scale s: Total time: converges to a constant Making the total time O(n²)

Problem: Real scales Was open even for strings… How do we define? aabcccbb Scaled

Problem: Real scales Was open even for strings… How do we define? aabcccbb Scaled to 2: aaaabbccccccbbbb Scaled to 1½: aaab cccc bbb truncate ½b ½c

Formally: r r times Denote: a aaa. . . a Problem Definition 1: Input:

Formally: r r times Denote: a aaa. . . a Problem Definition 1: Input: Pattern Text: Output: All text locations where appears for some

Remark α ≥ 1 means we only scale “up” Reasons: Avoid conceptual problem of

Remark α ≥ 1 means we only scale “up” Reasons: Avoid conceptual problem of loss of resolution. From “far enough” away everything looks the same. By our definition, for k<1/m there is a match at every text location.

Simplify definition Definition 2: Look for in the text. Example: P=aabcccbbbb Match by definition

Simplify definition Definition 2: Look for in the text. Example: P=aabcccbbbb Match by definition 2: daaabccccbbbbbbe Match by definition 1 but not by def 2: daaaabccccbbbbbbbe

Why are definitions equivalent? Split text and pattern to symbol part Ts , Ps

Why are definitions equivalent? Split text and pattern to symbol part Ts , Ps and length part TL , PL. Example: P= aabcccbbbb Ps=abcb PL=2134 T=daaabccccbbbbbbe Ts=dabcbe TL=131461

Time for split: O(n+m) Finding Ps in Ts: O(n+m) (e. g. KMP) HARD PART:

Time for split: O(n+m) Finding Ps in Ts: O(n+m) (e. g. KMP) HARD PART: Finding PL in TL.

Definitions are Equivalent Claim: Solving def 2 in time O(f(n)) Solving def 1 in

Definitions are Equivalent Claim: Solving def 2 in time O(f(n)) Solving def 1 in time O(f(n)). Why? - Find in time O(f(n)) - For each match verify 1 st and last symbol in constant time in Ts and TL. Total time: O(f(n)+n)=O(f(n)).

Naïve algorithm for matching PL in TL For each text location, position pattern starting

Naïve algorithm for matching PL in TL For each text location, position pattern starting at that location and calculate interval [t/p, (t+1)/p) for each resulting <text, pattern> pair. This is the interval of possible scales since t/p·p = t for every α < t/p, |αp| < t (t+1)/p ·p = t+1 for every α ≥ t/p, |αp| > t

Check intersection If intersection of all intervals is not empty then there is a

Check intersection If intersection of all intervals is not empty then there is a match. Time: O(nm) Example: PL: 2 1 2 3 2 T L: 2 4 7 4 5 3 [1, 3/2) [4, 5) The intersection is empty thus no scaled match in location 1. But…

Check intersection If intersection of all intervals is not empty then there is a

Check intersection If intersection of all intervals is not empty then there is a match. Time: O(nm) Example: PL: 2 1 2 3 2 T L: 2 4 7 4 5 3 [2, 5/2) [2, 3) [2, 5/2)[7/3, 8/3)[2, 5/2) The intersection is [7/3, 5/2) thus there is a scaled match in location 2.

Improvement – Parameterized Matching Introduced: Baker 1994. Motivation: “copying” code.

Improvement – Parameterized Matching Introduced: Baker 1994. Motivation: “copying” code.

Parameterized Matching Input: two strings s and t |s|=|t|, over alphabets ∑s and ∑t.

Parameterized Matching Input: two strings s and t |s|=|t|, over alphabets ∑s and ∑t. s parameterize matches t: if bijection : ∑s ∑t , such that (s) = t. Example: (a)=x (b)=y a b b x yx y y

Parameterized Matching Claim (AFM-94): For Σ that can be sorted in linear time (e.

Parameterized Matching Claim (AFM-94): For Σ that can be sorted in linear time (e. g. Σ={1, . . . , n}) Parameterized matching can be done in time O(n).

The reduction Lemma: for which PL matches TL at location i scaled to α

The reduction Lemma: for which PL matches TL at location i scaled to α only if PL p-matches TL at i. Proof: Assume PL does not p-match TL at location i. The possible situations are:

Possibility 1 w. l. o. g. c ≥ a+1 TL a c≠a PL b

Possibility 1 w. l. o. g. c ≥ a+1 TL a c≠a PL b b For c = a+1 (smallest possible):

Possibility 2 TL PL a a w. l. o. g. c ≥ b+1 c≠b

Possibility 2 TL PL a a w. l. o. g. c ≥ b+1 c≠b b Intersection not empty only if: (a+1)/(b+1) > a/b i. e. ab+b > ab+a b>a But this can never happen if α ≥ 1.

Algorithm for Real Scaled String Matching Let { Pi 1, Pi 2, . .

Algorithm for Real Scaled String Matching Let { Pi 1, Pi 2, . . . , Pij } be the different numbers in PL. 1. P-match PL in TL. 2. For each match, chack intersection of intervals between Pi 1, . . . , Pij and corresponding symbols in TL. End Algorithm

Example: PL = 2 3 2 Pi 1=2 Pi 2=3 p-matches TL = 5

Example: PL = 2 3 2 Pi 1=2 Pi 2=3 p-matches TL = 5 6 5 6 10 7 scaled match

Important Fact: So there at most O(√m) different Pik’s. Time: O(n) for parameterized matching

Important Fact: So there at most O(√m) different Pik’s. Time: O(n) for parameterized matching (Σ={1, 2, …, n}). O(√m) verification for each location. Total: O(n√m).

Tighter analysis Upper bound number of possible p-matches. Lemma: Let |P|=m, |T|=n, { Pi

Tighter analysis Upper bound number of possible p-matches. Lemma: Let |P|=m, |T|=n, { Pi 1, Pi 2, . . . , Pij } be the different numbers in PL. Then there at most n/2 j p-matches of PL in TL. Meaning: Since verification time is O(j) per p-match, the lemma implies that total verification time is: O((n/2 j) · j) = O(n)

Proof of Lemma: 1 st appearance of Pi 1, . . . , Pij

Proof of Lemma: 1 st appearance of Pi 1, . . . , Pij PL P i 1 P i 2 P ij TL a 1 a 2 aj m-match

Lemma’s proof (cont. ) Let x be the total number of p-matches in the

Lemma’s proof (cont. ) Let x be the total number of p-matches in the text. The sum of all text elements that match 1 st occurrences of Pik‘s in the pattern ≥ (xj²)/2 But: There are overlaps! How many?

Lemma’s proof (cont. ) For each text location, at most j matches will count

Lemma’s proof (cont. ) For each text location, at most j matches will count it. Therefore… Total count without overlaps ≥ Clearly: x·j/2 ≤ n thus x ≤ (2 n)/j

Open Problem: Give 1 -d algorithm linear in run-length compressed text and pattern.

Open Problem: Give 1 -d algorithm linear in run-length compressed text and pattern.