SC A L E D Pattern Matching Amihood

Motivation Searching for Templates in Aerial Photographs Input: Aerial photo Template Task: Search for

Model • Low level (pixel level) avoid costly processing • Asymptotically efficient solutions. •

Types of Approximations Local errors: Level of detail Occlusion Noise results: O(n² log m)

Types of Approximation Orientation. results: O(n²m 5 ) FU-98 O(n²m³) ACL-98 Scaling: Natural scales:

Problem inherently inexact What if occurrence is 1½ times bigger? What is the meaning

Text: n Definition: Pattern: m m n Find all occurrences of the pattern in

Discrete exact Scaled Matching T AAAAAA AA AAAAACCAAAAAAAA AA AAAAAAAAACCAA AAACCCAAAAAAA AAACCCAAAACAA AAAAAA AA

Discrete exact Scaled Matching P³ P ZUY KVS XET ZZZUUUYYY KKKVVVSSS XXXEEETTT

Idea: Fix a scale s s s n/s n Constant amount of work for

Algorithm time Time for scale s: Total time: converges to a constant Making the

Problem: Real scales Was open even for strings… How do we define? aabcccbb Scaled

Formally: r r times Denote: a aaa. . . a Problem Definition 1: Input:

Remark α ≥ 1 means we only scale “up” Reasons: Avoid conceptual problem of

Simplify definition Definition 2: Look for in the text. Example: P=aabcccbbbb Match by definition

Why are definitions equivalent? Split text and pattern to symbol part Ts , Ps

Time for split: O(n+m) Finding Ps in Ts: O(n+m) (e. g. KMP) HARD PART:

Definitions are Equivalent Claim: Solving def 2 in time O(f(n)) Solving def 1 in

Naïve algorithm for matching PL in TL For each text location, position pattern starting

Check intersection If intersection of all intervals is not empty then there is a

Improvement – Parameterized Matching Introduced: Baker 1994. Motivation: “copying” code.

Parameterized Matching Input: two strings s and t |s|=|t|, over alphabets ∑s and ∑t.

Parameterized Matching Claim (AFM-94): For Σ that can be sorted in linear time (e.

The reduction Lemma: for which PL matches TL at location i scaled to α

Possibility 1 w. l. o. g. c ≥ a+1 TL a c≠a PL b

Possibility 2 TL PL a a w. l. o. g. c ≥ b+1 c≠b

Algorithm for Real Scaled String Matching Let { Pi 1, Pi 2, . .

Example: PL = 2 3 2 Pi 1=2 Pi 2=3 p-matches TL = 5

Important Fact: So there at most O(√m) different Pik’s. Time: O(n) for parameterized matching

Tighter analysis Upper bound number of possible p-matches. Lemma: Let |P|=m, |T|=n, { Pi

Proof of Lemma: 1 st appearance of Pi 1, . . . , Pij

Lemma’s proof (cont. ) Let x be the total number of p-matches in the

Lemma’s proof (cont. ) For each text location, at most j matches will count

Open Problem: Give 1 -d algorithm linear in run-length compressed text and pattern.

Slides: 38

Download presentation

SC A L E D Pattern Matching Amihood Amir Bar-Ilan University Ayelet Butman Moshe Lewenstein and Johns Hopkins University Bar- Ilan University

Motivation Searching for Templates in Aerial Photographs Input: Aerial photo Template Task: Search for all locations where the template appears in the image.

Model • Low level (pixel level) avoid costly processing • Asymptotically efficient solutions. • Serial, exact algorithms.

Types of Approximations Local errors: Level of detail Occlusion Noise results: O(n² log m) mismatches O(n²k²) edit distance, k errors, AL-88 rectangular patterns. O(n²k√(m log m) √(k log k) edit distance, k errors, half rectangular patterns AF-95

Types of Approximation Orientation. results: O(n²m 5 ) FU-98 O(n²m³) ACL-98 Scaling: Natural scales: results: O(n) 1 -d EV-88 O(n² log |Σ|) 2 -d ALV-92 O(n²) dictionary AC-96 Real scales: this result: O(n) 1 -d, truncation

It seems daunting, but…

CPM 2003: Morelia, Mexico

Problem inherently inexact What if occurrence is 1½ times bigger? What is the meaning of “½ a pixel”? Solutions until now: Natural Scales Consider only discrete scales: 1, 2, 3, 4, 5, . . .

Text: n Definition: Pattern: m m n Find all occurrences of the pattern in the text in all discrete sizes.

Discrete exact Scaled Matching T AAAAAA AA AAAAACCAAAAAAAA AA AAAAAAAAACCAA AAACCCAAAAAAA AAACCCAAAACAA AAAAAA AA AAAAACCAC AAAAAA AA P AAA ACA AAA

Discrete exact Scaled Matching P³ P ZUY KVS XET ZZZUUUYYY KKKVVVSSS XXXEEETTT

Idea: Fix a scale s s s n/s n Constant amount of work for each square (s-block)

Algorithm time Time for scale s: Total time: converges to a constant Making the total time O(n²)

Problem: Real scales Was open even for strings… How do we define? aabcccbb Scaled to 2: aaaabbccccccbbbb Scaled to 1½: aaab cccc bbb truncate ½b ½c

Formally: r r times Denote: a aaa. . . a Problem Definition 1: Input: Pattern Text: Output: All text locations where appears for some

Remark α ≥ 1 means we only scale “up” Reasons: Avoid conceptual problem of loss of resolution. From “far enough” away everything looks the same. By our definition, for k<1/m there is a match at every text location.

Simplify definition Definition 2: Look for in the text. Example: P=aabcccbbbb Match by definition 2: daaabccccbbbbbbe Match by definition 1 but not by def 2: daaaabccccbbbbbbbe

Why are definitions equivalent? Split text and pattern to symbol part Ts , Ps and length part TL , PL. Example: P= aabcccbbbb Ps=abcb PL=2134 T=daaabccccbbbbbbe Ts=dabcbe TL=131461

Time for split: O(n+m) Finding Ps in Ts: O(n+m) (e. g. KMP) HARD PART: Finding PL in TL.

Definitions are Equivalent Claim: Solving def 2 in time O(f(n)) Solving def 1 in time O(f(n)). Why? - Find in time O(f(n)) - For each match verify 1 st and last symbol in constant time in Ts and TL. Total time: O(f(n)+n)=O(f(n)).

Naïve algorithm for matching PL in TL For each text location, position pattern starting at that location and calculate interval [t/p, (t+1)/p) for each resulting <text, pattern> pair. This is the interval of possible scales since t/p·p = t for every α < t/p, |αp| < t (t+1)/p ·p = t+1 for every α ≥ t/p, |αp| > t

Check intersection If intersection of all intervals is not empty then there is a match. Time: O(nm) Example: PL: 2 1 2 3 2 T L: 2 4 7 4 5 3 [1, 3/2) [4, 5) The intersection is empty thus no scaled match in location 1. But…

Check intersection If intersection of all intervals is not empty then there is a match. Time: O(nm) Example: PL: 2 1 2 3 2 T L: 2 4 7 4 5 3 [2, 5/2) [2, 3) [2, 5/2)[7/3, 8/3)[2, 5/2) The intersection is [7/3, 5/2) thus there is a scaled match in location 2.

Improvement – Parameterized Matching Introduced: Baker 1994. Motivation: “copying” code.

Parameterized Matching Input: two strings s and t |s|=|t|, over alphabets ∑s and ∑t. s parameterize matches t: if bijection : ∑s ∑t , such that (s) = t. Example: (a)=x (b)=y a b b x yx y y

Parameterized Matching Claim (AFM-94): For Σ that can be sorted in linear time (e. g. Σ={1, . . . , n}) Parameterized matching can be done in time O(n).

The reduction Lemma: for which PL matches TL at location i scaled to α only if PL p-matches TL at i. Proof: Assume PL does not p-match TL at location i. The possible situations are:

Possibility 1 w. l. o. g. c ≥ a+1 TL a c≠a PL b b For c = a+1 (smallest possible):

Possibility 2 TL PL a a w. l. o. g. c ≥ b+1 c≠b b Intersection not empty only if: (a+1)/(b+1) > a/b i. e. ab+b > ab+a b>a But this can never happen if α ≥ 1.

Algorithm for Real Scaled String Matching Let { Pi 1, Pi 2, . . . , Pij } be the different numbers in PL. 1. P-match PL in TL. 2. For each match, chack intersection of intervals between Pi 1, . . . , Pij and corresponding symbols in TL. End Algorithm

Example: PL = 2 3 2 Pi 1=2 Pi 2=3 p-matches TL = 5 6 5 6 10 7 scaled match

Important Fact: So there at most O(√m) different Pik’s. Time: O(n) for parameterized matching (Σ={1, 2, …, n}). O(√m) verification for each location. Total: O(n√m).

Tighter analysis Upper bound number of possible p-matches. Lemma: Let |P|=m, |T|=n, { Pi 1, Pi 2, . . . , Pij } be the different numbers in PL. Then there at most n/2 j p-matches of PL in TL. Meaning: Since verification time is O(j) per p-match, the lemma implies that total verification time is: O((n/2 j) · j) = O(n)

Proof of Lemma: 1 st appearance of Pi 1, . . . , Pij PL P i 1 P i 2 P ij TL a 1 a 2 aj m-match

Lemma’s proof (cont. ) Let x be the total number of p-matches in the text. The sum of all text elements that match 1 st occurrences of Pik‘s in the pattern ≥ (xj²)/2 But: There are overlaps! How many?

Lemma’s proof (cont. ) For each text location, at most j matches will count it. Therefore… Total count without overlaps ≥ Clearly: x·j/2 ≤ n thus x ≤ (2 n)/j

Open Problem: Give 1 -d algorithm linear in run-length compressed text and pattern.