MCS 101 Algorithms Instructor Neelima Gupta nguptacs du

Table of Contents • String Matching – Naïve Method – Finite Automata Approach –

Pattern Matching • Given a text string T[0. . n-1] and a pattern P[0.

Let Σ denotes the set of alphabet. • Given: A string of alphabets T[1.

NAÏVE APPROACH T: P: a a b b c d a b d a

Worst Case Running Time T : a a a……. . a a f of

Example T: P: a a a . . a a a Match found after

Worst Case Running Time This will continue to happen until (n-4)th alphabet in T

Worst Case Running Time • At every step, after ‘m’ comparisons a mismatch will

Finite Automata #a ∑ #a a s 1 s 0 #a a s 2

Worst Case Running Time • In finite automata, each character is scanned atmost once.

Drawback: If the alphabet set ∑ is very large, then the time required to

BRUTE FORCE STRATEGY • In this strategy whenever a mismatch was found , the

An example T: abcabcaf mismatch found P: abcabcabcaf 20

Shift by 3 T: abcabcaf match found P: abcabcabcaf 21

Another example T: abcdeghabf mismatch found P: abcdeghabf 22

Shift by 7 T: abcdeghabf P: abcdeghabf 23

• How to decide how much to shift? • Ans: We use the

KMP : Knuth Morris Pratt Algorithm T : …… tj. . …. . .

Several such prefixes may exist T: abcabcaf mismatch found P: abcabcabcaf One prefix is

KMP Contd. . • Let r be the length of the longest prefix of

Why LONGEST? T: abcabcaf mismatch found P: abcabcabcaf 28

T: abcabcaf mismatch found P: abcabcabcaf the longest prefix. Correct alignment for the pattern

T: abcabcaf P: abcabcabcaf Pattern found. 30

T: abcabcaf P: mismatch abcabcabcaf Pattern not found. By finding a smaller prefix and

How to find the longest such prefix? KMP algorithm 32

P : p 1 …. …………… pk ………… Let the length of the longest

T : …… tj. . …. . . tj+r-1 …. tj+k-r…. . . tj+k-2

P : p 1 … pr-1 pr pr+1 ……. …. pk-1 pk … p

EXAMPLE P: abcabcabcaf for k=1, fai[k]=0 (assumed) for k=2, s=fail[1]=0 therefore, fail[k]=0+1=1 for k=3,

P: abcabcabcaf for k=4, s=fail[3]=1 check whether p 1=p 3 since p 1!=p 3

k Fail[k] 1 0 2 1 3 1 P: abcabcabcaf 4 1 5 2

Example : T: abcabcaf P: k: abcabcabcaf 1 2 3 4 5 6 7

Another Example : k Fail[k] T: abcbabcabcabcaf 1 0 2 1 3 1 4

Analysis of KMP # of mismatch: For mismatch the pattern is shifted by at

Analysis of KMP contd. # of matches: For every match, pointer in the text

ACKNOWLEDGEMENTS MSc (CS) 2009 Abhishek Behl(02) Aarti Sethiya(01) Akansha Aggarwal(03) Alok Prakash (04) Vibha

Slides: 49

Download presentation

MCS 101: Algorithms Instructor Neelima Gupta ngupta@cs. du. ac. in

Table of Contents • String Matching – Naïve Method – Finite Automata Approach – Rabin Karp – KMP

Pattern Matching • Given a text string T[0. . n-1] and a pattern P[0. . m-1], find all occurrences of the pattern within the text. • Example: T = ababcabdabcaabc and P = abc, the occurrences are: – first occurrence starts at T[3] – second occurrence starts at T[9] – third occurrence starts at T[13]

Let Σ denotes the set of alphabet. • Given: A string of alphabets T[1. . n] of size “n” and a pattern P[1. . m] of size “m” where, m<<<n. • To Find: Whether the pattern P occurs in text T or not. If it does, then give the first occurrence of P in T. The alphabets of both T and P are drawn from finite set Σ.

NAÏVE APPROACH T: P: a a b b c d a b d a a b c d e

Example T: P: a b c a b d a b ( Step – 1 ) d a a b Mismatch after 3 Comparisons c d e

Example ( Step – 2 ) T: P: a b c a b d a a b d Mismatch after 1 Comparison c d e

Example ( Step – 3 ) T: P: a b c a b d d a a b Mismatch after 1 Comparison c d e

Example ( Step – 4 ) T: P: a b c a b d a a b c Match found after 3 Comparisons Thus, after 8 comparisons the substring P is found in T. d e

Worst Case Running Time T : a a a……. . a a f of size say “n” P : a a a f of size 4

Example ( Step – 1 ) T: a P: a a a . . . a a f Mismatch found after 4 comparisons f

Example ( Step – 2 ) T: P: a a a a , , a a f Mismatch found after 4 comparisons f

Example T: P: a a a . . a a a Match found after 4 comparisons a a a f f

Worst Case Running Time This will continue to happen until (n-4)th alphabet in T is compared with the characters in P and thus the no. of comparisons required is (n-4)4 + 4.

Worst Case Running Time • At every step, after ‘m’ comparisons a mismatch will be found. • These ‘m’ comparisons will be done for m) characters in T. (n- • Thus, the running time obtained is (n-m)m + m.

Finite Automata #a ∑ #a a s 1 s 0 #a a s 2 a s 3 f f

Worst Case Running Time • In finite automata, each character is scanned atmost once. Thus in the worst case, the searching time is O(n). • Preprocessing time: - As for every character in ∑ an edge has to be formed, thus the preprocessing time is O(m*|∑|). • Thus total running time is O(n) + O(m*|∑|).

Drawback: If the alphabet set ∑ is very large, then the time required to construct the FA will be very large.

BRUTE FORCE STRATEGY • In this strategy whenever a mismatch was found , the pattern was shifted right by 1 character. • But this wasn’t an efficient strategy as it required a large number of comparisons. Hence a better algorithm was required. 19

An example T: abcabcaf mismatch found P: abcabcabcaf 20

Shift by 3 T: abcabcaf match found P: abcabcabcaf 21

Another example T: abcdeghabf mismatch found P: abcdeghabf 22

Shift by 7 T: abcdeghabf P: abcdeghabf 23

• How to decide how much to shift? • Ans: We use the information contained in the partial pattern that has matched so far.

KMP : Knuth Morris Pratt Algorithm T : …… tj. . …. . . tj+r-1 …. tj+k-r…. . . tj+k-2 tj+k-1 … ……………… P: p 1 …… pr …… ……… pk-1 pk …… p 1 …… pr pk … If tj+k-1 ≠ pk How much should we shift is equivalent to asking which character of the pattern, tj+k-1 be compared with? Thus we look for a prefix p 1 … pr of p 1 … pk-1 that matches the suffix of tj … tj+k-2, then , tj+k-1 should be compared with pr+1. 25

Several such prefixes may exist T: abcabcaf mismatch found P: abcabcabcaf One prefix is : a b c a ( r = 7) And Another is : a b c a ( r = 4) t 11 be compared with p 8 or p 5? Choose the longest prefix. . i. e largest such r. 26

KMP Contd. . • Let r be the length of the longest prefix of P that matches with the matched part of P. Then the pattern can be shifted by r positions instead of 1 and tj+k-1 should be compared with pr+1. • Claim 1: We have not missed any match i. e. the pattern does not exist at any position from j to j+k-r 1. • Proof: Had it been, we would have a longer prefix matching with its suffix.

Why LONGEST? T: abcabcaf mismatch found P: abcabcabcaf 28

T: abcabcaf mismatch found P: abcabcabcaf the longest prefix. Correct alignment for the pattern will be by shifting it 3 characters right. 29

T: abcabcaf P: abcabcabcaf Pattern found. 30

T: abcabcaf P: mismatch abcabcabcaf Pattern not found. By finding a smaller prefix and aligning the pattern accordingly as shown, the pattern’s occurrence in the text got missed (that is we shifted by more positions than we should 31

How to find the longest such prefix? KMP algorithm 32

KMP : Knuth Morris Pratt Algorithm T : …… tj. . …. . . tj+r-1 …. tj+k-r…. . . tj+k-2 tj+k-1 … ……………… P: p 1 …… pr …… ……… pk-1 pk …… p 1 …… pr pk … If tj+k-1 ≠ pk Since tj … tj+k-1 has already been matched with p 1 … pk-1 , we need to look for longest prefix p 1 … pr (r < k-1) of p 1 … pk-1 that matches with its own suffix. Thus the longest prefix can be found from the pattern itself and we do not need the text for the purpose……Note 1. 33

P : p 1 …. …………… pk ………… Let the length of the longest prefix of p 1 … pk-1 that matches its suffix be ‘r. ’ 34

T : …… tj. . …. . . tj+r-1 …. tj+k-r…. . . tj+k-2 tj+k-1 … ……………… P: p 1 …… pr …… ……… pk-1 pk …… p 1 …… pr pk … If tj+k-1 ≠ pk Let Fail[k] be a pointer which says that if a mismatch occurs for pk then what is the character in P that should come in place of pk by shifting P accordingly. Fail[k] is nothing but the length of the longest prefix plus 1. Thus the Q is How to compute Fail[k]? 35

P : p 1 … pr-1 pr pr+1 ……. …. pk-1 pk … p 1 … pr’-1 pr’+1 p 1…. . ps-1 ps ps+1 Look at fail[k-1]. Let it be r’. If pr’ = pk-1 (which has already been matched with tj+k-1) fail[k] = r’+1 1 else { look at fail[r’] = s , say if s>0 { if ps = pk-1 then fail[k] = s+1 else goto 1 with r’ = s } } else (i. e s = 0) fail[k] =1 36

EXAMPLE P: abcabcabcaf for k=1, fai[k]=0 (assumed) for k=2, s=fail[1]=0 therefore, fail[k]=0+1=1 for k=3, s=fail[2]=1 check whether p 2=p 1 since p 2!=p 1 so, s=fail[1]=0 therefore, fail[k]=0+1=1

P: abcabcabcaf for k=4, s=fail[3]=1 check whether p 1=p 3 since p 1!=p 3 so, s=fail[1]=0 therefore, fail[k]=0+1=1 For k=5 s=fail[4]=1 check whether p 1=p 4 yes therefore, fail[k]=1+1=2 Similarly, for others.

k 1 2 3 4 5 6 7 8 9 10 11 fail[k] 0 1 1 1 2 3 4 5 6 7 8

k Fail[k] 1 0 2 1 3 1 P: abcabcabcaf 4 1 5 2 P: 6 3 7 4 8 5 9 6 10 7 11 8 Example : T: abcabcaf k: 1 2 3 4 5 6 7 8 9 10 11 k: abcabcabcaf 1 2 3 4 5 6 7 8 9 10 11 Mismatch found at k=11 position. Look at fail[11] = 8 which implies the pattern must be shifted such that p 8 comes in place of p 11 40

Example : T: abcabcaf P: k: abcabcabcaf 1 2 3 4 5 6 7 8 9 10 11 Pattern found k Fail[k] 1 0 2 1 3 1 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 8 41

Another Example : k Fail[k] T: abcbabcabcabcaf 1 0 2 1 3 1 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 8 P: abcabcabcaf k: 1 2 3 4 5 6 7 8 9 10 11 P: abcabcabcaf k: 1 2 3 4 5 6 7 8 9 10 11 Mismatch found at k=4 position. Look at fail[4] = 1 which implies the pattern must be shifted such that p 1 comes in place of p 4 42

Another Example : k Fail[k] T: abcbabcabcabcaf 1 0 2 1 3 1 4 1 P: abcabcabcaf k: 1 2 3 4 5 6 7 8 9 10 11 P: abcabcabcaf 5 2 k: 1 2 3 4 5 6 7 8 9 10 11 6 3 7 4 8 5 9 6 10 7 11 8 Mismatch found at k=1 position. Look at fail[1] = 0 which implies read the next character in text. 43

Another Example : k Fail[k] T: abcbabcabcabcaf 1 0 2 1 3 1 4 1 P: abcabcabcaf k: 1 2 3 4 5 6 7 8 9 10 11 P: abcabcabcaf 5 2 k: 1 2 3 4 5 6 7 8 9 10 11 6 3 7 4 8 5 9 6 10 7 11 8 Mismatch found at k=4 position. Look at fail[4] = 1 which implies the pattern must be shifted such that p 1 comes in place of p 4 44

Another Example : k Fail[k] T: abcbabcabcabcaf 1 0 2 1 3 1 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 8 P: k: abcabcabcaf 1 2 3 4 5 6 7 8 9 10 11 Pattern found 46

Analysis of KMP # of mismatch: For mismatch the pattern is shifted by at least 1 position. The maximum number of shifts is determined by the largest suffix. T: . . . a b c d a f d. . . . mismatch P: deb For every mismatch pattern is. . shifted by atleast 1 postion. Total no. of shifts <= n-m Total no. of mismatches <=n-m+1

Analysis of KMP contd. # of matches: For every match, pointer in the text moves up by 1 position. T: . . . a b c d a f d. . . . P: P: P: For every match pointer moves abc bde up by 1 position. abcbde a b c b. d e => # of matches <= length of text. . <= n . . The complexity of KMP is linear in nature. O(m+n)

ACKNOWLEDGEMENTS MSc (CS) 2009 Abhishek Behl(02) Aarti Sethiya(01) Akansha Aggarwal(03) Alok Prakash (04) Vibha Negi(31) 49