Mining hybrid sequential patterns and sequential rules AuthorYenLiang
Mining hybrid sequential patterns and sequential rules Author:Yen-Liang Chen , Shih-Sheng Chen , Ping-Yu Hsu Source:Information Systems 27 (2002) pp. 345 -362 Speaker:Jung-san Lee Date: 2002/09/12 1
Outline n n n 1. Introduction 2. The proposed algorithm GFP 1 3. An improved algorithm GFP 2 4. Experimental result 5. Conclusion 2
1. Introduction(1/3) n n n Unordered items Ordered items 1. continous patterns 2. discontinous patterns Hybrid patterns 3
Introduction(2/3) Ex:A navigation sequence : A, B, Y, K, F Our interesting navigation pattern is first to visit sites A and B, and then visit sites K and F Goal:To find the pattern <AB*KF> Symbol * means a variable number of intermediate elements n 4
Introduction(3/3) Continuous pattern : <AB>、<KF> Discontinuous pattern :<A*B*K*F>、 <B*K*F>、<B*F> Hybrid pattern:<AB*KF> 5
2. The proposed algorithm GFP 1 n n n I = {i 1, i 2, i 3…, im} denotes all items in database “*” denotes a subsequence of any length, including zero length Constraints : 6
Ex:<ABC>, <A*B*C> are patterns, but <*ABC>, <*AB**C*>, <A*B**C> are not patterns n Ex:If X=<A*BC>and T = <BACABCC> <A*BC>, <ABC> …. matched Then But if we set X= <A*BAC> Then n 7
n n n L k, r denote the set of all frequent patterns with fixed length k and variable length r C k, r denote the set of all candidates patterns with fixed length k and variable length r Using a recursive function named num() to count the number of matches of Generate C 1, 0 8
Determine L 1, 0 Generate C 1, 0 Determine L k, k-1 Determine L k, k- j for all j=2…k 9
example 1 10
11
12
13
14
15
3. An improved algorithm GFP 2 n n In GFP 1, if there are n different patterns X , say Ck, r , then each transaction needs too run num() n times Improvement :Using a tree structure representing all the patterns in Ck, r So, we can examine all the patterns at a time 16
17
18
GFP 2 n n n Generate C 1, 0 and determine L 1, 0 Find all Ck, k-1 for k=2……k Construct candidate tree Traverse candidate tree and add the supports to the most specific patterns Build compensation list Do compensate 19
n n In GFP 1, if the maximum fixed length of the candidates is m , then we need to scan database m(m+1)/2 times Using the tree structure helps us to reduce scanning database times from m(m+1)/2 to m 20
n n n Take node BGAC for example:It can be viewed as BGAC, B*GAC, BG*AC, B*G*AC, BGA*C, B*GA*C, BG*A*C, B*G*A*C Another problem:computation time O(2 k-1) for each leaf of depth k Using compensation list to solve the problem 21
n n For a pattern X, we say Y is a generalization of X, if X and Y are the same patterns except that Y has more “*” s than X Ex:BG*A*C is a generalization of BGA*C and BGAC 22
BGA*C (index as 001) has generalized patterns as B*GA*C(101), BG*A*C(011), B*G*A*C(111) n The most specific pattern T=(ABKFKGAC), matched position 2, 6, 7, 8 All possible patterns:B*GAC, B*G*AC, B*GA*C, B*G*A*C where B*GAC is the most specific pattern n 23
24
n Instead of increasing the supports of all found patterns, we only increase the support of the most specific pattern 25
4. Experimental result 26
27
28
Conclusions n n One approach could find both continuous and discontinuous sequential patterns at the same time Although GFP 2 runs the same speed as the WAP-tree algorithm, GFP 2 not only can find discontinuous patterns but also can find the continuous patterns 29
- Slides: 29