PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix
PREFIXSPAN ALGORITHM Mining Sequential Patterns Efficiently by Prefix. Projected Pattern Growth al 113301 m@student. etf. rs 1/16
Contet 1. Introduction 2. Problem statement 3. Existing Solutions 4. Proposed Solution 5. Algorithm 6. Conclusion 7. References al 113301 m@student. etf. rs 2/16
Introduction Given a set of sequences, where each sequence consists of a list of elements and each element consists of set of items. ◦ <a(abc)(ac)d(cf)> - 5 elements, 9 items ◦ <a(abc)(ac)d(cf)> - 9 -sequence ◦ <a(abc)(ac)d(cf)> ≠ <a(ac)(abc)d(cf)> id Sequence 10 20 <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> 30 40 <(ef)(ab)(df)cb> <eg(af)cbc> al 113301 m@student. etf. rs 3/16
Subsequence vs. super sequence Given two sequences α=<a 1 a 2…an> and β=<b 1 b 2…bm>. � α is called a subsequence of β, denoted as α⊆ β, if there exist integers 1≤j 1<j 2<…<jn≤m such that a 1⊆b j 1, a 2 ⊆b j 2, …, an⊆b jn. � �β is a super sequence of α. � Example: � β =<a(abc)(ac)d(cf)> Correct : Not correct: α 1=<aa(ac)d(c)> α 2=<(ac)d(cf)> α 3=<ac> α 4=<df(cf)> α 5=<(cf)d> α 6=<(abc)dcf al 113301 m@student. etf. rs 4/16
Sequential Pattern Mining Find all the frequent subsequences, i. e. the subsequences whose occurrence frequency in the set of sequences is no less than min_support (user-specified). id Sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> min_support = 2 Solution – 53 frequent subsequences: <a><aa> <ab> <a(bc)a> <abc> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <aca> <acb> <acc> <adc> <af> <ba> <bc> <(bc)a> <bdc> <bf> <ca> <cb> <cc> <db> <dcb> <ea> <eab> <eacb> <ebc> <ecb> <efb> <efcb> <fb> <fbc> <fcb> al 113301 m@student. etf. rs 5/16
Existing Solutions Apriori-like approaches (Apriori. Some (1995. ), Apriori. All (1995. ), Dynamic. Some (1995. ), GSP (1996. )): ◦ Potentially huge set of candidate sequences, ◦ �Multiple scans of databases, ◦ �Difficulties at mining long sequential patterns. Free. Span (2000. ) - pattern groth method (Frequent pattern-projected Sequential pattern mining) General idea is to use frequent items to recursively project sequence databases into a smaller projected databases , and grow subsequence fragments in each projected database. Prefix. Span (Prefix-projected Sequential pattern mining) ◦ �Less projections and quickly shrinking sequence. al 113301 m@student. etf. rs 6/16
Prefix Given two sequences α=<a 1 a 2…an> and β=<b 1 b 2…bm>, m≤n. Sequence β is called a prefix of α if and only if: ◦ bi= ai for i ≤ m-1; ◦ bm ⊆ am; Example : ◦ α =<a(abc)(ac)d(cf)> ◦ β =<a(abc)a> al 113301 m@student. etf. rs 7/16
Projection Given sequences α and β, such that β is a subsequence of α. � A subsequence α’ of sequence α is called a projection of α w. r. t. β prefix if and only if: � ◦ α’ has prefix β; ◦ There exist no proper super-sequence α’’ of α’ such that: α’’ is a subsequence of α and also has prefix β. � Example: � α =<a(abc)(ac)d(cf)> � β =<(bc)a> � α’ =<(bc)(ac)d(cf)> al 113301 m@student. etf. rs 8/16
Postfix Let α’ =<a 1, a 2…an> be the projection of α w. r. t. prefix β=<a 1 a 2…am-1 a’m> (m ≤n) �. Sequence γ=<a’’mam+1…an> is called the postfix of α w. r. t. prefix β, denoted as γ= α/ β, where a’’m=(am - a’m). � We also denote α =β ⋅ γ. Example: α’ =<a(abc)(ac)d(cf)>, β =<a(abc)a>, γ=<(_c)d(cf)>. al 113301 m@student. etf. rs 9/16
Prefix. Span – Algorithm Input of the algorithm : A sequence database S, and the minimum support threshold min_support. �Output of the algorithm: The complete set of sequential patterns. id Sequence �Subroutine: Prefix. Span( α, L, S|α). 10 <a(abc)(ac)d(cf)> Parameters: 20 <(ad)c(bc)(ae)> ◦ α: sequential pattern, 30 <(ef)(ab)(df)cb> ◦ �L: the lengthα; of 40 <eg(af)cbc> ◦ � S|α: : the α-projected database, if α ≠<>; otherwise; the sequence database S. Call Prefix. Span(<>, 0, S). al 113301 m@student. etf. rs 10/16
Prefix. Span – Algorithm (2) Method: � 1. Scan S|α once, find the set of frequent items b such that: ◦ b can be assembled to the last element of α to form a sequential pattern; or ◦ <b> can be appended to α to form a sequential pattern. � 2. For each frequent item b: � 3. For each α’: ◦ append it to α to form a sequential pattern α’ and output α’; ◦ construct α’-projected database S|α’ and ◦ call Prefix. Span(α’, L+1, S|α’). al 113301 m@student. etf. rs 11/16
Prefix. Span - Example 1. Find length 1 sequential patterns: id 10 20 30 40 Sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> 2. <a> <b> <c> <d> <e> <f> <g> 4 4 4 3 3 3 1 <a><b><c><d><e><f> Divide search space Prefix <a> <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> <b> <(_c)(ac)d(cf)> <(_c)(ae)> <(df)cb> <c> <(ac)d(cf)> <(bc)(ae)> <bc> <d> <(cf)> <c(bc)(ae)> <(_f)cb> <e> <f> <(_f)(ab)(df)cb> <(af)cbc> <(ab)(df)cb> <cbc> al 113301 m@student. etf. rs 12/16
Prefix. Span – Example (2) Find <d> <(cf)> <c(bc)(ae)> <(_f)cb> subsets of sequential patterns: <a> <b> <c> <d> <e> <f> <_f> 1 2 3 0 1 1 1 <db> <dc> <db> <(_c)(ae)> <dc> <(bc)(ae)> <b> <a> <e> <c> 2 1 1 1 <dcb> <> al 113301 m@student. etf. rs 13/16
Conclusions Prefix. Span ◦ � Efficient pattern growth method. ◦ � Outperforms both GSP and Free. Span. ◦ � Explores prefix-projection in sequential pattern min ◦ � Mines the complete set of patterns, but reduces the effort of candidate subsequence generation. ◦ � Prefix-projection reduces the size of projected data and leads to efficient processing. ◦ � Bi-level projection and pseudo-projection may improv mining efficiency. al 113301 m@student. etf. rs 14/16
References Pei J. , Han J. , Mortazavi-Asl J. , Pinto H. , Chen Q. , Dayal U. , Hsu M. , “Prefix. Span: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth”, 17 th International Conference on Data Engineering (ICDE), April 2001. Agrawal R. , Srikant R. , “Mining sequential patterns”, Proceedings 1995 Int. Conf. Very Large Data Bases (VLDB’ 94), pp. 487 -499, 1999. Han J. , Dong G. , Mortazavi-Asl B. , Chen Q. , Dayal U. , Hsu M. -C. , ”Freespan: Frequent pattern-projected sequential pattern mining”, Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’ 00), pp. 355 -359, 2000. Wojciech Stach, “http: //webdocs. ualberta. ca/~zaiane/courses/cmput 69504/slides/Prefix. Span-Wojciech. pdf”. al 113301 m@student. etf. rs 15/16
Thank you for attention. Questions? Lazar Arsić 2011/3301 AL 113301 m@student. etf. rs al 113301 m@student. etf. rs 16/16
- Slides: 16