Multidimensional Sequential Pattern Mining Helen Pinto Jiawei Han

  • Slides: 24
Download presentation
Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen,

Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal 1

Outline n n n Why multidimensional sequential pattern mining? Problem definition Algorithms Experimental results

Outline n n n Why multidimensional sequential pattern mining? Problem definition Algorithms Experimental results Conclusions 2

Why Sequential Pattern Mining? n n Sequential pattern mining: Finding time-related frequent patterns (frequent

Why Sequential Pattern Mining? n n Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) Many data and applications are time-related n Customer shopping patterns, telephone calling patterns n n n E. g. , first buy computer, then CD-ROMS, software, within 3 mos. Natural disasters (e. g. , earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis 3

Motivating Example n Sequential patterns are useful n n n Marketing, product design &

Motivating Example n Sequential patterns are useful n n n Marketing, product design & development Problems: lack of focus n n “free internet access buy package 1 upgrade to package 2” Various groups of customers may have different patterns MD-sequential pattern mining: integrate multidimensional analysis and sequential pattern mining 4

Sequences and Patterns n Given a set of sequences, find the complete set of

Sequences and Patterns n Given a set of sequences, find the complete set of frequent subsequences A sequence database SID sequence 10 <a(abc)(a c)d(cf)> ab 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df) cb> ab 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 5

Sequential Pattern: Basics A sequence database Seq. ID Sequence 10 <(bd) bd cb(ac)> cb

Sequential Pattern: Basics A sequence database Seq. ID Sequence 10 <(bd) bd cb(ac)> cb 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)b bd cb(ade)> cb A sequence : <(bd) c b (ac)> Elements <ad(ae)> is a subsequence of <a(bd)bcb(ade)> Given support threshold min_sup =2, <(bd)cb> is a sequential pattern 6

MD Sequence Database n n cid 10 20 30 P=(*, Chicago, *, <bf>) matches

MD Sequence Database n n cid 10 20 30 P=(*, Chicago, *, <bf>) matches tuple 20 and 30 If support =2, P is a MD sequential pattern Cust_grp Business Professional Business 40 Education City Boston Chicago Age_grp Middle Young Middle New York Retired sequence <(bd)cba> <(bf)(ce)(fg)> <(ah)abf> <(be)(ce)> 7

Mining of MD Seq. Pat. n Embedding MD information into sequences n n Using

Mining of MD Seq. Pat. n Embedding MD information into sequences n n Using a uniform seq. pat. mining method Integration of seq. pat. mining and MD analysis method 8

UNISEQ n Embed MD information into sequences cid Cust_grp City Age_grp sequence 10 Business

UNISEQ n Embed MD information into sequences cid Cust_grp City Age_grp sequence 10 Business Boston Middle <(bd)cba> 20 Professional Chicago Young <(bf)(ce)(fg)> 30 Business Chicago Middle <(ah)abf> 40 Education New York Retired <(be)(ce)> Mine the extended sequence database using sequential pattern mining methods cid MD-extension of sequences 10 <(Business, Boston, Middle)(bd)cba> 20 <(Professional, Chicago, Young)(bf)(ce)(fg)> 30 <(Business, Chicago, Middle)(ah)abf> 40 <(Education, New York, Retired)(be)(ce)> 9

Mine Sequential Patterns by Prefix Projections n Step 1: find length-1 sequential patterns n

Mine Sequential Patterns by Prefix Projections n Step 1: find length-1 sequential patterns n n <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: n The ones having prefix <a>; n The ones having prefix <b>; n … n The ones having prefix <f> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> 10

Find Seq. Patterns with Prefix <a> n n Only need to consider projections w.

Find Seq. Patterns with Prefix <a> n n Only need to consider projections w. r. t. <a> n <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> n Further partition into 6 subsets SID sequence n Having prefix <aa>; 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> n … 30 <(ef)(ab)(df)cb> n Having prefix <af> 40 <eg(af)cbc> 11

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … …… Having prefix <aa> Having prefix <af> <aa>-proj. db … <af>-proj. db 12

Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected

Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected databases keep shrinking n Major cost of Prefix. Span: constructing projected databases n Can be improved by bi-level projections 13

Mining MD-Patterns MD pattern (*, Chicago, *) (cust-grp, city, age-grp) cid Cust_grp City Age_grp

Mining MD-Patterns MD pattern (*, Chicago, *) (cust-grp, city, age-grp) cid Cust_grp City Age_grp sequence 10 Business Boston Middle <(bd)cba> 20 Professional Chicago Young <(bf)(ce)(fg)> 30 Business Chicago Middle <(ah)abf> 40 Education New York Retired <(be)(ce)> (cust-grp, city) Cust-grp, *, age-grp) (cust-grp, *, *) (*, city, *) All (*, *, age-grp) BUC processing 14

Dim-Seq n First find MD-patterns n n Form projected sequence database n n E.

Dim-Seq n First find MD-patterns n n Form projected sequence database n n E. g. (*, Chicago, *) <(bf)(ce)(fg)> and <(ah)abf> for (*, Chicago, *) Find seq. pat in projected database n E. g. (*, Chicago, *, <bf>) cid Cust_grp City Age_grp sequence 10 Business Boston Middle <(bd)cba> 20 Professional Chicago Young <(bf)(ce)(fg)> 30 Business Chicago Middle <(ah)abf> 40 Education New York Retired <(be)(ce)> 15

Seq-Dim n Find sequential patterns n n Form projected MD-database n n E. g.

Seq-Dim n Find sequential patterns n n Form projected MD-database n n E. g. <bf> E. g. (Professional, Chicago, Young) and (Business, Chicago, Middle) for <bf> Mine MD-patterns n E. g. (*, Chicago, *, <bf>) cid Cust_grp City Age_grp sequence 10 Business Boston Middle <(bd)cba> 20 Professional Chicago Young <(bf)(ce)(fg)> 30 Business Chicago Middle <(ah)abf> 40 Education New York Retired <(be)(ce)> 16

Scalability Over Dimensionality 17

Scalability Over Dimensionality 17

Scalability Over Cardinality 18

Scalability Over Cardinality 18

Scalability Over Support Threshold 19

Scalability Over Support Threshold 19

Scalability Over Database Size 20

Scalability Over Database Size 20

Pros & Cons of Algorithms n Seq-Dim is efficient and scalable n n Uni.

Pros & Cons of Algorithms n Seq-Dim is efficient and scalable n n Uni. Seq is also efficient and scalable n n Fastest in most cases Fastest with low dimensionality Dim-Seq has poor scalability 21

Conclusions n n MD seq. pat. mining are interesting and useful Mining MD seq.

Conclusions n n MD seq. pat. mining are interesting and useful Mining MD seq. pat. efficiently n n Uniseq, Dim-Seq, and Seq-Dim Future work n Applications of sequential pattern mining 22

References (1) n n n R. Agrawal and R. Srikant. Fast algorithms for mining

References (1) n n n R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94, pages 487 -499. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, pages 314. C. Bettini, X. S. Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21: 32 -38, 1998. M. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. VLDB'99, pages 223 -234. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, pages 106 -115. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. -C. Hsu. Free. Span: Frequent pattern-projected sequential pattern mining. KDD'00, pages 355 -359. 23

References (2) n n n J. Han, J. Pei, and Y. Yin. Mining frequent

References (2) n n n J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, pages 1 -12. H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional intertransaction association rules. DMKD'98, pages 12: 1 -12: 7. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1: 259 -289, 1997. B. "Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, pages 412 -421. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. -C. Hsu. Prefix. Span: Mining sequential patterns efficiently by prefixprojected pattern growth. ICDE'01, pages 215 -224. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT'96, pages 3 -17. 24