Classification 9112021 Data Mining Concepts and Techniques 1

  • Slides: 35
Download presentation
Classification 9/11/2021 Data Mining: Concepts and Techniques 1 1

Classification 9/11/2021 Data Mining: Concepts and Techniques 1 1

Sequence Database: Object A A A B B C 9/11/2021 Timestamp 10 20 23

Sequence Database: Object A A A B B C 9/11/2021 Timestamp 10 20 23 11 17 21 28 14 Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 Data Mining: Concepts and Techniques 2

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor Home page, index page, contact info, etc Event data History of events generated by a given sensor A collection of files viewed by a Web visitor after a single mouse click Events triggered by a sensor at time t Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A, T, G, C Element (Transaction) Sequence 9/11/2021 E 2 E 1 E 3 E 2 Data Mining: Concepts and Techniques Types of alarms generated by sensors E 3 E 4 Event (Item) 3

Formal Definition of a Sequence n A sequence is an ordered list of elements

Formal Definition of a Sequence n A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 … > n Each element contains a collection of events (items) ei = {i 1, i 2, …, ik} n n n 9/11/2021 Each element is attributed to a specific time or location Length of a sequence, |s|, is given by the number of elements of the sequence A k-sequence is a sequence that contains k events (items) Data Mining: Concepts and Techniques 4

Formal Definition of a Subsequence n n n A sequence <a 1 a 2

Formal Definition of a Subsequence n n n A sequence <a 1 a 2 … an> is contained in another sequence <b 1 b 2 … bm> (m ≥ n) if there exist integers i 1 < i 2 < … < in such that a 1 bi 1 , a 2 bi 1, …, an bin Data sequence Subsequence Contain? < {2, 4} {3, 5, 6} {8} > < {2} {3, 5} > Yes < {1, 2} {3, 4} > < {1} {2} > No < {2, 4} {2, 5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i. e. , a subsequence whose support is ≥ minsup) 9/11/2021 Data Mining: Concepts and Techniques 5

Sequential Pattern Mining: Definition n Given: n n n a database of sequences a

Sequential Pattern Mining: Definition n Given: n n n a database of sequences a user-specified minimum support threshold, minsup Task: n 9/11/2021 Find all subsequences with support ≥ minsup Data Mining: Concepts and Techniques 6

Sequential Pattern Mining: Challenge n Given a sequence: <{a b} {c d e} {f}

Sequential Pattern Mining: Challenge n Given a sequence: <{a b} {c d e} {f} {g h i}> n n Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y_ <{a} 9/11/2021 _YY _ {d e} _ _Y {i}> Data Mining: Concepts and Techniques 7

Challenges on Sequential Pattern Mining n n A huge number of possible sequential patterns

Challenges on Sequential Pattern Mining n n A huge number of possible sequential patterns are hidden in databases A mining algorithm should n n n 9/11/2021 find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints Data Mining: Concepts and Techniques 8

Sequential Pattern Mining Algorithms n Concept introduction and an initial Apriori-like algorithm n n

Sequential Pattern Mining Algorithms n Concept introduction and an initial Apriori-like algorithm n n Agrawal & Srikant. Mining sequential patterns, ICDE’ 95 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal @ EDBT’ 96) n Pattern-growth methods: Free. Span & Prefix. Span (Han et al. @KDD’ 00; Pei, et al. @ICDE’ 01) n Vertical format-based mining: SPADE (Zaki@Machine Leanining’ 00) n Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim@VLDB’ 99; Pei, Han, Wang @ CIKM’ 02) n Mining closed sequential patterns: Clo. Span (Yan, Han & Afshar @SDM’ 03) 9/11/2021 Data Mining: Concepts and Techniques 9

Extracting Sequential Patterns n Given n events: i 1, i 2, i 3, …,

Extracting Sequential Patterns n Given n events: i 1, i 2, i 3, …, in n Candidate 1 -subsequences: <{i 1}>, <{i 2}>, <{i 3}>, …, <{in}> n Candidate 2 -subsequences: <{i 1, i 2}>, <{i 1, i 3}>, …, <{i 1}>, <{i 1} {i 2}>, …, <{in-1} {in}> n Candidate 3 -subsequences: <{i 1, i 2 , i 3}>, <{i 1, i 2 , i 4}>, …, <{i 1, i 2} {i 1}>, <{i 1, i 2} {i 2}>, …, <{i 1} {i 1 , i 2}>, <{i 1} {i 1 , i 3}>, …, <{i 1}>, <{i 1} {i 2}>, … 9/11/2021 Data Mining: Concepts and Techniques 10

Generalized Sequential Pattern (GSP) n Step 1: n n Make the first pass over

Generalized Sequential Pattern (GSP) n Step 1: n n Make the first pass over the sequence database D to yield all the 1 element frequent sequences Step 2: Repeat until no new frequent sequences are found n n Candidate Generation: n Merge pairs of frequent subsequences found in the (k-1) th pass to generate candidate sequences that contain k items Candidate Pruning: n Prune candidate k-sequences that contain infrequent ( k-1)subsequences Support Counting: n Make a new pass over the sequence database D to find the support for these candidate sequences Candidate Elimination: n Eliminate candidate k-sequences whose actual support is less than minsup 9/11/2021 Data Mining: Concepts and Techniques 11

Candidate Generation n Base case (k=2): n n Merging two frequent 1 -sequences <{i

Candidate Generation n Base case (k=2): n n Merging two frequent 1 -sequences <{i 1}> and <{i 2}> will produce two candidate 2 -sequences: <{i 1} {i 2}> and <{i 1 i 2}> General case (k>2): n A frequent (k-1)-sequence w 1 is merged with another frequent (k-1)-sequence w 2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w 1 is the same as the subsequence obtained by removing the last event in w 2 n The resulting candidate after merging is given by the sequence w 1 extended with the last event of w 2. n n 9/11/2021 If the last two events in w 2 belong to the same element, then the last event in w 2 becomes part of the last element in w 1 Otherwise, the last event in w 2 becomes a separate element appended to the end of w 1 Data Mining: Concepts and Techniques 12

Candidate Generation Examples n n n Merging the sequences w 1=<{1} {2 3} {4}>

Candidate Generation Examples n n n Merging the sequences w 1=<{1} {2 3} {4}> and w 2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w 2 (4 and 5) belong to the same element Merging the sequences w 1=<{1} {2 3} {4}> and w 2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w 2 (4 and 5) do not belong to the same element We do not have to merge the sequences w 1 =<{1} {2 6} {4}> and w 2 =<{1} {2} {4 5}> to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable candidate, then it can be obtained by merging w 1 with < {1} {2 6} {5}> 9/11/2021 Data Mining: Concepts and Techniques 13

GSP Example 9/11/2021 Data Mining: Concepts and Techniques 14

GSP Example 9/11/2021 Data Mining: Concepts and Techniques 14

Finding Length-1 Sequential Patterns n n n Examine GSP using an example Initial candidates:

Finding Length-1 Sequential Patterns n n n Examine GSP using an example Initial candidates: all singleton sequences n <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 9/11/2021 Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Data Mining: Concepts and Techniques Cand <a> <b> <c> <d> <e> <f> <g> <h> Sup 3 5 4 3 3 2 1 1 15

GSP: Generating Length-2 Candidates 51 length-2 Candidates <a> <b> <c> <d> <e> <f> 9/11/2021

GSP: Generating Length-2 Candidates 51 length-2 Candidates <a> <b> <c> <d> <e> <f> 9/11/2021 <b> <(ab)> <c> <(ac)> <(bc)> <a> <b> <c> <d> <e> <f> <d> <(ad)> <(bd)> <(cd)> <aa> <ba> <ca> <da> <ea> <fa> <b> <ab> <bb> <cb> <db> <eb> <fb> <e> <(ae)> <(be)> <(ce)> <(de)> <f> <(af)> <(bf)> <(cf)> <(df)> <(ef)> <c> <ac> <bc> <cc> <dc> <ec> <fc> <d> <ad> <bd> <cd> <dd> <ed> <fd> <e> <ae> <be> <ce> <de> <ee> <f> <af> <bf> <cf> <df> <ef> <ff> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44. 57% candidates Data Mining: Concepts and Techniques 16

The GSP Mining Process 5 th scan: 1 cand. 1 length-5 seq. pat. Cand.

The GSP Mining Process 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold <(bd)cba> Cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … pat. 3 rd scan: 47 cand. 19 length-3 seq. <abb> <aab> <aba> <bab> … pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> pat. 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. <a> <b> <c> <d> <e> <f> <g> <h> pat. min_sup =2 9/11/2021 Seq. ID Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> Data Mining: Concepts and Techniques 17

Candidate Generate-and-test: Drawbacks n A huge set of candidate sequences generated. n n Especially

Candidate Generate-and-test: Drawbacks n A huge set of candidate sequences generated. n n Especially 2 -item candidate sequence. Multiple Scans of database needed. n The length of each candidate grows by one at each database scan. n Inefficient for mining long sequential patterns. n A long pattern grow up from short patterns n The number of short patterns is exponential to the length of mined patterns. 9/11/2021 Data Mining: Concepts and Techniques 18

The SPADE Algorithm n SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki

The SPADE Algorithm n SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 n A vertical format sequential pattern mining method n A sequence database is mapped to a large set of n n Item: <SID, EID> Sequential pattern mining is performed by n growing the subsequences (patterns) one item at a time by Apriori candidate generation 9/11/2021 Data Mining: Concepts and Techniques 19

The SPADE Algorithm 9/11/2021 Data Mining: Concepts and Techniques 20

The SPADE Algorithm 9/11/2021 Data Mining: Concepts and Techniques 20

Bottlenecks of GSP and SPADE n A huge set of candidates could be generated

Bottlenecks of GSP and SPADE n A huge set of candidates could be generated n 1, 000 frequent length-1 sequences generate s huge number of length-2 candidates! n Multiple scans of database in mining n Mining long sequential patterns n Needs an exponential number of short candidates n A length-100 sequential pattern needs 1030 candidate sequences! 9/11/2021 Data Mining: Concepts and Techniques 21

Prefix and Suffix (Projection) n <a>, <a(ab)> and <a(abc)> are prefices of sequence <a(abc)(ac)d(cf)>

Prefix and Suffix (Projection) n <a>, <a(ab)> and <a(abc)> are prefices of sequence <a(abc)(ac)d(cf)> n 9/11/2021 Given sequence <a(abc)(ac)d(cf)> Prefix Suffix (Prefix-Based Projection) <a> <aa> a(ab) <(abc)(ac)d(cf)> <(_c)(ac)d(cf)> Data Mining: Concepts and Techniques 22

Mining Sequential Patterns by Prefix Projections n n Step 1: find length-1 sequential patterns

Mining Sequential Patterns by Prefix Projections n n Step 1: find length-1 sequential patterns n <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: n The ones having prefix <a>; n The ones having prefix <b>; SID sequence 10 <a(abc)(ac)d(cf)> n … 20 <(ad)c(bc)(ae)> n The ones having prefix <f> 30 40 9/11/2021 Data Mining: Concepts and Techniques <(ef)(ab)(df)cb> <eg(af)cbc> 23

Finding Seq. Patterns with Prefix <a> n Only need to consider projections w. r.

Finding Seq. Patterns with Prefix <a> n Only need to consider projections w. r. t. <a> n n <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> n 9/11/2021 Further partition into 6 subsets n Having prefix <aa>; n … n Having prefix <af> Data Mining: Concepts and Techniques SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> 24

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <b>-projected database Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … …… Having prefix <aa> Having prefix <af> <aa>-proj. db 9/11/2021 … <af>-proj. db Data Mining: Concepts and Techniques 25

Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected

Efficiency of Prefix. Span n No candidate sequence needs to be generated n Projected databases keep shrinking n Major cost of Prefix. Span: constructing projected databases n 9/11/2021 Can be improved by pseudo-projections Data Mining: Concepts and Techniques 26

Speed-up by Pseudo-projection n Major cost of Prefix. Span: projection n Postfixes of sequences

Speed-up by Pseudo-projection n Major cost of Prefix. Span: projection n Postfixes of sequences often appear repeatedly in recursive projected databases n When (projected) database can be held in main memory, use pointers to form projections n Pointer to the sequence n Offset of the postfix s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> 9/11/2021 Data Mining: Concepts and Techniques 27

Pseudo-Projection vs. Physical Projection n Pseudo-projection avoids physically copying postfixes n n However, it

Pseudo-Projection vs. Physical Projection n Pseudo-projection avoids physically copying postfixes n n However, it is not efficient when database cannot fit in main memory n n Disk-based random accessing is very costly Suggested Approach: n n 9/11/2021 Efficient in running time and space when database can be held in main memory Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory Data Mining: Concepts and Techniques 28

Performance on Data Set C 10 T 8 S 8 I 8 9/11/2021 Data

Performance on Data Set C 10 T 8 S 8 I 8 9/11/2021 Data Mining: Concepts and Techniques 29

Clo. Span: Mining Closed Sequential Patterns n n n A closed sequential pattern s:

Clo. Span: Mining Closed Sequential Patterns n n n A closed sequential pattern s: there exists no superpattern s’ such that s’ כ s, and s’ and s have the same support Motivation: reduces the number of (redundant) patterns but attains the same expressive power Using Backward Subpattern and Backward Superpattern pruning to prune redundant search space 9/11/2021 Data Mining: Concepts and Techniques 30

Constraint-Based Seq. -Pattern Mining n n Constraint-based sequential pattern mining n Constraints: User-specified, for

Constraint-Based Seq. -Pattern Mining n n Constraint-based sequential pattern mining n Constraints: User-specified, for focused mining of desired patterns n How to explore efficient mining with constraints? — Optimization Classification of constraints n Anti-monotone: E. g. , value_sum(S) < 150, min(S) > 10 n Monotone: E. g. , count (S) > 5, S {PC, digital_camera} n Succinct: E. g. , length(S) 10, S {Pentium, MS/Office, MS/Money} n Convertible: E. g. , value_avg(S) < 25, profit_sum (S) > 160, max(S)/avg(S) < 2, median(S) – min(S) > 5 n Inconvertible: E. g. , avg(S) – median(S) = 0 9/11/2021 Data Mining: Concepts and Techniques 31

From Sequential Patterns to Structured Patterns n n 9/11/2021 Sets, sequences, trees, graphs, and

From Sequential Patterns to Structured Patterns n n 9/11/2021 Sets, sequences, trees, graphs, and other structures n Transaction DB: Sets of items n {{i 1, i 2, …, im}, …} n Seq. DB: Sequences of sets: n {<{i 1, i 2}, …, {im, in, ik}>, …} n Sets of Sequences: n {{<i 1, i 2>, …, <im, in, ik>}, …} n Sets of trees: {t 1, t 2, …, tn} n Sets of graphs (mining for frequent subgraphs): n {g 1, g 2, …, gn} Mining structured patterns in XML documents, biochemical structures, etc. Data Mining: Concepts and Techniques 32

Episodes and Episode Pattern Mining n n Other methods for specifying the kinds of

Episodes and Episode Pattern Mining n n Other methods for specifying the kinds of patterns n Serial episodes: A B n Parallel episodes: A & B n Regular expressions: (A | B)C*(D E) Methods for episode pattern mining n Variations of Apriori-like algorithms, e. g. , GSP n Database projection-based pattern growth n 9/11/2021 Similar to the frequent pattern growth without candidate generation Data Mining: Concepts and Techniques 33

Periodicity Analysis n n n 9/11/2021 Periodicity is everywhere: tides, seasons, daily power consumption,

Periodicity Analysis n n n 9/11/2021 Periodicity is everywhere: tides, seasons, daily power consumption, etc. Full periodicity n Every point in time contributes (precisely or approximately) to the periodicity Partial periodicit: A more general notion n Only some segments contribute to the periodicity n Jim reads NY Times 7: 00 -7: 30 am every week day Cyclic association rules n Associations which form cycles Methods n Full periodicity: FFT, other statistical analysis methods n Partial and cyclic periodicity: Variations of Apriori-like mining methods Data Mining: Concepts and Techniques 34

Ref: Mining Sequential Patterns n n n n n R. Srikant and R. Agrawal.

Ref: Mining Sequential Patterns n n n n n R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT’ 96. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI: 97. M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 2001. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M. -C. Hsu. Prefix. Span: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’ 04). J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases, CIKM'02. X. Yan, J. Han, and R. Afshar. Clo. Span: Mining Closed Sequential Patterns in Large Datasets. SDM'03. J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04. H. Cheng, X. Yan, and J. Han, Inc. Span: Incremental Mining of Sequential Patterns in Large Database, KDD'04. J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series Database, ICDE'99. J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data, KDD'00. 9/11/2021 Data Mining: Concepts and Techniques 35