Sequential Pattern Mining CS 685 Special Topics in

  • Slides: 23
Download presentation
Sequential Pattern Mining CS 685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY

Sequential Pattern Mining CS 685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY

Sequential Pattern Mining Why sequential pattern mining? GSP algorithm Prefix. Span 2 CS 685:

Sequential Pattern Mining Why sequential pattern mining? GSP algorithm Prefix. Span 2 CS 685: Special Topics in Data Mining

Sequence Database: Object A A A B B C 3 Timestamp 10 20 23

Sequence Database: Object A A A B B C 3 Timestamp 10 20 23 11 17 21 28 14 Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 CS 685: Special Topics in Data Mining

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor Home page, index page, contact info, etc Event data History of events generated by a given sensor A collection of files viewed by a Web visitor after a single mouse click Events triggered by a sensor at time t Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A, T, G, C Element (Transaction) Sequence 4 E 1 E 2 E 1 E 3 E 2 E 3 E 4 Types of alarms generated by sensors Event (Item) CS 685: Special Topics in Data Mining

Formal Definition of a Sequence A sequence is an ordered list of elements (transactions)

Formal Definition of a Sequence A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 … > Each element contains a collection of events (items) ei = {i 1, i 2, …, ik} Each element is attributed to a specific time or location Length of a sequence, |s|, is given by the number of elements of the sequence A k-sequence is a sequence that contains k events (items) 5 CS 685: Special Topics in Data Mining

What Is Sequential Pattern Mining? Given a set of sequences, find the complete set

What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6 CS 685: Special Topics in Data Mining

Sequential Pattern Mining: Definition Given: a database of sequences a user-specified minimum support threshold,

Sequential Pattern Mining: Definition Given: a database of sequences a user-specified minimum support threshold, minsup Task: Find all subsequences with support ≥ minsup 7 CS 685: Special Topics in Data Mining

Sequential Pattern Mining: Challenge Given a sequence: <{a b} {c d e} {f} {g

Sequential Pattern Mining: Challenge Given a sequence: <{a b} {c d e} {f} {g h i}> Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: 8 Y_ _YY _ <{a} {d e} _ _Y {i}> CS 685: Special Topics in Data Mining

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden

Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should Find the complete set of patterns satisfying the minimum support (frequency) threshold Be highly efficient, scalable, involving only a small number of database scans Be able to incorporate various kinds of userspecific constraints 9 CS 685: Special Topics in Data Mining

A Basic Property of Sequential Patterns: Apriori A basic property: Apriori (Agrawal & Sirkant’

A Basic Property of Sequential Patterns: Apriori A basic property: Apriori (Agrawal & Sirkant’ 94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E. g, <hb> is infrequent so do <hab> and <(ah)b> Seq. ID 10 20 30 40 50 10 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Given support threshold min_sup =2 CS 685: Special Topics in Data Mining

Basic Algorithm : Breadth First Search (GSP) L=1 While (Result. L != NULL) Candidate

Basic Algorithm : Breadth First Search (GSP) L=1 While (Result. L != NULL) Candidate Generate Prune Test L=L+1 11 CS 685: Special Topics in Data Mining

Finding Length-1 Sequential Patterns Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>,

Finding Length-1 Sequential Patterns Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates min_sup =2 Seq. ID 10 20 30 40 50 12 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)> Cand <a> <b> <c> Sup 3 5 4 <d> <e> <f> <g> <h> 3 3 2 1 1 CS 685: Special Topics in Data Mining

Generating Length-2 Candidates 51 length-2 Candidates <a> <b> <c> <d> <e> <f> 13 <a>

Generating Length-2 Candidates 51 length-2 Candidates <a> <b> <c> <d> <e> <f> 13 <a> <b> <c> <d> <e> <f> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <ca> <cb> <cc> <cd> <ce> <cf> <da> <db> <dc> <dd> <de> <df> <ea> <eb> <ec> <ed> <ee> <ef> <fa> <fb> <fc> <fd> <fe> <ff> <b> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44. 57% candidates CS 685: Special Topics in Data Mining

The Mining Process 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot

The Mining Process 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold <(bd)cba> Cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … pat. 3 rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <bab> … pat. 20 cand. not in DB at all 2 nd scan: 51 cand. 19 length-2 seq. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> pat. 10 cand. not in DB at all 1 st scan: 8 cand. 6 length-1 seq. <a> <b> <c> <d> <e> <f> <g> <h> pat. Seq. ID Sequence min_sup =2 14 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)> CS 685: Special Topics in Data Mining

Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated. Especially 2 -item candidate

Candidate Generate-and-test: Drawbacks A huge set of candidate sequences generated. Especially 2 -item candidate sequence. Multiple Scans of database needed. Inefficient for mining long sequential patterns. A long pattern grow up from short patterns The number of short patterns is exponential to the length of mined patterns. 15 07 September 2021 Data Mining: Concepts and Techniques 15 in Data Mining CS 685: Special Topics

Bottlenecks of GSP A huge set of candidates could be generated 1, 000 frequent

Bottlenecks of GSP A huge set of candidates could be generated 1, 000 frequent length-1 sequences generate s huge number of length-2 candidates! Multiple scans of database in mining The length of each candidate grows by one at each database scan. Mining long sequential patterns Needs an exponential number of short candidates A length-100 sequential pattern needs 1030 candidate sequences! 16 CS 685: Special Topics in Data Mining

Pattern Growth (prefix. Span) Prefix and Suffix (Projection) <a>, <a(ab)> and <a(abc)> are prefixes

Pattern Growth (prefix. Span) Prefix and Suffix (Projection) <a>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> 17 Prefix Suffix (Prefix-Based Projection) <a> <a(ab)> <(abc)(ac)d(cf)> <(_c)(ac)d(cf)> CS 685: Special Topics in Data Mining

Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns <a>, <b>,

Mining Sequential Patterns by Prefix Projections Step 1: find length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix <b>; … The ones having prefix <f> 18 SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> CS 685: Special Topics in Data Mining

Finding Seq. Patterns with Prefix <a> Only need to consider projections w. r. t.

Finding Seq. Patterns with Prefix <a> Only need to consider projections w. r. t. <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets Having prefix <aa>; … Having prefix <af> 19 SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> CS 685: Special Topics in Data Mining

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Having

Completeness of Prefix. Span SDB Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> Having prefix <aa>-proj. db 20 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <b> Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> <b>-projected database … …… Having prefix <af> … <af>-proj. db CS 685: Special Topics in Data Mining

Efficiency of Prefix. Span No candidate sequence needs to be generated Projected databases keep

Efficiency of Prefix. Span No candidate sequence needs to be generated Projected databases keep shrinking Major cost of Prefix. Span: constructing projected databases Can be improved by pseudo-projections 21 CS 685: Special Topics in Data Mining

Speed-up by Pseudo-projection Major cost of Prefix. Span: projection Postfixes of sequences often appear

Speed-up by Pseudo-projection Major cost of Prefix. Span: projection Postfixes of sequences often appear repeatedly in recursive projected databases When (projected) database can be held in main memory, use pointers to form projections s=<a(abc)(ac)d(cf)> <a> Pointer to the sequence Offset of the postfix s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) 22 <(_c)(ac)d(cf)> CS 685: Special Topics in Data Mining

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes Efficient in running time and

Pseudo-Projection vs. Physical Projection Pseudo-projection avoids physically copying postfixes Efficient in running time and space when database can be held in main memory However, it is not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory 23 CS 685: Special Topics in Data Mining