Data Mining Techniques Sequential Patterns Sequential Pattern Mining

  • Slides: 17
Download presentation
Data Mining Techniques Sequential Patterns

Data Mining Techniques Sequential Patterns

Sequential Pattern Mining • Progress in bar-code technology has made it possible for retail

Sequential Pattern Mining • Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data, referred to as the basket data • A record in such data typically consists of the transaction date and the items bought in the transaction • Very often, data records also contain customer-id, particularly when the purchase has been made using a credit card or a frequent-buyer card • Catalog companies also collect such data using the orders they receive

Sequential Pattern Mining • An example of such a pattern is that customers typically

Sequential Pattern Mining • An example of such a pattern is that customers typically rent “Star Wars (星際大戰)”, then “Empire Strikes Back (帝國大反擊)”, and then “Return of the Jedi (絕地大反攻)” • These rentals need not be consecutive – Customers who rent some other videos in between also support this sequential pattern • Elements of a sequential pattern need not be simple items – “Computer Science and Programming Language”, followed by “Data Structure”, followed by “System Programs and Operating Systems” is an example of a sequential pattern in which the elements are sets of items

Sequential Pattern Mining • Given Transaction Time, Customer Id, Items Bought Original Database Answer

Sequential Pattern Mining • Given Transaction Time, Customer Id, Items Bought Original Database Answer Set

Definition • The length of a sequence is the number of itemsets in the

Definition • The length of a sequence is the number of itemsets in the sequence • A sequence of length k is called a k-sequence • The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction • The itemset i and the 1 -sequence <i> have the same support • An itemset with minimum support is called a large (frequent) itemset or litemset

Apriori. All Algorithm • Each itemset in a large sequence must have minimum support

Apriori. All Algorithm • Each itemset in a large sequence must have minimum support • Any large sequence must be a list of litemsets • Finding all sequential patterns in five phases – Sort Phase – Litemset Phase – Transformation Phase – Sequence Phase – Maximal Phase

Apriori. All Algorithm: Sort Phase Customer-Sequence Version of the Database

Apriori. All Algorithm: Sort Phase Customer-Sequence Version of the Database

Apriori. All Algorithm: Litemset Phase min_sup_count=2 Apriori/DHP FP Growth

Apriori. All Algorithm: Litemset Phase min_sup_count=2 Apriori/DHP FP Growth

Apriori. All Algorithm: Transformation Phase

Apriori. All Algorithm: Transformation Phase

Apriori. All Algorithm: Sequence Phase Large 2 -Sequences Customer Sequences Large 1 -Sequences 2

Apriori. All Algorithm: Sequence Phase Large 2 -Sequences Customer Sequences Large 1 -Sequences 2 Large 4 -Sequences Large 3 -Sequences Maximal Large Sequences

Sequence Phase: Candidate Generation

Sequence Phase: Candidate Generation

Apriori. All Algorithm: Maximal Phase • The sequence <(3) (4 5) (8)> is contained

Apriori. All Algorithm: Maximal Phase • The sequence <(3) (4 5) (8)> is contained in <(7) (3 8) (9) (4 5 6) (8)>, since (3) (3 8), (4 5) (4 5 6) and (8) • The sequence <(3) (5)> is not contained in <(3 5)> (and vice versa) – The former represents items 3 and 5 being bought one after the other – The latter represents items 3 and 5 being bought together. • In a set of sequences, a sequence s is maximal if s is not contained in any other sequence.

Apriori. All Algorithm Answer Set • With minimum support set to 25%, i. e.

Apriori. All Algorithm Answer Set • With minimum support set to 25%, i. e. , a minimum support of 2 customers – < (30) (90)> and <(30) (40 70)> are maximal – <(10 20) (30)> which is only supported by customer 2 does not have minimum support – <(30)>, <(40)>, <(70)>, <(90)>, <(30) (40)>, <(30) (70)> and <(40 70)>, though having minimum support, are not in the answer because they are not maximal.

Summary

Summary

Discussions • Apriori. All algorithm will generate a huge set of candidate sequences –

Discussions • Apriori. All algorithm will generate a huge set of candidate sequences – If there are 1000 frequent sequences of length-1, the algorithm will generate 1000 × 1000 + (1000 × 999) / 2 = 1, 499, 500 candidate sequences • Many scans of databases in mining • Difficulties at mining long sequential patterns

Research Topics • • • Time-Interval Sequential Patterns Time-Gap Sequential Patterns Non-redundant Sequential Patterns

Research Topics • • • Time-Interval Sequential Patterns Time-Gap Sequential Patterns Non-redundant Sequential Patterns Constrained Sequential Pattern Mining Multi-dimensional Sequential Patterns Generalized Sequential Patterns Incremental Mining Sequential Patterns Data Stream Sequential Pattern Mining Interactive Mining Sequential Patterns

Exercise 6 A Sequence Database (min-sup = 50%) SID Customer sequence 10 <a(abc)(ac)d(cf)> 20

Exercise 6 A Sequence Database (min-sup = 50%) SID Customer sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>