Data Mining Association Rules Advanced Concepts and Algorithms

  • Slides: 22
Download presentation
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining by Tan, Steinbach, Kumar Modified by Longbiao CHEN © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Outine Multi-level Association Rules l Sequence Data l © Tan, Steinbach, Kumar Introduction to

Outine Multi-level Association Rules l Sequence Data l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Multi-level Association Rules © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Multi-level Association Rules © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Multi-level Association Rules l Why should we incorporate concept hierarchy? – Rules at lower

Multi-level Association Rules l Why should we incorporate concept hierarchy? – Rules at lower levels may not have enough support to appear in any frequent itemsets – Rules at lower levels of the hierarchy are overly specific e. g. , skim milk white bread, 2% milk wheat bread, skim milk wheat bread, etc. are indicative of association: milk bread u © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Multi-level Association Rules l How do support and confidence vary as we traverse the

Multi-level Association Rules l How do support and confidence vary as we traverse the concept hierarchy? – If X is the parent item for both X 1 and X 2, then (X) ≤ (X 1) + (X 2) – If and then (X 1 Y 1) ≥ minsup, X is parent of X 1, Y is parent of Y 1 (X Y 1) ≥ minsup, (X 1 Y) ≥ minsup (X Y) ≥ minsup – If then conf(X 1 Y 1) ≥ minconf, conf(X 1 Y) ≥ minconf © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Multi-level Association Rules l Approach 1: – Extend current association rule formulation by augmenting

Multi-level Association Rules l Approach 1: – Extend current association rule formulation by augmenting each transaction with higher level items Original Transaction: {skim milk, wheat bread} Augmented Transaction: {skim milk, wheat bread, milk, bread, food} l Issues: – Items that reside at higher levels have much higher support counts if support threshold is low, too many frequent patterns involving items from the higher levels u – Increased dimensionality of the data © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Multi-level Association Rules l Approach 2: – Generate frequent patterns at highest level first

Multi-level Association Rules l Approach 2: – Generate frequent patterns at highest level first – Then, generate frequent patterns at the next highest level, and so on l Issues: – I/O requirements will increase dramatically because we need to perform more passes over the data – May miss some potentially interesting cross-level association patterns © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Outline Multi-level Association Rules l Sequence Data l © Tan, Steinbach, Kumar Introduction to

Outline Multi-level Association Rules l Sequence Data l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Sequence Database: Object A A A B B C Timestamp 10 20 23 11

Sequence Database: Object A A A B B C Timestamp 10 20 23 11 17 21 28 14 © Tan, Steinbach, Kumar Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 Introduction to Data Mining 4/18/2004 9

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A, T, G, C Element (Transaction) Sequence © Tan, Steinbach, Kumar E 1 E 2 E 1 E 3 E 2 Introduction to Data Mining E 2 E 3 E 4 Event (Item) 4/18/2004 10

Formal Definition of a Sequence l A sequence is an ordered list of elements

Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 … > – Each element contains a collection of events (items) ei = {i 1, i 2, …, ik} – Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Examples of Sequence l Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera}

Examples of Sequence l Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > l Sequence of initiating events causing the nuclear accident at 3 -mile Island: – l < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases} > Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Formal Definition of a Subsequence l l l A sequence <a 1 a 2

Formal Definition of a Subsequence l l l A sequence <a 1 a 2 … an> is contained in another sequence <b 1 b 2 … bm> (m ≥ n) if there exist integers i 1 < i 2 < … < in such that a 1 bi 1 , a 2 bi 2, …, an bin Data sequence Subsequence Contain? < {2, 4} {3, 5, 6} {8} > < {2} {3, 5} > Yes < {1, 2} {3, 4} > < {1} {2} > No < {2, 4} {2, 5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i. e. , a subsequence whose support is ≥ minsup) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Sequential Pattern Mining: Definition l Given: – a database of sequences – a user-specified

Sequential Pattern Mining: Definition l Given: – a database of sequences – a user-specified minimum support threshold, minsup l Task: – Find all subsequences with support ≥ minsup © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Sequential Pattern Mining: Challenge l Given a sequence: <{a b} {c d e} {f}

Sequential Pattern Mining: Challenge l Given a sequence: <{a b} {c d e} {f} {g h i}> – Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. l How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y_ <{a} © Tan, Steinbach, Kumar _YY _ _ _Y {d e} Introduction to Data Mining {i}> 4/18/2004 15

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: < {1, 2}

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: < {1, 2} > < {2, 3} > < {2, 4}> < {3} {5}> < {1} {2} > < {1} {2, 3} > < {2} {2, 3} > < {1, 2} {2, 3} > A: <{1, 2, 4}, {2, 3}, {5}> B: <{1, 2}, {2, 3, 4}> C: <{1, 2}, {2, 3, 4}, {2, 4, 5}> D: <{2}, {3, 4}, {4, 5}> E: <{1, 3}, {2, 4, 5}> © Tan, Steinbach, Kumar s=60% s=80% s=60% {1, 2}: A, B, C {2, 3}: A, B, C {2, 4}: A, B, C, E {3} {5}: A, C, D, E … Introduction to Data Mining 4/18/2004 16

Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …,

Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, in l Candidate 1 -subsequences: <{i 1}>, <{i 2}>, <{i 3}>, …, <{in}> l Candidate 2 -subsequences: <{i 1, i 2}>, <{i 1, i 3}>, …, <{i 1}>, <{i 1} {i 2}>, …, <{in-1} {in}> l Candidate 3 -subsequences: <{i 1, i 2 , i 3}>, <{i 1, i 2 , i 4}>, …, <{i 1, i 2} {i 1}>, <{i 1, i 2} {i 2}>, …, <{i 1} {i 1 , i 2}>, <{i 1} {i 1 , i 3}>, …, <{i 1}>, <{i 1} {i 2}>, … © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Other Formulation l In some domains, we may have only one very long time

Other Formulation l In some domains, we may have only one very long time series – Example: l u monitoring network traffic events for attacks u monitoring telecommunication alarm signals Goal is to find frequent sequences of events in the time series – This problem is also known as frequent episode mining E 1 E 3 E 1 E 2 E 4 E 2 E 3 E 5 E 1 E 2 E 3 E 1 Pattern: <E 1> <E 3> © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs l Useful for

Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs l Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Graph Definitions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Graph Definitions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Representing Transactions as Graphs l Each transaction is a clique of items © Tan,

Representing Transactions as Graphs l Each transaction is a clique of items © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

© Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22