Data Mining Association Rules Advanced Concepts and Algorithms

  • Slides: 22
Download presentation
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining

Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Sequence Database: Object A A A B B C Timestamp 10 20 23 11

Sequence Database: Object A A A B B C Timestamp 10 20 23 11 17 21 28 14 © Tan, Steinbach, Kumar Events 2, 3, 5 6, 1 1 4, 5, 6 2 7, 8, 1, 2 1, 6 1, 8, 7 Introduction to Data Mining 4/18/2004 2

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a

Examples of Sequence Database Sequence Element (Transaction) Event (Item) Customer Purchase history of a given customer A set of items bought by a customer at time t Books, diary products, CDs, etc Web Data Browsing activity of a particular Web visitor A collection of files viewed by a Web visitor after a single mouse click Home page, index page, contact info, etc Event data History of events generated by a given sensor Events triggered by a sensor at time t Types of alarms generated by sensors Genome sequences DNA sequence of a particular species An element of the DNA sequence Bases A, T, G, C Element (Transaction) Sequence © Tan, Steinbach, Kumar E 1 E 2 E 1 E 3 E 2 Introduction to Data Mining E 2 E 3 E 4 Event (Item) 4/18/2004 3

Formal Definition of a Sequence l A sequence is an ordered list of elements

Formal Definition of a Sequence l A sequence is an ordered list of elements (transactions) s = < e 1 e 2 e 3 … > – Each element contains a collection of events (items) ei = {i 1, i 2, …, ik} – Each element is attributed to a specific time or location l Length of a sequence, |s|, is given by the number of elements of the sequence l A k-sequence is a sequence that contains k events (items) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Examples of Sequence l Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera}

Examples of Sequence l Web sequence: < {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} > l Sequence of initiating events causing the nuclear accident at 3 -mile Island: (http: //stellar-one. com/nuclear/staff_reports/summary_SOE_the_initiating_event. htm) < {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}> l Sequence of books checked out at a library: <{Fellowship of the Ring} {The Two Towers} {Return of the King}> © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Formal Definition of a Subsequence l l l A sequence <a 1 a 2

Formal Definition of a Subsequence l l l A sequence <a 1 a 2 … an> is contained in another sequence <b 1 b 2 … bm> (m ≥ n) if there exist integers i 1 < i 2 < … < in such that a 1 bi 1 , a 2 bi 1, …, an bin Data sequence Subsequence Contain? < {2, 4} {3, 5, 6} {8} > < {2} {3, 5} > Yes < {1, 2} {3, 4} > < {1} {2} > No < {2, 4} {2, 5} > < {2} {4} > Yes The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i. e. , a subsequence whose support is ≥ minsup) © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Sequential Pattern Mining: Definition l Given: – a database of sequences – a user-specified

Sequential Pattern Mining: Definition l Given: – a database of sequences – a user-specified minimum support threshold, minsup l Task: – Find all subsequences with support ≥ minsup © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Sequential Pattern Mining: Challenge l Given a sequence: <{a b} {c d e} {f}

Sequential Pattern Mining: Challenge l Given a sequence: <{a b} {c d e} {f} {g h i}> – Examples of subsequences: <{a} {c d} {f} {g} >, < {c d e} >, < {b} {g} >, etc. l How many k-subsequences can be extracted from a given n-sequence? <{a b} {c d e} {f} {g h i}> n = 9 k=4: Y_ <{a} © Tan, Steinbach, Kumar _YY _ _ _Y {d e} Introduction to Data Mining {i}> 4/18/2004 8

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: < {1, 2}

Sequential Pattern Mining: Example Minsup = 50% Examples of Frequent Subsequences: < {1, 2} > < {2, 3} > < {2, 4}> < {3} {5}> < {1} {2} > < {1} {2, 3} > < {2} {2, 3} > < {1, 2} {2, 3} > © Tan, Steinbach, Kumar Introduction to Data Mining s=60% s=80% s=60% 4/18/2004 9

Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …,

Extracting Sequential Patterns l Given n events: i 1, i 2, i 3, …, in l Candidate 1 -subsequences: <{i 1}>, <{i 2}>, <{i 3}>, …, <{in}> l Candidate 2 -subsequences: <{i 1, i 2}>, <{i 1, i 3}>, …, <{i 1}>, <{i 1} {i 2}>, …, <{in-1} {in}> l Candidate 3 -subsequences: <{i 1, i 2 , i 3}>, <{i 1, i 2 , i 4}>, …, <{i 1, i 2} {i 1}>, <{i 1, i 2} {i 2}>, …, <{i 1} {i 1 , i 2}>, <{i 1} {i 1 , i 3}>, …, <{i 1}>, <{i 1} {i 2}>, … © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

Generalized Sequential Pattern (GSP) l Step 1: – Make the first pass over the

Generalized Sequential Pattern (GSP) l Step 1: – Make the first pass over the sequence database D to yield all the 1 element frequent sequences l Step 2: Repeat until no new frequent sequences are found – Candidate Generation: u Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items – Candidate Pruning: u Prune candidate k-sequences that contain infrequent (k-1)-subsequences – Support Counting: u Make a new pass over the sequence database D to find the support for these candidate sequences – Candidate Elimination: u Eliminate candidate k-sequences whose actual support is less than minsup © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Candidate Generation l Base case (k=2): – Merging two frequent 1 -sequences <{i 1}>

Candidate Generation l Base case (k=2): – Merging two frequent 1 -sequences <{i 1}> and <{i 2}> will produce two candidate 2 -sequences: <{i 1} {i 2}> and <{i 1 i 2}> © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Candidate Generation Examples l Merging the sequences w 1=<{1} {2 3} {4}> and w

Candidate Generation Examples l Merging the sequences w 1=<{1} {2 3} {4}> and w 2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w 2 (4 and 5) belong to the same element l Merging the sequences w 1=<{1} {2 3} {4}> and w 2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w 2 (4 and 5) do not belong to the same element © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Timing Constraints (I) {A B} {C} <= xg {D E} xg: max-gap >ng ng:

Timing Constraints (I) {A B} {C} <= xg {D E} xg: max-gap >ng ng: min-gap ms: maximum span <= ms xg = 2, ng = 0, ms= 4 Data sequence Subsequence Contain? < {2, 4} {3, 5, 6} {4, 7} {4, 5} {8} > < {6} {5} > Yes < {1} {2} {3} {4} {5}> < {1} {4} > No < {1} {2, 3} {3, 4} {4, 5}> < {2} {3} {5} > Yes < {1, 2} {3} {2, 3} {3, 4} {2, 4} {4, 5}> < {1, 2} {5} > No © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Mining Sequential Patterns with Timing Constraints l Approach 1: – Mine sequential patterns without

Mining Sequential Patterns with Timing Constraints l Approach 1: – Mine sequential patterns without timing constraints – Postprocess the discovered patterns l Approach 2: – Modify GSP to directly prune candidates that violate timing constraints © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs l Useful for

Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs l Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc l © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Graph Definitions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Graph Definitions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Representing Transactions as Graphs l Each transaction is a clique of items © Tan,

Representing Transactions as Graphs l Each transaction is a clique of items © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Representing Graphs as Transactions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Representing Graphs as Transactions © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Apriori-like Algorithm Find frequent 1 -subgraphs l Repeat l – Candidate generation u Use

Apriori-like Algorithm Find frequent 1 -subgraphs l Repeat l – Candidate generation u Use frequent (k-1)-subgraphs to generate candidate k-subgraph – Candidate pruning Prune candidate subgraphs that contain infrequent (k-1)-subgraphs u – Support counting u Count the support of each remaining candidate – Eliminate candidate k-subgraphs that are infrequent In practice, it is not as easy. There are many other issues © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Example: Dataset © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Example: Dataset © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Apriori on Graphs © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Apriori on Graphs © Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004 22