MINING ASSOCIATION RULES FROM LARGE DATABASES USING THE

MINING ASSOCIATION RULES FROM LARGE DATABASES USING THE LATTICE-BASED APPROACH AND HYBRID SEARCH METHOD Arif Djunaidy Rully Soelaiman Daning Tyaspamadya Faculty of Information Technology ITS - Surabaya 10 -12/09/2002 ii. WAS 2002 1

Background - 1 • In data mining, association rules represent relationships that may exist among items in their transactional data bases • Since, the association rules that can be exploited may represent the customers’ behavior, identification of the frequent itemsets and the formation of the conditional implication rules among items are paramount important to perform • Efficient algorithms capable of optimizing those overheads in mining meaningful association rules are therefore required • However, for large databases, the extraction of a set of meaningful association rules may require substantial memory and database scanning that may in turn increase the overall computing time of the mining process 10 -12/09/2002 ii. WAS 2002 2

Background - 2 • The task of discovering all frequent associations in very large databases is quite challenging – The search space is exponential in the number of database attributes – With millions of database objects, the problem of I/O minimization becomes paramount • Most current approaches are iterative in nature, requiring multiple database scans • Most approaches use very complicated data internal data structures, which have poor locality and additional space and computation overheads 10 -12/09/2002 ii. WAS 2002 3

Key Features of Our Approach • All frequent itemsets are enumerated via simple “tid list” intersections • A lattice theoretic approach is used to decompose the original search space (lattice) into smaller pieces (sub lattices) that can be processed independently and easier • The hybrid search strategy for enumerating the frequent itemsets within each sub lattice • Our approach is designed to involve only a few database scans to minimize the I/O costs 10 -12/09/2002 ii. WAS 2002 4

Problem Statement - 1 • An association rule can be written as A B, where – A is an itemset called the antecedent or left-hand side (LHS), and – B is an itemset called the consequent or righthand side (RHS) • The association mining task is to discover a set of association rules among a large number of objects in a given database 10 -12/09/2002 ii. WAS 2002 5

Problem Statement - 2 • The basic and fundamental task of the mining association rules application is to generate all association rules X Y (X, Y are itemsets) that can be extracted from the database. These rules must satisfy both the support and confidence constraints – – – Support constraint : Sup (X Y), Confidence constraint: Sup (X Y) / Sup (X) Sup(X), is defined as the number of transactions in which it occurs as a subset • An itemset is categorized as a frequent itemset if its support is more than a minimum support (Min. Sup) supplied by a user • The confidence factor represents the conditional probability that a transaction contains Y (given that the transaction contains X) – An association rule is said to be confident if its confidence factor value is more than the minimum confidence (Min. Cof) supplied by the user. 10 -12/09/2002 ii. WAS 2002 6

Simple Example - 1 • Consider the sales database of food store, where the objects represent customers and itemsets represent food – In this example, the discovered patterns are the set of food frequently bought together by the customers. – An example pattern found could be that, “ 60 percent of the customers who buy cereal also buy milk” – The store can then use this knowledge for shelf placement, controlling the stock, etc. • There are many potential application areas for association rule technology, which include catalog design, customer segmentation, store layout, and so on 10 -12/09/2002 ii. WAS 2002 7

Simple Example - 2 Min. Sup = 50% Min. Cof = 100% 10 -12/09/2002 ii. WAS 2002 8

The Lattice-Based Approach - 1 • • We use the “Lattice Theoretic” to: – Identify all frequent itemsets – Count the “support” of association rules Pre req: Construct the “tid list” from the transaction database 10 -12/09/2002 ii. WAS 2002 9

The Lattice-Based Approach - 2 • Construct the “powerset” Lattice P(I) Maximal freq. Min. Sup = itemsets 50% 10 -12/09/2002 ii. WAS 2002 10

The Lattice-Based Approach - 3 • Compute support of iternsets via tid list intersections 10 -12/09/2002 ii. WAS 2002 11

Hybrid Search for Freq. Itemsets - 1 • Hybrid Search used to quickly enumerate all frequent itemsets • Hybrid Search combines both the top down and bot tom up search strategies and is based on the intuition that the greater the support of a frequent itemset, the more likely it is to be a part of a longer frequent itemset • The hybrid approach is divided in two main steps: – Initial phase containing the atoms rearrangement, and – The hybrid process itself for generating all frequent itemsets. In the second step, the recursion process is repeated until no more frequent itemset can be generated 10 -12/09/2002 ii. WAS 2002 12

Hybrid Search for Freq. Itemsets - 2 • The first step simply rearranges the atoms in descending order of their supports. The sorting algorithm is involved in this step • The second step starts by intersecting a pair of atoms one at a time – The inter section process is started from a pair of atoms each – – of which having the largest support among others to produce a larger and longer frequent itemset. The process stops when an extension becomes infrequent (i. e. , item set that does not satisfy the minimum support requirement). The second bottom up phase is then entered 10 -12/09/2002 ii. WAS 2002 13

Hybrid Search for Freq. Itemsets - 3 Infrequent Itemsets Infrequent (Min. Sup =Itemsets 50%) 10 -12/09/2002 ii. WAS 2002 14

Design of Application 10 -12/09/2002 ii. WAS 2002 15

Test Data Statistics of Test Data 10 -12/09/2002 ii. WAS 2002 16

Experimental Results - 1 Number of k-itemsets 10 -12/09/2002 ii. WAS 2002 17

Experimental Results - 2 Number of Association Rules 10 -12/09/2002 ii. WAS 2002 18

Experimental Results - 3 Computing Time 10 -12/09/2002 ii. WAS 2002 19

Experimental Results - 4 Support Counting Performance 10 -12/09/2002 ii. WAS 2002 20

Experimental Results - 5 Comparison Results 10 -12/09/2002 ii. WAS 2002 21

Conclusions • Experimental results show that the use of this approach as well as the hybrid search method can speed up the computing time compared to both apriori based algorithms as well as the similar lattice based approach that uses the bottom up search strategy • Another interesting advantage of using the lattice based algorithm is concerned with time used for scan ning the databases. In this context, the lattice based algorithms requires a single database scan once only. Hence, the I/O overhead can be maximally minimized • As far as the computing speed is concerned, it seems that substantial computing time are still required to exe cute large databases. Although, the lattice approach is relatively powerful, it indicates that some other computing methodologies, such as the parallel algorithms using the distributed computing environments need to be considered to solve the computing speed problem 10 -12/09/2002 ii. WAS 2002 22