Inverted Matrix Efficient Discovery of Frequent Items in

  • Slides: 36
Download presentation
Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of

Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining KDD 2003 Mohammad El-Hajj Osmar R. Zaïane Department of Computing Science University of Alberta, Canada ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Outline • Introduction • Pre-Processing Phase Transactional Layouts

Introduction Pre-processing Mining Phase Experiments Conclusion Outline • Introduction • Pre-Processing Phase Transactional Layouts • Mining Phase Building COFI-trees Mining COFI-trees • Experimental Studies • Conclusion and Future work ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Association Rule Mining Association rule mining is crucial

Introduction Pre-processing Mining Phase Experiments Conclusion Association Rule Mining Association rule mining is crucial in many applications and plays an essential role in many important mining tasks. Antecedent Consequent Body Head Frequent Itemset Mining Association Rules Generation 1 2 FIM ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2.

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2. Repetitive tasks, (I/O) readings (Superfluous Processing) 3. Non interactive mining Expensive candidacy generation step OR Huge Memory based Data structures ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2.

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2. Repetitive tasks, (I/O) readings (Superfluous Processing) 3. Non interactive mining Support > 4 Frequent 1 -itemsets {A, B, C, D, E, F} Non frequent items {G, H, I, J, K, L, M, N, O, P, Q, R} ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2.

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2. Repetitive tasks, (I/O) readings (Superfluous Processing) 3. Non interactive mining Support > 9 Frequent 1 -itemsets {A, B, C} Non frequent items {D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R} ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2.

Introduction Pre-processing Mining Phase Experiments Conclusion Challenges for FIM 1. High memory dependency 2. Repetitive tasks, (I/O) readings (Superfluous Processing) 3. Non interactive mining Changing the support level means expensive steps (whole process is redone) Evaluation and Knowledge Presentation Data Mining Selection and Transformation Patterns Data warehouse Databases ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Motivation • New association Rule mining algorithm that

Introduction Pre-processing Mining Phase Experiments Conclusion Motivation • New association Rule mining algorithm that has the following features 1. Low Memory Dependency 2. Remove Superfluous Processing 3. Interactive Mining Ready Without compromising scalability ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Horizontal Layout Candidacy generation can

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Horizontal Layout Candidacy generation can be removed (FP-Growth) Superfluous Processing ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Vertical Layout Minimize Superfluous Processing

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Vertical Layout Minimize Superfluous Processing Candidacy generation is required ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Suggested Layout • Inverted Matrix Layout: Combines the

Introduction Pre-processing Mining Phase Experiments Conclusion Suggested Layout • Inverted Matrix Layout: Combines the horizontal and vertical layouts 2 I/O passes ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout Pass 1,

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout Pass 1, generates sorted item list (based on frequency) ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts T# • Inverted Matrix Layout Pass

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts T# • Inverted Matrix Layout Pass 2, Generate the transactional array of the IM Loc Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ACM SIGKDD Aug. 2003 – Washington, DC R Q P O N M L K J I H G F E D C B A 2 2 3 3 3 3 3 4 7 8 9 10 10 11 1 2 3 4 Transactional Array 5 6 7 8 9 10 11 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 T 11 T 12 T 13 T 14 T 15 T 16 T 17 T 18 Items A B B C A A A L A C A D M C B J A C G C D E B C C E F F D E D F D E K D (15, 1) (16, 1) (17, 1) (18, 1) (¤, ¤) ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada D H E F N Q H F M P B B C P E B E L C E A A O R I K N J H K G Q F A F B B D M N P G G B O R I L O J I D C A

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts T# • Inverted Matrix Layout Loc

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts T# • Inverted Matrix Layout Loc Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ACM SIGKDD Aug. 2003 – Washington, DC R Q P O N M L K J I H G F E D C B A 2 2 3 3 3 3 3 4 7 8 9 10 10 11 1 2 3 4 Transactional Array 5 6 7 8 9 10 11 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T 10 T 11 T 12 T 13 T 14 T 15 T 16 T 17 T 18 Items A B B C A A A L A C A D M C B J A C G C D E B C C E F F D E D F D E K D (14, 1) (15, 2) (16, 1) (16, 2) (17, 1) (17, 2) (18, 1) (¤, ¤) ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada D H E F N Q H F M P B B C P E B E L C E A A O R I K N J H K G Q F A F B B D M N P G G B O R I L O J I D C A

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts There is no minimum support involved

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts There is no minimum support involved in building the Inverted Matrix. • Inverted Matrix Layout Loc Index 1 1 R 2 (2, 1) 2 Q 2 (12, 2) 3 P 3 (4, 1) 4 O 3 (5, 2) 5 N 3 (13, 1) 6 M 3 (14, 2) 7 L 3 (8, 1) 8 K 3 (13, 2) 9 J 3 (13, 4) 10 I 3 (11, 2) 11 H 3 (14, 1) 12 G 4 (15, 1) 13 F 7 (14, 3) 14 E 8 (15, 2) 15 D 9 (16, 1) 16 C 10 (17, 1) 17 B 10 (18, 1) 18 A 11 (¤, ¤) ACM SIGKDD Aug. 2003 – Washington, DC 2 (3, 2) (3, 3) (9, 1) (5, 3) (17, 4) (13, 3) (8, 2) (14, 5) (13, 5) (11, 3) (12, 3) (16, 4) (14, 4) (15, 3) (16, 2) (17, 2) (¤, ¤) 3 4 (9, 2) (6, 3) (6, 2) (12, 4) (15, 9) (13, 7) (14, 7) (13, 6) 15, 4) (16, 5) (18, 7) (16, 3) (17, 2) (18, 3) (18, 2) (¤, ¤) (15, 6) (16, 6) (17, 5) (17, 6) (18, 5) (18, 4) (¤, ¤) Transactional Array 5 6 7 (16, 8) (15, 5) (17, 7) (18, 6) (¤, ¤) (14, 6) (15, 7) (16, 7) (¤, ¤) (18, 8) (¤, ¤) (14, 8) (15, 8) (17, 8) (¤, ¤) 8 (16, 9) (17, 9) (¤, ¤) 9 10 11 (16, 10) (18, 10) (17, 10) (18, 9) (18, 11) (¤, ¤) ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout Support >

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout Support > 4 Border Support ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Transactional Layouts • Inverted Matrix Layout ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Sub transactions generated from IM Frequent sub-transaction with

Introduction Pre-processing Mining Phase Experiments Conclusion Sub transactions generated from IM Frequent sub-transaction with item F Frequent sub-transaction with item E Frequent sub-transaction with item C ACM SIGKDD Aug. 2003 – Washington, DC Frequent sub-transaction with item D Frequent sub-transaction with item B ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Frequency Count Participation Count

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Frequency Count Participation Count Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD

Introduction Pre-processing Mining Phase Experiments Conclusion Co-Occurrences Frequent Item tree Building F-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion ACM SIGKDD Aug. 2003 – Washington, DC Co-Occurrences

Introduction Pre-processing Mining Phase Experiments Conclusion ACM SIGKDD Aug. 2003 – Washington, DC Co-Occurrences Frequent Item tree ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 –

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree Support = Frequency count –

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree Support = Frequency count – Participation count ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 –

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 –

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 –

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees E-COFI-tree ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees D-COFI-tree DBA: 5 DB: 8 B-COFI-tree

Introduction Pre-processing Mining Phase Experiments Conclusion Mining COFI-trees D-COFI-tree DBA: 5 DB: 8 B-COFI-tree C-COFI-tree BA: 6 CA: 6 ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Experimental Studies Time needed to mine 1 M

Introduction Pre-processing Mining Phase Experiments Conclusion Experimental Studies Time needed to mine 1 M transactions with different support levels Pentium 700 Mhz with 256 MB of RAM ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion Experimental Studies Accumulated time needed to mine 1

Introduction Pre-processing Mining Phase Experiments Conclusion Experimental Studies Accumulated time needed to mine 1 M transactions using 4 different support levels Time needed in seconds to mine different transaction sizes Pentium 700 Mhz with 256 MB of RAM ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada

Introduction Pre-processing Mining Phase Experiments Conclusion and Future work New AR algorithm 1. Low

Introduction Pre-processing Mining Phase Experiments Conclusion and Future work New AR algorithm 1. Low memory dependency 2. No Superfluous processing 3. Interactive mining ready 4. scalable Future work Updateable Inverted Matrix for native storage of transactions Compressing the size of Inverted Matrix Parallelizing the mining process as well as the construction of the Inverted Matrix ACM SIGKDD Aug. 2003 – Washington, DC ÓM. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada