Frequent Pattern Mining Toon Calders Bart Goethals ADRe

Frequent Pattern Mining Toon Calders Bart Goethals ADRe. M research group

Outline • What is data mining? - Definition local patterns vs global models Supervised vs Unsupervised What do we do? • Frequent set mining • More complex data types 2

What is data mining? “the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. ” $ $ $ Data Information 3

Supervised vs Unsupervised • Supervised: - data has been annotated - well-defined task: learn to annotate new data E. g. : examples of good/bad customers • Unsupervised: - only data has been given - no annotation - « find knowledge » x y n y x x x 4

Local vs Global • Local pattern: - tells something about a small subset of the data E. g. « 90% of the customers that purchase beer also buy chips » • Global model: - fits a global model to the data, a summary E. g. : there is a linear relationship between $ spent and the income of the customers 5

What do we do? • Pattern mining - Local - Unsupervised • Useful for - large datasets - exploration: « what is this data like? » • Less suitable for - well-studied and understood problem domains 6

Outline • What is data mining? • Frequent set mining - Market Basket analysis Association rules Interestingness measures Numerical attributes • More complex data types 7

Market Basket Analysis • Data: collection of transactions of customers: • Goal: find sets of products frequently occuring together 8

Applications • Supermarket - product placement - special promotions • Websearch - which keywords often occur together in webpages? • Health care - frequent sets of symptoms for a disease 9

Applications • Basically works for all data that can be represented as a set of examples/objects having certain properties - patient / symptoms movies / ratings web pages / keywords basket / products … 10

Algorithms • Computationally a very hard problem - with n products, 2 n sets of products • Hundreds of algorithms have been proposed - for sparse/dense data many rows/columns data fits/does not fit in memory … 11

Association Rules • Conditional probabilities X Y (c%): if X is in the transaction, then there is a probability of c% that Y is in it as well. • Based on the frequent sets, associations can be computed easily: { Beer, Chips } { Snack nuts } 75% { adrem. html, cnts. html } { islab. html } 80% { rain } { overcast } 100% 12

Interestingness Measures • Not all association rules are interesting - Domain knowledge pregnant female, rain overcast - Redundancy A B (100%) then: AC B, AD B, … - Independence 70% buys product A: X A(70%), Y A(70%) • Too many rules 13

Interestingness Measures • Incorporating background knowledge - e. g. , via Bayesian network - only produce rules that deviate from background knowledge • Redundancies - Condensed representations: produce only a nonredundant subset of patterns 14

Interestingness Measures • Independence - statistical significance tests • X 2 • Careful with conclusions !! 1000 tests with significance level 0. 05 … (Bonferroni correction) • Too many rules - Constraints - Top-k mining 15

Numerical Attributes • Association rule mining is also possible for numerical attributes - discretization: make continuous attributes ordinal • information loss • not appropriate if the order between the values is important - other methods: • recently new method based on rank correlation measures 16

Complex Patterns • Sets • Sequences • Graphs • Relational Structures • Generation and Counting of such patterns becomes much more complex too! 17

Sequences CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA 18

$Patterns in Sequences • • Substrings Regular expressions (bb|[^b]{2}) Partial orders Directed Acyclic Graphs$

Patterns in Sequences • • Substrings Regular expressions (bb|[^b]{2}) Partial orders Directed Acyclic Graphs 19

Graphs 20

Patterns in Graphs 21

Rules f: 5 0. 8 f: 4 f: 7 f: 8 0. 57 f: 4 22

Relational Databases 23

Patterns in RDBs • Queries • Query 1: Select L. drinker, V. bar From Likes L, Visits V Where V. drinker = L. drinker And L. beer = ‘Duvel’ 24

Patterns in RDBs • Query 2: Select L. drinker, V. bar From Likes L, Visits V, Serves S Where V. drinker = L. drinker And L. beer = ‘Duvel’ And S. bar = V. bar And S. beer = ‘Duvel’ 25

Patterns in RDBs • Association Rule: Query 1 => Query 2 If a person that likes Duvel visits bar, then that bar serves Duvel 26

27