From Association Rules To Causality Presenters Amol Shukla

From Association Rules To Causality Presentation Outline n Limitations of Association Rules and the

Review: Association Rules Mining n Itemset I={i 1, …, ik} n Find all the

Limitations of Association Rules using Support-Confidence Framework n n n Negative implications or dependencies

Limitations of Association Rules using Support-Confidence Framework Another market basket data example Items Coffee

From Association Rules To Causality n Limitations of Association Rules and the Support. Confidence

What is Correlation? n n P(A): Probability that event A occurs P(A’): Probability that

Computing Correlation Rules: Chi-squared Test for Independence n n n For an itemset I={i

Example: Computing the Chi-squared Statistic E(Coffee, Tea)= (90 x 25)/100 = 22. 5 Coffee

Determining the Cause of Correlation § Define measures of interest for each cell I(r)

Properties of Correlation n n If a set of items is correlated, all its

Combining Correlation with Support n n n Support-confidence framework looks at only the top-left

A level-wise algorithm for finding correlation rules 13

Steps performed by the algorithm at level k Start Construct Contingency Table for next

Limitations of Correlation n n Correlation might not be valid for ‘sparse’ itemsets. At

Causality Hamburgers 33% 33% Hot-Dogs Association Rule: Hot-Dogs BBQ Sauce [33%, 50%] Causality Rule:

Bayesian Networks n n What is the best topology of a Bayesian network that

Simplifying Causal Relationships n Knowing the existence of a causal relationship is as good

Causality vs Correlation n Two correlated variables can have either: § A causal relationship

First Rule of Causality 1) Suppose we have three pair wise dependent variables: 2)

First Rule of Causality Then we have one of these following configurations 22

Second Rule of Causality 1) Suppose we have three variables with these relationships independent

Second Rule of Causality n Then the two independent variables cause third variable. 24

Finding Causality U C 1) Construct a graph where each variable is a vertex

Weaknesses of the Algorithm n n n Causality rules do not cover all possible

Experiments (Census) n n Correlation rules n Not a native English speaker Not born

Experiments (Text Data) n n 416 distinct frequent words 86320 pairs of words, 10%

Beyond Correlation and Causality n n Correlation and causality seem to be stronger mathematical

Association Rules with Constraints At least one item is meat n n Correlation can

Conclusion (Good news) n n Correlation and causality are stronger mathematical models to retrieve

Conclusion (Bad news) n n n Difficult to precisely detect correlation (especially in sparse

Open Problems n n n How to discover hidden variables in causality How to

References Papers n “Beyond Market Baskets: Generalizing Association Rules to Correlations” - Brin, Motwani,

From Association Rules To Causality Questions 37

Slides: 37

Download presentation

From Association Rules To Causality Presenters: Amol Shukla, University of Waterloo Claude-Guy Quimper, University of Waterloo 1

From Association Rules To Causality Presentation Outline n Limitations of Association Rules and the Support. Confidence Framework n Generalizing Association Rules to Correlations n Scalable Techniques for Mining Causal Structures n Applications of Correlation and Causality n Summary 2

Review: Association Rules Mining n Itemset I={i 1, …, ik} n Find all the rules X Y with min confidence and support n n support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y, i. e. , P(Y|X) Transaction Items -id bought 10 A, B, C 20 A, C 30 A, D 40 B, E, F Let min_support = 50%, min_conf = 50%. Two example association rules are: A C (50%, 66. 7%) C A (50%, 100%) 3

Limitations of Association Rules using Support-Confidence Framework n n n Negative implications or dependencies are ignored Consider the adjoining database. n X and Y: positively related, n X and Z: negatively related n support and confidence of X=>Z dominates Only the presence of items is taken into account 4

Limitations of Association Rules using Support-Confidence Framework Another market basket data example Items Coffee Buys Tea => Buys Coffee Bought (support=20%, confidence=80%) Tea 20 n Is this rule really valid? No Tea 70 Ø Pr(Buys Coffee)=90% Sum 90 Ø Pr(Buys Coffee|Buys Tea)=80% n No Coffee Sum (row) 5 25 5 75 10 100 (col. ) n Negative correlation between buying tea and buying coffee is ignored 5

From Association Rules To Causality n Limitations of Association Rules and the Support. Confidence Framework n Generalizing Association Rules to Correlations n Scalable Techniques for Mining Causal Structures n Applications of Correlation and Causality n Summary 6

What is Correlation? n n P(A): Probability that event A occurs P(A’): Probability that event A does not occur P(AB): Probability that events A and B occur together. Events A and B are said to be independent if P(AB) = P(A) x P(B) Otherwise A and B are dependent Events A and B are said to be correlated if any of AB, A’B , AB’, A’B’ are dependent A correlation rule is a set of items that are correlated 7

Computing Correlation Rules: Chi-squared Test for Independence n n n For an itemset I={i 1, …, ik}, construct a k-dimensional contingency table R= {i 1, i 1’} x … x {ik, ik’} We need to test whether each cell r= r 1, …, rk in this table is dependent Let O(r) denote the observed value of cell r in this table, and E(r) be its expected value. The chi-squared statistic is the computed as: If 2 = 0, the cells are independent. If 2 > cut-off value, reject the independence assumption 8

Example: Computing the Chi-squared Statistic E(Coffee, Tea)= (90 x 25)/100 = 22. 5 Coffee No Coffee Sum (row) Tea 20 5 25 No Tea 70 5 75 Sum (col. ) 90 10 100 E(No Coffee, Tea) = (10 x 25)/100 = 2. 5 E(Coffee, No Tea)= (90 x 75)/100 = 67. 5 E(No Coffee, No Tea)=(10 x 75)/100=7. 5 2 = (20 -22. 5)2/22. 5 + (5 -2. 5)2/2. 5 + (70 -67. 5)2/67. 5 + (5 -7. 5)2/7. 5 = 0. 28 + 2. 5 + 0. 09 + 0. 83 = 3. 7 Since this value is greater than the cut-off value (2. 71 at 90% significance level), we reject the independence assumption 9

Determining the Cause of Correlation § Define measures of interest for each cell I(r) = O(r) / E(r) n n Tea Coffee No Coffee 20 5 I(r)>1 indicates positive No Tea 70 5 dependence and I(r)<1 indicates negative dependence Cell Counts The farther I(r) is from 1, the more a cell contributes to the 2 value, Coffee No and the correlation. § Thus, [No Coffee, Tea] contributes the most to the correlation, indicating that buying tea might inhibit buying coffee = 70/67. 5 Coffee Tea 0. 89 2 No Tea 1. 03 0. 66 Measures of Interest 10

Properties of Correlation n n If a set of items is correlated, all its supersets are also correlated. Thus, correlation is upward-closed We can focus on minimal correlated itemsets to reduce our search space Support is downward-closed. A set has minimum support only if all its subsets have minimum support We can combine correlation with support for an effective pruning strategy 11

Combining Correlation with Support n n n Support-confidence framework looks at only the top-left cell in the contingency table. To incorporate negative dependence, we must consider all the cells in the table Combine correlation with support by defining “CTsupport” Let s be a user specified min-support threshold. Let p be a user-specified cut-off percentage value An itemset I is CT-supported if at least p% of the cells in its contingency table have support not less than s An itemset is significant if it is CT-supported and minimally correlated 12

A level-wise algorithm for finding correlation rules 13

Steps performed by the algorithm at level k Start Construct Contingency Table for next itemset at the level Done processing all itemsets at level k Generate itemset(s) of size k+1 such that all of its subsets are in NOTSIG Is the Itemset CT-supported? Yes No No Add to the set NOTSIG Yes Mark the itemset as ‘significant’ Is 2 greater than cut-off value? 14

Limitations of Correlation n n Correlation might not be valid for ‘sparse’ itemsets. At least 80% of the cells in the contingency table must have expected value greater than 5. Finding correlation rules is computationally more expensive than finding association rules. Only indicates that the existence of a relationship. Does not specify the nature of the relationship, i. e. , the cause and effect phenomenon is ignored. Identifying causality is important for decision-making. 15

From Association Rules to Causality n Limitations of Association Rules and the Support -Confidence Framework n Generalizing Association Rules to Correlations n Scalable Techniques for Mining Causal Structures n Applications of Correlation and Causality n Summary 16

Causality Hamburgers 33% 33% Hot-Dogs Association Rule: Hot-Dogs BBQ Sauce [33%, 50%] Causality Rule: Hamburgers BBQ Sauce 17

Bayesian Networks n n What is the best topology of a Bayesian network that describes the observed data? Problem: Very expensive to compute 18

Simplifying Causal Relationships n Knowing the existence of a causal relationship is as good as knowing the relationship 19

Causality vs Correlation n Two correlated variables can have either: § A causal relationship § A common ancestor 20

First Rule of Causality 1) Suppose we have three pair wise dependent variables: 2) And two variables become independent when conditioned on the third one Independent 21

First Rule of Causality Then we have one of these following configurations 22

Second Rule of Causality 1) Suppose we have three variables with these relationships independent 2) And the two independent variables become dependent when conditioned on the third variable dependent 23

Second Rule of Causality n Then the two independent variables cause third variable. 24

Finding Causality U C 1) Construct a graph where each variable is a vertex C C C C 2) Perform a Chi-squared test to determine correlation 3) Add an edge labeled “C” for each correlated test 4) Add an edge labeled “U” for each uncorrelated test 5) For each triplet, check if a causality rule can be applied 25

Weaknesses of the Algorithm n n n Causality rules do not cover all possible causality relationships The X 2 test with confidence set to 95% is expected to fail 5 times every 100 tests Some variables might not be reported correlated or uncorrelated 26

Experiments (Census) n n Correlation rules n Not a native English speaker Not born in the U. S n Served in the military Male n Married more than 40 years old Causality Rules n Male Moved Last 5 years, Support-Job n Native-Amer. $20 -$40 K House Holder n Asian, Laborer < $20 K 28

Experiments (Text Data) n n 416 distinct frequent words 86320 pairs of words, 10% are correlated Correlation Causality Rules Nelson, Mandela upi, not reuter area, province Iraqi, Iraq area, secretary, war united, states area, secretary, they prime, minister 29

Beyond Correlation and Causality n n Correlation and causality seem to be stronger mathematical model than confidence and support It is possible to apply these concepts where confidence and support were previously applied 30

Association Rules with Constraints At least one item is meat n n Correlation can be seen as a monotone constraint Algorithm obtained by modifying algorithms for mining constrained association rules 31

Conclusion (Good news) n n Correlation and causality are stronger mathematical models to retrieve interesting association rules Allow to detect negative implications § Causality explains why there is a correlation 33

Conclusion (Bad news) n n n Difficult to precisely detect correlation (especially in sparse data cubes) Not all causality relationships can be found Are the results really better than with support and confidence? 34

Open Problems n n n How to discover hidden variables in causality How to resolve bi-directional causality for disambiguation e. g: prime minister prime How do we find causal patterns for more than 3 variables 35

References Papers n “Beyond Market Baskets: Generalizing Association Rules to Correlations” - Brin, Motwani, Silverstein; SIGMOD 97 n “Scalable Techniques for Mining Causal Structures” Silverstein, Brin, Motwani, Ullman; VLDB 98 n “Efficient Mining of Constrained Correlated Sets” - Grahne, Lakshmanan, Wang; ICDE 2000 n “A Simple Constraint-Based Algorithm for Efficiently Mining Observational Databases for Causal Relationships” Cooper; Data Mining and Knowledge Discovery, vol 1, 1997 Textbook n “Causality: models, reasoning, and inference” - Judea Pearl; Cambridge University Press, 2000 36

From Association Rules To Causality Questions 37