Privacy Preserving Mining of Association Rules Alexandre Evfimievski
- Slides: 56
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center Cornell University
Data Mining and Privacy • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records?
Data Mining and Privacy • The primary task in data mining: development of models about aggregated data. • Can we develop accurate models without access to precise information in individual data records? • Answer: yes, by randomization. – R. Agrawal, R. Srikant “Privacy Preserving Data Mining, ” SIGMOD 2000 – for numerical attributes, classification • How about association rules?
Alice Randomization Overview J. S. Bach, painting, nasa. gov, … Recommendation Service Bob B. Spears, baseball, cnn. com, … Chris B. Marley, camping, linux. org, …
Alice Randomization Overview J. S. Bach, painting, nasa. gov, … Bob B. Spears, baseball, cnn. com, … Chris B. Marley, camping, linux. org, … Recommendation Service
Alice Randomization Overview J. S. Bach, painting, nasa. gov, … Bob B. Spears, baseball, cnn. com, … Chris B. Marley, camping, linux. org, … Recommendation Service Associations B. Marley, camping, linux. org, … Recommendations
Alice Randomization Overview Metallica, painting, nasa. gov, … J. S. Bach, painting, nasa. gov, … Recommendation Service Support Recovery Bob B. Spears, baseball, cnn. com, … B. Spears, soccer, bbc. co. uk, … Chris B. Marley, camping, linux. org, … Associations B. Marley, camping, microsoft. com … Recommendations
Associations Recap • A transaction t is a set of items (e. g. books) • All transactions form a set T of transactions • Any itemset A has support s in T if • Itemset A is frequent if s smin • If A B , then supp (A) supp (B).
Associations Recap • A transaction t is a set of items (e. g. books) • All transactions form a set T of transactions • Any itemset A has support s in T if • Itemset A is frequent if s smin • If A B , then supp (A) supp (B). • Example: – 20% transactions contain beer, – 5% transactions contain beer and diapers; – Then: confidence of “beer diapers” is 5/20 = 0. 25 = 25%.
The Problem • How to randomize transactions so that – we can find frequent itemsets – while preserving privacy at transaction level?
Talk Outline • • • Introduction Privacy Breaches Our Solution Experiments Conclusion
Uniform Randomization • Given a transaction, – keep item with 20% probability, – replace with a new random item with 80% probability.
Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have {x, y}, {x, z}, {x, y, z} or {y, z} only 94% have one or zero items of {x, y, z}
Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have {x, y}, {x, z}, {x, y, z} or {y, z} only 94% have one or zero items of {x, y, z} Uniform randomization: How many have {x, y, z} ?
Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have {x, y}, {x, z}, {x, y, z} or {y, z} only • 0. 23 0. 008% 800 ts. • 0. 22 • 0. 00016% 16 trans. 94% have one or zero items of {x, y, z} 8/10, 000 at most • 0. 2 • (9/10, 000)2 less than 0. 00002% 2 transactions Uniform randomization: How many have {x, y, z} ?
Example: {x, y, z} 10 M transactions of size 10 with 10 K items: 1% 5% have {x, y}, {x, z}, {x, y, z} or {y, z} only • 0. 23 0. 008% 800 ts. 97. 8% • 0. 22 • 0. 00016% 16 trans. 1. 9% 94% have one or zero items of {x, y, z} 8/10, 000 at most • 0. 2 • (9/10, 000)2 less than 0. 00002% 2 transactions 0. 3% Uniform randomization: How many have {x, y, z} ?
Example: {x, y, z} • Given nothing, we have only 1% probability that {x, y, z} occurs in the original transaction • Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one. • This is what we call a privacy breach. • Uniform randomization preserves privacy “on average, ” but not “in the worst case. ”
Privacy Breaches • Suppose: – t is an original transaction; – t’ is the corresponding randomized transaction; – A is a (frequent) itemset. • Definition: Itemset A causes a privacy breach of level (e. g. 50%) if, for some item z A, – Assumption: no external information besides t’.
Talk Outline • • • Introduction Privacy Breaches Our Solution Experiments Conclusion
Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest? ” “He grows a forest to hide it in. ” G. K. Chesterton • Insert many false items into each transaction • Hide true itemsets among false ones
Our Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest? ” “He grows a forest to hide it in. ” G. K. Chesterton • Insert many false items into each transaction • Hide true itemsets among false ones Can we still find frequent itemsets while having sufficient privacy?
Definition of cut-and-paste • Given transaction t of size m, construct t’: t = t’ = a, b, c, u, v, w, x, y, z
Definition of cut-and-paste • Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); t = a, b, c, u, v, w, x, y, z t’ = j=4
Definition of cut-and-paste • Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4
Definition of cut-and-paste • Given transaction t of size m, construct t’: – Choose a number j between 0 and Km (cutoff); – Include j items of t into t’; – Each other item is included into t’ with probability pm. The choice of Km and pm is based on the desired level of privacy. t = t’ = a, b, c, u, v, w, x, y, z b, v, x, z j=4 œ, å, ß, ξ, ψ, €, א , ъ, ђ, …
Partial Supports To recover original support of an itemset, we need randomized supports of its subsets. • Given an itemset A of size k and transaction size m, • A vector of partial supports of A is – Here sk is the same as the support of A. – Randomized partial supports are denoted by
Transition Matrix • Let k = |A|, m = |t|. • Transition matrix P = P (k, m) connects randomized partial supports with original ones: • Randomized supports are distributed as a sum of multinomial distributions.
The Unbiased Estimators • Given randomized partial supports, we can estimate original partial supports:
The Unbiased Estimators • Given randomized partial supports, we can estimate original partial supports: • Covariance matrix for this estimator: • To estimate it, substitute sl with (sest)l. – Special case: estimators for support and its variance
Class of Randomizations • Our analysis works for any randomization that satisfies two properties: – A per-transaction randomization applies the same procedure to each transaction, using no information about other transactions; – An item-invariant randomization does not depend on any ordering or naming of items.
Class of Randomizations • Our analysis works for any randomization that satisfies two properties: – A per-transaction randomization applies the same procedure to each transaction, using no information about other transactions; – An item-invariant randomization does not depend on any ordering or naming of items. • Both uniform and cut-and-paste randomizations satisfy these two properties.
Apriori Let k = 1, candidate sets = all 1 -itemsets. Repeat: 1. 2. 3. 4. Count support for all candidate sets Output the candidate sets with support smin New candidate sets = all (k + 1)-itemsets s. t. all their k-subsets are candidate sets with support smin Let k = k + 1 Stop when there are no more candidate sets.
The Modified Apriori Let k = 1, candidate sets = all 1 -itemsets. Repeat: 1. 2. 3. 4. Estimate support and variance (σ2) for all candidate sets Output the candidate sets with support smin New candidate sets = all (k + 1)-itemsets s. t. all their k-subsets are candidate sets with support smin - σ Let k = k + 1 Stop when there are no more candidate sets, or the estimator’s precision becomes unsatisfactory.
Privacy Breach Analysis • How many added items are enough to protect privacy? – Have to satisfy Pr [z t | A t’] < ( no privacy breaches) – Select parameters so that it holds for all itemsets. – Use formula ( ):
Privacy Breach Analysis • How many added items are enough to protect privacy? – Have to satisfy Pr [z t | A t’] < ( no privacy breaches) – Select parameters so that it holds for all itemsets. – Use formula ( ): • Parameters are to be selected in advance! – Construct a privacy-challenging test: an itemset whose all subsets have maximum possible support. – Enough to know maximal support of an itemset for each size.
Graceful Tradeoff • Want more precision or more privacy? – Adjust privacy breach level – A small relaxation of privacy restrictions results in a small increase in precision of estimators.
Talk Outline • • Introduction Privacy Breaches Our Solution Experiments – Support recovery vs. parameters – Real-life data • Conclusion
Lowest Discoverable Support • LDS is s. t. , when predicted, is 4 away from zero. • Roughly, LDS is proportional to |t| = 5, = 50%
LDS vs. Breach Level |t| = 5, |T| = 5 M • Reminder: breach level is the limit on Pr [z t | A t’]
Talk Outline • • Introduction Privacy Breaches Our Solution Experiments – Support recovery vs. parameters – Real-life data • Conclusion
Real datasets: soccer, mailorder • Soccer is the clickstream log of World. Cup’ 98 web site, split into sessions of HTML requests. – 11 K items (HTMLs), 6. 5 M transactions – Available at http: //www. acm. org/sigcomm/ITA/ • Mailorder is a purchase dataset from a certain on-line store – Products are replaced with their categories – 96 items (categories), 2. 9 M transactions A small fraction of transactions are discarded as too long. – longer than 10 (for soccer) or 7 (for mailorder)
Modified Apriori on Real Data Breach level = 50%. Inserted 20 -50% items to each transaction. Itemset Size True Itemsets True Positives False Drops False Positives smin = 0. 2% 1 266 254 12 31 0. 07% for 2 217 195 22 45 3 -itemsets 3 48 43 5 26 Itemset Size True Itemsets True Positives False Drops False Positives 1 65 65 0 0 0. 05% for 2 228 212 16 28 3 -itemsets 3 22 18 4 5 Soccer: Mailorder: smin = 0. 2%
False Drops False Positives Soccer Pred. supp%, when true supp 0. 2% True supp%, when pred. supp 0. 2% Size < 0. 1 -0. 15 -0. 2 1 0 2 10 254 1 0 7 24 254 2 0 5 17 195 2 7 10 28 195 3 0 1 4 43 3 5 13 8 43 Mailorder Pred. supp%, when true supp 0. 2% True supp%, when pred. supp 0. 2% Size < 0. 1 -0. 15 -0. 2 1 0 0 0 65 2 0 1 15 212 2 0 0 28 212 3 0 1 3 18 3 1 2 2 18
Actual Privacy Breaches • Verified actual privacy breach levels • The breach probabilities are counted in the datasets for frequent and near-frequent itemsets. • If maximum supports were estimated correctly, even worst-case breach levels fluctuated around 50% – At most 53. 2% for soccer, – At most 55. 4% for mailorder.
Talk Outline • • • Introduction Privacy Breaches Our Solution Experiments Conclusion
Summary • Privacy breaches: identified problem and provided a solution for controlling breaches • Derived estimators of support and variance for a class of randomization operators • Algorithm for discovering associations in randomized data • Validated on real-life datasets • Can find associations while preserving privacy at the level of individual transactions
Future Work • Control of more general privacy breaches – What about other properties of transactions, for example item z t breach caused by A t’ = ? – What about external information? • Theoretical limits of discoverability for a given privacy breach level – How to compute theoretical limits? – How to attain them by an algorithm?
Thank You!
BACK-UPS
Our Solution: Example • Old set-up: – Given 10, 000 items, 10 M transactions of size 10 – 100, 000 transactions (1%) contain A = {x, y, z} • In addition to uniform randomization with p = 80%, insert 500 new random items to each transaction. – ~ 800 transactions contain {x, y, z} before and after; – Roughly (10 M) • (500 / 10, 000)3 = 1250 transactions contain none before and full {x, y, z} after. • Presence of {x, y, z} in a randomized transaction now says little about the original transaction.
Privacy Breach Analysis • GIVEN: itemset A, and item z A • WANTED: • Assume that partial supports are probabilities: • Define: • Then we have:
Limiting Privacy Breaches • We want to make sure that always • But we do not know supports in advance. • Solution: For each itemset size k, give “privacychallenging” test values to. – It is an itemset whose subsets have maximum supports – We need to estimate maximum support values prior to randomization
LDS vs. Transaction Size = 50%, |T| = 5 M • Too long transactions cannot be used for prediction
Related Work R. Agrawal, R. Srikant “Privacy Preserving Data Mining, ” SIGMOD 2000: • Each client has a numerical attribute xi • Client i sends xi + yi , where yi = random offset, with known distribution • Server reconstructs the distribution of original attributes (~ EM algorithm) • The distribution is then used for classification – Numerical attributes only
Related Work • Y. Lindell and B. Pinkas “Privacy Preserving Data Mining, ” Crypto 2000 • J. Vaidya and C. Clifton “Privacy Preserving Association Rule Mining in Vertically Partitioned Data” • …
Privacy Concern • Popular press: – “The End of Privacy”, “The Death of Privacy” • Government directives: – European directive on privacy protection (Oct 98) – Canadian Personal Information Protection Act (Jan 2001) • Surveys of Web users: – 17% fundamentalists, 56% pragmatic majority, 27% marginally concerned (April 99) – 82% said having privacy would matter (July 99)
- Revealing information while preserving privacy
- Privacy awareness and hipaa privacy training cvs
- Fast algorithms for mining association rules
- Fast algorithms for mining association rules
- Association rules in data mining
- Association rules in data mining
- Association analysis advanced concepts
- Association rules in data mining
- Fast algorithms for mining association rules
- Fast algorithms for mining association rules
- Kelvin rodolfo
- Strip mining vs open pit mining
- Chapter 13 mineral resources and mining
- Difference between strip mining and open pit mining
- Text and web mining
- Multimedia data mining
- Mining complex types of data in data mining
- Ecology preserving the animal kingdom
- Preserving your credit
- Preserving your credit
- Orthogonal matrix example
- Preserving statistical validity in adaptive data analysis
- Style transfer
- Preserving food
- Purpose of preserving food
- Association rule mining tutorial
- Integrating classification and association rule mining
- Max pattern and closed pattern
- Association rule mining definition
- Subset operation using hash tree
- Association data mining techniques
- Association rule mining definition
- Correlation rules in data mining
- Alexandre kieslich da silva
- Alonso sanchez coello
- Alexandre mylle
- Paasche indice
- Les conquêtes d'alexandre le grand
- Alexandre tournakis
- Tipos de estudo epidemiologico
- Alexandre beguyer de chancourtois periyodik tablo
- Tipos de estudo epidemiológico
- Alexandre reider
- Family office exchange
- Poema de alexandre herculano
- Alexandre berthelot
- Alexandre popier
- Alexandre dang
- Dr rodrigo alexandre dos santos
- Rene alexandre escude
- Beto carrero morreu quando
- Alexandre dumas d.ä. ida ferrier
- Fallen angel alexandre cabanel wikipedia
- Alexandre alessi
- Alexandre grandchamp
- "alexandre duguet"
- 10112000 color