Privacy preserving data mining randomized response and association

Privacy Preserving Data Mining Techniques Protecting sensitive raw data Randomization (additive noise) Geometric perturbation

Data Collection Model Data cannot be shared directly because of privacy concern

Background: Randomized Response The true answer is “Yes” Biased coin: Do you smoke? Head

Decision Tree Mining using Randomized Response Multiple attributes encoded in bits Biased coin: Head

Accuracy of Decision tree built on randomized response

Generalization for Multi-Valued Categorical Data q 1 q 2 q 3 q 4 True

A Generalization RR Matrices [Warner 65], [R. Agrawal 05], [S. Agrawal 05] RR Matrix

What is an optimal matrix? Which of the following is better? Privacy: M 2

Optimal RR Matrix An RR matrix M is optimal if no other RR matrix’s

Metrics Privacy: accuracy of estimate of individual values Utility: difference between the original probability

Optimization Methods Approach 1: Weighted sum: w 1 Privacy + w 2 Utility Approach

Optimization algorithm Evolutionary Multi-Objective Optimization (EMOO) The algorithm Start with a set of initial

Output of Optimization The optimal set is often plotted in the objective space as

Frequent Itemset Mining and Association Rule Mining Frequent itemset mining: frequent set of items

Frequent Itemset Mining and Association Rule Mining First proposed by Agrawal, Imielinski, and Swami

Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought Itemset: X = {x

Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D

Association Rule Hiding: what? why? ? Problem: hide sensitive association rules in data without

Problem statement Given a database D to be released minimum threshold “MST”, “MCT” a

Solutions Data modification approaches Basic idea: data sanitization D->D’ Approaches: distortion, blocking Drawbacks Cannot

Distortion-based Techniques Sample Database Distorted Database A B C D 1 1 1 0

Side Effects Before Hiding Process After Hiding Process Side Effect Rule Ri has had

Distortion-based Techniques Challenges/Goals: To minimize the undesirable Side Effects that the hiding process causes

Data distortion [Atallah 99] Hardness result: The distortion problem is NP Hard Heuristic search

Heuristic Approach A greedy bottom-up search through the ancestors (subsets) of the sensitive itemset

Blocking-based Techniques Initial Database A B C D 1 1 1 0 0 0

Data reconstruction approach 1. Frequent Set Mining DD FS R 2. Perform sanitization Algorithm

The first two phases 1. Frequent set mining Generate all frequent itemsets with their

Example: the first two phases Oiginal Database: D TID T 1 T 2 T

Open research questions Optimal solution Itemsets sanitization The support and confidence of the rules

Coming up Cryptographic protocols for privacy preserving distributed data mining

Classification of current algorithms Data modification Hide rules Hide large itemsets Data. Distortion Algo

Weight-based Sorting Distortion Algorithm (WSDA) [Pontikakis 03] High Level Description: Input: Initial Database Set

WSDA Algorithm High Level Description: 1 st step: Retrieve the set of transactions which

WSDA Algorithm High Level Description: 2 nd step: For each rule Ri in the

WSDA Algorithm High Level Description: 3 rd step: Sort the N 1 transactions in

WSDA Algorithm High Level Description: 5 th step: Update confidence and support values for

Proposed Solution Discussion Sanitization algorithm Compared with early popular data sanitization : performs sanitization

Slides: 46

Download presentation

Privacy preserving data mining – randomized response and association rule hiding Li Xiong CS 573 Data Privacy and Anonymity Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University

Privacy Preserving Data Mining Techniques Protecting sensitive raw data Randomization (additive noise) Geometric perturbation and projection (multiplicative noise) Randomized response technique Categorical data perturbation in data collection model Protecting sensitive knowledge (knowledge hiding)

Data Collection Model Data cannot be shared directly because of privacy concern

Background: Randomized Response The true answer is “Yes” Biased coin: Do you smoke? Head Tail Yes No

Decision Tree Mining using Randomized Response Multiple attributes encoded in bits Biased coin: Head True answer E: 110 Tail False answer !E: 001 Column distribution can be estimated for learning a decision tree! Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Accuracy of Decision tree built on randomized response

Generalization for Multi-Valued Categorical Data q 1 q 2 q 3 q 4 True Value: Si M Si Si+1 Si+2 Si+3

A Generalization RR Matrices [Warner 65], [R. Agrawal 05], [S. Agrawal 05] RR Matrix can be arbitrary Can we find optimal RR matrices? Opt. RR: Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix? Which of the following is better? Privacy: M 2 is better Utility: M 1 is better So, what is an optimal matrix?

Optimal RR Matrix An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification A number of privacy and utility metrics have been proposed. Privacy: how accurately one can estimate individual info. Utility: how accurately we can estimate aggregate info.

Metrics Privacy: accuracy of estimate of individual values Utility: difference between the original probability and the estimated probability

Optimization Methods Approach 1: Weighted sum: w 1 Privacy + w 2 Utility Approach 2 Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed privacy or utility. Proposed Approach: Multi-Objective Optimization

Optimization algorithm Evolutionary Multi-Objective Optimization (EMOO) The algorithm Start with a set of initial RR matrices Repeat the following steps in each iteration Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices. Note : the fitness values is defined in terms of privacy and utility metrics

Illustration

Output of Optimization The optimal set is often plotted in the objective space as Pareto front. Worse M 6 M 8 Utility M 1 M 2 M 5 M 4 M 7 M 3 Better Privacy

For First attribute of Adult data

Privacy Preserving Data Mining Techniques Protecting sensitive raw data Randomization (additive noise) Geometric perturbation and projection (multiplicative noise) Randomized response technique Protecting sensitive knowledge (knowledge hiding) Frequent itemset and association rule hiding Downgrading classifier effectiveness

Frequent Itemset Mining and Association Rule Mining Frequent itemset mining: frequent set of items in a transaction data set Association rules: associations between items

Frequent Itemset Mining and Association Rule Mining First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003 “This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ” Apriori algorithm in VLDB 1994 #4 in the top 10 data mining algorithms in ICDM 2006 R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’ 93. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

Basic Concepts: Frequent Patterns and Association Rules Transaction-id Items bought Itemset: X = {x 1, …, xk} (k-itemset) 10 A, B, D Frequent itemset: X with minimum 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Customer buys both Customer buys diaper support count Association rule: A B with minimum support and confidence Customer buys beer Support count (absolute support): count of transactions containing X Support: probability that a transaction contains A B s = P(A B) Confidence: conditional probability that a transaction having A also contains B c = P(A | B) Association rule mining process Find all frequent patterns (more costly) 28 November 2020 Generate strong association rules 20

Illustration of Frequent Itemsets and Association Rules Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Frequent itemsets (minimum support count = 3) ? {A: 3, B: 3, D: 4, E: 3, AD: 3} Association rules (minimum support = 50%, minimum confidence = 50%) ? A D (60%, 100%) D A (60%, 75%) 28 November 2020

Association Rule Hiding: what? why? ? Problem: hide sensitive association rules in data without losing non-sensitive rules Motivations: confidential rules may have serious adverse effects SIGMOD Ph. D. Workshop IDAR’ 07 22

Problem statement Given a database D to be released minimum threshold “MST”, “MCT” a set of association rules R mined from D a set of sensitive rules Rh R to be hided Find a new database D’ such that the rules in Rh cannot be mined from D’ the rules in R-Rh can still be mined as many as possible SIGMOD Ph. D. Workshop IDAR’ 07

Solutions Data modification approaches Basic idea: data sanitization D->D’ Approaches: distortion, blocking Drawbacks Cannot control hiding effects intuitively, lots of I/O Data reconstruction approaches Basic idea: knowledge sanitization D->K->D’ Potential advantages Can easily control the availability of rules and control the hiding effects directly, intuitively, handily SIGMOD Ph. D. Workshop IDAR’ 07

Distortion-based Techniques Sample Database Distorted Database A B C D 1 1 1 0 0 0 1 1 1 0 0 1 A B C D 1 1 1 0 0 0 1 1 1 0 Distortion Algorithm Rule A→C has: Rule A→C has now: Support(A→C)=80% Confidence(A→C)=100% Support(A→C)=40% Confidence(A→C)=50%

Side Effects Before Hiding Process After Hiding Process Side Effect Rule Ri has had conf(Ri)>MCT Rule Ri has now conf(Ri)<MCT Rule Eliminated (Undesirable Side Effect) Rule Ri has had conf(Ri)<MCT Rule Ri has now conf(Ri)>MCT Ghost Rule (Undesirable Side Effect) Large Itemset I has had sup(I)>MST Itemset I has now sup(I)<MST Itemset Eliminated (Undesirable Side Effect)

Distortion-based Techniques Challenges/Goals: To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules. To minimize the number of 1’s that must be deleted in the database. Algorithms must be linear in time as the database increases in size.

Sensitive itemsets: ABC

Data distortion [Atallah 99] Hardness result: The distortion problem is NP Hard Heuristic search Find items to remove and transactions to remove the items from Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999

Heuristic Approach A greedy bottom-up search through the ancestors (subsets) of the sensitive itemset for the parent with maximum support (why? ) At the end of the search, 1 -itemset is selected Search through the common transactions containing the item and the sensitive itemset for the transaction that affects minimum number of 2 -itemsets Delete the selected item from the identified transaction

Results comparison

Blocking-based Techniques Initial Database A B C D 1 1 1 0 0 0 1 1 1 0 New Database A B C D 1 1 1 0 ? 1 1 ? 0 0 1 1 1 0 1 1 Blocking Algorithm Support and Confidence becomes marginal. In New Database: 60% ≤ conf(A → C) ≤ 100%

Data reconstruction approach 1. Frequent Set Mining DD FS R 2. Perform sanitization Algorithm 3. FP-tree - based Inverse Frequent Set Mining FS ’ D’ FP-tree SIGMOD Ph. D. Workshop IDAR’ 07 R-Rh

The first two phases 1. Frequent set mining Generate all frequent itemsets with their supports and support counts FS from original database D 2. Perform sanitization algorithm Input: FS output in phase 1, R, Rh Output: sanitized frequent itemsets FS’ Process Select hiding strategy Identify sensitive frequent sets Perform sanitization In best cases, sanitization algorithm can ensure from FS’ , we can exactly get the non-sensitive rules set R-Rh SIGMOD Ph. D. Workshop 2007 -7 -10 IDAR’ 07 36

Example: the first two phases Oiginal Database: D TID T 1 T 2 T 3 T 4 T 5 T 6 Items ABCE ABCD ABD AD ACD Frequent Itemsets: FS A: 6 100% 1. Frequent B: 4 66% set mining C: 4 66% σ= 4 D: 4 66% MST=66% MCT=75% AB: 4 66% AC: 4 66% AD: 4 66% Association Rules: R rules confidence support B Þ A 100% 66% CÞ A 100% 66% D ÞA 100% 66% 2. Perform sanitization algorithm A: 6 100% C: 4 66% rules confidence support D: 4 66% C Þ A 100% 66% AC: 4 66% D ÞA 100% 66% AD: 4 66% Association Rules: R-R h Frequent Itemsets: FS' SIGMOD Ph. D. Workshop IDAR’ 07 2007 -7 -10 37

Open research questions Optimal solution Itemsets sanitization The support and confidence of the rules in R- Rh should remain unchanged as much as possible Integrating data protection and knowledge (rule) protection

Coming up Cryptographic protocols for privacy preserving distributed data mining

Classification of current algorithms Data modification Hide rules Hide large itemsets Data. Distortion Algo 1 a Algo 1 b Algo 2 a WSDA PDA Algo 2 b Algo 2 c Naïve Min. FIA Max. FIA IGA RRA RA SWA Border-Based Integer-Programing Sanitization-Matrix Data. Blocking CR CR 2 GIH Data reconstruction CIILM

Weight-based Sorting Distortion Algorithm (WSDA) [Pontikakis 03] High Level Description: Input: Initial Database Set of Sensitive Rules Safety Margin (for example 10%) Output: Sanitized Database Sensitive Rules no longer hold in the Database

WSDA Algorithm High Level Description: 1 st step: Retrieve the set of transactions which support sensitive rule RS For each sensitive rule RS find the number N 1 of transaction in which, one item that supports the rule will be deleted

WSDA Algorithm High Level Description: 2 nd step: For each rule Ri in the Database with common items with RS compute a weight w that denotes how strong is Ri For each transaction that supports RS compute a priority Pi, that denotes how many strong rules this transaction supports

WSDA Algorithm High Level Description: 3 rd step: Sort the N 1 transactions in ascending order according to their priority value Pi 4 th step: For the first N 1 transactions hide an item that is contained in RS

WSDA Algorithm High Level Description: 5 th step: Update confidence and support values for other rules in the database

Proposed Solution Discussion Sanitization algorithm Compared with early popular data sanitization : performs sanitization directly on knowledge level of data Inverse frequent set mining algorithm Deals with frequent items and infrequent items separately: more efficiently, a large number of outputs Our solution provides user with a knowledge level window to perform sanitization handily and generates a number of secure databases SIGMOD Ph. D. Workshop IDAR’ 07 2007 -7 -10 46