Probabilistic Inference Protection on Anonymized Data Raymond ChiWing

Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University) Philip S. Yu (Univerisity of Illinois at Chicago) Prepared by Raymond Chi-Wing Wong Presented by Raymond Chi-Wing Wong 1

Outline 1. Introduction l l-diversity 2. Background Knowledge 3. Proposed Model 4. Conclusion 2

Simplified 2 -diversity: to generate a data set such that each individual is linked to a sensitive value (e. g. , Lung Cancer) with probability at most 1/2 1. l-diversity Patient Bucketization Knowledge 2 Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Female 63 Flu Female 64 HIV I also know Alan with (Male, 41) Catherine In other words, P(Alan is linked to Lung Cancer) is at most 1/2. Diana Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer with probability=1/2. This dataset satisfies 2 -diversity. Release the data set to public Knowledge 1 Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table 3

Simplified 2 -diversity: to generate a data set such that each individual is linked to a sensitive value (e. g. , Lung Cancer) with probability at most 1/2 1. l-diversity This can be obtained from statistical reports from the US department of Health and Human Services and other Patient statistical data sources Bucketization discussed in previous studies Alan Betty Knowledge 2 I also know Alan with (Male, 41) Catherine Knowledge 3 Diana Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Female 63 Flu Female 64 HIV QI Based Distribution p() Male Female Lung Cancer Not Lung Cancer Release 0. 1 0. 9 0. 003 0. 997 This dataset satisfies 2 -diversity. the data set to public Knowledge 1 Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table 4

Simplified 2 -diversity: to generate a data set such that each individual is linked to a sensitive value (e. g. , Lung Cancer) with probability at most 1/2 1. l-diversity It is more likely that a male patient is linked to Lung Cancer compared with a. Patient Bucketization female patient. Alan Betty Knowledge 2 I also know Alan with (Male, 41) Catherine Knowledge 3 Diana Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Female 63 Flu Female 64 HIV QI Based Distribution p() Male Female Lung Cancer Not Lung Cancer Release 0. 1 0. 9 0. 003 0. 997 Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability Thisgreater datasetthan satisfies 2 -diversity. (much 1/2). Why the data set to public Knowledge 1 Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table 5

Simplified 2 -diversity: to generate a data set such that each individual is linked to a sensitive value (e. g. , Lung Cancer) with probability at most 1/2 Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity Patient Bucketization Knowledge 2 Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Female 63 Flu Female 64 HIV I also know Alan with (Male, 41) Catherine Knowledge 3 Diana QI Based Distribution p() Male Female Lung Cancer 0. 1 0. 003 Not Lung Cancer Release the data set to public We 0. 9 need to formulate how to calculate the probability 1 Cancer) ) according to (e. g. , P(Alan. Knowledge is linked to Lung 0. 997 GID Disease Knowledge 1, Gender 2 and 3 Age GID Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability Thisgreater datasetthan satisfies 2 -diversity. (much 1/2). Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table 6

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity We need to formulate how to calculate the probability (e. g. , P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3 7

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity n Challenge 1: Calculating the probability (e. g. , P(Alan is linked to Lung Cancer)) is computationally expensive. We need to formulate how to calculate the probability (e. g. , P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3 8

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity n n Challenge 1: Calculating the probability (e. g. , P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e. g. , Incognito and Mondrian) rely on this monotonic property. 9

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 1. l-diversity n n Challenge 1: Calculating the probability (e. g. , P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e. g. , Incognito and Mondrian) rely on this monotonic property. 10

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 1. l-diversity n n Challenge 1: Calculating the probability (e. g. , P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Related Work: There is a closely related work [LLZ 09] for this problem. [LLZ 09] T. Li, N. Li and J. Zhang, “Modeling and Integrating Background Knowledge in Data Anonymization”, ICDE 2009 [LLZ 09] approximates the formula for this probability. Thus, there is no solid guarantee on the privacy protection. 11

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 1. l-diversity n n Challenge 1: Calculating the probability (e. g. , P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Contributions: We propose a condition. If this condition is satisfied, we can guarantee the privacy requirement (i. e. , P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size. 12

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 1. l-diversity n The major idea of the condition includes some simple calculations based on the statistics of an A-group 1. The size of the A-group (N) 2. The privacy requirement (r) 3. The global probabilities of each tuple in the A-group to a sensitive value Contributions: We propose a condition. If this condition is satisfied, we can guarantee the privacy requirement (i. e. , P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size. 13

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 1. l-diversity n The major idea of the condition includes some simple calculations based on the statistics of an A-group 1. The size of the A-group (N) 2. The privacy requirement (r) 3. The global probabilities of each tuple in the A-group to a sensitive value N r Global probabilities Condition Check Satisfied/ Not Satisfied If it is satisfied, we deduce that the privacy requirement is satisfied (e. g. , P(Alan is linked to Lung Cancer) ≤ 1/2) 14

4. Conclusion 1. Background Knowledge n QI-based Probability Distribution 2. Two Challenges n n Challenge 1: The formula for the probability is computationally expensive Challenge 2: The formula is not monotonic 3. Proposed Condition n overcomes Challenge 1 and Challenge 2 15

Q&A 16

A way to prevent this linkage. 1. l-diversity There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization. Patient Bucketization Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine Female 63 Flu Diana Female 64 HIV Release the data set to public These two tuples form an anonymized group (A -group) Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Female 63 Flu Female 64 HIV GID = L 1 These two tuples form another A-group. GID = L 2 17

1. l-diversity Patient Bucketization Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine Female 63 Flu Diana Female 64 HIV Release the data set to public Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Female 63 Flu Female 64 HIV GID = L 1 GID = L 2 18

1. l-diversity Patient Bucketization Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine Female 63 Flu Diana Female 64 HIV Release the data set to public Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV QI Table Sensitive Table 19

1. l-diversity Patient Knowledge 2 Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Catherine Female 63 Flu Diana Female 64 HIV Alan I also know Alan with (Male, 41) Betty Release the data set to public Knowledge 1 Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer. Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension Female 63 Flu Female 64 HIV 20

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity P(an individual is linked to a sensitive value) ≤ 0. 5 n Monotonicity n Consider two A-groups P(an individual is linked to a sensitive value) = 0. 5 Gender Age GID Disease 41 L L Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV Male 1 An A-group with 1 GID = L 1 An A-group with GID = L 2 Merging An A-group “merged” from these two A-groups P(an individual is linked to a sensitive value) = 0. 4 The probability is monotonically decreasing when the size of the A-gourp increases. 21

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). 1. l-diversity It is possible that P(an individual is linked to a sensitive value) > 0. 5 n Non-Monotonicity n Consider two A-groups P(an individual is linked to a sensitive value) = 0. 5 Gender Age GID Disease 41 L L Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV Male 1 An A-group with 1 GID = L 1 An A-group with GID = L 2 Merging An A-group “merged” from these two A-groups P(an individual is linked to a sensitive value) = 0. 4 The probability is not monotonically decreasing when the size of the A-gourp increases. 22

Objective: to make sure that the probability is bounded by a threshold (e. g. , 1/2). Knowledge 2 Objective: to make sure that I also know Alan with (Male, 41) P(Alan is linked to Lung Cancer) ≤ 1/2 For the sake of illustration, we focus on Knowledge 1 1. l-diversity attribute Gender only. Knowledge 3 QI Based Distribution p() Lung Cancer Not Lung Cancer 0. 1 0. 9 0. 003 0. 997 Male Female Gender Age GID Disease Male 41 L 1 Lung Cancer Female 42 L 1 Hypertension Female 63 L 2 Flu Female 64 L 2 HIV Suppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2. 2 N 2 r Global probabilities 0. 1 0. 003 Condition Check Satisfied/ Not Satisfied If it is satisfied, we deduce that the privacy requirement is satisfied (e. g. , P(Alan is linked to Lung Cancer) ≤ 1/2) 23

What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. 2 N 2 r Condition Check Satisfied/ Not Satisfied Global probabilities 0. 1 0. 003 24

What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. n Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied. 25

n Theorem 2: Computing ceil can be done in O(1) time. This means that we overcome Challenge 1: Calculating the probability is computationally expensive. n Theorem 3: ceil is a monotonically increasing function on N where N is the A-group size. This means that we overcome Challenge 2: The formula for the original probability is not monotonic with respect to the A-group size. 26

What is the condition check? The greatest global probability fmax = max{f 1, f 2} = max{0. 1, 0. 003} = 0. 1 The difference between the greatest global probability and the “current” global probability 1 = fmax – f 1 = 0. 1 – 0. 1 = 0 in terms of N, r and fmax. 2 = fmax – f 2 = 0. 1 – 0. 003 = 0. 097 The condition is whether this difference 1 (and 2) is at most an expression ceil = (N-r)/fmax(r-1)/(1 -fmax) + (N-1) 2 N 2 r Global probabilities f 1 0. 003 f 2 Condition Check Satisfied/ Not Satisfied 27

What is the condition check? The greatest global probability fmax = max{f 1, f 2} = max{0. 1, 0. 003} = 0. 1 The difference between the greatest global probability and the “current” global probability 1 = fmax – f 1 = 0. 1 – 0. 1 = 0 2 = fmax – f 2 = 0. 1 – 0. 003 = 0. 097 The condition is whether this difference 1 (and 2) is at most an expression ceil = n (N-r)/fmax(r-1)/(1 -fmax) + (N-1) Theorem 1: If i ≤ ceil is satisfied, then the privacy requirement is satisfied. 28

Anonymization n The condition check gives hints for anonymization n n Initially, each tuple forms an A-group. Repeat the following until each A-group satisfies the condition. n If there is an A-group violating the condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition. 29

B. 1. 2 K-Anonymity Problem: to generate a data set such that each possible value appears at least TWO times. Customer Gender District Birthday Cancer Raymond Male Shatin 29 Jan None Peter Male Fanling 16 July Yes Kitty Female Shatin 21 Oct None Mary Female Shatin 8 Feb None Two Kinds of Generalisations 1. Shatin NT 2. 16 July * Release the data set to public Gender District Birthday Cancer Male NT * None Question: how can we measure the distortion? Male NT * Yes Female Shatin * None This data set is 2 -anonymous Female Shatin * None “Shatin NT” causes LESS distortion than “ 16 July *” 30

B. 1. 2 K-Anonymity Measurement= 1/1=1. 0 * Male Measurement= 2/2=1. 0 Female HKG * NT Shatin Fanling KLN Jan July Oct Feb Mongkok Jordon 29 Jan 16 July 21 Oct 8 Feb Measurement= 1/2 =0. 5 Conclusion: We propose a measurement of distortion of the modified/anonymized data. 31

B. 1. 2 K-Anonymity Measurement= 1/1=1. 0 * Male Measurement= 2/2=1. 0 Female HKG * NT Shatin Fanling KLN Jan July Oct Feb Mongkok Jordon 29 Jan 16 July 21 Oct 8 Feb Measurement= 1/2 =0. 5 Can we modify the measurement? e. g. different weightings to each level 32