Privacy Preserving Data Publication Yufei Tao Department of
Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong
Centralized publication q Assume that a hospital wants to publish the following table, called the microdata. q The publication must preserve the privacy of patients. Ø Prevent an adversary from knowing who-contractedwhat. Microdata
Centralized publication (cont. ) q A simple solution: Remove column ‘Name’. q It does not work. See next. publish
Linking attacks The published table A voter registration list Quasi-identifier (QI) attributes An adversary
These are real threats q Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}. q A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] Ø finds the medical record of an ex-governor of Massachusetts.
Objectives q Publish a distorted version of the dataset so that Ø [Privacy] the privacy of all individuals is “adequately” protected; Ø [Utility] the dataset is useful for analyzing the characteristics of the microdata. q. Paradox: Privacy protection , utility .
Issues q Privacy principle Ø What is adequate privacy protection? q Distortion approach Ø How to achieve the privacy principle? q The literature has discussed other issues as well. Ø Complexities, improving the utility of the published data, etc.
Principle 1: k-anonymity [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] q 2 -anonymous generalization: 4 QI groups A voter registration list Sensitive attribute QI attributes
Defects of k-anonymity q What is the disease of Joe? A voter registration list No “diversity” in this QI group.
Principle 2: l-diversity [Machanavajjhala et al. , ICDE, 2006] q Each QI group should have at least l “well-represented” sensitive values. q Different ways to interpret “well-represented”.
Naive interpretation q Each QI-group has l different sensitive values. A 2 -diverse table Age [1, 5] [6, 10] [11, 20] [21, 60] Sex M M F F F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] flu
Defects of the naive interpretation q Assume that Joe is identified in the QI group. What is the probability that he contracted HIV? 98 tuples A QI group with 100 tuples q Implication: The most frequent sensitive value in a QI group cannot be too frequent. q But accomplishing only is still vulnerable against attacks with background knowledge.
Background knowledge attack q Let Joe be an individual in the QI group having HIV. q A friend of Joe has the background knowledge: “Joe does not have pneumonia”. q How likely would this friend assume that Joe had HIV? 50 tuples A QI group with 100 tuples 49 tuples
Controlling also the 2 nd most frequent value 40 tuples A QI group with 100 tuples 30 tuples q Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability.
An example of 4 -diversity The most frequent value The 2 nd most frequent value A QI group The 3 rd most frequent value The 4 th most frequent value The other values
An example of 4 -diversity (cont. ) The most frequent value A QI group Same cardinality The other values
An example of 4 -diversity (cont. ) HIV pneumonia A QI group bronchitis cancer The other values q Assume that Joe is a person in the QI group. q Property: If an adversary can eliminate only 3 diseases, s/he can correctly guess the disease of Joe with at most 50% probability.
l-diversity q q q Consider a QI group. m is the number of sensitive values in the group. r 1 is the number of tuples having the most sensitive value. r 2 is the number of tuples having the 2 nd most sensitive value. … rm is the number of tuples having the m-th most sensitive value. q Then, r 1 c (rl + … + rm), where c is a constant. q If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1). q Called (c, l)-diversity precisely.
Defects of l-diversity q Andy does not want anyone to know that he had a stomach problem. q Sarah does not mind at all if others find out that she had flu. A 2 -diverse table Age Sex [1, 5] M [6, 10] M [11, 20] F [21, 60] F Zipcode [10001, 15000] [15001, 20000] [20001, 25000] [30001, 60000] Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis flu A voter registration list Name Andy Bill Ken Nash Mike Alice Betty Linda Jane Sarah Mary Age Sex 4 M 5 M 6 M 9 M 7 M 12 F 19 F 21 F 25 F 28 F 56 F Zipcode 12000 14000 18000 19000 17000 22000 24000 33000 34000 37000 58000
Defects of l-diversity (cont. ) q Does not work if an individual can have multiple tuples in the microdata. Microdata Name Age Sex Andy 4 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F Zipcode 12000 18000 19000 22000 24000 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis flu
Defects of l-diversity (cont. ) A 2 -diverse table Age 4 4 [6, 10] [11, 20] [21, 60] Sex M M F F F Zipcode Disease 12000 gastric ulcer 12000 dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] flu A voter registration list Name Age Sex Zipcode Andy 4 M 12000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000
Principle 3: Personalized anonymity [Xiao and Tao, SIGMOD, 2006] q Key ideas: Guarding node + sensitive attribute (SA) generalization q Assume a publicly-known hierarchy on the sensitive attribute.
Guarding node q Andy does not want anyone to know that he had a stomach problem. q He can specify “stomach disease” as the guarding node for his tuple. Name Age Sex Zipcode Disease guarding node Andy 4 M 12000 gastric ulcer stomach disease q Protect Andy from being conjectured to have any disease in the subtree of the guarding node.
Guarding node (cont. ) q Sarah is willing to disclose her exact symptom. q She can specify Ø as the guarding node for her tuple. Name Age Sex Zipcode Disease guarding node Sarah 28 F 37000 flu Ø
Guarding node (cont. ) q Bill does not have any special preference. q He sets the guarding node of his tuple to be the same as his sensitive value. Name Age Sex Zipcode Disease guarding node Bill 5 M 14000 dyspepsia
A personalized approach Name Andy Bill Ken Nash Alice Betty Linda Jane Sarah Mary Age Sex Zipcode Disease 4 M 12000 gastric ulcer 5 M 14000 dyspepsia 6 M 18000 pneumonia 9 M 19000 bronchitis 12 F 22000 flu 19 F 24000 pneumonia 21 F 33000 gastritis 25 F 34000 gastritis 28 F 37000 flu 56 F 58000 flu guarding node stomach disease dyspepsia respiratory infection bronchitis flu pneumonia gastritis Ø Ø flu
Personalized anonymity Name Andy Bill Ken Nash Alice Betty Linda Jane Sarah Mary Age Sex Zipcode Disease 4 M 12000 gastric ulcer 5 M 14000 dyspepsia 6 M 18000 pneumonia 9 M 19000 bronchitis 12 F 22000 flu 19 F 24000 pneumonia 21 F 33000 gastritis 25 F 34000 gastritis 28 F 37000 flu 56 F 58000 flu guarding node stomach disease dyspepsia respiratory infection bronchitis flu pneumonia gastritis Ø Ø flu q No adversary should be able to breach the privacy requirement of any guarding node with a probability above pbreach. . q If pbreach = 0. 3, then no adversary can have more than 30% probability to find out that: Ø Andy had a stomach disease Ø Bill had dyspepsia Ø…
Why SA generalization? q How many female patients are there with age above 30? q 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3 q Real answer: 1 Microdata Name Age Sex Andy 4 M Bill 5 M Ken 6 M Nash 9 M Alice 12 F Betty 19 F Linda 21 F Jane 25 F Sarah 28 F Mary 56 F Zipcode 12000 14000 18000 19000 22000 24000 33000 34000 37000 58000 Pure QI generalization Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis flu Age [1, 5] [6, 10] [11, 20] [21, 60] Sex M M F F F Zipcode Disease [10001, 15000] gastric ulcer [10001, 15000] dyspepsia [15001, 20000] pneumonia [15001, 20000] bronchitis [20001, 25000] flu [20001, 25000] pneumonia [30001, 60000] gastritis [30001, 60000] flu
SA generalization (cont. ) Age [1, 5] [6, 10] [11, 20] [21, 60] Pure QI generalization Sex Zipcode Disease M [10001, 15000] gastric ulcer M [10001, 15000] dyspepsia M [15001, 20000] pneumonia M [15001, 20000] bronchitis F [20001, 25000] flu F [20001, 25000] pneumonia F [30001, 60000] gastritis F [30001, 60000] flu Age [1, 5] [6, 10] [11, 20] [21, 30] 56 With SA generalization Sex Zipcode Disease M [10001, 15000] gastric ulcer M [10001, 15000] dyspepsia M [15001, 20000] pneumonia M [15001, 20000] bronchitis F [20001, 25000] flu F [20001, 25000] pneumonia F [30001, 40000] gastritis F [30001, 40000] flu respiratory F 58000 infection
Evaluation of disclosure risk q What is the probability that the adversary can find out that “Andy had a stomach disease”? A voter registration list Name Age Sex Zipcode Andy 4 M 12000 Bill 5 M 14000 Ken 6 M 18000 Nash 9 M 19000 Mike 7 M 17000 Alice 12 F 22000 Betty 19 F 24000 Linda 21 F 33000 Jane 25 F 34000 Sarah 28 F 37000 Mary 56 F 58000 The published data Age [1, 10] [11, 20] 21 25 28 56 Sex M M F F F Zipcode [10001, 20000] [20001, 25000] 33000 34000 37000 58000 Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia stomach disease gastritis flu respiratory infection
Combinatorial reconstruction (cont. ) q Can each individual appear more than once? Ø No = the primary case Ø Yes = the non-primary case q Some possible reconstructions: The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis
Combinatorial reconstruction (cont. ) q Can each individual appear more than once? Ø No = the primary case Ø Yes = the non-primary case q Some possible reconstructions: The primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis The non-primary case Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis
Breach probability (primary) Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis q Totally 120 possible reconstructions q If Andy is associated with a stomach disease in nb reconstructions q The probability that the adversary should associate Andy with some stomach problem is nb / 120 q Andy is associated with Ø gastric ulcer in 24 reconstructions Ø dyspepsia in 24 reconstructions Ø gastritis in 0 reconstructions q nb = 48 q The breach probability for Andy’s tuple is 48 / 120 = 2 / 5.
Breach probability (non-primary) Andy Bill Ken Nash Mike gastric ulcer dyspepsia pneumonia bronchitis q Totally 625 possible reconstructions q Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions. q nb = 225 q The breach probability for Andy’s tuple is 225 / 625 = 9 / 25
A defect of personalized anonymity q Does not guard against background knowledge. Ø Recall that l-diversity can achieve this purpose. q But it seems possible to adapt the personalized approach to tackle background knowledge. Ø Future work?
Other privacy principles q k-gather. Ø Due to [Aggarwal et al. , PODS, 2006] Ø Suffers from the problems of k-anonymity. q (a, k)-anonymity Ø Due to [Wong et al. , KDD, 2006] q t-closeness. Ø Recently proposed by [Li and Li, ICDE, 2007]
Issues q Privacy principle Ø What is adequate privacy protection? q Distortion approach Ø How to achieve the privacy principle?
Three approaches q Suppression Ø We do not discuss it because ü the utility of the resulting table is low; ü it can be regarded as a special case of generalization. q Generalization Ø Due to [Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] q Anatomy (also called “bucketization”) Ø Due to [Xiao and Tao, VLDB, 2006] q Each of the above approaches can be integrated with all the privacy principles discussed earlier.
A multidimensional view of generalization
Taxonomy of generalization [Le. Fevre et al. SIGMOD, 2005] q Local recoding Ø (Generalized) rectangles may overhalp. Ø Suppression is a special case of local recoding. q Global recoding Ø All rectangles are disjoint.
Taxonomy of generalization (cont. ) q Global recoding can be further divided. q Single-dimension recoding Ø Rectangles form a grid. q Multi-dimension recoding Ø The opposite of singledimension recoding.
Taxonomy of generalization (cont. ) q Single-dimension recoding can be further divided. Ø Full-domain recoding Ø Full-subtree recoding q Both assume a hierarchy on each QI attribute. q Example: A hierarchy on Age
Taxonomy of generalization (cont. ) q Full-domain recoding Ø All age values must be generalized to the same level of the hierachy.
Taxonomy of generalization (cont. ) q Full-subtree recoding Ø The subtrees of all generalized values must be disjoint. Ø Permissible generalization: ü [1, 30], [31, 40], [41, 50], [51, 60], [61, 90]. Ø Illegal generalization: ü [1, 10], [1, 30], [31, 60], [61, 90].
Why all these generalization types? q Reason 1: If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).
Why all these generalization types? q Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.
Why all these generalization types? q Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.
Generalization algorithms q Operate on a quality metric. Examples: Ø The generalization level (for full-domain recoding) Ø Total rectangle size (for local recoding) Ø… q Mostly heuristics-based. q Finding the optimal generalization is often NP hard.
Defect of generalization q Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Age Sex Zipcode [21, 60] M [10001, 60000] [61, 70] F [10001, 60000] Disease pneumonia dyspepsia pneumonia flu gastritis flu bronchitis q Estimated answer: 2 p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.
Defect of generalization (cont. ) q Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Age Sex Zipcode [21, 60] M [10001, 60000] Disease pneumonia q p = Area( R 1 ∩ Q ) / Area( R 1 ) = 0. 05 q Estimated answer for Query A: 2 p = 0. 1
Defect of generalization (cont. ) q Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] q Estimated answer = 0. 1 q The exact answer = 1 Name Bob Ken Peter Sam Jane Linda Alice Mandy Age Sex Zipcode Disease 23 M 11000 pneumonia 27 M 13000 dyspepsia 35 M 59000 dyspepsia 59 M 12000 pneumonia 61 F 54000 flu 65 F 25000 gastritis 65 F 25000 flu 70 F 30000 bronchitis
Defect of generalization (cont. ) q Cause of inaccuracy: QI distribution inside each QI group is lost! Age Sex Zipcode [21, 60] M [10001, 60000] Disease pneumonia
Anatomy q Releases a quasi-identifier table (QIT) and a sensitive table (ST). Age 23 27 35 59 61 65 65 70 Sex M M F F Age Zipcode 23 11000 13000 27 59000 35 12000 59 54000 61 25000 65 30000 65 Sex Zipcode Group-ID M 1 11000 M 1 13000 M 1 59000 1 M 2 12000 F 2 54000 F 2 25000 2 F Disease pneumonia. Disease Group-ID 1 dyspepsia pneumonia 2 bronchitis pneumonia 2 flu gastritis 2 25000 70 table. F (QIT) 30000 Quasi-identifier Microdata gastritis Count 2 2 1 Sensitive table (ST) flu bronchitis
Anatomy (cont. ) 1. Decide an l-diverse partition of the tuples. QI group 1 QI group 2 Age Sex Zipcode Disease 23 M 11000 pneumonia 27 M 13000 dyspepsia 35 M 59000 dyspepsia 59 M 12000 pneumonia 61 F 54000 flu 65 F 25000 gastritis 65 F 25000 flu 70 F 30000 bronchitis A 2 -diverse partition
Anatomy (cont. ) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition. group 1 group 2 Age Sex Zipcode Disease 23 27 35 59 M M 11000 13000 59000 12000 pneumonia dyspepsia pneumonia 61 65 65 70 F F 54000 25000 30000 flu gastritis flu bronchitis quasi-identifier table (QIT) sensitive table (ST)
Anatomy (cont. ) 2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition. Age Sex Zipcode Group-ID Disease 23 27 35 59 M M 11000 13000 59000 12000 1 1 1 1 pneumonia dyspepsia pneumonia 61 65 65 70 F F 54000 25000 30000 2 2 2 2 flu gastritis flu bronchitis quasi-identifier table (QIT) sensitive table (ST)
Privacy preservation q Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l. Name Age Sex Bob 23 M Age 23 27 35 59 61 65 65 70 Zipcode 11000 Sex M M F F Zipcode 11000 13000 59000 12000 54000 25000 30000 Group-ID 1 1 2 2 quasi-identifier table (QIT) Group-ID 1 1 2 2 2 Disease dyspepsia pneumonia bronchitis flu gastritis Count 2 2 1 sensitive table (ST)
Accuracy of data analysis q Query A: Age 23 27 35 59 61 65 65 70 Sex M M F F SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] Zipcode 11000 13000 59000 12000 54000 25000 30000 Group-ID 1 1 2 2 Quasi-identifier table (QIT) Group-ID 1 1 2 2 2 Disease dyspepsia pneumonia bronchitis flu gastritis Count 2 2 1 Sensitive table (ST)
Accuracy of data analysis q Query A: SELECT COUNT(*) from Unknown-Microdata WHERE Disease = ‘pneumonia’ AND Age in [0, 30] AND Zipcode in [10001, 20000] t 1 t 2 t 3 t 4 Age 23 27 35 59 Sex M M Zipcode 11000 13000 59000 12000 Group-ID 1 1 q 2 patients contracted pneumonia q 2 out of 4 patients satisfy the query conditions on Age and Zipcode q Estimated answer = 2 * 2 / 4 = 1.
A defect of anatomy q Existence breach: Does an individual exist in the microdata?
Future work q Re-publication q Tackle stronger background knowledge Ø Recent work [Martin et al. , ICDE, 2007] q Improving utility Ø Pioneering work [Kifer and Gehrke, SIGMOD, 2006] q Application to specific (non-trivial) applications Ø Location privacy üPioneering work [Mokbel et al. , VLDB, 2006]
- Slides: 61