Data Anonymization Generalization Algorithms Li Xiong CS 573

  • Slides: 45
Download presentation
Data Anonymization Generalization Algorithms Li Xiong CS 573 Data Privacy and Anonymity

Data Anonymization Generalization Algorithms Li Xiong CS 573 Data Privacy and Anonymity

Generalization and Suppression n • Generalization Z 2 = {410**} Z 1 = {4107*.

Generalization and Suppression n • Generalization Z 2 = {410**} Z 1 = {4107*. 4109*} n n Suppression Replace the value with a less specific but semantically consistent value n Do not release a value at all Z 0 = {41075, 41076, 41095, 41099} # S 1 = {Person} S 0 = {Male, Female} Zip Age Nationality Condition 1 41076 < 40 * Heart Disease 2 48202 < 40 * Heart Disease 3 41076 < 40 * Cancer 4 48202 < 40 * Cancer

Complexity Search Space: • Number of generalizations = (Max level of generalization for attribute

Complexity Search Space: • Number of generalizations = (Max level of generalization for attribute i + 1) attrib i If we allow generalization to a different level for each value of an attribute: • Number of generalizations = attrib i 3 #tuples (Max level of generalization for attribute i + 1)

Hardness result n Given some data set R and a QI Q, does R

Hardness result n Given some data set R and a QI Q, does R satisfy k-anonymity over Q? n Easy to tell in polynomial time, NP! n Finding an optimal anonymization is not easy n n NP-hard: reduction from k-dimensional perfect matching A polynomial solution implies P = NP A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS’ 04.

Taxonomy of Generalization Algorithms n Top-down specialization vs. bottom-up generalization n Global (single dimensional)

Taxonomy of Generalization Algorithms n Top-down specialization vs. bottom-up generalization n Global (single dimensional) vs. local (multidimensional) n Complete (optimal) vs. greedy (approximate) n Hierarchy-based (user defined) vs. partitionbased (automatic) K. Le. Ferve, D. J. De. Witt, and R. Ramakrishnan. Incognito: Efficient Full-Domain K-Anonymity. In SIGMOD 05

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical n Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical n Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy n TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy n K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete n Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete n Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

µ-Argus n Hundpool and Willenborg, 1996 n Greedy approach n Global generalization with tuple

µ-Argus n Hundpool and Willenborg, 1996 n Greedy approach n Global generalization with tuple suppression n Not guaranteeing k-anonymity

µ-Argus algorithm

µ-Argus algorithm

µ-Argus

µ-Argus

Problems With µ-Argus 1. Only 2 - and 3 - combinations are examined, there

Problems With µ-Argus 1. Only 2 - and 3 - combinations are examined, there may exist 4 combinations that are unique – may not always satisfy k-anonymity 2. Enforce generalization at the attribute level (global) – may over generalize

The Datafly System n Sweeney, 1997 n Greedy approach n Global generalization with tuple

The Datafly System n Sweeney, 1997 n Greedy approach n Global generalization with tuple suppression

Datafly Algorithm Core Datafly Algorithm

Datafly Algorithm Core Datafly Algorithm

Datafly MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}

Datafly MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}

Problems With Datafly 1. Generalizing all values associated with an attribute (global) 2. Suppressing

Problems With Datafly 1. Generalizing all values associated with an attribute (global) 2. Suppressing all values within a tuple (global) 3. Selecting the attribute with the greatest number of distinct values as the one to generalize first – computationally efficient but may over generalize

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n n n n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

K-OPTIMIZE n Practical solution to guarantee optimality n Main techniques n Framing the problem

K-OPTIMIZE n Practical solution to guarantee optimality n Main techniques n Framing the problem into a set-enumeration search problem n Tree-search strategy with cost-based pruning and dynamic search rearrangement n Data management strategies 3/1/2021 16

Anonymization Strategies n Local suppression n Delete individual attribute values n E. g. <Age=50,

Anonymization Strategies n Local suppression n Delete individual attribute values n E. g. <Age=50, Gender=M, State=CA> n Global attribute generalization n Replace specific values with more general ones for an attribute n Numeric data: partitioning of the attribute domain into intervals. E. g. Age={[1 -10], . . . , [91100]} n Categorical data: generalization hierarchy supplied by users. E. g. Gender = [M or F] 3/1/2021 17

K-Anonymization with Suppression n K-anonymization with suppression n Global attribute generalization with local suppression

K-Anonymization with Suppression n K-anonymization with suppression n Global attribute generalization with local suppression of outlier tuples. n Terminologies n Dataset: D n Anonymization: {a 1, …, a m} n Equivalent classes: E 3/1/2021 18 a 1 E{ v 1, 1 … v 1, n am … v 1, m vn, m

Finding Optimal Anonymization n Optimal anonymization determined by a cost metric n Cost metrics

Finding Optimal Anonymization n Optimal anonymization determined by a cost metric n Cost metrics n Discernibility metric: penalty for nonsuppressed tuples and suppressed tuples n Classification metric 3/1/2021 19

Modeling Anonymizations n Assume a total order over the set of all attribute domain

Modeling Anonymizations n Assume a total order over the set of all attribute domain n Set representation for anonymization n n E. g. Age: <[10 -29], [30 -49]>, Gender: <[M or F]>, Marital Status: <[Married], [Widowed or Divorced], [Never Married]> {1, 2, 4, 6, 7, 9} -> {2, 7, 9} n Power set representation for entire anonymization space n n n 3/1/2021 Power set of {2, 3, 5, 7, 8, 9} - order of 2 n! {} – most general anonymization {2, 3, 5, 7, 8, 9} – most specific anonymization 20

Optimal Anonymization Problem n Goal n Find the best anonymization in the powerset with

Optimal Anonymization Problem n Goal n Find the best anonymization in the powerset with lowest cost n Algorithm n n set enumeration search through tree expansion - size 2 n Top-down depth first search n Heuristics n Cost-based pruning Dynamic tree rearrangement 3/1/2021 21 n Set enumeration tree over powerset of {1, 2, 3, 4}

Node Pruning through Cost Bounding n Intuitive idea n prune a node H if

Node Pruning through Cost Bounding n Intuitive idea n prune a node H if none of its descendents can be optimal n Cost lower-bound of subtree of H n n 22 Cost of suppressed tuples bounded by H Cost of non-suppressed tuples bounded by A H A 3/1/2021

Useless Value Pruning n Intuitive idea n Prune useless values that have no hope

Useless Value Pruning n Intuitive idea n Prune useless values that have no hope of improving cost n Useless values n Only split equivalence classes into suppressed equivalence classes (size < k) 3/1/2021 23

Tree Rearrangement n Intuitive idea n Dynamically reorder tree to increase pruning opportunities n

Tree Rearrangement n Intuitive idea n Dynamically reorder tree to increase pruning opportunities n Heuristics n sort the values based on the number of equivalence classes induced 3/1/2021 24

Experiments n Adult census dataset n 30 k records and 9 attributes n Fine:

Experiments n Adult census dataset n 30 k records and 9 attributes n Fine: powerset of size 2160 n Evaluations of performance and optimal cost n Comparison with greedy/stochastic method n 2 -phase greedy generalization/specialization n Repeated process 3/1/2021 25

Results – Comparison n None of the other optimal algorithms can handle the census

Results – Comparison n None of the other optimal algorithms can handle the census data n Greedy approaches, while executing quickly, produce highly sub- optimal anonymizations n Comparison with 2 -phase method (greedy + stochastic) 26 3/1/2021

Comments n Interesting things to think about n Domains without hierarchy or total order

Comments n Interesting things to think about n Domains without hierarchy or total order restrictions n Other cost metrics n Global generalization vs. local generalization 3/1/2021 27

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n n n n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

Mondrian n Top-down partitioning n Greedy n Local (multidimensional) – tuple/cell level

Mondrian n Top-down partitioning n Greedy n Local (multidimensional) – tuple/cell level

Global Recoding n Mapping domains of quasi-identifiers to generalized or altered values using a

Global Recoding n Mapping domains of quasi-identifiers to generalized or altered values using a single function n Notation n Dx is the domain of attribute Xi in table T n Single Dimensional φi : Dxi D’ for each attribute Xi of the quasiid n φi applied to values of Xi in tuple of T n

Local Recoding n Multi-Dimensional n Recode domain of value vectors from a set of

Local Recoding n Multi-Dimensional n Recode domain of value vectors from a set of quasi-identifier attributes n φ : Dx 1 x … x Dxn D’ n φ applied to vector of quasi-identifier attributes in each tuple in T

Partitioning n Single Dimensional n For each Xi, define non-overlapping single dimensional intervals that

Partitioning n Single Dimensional n For each Xi, define non-overlapping single dimensional intervals that covers Dxi n Use φi to map x ε Dx to a summary stat n Strict Multi-Dimensional n Define non-overlapping multi-dimensional intervals that covers Dx 1… Dxd n Use φ to map (xx 1…xxd) ε Dx 1…Dxd to a summary stat for its region

Global Recoding Example k=2 Quasi Identifiers Age, Sex, Zipcode Single Dimensional Partitions Age :

Global Recoding Example k=2 Quasi Identifiers Age, Sex, Zipcode Single Dimensional Partitions Age : {[25 -28]} Sex: {Male, Female} Zip : {[53710 -53711], 53712} Multi-Dimensional Partitions {Age: [25 -26], Sex: Male, Zip: 53711} {Age: [25 -27], Sex: Female, Zip: 53712} {Age: [27 -28], Sex: Male, Zip: [53710 -53711]}

Global Recoding Example 2 k=2 Quasi Identifiers Age, Zipcode Patient Data Single Dimensional Multi-Dimensional

Global Recoding Example 2 k=2 Quasi Identifiers Age, Zipcode Patient Data Single Dimensional Multi-Dimensional

Greedy Partitioning Algorithm n Problem n Need an algorithm to find multi-dimensional partitions n

Greedy Partitioning Algorithm n Problem n Need an algorithm to find multi-dimensional partitions n Optimal k-anonymous strict multi-dimensional partitioning is NP-hard n Solution n Use a greedy algorithm n Based on k-d trees n Complexity O(nlogn)

Greedy Partitioning Algorithm

Greedy Partitioning Algorithm

Algorithm Example nk=2 n Dimension determined heuristically n Quasi-identifiers n Zipcode n Age Patient

Algorithm Example nk=2 n Dimension determined heuristically n Quasi-identifiers n Zipcode n Age Patient Data Anonymized Data

Algorithm Example Iteration # 1 (full table) partition ` dim = Zipcode fs split.

Algorithm Example Iteration # 1 (full table) partition ` dim = Zipcode fs split. Val = 53711 LHS RHS

Algorithm Example continued Iteration # 2 (LHS from iteration # 1) partition ` dim

Algorithm Example continued Iteration # 2 (LHS from iteration # 1) partition ` dim = Age fs split. Val = 26 LHS RHS

Algorithm Example continued Iteration # 3 (LHS from iteration # 2) partition ` No

Algorithm Example continued Iteration # 3 (LHS from iteration # 2) partition ` No Allowable Cut ` Summary: Age = [25 -26] Zip= [53711] Iteration # 4 (RHS from iteration # 2) partition ` No Allowable Cut Summary: Age = [27 -28] Zip= [53710 - 53711]

Algorithm Example continued Iteration # 5 (RHS from iteration # 1) partition ` No

Algorithm Example continued Iteration # 5 (RHS from iteration # 1) partition ` No Allowable Cut ` Summary: Age = [25 -27] Zip= [53712]

Experiment n Adult dataset n Data quality metric (cost metric) n Discernability Metric (CDM)

Experiment n Adult dataset n Data quality metric (cost metric) n Discernability Metric (CDM) n n n CDM = ΣEquivalent. Classes E |E|2 Assign a penalty to each tuple Normalized Avg. Eqiv. Class Size Metric (CAVG) n CAVG = (total_records/total_equiv_classes)/k

Comparison results n Full-domain method: Incognito n Single-dimensional method: K-OPTIMIZE

Comparison results n Full-domain method: Incognito n Single-dimensional method: K-OPTIMIZE

Data partitioning comparison

Data partitioning comparison

Mondrian

Mondrian