Data Anonymization Generalization Algorithms Li Xiong CS 573

Generalization and Suppression n • Generalization Z 2 = {410**} Z 1 = {4107*.

Complexity Search Space: • Number of generalizations = (Max level of generalization for attribute

Hardness result n Given some data set R and a QI Q, does R

Taxonomy of Generalization Algorithms n Top-down specialization vs. bottom-up generalization n Global (single dimensional)

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n

µ-Argus n Hundpool and Willenborg, 1996 n Greedy approach n Global generalization with tuple

Problems With µ-Argus 1. Only 2 - and 3 - combinations are examined, there

The Datafly System n Sweeney, 1997 n Greedy approach n Global generalization with tuple

Datafly Algorithm Core Datafly Algorithm

Datafly MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}

Problems With Datafly 1. Generalizing all values associated with an attribute (global) 2. Suppressing

K-OPTIMIZE n Practical solution to guarantee optimality n Main techniques n Framing the problem

Anonymization Strategies n Local suppression n Delete individual attribute values n E. g. <Age=50,

K-Anonymization with Suppression n K-anonymization with suppression n Global attribute generalization with local suppression

Finding Optimal Anonymization n Optimal anonymization determined by a cost metric n Cost metrics

Modeling Anonymizations n Assume a total order over the set of all attribute domain

Optimal Anonymization Problem n Goal n Find the best anonymization in the powerset with

Node Pruning through Cost Bounding n Intuitive idea n prune a node H if

Useless Value Pruning n Intuitive idea n Prune useless values that have no hope

Tree Rearrangement n Intuitive idea n Dynamically reorder tree to increase pruning opportunities n

Experiments n Adult census dataset n 30 k records and 9 attributes n Fine:

Results – Comparison n None of the other optimal algorithms can handle the census

Comments n Interesting things to think about n Domains without hierarchy or total order

Mondrian n Top-down partitioning n Greedy n Local (multidimensional) – tuple/cell level

Global Recoding n Mapping domains of quasi-identifiers to generalized or altered values using a

Local Recoding n Multi-Dimensional n Recode domain of value vectors from a set of

Partitioning n Single Dimensional n For each Xi, define non-overlapping single dimensional intervals that

Global Recoding Example k=2 Quasi Identifiers Age, Sex, Zipcode Single Dimensional Partitions Age :

Global Recoding Example 2 k=2 Quasi Identifiers Age, Zipcode Patient Data Single Dimensional Multi-Dimensional

Greedy Partitioning Algorithm n Problem n Need an algorithm to find multi-dimensional partitions n

Algorithm Example nk=2 n Dimension determined heuristically n Quasi-identifiers n Zipcode n Age Patient

Algorithm Example Iteration # 1 (full table) partition ` dim = Zipcode fs split.

Algorithm Example continued Iteration # 2 (LHS from iteration # 1) partition ` dim

Algorithm Example continued Iteration # 3 (LHS from iteration # 2) partition ` No

Algorithm Example continued Iteration # 5 (RHS from iteration # 1) partition ` No

Experiment n Adult dataset n Data quality metric (cost metric) n Discernability Metric (CDM)

Comparison results n Full-domain method: Incognito n Single-dimensional method: K-OPTIMIZE

Slides: 45

Download presentation

Data Anonymization Generalization Algorithms Li Xiong CS 573 Data Privacy and Anonymity

Generalization and Suppression n • Generalization Z 2 = {410**} Z 1 = {4107*. 4109*} n n Suppression Replace the value with a less specific but semantically consistent value n Do not release a value at all Z 0 = {41075, 41076, 41095, 41099} # S 1 = {Person} S 0 = {Male, Female} Zip Age Nationality Condition 1 41076 < 40 * Heart Disease 2 48202 < 40 * Heart Disease 3 41076 < 40 * Cancer 4 48202 < 40 * Cancer

Complexity Search Space: • Number of generalizations = (Max level of generalization for attribute i + 1) attrib i If we allow generalization to a different level for each value of an attribute: • Number of generalizations = attrib i 3 #tuples (Max level of generalization for attribute i + 1)

Hardness result n Given some data set R and a QI Q, does R satisfy k-anonymity over Q? n Easy to tell in polynomial time, NP! n Finding an optimal anonymization is not easy n n NP-hard: reduction from k-dimensional perfect matching A polynomial solution implies P = NP A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS’ 04.

Taxonomy of Generalization Algorithms n Top-down specialization vs. bottom-up generalization n Global (single dimensional) vs. local (multidimensional) n Complete (optimal) vs. greedy (approximate) n Hierarchy-based (user defined) vs. partitionbased (automatic) K. Le. Ferve, D. J. De. Witt, and R. Ramakrishnan. Incognito: Efficient Full-Domain K-Anonymity. In SIGMOD 05

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical n Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical n Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy n TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy n K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete n Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete n Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

µ-Argus n Hundpool and Willenborg, 1996 n Greedy approach n Global generalization with tuple suppression n Not guaranteeing k-anonymity

µ-Argus algorithm

µ-Argus

Problems With µ-Argus 1. Only 2 - and 3 - combinations are examined, there may exist 4 combinations that are unique – may not always satisfy k-anonymity 2. Enforce generalization at the attribute level (global) – may over generalize

The Datafly System n Sweeney, 1997 n Greedy approach n Global generalization with tuple suppression

Datafly Algorithm Core Datafly Algorithm

Datafly MGT resulting from Datafly, k=2, QI={Race, Birthdate, Gender, ZIP}

Problems With Datafly 1. Generalizing all values associated with an attribute (global) 2. Suppressing all values within a tuple (global) 3. Selecting the attribute with the greatest number of distinct values as the one to generalize first – computationally efficient but may over generalize

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n n n n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

K-OPTIMIZE n Practical solution to guarantee optimality n Main techniques n Framing the problem into a set-enumeration search problem n Tree-search strategy with cost-based pruning and dynamic search rearrangement n Data management strategies 3/1/2021 16

Anonymization Strategies n Local suppression n Delete individual attribute values n E. g. <Age=50, Gender=M, State=CA> n Global attribute generalization n Replace specific values with more general ones for an attribute n Numeric data: partitioning of the attribute domain into intervals. E. g. Age={[1 -10], . . . , [91100]} n Categorical data: generalization hierarchy supplied by users. E. g. Gender = [M or F] 3/1/2021 17

K-Anonymization with Suppression n K-anonymization with suppression n Global attribute generalization with local suppression of outlier tuples. n Terminologies n Dataset: D n Anonymization: {a 1, …, a m} n Equivalent classes: E 3/1/2021 18 a 1 E{ v 1, 1 … v 1, n am … v 1, m vn, m

Finding Optimal Anonymization n Optimal anonymization determined by a cost metric n Cost metrics n Discernibility metric: penalty for nonsuppressed tuples and suppressed tuples n Classification metric 3/1/2021 19

Modeling Anonymizations n Assume a total order over the set of all attribute domain n Set representation for anonymization n n E. g. Age: <[10 -29], [30 -49]>, Gender: <[M or F]>, Marital Status: <[Married], [Widowed or Divorced], [Never Married]> {1, 2, 4, 6, 7, 9} -> {2, 7, 9} n Power set representation for entire anonymization space n n n 3/1/2021 Power set of {2, 3, 5, 7, 8, 9} - order of 2 n! {} – most general anonymization {2, 3, 5, 7, 8, 9} – most specific anonymization 20

Optimal Anonymization Problem n Goal n Find the best anonymization in the powerset with lowest cost n Algorithm n n set enumeration search through tree expansion - size 2 n Top-down depth first search n Heuristics n Cost-based pruning Dynamic tree rearrangement 3/1/2021 21 n Set enumeration tree over powerset of {1, 2, 3, 4}

Node Pruning through Cost Bounding n Intuitive idea n prune a node H if none of its descendents can be optimal n Cost lower-bound of subtree of H n n 22 Cost of suppressed tuples bounded by H Cost of non-suppressed tuples bounded by A H A 3/1/2021

Useless Value Pruning n Intuitive idea n Prune useless values that have no hope of improving cost n Useless values n Only split equivalence classes into suppressed equivalence classes (size < k) 3/1/2021 23

Tree Rearrangement n Intuitive idea n Dynamically reorder tree to increase pruning opportunities n Heuristics n sort the values based on the number of equivalence classes induced 3/1/2021 24

Experiments n Adult census dataset n 30 k records and 9 attributes n Fine: powerset of size 2160 n Evaluations of performance and optimal cost n Comparison with greedy/stochastic method n 2 -phase greedy generalization/specialization n Repeated process 3/1/2021 25

Results – Comparison n None of the other optimal algorithms can handle the census data n Greedy approaches, while executing quickly, produce highly sub- optimal anonymizations n Comparison with 2 -phase method (greedy + stochastic) 26 3/1/2021

Comments n Interesting things to think about n Domains without hierarchy or total order restrictions n Other cost metrics n Global generalization vs. local generalization 3/1/2021 27

Generalization algorithms n Early systems n µ-Argus, Hundpool, 1996 - Global, bottom-up, greedy n Datafly, Sweeney, 1997 - Global, bottom-up, greedy n k-anonymity algorithms n n n n All. Min, Samarati, 2001 - Global, bottom-up, complete, impractical Min. Gen, Sweeney, 2002 - Global, bottom-up, complete, impractical Bottom-up generalization, Wang, 2004 – Global, bottom-up, greedy TDS (Top-Down Specialization), Fung, 2005 - Global, top-down, greedy K-OPTIMIZE, Bayardo, 2005 – Global, top-down, partition-based, complete Incognito, Le. Fevre, 2005 – Global, bottom-up, hierarchy-based, complete Mondrian, Le. Fevre, 2006 – Local, top-down, partition-based, greedy

Mondrian n Top-down partitioning n Greedy n Local (multidimensional) – tuple/cell level

Global Recoding n Mapping domains of quasi-identifiers to generalized or altered values using a single function n Notation n Dx is the domain of attribute Xi in table T n Single Dimensional φi : Dxi D’ for each attribute Xi of the quasiid n φi applied to values of Xi in tuple of T n

Local Recoding n Multi-Dimensional n Recode domain of value vectors from a set of quasi-identifier attributes n φ : Dx 1 x … x Dxn D’ n φ applied to vector of quasi-identifier attributes in each tuple in T

Partitioning n Single Dimensional n For each Xi, define non-overlapping single dimensional intervals that covers Dxi n Use φi to map x ε Dx to a summary stat n Strict Multi-Dimensional n Define non-overlapping multi-dimensional intervals that covers Dx 1… Dxd n Use φ to map (xx 1…xxd) ε Dx 1…Dxd to a summary stat for its region

Global Recoding Example k=2 Quasi Identifiers Age, Sex, Zipcode Single Dimensional Partitions Age : {[25 -28]} Sex: {Male, Female} Zip : {[53710 -53711], 53712} Multi-Dimensional Partitions {Age: [25 -26], Sex: Male, Zip: 53711} {Age: [25 -27], Sex: Female, Zip: 53712} {Age: [27 -28], Sex: Male, Zip: [53710 -53711]}

Global Recoding Example 2 k=2 Quasi Identifiers Age, Zipcode Patient Data Single Dimensional Multi-Dimensional

Greedy Partitioning Algorithm n Problem n Need an algorithm to find multi-dimensional partitions n Optimal k-anonymous strict multi-dimensional partitioning is NP-hard n Solution n Use a greedy algorithm n Based on k-d trees n Complexity O(nlogn)

Greedy Partitioning Algorithm

Algorithm Example nk=2 n Dimension determined heuristically n Quasi-identifiers n Zipcode n Age Patient Data Anonymized Data

Algorithm Example Iteration # 1 (full table) partition ` dim = Zipcode fs split. Val = 53711 LHS RHS

Algorithm Example continued Iteration # 2 (LHS from iteration # 1) partition ` dim = Age fs split. Val = 26 LHS RHS

Algorithm Example continued Iteration # 3 (LHS from iteration # 2) partition ` No Allowable Cut ` Summary: Age = [25 -26] Zip= [53711] Iteration # 4 (RHS from iteration # 2) partition ` No Allowable Cut Summary: Age = [27 -28] Zip= [53710 - 53711]

Algorithm Example continued Iteration # 5 (RHS from iteration # 1) partition ` No Allowable Cut ` Summary: Age = [25 -27] Zip= [53712]

Experiment n Adult dataset n Data quality metric (cost metric) n Discernability Metric (CDM) n n n CDM = ΣEquivalent. Classes E |E|2 Assign a penalty to each tuple Normalized Avg. Eqiv. Class Size Metric (CAVG) n CAVG = (total_records/total_equiv_classes)/k

Comparison results n Full-domain method: Incognito n Single-dimensional method: K-OPTIMIZE

Data partitioning comparison

Mondrian