Nabil Tabbaa Outline Introduction Background Definitions The Inference

Nabil Tabbaa

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 2

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 3

Introduction � � Large amount of individuals data (medical, educational, human services records…). These types of data are invaluable to researchers in a vast array of fields. Many agencies rely on publicly released data from the census. Numerous research projects depend on publicly available medical or educational data sets. 4

Introduction… � Sensitive information about an individual must remain private for: � Ethical reasons. � Legal reasons. � Trust between a data collecting agency and its respondents is very important: � Respondents may alter responses or simply not respond at all to some surveys. � Groups or individuals often have incentives to use data maliciously. 5

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 6

Background Definitions � Data (values collected from responders) � Categorical values (marital status). � Magnitude values (income). � Summarized using aggregation functions. � Sensitive Information �A cell in a table is considered to be sensitive (or unsafe) if it contains a value whose publication could disclose the specific information of a respondent. 7

Background Definitions… � Attackers � Attacker has the aim of gaining access to details in the sensitive cells of a table. � The attacker will work with the published information, the structure of the table, and some a priori knowledge that may be publicly available. � Output pattern � Several methodologies are used to protect the sensitive information in a table. � The output of a methodology is called a pattern. 8

Background Definitions… � Loss of information � The information loss of a pattern depends very much on the protection methodology. � The optimization problem underlying a methodology is the problem of finding a protected pattern with minimum loss of information. 9

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 10

The Inference Problem � Inference problems are security concerns that arise when users deduce sensitive information about the database from relatively trivial information. � Inference problems differ from other security problems in that it is not an issue of unauthorized access to data or leakage of information. 11

The Inference Problem… � Inference rules � Subsume rule: the result of one query and the result of another query together correspond to the same tuple. � Overlapping rule: some of the values returned by a query match some of the values of another query. � Complementary rule: taking the difference between two sets of queries. � Functional dependency rule: based on the relationship between the attributes of a database. 12

The Inference Problem… � Inference information � Information that is stored in the database. � The design of the database. � The relationship between the different attributes of the database. � Statistical data derived from the database. � The existence or absence of data. � The changing values of the data. � Specialized information about the database. � Common knowledge and Common sense. 13

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 14

Limiting Disclosure Risk � Basic methods � Limitation of detail. � Top/bottom coding. � Suppression. � Rounding. � Addition of noise. � Sampling � Makes it difficult to verify population uniqueness. � Easy to implement and the resulting sampled data are relatively easy to analyze. 15

Limiting Disclosure Risk… � Matrix Masking � Rather than release the data X, one could release the data Y = A X B + C. � Special cases of matrix masking include: noise addition, sampling, suppressing sensitive variables, cell suppression, and addition of simulated data. � The analyzer must have knowledge of the masking procedure used. � The analysis of the data can be complex and special software may be needed. 16

Limiting Disclosure Risk… � Data Swapping and Data Shuffling � Data are swapped in such a way as to maintain the marginal counts of the table. � Swapping only needs to be performed on sensitive variables in order to remove the relationship between the record and the respondent. � Drawbacks: may not maintain multivariate relationships, analysis of sub-populations may be affected by the swapping procedure, and the swapping may result in nonsensical combinations. 17

Limiting Disclosure Risk… Raw Data Swapped Data Record X Y Z 1 0 1 1 1 0 2 0 1 0 3 0 0 1 4 1 0 1 5 1 1 1 5 0 1 1 6 1 0 0 7 0 0 0 18

Limiting Disclosure Risk… � Synthetic Data � The idea is to view sensitive data as missing values and replace them using multiple imputation techniques. � Sensitive attributes would be replaced by random draws from an appropriate posterior predictive distribution. � Advantage: the ease with which the data can be analyzed. 19

Limiting Disclosure Risk… � Other Methods � Slicing, Micro-aggregates, and Recombination. � Location Data. � Scrub System, Datafly, Argus, and SUDA 2. � Micro-agglomeration, Substitution, Subsampling, and Calibration (MASSC). 20

Outline � Introduction � Background Definitions � The Inference Problem � Limiting Disclosure Risk � Disclosure Risk vs. Data Utility 21

Disclosure Risk vs. Data Utility � � Disclosure risk can be lowered by applying a disclosure limitation (DL) procedure to mask the data. This masking will typically also lower the data utility. It is crucial that the tradeoff between Risk and Utility be assessed. R-U confidentiality map is offered as an analytical framework for this assessment. 22

Disclosure Risk vs. Data Utility… Disclosure Risk R-U confidentiality map raw data Maximum Tolerable Risk Threshold released data no data Data Utility 23

Disclosure Risk vs. Data Utility… � In all the cases, the question is: � Whether the disclosure limitation methods used are adequate, but not excessive, � Could less severe distortion or obscuring of the data still keep low the risk from data snoopers, while allowing better data utility, � What explicitly is the tradeoff between disclosure risk and data utility, � Would a different DL method lower disclosure risk while maintaining data utility? 24

References � J. J. Salazar-González, “Statistical Confidentiality: Optimization Techniques to Protect Tables, ” Computer and Operations Research, vol. 35, no. 5, pp. 1638 -1651, 2008. � R. E. Yip and E. N. Levitt, “Data Level Inference Detection in Database Systems, ” in CSFW '98 Proceedings of the 11 th IEEE workshop on Computer Security Foundations, Rockport, MA, USA, 1998, pp. 179 -189. � “NCSE Technical Report – 005, ” vol. 1, no. 5, May 1996. � G. Duncan and R. Pearson, “Enhancing Access to Microdata while Protecting Confidentiality: Prospects for the Future (with discussion), ” Statistical Science, vol. 6, pp. 219– 232, 1991. � N. R. Adam and J. C. Worthmann, “Security-Control Methods for Statistical Databases: A Comparative Study, ” ACM Computing Survey, vol. 21, no. 4, pp. 515– 556, 1989. � C. Skinner, C. Marsh, S. Openshaw, and C. Wymer, “Disclosure Control for Census Microdata, ” Journal of Official Statistics, vol. 10, no. 1, pp. 31– 51, 1994. 25

References… � L. H. Cox, “Matrix Masking Methods for Disclosure Limitation in Microdata, ” Survey Methodology, vol. 6, pp. 165– 169, 1994. � S. E. Fienberg and J. Mc. Intyre, “Data Swapping: Variations on a Theme by Dalenius and Reiss, ” In: Domingo-Ferrer, J. , Torra, V. (Eds. ), Privacy in Statistical Databases. Vol. 3050 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg, pp. 519, 2004. � T. E. Raghunathan, J. P. Reiter, and D. B. Rubin, “Multiple Imputation for Statistical Disclosure Limitation, ” Journal of Official Statistics, vol. 19, no. 1, pp. 1– 16, 2003. � G. J. Matthews, O. Harel, and R. H. Aseltine, “Examining the Robustness of Fully Synthetic Data Techniques for Data with Binary Variables, ” Journal of Statistical Computation and Simulation, vol. 80, no. 6, pp. 609– 624, 2010. � G. T. Duncan, S. A. Keller-Mc. Nulty, and S. L. Stokes, “Disclosure Risk vs. Data Utility: The R-U Confidentiality Map, ” Technical Report LA-UR-01 -6428. , Statistical Sciences Group, Los Alamos, N. M. : Los Alamos National Laboratory, 2001. 26

THANK YOU…