CMUPitt Applied Decision Modeling Seminar Disclosure Limitation in

  • Slides: 73
Download presentation
CMU/Pitt Applied Decision Modeling Seminar Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig

CMU/Pitt Applied Decision Modeling Seminar Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig

Collaborators n n n n George Duncan, Heinz/Statistics, CMU Stephen Fienberg, Statistics, CMU Adrian

Collaborators n n n n George Duncan, Heinz/Statistics, CMU Stephen Fienberg, Statistics, CMU Adrian Dobra, Statistics, Duke Larry Cox, Center for Health Statistics, H&HS Jesus De Loera, Math, UC Davis Bernd Sturmfels, Math, UC Berkeley Jo. Anne O’Roarke, Institute for Social Research, U. Michigan

Funding n n n n NSF NCES NIA NCHS NISS Census BLS

Funding n n n n NSF NCES NIA NCHS NISS Census BLS

How Should Government Distribute What It Knows? n n n Individuals and businesses contribute

How Should Government Distribute What It Knows? n n n Individuals and businesses contribute data about themselves, as mandated by government Government summarizes and returns these data to policy makers, policy researchers, individuals and businesses Everyone is supposed to see the value, but with no privacy downside n Obligations: return as much information as possible n Restrictions: don’t disclose sensitive information

Who Collects Data? n Federal, state and local government: n n Federally funded surveys:

Who Collects Data? n Federal, state and local government: n n Federally funded surveys: n n n Census Bureau of Labor Statistics National Center for Health Statistics, etc. Health and Retirement Study (NIA) Treatment Episode Data Set (NIH) Industry n n n Health care Insurance “Dataminers”

Real-World Example I n n The U. S. Census Bureau wants to allow online

Real-World Example I n n The U. S. Census Bureau wants to allow online queries against its census data. It builds a large system, with many safeguards, then calls in “distinguished university statisticians” to test it. Acting as “data attackers”, they attempt to discover sensitive information. They do, and so the Census Bureau’s plans must be scaled back. (Source: Duncan, Roehrig and Kannan)

Real-World Example II n n n To better understand health care usage patterns, a

Real-World Example II n n n To better understand health care usage patterns, a researcher wants to measure the distance from patients’ residences to the hospitals where their care was given. But because of confidentiality concerns, it’s only possible to get the patients’ ZIP codes, not addresses. The researcher uses ZIP code centroids instead of addresses, causing large errors in the analysis. (Source: Marty Gaynor)

Real-World Example III n n n To understand price structures in the health care

Real-World Example III n n n To understand price structures in the health care industry, a researcher wants to compare negotiated prices to list prices, for health services. One data set has hospital IDs and list prices for various services. Another data set gives negotiated prices for actual services provided, but for confidentiality reasons, no hospital IDs can be provided. Thus matching is impossible. (Source: Marty Gaynor)

Data Utility vs. Disclosure Risk n n Data utility: a measure of the usefulness

Data Utility vs. Disclosure Risk n n Data utility: a measure of the usefulness of a data set to its intended users Disclosure risk: the degree to which a data set reveals sensitive information The tradeoff is a hard one, currently judged mostly heuristically Disclosure limitation, in theory and practice, examines this tradeoff.

The R-U Confidentiality Map Two possible releases of the same data Disclosure Risk Original

The R-U Confidentiality Map Two possible releases of the same data Disclosure Risk Original Data Disclosure Threshold No Data Utility

The R-U Confidentiality Map For many disclosure limitation methods, we can choose one or

The R-U Confidentiality Map For many disclosure limitation methods, we can choose one or more parameters. Original Data Disclosure Risk DL Method 1 Disclosure Threshold No Data DL Method 2 Data Utility

Traditional Methods: Microdata n n Microdata: a set of records containing information on individual

Traditional Methods: Microdata n n Microdata: a set of records containing information on individual respondents Suppose you are supposed to release microdata, for the public good, about individuals. The data include: n n n Name Address City of residence Occupation Criminal record

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record…

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record… n n n Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city

Microdata n n Typical safeguard: delete “identifiers” So release a dataset containing only city of residence, occupation and criminal record… n n n Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record HIPPA says this is OK, so long as a researcher “promises not to attempt re-identification” Is this far-fetched?

Microdata n n Unique (or almost unique) identifiers are common The 1997 voting list

Microdata n n Unique (or almost unique) identifiers are common The 1997 voting list for Cambridge, MA has 54, 805 voters: Uniqueness of demographic fields birth date alone birth date and gender birth date and 5 -digit ZIP birth date and full postal code (Source: Latanya Sweeney) 12% 29% 69% 97%

Traditional Solutions for Microdata n n n Sampling Adding noise Global recoding (coarsening) Local

Traditional Solutions for Microdata n n n Sampling Adding noise Global recoding (coarsening) Local suppression Data swapping Micro-aggregation

Example: Data Swapping Unique on location, age and sex Find a match in another

Example: Data Swapping Unique on location, age and sex Find a match in another location Flip a coin to see if swapping is done Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore

Data Swapping Terms n n Uniques Key: Variables that are used to identify records

Data Swapping Terms n n Uniques Key: Variables that are used to identify records that pose a confidentiality risk. Swapping Key: Variables that are used to identify which records will be swapped. Swapping Attribute: Variables over which swapping will occur. Protected Variables: Other variables, which may or may not be sensitive.

Swapping Attribute Uniques Key & Swapping Key Protected Variable Location Age Sex Candidate Pittsburgh

Swapping Attribute Uniques Key & Swapping Key Protected Variable Location Age Sex Candidate Pittsburgh Young M Bush Pittsburgh Young M Gore Pittsburgh Young F Gore Pittsburgh Old M Bush Cleveland Young F Gore Cleveland Old M Gore Cleveland Old F Bush Cleveland Old F Gore

Data Swapping In Use n The Treatment Episode Data Set (TEDS) n n National

Data Swapping In Use n The Treatment Episode Data Set (TEDS) n n National admissions to drug treatment facilities Administered by SAMHSA (part of HHS) Data released through ICPSR 1, 477, 884 records in 1997, of an estimated 2, 207, 375 total admissions nationwide

TEDS (cont. ) n Each record contains n n n n n Age Sex

TEDS (cont. ) n Each record contains n n n n n Age Sex Race Ethnicity Education level Marital status Source of income State & PMSA Primary substance abused and much more…

First Step: Recode n Example recode: Education level Continuous 0 -25 becomes 1: 8

First Step: Recode n Example recode: Education level Continuous 0 -25 becomes 1: 8 years or less 2: 9 -11 3: 12 4: 13 -15 5: 16 or more

TEDS Uniques Key n This was determined empirically, after recoding. We ended up choosing:

TEDS Uniques Key n This was determined empirically, after recoding. We ended up choosing: n n n n State and PMSA Pregnancy Veteran status Methadone planned as part of treatment Race Ethnicity Sex Age

Other Choices n n n Swapping key: the uniques key, plus primary substance of

Other Choices n n n Swapping key: the uniques key, plus primary substance of abuse Swapping attribute: state, PMSA, Census Region and Census Division Protected variables: all other TEDS variables

TEDS Results n n After recoding, only 0. 3% of the records needed to

TEDS Results n n After recoding, only 0. 3% of the records needed to be swapped Swapping was done between nearby locations, to preserve statistics over natural geographic aggregations

Tables: Magnitude Data Language Region A Region B Region C Total C++ 11 47

Tables: Magnitude Data Language Region A Region B Region C Total C++ 11 47 58 116 Java 1 15 33 49 Smalltalk 2 31 20 53 Total 14 93 111 218 Profit of Software Firms $10 million (Source: Java Random. next. Int(75))

Tables: Frequency Data Income Level, Gender = Male Race White Black Chinese Total $10,

Tables: Frequency Data Income Level, Gender = Male Race White Black Chinese Total $10, 000 96 10 1 107 >$10, 000 and $25, 000 72 7 1 80 > $25, 000 161 6 2 169 Total 329 23 4 356 Income Level, Gender = Female Race White Black Chinese Total $10, 000 186 11 0 197 (Source: 1990 Census) >$10, 000 and $25, 000 127 7 1 135 > $25, 000 51 3 0 51 Total 364 21 1 386

Traditional Solutions for Tables n Suppress some cells n n n Perturb some cells

Traditional Solutions for Tables n Suppress some cells n n n Perturb some cells n n n Publish only the marginal totals Suppress the sensitive cells, plus others as necessary Controlled rounding Lots of research here, and good results for 2 way tables For 3 -way and higher, this is surprisingly hard!

Disclosure Risk, Data Utility n Risk n n n the degree to which confidentiality

Disclosure Risk, Data Utility n Risk n n n the degree to which confidentiality might be compromised perhaps examine cell feasibility intervals, or better, distributions of possible cell values Utility n n a measure of the value to a legitimate user higher, the more accurately a user can estimate magnitude of errors in analysis based on the released table

Example: Delinquent Children Education Level of Head of Household County Low Medium Alpha 15

Example: Delinquent Children Education Level of Head of Household County Low Medium Alpha 15 1 3 1 20 Beta 20 10 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total Number of Delinquent Children by County and Education Level (Source: OMB Statistical Policy Working Paper 22)

Controlled Rounding (Base 3) County Low Medium Alpha 15 0 3 0 18 Beta

Controlled Rounding (Base 3) County Low Medium Alpha 15 0 3 0 18 Beta 21 9 12 15 57 3 9 9 3 24 Delta 12 15 6 3 36 Total 51 33 30 21 135 Gamma n n n High Very High Uniform (and known) feasibility interval. Easy for 2 -D tables, sometimes impossible for 3 -D 1, 025, 908, 683 possible original tables. Total

Suppress Sensitive Cells & Others County Low Medium Alpha 15 p s p 20

Suppress Sensitive Cells & Others County Low Medium Alpha 15 p s p 20 Beta 20 10 10 15 55 3 10 s p 25 Delta 12 s 7 p 35 Total 50 35 30 20 135 Gamma n n n High Very High Total Hard to do optimally (NP complete). Feasibility intervals easily found with LP. Users have no way of finding cell value probabilities.

Release Only the Margins County Low Medium High Very High Total Alpha 20 Beta

Release Only the Margins County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 35 30 20 135

Release Only the Margins n n n 18, 272, 363, 056 tables have our

Release Only the Margins n n n 18, 272, 363, 056 tables have our margins (thanks to De Loera & Sturmfels). Low risk, low utility. Easy! Very commonly done. Statistical users might estimate internal cells with e. g. , iterative proportional fitting. Or with more powerful methods…

Some New Methods for Disclosure Detection n n Use methods from commutative algebra to

Some New Methods for Disclosure Detection n n Use methods from commutative algebra to explore the set of tables having known margins Originated with the Diaconis-Sturmfels paper (Annals of Statistics, 1998) Extensions and complements by Dobra and Fienberg Related to existing ideas in combinatorics

Background: Algebraic Ideals n Let A be a ring (e. g. , ℝ or

Background: Algebraic Ideals n Let A be a ring (e. g. , ℝ or ℤ), and I A. Then I is called an ideal if n n 0 I If f, g I, then f + g I If f I and h A, then f h I f g. I A Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4, …}. h

Generators n n Let f 1, …, fs A. Then is the ideal generated

Generators n n Let f 1, …, fs A. Then is the ideal generated by f 1, …, fs Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4, …}. SAT question: What is a generator of I ? I = 2 , since I = {2} {…-1, 0, 1, …}

Ideal Example 2 n n n The ring k [x] of polynomials of one

Ideal Example 2 n n n The ring k [x] of polynomials of one variable, and the ideal I = x 4 -1, x 6 -1. GRE question: What is a minimal generator of I (i. e. , no subset also a generator)? I = x 2 -1 since x 2 -1 is the greatest common divisor of {x 4 -1, x 6 -1}.

Why Are Minimal Generators Useful? n n n Compact representation of the ideal--the initial

Why Are Minimal Generators Useful? n n n Compact representation of the ideal--the initial description may be “verbose”. Allow easy generation of elements, often in a disciplined order. Guaranteed to explore the full ideal.

Disclosure Analysis: Marginals n Suppose a “data snooper” knows only this: County Low Medium

Disclosure Analysis: Marginals n Suppose a “data snooper” knows only this: County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 35 30 20 135

Disclosure Analysis n Of course the data collector knows the true counts: County Low

Disclosure Analysis n Of course the data collector knows the true counts: County Low Medium Alpha 15 1 3 1 20 Beta 20 10 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total

Disclosure Analysis n What are the feasible tables, given the margins? Here is one:

Disclosure Analysis n What are the feasible tables, given the margins? Here is one: County Low Medium Alpha 15+1 1 -1 3 1 20 Beta 20 -1 10+1 10 15 55 3 10 10 2 25 Delta 12 14 7 2 35 Total 50 35 30 20 135 Gamma High Very High Total

Disclosure Analysis Problems n Both the data collector and the snooper are interested in

Disclosure Analysis Problems n Both the data collector and the snooper are interested in the largest and smallest feasible values for each cell. n n Narrow bounds might constitute disclosure. Both might also be interested in the distribution of possible cell values. n A tight distribution might constitute disclosure.

The Bounds Problem County Low Medium High Very High Total Alpha 20 Beta 55

The Bounds Problem County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 This is usually framed as a continuous problem. Is this the right question? 35 30 0 1 2 3 4 5 20 135

The Distribution Problem County Low Medium High Very High Total Alpha 20 Beta 55

The Distribution Problem County Low Medium High Very High Total Alpha 20 Beta 55 Gamma 25 Delta 35 Total 50 Given the margins, and a set of priors over cell values, what distribution results? 35 30 0 1 2 3 4 5 20 135

Transform to an Algebraic Problem n Define some indeterminates: x 11 x 12 x

Transform to an Algebraic Problem n Define some indeterminates: x 11 x 12 x 13 x 14 x 21 x 22 x 23 x 24 n Then write the move 1 -1 -1 1 x 32 x 33 x 34 x 41 x 42 x 43 x 44 as the polynomial x 11 x 22 - x 12 x 21

Ideal Example 3 n n Ideal I is the set of monomial differences that

Ideal Example 3 n n Ideal I is the set of monomial differences that take a non-negative table of dimension J K with fixed margins M to another nonnegative table with the same margins. Putnam Competition question: What is a minimal generator of I ?

Solutions: Bounds, Distributions n Upper and lower bounds on cells: n n Integer linear

Solutions: Bounds, Distributions n Upper and lower bounds on cells: n n Integer linear programming (max/min, subject to constraints implied by the marginals). Find a generator of the ideal: n n n Use Buchberger’s algorithm to find a Gröbner basis. Apply long division repeatedly; the remainder yields an optimum (Conti & Traverso). Distribution of cell values: n Use the Gröbner basis to enumerate or sample (Diaconis & Sturmfels, Dobra & Fienberg).

Trouble in Paradise n n For larger disclosure problems (dimension 3), the linear programming

Trouble in Paradise n n For larger disclosure problems (dimension 3), the linear programming bounds may be fractional. For larger disclosure problems (dimension 3), the Gröbner basis is very hard to compute.

Why Fractional Bounds? n Sometimes the marginal sum values cause the constraints to intersect

Why Fractional Bounds? n Sometimes the marginal sum values cause the constraints to intersect at non -integer points. x 2 LP maximum for x 2 Integer maximum x 1

But Could This Happen? n Standard example from Man. Sci 101: x 2 LP

But Could This Happen? n Standard example from Man. Sci 101: x 2 LP maximum for x 2 Integer maximum x 1

Computing Gröbner Bases n Both the Conti & Traverso and Sturmfels constructions include extra

Computing Gröbner Bases n Both the Conti & Traverso and Sturmfels constructions include extra variables. n n The ring k [x 1, …, x. IJK] is embedded in k [x 1, …, x. IJK, y 1, …, y. N], where N is the number of constraints. Buchberger’s algorithm applied. Then throw away basis elements not in k[x 1, …, x. IJK] Computing a basis for the 3 3 3 problem: about 7 hours with Macaulay.

An Improved Algorithm n Designed to compute implicitly in k [x 1, …, x.

An Improved Algorithm n Designed to compute implicitly in k [x 1, …, x. IJK, y 1, …, y. N] but only process elements in k [x 1, …, x. IJK].

An Improved Algorithm n n Designed to compute implicitly in k [x 1, …,

An Improved Algorithm n n Designed to compute implicitly in k [x 1, …, x. IJK, y 1, …, y. N] but only process elements in k [x 1, …, x. IJK]. 3 3 3 problem runs in 25 m. S.

An Improved Algorithm n n n Designed to compute implicitly in k [x 1,

An Improved Algorithm n n n Designed to compute implicitly in k [x 1, …, x. IJK, y 1, …, y. N] but only process elements in k [x 1, …, x. IJK]. 3 3 3 problem runs in 25 m. S. 4 3 3 problem runs in 20 min.

An Improved Algorithm n n Designed to compute implicitly in k [x 1, …,

An Improved Algorithm n n Designed to compute implicitly in k [x 1, …, x. IJK, y 1, …, y. N] but only process elements in k [x 1, …, x. IJK]. 3 3 3 problem runs in 25 m. S. 4 3 3 problem runs in 20 min. 5 3 3 problem runs in 3 months…

Ideals “If they’re so freaking ideal, how come you can’t compute them? ” n

Ideals “If they’re so freaking ideal, how come you can’t compute them? ” n Overheard at Grostat 5, the Conference on Commutative Algebra and Statistics.

Other Approaches n n Connections with decomposable and graphical log-linear models (Dobra & Fienberg,

Other Approaches n n Connections with decomposable and graphical log-linear models (Dobra & Fienberg, Dobra). Example: 3 -D sensitive table, two 2 -D margins given. Use conditional independence: Known margins A B C A

Decomposable Models n n n Gröbner bases are “simple, ” and can be written

Decomposable Models n n n Gröbner bases are “simple, ” and can be written down explicitly. For some other problems (e. g. , graphical models), the Gröbner basis can be “assembled” from bases for smaller pieces. Nice!

A New Method for Disclosure Limitation: Cyclic Perturbation n Choose cycles that leave the

A New Method for Disclosure Limitation: Cyclic Perturbation n Choose cycles that leave the margins fixed. One example: Original 15 1 3 1 20 10 10 15 3 10 10 2 12 14 7 2 n Cycle + 1 -1 0 0 0 0 0 Perturbed table = 16 1 2 1 19 10 11 15 3 10 10 2 12 14 7 2 The set of cycles determines the published table’s feasibility interval

Cyclic Perturbation: Details n Choose a set of cycles that covers all table cells

Cyclic Perturbation: Details n Choose a set of cycles that covers all table cells “equally”. Example: + - 0 0 0 + - 0 0 + + - 0 + 0 0 + - + 0 0 0 0 + - Each cell has exactly two “chances” to move.

Cyclic Perturbation: Details n Flip a three-sided coin with outcomes n n n n

Cyclic Perturbation: Details n Flip a three-sided coin with outcomes n n n n A (probability = ) B (probability = ) C (probability = ) If A, add the first cycle (unless there is a zero in the cycle) If B, subtract the first cycle (unless there is a zero in the cycle) If C, do nothing Repeat with the remaining cycles

Cyclic Perturbation: Details n n Original Table For the chosen set of cycles, there

Cyclic Perturbation: Details n n Original Table For the chosen set of cycles, there are 34=81 possible perturbed tables. The feasibility interval is original value 2. Perturbed TO TP cycle 1 cycle 2 cycle 3 cycle 4 applied Table

Cyclic Perturbation: Details n Choose , . Perturb with each cycle. Publish the resulting

Cyclic Perturbation: Details n Choose , . Perturb with each cycle. Publish the resulting table. n Publish the cycles and , . n n Original 15 1 3 1 20 10 10 15 3 10 10 2 12 14 7 2 Perturbed table 16 0 2 2 21 11 9 14 2 11 11 13 8 3

Analysis of Cell Probabilities Tk TO TP TO Original Table Perturbed Table T 4

Analysis of Cell Probabilities Tk TO TP TO Original Table Perturbed Table T 4 Possible Tables

Distributions of Cell Values n n n Since the mechanism is public, a user

Distributions of Cell Values n n n Since the mechanism is public, a user can calculate the distribution of true cell values: Compute every table Tk that could have been the original, along with the probability Pr(TP | Tk). Specify a prior distribution over all the possible original tables Tk. Apply Bayes’ theorem to get the posterior probability Pr(Tk | TP) for each Tk. The distribution for each cell is

Results for the Example Original Perturbed table 15 1 3 1 20 10 10

Results for the Example Original Perturbed table 15 1 3 1 20 10 10 15 3 10 10 2 12 14 7 2 q= Pr( t(1, 2) = q | TP ) Pr( t(1, 4) = q | TP ) Pr( t(3, 4) = q | TP ) Pr( t(4, 4) = q | TP ) 16 0 2 2 21 11 9 14 2 11 11 13 8 3 0 1 2 3 4 5 0. 71 0. 06 0. 00 0. 25 0. 71 0. 05 0. 04 0. 38 0. 25 0. 29 0. 00 0. 25 0. 04 0. 44 0. 00 0. 06 0. 00 0. 21 0. 00 0. 01

Properties n n n It’s not difficult to quantify data utility and disclosure risk

Properties n n n It’s not difficult to quantify data utility and disclosure risk (cf. cell suppression and controlled rounding). Priors of data users and data intruders can be different. Theorem: For a uniform prior, the mode of each posterior cell distribution is it’s published value.

Scaling n n Sets of cycles w/ desirable properties are easy to find for

Scaling n n Sets of cycles w/ desirable properties are easy to find for larger 2 -D tables. Extensions to 3 and higher dimensions also straightforward. Computing the perturbation for any size table is easy & fast. The complete Bayesian analysis is feasible to at least 20 20 (with no special TLC)

What Might Priors Be? n n They could reflect historical data. If I’m in

What Might Priors Be? n n They could reflect historical data. If I’m in the survey, I know my cell is at least 1. Public information. Insider information.

Rounding Redux n n n A similar Bayesian analysis can be done, provided the

Rounding Redux n n n A similar Bayesian analysis can be done, provided the exact algorithm is available. It’s generally much harder to do. Using a deterministic version of Cox’s `87 rounding procedure, we must consider “only” 17, 132, 236 tables. For uniform priors, the posterior cell distributions were nearly uniform. Three days of computing time for a 4 4 table…

Future Work (i. e. What’s Hot? ) n n n Much more to be

Future Work (i. e. What’s Hot? ) n n n Much more to be learned about the connections between log-linear models, algebraic structures and networks. The Grostat conference, and the “Tables” conference at Davis have resulted in wonderful collaborations between statisticians, mathematicians and operations researchers. Stay tuned!