GDPR Data Privacy Anonymization Minimization Oh My Steve

  • Slides: 49
Download presentation
GDPR, Data Privacy, Anonymization, Minimization. . . Oh My! Steve Touw, Immuta

GDPR, Data Privacy, Anonymization, Minimization. . . Oh My! Steve Touw, Immuta

About Me/Immuta CTO of Immuta is a self-service platform where data owners, data scientist

About Me/Immuta CTO of Immuta is a self-service platform where data owners, data scientist and compliance officers eliminate friction and accelerate innovation. Our software enables enterprises to unlock data, control risk, and innovate faster with confidence.

Agenda GDPR & data processing why do YOU care? Get out of GDPR jail

Agenda GDPR & data processing why do YOU care? Get out of GDPR jail free? The Anonymization zoo The “Data Control Plane” Conclusion

General Data Protection Regulation (GDPR)

General Data Protection Regulation (GDPR)

GDPR In A Nutshell “The General Data Protection Regulation is the EU’s primary data

GDPR In A Nutshell “The General Data Protection Regulation is the EU’s primary data governance regulation and realistically applies to any business using data from EU data subjects. It is the most forward-leading privacy regime on the planet, with fines of up to four percent of global revenue. With such staggering fines, breaching the GDPR is a risk that many enterprises quite literally may not be able to afford. ” -Andrew Burt, Immuta

It’s All About Personal Data Article 4(1): "Personal data" means any information relating to

It’s All About Personal Data Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.

Let’s Talk Privacy

Let’s Talk Privacy

I know stuff about Judd and Leslie! Photo credit: Gawker

I know stuff about Judd and Leslie! Photo credit: Gawker

New York Taxi & Limousine Commission Data was released containing taxi pickups, dropoffs, location,

New York Taxi & Limousine Commission Data was released containing taxi pickups, dropoffs, location, time, amount, and tip amount, among others. Seems pretty harmless?

Well, Judd and Leslie May Not Think It’s Harmless This photos was geotagged (with

Well, Judd and Leslie May Not Think It’s Harmless This photos was geotagged (with time), so by simply querying by medallion and time, we know how much Judd and Leslie tip!

This Is An Example Of a Link Attack NY Taxi Data Medallion & Pickup

This Is An Example Of a Link Attack NY Taxi Data Medallion & Pickup Time Medallion & Photo Time

I Swear This Is Relevant. . . Back To GDPR

I Swear This Is Relevant. . . Back To GDPR

Yes. . . This Means The New York Taxi Commission has personal data by

Yes. . . This Means The New York Taxi Commission has personal data by GDPR definition (we identified individuals indirectly). GDPR would apply to the New York Taxi Commission (but probably only if the data was generated in an EU city)! Are you having an oh no moment?

GDPR Purpose Restrictions No room for interpretation Consent: personal data may be processed on

GDPR Purpose Restrictions No room for interpretation Consent: personal data may be processed on the basis that the data subject has consented to such processing Contractual necessity: processing is necessary in order to enter into or perform a contract with the data subject Compliance with legal obligations Vital interests: this essentially applies in "life‑or-death" scenarios Public interest: necessary for the performance of tasks carried out by a public authority or private organisation acting in the public interest Legitimate Interests: must be specified at time of collection and reasonable (accountability on the data controller) Room for interpretation by an auditor - riskier

Processing Principles Fair, lawful and transparent processing: ability to tell the data subject what

Processing Principles Fair, lawful and transparent processing: ability to tell the data subject what their data is being used for The purpose limitation principle: what we just discussed Data minimisation: only process the personal data that it actually needs to process in order to achieve its goals Accuracy: responsibility for taking all reasonable steps to ensure that personal data are accurate Data retention periods: data should not be retained for longer than necessary in relation to the purposes for which they were collected Data security: data are kept secure, both against internal and external threats Accountability: enforcement of the Data Protection Principles

Those Principles and Purposes are Scary. . . Maybe… “Once a dataset is truly

Those Principles and Purposes are Scary. . . Maybe… “Once a dataset is truly anonymised and individuals are no longer identifiable, European data protection law no longer applies. ” -Article 29 Working Party

Let’s Talk Anonymization

Let’s Talk Anonymization

Pseudonymization “the processing of personal data in such a way that the data can

Pseudonymization “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information. ” -GDPR Article 4(5)

Pseudonymization In the Wild Back to our New York Taxi Data. . . They

Pseudonymization In the Wild Back to our New York Taxi Data. . . They actually did go to the trouble of pseudonymizing the data by hashing the medallion id. But that didn’t matter. . .

More Link Attacks NY Taxi Data Medallion & Pickup Time Medallion & Photo Time

More Link Attacks NY Taxi Data Medallion & Pickup Time Medallion & Photo Time Pickup Time & Pickup Loc & Dropoff Loc & Dropoff Time & Amount Dropoff Time & Receipt

Cardinality is the Achilles Heel of Anonymization What did all those columns we linked

Cardinality is the Achilles Heel of Anonymization What did all those columns we linked have in common? -- They have many unique values (high cardinality). The more unique values, the more opportunity to pinpoint and link an external source. These columns contain what is termed quasi-identifiers Quasi-identifiers aren’t personal data necessarily! You’re hashing for anonymity, not privacy thus removing utility! (I always wear a helmet and nothing else)

The Privacy vs Utility Tradeoff This is what our data looks like now to

The Privacy vs Utility Tradeoff This is what our data looks like now to prevent link attacks: Remove all quasi-identifiers, remove all utility! NOT

Pseudonymization Good, But Not Party Time: “pseudonymisation is not a method of anonymisation. It

Pseudonymization Good, But Not Party Time: “pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure. ” -Article 29 Working Party In plain English: GDPR requires that you pseudonymize when you can because that minimizes risk; GDPR’s “privacy by design” So it does buy you something, but GDPR still applies.

The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to

The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy

K-Anonymization Think of k-anonymization as a better way to hash like we did for

K-Anonymization Think of k-anonymization as a better way to hash like we did for the taxi data in the prior slides, yet provides more utility. This is done by generalizing quasi-identifiers by making them more “coarse”, becoming homogeneous with their neighbors Each record is then indistinguishable from at least k-1 other records, forming an equivalence class 20852 20878 20868 Zip Code 30. 6 26 208* 24 25 Age 29 30. 56755 Coordinates

Example: Generalizing By Zip Code Homogeneity Attack Black born 1965, know their problem? Black.

Example: Generalizing By Zip Code Homogeneity Attack Black born 1965, know their problem? Black. Female Male in in 1965, dodo wewe know their problem? No. YES Blackborn Male, do we know their problem? -- No -- --

K-Anonymized Taxi Data K-anonymized pickup & dropoff loc and time Certainly more utility But

K-Anonymized Taxi Data K-anonymized pickup & dropoff loc and time Certainly more utility But same problems. . . Link attack on very unique pickup/dropoff Homogeneity attack: everyone tipped the same L-Diversity, T-Closeness, has its own problems

K-Anonymization In the Wild I’m not the only one that gets the joke now!

K-Anonymization In the Wild I’m not the only one that gets the joke now!

K-Anonymization, Better Utility, No Party K-Anonymization provides no guarantees of privacy K-Anonymization is computationally

K-Anonymization, Better Utility, No Party K-Anonymization provides no guarantees of privacy K-Anonymization is computationally intensive to build - searching for K-perfection, LDiversity, T-Closeness may be a waste of time There’s still a privacy vs utility tradeoff to contend with One should mask (pseudonymize) personal data and generalize quasi-identifiers to meet “privacy by design” principles whenever possible NOT SLIGHTLY

The Privacy vs Utility Game Let’s have some fun. . .

The Privacy vs Utility Game Let’s have some fun. . .

The movie title is our “private” data We can generalize We can mask the

The movie title is our “private” data We can generalize We can mask the rest….

Challenge 1: Basically what NY Taxi Did

Challenge 1: Basically what NY Taxi Did

Challenge 2: More Anonymization Applied 19** 3 hours

Challenge 2: More Anonymization Applied 19** 3 hours

Challenge 3: Perfectly Private, But Completely Useless 19** 8. 2 19** 1 hour 86

Challenge 3: Perfectly Private, But Completely Useless 19** 8. 2 19** 1 hour 86 723 user, 201 critic 438

The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to

The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy

Let’s Play Another a Game. . . Think of a number 1 - 6

Let’s Play Another a Game. . . Think of a number 1 - 6 Now I’m going to ask you a private question you may not want to answer in public Did you, or would you have, voted for Brexit? Now, if you thought of a “ 3” or answered “YES” to Brexit, then raise your hand when this counter gets to zero: 1 2 0 3

Differential Privacy ‘Differential privacy formalizes the idea that a "private" computation should not reveal

Differential Privacy ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are. ’ - [Frank Mc. Sherry] (https: //github. com/frankmcsherry/blog/blob/master/posts/2016 -02 -03. md) $320 k $340 k $330 k Sensitivity of median = ~10 k Sensitivity of mean = ~30 M $30 M

There’s a Catch! (Three of Them) 1. You can only ask “aggregate” questions of

There’s a Catch! (Three of Them) 1. You can only ask “aggregate” questions of the data. For example, the count of hands raised, but not specifically who’s hand SUM, COUNT, AVERAGE, MIN, MAX 2. If you ask the same/similar question enough - you’ll find the right answer!! You know, statistics. . . if you flip a coin 100 times, you’re going to get really close to 50% each side. The “Privacy Budget”. 3. “Epsilon” (amount of privacy) is not intuitive and hard to assign

So What Would Differential Privacy Look Like In Our Movie Game? Let’s pretend the

So What Would Differential Privacy Look Like In Our Movie Game? Let’s pretend the rating was the sensitive piece of data Select AVG(rating) WHEN title = ‘The Terminator’ 7. 3 Select AVG(rating) WHEN title = ‘The Terminator’ 8. 6 Select AVG(rating) WHEN title = ‘The Terminator’ 8. 3 Select AVG(rating) WHEN title = ‘The Terminator’ 7. 6 Average = 7. 95 The more we ask, the more we pound away the noise

Differential Privacy does provide guarantees of privacy! But, there are still utility limitations: You

Differential Privacy does provide guarantees of privacy! But, there are still utility limitations: You need to understand you can only ask general / aggregate type questions. This should be intuitive: you shouldn’t ask specifics of anonymized data Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front. Intuition about privacy settings (epsilon) SLIGHTLY There are tricks you can do here

Don’t Rely On Anonymization Alone! Recital 26**, talking about anonymization: “To determine whether a

Don’t Rely On Anonymization Alone! Recital 26**, talking about anonymization: “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. ” **Note that anonymization is only ever mentioned in recital 26. Recitals can be thought of as commentary and some would consider non-binding. Until there’s GDPR guidance about when data is “reasonably likely” to be reidentified, early adopters will face an uncertain regulatory environment RISK!! Even with the guarantees of Differential Privacy, one still needs to meet the principals and purpose requirements for original collection!!

What I Recommend A rock solid governance solution in your organization Data Ownership Access

What I Recommend A rock solid governance solution in your organization Data Ownership Access Policies Appropriate Usage Data Lifecycle Always some level of anonymization and/or pseudonimization to meet the privacy by design requirements

Governance Data Ownership: Owns the data and makes decisions on how and if it

Governance Data Ownership: Owns the data and makes decisions on how and if it can be accessed - and are held accountable for those decisions Access Policies: Who can access the data, what exactly can they see, and under what circumstances? Appropriate Usage: What constitutes appropriate and inappropriate use of data internally and externally, particularly for automated decisions? Data Lifecycle: How to manage acquiring, storing, selling, and purging your data? Governance is not memos and glorified wikis - it’s actual enforcement through software!

A Complex Problem You have data everywhere in many different storage technologies, and now

A Complex Problem You have data everywhere in many different storage technologies, and now complex data governance requirements to enforce DON’T IMPLEMENT UNIQUELY PER DATABASE! DON’T DATA LAKE FOR THE PURPOSE OF COMPLIANCE SIMPLIFICATION! Consent Transparency Legitimate interests Retention Minimization Anonymization Accountability

A Data Control Plane Consent Transparency Legitimate interests Data Ownership Consent Retention Minimization Anonymization

A Data Control Plane Consent Transparency Legitimate interests Data Ownership Consent Retention Minimization Anonymization Accountability Appropriate Usage Data Lifecycle Retention Transparency The Data Control Plane Anonymization Access Policies Legitimate interests Minimization Accountability

Tenants of a Data Control Plane 1. Simplicity: Easy to create privacy rules and

Tenants of a Data Control Plane 1. Simplicity: Easy to create privacy rules and expose authoritative views of data from any storage technology 2. Mutability: Ability to change rules and have that reflected in the data on the fly 3. Accessibility: Plane cannot force users to an API to access the data → Needs to be accessible by any language or tool 4. Context: State of access requests needs to be understood to enforce rules appropriately (link data to analytical context, e. g. purpose) 5. Visibility: All actions in the plane are audited, all policies are

A Critical Component: Purposes Purpose-based restrictions are the future of privacy controls Purpose-based restrictions

A Critical Component: Purposes Purpose-based restrictions are the future of privacy controls Purpose-based restrictions DO NOT fit in the identity management frameworks we’re used to Identity: Roles, Groups, Authorizations - GRANTED TO ME Purpose: Context, Dynamic, Layered - REACT TO MY CONTEXT

Conclusion Don’t try to shortcut GDPR. Always pseudonimize/anonymize when possible, but don’t use it

Conclusion Don’t try to shortcut GDPR. Always pseudonimize/anonymize when possible, but don’t use it to escape GDPR, at least not yet. Necessity is the mother of invention: you’ll see your data science operations soar once governance is applied appropriately. Governance can be an enabler!

Steve Touw steve@immuta. com @steve_touw www. immuta. com

Steve Touw steve@immuta. com @steve_touw www. immuta. com