GDPR Data Privacy Anonymization Minimization Oh My Steve
- Slides: 49
GDPR, Data Privacy, Anonymization, Minimization. . . Oh My! Steve Touw, Immuta
About Me/Immuta CTO of Immuta is a self-service platform where data owners, data scientist and compliance officers eliminate friction and accelerate innovation. Our software enables enterprises to unlock data, control risk, and innovate faster with confidence.
Agenda GDPR & data processing why do YOU care? Get out of GDPR jail free? The Anonymization zoo The “Data Control Plane” Conclusion
General Data Protection Regulation (GDPR)
GDPR In A Nutshell “The General Data Protection Regulation is the EU’s primary data governance regulation and realistically applies to any business using data from EU data subjects. It is the most forward-leading privacy regime on the planet, with fines of up to four percent of global revenue. With such staggering fines, breaching the GDPR is a risk that many enterprises quite literally may not be able to afford. ” -Andrew Burt, Immuta
It’s All About Personal Data Article 4(1): "Personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that person.
Let’s Talk Privacy
I know stuff about Judd and Leslie! Photo credit: Gawker
New York Taxi & Limousine Commission Data was released containing taxi pickups, dropoffs, location, time, amount, and tip amount, among others. Seems pretty harmless?
Well, Judd and Leslie May Not Think It’s Harmless This photos was geotagged (with time), so by simply querying by medallion and time, we know how much Judd and Leslie tip!
This Is An Example Of a Link Attack NY Taxi Data Medallion & Pickup Time Medallion & Photo Time
I Swear This Is Relevant. . . Back To GDPR
Yes. . . This Means The New York Taxi Commission has personal data by GDPR definition (we identified individuals indirectly). GDPR would apply to the New York Taxi Commission (but probably only if the data was generated in an EU city)! Are you having an oh no moment?
GDPR Purpose Restrictions No room for interpretation Consent: personal data may be processed on the basis that the data subject has consented to such processing Contractual necessity: processing is necessary in order to enter into or perform a contract with the data subject Compliance with legal obligations Vital interests: this essentially applies in "life‑or-death" scenarios Public interest: necessary for the performance of tasks carried out by a public authority or private organisation acting in the public interest Legitimate Interests: must be specified at time of collection and reasonable (accountability on the data controller) Room for interpretation by an auditor - riskier
Processing Principles Fair, lawful and transparent processing: ability to tell the data subject what their data is being used for The purpose limitation principle: what we just discussed Data minimisation: only process the personal data that it actually needs to process in order to achieve its goals Accuracy: responsibility for taking all reasonable steps to ensure that personal data are accurate Data retention periods: data should not be retained for longer than necessary in relation to the purposes for which they were collected Data security: data are kept secure, both against internal and external threats Accountability: enforcement of the Data Protection Principles
Those Principles and Purposes are Scary. . . Maybe… “Once a dataset is truly anonymised and individuals are no longer identifiable, European data protection law no longer applies. ” -Article 29 Working Party
Let’s Talk Anonymization
Pseudonymization “the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information. ” -GDPR Article 4(5)
Pseudonymization In the Wild Back to our New York Taxi Data. . . They actually did go to the trouble of pseudonymizing the data by hashing the medallion id. But that didn’t matter. . .
More Link Attacks NY Taxi Data Medallion & Pickup Time Medallion & Photo Time Pickup Time & Pickup Loc & Dropoff Loc & Dropoff Time & Amount Dropoff Time & Receipt
Cardinality is the Achilles Heel of Anonymization What did all those columns we linked have in common? -- They have many unique values (high cardinality). The more unique values, the more opportunity to pinpoint and link an external source. These columns contain what is termed quasi-identifiers Quasi-identifiers aren’t personal data necessarily! You’re hashing for anonymity, not privacy thus removing utility! (I always wear a helmet and nothing else)
The Privacy vs Utility Tradeoff This is what our data looks like now to prevent link attacks: Remove all quasi-identifiers, remove all utility! NOT
Pseudonymization Good, But Not Party Time: “pseudonymisation is not a method of anonymisation. It merely reduces the linkability of a dataset with the original identity of a data subject, and is accordingly a useful security measure. ” -Article 29 Working Party In plain English: GDPR requires that you pseudonymize when you can because that minimizes risk; GDPR’s “privacy by design” So it does buy you something, but GDPR still applies.
The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy
K-Anonymization Think of k-anonymization as a better way to hash like we did for the taxi data in the prior slides, yet provides more utility. This is done by generalizing quasi-identifiers by making them more “coarse”, becoming homogeneous with their neighbors Each record is then indistinguishable from at least k-1 other records, forming an equivalence class 20852 20878 20868 Zip Code 30. 6 26 208* 24 25 Age 29 30. 56755 Coordinates
Example: Generalizing By Zip Code Homogeneity Attack Black born 1965, know their problem? Black. Female Male in in 1965, dodo wewe know their problem? No. YES Blackborn Male, do we know their problem? -- No -- --
K-Anonymized Taxi Data K-anonymized pickup & dropoff loc and time Certainly more utility But same problems. . . Link attack on very unique pickup/dropoff Homogeneity attack: everyone tipped the same L-Diversity, T-Closeness, has its own problems
K-Anonymization In the Wild I’m not the only one that gets the joke now!
K-Anonymization, Better Utility, No Party K-Anonymization provides no guarantees of privacy K-Anonymization is computationally intensive to build - searching for K-perfection, LDiversity, T-Closeness may be a waste of time There’s still a privacy vs utility tradeoff to contend with One should mask (pseudonymize) personal data and generalize quasi-identifiers to meet “privacy by design” principles whenever possible NOT SLIGHTLY
The Privacy vs Utility Game Let’s have some fun. . .
The movie title is our “private” data We can generalize We can mask the rest….
Challenge 1: Basically what NY Taxi Did
Challenge 2: More Anonymization Applied 19** 3 hours
Challenge 3: Perfectly Private, But Completely Useless 19** 8. 2 19** 1 hour 86 723 user, 201 critic 438
The Anonymization Zoo Let’s go through some other anonymization techniques. Will we get to party time? K-Anonymization Differential Privacy
Let’s Play Another a Game. . . Think of a number 1 - 6 Now I’m going to ask you a private question you may not want to answer in public Did you, or would you have, voted for Brexit? Now, if you thought of a “ 3” or answered “YES” to Brexit, then raise your hand when this counter gets to zero: 1 2 0 3
Differential Privacy ‘Differential privacy formalizes the idea that a "private" computation should not reveal whether any one person participated in the input or not, much less what their data are. ’ - [Frank Mc. Sherry] (https: //github. com/frankmcsherry/blog/blob/master/posts/2016 -02 -03. md) $320 k $340 k $330 k Sensitivity of median = ~10 k Sensitivity of mean = ~30 M $30 M
There’s a Catch! (Three of Them) 1. You can only ask “aggregate” questions of the data. For example, the count of hands raised, but not specifically who’s hand SUM, COUNT, AVERAGE, MIN, MAX 2. If you ask the same/similar question enough - you’ll find the right answer!! You know, statistics. . . if you flip a coin 100 times, you’re going to get really close to 50% each side. The “Privacy Budget”. 3. “Epsilon” (amount of privacy) is not intuitive and hard to assign
So What Would Differential Privacy Look Like In Our Movie Game? Let’s pretend the rating was the sensitive piece of data Select AVG(rating) WHEN title = ‘The Terminator’ 7. 3 Select AVG(rating) WHEN title = ‘The Terminator’ 8. 6 Select AVG(rating) WHEN title = ‘The Terminator’ 8. 3 Select AVG(rating) WHEN title = ‘The Terminator’ 7. 6 Average = 7. 95 The more we ask, the more we pound away the noise
Differential Privacy does provide guarantees of privacy! But, there are still utility limitations: You need to understand you can only ask general / aggregate type questions. This should be intuitive: you shouldn’t ask specifics of anonymized data Very hard to do exploration with the privacy budget, you somewhat have to know the questions you intend to ask up front. Intuition about privacy settings (epsilon) SLIGHTLY There are tricks you can do here
Don’t Rely On Anonymization Alone! Recital 26**, talking about anonymization: “To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. ” **Note that anonymization is only ever mentioned in recital 26. Recitals can be thought of as commentary and some would consider non-binding. Until there’s GDPR guidance about when data is “reasonably likely” to be reidentified, early adopters will face an uncertain regulatory environment RISK!! Even with the guarantees of Differential Privacy, one still needs to meet the principals and purpose requirements for original collection!!
What I Recommend A rock solid governance solution in your organization Data Ownership Access Policies Appropriate Usage Data Lifecycle Always some level of anonymization and/or pseudonimization to meet the privacy by design requirements
Governance Data Ownership: Owns the data and makes decisions on how and if it can be accessed - and are held accountable for those decisions Access Policies: Who can access the data, what exactly can they see, and under what circumstances? Appropriate Usage: What constitutes appropriate and inappropriate use of data internally and externally, particularly for automated decisions? Data Lifecycle: How to manage acquiring, storing, selling, and purging your data? Governance is not memos and glorified wikis - it’s actual enforcement through software!
A Complex Problem You have data everywhere in many different storage technologies, and now complex data governance requirements to enforce DON’T IMPLEMENT UNIQUELY PER DATABASE! DON’T DATA LAKE FOR THE PURPOSE OF COMPLIANCE SIMPLIFICATION! Consent Transparency Legitimate interests Retention Minimization Anonymization Accountability
A Data Control Plane Consent Transparency Legitimate interests Data Ownership Consent Retention Minimization Anonymization Accountability Appropriate Usage Data Lifecycle Retention Transparency The Data Control Plane Anonymization Access Policies Legitimate interests Minimization Accountability
Tenants of a Data Control Plane 1. Simplicity: Easy to create privacy rules and expose authoritative views of data from any storage technology 2. Mutability: Ability to change rules and have that reflected in the data on the fly 3. Accessibility: Plane cannot force users to an API to access the data → Needs to be accessible by any language or tool 4. Context: State of access requests needs to be understood to enforce rules appropriately (link data to analytical context, e. g. purpose) 5. Visibility: All actions in the plane are audited, all policies are
A Critical Component: Purposes Purpose-based restrictions are the future of privacy controls Purpose-based restrictions DO NOT fit in the identity management frameworks we’re used to Identity: Roles, Groups, Authorizations - GRANTED TO ME Purpose: Context, Dynamic, Layered - REACT TO MY CONTEXT
Conclusion Don’t try to shortcut GDPR. Always pseudonimize/anonymize when possible, but don’t use it to escape GDPR, at least not yet. Necessity is the mother of invention: you’ll see your data science operations soar once governance is applied appropriately. Governance can be an enabler!
Steve Touw steve@immuta. com @steve_touw www. immuta. com
- Gdpr privacy
- Gdpr privacy
- Amnesia data anonymization
- Anonymization tool
- Cvs privacy awareness and hipaa privacy training
- Steve jobs, steve wozniak and ronald wayne
- Contact unifida
- Malaysia data privacy law
- Big data privacy issues in public social media
- Paige kowalski
- Data privacy massachusetts
- What are kmaps
- Implication table state minimization
- Minimization problem example
- Expected risk minimization
- Cost minimization formula
- Simplex lp
- Finite state machine minimization
- Minimization techniques in digital electronics
- Cost minimization formula
- Cost minimization
- Dfa minimization code in c++
- Big m method minimization example
- Subset construction nfa to dfa
- Big m method minimization
- Interval halving method optimization example
- Dfa for (a/b)*abb
- Tabular method of minimization
- Short run cost minimization
- Makespan problem
- Cp meaning psychology
- Dfa minimization examples
- Dfa minimization examples
- Gate level minimization
- Cost minimization analysis
- Empirical risk minimization python
- State minimization partitioning method
- Risk minimization plan
- Cost minimization perfect complements
- Ssl inspection gdpr
- Aws gdpr compliance
- Gdpr implementation timeline
- Gdpr refresher training
- Gdpr case studies
- Codeigniter gdpr
- Gdpr denetim
- Varonis gdpr patterns
- Arkivering av personalhandlingar
- Gdpr principles
- Gdpr acerta