CMPUT 691 Differential Privacy Privacy Preserving DataAnalysis http

  • Slides: 35
Download presentation
CMPUT 691: Differential Privacy: Privacy Preserving Data-Analysis http: //webdocs. ualberta. ca/~osheffet/CMPUT 691 F 16.

CMPUT 691: Differential Privacy: Privacy Preserving Data-Analysis http: //webdocs. ualberta. ca/~osheffet/CMPUT 691 F 16. html

The Course • Time: Tue. & Thr. 14: 00 -15: 20 • Place: CSC

The Course • Time: Tue. & Thr. 14: 00 -15: 20 • Place: CSC B-43 • Webpage: e. Class (for registered) and webdocs: http: //webdocs. ualberta. ca/~osheffet/CMPUT 691 F 16. html • Book: Dwork & Roth, The Algorithmic Foundations of Differential Privacy (link at webpage) • Instructor: • Or Sheffet, Athabasca 3 -04 • Office hours 15: 30 -16: 00 Tuesdays / coordinate via email • osheffet@ualberta. ca • Subject must begin with [CMPUT 691] • Not a native English-speaker, not native to Canada

The Course • Grade: • 10 -15% FFTq (Food-for-Thought Question) • A question will

The Course • Grade: • 10 -15% FFTq (Food-for-Thought Question) • A question will be given at the end of the class – write an answer by the beginning of next class. • 1% per not-completely-trivial answer, more if you give a really good one / engage in a discussion • 35 -40% based on 3 -4 HW assignments • Pen and paper assignments • Maybe, but unlikely, draw a few plots • PLEASE TYPE YOUR SOLUTION • Late unapproved submission: -25% of assignment per 24 hrs past submission deadline • Collaborations are encouraged… • … but solutions ought to be individually written. • Cite anything & anyone you used

The Course • Grade: • 50% Class project • Groups of 1 -2 •

The Course • Grade: • 50% Class project • Groups of 1 -2 • Type and scope – for you to decide • Can be centered on implementation, practical testing or theoretical • Tentative schedule: • By week 3: ideas will be given • By week 5: [5%] submission of title, group members, abstract + references (1 -2 pages) • And a meeting with me, for initial feedback • By week 8: [10%] mid-project presentations (10 -20 mins) • • Papers you are based on What is your research question Plan of attack Feedback opportunity from one another! • • Your research questions & “line of attack” Results Future directions Last chance for a feedback ts! n ne • By week 13: [15%] full presentations (20 -25 mins) • End of semester: [20%] project due. • Can this become a full-fledged paper? e Gra d de ds n pe a on ll po m co

Questions? • Tuesday, Sep. 13 th : • Any objections to delaying the class

Questions? • Tuesday, Sep. 13 th : • Any objections to delaying the class by 1 hour? (Starting at 15: 00, ending 16: 20)

Today’s Class • Overview of the problem of privacy in data analysis • Atypical

Today’s Class • Overview of the problem of privacy in data analysis • Atypical • Slides • Storytelling • The rest of the course • Whiteboard • Math • So if you’re still deciding on whether or not to take the class – judge it based on next time.

Data Privacy – The Problem • Given: • a dataset with sensitive information •

Data Privacy – The Problem • Given: • a dataset with sensitive information • Health records, census data, financial data, … • How to: • Compute and release functions of the dataset • Answer queries, output summary, learn • Without compromising individual privacy • What the #$@& does it even mean? ?

Data Privacy – The Problem Individuals Server/agency A Users ( queries answers ) Government,

Data Privacy – The Problem Individuals Server/agency A Users ( queries answers ) Government, researchers, businesses (or) Malicious adversary

A Real Problem Typical examples: • Census • Civic archives • Medical records •

A Real Problem Typical examples: • Census • Civic archives • Medical records • Search information • Communication logs • Social networks • Genetic databases • … Benefits: • New discoveries • Improved medical care • National securityprivacy discoveries med care security 9

The Anonymization Dream Database Anonymized Database • Trusted curator: • Removes identifying information (name,

The Anonymization Dream Database Anonymized Database • Trusted curator: • Removes identifying information (name, address, ssn, …). • Replaces identities with random identifiers. • Idea hard wired into practices, regulations, …, thought. • Many uses. • Reality: series failures. • Pronounced both in academic and public literature.

Linkage Attacks [Sweeney 2000] GIC Ethnicity Group Insurance Commission visit date ZIP patient specific

Linkage Attacks [Sweeney 2000] GIC Ethnicity Group Insurance Commission visit date ZIP patient specific data Diagnosis Birth date ( 135, 000 patients) Procedure Sex 100 attributes Medication per encounter Total Charge Anonymized GIC data Name Address Voter registration ZIP of Cambridge MA Date Birth date registered “Public records” Party by anyone open Sex for inspection affiliation Date last voted Voter registration

Linkage Attacks [Sweeney 2000] q Quasi identifiers re-identification q. Not a coincidence: q dob+5

Linkage Attacks [Sweeney 2000] q Quasi identifiers re-identification q. Not a coincidence: q dob+5 zip 69% q dob+9 zip 97% q William Weld (governor of Massachusetts at the time) q. According to the Cambridge Voter list: q. Six people had his particular birth date q. Of which three were men q. He was the only one in his 5 -digit ZIP code!

Azrieli Towers Thanks to Amos Fiat 13

Azrieli Towers Thanks to Amos Fiat 13

Azrieli Towers Thanks to Google 14

Azrieli Towers Thanks to Google 14

15

15

AOL Data release (2006) • AOL released search data • A sample of ~20

AOL Data release (2006) • AOL released search data • A sample of ~20 M web queries collected from ~650 k users over three months • Goal: provide real query log data that is based on real users • “It could be used for personalization, query reformulation or other types of search research” • The data set: Anon. ID Query. Time Item. Rank Click. URL 16

4417749 best dog for older owner 3/6/2006 11: 48: 24 1 4417749 best dog

4417749 best dog for older owner 3/6/2006 11: 48: 24 1 4417749 best dog for older owner 3/6/2006 11: 48: 24 5 4417749 landscapers in lilburn ga. 3/6/2006 18: 37: 26 4417749 effects of nicotine 3/7/2006 19: 17: 19 6 4417749 best retirement in the world 3/9/2006 21: 47: 26 4 4417749 best retirement place in usa 3/9/2006 21: 49: 37 10 4417749 best retirement place in usa 3/9/2006 21: 49: 37 9 4417749 bi polar and heredity 3/13/2006 20: 57: 11 4417749 adventure for the older american 3/17/2006 21: 35: 48 4417749 nicotine effects on the body 3/26/2006 10: 31: 15 3 4417749 nicotine effects on the body 3/26/2006 10: 31: 15 2 4417749 wrinkling of the skin 3/26/2006 10: 38: 23 4417749 mini strokes 3/26/2006 14: 56 1 4417749 panic disorders 3/26/2006 14: 58: 25 4417749 jarrett t. arnold eugene oregon 3/23/2006 21: 48: 01 2 4417749 jarrett t. arnold eugene oregon 3/23/2006 21: 48: 01 3 4417749 plastic surgeons in gwinnett county 3/28/2006 15: 04: 23 http: //www. implantinfo. com 4417749 plastic surgeons in gwinnett county 3/28/2006 15: 31: 00 4417749 60 single men 3/29/2006 20: 11: 52 6 4417749 60 single men 3/29/2006 20: 14 4417749 clothes for 60 plus age 4/19/2006 12: 44: 03 4417749 clothes for age 60 4/19/2006 12: 44: 41 10 4417749 clothes for age 60 4/19/2006 12: 45: 41 4417749 lactose intolerant 4/21/2006 20: 53: 51 2 4417749 lactose intolerant 4/21/2006 20: 53: 51 10 4417749 dog who urinate on everything 4/28/2006 13: 24: 07 6 4417749 fingers going numb 5/2/2006 17: 35: 47 http: //www. canismajor. com http: //dogs. about. com http: //www. nida. nih. gov http: //www. escapeartist. com http: //www. clubmarena. com http: //www. committment. com http: //www. geocities. com http: //health. howstuffworks. com http: //www. ninds. nih. gov http: //www 2. eugeneweekly. com 1 http: //www. wedalert. com 4 http: //www. adultlovecompass. com http: //www. news. cornell. edu http: //digestive. niddk. nih. gov http: //www. netdoctor. co. uk http: //www. dogdaysusa. com

Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA

Name: Thelma Arnold Age: 62 Widow Residence: Lilburn, GA

Other Re-Identification Examples [partial and unordered list] • Netflix award [Narayanan, Shmatikov 08]. •

Other Re-Identification Examples [partial and unordered list] • Netflix award [Narayanan, Shmatikov 08]. • Social networks [Backstrom, Dwork, Kleinberg 07, NS 09]. • Computer networks [Coull, Wright, Monrose, Collins, Reiter ’ 07, Ribeiro, Chen, Miklau, Townsley 08]. • Genetic data (GWAS) [Homer, Szelinger, Redman, Duggan, Tembe, Muehling, Pearson, Stephan, Nelson, Craig 08, . . . ]. • Microtargeted advertising [Korolova 11]. • Recommendation Systems [Calandrino, Kiltzer, Naryanan, Felten, Shmatikov 11]. • Israeli CBS [Mukatren, N, Salman, Tromer]. • …

k-Anonymity [SS 98, S 02] … l-diversity … t-closeness … • Prevent re-identification: •

k-Anonymity [SS 98, S 02] … l-diversity … t-closeness … • Prevent re-identification: • Make every individual’s identity unidentifiable from other k-1 individuals ZIP Age sex Disease 23456 55 Female Heart 23456 ** * Heart 12345 30 Male Heart 1234* 3* Male Heart 12346 33 Male Heart 1234* 3* Male Heart 13144 45 Female Breast Cancer 131** 4* * Breast Cancer 13155 42 Male Hepatitis 131** 4* * Hepatitis 23456 42 Male Viral 23456 ** * Viral Both guys from zip 1234* that are in their thirties have heart problems My (male) neighbor from zip 13155 has hepatitis! Bugger! I Cannot tell which disease for the patients from zip 23456 20

Auditing Here’s the answer OR Here’s a new query: qi+1 Query denied (as the

Auditing Here’s the answer OR Here’s a new query: qi+1 Query denied (as the answer would cause privacy loss) Auditor Query log q 1, …, qi Statistical database 21

Example 1: Sum/Max auditing di real, sum/max queries, privacy breached if some di learned

Example 1: Sum/Max auditing di real, sum/max queries, privacy breached if some di learned q 1 = sum(d 1, d 2, d 3) = 15 q 2 = max(d 1, d 2, d 3) Denied (the answer would cause privacy loss) Oh well… Auditor 22

… After Two Minutes … di real, sum/max queries, privacy breached if some di

… After Two Minutes … di real, sum/max queries, privacy breached if some di learned q 1 = sum(d 1, d 2, d 3) = 15 q 2 = max(d 1, d 2, d 3) There must beiffa q 2 is denied Ohreason well…for the d 1=d 2=d 3 = 5 denial… I win! Denied (the answer would cause privacy loss) Auditor 23

Example 2: Interval Based Auditing di [0, 100], sum queries, =1 (PTIME) q 1

Example 2: Interval Based Auditing di [0, 100], sum queries, =1 (PTIME) q 1 = sum(d 1, d 2) Sorry, denied q 2 = sum(d 2, d 3) = 50 d 1, d 2 [0, 1] Denial d 3 [49, 50] d 1, d 2 [0, 1] or [99, 100] Auditor 24

Max Auditing d 1 d 2 d 3 d 4 d 5 d 6

Max Auditing d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 … dn-1 dn di real q 1 = max(d 1, d 2, d 3, d 4) q 2 = max(d 1, d 2, d 3) If denied: d 4=M 1234 M 123 / denied q 2 = max(d 1, d 2) If denied: d 3=M 123 M 12 / denied Auditor 25

Adversary’s Success q 1 = max(d 1, d 2, d 3, d 4) If

Adversary’s Success q 1 = max(d 1, d 2, d 3, d 4) If denied: d 4=M 1234 Denied with probability 1/4 q 2 = max(d 1, d 2) If denied: d 3=M 123 Denied with probability 1/3 Success probability: 1/4 + (1 - 1/4)· 1/3 = 1/2 q 2 = max(d 1, d 2, d 3) Recover 1/8 of the database! Auditor 26

Boolean Auditing? d 1 d 2 d 3 d 4 d 5 d 6

Boolean Auditing? d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 q 1 = sum(d 1, d 2) q 2=sum(d 2, d 3) … dn-1 dn di Boolean 1 / denied … 1 / denied qi denied iff di = di+1 learn database/complement Let di, dj, dk not all equal, where qi-1, qi, qj-1, qj, qk-1, qk all denied q 2=sum(di, dj, dk) 1/2 Recover the entire database! Auditor 27

Randomization

Randomization

The Scenario • Users provide modified values of sensitive attributes • Data-miner develops models

The Scenario • Users provide modified values of sensitive attributes • Data-miner develops models about aggregated data Dataminer Can we develop accurate models without access to precise individual information? 29

Preserving Privacy • Value distortion – return xi+ri • Uniform noise: ri U(-a, a)

Preserving Privacy • Value distortion – return xi+ri • Uniform noise: ri U(-a, a) • Gaussian noise: ri N(0, stdev) • Perturbation of an entry is fixed • So that repeated queries do not reduce noise • Privacy quantification: interval of confidence [AS 2000] • With c% confidence xi is in the interval [a 1, a 2]. a 2 -a 1 define the amount of privacy at c% confidence level • Examples: Uniform Gaussian 50% 0. 5 x 2 a 1. 34 stdev 95% 0. 95 x 2 a 3. 92 stdev 99. 9% 0. 999 x 2 a 6. 8 stdev • Intuition: the larger the interval is, the better privacy is preserved. 30

Knowledge about the underlying distribution affects privacy • Let X=age • We know that

Knowledge about the underlying distribution affects privacy • Let X=age • We know that age > 0 • Suppose ri U(-50, 50) • [AS]: Privacy 100 at 100% confidence • Seeing an outcome -49. 038 • x is reduced to the interval [0, 1] • Taking ‘facts of life’ into account affects privacy 31

Prior knowledge affects privacy • Let X=age ri U(-50, 50) • [AS]: Privacy 100

Prior knowledge affects privacy • Let X=age ri U(-50, 50) • [AS]: Privacy 100 at 100% confidence • Seeing a measurement -10 • Facts of life: Bob’s age is between 0 and 40 • Assume you also know Bob has two children • Bob’s age is between 15 and 40 • a-priori information may be used in attacking individual data 32

What went wrong? De-identified data isn’t! “These definitions of privacy are syntactic, not semantic”

What went wrong? De-identified data isn’t! “These definitions of privacy are syntactic, not semantic” • These attempts fail because they define privacy as the result of some specific algorithm… • And they don’t talk about the meaning of preserving privacy. • Maybe instead we should try to define privacy.

Food For Thought #1: In light of the examples seen today in class: •

Food For Thought #1: In light of the examples seen today in class: • Put forward one (or more) property/ies a “good” definition of privacy should satisfy. • Try to define these properties formally. • Bonus: try to define what it means to “preserve privacy. ”