Obtaining Useful Information while Protecting Privacy Gio Wiederhold
Obtaining Useful Information while Protecting Privacy Gio Wiederhold Emeritus Professor, CSD, EE, Med. Stanford University 4 November 2005 UMd FISOO presentation www-db. stanford. edu/people/gio. html UMd FISOO Gio 1
Background • 19 years in industry – Systems for Do. D, commerce, hospitals Became a manager- had to deal only with people problems. • Decided to get Ph. D, and became an academic – Main funding again Do. D, Health care – Actually similar: • bb • Spent three years as a DARPA program manger • After retirement in 2001 mainly work for Treasury UMd FISOO Gio 2
Objective of Databases • Capable of Providing Useful Information – Novel • If you know it already it is just data not information • Combining old data can create novel information ► – Relevant • SELECTing the right data is much work • Focus of all the search engines – Correct • Need metadata that engender trust – Actionable • Receiver must have resources to affect the future UMd FISOO Gio 3
Liabilities of Databases • Intrusions – Substantial information is concentrated – More can be aggregated from on-line sources • Destruction – Loss of information can deny rights (Fo. IA) – Loss of information increases risks ↑(uncertainty) • Falsification – Leads to wrongful actions (phishing, pharming) – Causes broad loss of trust when discovered Complexity and sloppiness allows break-ins UMd FISOO Gio 4
Conflict • Legitimate needs for dissemination of data obtained from individuals for society – Clinical research – Security threats – Fairness in governmental decision-making • Concerns about exposure of – Corporate data – Individual information: ► Privacy – Governmental secrets UMd FISOO Gio 5
Privacy Concerns • Legal – Ill-defined • Loss of opportunities – Threat of constraints to improve ones destiny ► • Emotional – Embarrassment – Often unrealistic • Politically exploited Still a valid concern when dealing with citizens UMd FISOO Gio 6
Threats to Privacy • Release of facts about personal life – Stay in a psychiatric ward • Release of facts that infer attitudes – Subscription to The Onion → cynicism • Release of facts that imply associations – Visiting imprisoned friend All protected by keeping facts anonymous ► Ability to break anonymity by combining data UMd FISOO Gio 7
Problems of Inference Control • Multiple sources – Combining data creates novel information • Variability of cell sizes – Partioned: cells ≈ equal size people per census district – Natural: wide, Zipfian distribution word usage, diseases, behavior – Both types of columns will exist in databases – Intersecting cells with few data identifies individuals UMd FISOO Gio 8
3 Alternatives to protect data 1. Limit access to trusted analysts 2. Transform accessible copy 3. Transform obtained results UMd FISOO Gio 9
Alt 1. Limit access • Define trusted personnel – Must keep numbers small to control risk » 8000 agents have access to FBI warehouse DB – Fine-grained access • Partition by roles • Maintain boundaries within the DB by roles » Roles intersect Complexity Reduces benefits greatly UMd FISOO Gio 10
Roles in a Hospital l a c di ch e M ear s Re Billing Patient Inpatient Insurance Carriers Physician Pharmacy Laboratory staff Clinics Laboratory Accreditation Accounting Ward staff Etc. . CDC UMd FISOO Gio 11
Partitioning for roles requires foresight Assigning each data element to the roles Note: new needs arise through time • Done by the data creator? – The creator of data cannot predict who can beneficially use the data in the future • Done by the individual? – The source of data cannot predict who can beneficially use the data in the future • Done by the Database steward – A overwhelming, ongoing task UMd FISOO Gio 12
Alt. 2 Transform Contents • Typically done on a copy for the public ≈ to restricted write-down operation [de. Padula] ["Markovitz" @CIA] (The original remains accessible to trusted personnel) 1. Anonynimizing – Removal of identifying data ►► – Grouping: all incomes over $150 K one group 2. Aggregation – Data now apply to groups, not individuals ► 3. Pertubate the data by changing detail – Misleads inferencing attempts – Prevents legal trace UMd FISOO Gio 13
Anonymize to Protect Privacy Remove identifying key fields • Obvious – Name, SSN, full address, phone number, . . . • Intersections are powerful "87% of people are identifiable given Do. B, gender, 5 -digit ZIP" [LS] – Must remove much because of intersection problem High salary, profession CPA, location, age range → CFO • Unexpected, external information can be keys A sequence of 5 clinic visit dates identifies most Medical Records UMd FISOO Gio 14
Aggregate To Protect Privacy • Simplifying rule – Makes it unlikely that inference is possible • Aggregate data so that any results pertain to a bunch of individuals, say n > 15 – By census tract, by disease class, etc. • One aggregation hierarchy per attribute, or attribute cluster – Sufficiently to disable inference among DB fields • Those are known: closed world – Sufficiently to make inference with public data hard • The extent of those can only be guessed UMd FISOO Gio 15
Transform DB using Aggregation • Rule: Cells should contain data of at least 15 persons 222 Uninfor mative DB By color 44 24 7 2 90 81 40 36 112 120 By stripe 20 5 50 45 – Aggregate according to all hierarchies – Stop when rule is satisfied UMd FISOO Gio Can no longer answer any color queries 16
Alt. 3 Result checking • • Use full database or minimal partioning Validate requestor and assign role Execute query, but hold result Process result – Anonymity – Aggregation – Inapproriate contents • Additional capability because now actual content is available, before we only had metadata UMd FISOO Gio 17
Transform result; aggregate if needed Query: Blue vs Red 40 36 50 45 36 50 40 2 50 5 Check: requires aggregation Check: all cells are ok 40 Query: Blue vs Yellow By color 45 All requested information can be returned 42 55 Only partition by stripe can be returned • Same rule: Result cells should contain at least 15 persons – Aggregate result when needed to satisfy rule – Stop when rule is satisfied – Note: most queries specify a hierarchy UMd FISOO Gio 18
Why does Alt. 3 work better? • The DB aggregation process fails to deal with the objective: Ensure that the result is ok ! 1. It tries to assure proper results by changing the DB 2. It follows the traditional DB paradigm: If it can be obtained from the DB, it can be given out 3. Results are not filtered Hence: If you want to control results given out, analyze the results, not the database! UMd FISOO Gio 19
Why are results not processed? • Access control is needed for disaster protection. • Access control requires only metadata. • Commercial databases have good access control And since access control exists • it became the tool for privacy protection • Looking under the lamppost for the lost keys Processing of results must be done for every query • Converting the entire database is done only periodically Result processing – Can be added: Security mediator – Becomes more efficient with a new DBMS architecture UMd FISOO Gio 20
Current DBMS architecture Security officer : -( Authentication based good/badcontrol good guy security needs Database administrator -) oo O. K. / DB schema-based access control result valid query O. K. /? Good UMd FISOO Gio query Database 21
Mediated Protection (TIHI) Security officer : -( Authentication based good/badcontrol prior use good guy Security Mediator security needs -) Database oo administrator good query DB schemabased O. K. control Cell counts validated to be O. K. history result maybe O. K. processable query performance, function requests UMd FISOO Gio Database 22
Summary for inference control Shared Approach: Assure that each result cell is based on many instances, reducing the likelihood of successful inference Estimate expected lowest counts for any possible query result. Alt 2: Access Control Aggregate data to an always safe hierarchical level Use cell counts in result. Alt. 3: Release Filter If inadequate, aggregate up the hierarchy until count is adequate Never absolutely secure, but the best one can do now! UMd FISOO Gio 23
Release Control Benefits Release control provides more useful information than access control alone can It also • Allows filtering of inappropriate content. Inappropriate content can be due to – Errors in original data classification to a role Common, especially for long-lived databases – Random errors ~10% in medical records, ~5% in ancillary financial data – Break-ins – Once the crook is in, any data is fair game today UMd FISOO Gio 24
Example 1 • Released Data Music web site looses 15 000 Credit Card #s 1. Site is entered by presumed customer. 2. Customer pays $1. - for a tune ( ≈ 500 Kbytes) 3. Customer knows a trapdoor, takes out CC#s 4. 15 000 real customers now are at risk Such a scenario is unfortunately frequent [Business Week 28 Apr 05] Release control can easily recognize credit card #s A single one with p > 90%, and ten with p >99. 9999 and when only music is being sold, the test is absolute UMd FISOO Gio 25
Example 2 Released Data Intel official moves secret data offsite 1. 2. 3. 4. Official has high access rights Studies some case Has to go home, wants to finish work Copies file to personal computer Possible for anyone with access rights 5. Major scandal when accidentally discovered Unfortunately, there have also been cases where result copying was not accidental All result leaving a secure setting should be filtered Logs only help after discovery UMd FISOO Gio 26
Back to Inference • • Cell size rules are ad hoc Improve using ongoing research 1. k-anonymity [LS: Latanya Sweeney @ CMU] 2. Mistrust the closed-world assumption: Not all the potential links are controllable Consider external databases and widen range of linkages 3. Risk assessment as a base ► More systems + theory integrative research is needed. That doesn't mean we should now bury our heads in the sand Caveat: I have not integrated security mediation with research on inference. I won’t start a new research enterprise; already busier in retirement than I expected to be. I do hope that others will follow up. UMd FISOO Gio 27
Risk assessment principle • Work backwards from the problem • Establish cost of loss c 1. Apply feasible solution will always be partial • • Determine effectiveness %tage e Compute new loss c’ = c× e Subtract solution implementation cost s • Go back to step 1 if s < q-c’ Not forwards from a technical menu of known solutions UMd FISOO Gio 28
Commercial Release Support • • } Ponemon Institute [Tucson, AZ] & Vontu [San Francisco CA] Filters outgoing email only Tablus [San Mateo, CA] Linguistic pattern matching on all outbound traffic Reconnex [Mountain View CA] Filter appliance on outgoing IP port Vericept [Englewood, CO] Internet traffic filter Vertasys – consultants [Wyomissing PA] Vidius [Beverly Hills, CA, Tel Aviv Israel] Information Leak prevention Zix [Cambridge MA] Content filtering, enforces encryption Clear Swift Formail [? ? ? ] Egress filtering ? ? ? Not Yet a Science UMd FISOO Gio 29
Architecture: Security: In and Out n co l tro v Protect against enemies v Protect against hackers Ø Assure entry of customers Ø Assure entry of collaborators ss ce Ac • Security in Like a warehouse store Collected contents • Security out e as le Re fil ~ Release only what was purchased ~ Share only what is appropriate r te 1/18/2022 Gio Dallas Security 2005 30
General Release filters • Benefits 1. Can reduce statistical results to limit inference 2. Can look at actual contents, not just metadata Credit card numbers do not look like music 3. Can be tuned to current requirements New situations do not require data relabeling and reorgs 4. Independent of software failures 5. Potential for Intrusion detection • Costs 1. Customer's role information must be kept until done 2. Checking textual data requires computational effort Paranoid word matching, noun phrases, linguistics, in images 3. False negatives are possible, likely for some data Need a security officer with tools and override authority UMd FISOO Gio 31
Who is in charge? Enterprises need a distinct Security Officer. Who understands sharing vs. protection risk balance [Gilligan] Security & privacy protection should not be handled by the • Database administrator: Must make data available to valid users • Network administrator: Must assure accessibility by valid users • A Cryptographer: Creates important tools, but serves binary Protection of privacy is not an absolute issue. UMd FISOO Gio settings 32
Conclusion: Don’t rely on access control when the objective is to 1. get useful information & 2. to protect data ! ê DB aggregation makes data useless [Lin Zhen: Stanford Med. Inf. thesis] o Creating a public-access copy must be very conservative o Aggregating results allows production of useful information ê All new usages of data cannot be foreseen ê New, unthought of collaborators -- Russians in Kosovo needed access to US map info [Gilligan] o Supply-line integration ê System failures - trap doors, etc. abound o Release checking provides a backstop, intrusion detection UMd FISOO Gio 33
As far as we know, we have never had an unreported loss of private information. UMd FISOO Gio Courtesy of. Mike Morris 34
References on commercial Release Control Tablus http: //www. tablus. com/ (San Mateo CA) Pattern matching on all outbound traffic for selected types of information. Jim Nisbet, Founder and Chief Technology Officer 650 572 -1515 Publication: Jim Nisbet: The Security Role of Linguistic Content Analysis. LISA 2004 Reconnex in. Sight http: //www. reconnex. net/ (Mountain View CA) Appliance on outgoing port filters registered data. PR contact kristin@engagepr. com 510 -748 -8200 x 204 Vontu http: //www. vontu. com/ (San Francisco, CA 94111 415 -364 -8100) and Ponemon Institute http: //www. ponemon. org/ (Tucson AZ 520 -290 -3400) Outgoing email only Dr. Larry Ponemon, Ponemon Institute research@ponemon. org Larry Ponemon works with US Bancorp. Site Provides Webcasts On Demand Vericept http: //www. vericept. com/ (Englewood, CO) All Internet traffic, workd with Visa list Tery R. Larrew CEO (303) 798 -1568; (800) 262 -0274 Vertasys http: //vertasys. com (consultants, Wyomissing, PA) Vidius http: //www. vidius. com/ (Beverly Hills CA, Tel Aviv Israel) Information Leak prevention Zix Corporation http: //www. zixcorp. com / D anny Sands dsands@bidmc. harvard. edu Content filtering product for e-mail (algorithmically determines if messages contain PHI and then encrypts as needed. Used in some hospitals? UMd FISOO Gio 35
Release filter Collaborators are allowed in, but what they take out must be controlled. Symmetric checking of 1. 2. access to information systems using metadata and also checking of the release of the actual result contents – Act like a warehouse store o Check and/or remove restricted topics in outgoing documents a. b. c. o Researchers: Names, employers, addresses, emails, . . . Payors: other incidents, prior diseases, admissions, . . . : check specific contents for each collaborating & authorized role Better: check that all terms in outgoing documents are acceptable § Use a topic-specific inclusive word/phrase list, and filter others o § Paranoia is safest, and the cost is bearable o o most application / usage areas use less than 3000 terms Trapped documents can be released by a security officer Extract text from images, as x-rays, and then check those texts o Many media contain unexpected private or identifying data UMd FISOO Gio 36
Access control flaws False Asssumptions 1. Legitimate users are good guys and girls We often have more users than we can check 2. All data can be identified for all uses Future use is unpredictable 1. Allowing is risky [Kosovo data to Russian allies] 2. Not allowing is costly to collaboration Role complexity is manageable by contributor 3. All data is correctly identified No errors in metadata UMd FISOO Gio 37
Inference example Patients and clinic visits Information objective: Does medication a work better than medication b? Outcome measure: length of treatment time Privacy concern • Don’t show who went to HIV clinic UMd FISOO Gio 38
Concept Exists • Like a warehouse 1. Check metadata on entry 2. Collect contents 3. Check contents at the exit Access control UMd FISOO Gio Collected contents Release filter 39
- Slides: 39