Distributing Data for Secure Data Services Vignesh Ganapathy
Distributing Data for Secure Data Services Vignesh Ganapathy, Dilys Thomas, Tomas Feder, Hector Garcia Molina, Rajeev Motwani April 8 th, 2011 Stanford, TRDDC, TRUST
Road. Map Motivation for Secure Databases Column level distribution Encryption, Distribution Privacy constraints Set cover initialization Query Mediation Cost estimation Where and Select clause processing Query decomposition Experiments Related Work
Motivation 1: Data Privacy in Enterprises Health Banking Personal medical details Disease history Clinical research data Govt. Agencies Census records Economic surveys Hospital Records Bank statement Loan Details Transaction history Finance Portfolio information Credit history Transaction records Investment details Manufacturing Process details Blueprints Production data Outsourcing Insurance Claims records Accident history Policy details Retail Business Inventory records Individual credit card details Audits Customer data for testing Remote DB Administration BPO & KPO
Motivation 2: Government Regulations Country Privacy Legislation Australia Privacy Amendment Act of 2000 European Union Personal Data Protection Directive 1998 Hong Kong Personal Data (Privacy) Ordinance of 1995 United Kingdom Data Protection Act of 1998 United States Security Breach Information Act (S. B. 1386) of 2002 Gramm-Leach-Bliley Act of 1999 Health Insurance Portability and Accountability Act of 1996
Motivation 3: Personal Information Emails Searches on Google/Yahoo Profiles on Social Networking sites Passwords / Credit Card / Personal information at multiple Ecommerce sites / Organizations Documents on the Computer / Network
Losses due to Lack of Privacy: ID-Theft • 3% of households in the US affected by ID-Theft • US $5 -50 B losses/year • UK £ 1. 7 B losses/year • AUS $1 -4 B losses/year
Data Privacy Value disclosure: What is the value of attribute salary of person X Perturbation Privacy Preserving OLAP Identity disclosure: Whether an individual is present in the database table Randomization, K-Anonymity etc. Data for Outsourcing / Research Linkage disclosure: Linking columns from multiple sites
Road. Map Motivation for Secure Databases Column level distribution Encryption, Distribution Privacy constraints Set cover initialization Query Mediation Cost estimation Where and Select clause processing Query decomposition Experiments Related Work
Masketeer: A tool for data privacy Lodha, Patwardhan, Roy, Sundaram etal.
Two Can Keep a Secret: A Distributed Architecture for Secure Database Services How to distribute data across multiple sites for (1)redundancy and (2)(2) privacy so that a single (3)site being compromised does not lead to data loss Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005
Motivation • Data outsourcing growing in popularity – Cheap, reliable data storage and management • 1 TB $399 < $0. 5 per GB • $5000 – Oracle 10 g / SQL Server • $68 k/year DBAdmin • Privacy concerns looming ever larger – High-profile thefts (often insiders) • • UCLA lost 900 k records Berkeley lost laptop with sensitive information Acxiom, JP Morgan, Choicepoint www. privacyrights. org
Present solutions Application level: Salesforce. com On-Demand Customer Relationship Management $65/User/Month ---- $995 / 5 Users / 1 Year Amazon Elastic Compute Cloud 1 instance = 1. 7 Ghz x 86 processor, 1. 75 GB RAM, 160 GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0. 10 per instance hour $0. 20 per GB of data in/out of Amazon $0. 15 per GB-Month of Amazon S 3 storage used Google Apps for your domain Small businesses, Enterprise, School, Family or Group
Encryption Based Solution Client Query Q Answer Encrypt Client-side Processor Q’ “Relevant Data” Problem: Q’ “SELECT *” DSP
The Power of Two Client DSP 1 DSP 2
The Power of Two Query Q Q 1 DSP 1 Q 2 DSP 2 Client-side Processor Key: Ensure Cost (Q 1)+Cost (Q 2) Cost (Q)
SB 1386 Privacy { Name, SSN}, { Name, Licence. No} { Name, California. ID} { Name, Account. Number} { Name, Credit. Card. No, Security. Code} are all to be kept private. A set is private if at least one of its elements is “hidden”. Element in encrypted form ok
Techniques Vertical Fragmentation Partition attributes across R 1 and R 2 E. g. , to obey constraint {Name, SSN}, R 1 Name, R 2 SSN Use tuple IDs for reassembly. R = R 1 JOIN R 2 Encoding One-time Pad For each value v, construct random bit seq. r R 1 v XOR r, R 2 r Deterministic Encryption R 1 EK (v) R 2 K Can detect equality and push selections with equality predicate Random addition R 1 v+r , R 2 r Can push aggregate SUM
Example An Employee relation: {Name, Do. B, Position, Salary, Gender, Email, Telephone, Zip. Code} Privacy Constraints {Telephone}, {Email} {Name, Salary}, {Name, Position}, {Name, Do. B} {Do. B, Gender, Zip. Code} {Position, Salary}, {Salary, Do. B} Will use just Vertical Fragmentation and Encoding. Decomposed Schema R 1: {TID, Name, Email, Telephone, Gender, Salary} R 2: {TID, Name, Email, Telephone, Do. B, Position, Zip. Code} Encrypted Attributes E: {Telephone, Email, Name}
Partitioning, Execution • Partitioning Problem – Partition to minimize communication cost for given workload – Even simplified version hard to approximate – Hill Climbing algorithm after starting with weighted set cover • Query Reformulation and Execution – Consider only centralized plans – Algorithm to partition select and where clause predicates between the two partitions
Set Cover+ Greedy for partitioning
Road. Map Motivation for Secure Databases Column level distribution Encryption, Distribution Privacy constraints Set cover initialization Query Mediation Cost estimation Where and Select clause processing Query decomposition Experiments Related Work
Cost Estimation
State Definitions • • • 0: condition clause cannot be pushed to either servers 1: condition clause can be pushed to Server 1 2: condition clause can be pushed to Server 2 3: condition clause can be pushed to both servers 4: condition clause can be pushed to either servers
OR State Evaluation
AND State Evaluation
Query Partitioning Original Query SELECT Name, Do. B, Salary FROM R WHERE (Name =’Tom’ AND R 1: Position=’Staff’) AND R 1: {TID, Name, Email, Telephone, Gender, Salary} (Zipcode =’ 94305’ OR Salary > R 2: {TID, Name, Email, Telephone, Do. B, Position, Zipcode} 60000) • Query 1: SELECT TID, name, salary FROM R 1 WHERE Name=’Tom’ • Query 2: SELECT TID, dob, zipcode FROM R 2 WHERE Position=’Staff’
Distributed Query Plan
Road. Map Motivation for Secure Databases Column level distribution Encryption, Distribution Privacy constraints Set cover initialization Query Mediation Cost estimation Where and Select clause processing Query decomposition Experiments Related Work
Perfomance Gain Experiment
Iterations Vs Privacy Constraints
Acknowledgements: Collaborators Stanford Privacy Group TRDDC Privacy Group PORTIA, TRUST, Google
Back Up slides March 18, 2011
- Slides: 32