Architectures and Algorithms for Data Privacy Dilys Thomas

  • Slides: 86
Download presentation
Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30 th, 2007

Architectures and Algorithms for Data Privacy Dilys Thomas Stanford University, April 30 th, 2007 Advisor: Rajeev Motwani 1

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n n o o Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 2

Motivation 1: Data Privacy in Enterprises Health Banking Personal medical details Disease history Clinical

Motivation 1: Data Privacy in Enterprises Health Banking Personal medical details Disease history Clinical research data Govt. Agencies Census records Economic surveys Hospital Records Bank statement Loan Details Transaction history Finance Portfolio information Credit history Transaction records Investment details Manufacturing Process details Blueprints Production data Outsourcing Insurance Claims records Accident history Policy details Retail Business Inventory records Individual credit card details Audits Customer data for testing Remote DB Administration BPO & KPO 3

Motivation 2: Country Government Regulations Privacy Legislation Australia Privacy Amendment Act of 2000 European

Motivation 2: Country Government Regulations Privacy Legislation Australia Privacy Amendment Act of 2000 European Union Personal Data Protection Directive 1998 Hong Kong Personal Data (Privacy) Ordinance of 1995 United Kingdom Data Protection Act of 1998 United States Security Breach Information Act (S. B. 1386) of 2002 Gramm-Leach-Bliley Act of 1999 Health Insurance Portability and Accountability Act of 1996 4

Motivation 3: Personal Information o o o Emails Searches on Google/Yahoo Profiles on Social

Motivation 3: Personal Information o o o Emails Searches on Google/Yahoo Profiles on Social Networking sites Passwords / Credit Card / Personal information at multiple E-commerce sites / Organizations Documents on the Computer / Network 5

Losses due to Lack of Privacy: ID-Theft • 3% of households in the US

Losses due to Lack of Privacy: ID-Theft • 3% of households in the US affected by ID Theft • US $5 50 B losses/year • UK £ 1. 7 B losses/year • AUS $1 4 B losses/year 6

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n n o o Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 7

Privacy Preserving Data Analysis i. e. Online Analytical Processing OLAP Computing statistics of data

Privacy Preserving Data Analysis i. e. Online Analytical Processing OLAP Computing statistics of data collected from multiple data sources while maintaining the privacy of each individual source Agrawal, Srikant, Thomas SIGMOD 2005 8

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Inversion method Single

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method o o Privacy Guarantees Experiments 9

Horizontally Partitioned Personal Information Client C 2 Original Row r 2 Perturbed p 2

Horizontally Partitioned Personal Information Client C 2 Original Row r 2 Perturbed p 2 Client C 1 Table T for analysis Original Row r 1 at server Perturbed p 1 p 2 EXAMPLE: What number of children in this Client Cn p county go to college? Original Row rn n Perturbed pn 10

Vertically Partitioned Enterprise Information ID C 1 John Alice 5 1 Alice 7 ID

Vertically Partitioned Enterprise Information ID C 1 John Alice 5 1 Alice 7 ID C 1 C 2 C 3 Bob 18 John 1 35 9 Alice 7 53 7 Bob 18 Original Relation D 1 Perturbed Relation D’ 1 ID C 2 C 3 John 27 9 John 35 9 Alice 53 6 Alice 53 7 Perturbed Joined Relation D’ EXAMPLE: What fraction of United customers to New York fly Original Relation D 2 to travel Perturbed Relation D’ 2 Virgin Atlantic to London? 11

Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P 1 and

Privacy Preserving OLAP: Problem Definition Compute select count(*) from T where P 1 and P 2 and P 3 and …. Pk Eg Find # of people between age[30 -50] and salary[80 -150] i. e. COUNTT( P 1 and P 2 and P 3 and …. Pk ) Goal: provide error bounds to analyst. provide privacy guarantees to data sources. scale to larger # of attributes 12

Perturbation Example: Uniform Retention Replacement Throw a biased coin Heads: Retain Tails: Replace with

Perturbation Example: Uniform Retention Replacement Throw a biased coin Heads: Retain Tails: Replace with a random number from a predefined pdf 1 5 Tails 1 4 Tails 3 Heads 4 1 Tails 2 3 Tails 3 BIAS=0. 2 HEADS: RETAIN TAILS: REPLACE U. A. R. FROM [1 5] 13

Retention Replacement Perturbation o o Done for each column The replacing pdf need not

Retention Replacement Perturbation o o Done for each column The replacing pdf need not be uniform n o Best to use original pdf if available/ estimable Different columns can have different biases for retention 14

Single Attribute Example What is the fraction of people in this building with age

Single Attribute Example What is the fraction of people in this building with age 30 -50? o Assume age between 0 -100 o Whenever a person enters the building flips a coin of with heads probability p=0. 2. n Heads -- report true age RETAIN n o o o Tails -- random number uniform in 0 -100 reported PERTURB Totally 100 randomized numbers collected. Of these 22 are 30 -50. How many among the original are 30 -50? 15

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Inversion method Single

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Inversion method Single attribute Multiple attributes Iterative method o o Privacy Guarantees Experiments 16

Analysis 20 Retained 80 Perturbed Out of 100 : 80 perturbed (0. 8 fraction),

Analysis 20 Retained 80 Perturbed Out of 100 : 80 perturbed (0. 8 fraction), 20 retained (0. 2 fraction) 17

Analysis Contd. 20 Retained 16 Perturbed, Age[30 50] 64 Perturbed, NOT Age[30 50] 20%

Analysis Contd. 20 Retained 16 Perturbed, Age[30 50] 64 Perturbed, NOT Age[30 50] 20% of the 80 randomized rows, i. e. 16 of them satisfy Age[30 -50]. The remaining 64 don’t. 18

Analysis Contd. 6 16 Retained, Age[30 50] Perturbed, Age[30 50] 14 Retained, NOT Age[30

Analysis Contd. 6 16 Retained, Age[30 50] Perturbed, Age[30 50] 14 Retained, NOT Age[30 50] 64 Perturbed, NOT Age[30 50] Since there were 22 randomized rows in [30 -50]. 22 -16=6 of them come from the 20 retained rows. 19

Scaling up Total Rows Age[30 -50] 20 6 100 30 ? Thus 30 people

Scaling up Total Rows Age[30 -50] 20 6 100 30 ? Thus 30 people had age 30 50 in expectation. 20

Multiple Attributes (k=2) P 1=Age[30 -50], P 2=Salary[80 -150] Query Estimated on T Evaluated

Multiple Attributes (k=2) P 1=Age[30 -50], P 2=Salary[80 -150] Query Estimated on T Evaluated on T` count(¬P 1٨¬P 2) x 0 y 0 count(¬P 1٨ P 2) x 1 y 1 count(P 1٨¬P 2) x 2 y 2 count(P 1٨ P 2) x 3 y 3 21

Architecture 22

Architecture 22

Formally : Select count(*) from R where Pred p = retention probability (0. 2

Formally : Select count(*) from R where Pred p = retention probability (0. 2 in example) 1 -p = probability that an element is replaced by replacing p. d. f. b = probability that an element from the replacing p. d. f. satisfies predicate Pred ( a in example) = 1 -b 23

Transition matrix Count. T(: P) Count. T( P) (1 -p)a + p (1 -p)a

Transition matrix Count. T(: P) Count. T( P) (1 -p)a + p (1 -p)a (1 -p)b+p = Count T’(: P) Count. T’(P) i. e. Solve x. A=y A 00 = probability that original element satisfies : P and after perturbation satisfies : P p = probability it was retained (1 p)a = probability it was perturbed and satisfies : P A 00 = (1 p)a+p 24

Multiple Attributes For k attributes, o x, y are vectors of size 2 k

Multiple Attributes For k attributes, o x, y are vectors of size 2 k -1 o x=y A Where A=A 1 A 2 . . Ak [Tensor Product] Ai is the transition matrix for column i 25

Error Bounds o o In our example, we want to say when estimated answer

Error Bounds o o In our example, we want to say when estimated answer is 30, the actual answer lies in [28 -32] with probability greater than 0. 9 Given T !a T’ , with n rows f(T) is (n, e, d) reconstructible by g(T’) if |f(T) – g(T’)| < max ( , f(T)) with probability greater than (1 - ). f(T) =2, =0. 1 in above example 26

Theoretical Basis and Results Theorem: Fraction, f, of rows in [low, high] in the

Theoretical Basis and Results Theorem: Fraction, f, of rows in [low, high] in the original table estimated by matrix inversion on the table obtained after uniform perturbation is a (n, , ) estimator f if n > 4 log(2/ )(p ) 2 , by Chernoff bounds Theorem: Vector, x, obtained by matrix inversion is the MLE (maximum likelihood estimator), by using Lagrangian Multiplier method and showing that the Hessian is negative 27

Iterative Algorithm [AS 00] Initialize: x 0=y Iterate: xp. T+1 = q=0 t yq

Iterative Algorithm [AS 00] Initialize: x 0=y Iterate: xp. T+1 = q=0 t yq (apqxp. T / ( r=0 t arq xr. T)) [ By Application of Bayes Rule] Stop Condition: Two consecutive x iterates do not differ much 29

Iterative Algorithm We had proved, o Theorem: Inversion Algorithm gives the MLE o Theorem

Iterative Algorithm We had proved, o Theorem: Inversion Algorithm gives the MLE o Theorem [AA 01]: The Iterative Algorithm gives the MLE with the additional constraint that 0 < xi , 8 0 < i < 2 k-1 n n Models the fact the probabilities are non-negative Results better as shown in experiments 30

Privacy Guarantees Say initially know with probability < 0. 3 that Alice’s age >

Privacy Guarantees Say initially know with probability < 0. 3 that Alice’s age > 25 After seeing perturbed value can say that with probability > 0. 95 Then we say there is a (0. 3, 0. 95) privacy breach More subtle differential privacy in thesis 32

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments

Privacy Preserving OLAP o o o Motivation Problem Definition Query Reconstruction Privacy Guarantees Experiments 33

Experiments o o Real data: Census data from the UCI Machine Learning Repository having

Experiments o o Real data: Census data from the UCI Machine Learning Repository having 32000 rows Synthetic data: Generated multiple columns of Zipfian data, number of rows varied between 1000 and 1000000 Error metric: l 1 norm of difference between x and y. L 1 norm between 2 probability distributions Eg for 1 -dim queries |x 1 – y 1| + | x 0 – y 0| 34

Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm

Inversion vs Iterative Reconstruction 2 attributes: Census Data 3 attributes: Census Data Iterative algorithm (MLE on constrained space) outperforms Inversion (global MLE) 35

Error as a function of Number of Columns: Iterative Algorithm: Zipf Data The error

Error as a function of Number of Columns: Iterative Algorithm: Zipf Data The error in the iterative algorithm flattens out as its maximum value is bounded by 2 36

Error as a function of Number of Columns Census Data Inversion Algorithm Iterative Algorithm

Error as a function of Number of Columns Census Data Inversion Algorithm Iterative Algorithm Error increases exponentially with increase in number of columns 37

Error as a function of number of Rows Error decreases as increases as number

Error as a function of number of Rows Error decreases as increases as number of rows, n 38

Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate

Conclusion Possible to run OLAP on data across multiple servers so that probabilistically approximate answers are obtained and data privacy is maintained The techniques have been tested experimentally on real and synthetic data. More experiments in the paper. Privacy Preserving OLAP is Practical 39

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n n o o Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 40

Anonymizing Tables: ICDT 05 Creating tables that do not identify individuals for research or

Anonymizing Tables: ICDT 05 Creating tables that do not identify individuals for research or out sourced software development purposes Aggarwal, Feder, Kenthapadi, Motwani, Panigrahy, Thomas, Zhu 41

Achieving Anonymity via Clustering: PODS 06 Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu Probabilistic

Achieving Anonymity via Clustering: PODS 06 Aggarwal, Feder, Kenthapadi, Khuller, Panigrahy, Thomas, Zhu Probabilistic Anonymity: (submitted) Lodha, Thomas 42

Data Privacy o Value disclosure: What is the value of attribute salary of person

Data Privacy o Value disclosure: What is the value of attribute salary of person X n Perturbation o o Privacy Preserving OLAP Identity disclosure: Whether an individual is present in the database table n Randomization, K-Anonymity etc. o Data for Outsourcing / Research 43

Original Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 614 Sara 03/04/76

Original Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 614 Sara 03/04/76 F 94305 Flu 615 Joan 07/11/80 F 94307 Cold 629 Karan 05/09/55 M 94301 Diabetes 710 Harris 11/23/62 M 94305 Flu 840 Carl 11/23/62 M 94059 Arthritis 780 Amanda 01/07/50 F 94042 Heart problem 619 Rob 04/08/43 M 94042 Arthritis 44

Randomized Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 101 Amy 03/04/76

Randomized Dataset Identifying Sensitive SSN Name DOB Gender Zip code Disease 101 Amy 03/04/76 F 94305 Flu 102 Betty 07/11/80 F 94307 Cold 103 Clarke 05/09/55 M 94301 Diabetes 104 David 11/23/62 M 94305 Flu 105 Earl 11/23/62 M 94059 Arthritis 106 Finy 01/07/50 F 94042 Heart problem 107 George 04/08/43 M 94042 Arthritis 45

Quasi-Identifiers Sensitive Uniquely identify you! DOB Gender Zip code Disease 03/04/76 F 94305 Flu

Quasi-Identifiers Sensitive Uniquely identify you! DOB Gender Zip code Disease 03/04/76 F 94305 Flu 07/11/80 F 94307 Cold 05/09/55 M Quasi-identifiers: 94301 approximate Diabetesforeign keys 12/30/72 M 94305 Flu 11/23/62 M 94059 Arthritis 01/07/50 F 94042 Heart problem 04/08/43 M 94042 Arthritis 46

k-Anonymity Model [Swe 00] o Modify some entries of quasi-identifiers n o each modified

k-Anonymity Model [Swe 00] o Modify some entries of quasi-identifiers n o each modified row becomes identical to at least k-1 other rows with respect to quasi-identifiers Individual records hidden in a crowd of size k 47

Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David

Original Table Age Salary Amy 25 50 Brian 27 60 Carol 29 100 David 35 110 Evelyn 39 120 48

Suppressing all entries: No Utility Age Salary Amy * * Brian * * Carol

Suppressing all entries: No Utility Age Salary Amy * * Brian * * Carol * * David * * Evelyn * * 49

2 -Anonymity with Clustering Age Salary Amy [25 -29] [50 -100] Brian [25 -29]

2 -Anonymity with Clustering Age Salary Amy [25 -29] [50 -100] Brian [25 -29] [50 -100] Carol [25 -29] [50 -100] David [35 -39] [110 -120] Evelyn [35 -39] [110 -120] Cluster centers published 27=(25+27+29)/3 70=(50+60+100)/3 37=(35+39)/2 115=(110+120)/2 Clustering formulation: NP Hard 50

Clustering Metrics 10 points, radius 5 50 points, radius 15 20 points, radius 10

Clustering Metrics 10 points, radius 5 50 points, radius 15 20 points, radius 10 54

r-center Clustering: Minimize Maximum Cluster Size 2 d 2 d 2 d 55

r-center Clustering: Minimize Maximum Cluster Size 2 d 2 d 2 d 55

Cellular Clustering: Linear Program Minimize c ( i xicdc + fc yc) Sum of

Cellular Clustering: Linear Program Minimize c ( i xicdc + fc yc) Sum of Cellular cost and facility cost Subject to: c xic ¸ 1 Each Point belongs to a cluster xic· yc Cluster must be opened for point to belong 0 · xic · 1 Points belong to clusters positively 0 · yc · 1 Clusters are opened positively 56

Quasi-identifier Apple Guava 0. 6 Fraction uniquely identified by Fruit. Hence Fruit is 0.

Quasi-identifier Apple Guava 0. 6 Fraction uniquely identified by Fruit. Hence Fruit is 0. 6 Quasi-identifier. Orange Apple Banana 0. 87 fraction of U. S. population uniquely identified by (DOB, Gender, Zipcode) hence it is a 0. 87 quasi-identifier 58

Quasi-Identifier Find probability distribution over D distinct values that maximizes expected number of uniquely

Quasi-Identifier Find probability distribution over D distinct values that maximizes expected number of uniquely identified fraction of records. D distinct values, n rows If D <=n D/en (skewed distribution) Else e-n/D (uniform distribution) 59

Distinct values- Identifier o o o DOB : 60*365=2*104 Gender: 2 Zipcode: 105 (DOB,

Distinct values- Identifier o o o DOB : 60*365=2*104 Gender: 2 Zipcode: 105 (DOB, Gender, Zipcode) has together 2*104*2*105=4*109 US population=3*108 Fraction of singletons= e-3*10^8/4*10^9=0. 92 60

Distinct values and K-anonymity o Eg. Apply HIPAA to (Age in Years, Zipcode, Gender,

Distinct values and K-anonymity o Eg. Apply HIPAA to (Age in Years, Zipcode, Gender, Doctor details) o o o Want k=20, 000=2*104 anonymity with n=300 million=3*108 people. The number of distinct values is D=n/k=1. 5*104 D=Distinct values= z(zipcode)*100(age in years)*2(gender)=200 z 1. 5*104=200 z, z=102 approximately. Retain first two digits of zipcode (retain states) 61

Experiments o Efficient Algorithms based on randomized algorithms to find quantiles in small space

Experiments o Efficient Algorithms based on randomized algorithms to find quantiles in small space n 10 seconds to anonymize quarter million rows. Or approximately 3 GB per hour on a machine running 2. 66 Ghz Processor, 504 MB RAM, Windows XP with Service Pack 2 o order of magnitude better in running time for a quasi-identifier of size 10 than previous implementation Optimal algorithms to anonymize the dataset. Scalable n n n Almost independent of anonymity parameter k linear in quasi-identifier size (previously exponential) linear in dataset size 67

Masketeer: A tool for data privacy Das, Lodha, Patwardhan, Sundaram, Thomas. 72

Masketeer: A tool for data privacy Das, Lodha, Patwardhan, Sundaram, Thomas. 72

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n n o o Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 73

Auditing Batches of SQL Queries Given a set of SQL queries that have been

Auditing Batches of SQL Queries Given a set of SQL queries that have been posed over a database, determine whether some subset of these queries have revealed private information about an individual or a group of individuals Motwani, Nabar, Thomas PDM Workshop with ICDE 2007 74

Example SELECT zipcode FROM Patients p WHERE p. disease = ‘diabetes’ AUDIT zipcode FROM

Example SELECT zipcode FROM Patients p WHERE p. disease = ‘diabetes’ AUDIT zipcode FROM Patients p WHERE p. disease = ‘high blood pressure’ AUDIT disease FROM Patients p WHERE p. zipcode = 94305 Not Suspicious wrt this Suspicious if someone in 94305 has diabetes 76

Query Suspicious wrt an Audit Expression o o If all columns of audit expression

Query Suspicious wrt an Audit Expression o o If all columns of audit expression are covered by the query If the audit expression and the query have one tuple in common 77

SQL Batch Auditing Query 1 Query 2 Query 3 Query 4 Audit expression Audited

SQL Batch Auditing Query 1 Query 2 Query 3 Query 4 Audit expression Audited tuple columns are covered syntactically Query batch semantically suspicious wrt audit expression iff queries together cover all audited columns table T of at least audited tuple on some 81

Syntactic and Semantic Auditing o o Checking for semantic suspiciousness has polynomial time algorithm

Syntactic and Semantic Auditing o o Checking for semantic suspiciousness has polynomial time algorithm Checking for syntactic suspiciousness is NP complete 82

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n

Road. Map o o Motivation for Data Privacy Research Sanitizing Data for Privacy n n o o Privacy Preserving OLAP K-Anonymity/ Clustering for Anonymity Probabilistic Anonymity Masketeer Auditing for Privacy Distributed Architectures for Privacy 83

Two Can Keep a Secret: A Distributed Architecture for Secure Database Services How to

Two Can Keep a Secret: A Distributed Architecture for Secure Database Services How to distribute data across multiple sites for (1)redundancy and (2) privacy so that a single (2)site being compromised does not lead to data loss Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi, Motwani, Srivastava, Thomas, Xu CIDR 2005 84

Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases Feder, Ganapathy, Garcia-Molina,

Distributing data and Partitioning and Integrating Queries for Secure Distributed Databases Feder, Ganapathy, Garcia-Molina, Motwani, Thomas Work in Progress 85

Motivation o Data outsourcing growing in popularity n Cheap, reliable data storage and management

Motivation o Data outsourcing growing in popularity n Cheap, reliable data storage and management o o 1 TB $399 < $0. 5 per GB $5000 – Oracle 10 g / SQL Server $68 k/year DBAdmin Privacy concerns looming ever larger n High-profile thefts (often insiders) o o UCLA lost 900 k records Berkeley lost laptop with sensitive information Acxiom, JP Morgan, Choicepoint www. privacyrights. org 86

Present solutions o Application level: Salesforce. com n o On-Demand Customer Relationship Managemen $65/User/Month

Present solutions o Application level: Salesforce. com n o On-Demand Customer Relationship Managemen $65/User/Month ---- $995 / 5 Users / 1 Year Amazon Elastic Compute Cloud 1 instance = 1. 7 Ghz x 86 processor, 1. 75 GB RAM, 160 GB local disk, 250 Mb/s network bandwidth Elastic, Completely controlled, Reliable, Secure $0. 10 per instance hour $0. 20 per GB of data in/out of Amazon $0. 15 per GB-Month of Amazon S 3 storage used n o Google Apps for your domain Small businesses, Enterprise, School, Family or Group 87

Encryption Based Solution Client Query Q Answer Encrypt Client-side Processor DSP Q’ “Relevant Data”

Encryption Based Solution Client Query Q Answer Encrypt Client-side Processor DSP Q’ “Relevant Data” Problem: Q’ “SELECT *” 88

The Power of Two Client DSP 1 DSP 2 89

The Power of Two Client DSP 1 DSP 2 89

The Power of Two Query Q Q 1 DSP 1 Q 2 DSP 2

The Power of Two Query Q Q 1 DSP 1 Q 2 DSP 2 Client-side Processor Key: Ensure Cost (Q 1)+Cost (Q 2) Cost (Q) 90

SB 1386 Privacy o o { Name, SSN}, { Name, Licence. No} { Name,

SB 1386 Privacy o o { Name, SSN}, { Name, Licence. No} { Name, California. ID} { Name, Account. Number} { Name, Credit. Card. No, Security. Code} are all to be kept private. A set is private if at least one of its elements is “hidden”. n Element in encrypted form ok 91

Techniques o Vertical Fragmentation n o Partition attributes across R 1 and R 2

Techniques o Vertical Fragmentation n o Partition attributes across R 1 and R 2 E. g. , to obey constraint {Name, SSN}, R 1 Name, R 2 SSN Use tuple IDs for reassembly. R = R 1 JOIN R 2 Encoding One-time Pad n n For each value v, construct random bit seq. r R 1 v XOR r, R 2 r Deterministic Encryption n n R 1 EK (v) R 2 K Can detect equality and push selections with equality predicate Random addition n n R 1 v+r , R 2 r Can push aggregate SUM 92

Example o o An Employee relation: {Name, Do. B, Position, Salary, Gender, Email, Telephone,

Example o o An Employee relation: {Name, Do. B, Position, Salary, Gender, Email, Telephone, Zip. Code} Privacy Constraints n n o {Telephone}, {Email} {Name, Salary}, {Name, Position}, {Name, Do. B} {Do. B, Gender, Zip. Code} {Position, Salary}, {Salary, Do. B} Will use just Vertical Fragmentation and Encoding. 93

Example (2) R 1 Constraints {Telephone} {Email} {Name, Salary} {Name, Position} {Name, Do. B}

Example (2) R 1 Constraints {Telephone} {Email} {Name, Salary} {Name, Position} {Name, Do. B} {Do. B, Gender, Zip. Code} {Position, Salary} {Salary, Do. B} Salary ID Name Do. B Position Salary Gender Email Telephone Zip. Code ID R 2 94

Partitioning, Execution o Partitioning Problem n n n o Partition to minimize communication cost

Partitioning, Execution o Partitioning Problem n n n o Partition to minimize communication cost for given workload Even simplified version hard to approximate Hill Climbing algorithm after starting with weighted set cover Query Reformulation and Execution n n Consider only centralized plans Algorithm to partition select and where clause predicates between the two partitions 95

Thank You! 99

Thank You! 99

Acknowledgements: Stanford Faculty o o o Advisor: Rajeev Motwani Members of Orals Committee: Rajeev

Acknowledgements: Stanford Faculty o o o Advisor: Rajeev Motwani Members of Orals Committee: Rajeev Motwani, Hector Garcia-Molina, Dan Boneh, John Mitchell, Ashish Goel Many other professors at Stanford, esp. Jennifer Widom 100

Acknowledgements: Projects o o STREAM: Jennifer Widom, Rajeev Motwani PORTIA: Hector Garcia-Molina, Rajeev Motwani,

Acknowledgements: Projects o o STREAM: Jennifer Widom, Rajeev Motwani PORTIA: Hector Garcia-Molina, Rajeev Motwani, Dan Boneh, John Mitchell TRUST: Dan Boneh, John Mitchell, Rajeev Motwani, Hector Garcia-Molina RAIN: Rajeev Motwani, Ashish Goel, Amin Saberi 101

Acknowledgements: Internship Mentors Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin

Acknowledgements: Internship Mentors Rakesh Agrawal, Ramakrishnan Srikant, Surajit Chaudhuri, Nicolas Bruno, Phil Gibbons, Sachin Lodha, Anand Rajaraman 102

Acknowledgements: Co. Authors[A-K] Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank

Acknowledgements: Co. Authors[A-K] Gagan Aggarwal, Rakesh Agrawal, Arvind Arasu, Brian Babcock, Shivnath Babu, Mayank Bawa, Nicolas Bruno, Renato Carmo, Surajit Chaudhuri, Mayur Datar, Prasenjit Das, A A Diwan, Tomás Feder, Vignesh Ganapathy, Prasanna Ganesan, Hector Garcia. Molina, Keith Ito, Krishnaram Kenthapadi, Samir Khuller, Yoshiharu Kohayakawa, 103

Acknowledgements: Co. Authors[L-Z] Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar,

Acknowledgements: Co. Authors[L-Z] Eduardo Sany Laber, Sachin Lodha, Nina Mishra, Rajeev Motwani, Shubha Nabar, Itaru Nishizawa, Liadan Boyen, Rina Panigrahy, Nikhil Patwardhan, Ramakrishnan Srikant, Utkarsh Srivastava, S. Sudarshan, Sharada Sundaram, Rohit Varma, Jennifer Widom, Ying Xu, An Zhu 104

Acknowledgements: Others not in previous list o o o Aristides, Gurmeet, Aleksandra, Sergei, Damon,

Acknowledgements: Others not in previous list o o o Aristides, Gurmeet, Aleksandra, Sergei, Damon, Anupam, Arnab, Aaron, Adam, Mukund, Vivek, Anish, Parag, Vijay, Piotr, Moses, Sudipto, Bob, David, Paul, Zoltan etc. Members of Rajeev’s group, Stanford Theory, Database, Security groups, Also many Ph. D students of the incoming year 2002 -- Paul etc. and many other students at Stanford Lynda, Maggie, Wendy, Jam, Kathi, Claire, Meredith for administrative help Andy, Miles, Lilian for keeping the machines running! Various outing clubs and groups at Stanford, Catholic community here, SIA, RAINS groups, Ivgrad, DB Movie and Social Committee 105

Acknowledgements: More! o o o Jojy Michael, Joshua Easow and families Roommates: Omkar Deshpande,

Acknowledgements: More! o o o Jojy Michael, Joshua Easow and families Roommates: Omkar Deshpande, Alex Joseph, Mayur Naik, Rajiv Agrawal, Utkarsh Srivastava, Rajat Raina, Jim Cybluski, Blake Blailey Batchmates and Professors from IITs Friends and relatives, grandparents sister Dina, and Parents 106

Data Streams o o Traditional DBMS – data stored in finite, persistent data sets

Data Streams o o Traditional DBMS – data stored in finite, persistent data sets New Applications – data input as continuous, ordered data streams n n n n Network and traffic monitoring Telecom call records Network security Financial applications Sensor networks Web logs and clickstreams Massive data sets 107

Scheduling Algorithms for Data Streams o o o Minimizing the overhead over the disk

Scheduling Algorithms for Data Streams o o o Minimizing the overhead over the disk system. Motwani, Thomas. SODA 2004 Operator Scheduling in Data Stream Systems – Minimizing memory consumption and latency. Babu, Babcock, Datar, Motwani, Thomas. VLDB Journal 2004 Stanford STREAM Data Manager. Stanford Stream Group. IEEE Bulletin 2003 108