Privacy Issues in Scientific Workflow Provenance Sudeepa Roy

  • Slides: 33
Download presentation
Privacy Issues in Scientific Workflow Provenance Sudeepa Roy 1 Joint work with Susan B.

Privacy Issues in Scientific Workflow Provenance Sudeepa Roy 1 Joint work with Susan B. Davidson 1 Sanjeev Khanna 1 Sarah Cohen Boulakia 2 1 University USA of Pennsylvania 2 Laboratoire France de Recherche en Informatique

Workflow TGCCGTG CTGTGC TGGCTAAATG ATG… … GGCTAAA TCTGTGC TGCCGTG …TGTCTG ATCCGTG TGGCGTC … TGGCTA.

Workflow TGCCGTG CTGTGC TGGCTAAATG ATG… … GGCTAAA TCTGTGC TGCCGTG …TGTCTG ATCCGTG TGGCGTC … TGGCTA. . n Split Entries Align Sequences Format-1 Functional Data Curate Annotations Format-2 Graphical representation of a sequence of actions to perform a task (e. g. , a biological experiment) n Vertex ≡ Module (program) n Edge ≡ Dataflow n Run: An execution of the workflow Format-3 Construct Trees n Actual data appears on the edges 2

Need for provenance s TGCCGTGT CCCTTTCCG GGCTAAAT TGCCGTGT TGTGGCTA GTCTGTGC TGCCGTGT GGCTAAAT AATGTCTG TGCCGTGT

Need for provenance s TGCCGTGT CCCTTTCCG GGCTAAAT TGCCGTGT TGTGGCTA GTCTGTGC TGCCGTGT GGCTAAAT AATGTCTG TGCCGTGT GGCTAAAT GTCTGTGC … TGC ATGGCCGT GGCTAAAT GTCTGTGC GTGGTCTGTGC … TGCCTAAC … GTCTGTGC … TAACTAA… Split Entries Align Sequences Format Curate Annotations Functional Data Format Construct Trees t How has this tree been generated? Which sequences have been used to produce this tree? … ? Typical provenance queries: – Whether d 1 depends on d 2 – How d 1 depends on d 2 Biologist’s workspace 3

Workflow privacy for provenance queries • Analysts want access to the sequence of module

Workflow privacy for provenance queries • Analysts want access to the sequence of module executions and intermediate data • But workflows may capture medical diagnosis, proprietary biological experiments, etc • Many “private” components in such workflows! – proprietary modules (functionality) – personal information, medical records (data) – even the entire process! (provenance) Privacy Issues in Workflow Provenance 4

In this talk…. • Identify important privacy concerns in scientific workflows w. r. t.

In this talk…. • Identify important privacy concerns in scientific workflows w. r. t. provenance queries • Module Privacy • Data Privacy • Provenance Privacy • Propose a model for Module Privacy – From Davidson-Khanna-Panigrahi-Roy ’ 10 • Discuss future directions Privacy Issues in Workflow Provenance 5

Privacy Concerns in Scientific Workflows Privacy Issues in Workflow Provenance 6

Privacy Concerns in Scientific Workflows Privacy Issues in Workflow Provenance 6

Example 1: Module Privacy Patient record: Gender, smoking habits, Familial environment, blood pressure, blood

Example 1: Module Privacy Patient record: Gender, smoking habits, Familial environment, blood pressure, blood test report, … P: (X 1, X 2, X 3, X 4, X 5) Split entries (X 1, X 2, X 3) (X 1, X 2, X 4, X 5) Check for Cancer Check for Infectious disease P has cancer? P has an disease? Create Report report 7 infectious If X 1 > 60 OR (X 2 < 800 AND X 5=1) AND …. Module functionality should be kept secret (From patient’s standpoint): output should not be guessed given input data values (From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere.

Example 2: Data Privacy Microarray data obtained from the experiment Robots are used to

Example 2: Data Privacy Microarray data obtained from the experiment Robots are used to perform microarray analysis Data must be normalized to be interpreted correctly Microarray companies provide normalization methods Normalized data Normalization data should be kept secret Data from other groups is used in normalization Privacy Issues in Workflow Provenance 8

Example 3: Provenance Privacy Protein M 1 compares the entire protein against already annotated

Example 3: Provenance Privacy Protein M 1 compares the entire protein against already annotated genomes M 2 Protein + Functional annotation M 2 compares domains of proteins (more precise but more time consuming) The provenance should be kept secret Privacy Issues in Workflow Provenance 9

Privacy concerns at a glance smoking habits, blood pressure, blood test report, …… q

Privacy concerns at a glance smoking habits, blood pressure, blood test report, …… q P: (X 1, X 2, X 3, X 4) Module Privacy q Functionality is private (x, f(x)) Split entries (X 1, X 2, X 3) Check for cancer (X 1, X 2, X 4) DB q Check for infectious disease P has cancer? Create Report report P has infectious q disease? Data Privacy q Data items are private Provenance Privacy q How data items are generated is private Privacy Issues in Workflow Provenance 10

Formal Study of Privacy in Workflows Privacy Issues in Workflow Provenance 11

Formal Study of Privacy in Workflows Privacy Issues in Workflow Provenance 11

The questions we want to answer. . . ? ? How do we measure

The questions we want to answer. . . ? ? How do we measure privacy? ü What information can we hide? We identified them! • Can we preserve privacy of private components in a workflow and maximize utility w. r. t. provenance queries with provable guarantees on both privacy and utility of the solution? ? How do we measure utility? ? Privacy Issues in Workflow Provenance How do we find a good solution? 12

Module Privacy – A formal study… … from our recent work Privacy Issues in

Module Privacy – A formal study… … from our recent work Privacy Issues in Workflow Provenance 13

Initial Input Data Our workflow model • A directed acyclic graph • D =

Initial Input Data Our workflow model • A directed acyclic graph • D = {d 1, d 2, …. d 6} : data items d 1 d 2 • Each edge carries a data item v 1 d 3 d 4 v 2 d 5 Final Output Data v 3 • Data Sharing: each data item is produced by a unique module but can be on multiple edges d 6 Privacy Issues in Workflow Provenance 14

Run – An execution of the workflow d 1, 0 d 2, 1 v

Run – An execution of the workflow d 1, 0 d 2, 1 v 1 d 3, 0 d 4, 1 v 2 d 5, 1 d 4, 1 v 3 d 6, 0 Privacy Issues in Workflow Provenance 15

Owner vs. User of the workflow • Owner owns the workflow d 1, 0

Owner vs. User of the workflow • Owner owns the workflow d 1, 0 d 2, 1 v 1 d 3, 0 v 2 d 5, 1 d 4, 1 v 3 d 6, 0 • User executes the workflow on different inputs and sees the output and may want to see some intermediate data • Each module in the workflow is “private” – User has no apriori knowledge of the functions • Owner cannot show all intermediate data! Privacy Issues in Workflow Provenance 16

Owner vs. User of the workflow d 1, 0 d 2, 1 • Privacy

Owner vs. User of the workflow d 1, 0 d 2, 1 • Privacy to the owner vs. loss of utility to the user • There is cost to the user of hiding each data item v 1 • Owner chooses which subset of data to hide d 3, 0 d 4, ? – These data values are not shown across all runs of the workflow v 2 v 3 – Connections are always shown – Hide a subset of data with minimum d 5, 1 d 6, 0 cost that ensures privacy Privacy Issues in Workflow Provenance 17

Module Privacy • A module f = a function • For every input x

Module Privacy • A module f = a function • For every input x to f, f(x) value should not be revealed – Enough equivalent possible f(x) values w. r. t. visible information • There is a fork, to a knife and a spoon According in this figureprivacy required guarantee Privacy Issues in Workflow Provenance 18

Module Privacy • Standalone module privacy: – Module is not part of a workflow

Module Privacy • Standalone module privacy: – Module is not part of a workflow • In-network module privacy: – Module belongs to a workflow Let us take a look at “standalone module privacy” first Privacy Issues in Workflow Provenance 19

Standalone Privacy: An example… • A module f decides which diseases a person may

Standalone Privacy: An example… • A module f decides which diseases a person may have based on the regions he lives and visited recently A • • B C D R V Four regions A, B, C, D Three diseases D 1, D 2, D 3 R denotes where a person lives (A or B), V denotes where he visited recently (C or D) f(R, V) = (D 1, D 2, D 3) Privacy Issues in Workflow Provenance 20

Standalone Privacy: An example… • • • R = 1 if a person lives

Standalone Privacy: An example… • • • R = 1 if a person lives in region A, 0 if he lives in B V = 1 if a person recently visited region C, 0 if he visited D D 1 = 1 if a person is susceptible to disease D 1 and 0 otherwise Similarly D 2, D 3. D 1 = R V (may have D 1 if either he lives in A or visited C) D 2 = (R V) (may not have D 2 only if he lives in A and visited C) • D 3 = (R V) (may have D 3 if and only he lives in B and visited D) R V D 1 D 2 D 3 0 0 0 1 1 1 0 1 1 1 0 0 • D 1 = R V • D 2 = (R V) • D 3 = (R V) Privacy Issues in Workflow Provenance 21

Γ-Standalone Privacy of functions • hide a subset of input and output data values

Γ-Standalone Privacy of functions • hide a subset of input and output data values • Γ-standalone-privacy: for all inputs x, there at least Γ possible values of f(x) – Similar to the notion of L-diversity (MKGV’ 07) R V D 1 D 2 D 3 0 0 0 1 1 1 0 1 1 1 0 0 • hide D 2, D 3 • 4 possible outputs for each input • eg. can map f(0, 0) to (0, 0, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1) Privacy Issues in Workflow Provenance 22

Many options give the same privacy… R V D 1 D 2 D 3

Many options give the same privacy… R V D 1 D 2 D 3 0 0 0 1 1 1 0 1 1 1 0 0 • hide V, D 3 • can map f(0, 0) to (0, 1, 0), (0, 1, 1), (1, 1, 0), (1, 1, 1) • gives 4 -privacy But not all…. R V D 1 D 2 D 3 0 0 0 1 1 1 0 1 1 1 0 0 • hide R, V • over all possible executions only 3 outputs • can map f(0, 0) to (0, 1, 1), (1, 1, 0), (1, 0, 0) • does not give 4 -privacy Privacy Issues in Workflow Provenance 23

Consistent Functions • Two functions are consistent w. r. t. some visible attributes, if

Consistent Functions • Two functions are consistent w. r. t. some visible attributes, if their “tables” are the same w. r. t. the visible values – Γ-standalone-privacy: for all inputs x, consistent functions map to at least Γ possible values of f(x) R V D 1 D 2 D 3 0 0 0 1 1 1 0 1 1 1 0 0 R V D 1 D 2 D 3 0 0 0 1 1 1 0 0 Privacy Issues in Workflow Provenance 24

Standalone to In-network Privacy • Carefully choosing a subset gives desired standalone privacy -

Standalone to In-network Privacy • Carefully choosing a subset gives desired standalone privacy - through consistent functions • How is in-network privacy (enough possible f(x) values in a network) different? • Claim: If consistent functions give Γ possible values of f(x) when f is standalone, then they also give Γ possible values of f(x) when f is in a network. • Why? Pick a consistent function for each module, they are consistent with the network as a whole Privacy Issues in Workflow Provenance 25

A toy workflow R V f 1 D 2 D 3 f 2 V

A toy workflow R V f 1 D 2 D 3 f 2 V 1 V 2 • f 1 is the same module as before • Module f 2 decides which vaccine (V 1 or V 2) a person needs to take based on the diseases • V 1 = 1, V 2 = 0 if D 1 = 1 • Otherwise, V 1 = 0 and V 2 = D 1 D 2 Privacy Issues in Workflow Provenance 26

Consistent Sequence of Functions <g 1, …, gn> is consistent with <f 1, …,

Consistent Sequence of Functions <g 1, …, gn> is consistent with <f 1, …, fn> if all individual fi-s are consistent with gi-s • R V • f 1 D 2 D 3 f 2 V 1 • • V 2 f 1: – D 1 = R V, D 2 = (R V), D 3 = (R V) f 2: – V 1 = 1, V 2 = 0 if D 1 = 1 – Otherwise, V 1 = 0 and V 2 = D 2 D 3 g 1: – D 1 = ( R V), D 2 = (R V), D 3 = (R V) g 2: – V 1 = 1, V 2 = 0 if D 1 = 0 – Otherwise, V 1 = 0 and V 2 = D 2 D 3 f 1 is consistent with both f 1, g 1; f 2 with both f 2, g 2; <g 1, g 2> is consistent with <f 1, f 2>, <f 1, g 2> is not Privacy Issues in Workflow Provenance 27

Γ-In. Network Privacy of Functions Informal definition: Each input x of each function can

Γ-In. Network Privacy of Functions Informal definition: Each input x of each function can be mapped to Γ different outputs by consistent functions from consistent sequences Our prior results (Davidson-Khanna-Panigrahi-Roy ’ 10) • We give a (correct!) proof of the claim • Use the above connection to find a minimum-cost data subset to hide that ensures privacy for every module – NP-complete Privacy Issues in Workflow Provenance 28

Related Work Privacy Issues in Workflow Provenance 29

Related Work Privacy Issues in Workflow Provenance 29

Related Work • Access control in Scientific Workflows: – Chebotko et. al. (2008), Gil

Related Work • Access control in Scientific Workflows: – Chebotko et. al. (2008), Gil et. al. (2007, 2010) – But, no formal notion of privacy and quality of the solution. • Privacy Preserving Data Mining Techniques: – Formal analysis of privacy and utility in social networks, statistical databases, …. – K-anonymity, L-diversity, Differential privacy – Not exactly suited for workflow related applications • Different query format, • Adding noise may not be useful • Secure Provenance of Workflows: – Braun et. al, Hasan et. al, . . . Privacy Issues in Workflow Provenance 30

Future Work and Open Problems Privacy Issues in Workflow Provenance 31

Future Work and Open Problems Privacy Issues in Workflow Provenance 31

Future Work • Module Privacy (ongoing work) – How do we handle a combination

Future Work • Module Privacy (ongoing work) – How do we handle a combination of private and public modules? • Data Privacy – Hiding a data value may not be enough – how much is revealed from the displayed data values? • Provenance Privacy (ongoing work) – Reachability between pairs of modules is private • Connect theory with practice Privacy Issues in Workflow Provenance 32

Thank you Questions? Privacy Issues in Workflow Provenance 33

Thank you Questions? Privacy Issues in Workflow Provenance 33