1 Last update 11 December 2015 Knowledge and

1 Last update: 11 December 2015 Knowledge and the Web / Privacy and Big Data – Data Mining {against, for, ? } Privacy Bettina Berendt KU Leuven, Department of Computer Science http: //people. cs. kuleuven. be/~bettina. berendt/teaching Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

2 Where are we? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

3 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

4 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

5 The cliché slide about data mining Data (in some format) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

6 5 ★ Open Data: Formats example http: //5 stardata. info/en/ Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

5 ★ Open Data (Berners Lee, 2006): From Web data via Open data to Linked Open Data e. g. Commercial data, Facebook e. g. Twitter, Much open govt. /public data e. g. dbpedia http: //www. w 3. org/Design. Issues/Linked. Data. html, http: //5 stardata. info/en/ Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 7

8 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

9 What is this about? Knowledge mined from data, descriptive and predictive (e. g. “predictive analytics“). . . that some people would prefer some other people to not have Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

10 Targeted advertising / nuisance to the individual Knowledge mined from data, descriptive and predictive (e. g. “predictive analytics“). . . that some people would prefer some other people to not have Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

11 Targeted advertising / privacy violation for the individual (“The Target story“, reconstructed on Amazon) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

12 Profiling of individuals with consequences beyond nuisance Knowledge mined from data, descriptive and predictive (e. g. “predictive analytics“). . . that some people would prefer some other people to not have cf. (Kosinski et al. , 2013) – the Converse/stupidity example was gleaned from interaction with the authors‘ Preference Tool demo account Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

13 Trade secrets Knowledge mined from data, descriptive and predictive (e. g. “predictive analytics“). . . that some people would prefer some other people to not have Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

14 Focus Is not, or not only, on privacy as • Individual privacy • A fundamental human right But – both more generally and more narrowly – about • Confidentiality of data (Depending on the jurisdiction, this is a plain misnomer, or just confusing – but it is the terminology of the field. . . ) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

An overview of key questions/problems and where they are discussed (1) • How do data become available, and how can you (and others) use them? • • Knowledge and the Web course How can the availability of data affect indidividuals‘ privacy? • • 15 Privacy and Big Data course How can the availability of data affect other interests in confidentiality? • Not covered, left to your/our common-sense understanding Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

An overview of key questions/problems and where they are discussed (2) • How can data mining be a threat to individuals‘ privacy? • • • Not covered, left to your/our common-sense understanding How can mining effects on privacy and confidentiality be mitigated? • Today: some technical modifications / decisions • (The question is much bigger, and technology is only one part of the answer. But we can‘t possibly cover this within one lecture. ) How can data mining be a helper for privacy? • • Martijn van Otterlo in the Privacy and Big Data course How can data mining be a threat to other interests in confidentiality? • • 16 Discussion (based on research we and others did) with those who are interested More on my view of the last two questions: (Berendt, 2012) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

17 Why should you care? (1) From the Ka. W student feedback: “[. . . ] A lot of focus on research question while for me as a computer scientist this does not seem relevant. ” Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

Do you care? http: //www. w 3. org/Design. Issues/Linked. Data. html, http: //5 stardata. info/en/ Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 18

Do you care? http: //www. w 3. org/Design. Issues/Linked. Data. html, http: //5 stardata. info/en/ Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 19

20 Do you care? (From your questions) n “When every subject has its own URI and data is automatically added and connected. Information about people will eventually end up there as well. n Of course on the web now this is also the case, but when looking for information about one person there is not a single source which has everything or provides links to where this information comes from. n I think one of the goals for the semantic web is to be able to link all this information together so it can be easily (and even automatically) retrieved and updated. n To me this seems a challenge looking at the privacy of those persons that are now reduced to data. Everything available is easily accessible and not scattered around. ” n “How to protect the privacy of individuals. If someone doesn't want some of his/her data to be linked, is there any method to cut the link? ” Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

21 BTW: also in not so open data environments Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

Should personal data be open data? Should it be linked data? (1) (from the discussion 2015 12 02, rephrasing from memory) n << It can be linked, even by a unique URI. But I, the data subject, should have control over who sees what. For example, the doctor should see my medical records, just like in pre Internet days. >> (Rephrasing BB: I. e. , personal data should not be open!) n << Isn‘t this more a security issue? >> n Remarks BB: It is definitely about security (when you think of access control to be defended against hackers), but also about privacy (when you think of access control as a way of exercising your right of informational self determination). The latter idea is at the core of European data protection law, so yes, in principle you have these rights, and personal data should not be open. Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 22

Should personal data be open data? Should it be linked data? (2) 23 But this presents some issues; just think of Twitter as an example: n What if someone else “owns“ or “co owns“ these data because it‘s their platform (Twitter) or because it‘s from a discussion they were involved in too (other users)? (Legally tricky) n What if you voluntarily “made these data public“ (just read the Twitter terms of service)? n What if this wasn‘t so voluntary, but a choice made due to the necessity to speak via this monopoly player on the communications market? n Is it practical to ask every user for consent if you analyse Twitter data? (Note: some lawyers argue that this would be the only legal way, at least in the EU. Others say, you accepted the terms of service. ) n What if some social good comes out of the analysis (lives/children are saved, diseases are cured, social understanding is enhanced, national security is increased, . . . )? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

24 Why should you care? (2) n Whether you have an interest in being an ethically aware computer scientist or not n And if so, whatever this means specifically to you : n As a CS professional, you will build systems. n You will deal with data (be a “data controller“) n These will be personal data (for ~ 80% of data scientists, recent survey) n There is a data protection and privacy legislation in pretty much every country n You will have to comply n Failure to do so costs you money, consumer trust, and maybe your job Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

25 “But I can‘t do anything“ (1) – as an individual Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

26 “But I can‘t do anything“ (2) – as an IT professional Well, you are the designer of IT systems, aren‘t you? The upcoming EU data protection regulation mandates Privacy by Design General reference for example: CNIL (2015). Is this covered in a course? n see (Berendt & Coudert, 2014), now adapted in Pa. BD Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

27 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

One (the classical) technology behind recommendations and other “predictive analytics“ frequent itemsets / association rules Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 28

Motivation for association rule learning/mining: store layout (Amazon, earlier: Wal Mart, . . . ) Where to put: spaghetti, butter? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ 29

30 Data "Market basket data": attributes with boolean domains In a table each row is a basket (aka transaction) Transaction ID Attributes (basket items) 1 Spaghetti, tomato sauce 2 Spaghetti, bread 3 Spaghetti, tomato sauce, bread 4 bread, butter 5 bread, tomato sauce Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

31 Solution approach: The apriori principle and the pruning of the search tree (1) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ Bread, butter

32 Solution approach: The apriori principle and the pruning of the search tree (2) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ Bread, butter

33 Solution approach: The apriori principle and the pruning of the search tree (3) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ Bread, butter

34 Solution approach: The apriori principle and the pruning of the search tree (4) Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spaghetti, tomato sauce spaghetti Spagetthi, Tomato sauce, butter Spaghetti, bread Spaghetti, butter Tomato sauce Spagetthi, Bread, butter Tomato s. , bread Tomato sauce, Bread, butter Tomato s. , butter Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/ Bread, butter

35 More formally: Generating large k itemsets with Apriori Transaction ID Attributes (basket items) 1 Spaghetti, tomato sauce 2 Spaghetti, bread 3 Spaghetti, tomato sauce, bread 4 bread, butter 5 bread, tomato sauce Min. support = 40% step 1: candidate 1 -itemsets n Spaghetti: support = 3 (60%) n tomato sauce: support = 3 (60%) n bread: support = 4 (80%) n butter: support = 1 (20%) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

36 Contd. Transaction ID Attributes (basket items) 1 Spaghetti, tomato sauce 2 Spaghetti, bread 3 Spaghetti, tomato sauce, bread 4 bread, butter 5 bread, tomato sauce step 2: large 1 itemsets n Spaghetti n tomato sauce n bread candidate 2 -itemsets n {Spaghetti, tomato sauce}: support = 2 (40%) n {Spaghetti, bread}: support = 2 (40%) n {tomato sauce, bread}: support = 2 (40%) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

37 Contd. Transaction ID Attributes (basket items) 1 Spaghetti, tomato sauce 2 Spaghetti, bread 3 Spaghetti, tomato sauce, bread 4 bread, butter 5 bread, tomato sauce step 3: large 2 itemsets n {Spaghetti, tomato sauce} n {Spaghetti, bread} n {tomato sauce, bread} candidate 3 -itemsets n {Spaghetti, tomato sauce, bread}: support = 1 (20%) step 4: large 3 itemsets n {} Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

38 From frequent itemsets to association rules Schema: If subset then large k-itemset with support s and confidence c n s = (support of large k-itemset) / # tuples n c = (support of large k-itemset) / (support of subset) Example: If {spaghetti} then {spaghetti, tomato sauce} n Support: s = 2 / 5 (40%) n Confidence: c = 2 / 3 (66%) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

39 Which interestingness measures are interesting for whom? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

40 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

41 The basic idea of privacy-preserving data mining n Database inference problem: "The problem that arises when confidential information can be derived from released data by unauthorized users” n PPDM „develops algorithms for modifying the original data [and/or the processing] in some way, so that the private data and private knowledge remain private even after the mining process“ n Term coined (in DM) in 2000, builds on older research traditions such as statistical disclosure control, secure multi party computation n Trade off the utility of the mining results against (this sense of) privacy n Measures of utility and of privacy (overview: Bertino et al. , 2008) Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

42 A classification of “privacy preserving data mining“ n What is to be protected? = What would the attacker want to know? The data The attacker will, given the data table T, not be able to Anonymization techniques, see Pa. BD course – link any row in T to a specific individual (identity disclosure) – obtain an individual‘s value of a sensitive attribute (attribute disclosure) The inferred data mining result – The attacker will, w/o T but given the results of DM, e. g. an association rule learned from T, be able to identify some attributes of a specific individual n How are the data held and processed? centralized distributed: every user knows only some rows (or columns) of T Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

Approach: Modify the data/algorithm/results to avoid undesired patterns Example Association Rule Hiding – approaches: n Distortion based (Sanitization) Technique n Blocking based Technique Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 43

44 High level view Data Mining Association Rules Hide Sensitive Rules User Changed Database How to specify the unwanted patterns? Configured/automatic: Describe the sensitive rules by templates (e. g. those that use or predict sensitive variables) This slide based on http: //dimacs. rutgers. edu/Workshops/Privacy/slides/pontikakis. ppt Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

Recall: Basic interestingness measures for association rules A rule X Y, with X and Y itemsets, is interesting if the measure > a threshold Support n Proportion or % of instances in the database (e. g. people) who exhibit the pattern (X and Y) n Ex. : „If britney then spears, supp=0. 35“ is interesting for minsupp=0. 05 Confidence n Proportion or % of instances with X that also have Y = support(X & Y) / support (X) n Ex. : „If book 1 then book 2, supp=0. 001, conf = 1 is interesting for any minconf Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 45

46 Example Sample Database A B C D 1 1 1 0 0 0 1 1 Rule A→C has: Support(A→C)=80% Confidence(A→C)=100% Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

47 Distortion based Techniques for association rule hiding Sample Database Distorted Database A B C D 1 1 1 0 0 0 1 1 1 0 0 1 A B C D 1 1 1 0 0 0 1 1 1 0 Distortion Algorithm Rule A→C has: Rule A→C has now: Support(A→C)=80% Confidence(A→C)=100% Support(A→C)=40% Confidence(A→C)=50% This + the following 9 slides from/based on http: //dimacs. rutgers. edu/Workshops/Privacy/slides/pontikakis. ppt Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

48 Side Effects on non sensitive rules Before Hiding Process After Hiding Process Side Effect Rule Ri has had conf(Ri)>MCT Rule Ri has now conf(Ri)<MCT Rule Eliminated (Undesirable Side Effect) Rule Ri has had conf(Ri)<MCT Rule Ri has now conf(Ri)>MCT Ghost Rule (Undesirable Side Effect) Large Itemset I has had sup(I)>MST Itemset I has now sup(I)<MST Itemset Eliminated (Undesirable Side Effect) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

49 Distortion based Techniques Challenges/Goals: n To minimize the undesirable Side Effects that the hiding process causes to non sensitive rules. Note: Many measures of utility (which is traded off against privacy) are based on the number or proportion of ghost rules etc. n To minimize the number of 1’s that must be deleted in the database. n Algorithms must be linear in time as the database increases in size. Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

50 Quality of Data Sometimes it is dangerous to delete some items from the database (e. g. medical databases) because the false data may create undesirable effects. So, we have to hide the rules in the database by adding uncertainty without distorting the database. Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

51 Blocking based Techniques Initial Database A B C D 1 1 1 0 0 0 1 1 1 0 New Database A B C D 1 1 1 0 ? 1 1 ? 0 0 1 1 1 0 1 1 Blocking Algorithm Support and Confidence becomes marginal. In New Database: 60% ≤ conf(A → C) ≤ 100% Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

52 Modification of Association Rule Definition A rule’s A→B confidence and support becomes marginal: sup(A→B) [minsup(A→B), maxsup(A→B)] conf(A→B) [minconf(A→B), maxconf(A→B)] minsup(A→B)= maxsup(A→B)= Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

53 Modification of Association Rule Definition minconf(A→B)= maxconf(A→B)= Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

54 Negative Border Rules Set (NBRS) Definition When a rule R has either n sup(R)>MST AND conf(R)<MCT OR n sup(R)<MST AND conf(R)>MCT, conf(R)>MCT then we say that R belongs to NBRS. Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

Side Effects Definition Modification in Blocking based Techniques Before Hiding Process After Hiding Process Side Effect Rule Ri has had conf(Ri)>MCT Rule Ri has now minconf(Ri)<MCT Rule Eliminated (Undesirable Side Effect) Rule Ri has had conf(Ri)<MCT Rule Ri has now maxconf(Ri)>MCT Ghost Rule (Desirable Side Effect) Large Itemset I has had Itemset I has now sup(I)>MST minsup(I)<MST Itemset Eliminated (Undesirable Side Effect) Itemset I has had sup(I)<MST Ghost Itemset (Desirable Side Effect) Itemset I has now maxsup(I)>MST Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 55

56 Blocking Based Techniques Goals that an algorithm has to achieve: To put a relatively small number of ? ’s and reduce significantly the confidence of sensitive rules. To minimize the undesirable side effects (rules and itemsets lost) by selecting the items in the appropriate transactions to change, and maximize the desirable side effects. To modify the database in a way that an adversary cannot recover the original values of the database. Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

Approach: Distribute data and processing Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 57

58 Distributed data mining / secure multi party computation: The principle explained by secure sum Given a number of values x 1, . . . , xn belonging to n entities compute xi such that each entity ONLY knows its input and the result of the computation (The aggregate sum of the data) 58 Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

Distributed association rule mining n Example distributed AR mining on horizontally partitioned data (one approach: Kantarcioglu & Clifton, 2006) n In principle, easy: n 59 If a rule has support > k% globally, it must have support > k% on at least one site 1. Request that each site send all rules with support > k% 2. For each rule returned: request that all sites send the count of their transactions that support the rule and the total count of transactions 3. From this, compute the global support of each rule But: if you are the only site where a rule holds, would you want to share that? Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

Phase 1: Find out which itemsets are frequent across sites all 2 all messages with commutative encryption Site A Frequent itemsets: X, Y KA(KB(X)) KA(KB(Z)) KA(KC(X)) KA(KC(Y)) Site B Frequent itemsets: X, Z KB(KA(X)) KB(KA(Y)) KB(KC(X)) KB(KC(Y)) Site C Frequent itemsets: X, Y KC(KB(X)) KC(KB(Z)) KC(KA(X)) KC(KA(Y)) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 60

Phase 1: Find out which itemsets are frequent across sites all 2 all messages with commutative encryption Site A Frequent itemsets: X, Y KA(KB(X)) KA(KB(Z)) KA(KC(X)) KA(KC(Y)) Site B Frequent itemsets: X, Z KB(KA(X)) KB(KA(Y)) KB(KC(X)) KB(KC(Y)) Site C Frequent itemsets: X, Y KC(KB(X)) KC(KB(Z)) KC(KA(X)) KC(KA(Y)) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 61

Phase 1: Find out which itemsets are frequent across sites all 2 all messages with commutative encryption Site A Frequent itemsets: X, Y KA(KB(X)) KA(KB(Z)) KA(KC(X)) KA(KC(Y)) Site B Frequent itemsets: X, Z KB(KA(X)) KB(KA(Y)) KB(KC(X)) KB(KC(Y)) Site C Frequent itemsets: X, Y KC(KB(X)) KC(KB(Z)) KC(KA(X)) KC(KA(Y)) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 62

Phase 1: Find out which itemsets are frequent across sites all 2 all messages with commutative encryption Site A Frequent itemsets: X, Y KA(KB(X)) KA(KB(Z)) KA(KC(X)) KA(KC(Y)) Site B Frequent itemsets: X, Z KB(KA(X)) KB(KA(Y)) KB(KC(X)) KB(KC(Y)) Site C Frequent itemsets: X, Y KC(KB(X)) KC(KB(Z)) KC(KA(X)) KC(KA(Y)) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 63

Phase 1: Find out which itemsets are frequent across sites all 2 all messages with commutative encryption Site A Frequent itemsets: X, Y Site B Frequent itemsets: X, Z Site C Frequent itemsets: X, Y X and Y are frequent KA(KB(X)) KA(KB(Z)) KA(KC(X)) KA(KC(Y)) KB(KA(X)) KB(KA(Y)) KB(KC(X)) KB(KC(Y)) KC(KB(X)) KC(KB(Z)) KC(KA(X)) KC(KA(Y)) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 64

Phase 2: secure multiparty computation (think of the itemset X = {ABC} ) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 65

Some further issues in privacy preserving data mining Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 66

67 Generalisation to data other than relational tables n Graph data n Search queries n Texts n Spatial data n . . . Approaches for all of these exist but are beyond the scope of this course! Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

A new problem: inferences from patterns • Atzori et al. (2008): publishing association rules, even those with high support, may lead to anonymity leaks from single individuals • solution approach: release only kanonymous patterns • By „sanitization“: adding or deleting transactions from data

Outlook: Privacy-preserving data publishing (PPDP) • In contrast to the general assumptions of PPDM, arbitrary mining methods may be performed after publishing need adversary models • Objective: "access to published data should not enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attacker’s background knowledge obtained from other sources” • (this needs to be relaxed by assumptions about the background knowledge) • A comprehensive survey: Fung et al. , ACM Computing Surveys 2010 • With more recent literature: Manta, A. (2013). Literature Survey on Privacy Preserving Mechanisms for Data Publishing. Masters thesis, TU Delft • (note for the Pa. BD people: this survey focusses on anonymization approaches to PPDP & so is closely linked to the material by Claudia Diaz that you have seen)

70 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

71 Two notes on data minimisation Intuition: n “Existing data create desires. “ (Vorhandene Daten wecken Begehrlichkeiten, traditional adage in German data protection discourse) n “There are no innocent data. “ * n If there‘s no data, you can‘t misuse them. Principle of European data protection law + other DP frameworks: data minimisation: “the policy of gathering the least amount of personal information necessary to perform a given function. ” * Anke Domscheit Berg, documented here: http: //blogs. taz. de/tazlab/2014/04/12/uberwachung durch nsa es gibt keine einfache losung/ Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

72 Data minimisation and data mining? ! n “the policy of gathering the least amount of personal information necessary to perform a given function. ” Often considered a problem: n If the point of data mining is to explore the data in order to find something new and interesting, there is no given function or purpose! n So are data mining and data minimisation mutually exclusive? We believe that NO: 1. For developing an app (which has a function), minimise the data 2. When planning an analysis, minimise the data 3. Then mine the data, ideally in a privacy friendly way Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

73 Aggregate, anonymize, reduce. . . – when re using “public“ data User (e. g. name or other ID) Resource (e. g. tweet text) Tag (e. g. Hashtag) 1. Get full records Store as received 2. Get full records Anonymize 3. Get full records Filter all but tag 4. Get tag only Analyse/ datamine Storeas asreceived is Most minimal and legally safest: no personal data transferred (assuming there is no way to reconstruct personal info from tag content. . . )! Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

74 Agenda The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Berendt: Knowledge and the Web, 2015, http: //www. cs. kuleuven. be/~berendt/teaching/

75 n Legal requirements? n Econonomic pressure ? n Political considerations? (“national sovereignty“) n see talks in Security and Privacy in a Post-Snowden World http: //eng. kuleuven. be/evenementen/arenbergsymposium 2014 Ethical / consumer choice ? This is related to ethical consumer choices in other areas, such as Fair Trade. (argument made by E. Morozov) Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

76 ”Few of us have had moral pangs about data sharing schemes, but that could change. Before the environment became a global concern, few of us thought twice about taking public transport if we could drive. Before ethical consumption became a global concern, no one would have paid more for coffee that tasted the same but promised “fair trade. ” Consider a cheap T shirt you see in a store. It might be perfectly legal to buy it, but after decades of hard work by activist groups, a “Made in Bangladesh” label makes us think twice about doing so. Perhaps we fear that it was made by children or exploited adults. Or, having thought about it, maybe we actually do want to buy the T shirt because we hope it might support the work of a child who would otherwise be forced into prostitution. ” Morozov, E. (2013). Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

But what does make people buy fair trade products? An experiment on the effectiveness of “ethical apps“ Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/ 77

78 Effectiveness of “ethical apps“? Hudson et al. (2013): What makes people buy a fair trade product? Informational film shown before buying decision? n NO Having to make the decision in public? n NO Some prior familiarity with the goals and activities of fair trade campaigns as well as broader understanding of national and global political issues that are only peripherally related to fair trade? n YES Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

79 Outlook The data The problem (and all the problems we won’t go into detail about today) The analytics used for demonstrating the argument The approach: “Privacy preserving data mining” Data minimisation: (not only) for data mining Incentives? Data mining and discrimination / fairness Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

80 References The seminal article is on PPDM Rakesh Agrawal and Ramakrishnan Srikant. 2000. Privacy preserving data mining. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD '00). ACM, New York, NY, USA, 439 450. DOI=10. 1145/342009. 335438 http: //doi. acm. org/10. 1145/342009. 335438 Berendt, B. (2012). More than modelling and hiding: Towards a comprehensive view of Web mining and privacy. Data Mining and Knowledge Discovery, 24 (3), 697 737. http: //people. cs. kuleuven. be/~bettina. berendt/Papers/berendt_2012_DAMI. pdf CNIL (2015). http: //www. cnil. fr/english/news and events/article/privacy impact assessments the cnil publishes its pia manual/ Berendt, B. & Coudert, F. (2015). Privatsphäre und Datenschutz lehren Ein interdisziplinärer Ansatz. Konzept, Umsetzung, Schlussfolgerungen und Perspektiven. [Teaching privacy and data protection an interdisciplinary approach. Concept, implementation, conclusions and perspectives. ] In Neues Handbuch Hochschullehre. [New Handbook of Teaching in Higher Education] (EG 71, 2015, E 1. 9) (pp. 7 40). Berlin: Raabe Verlag. Bertino, E. , Lin, D. , Jiang, W. (2008). A survey of quantification of privacy preserving data mining algorithms. In C. C. Aggarwal & P. S. Yu (Eds. ), Privacy-preserving data mining: models and algorithms (pp. 181 200). New York: Springer. Kosinski, M. , Stillwell, D. , & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110 (15), 5802 5805. Pontikakis, E. & Verykios, V. (undated). An Experimental Study of Association Rule Hiding Techniques. (slideset). http: //dimacs. rutgers. edu/Workshops/Privacy/slides/pontikakis. ppt Please see their bibliography for sources for the distortion-based and blocking-based techniques. Murat Kantarcioglu and Chris Clifton. 2004. Privacy Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data. IEEE Trans. on Knowl. and Data Eng. 16, 9 (September 2004), 1026 1037. DOI=10. 1109/TKDE. 2004. 45 http: //dx. doi. org/10. 1109/TKDE. 2004. 45 (graphic on p. 75 from the paper) Maurizio Atzori, Francesco Bonchi, Fosca Giannotti, Dino Pedreschi: Anonymity preserving pattern discovery. VLDB J. 17(4): 703 727 (2008). http: //www. researchgate. net/publication/226264051_Anonymity_preserving_pattern_discovery/file/504635236 dfc 5 f 308 e. pdf Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy preserving data publishing: A survey of recent developments. ACM Comput. Surv. 42, 4, Article 14 (June 2010), 53 pages. DOI=10. 1145/1749603. 1749605 http: //doi. acm. org/10. 1145/1749603. 1749605 Manta, A. (2013). Literature Survey on Privacy Preserving Mechanisms for Data Publishing. Masters thesis, TU Delft. Morozov, E. (2013). The Real Privacy Problem. MIT Technology Review. http: //www. technologyreview. com/featuredstory/520426/the real privacy problem/ Hudson, M. , Hudson, I. , & Edgerton, J. D. (2013). Political Consumerism in Context: An Experiment on Status and Information in Ethical Consumption Decisions. American Journal of Economics, 72 (4), 1009 1037. http: //dx. doi. org/10. 1111/ajes. 12033 Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/

81 Readings on PPDM A very readable and recent introduction: Matwin, S. (2013). Privacy-Preserving Data Mining Techniques: Survey and Challenges. In B. Custers et al. (Eds. ): Discrimination & Privacy in the Information Society, SAPERE 3, pp. 209– 221. Springer. http: //link. springer. com/chapter/10. 1007%2 F 978 -3 -64230487 -3_11 The classification on p. 31 is taken from that paper. A readable but somewhat old overview: Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y (2004) Stateof-the-art in privacy preserving data mining. SIGMOD Rec 33(1): 50– 57. http: //citeseerx. ist. psu. edu/viewdoc/summary? doi=10. 1. 1. 2. 3715 The quote on p. 29 (repeated on p. 30) defining the field is from that paper. A thorough overview: Aggarwal CC, Yu PS (2008 a) A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining: models and algorithms. Springer, New York, pp 11– 51. http: //citeseerx. ist. psu. edu/viewdoc/summary? doi=10. 1. 1. 352. 3032 Berendt: Knowledge and the Web, 2014, http: //www. cs. kuleuven. be/~berendt/teaching/