DATA PROVENANCE PRIVACY What is Provenance Provenance v

  • Slides: 34
Download presentation
DATA PROVENANCE & PRIVACY

DATA PROVENANCE & PRIVACY

What is Provenance? Provenance v Recording the history of data and its place of

What is Provenance? Provenance v Recording the history of data and its place of origin Provenance Dictionary Definitions 1. The Merriam-Webster online diction – Origin , Source 2. Oxford English Dictionary – The place of origin or earliest known history of something; origin, derivation. Provenance Definitions 1. Provenance refers to the source of Information such as entities and processes involved in producing or delivering an artifact. (Yolanda) 2. Provenance is a description of how things came to be, and how they came to be in the state they are in today. Statements about the provenance can themselves be considered to have provenance. (Jim M) Continues. . .

What is Provenance? Provenance Working Definitions 3. Provenance of a resource is a record

What is Provenance? Provenance Working Definitions 3. Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility. Provenance assertions are a form of contextual metadata and can themselves become important records with their own provenance. (W 3 C) Provenance Web Definition 4. On the web, provenance would include information about the creation and publication of web resources as well as information about access of those resources, and activities related to their discussion, linking, and reuse. Continues. . .

What is Provenance? Provenance Definitions 5. Provenance is documentation of the set of artifacts,

What is Provenance? Provenance Definitions 5. Provenance is documentation of the set of artifacts, processes, and agents that have caused a artifact to be, and of the contexts of these entities. Provenance provides a critical foundation for assessing authenticity, enabling trust, and allowing reproducibility and assertions of provenance can themselves become important records with their own provenance. (Jim M)

What kind of History? v v v Data Creator/Data Publisher Data Creation Date Data

What kind of History? v v v Data Creator/Data Publisher Data Creation Date Data Modifier & Modification Date Data Description Etc. . .

Why Provenance is Important? The need of Provenance for data integration and reuse v

Why Provenance is Important? The need of Provenance for data integration and reuse v Data comes from various diverse data sources v Varying Quality v Different Scope v Different Assumptions

Provenance Dimensions - 1 Content of Provenance Information v. Attribution - provenance as the

Provenance Dimensions - 1 Content of Provenance Information v. Attribution - provenance as the sources or entities that were used to create a result ² ² new Responsibility - knowing who endorses a particular piece of information or result Origin - recorded vs reconstructed, verified vs non-verified, asserted vs inferred v. Process - provenance as the process that yielded an artifact ² ² Reproducibility (e. g. workflows, mashups, text extraction) Data Access (e. g. access time, accessed server, party responsible for accessed server) v. Evolution and versioning ² ² Republishing (e. g. re-tweeting, re-blogging, re-publishing) Updates (e. g. a document with content from various sources and that changes over time) v. Justification for decisions – Includes argumentation, hypotheses, why-not questions v. Entailment - given the results to a particular query, what tuples led to those results

Provenance Dimensions - 2 Management of Provenance Information v. Publication - Making provenance information

Provenance Dimensions - 2 Management of Provenance Information v. Publication - Making provenance information available (expose, distribute) v. Access - Finding and querying provenance information v. Dissemination control – Track policies specified by creator for when/how an artifact can be used ² Access Control - incorporate access control policies to access provenance information ² Licensing - stating what rights the object creators and users have based on provenance ² Law enforcement (e. g. enforcing privacy policies on the use of personal information) v. Scale - how to operate with large amounts of provenance information Use of Provenance Information v. Understanding - End user consumption of provenance ² ² abstraction, multiple levels of description, summary presentation, visualization

Provenance Dimensions - 3 v v v Interoperability - combining provenance produced by multiple

Provenance Dimensions - 3 v v v Interoperability - combining provenance produced by multiple different systems Comparison - finding what is common in the provenance of two or more entities Accountability - the ability to check the provenance of an object with respect to some expectation ² ² v Trust - making trust judgments based on provenance ² ² v Information quality - choosing among competing evidence from diverse sources (e. g. linked data use cases) Incorporating reputation and reliability ratings with attribution information Imperfections - reasoning about provenance information that is not complete or correct ² ² v Verification - of a set of requirements Compliance - with a set of policies Incomplete provenance Uncertain/probabilistic provenance Erroneous provenance Fraudulent provenance Debugging

The Linked Data Paradigm v How can we exploit all the available data? ²

The Linked Data Paradigm v How can we exploit all the available data? ² Data can be reuse and remix ² Common flexible and usable APIs ² Standard vocabularies to describe interlinked datasets ² Various Tools ² Understand the Semantic Web vision

Provenance and Link Data v Provenance provides the ability ² ² Trace the sources

Provenance and Link Data v Provenance provides the ability ² ² Trace the sources of various kinds of data Enable the exploration of relationships between datasets, their authors and affiliations v Provenance analysis provides an insight on how data is produced and exploited v Provenance create a notion of information quality ² ² ² Is a certain dataset consistent and up to date? Is the connection between two datasets meaningful? Is a given dataset relevant for a particular domain? v Provenance to establish information trustworthiness v Provenance to provide data views relating to some criteria

The Provenance Data Model Institutional Level Metadata associated with origin in terms of its

The Provenance Data Model Institutional Level Metadata associated with origin in terms of its data attributes (e. g, Author. Name, Title, URL, etc. ) Experimental Protocol Level The Origin of datasets (e. g. History area, region, organisation or institution) Data Analysis and Significance Level Datasets statistical analysis methodology for selecting relevant attributes (e. g. Either datasets divided into parts, output values, versions, etc) Dataset Description Level Who published that datasets. The vocabulary of interlinked datasets such as Dublin Core, voi. D, PRV, etc.

Provenance Related Metadata Provenance related metadata is either directly attached to data item or

Provenance Related Metadata Provenance related metadata is either directly attached to data item or its host the documents or it is available as additional data on web. For example – Attached metadata are RDF statements about an RDF graph that contains the statements, Author. Name and Creation date of blog entries added to syndication feed, or information about an image and detached metadata can be represented in RDF using vocabularies.

Provenance Data Quality Assessment The Quality of Information v Main Objectives are accessing the

Provenance Data Quality Assessment The Quality of Information v Main Objectives are accessing the quality of datasets v Quality of datasets in multidimensional perspectives Categories Criteria Intrinsic Objectivity, Believability, Accuracy Contextual Completeness, Relevance, Timeliness Representational Understandable, Concise, Precise Accessibility Availability, Securing & Licensing, Constrains (Format & Procedures) v Relevance of criteria determined by preferences and performing certain tasks on available datasets

Provenance Data Quality v Data Trustworthiness ² ² Data Authenticity Data Reliability v Quality

Provenance Data Quality v Data Trustworthiness ² ² Data Authenticity Data Reliability v Quality of Data Provenance has Three dimensions: ² Correctness v Dimensions of Believability ² ² Trustworthiness of source ² Completeness Ø Ø ² Relevancy Data Lineage – The origin of data Related Artifacts and actors Reasonableness of data Ø Ø Possibility – The extent to which data value is possible Consistency – The extent to which a data value is consistent with other values of same data

Trust Evaluation Some Questions must need to be considered while provenance data trust evaluation…

Trust Evaluation Some Questions must need to be considered while provenance data trust evaluation… 1. Who created that content(s) (author or attributions)? 2. Was the contents manipulated? If yes then by what process or source? 3. Who is providing those contents (repositories)?

Quality of Data Assessment v Assign numeric values to Quality Criteria of Datasets or

Quality of Data Assessment v Assign numeric values to Quality Criteria of Datasets or Scoring/Rating Systems v Proactive Approach ² Precision vs Practicality Manual Approach ² Questionnaires base system Semi-Automatic Approach ² Rating based system ² Reputation based system

Reasons of Assessment Main Reasons v Provenance of assessed data on the web v

Reasons of Assessment Main Reasons v Provenance of assessed data on the web v Primary Objectives ² Identify the methods / approaches to automatically assess the quality of data on the web ² Or Identify the methods to assess the Quality Criteria of Data automatically of web data.

A Generalize Assessment Approach Step - 1 Generate a provenance graph for the data

A Generalize Assessment Approach Step - 1 Generate a provenance graph for the data item Step - 2 Annotate the provenance graph with impact values Step - 3 Execute the assessment function/program (script)

Generate a Provenance Graph 1. What types of provenance elements are necessarily require? 1.

Generate a Provenance Graph 1. What types of provenance elements are necessarily require? 1. What types of details (i. e. granularity) are necessarily require? 2. Where and how do we get provenance information? v Two complementary options ² ² Recordings Analyzing the metadata

Annotation with Impact Values 1. How might each Provenance element influence the quality of

Annotation with Impact Values 1. How might each Provenance element influence the quality of data? v Each type of element has to analyze systematically 1. What kinds of impact values are necessary and how to represent the influence through impact values? v It is not necessary that impact values should be numeric v It also depends on the assessment functions 1. How do we determine the impact values?

Determine the Impact Values 1. From Provenance Information 2. From user Input v Rating-based

Determine the Impact Values 1. From Provenance Information 2. From user Input v Rating-based systems, or reputation-based systems v Configuration options 1. Through Content Analysis v Comparison of data contents v Adoption of information retrieval methods v Adoption of data cleansing techniques 2. Through Context Analysis v Further metadata v Domain knowledge

Annotation with Impact Values v How might each Provenance element influence the quality of

Annotation with Impact Values v How might each Provenance element influence the quality of data? Provenance Element Type Creation Date Impact Values Creation time Creation Guidelines Weights Source data items Data creator Expiry time

Scientific and Technical Challenges of Provenance (SUMMARY) Provenance information need to be: v Represented

Scientific and Technical Challenges of Provenance (SUMMARY) Provenance information need to be: v Represented v Captured and recorded v Stored and secured, queries and reasoned about v Visualized and browsed

Data Privacy

Data Privacy

What Is Privacy? • Freedom from observation, intrusion, or attention of others • Society’s

What Is Privacy? • Freedom from observation, intrusion, or attention of others • Society’s needs sometimes trump individual privacy • Privacy rights are not absolute • Balance needed – Individual rights – Society’s need • Privacy and “due process”

Privacy and the Law • No constitutional right to privacy – The word “privacy”

Privacy and the Law • No constitutional right to privacy – The word “privacy” is not in the Constitution – Congress has passed numerous laws • Not particularly effective • Issue is pace of change • Privacy is a function of culture • Privacy means different things in different countries and regions – Serious problem on global Internet

 Figure 10. 1 Some U. S. privacy laws. Year Title Intent 1970 Fair

Figure 10. 1 Some U. S. privacy laws. Year Title Intent 1970 Fair Credit Reporting Act Limits the distribution of credit reports to those who need to know. 1974 Privacy Act Establishes the right to be informed about personal information on government databases. 1978 Right to Financial Privacy Act Prohibits the federal government from examining personal financial accounts without due cause. 1986 Electronic Communications Privacy Act Prohibits the federal government from monitoring personal e-mail without a subpoena. 1988 Video Privacy Protection Act Prohibits disclosing video rental records without customer consent or a court order. 2001 Patriot Act Streamlines federal surveillance guidelines to simplify tracking possible terrorists.

Collecting Personal Information (e. g. , your email address) • Notice/awareness – You must

Collecting Personal Information (e. g. , your email address) • Notice/awareness – You must be told when and why • Choice/consent – Opt-in or opt-out • Access/participation – You can access and suggest corrections • Integrity/security – Collecting party is responsible • Enforcement/redress – You can seek legal remedies

Collecting Personal Information • Often voluntary – Filling out a form – Registering for

Collecting Personal Information • Often voluntary – Filling out a form – Registering for a prize – Supermarket “Rewards” cards • Legal, involuntary sources – – Demographics Change of address Various directories Government records

Completing the Picture • Aggregation – Combining data from multiple sources – Complete dossier

Completing the Picture • Aggregation – Combining data from multiple sources – Complete dossier – Demographics • Finding missing pieces – Browser supplied data – TCP/IP – Public forums – monitoring

Capturing Clickstream Data • Record of individual’s Internet activity – Web sites and newsgroups

Capturing Clickstream Data • Record of individual’s Internet activity – Web sites and newsgroups visited – Incoming and outgoing e-mail addresses • Tracking – Secretly collecting clickstream data – ISP in perfect position to track you • All transactions go through ISP – Using cookies – Using Web bugs

Surveillance and Monitoring • Surveillance – Continual observation – Tampa – facial scanning at

Surveillance and Monitoring • Surveillance – Continual observation – Tampa – facial scanning at Super Bowl – Packet sniffing • Monitoring – The act of watching someone or something – E-mail Web bugs – Workplace monitoring is legal

Surveillance and Monitoring Tools • Spyware – Sends collected data over back channel •

Surveillance and Monitoring Tools • Spyware – Sends collected data over back channel • Snoopware – Records target’s online activities – Retrieved later • Screen shots, logs, keystrokes • Other surveillance/monitoring sources – On. Star and GPS tracking – E-ZPass systems – Phone calls and credit card purchases