Provenance an open approach to experiment validation in

  • Slides: 68
Download presentation
Provenance: an open approach to experiment validation in e. Science Professor Luc Moreau L.

Provenance: an open approach to experiment validation in e. Science Professor Luc Moreau L. Moreau@ecs. soton. ac. uk University of Southampton www. ecs. soton. ac. uk/~lavm

Provenance & PASOA Teams n University of Southampton n n IBM UK (EU Project

Provenance & PASOA Teams n University of Southampton n n IBM UK (EU Project Coordinator) n n Steven Willmott, Javier Vazquez SZTAKI n n Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari Universitad Politecnica de Catalunya (UPC) n n John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff n n Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen Laszlo Varga, Arpad Andics, Tamas Kifor German Aerospace n Andreas Schreiber, Guy Kloss, Frank Danneman

Contents n n n Motivation Provenance Concepts Provenance Architecture Standardisation Provenance Queries Conclusions

Contents n n n Motivation Provenance Concepts Provenance Architecture Standardisation Provenance Queries Conclusions

Motivation

Motivation

Scientific Research Academic Peer Review

Scientific Research Academic Peer Review

Audit & Business Regulations Audit: - Sarbanes-Oxley - Basel II - European Rec. R(97)5

Audit & Business Regulations Audit: - Sarbanes-Oxley - Basel II - European Rec. R(97)5 Accounting (protection of medical data) - …. Healthcare Banking

e-Science datasets n How to undertake peer-reviewing and validation of e-Scientific results?

e-Science datasets n How to undertake peer-reviewing and validation of e-Scientific results?

Compliance to Regulations n The “next-compliance” problem n Can we be certain that by

Compliance to Regulations n The “next-compliance” problem n Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance?

Current Solutions n n Proprietary, Monolithic Silos, Closed Do not inter-operate with other applications

Current Solutions n n Proprietary, Monolithic Silos, Closed Do not inter-operate with other applications Not adaptable to new regulations

Provenance n Oxford English Dictionary: n n the fact of coming from some particular

Provenance n Oxford English Dictionary: n n the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc. ; concretely, a record of the passage of an item through its various owners. Concept vs representation

Provenance in Computer Systems n Our definition of provenance in the context of applications

Provenance in Computer Systems n Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data n Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

Our Approach n n n Define core concepts pertaining to provenance Specify functionality required

Our Approach n n n Define core concepts pertaining to provenance Specify functionality required to become “provenance-aware” Define open data models and protocols that allow systems to inter-operate Standardise data models and protocols Provide a reference implementation Provide reasoning capability

Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99

Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

Context (2) Bioinformatics: verification and auditing of “experiments” (e. g. for drug approval) High

Context (2) Bioinformatics: verification and auditing of “experiments” (e. g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

Provenance Concepts

Provenance Concepts

Core Interfaces to Provenance “Lifecycle” Provenance Store Application Data Results Record Documentation of Execution

Core Interfaces to Provenance “Lifecycle” Provenance Store Application Data Results Record Documentation of Execution Administer Store and its contents Provenance Store Query and Reason over Provenance of Data

Nature of Documentation n We represent the provenance of some data by documenting the

Nature of Documentation n We represent the provenance of some data by documenting the process that led to the data: n documentation can be complete or partial; n it can be accurate or inaccurate; n it can present conflicting or consensual views of the actors involved; n it can provide operational details of execution or it can be abstract.

p-assertion n A given element of process documentation will be referred to as a

p-assertion n A given element of process documentation will be referred to as a p -assertion n p-assertion: is an assertion that is made by an actor and pertains to a process.

Service Oriented Architecture n n n Broad definition of service as component that takes

Service Oriented Architecture n n n Broad definition of service as component that takes some inputs and produces some outputs. Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. Interactions with services take place with messages that are constructed according to services interface specification. The term actor denotes either a client or a service in a SOA. A process is defined as execution of a workflow

Process Documentation (1) From these p-assertions, we can derive that M 3 was sent

Process Documentation (1) From these p-assertions, we can derive that M 3 was sent by Actor 1 and received by Actor 2 (and likewise for M 4) Actor 2 Actor 1 M 3 are not very useful because If actors are black boxes, these assertions we do not know dependencies between messages M 2 I received M 1, M 4 I sent M 2, M 3 M 4 I received M 3 I sent M 4

Process Documentation (2) Actor 2 Actor 1 M 3 These assertions help identify order

Process Documentation (2) Actor 2 Actor 1 M 3 These assertions help identify order of messages, but not how data was computed M 2 is in reply to M 1 M 3 is caused by M 1 M 2 is caused by M 4 M 4 is in reply to M 3

Process Documentation (3) Actor 1 Actor 2 M 1 M 3 These assertions f

Process Documentation (3) Actor 1 Actor 2 M 1 M 3 These assertions f 1 help identify how data is computed, f but provide no information about non-functional characteristicsf 2 of the computation M 4 M 2 (time, resources used, etc) M 3 = f 1(M 1) M 2 = f 2(M 1, M 4) M 4 = f(M 3)

Process Documentation (4) Actor 2 Actor 1 M 3 M 2 I used 386

Process Documentation (4) Actor 2 Actor 1 M 3 M 2 I used 386 cluster Request sat in queue for 6 min M 4 I used sparc processor I used algorithm x version x. y. z

Types of p-assertions (1) n Interaction p-assertion: is an assertion of the contents of

Types of p-assertions (1) n Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message I received M 1, M 4 I sent M 2, M 3

Types of p-assertions (2) n Relationship p-assertion: is an assertion, made by an actor,

Types of p-assertions (2) n Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data) M 2 is in reply to M 1 M 3 is caused by M 1 M 2 is caused by M 4 M 3 = f 1(M 1) M 2 = f 2(M 1, M 4)

Types of p-assertions (3) n Actor state p-assertion: assertion made by an actor about

Types of p-assertions (3) n Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x. y. z

Data flow n n n Interaction p-assertions allow us to specify a flow of

Data flow n n n Interaction p-assertions allow us to specify a flow of data between actors Relationship p-assertions allow us to characterise the flow of data “inside” an actor Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

Provenance Architecture

Provenance Architecture

Interfaces to Provenance Store Application Results Record Documentation of Execution Administer Store and its

Interfaces to Provenance Store Application Results Record Documentation of Execution Administer Store and its contents Provenance Store Query and Reason over Provenance of Data

P-Assertion schemas

P-Assertion schemas

The p-structure (1) n n n The p-structure is a common logical structure of

The p-structure (1) n n n The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors Hierarchical Indexed by interactions (interaction= 1 message exchange) Sender’s view Receiver’s view

The p-structure (2) Asserter identity All p-assertions asserted by a given actor participating in

The p-structure (2) Asserter identity All p-assertions asserted by a given actor participating in an interaction

Recording Protocol (Groth 04 -06) n n Abstract machines DS Properties n n n

Recording Protocol (Groth 04 -06) n n Abstract machines DS Properties n n n Documentation Properties n n Termination Liveness Safety Statelessness Immutability Attribution Datatype safety Foundation for adding necessary cryptographic techniques

Querying Functionality (Miles 06) n Process Documentation Query Interface: allows for “navigation” of the

Querying Functionality (Miles 06) n Process Documentation Query Interface: allows for “navigation” of the documentation of execution n Allows us to view the provenance store (i. e. the pstructure) as if containing XML data structures Independent of technology used for running application and internal store representation Seamless navigation of application dependent and application independent process documentation

Querying Functionality (Miles 06) n n n Provenance Query Interface: allows us to obtain

Querying Functionality (Miles 06) n n n Provenance Query Interface: allows us to obtain the provenance of some specific data A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest Hence, provenance is seen as the result of a query: n n Identify a piece of data at a specific execution point Scope of the process of interest: n Filter in/out p-assertions according to actors, process, types of relationships, etc

Available Software n n n PRe. Serv (Paul Groth & Simon Miles) Offer recording

Available Software n n n PRe. Serv (Paul Groth & Simon Miles) Offer recording and querying interfaces Available from www. pasoa. org OGSA-DAI based version available from www. gridprovenance. org Is being used in a bioinformatics application (cf. hpdc’ 05, iswc’ 05)

Provenance Store Components Factory Provenance. Store. Factory Uses Provenance. Service. Resource. Home Uses Manages

Provenance Store Components Factory Provenance. Store. Factory Uses Provenance. Service. Resource. Home Uses Manages Record OGSA-DAI PStore. Database Client OGSA-DAI API PStore. Database Client API Provenance. Store. Resource PQuery XPath Actor CSL XQuery Iterate OGSA-DAI Provenance. Store. Resource Destroy Provenance. Service. Resources Provenance. Service Globus GT 4 Container External Security Services Globus GT 4 Container Slide from John Ibbotson e. Xist XML Database

Provenance Store Security Deny Policy Decision Point Approve Factory Provenance. Store. Factory Deny Record

Provenance Store Security Deny Policy Decision Point Approve Factory Provenance. Store. Factory Deny Record PQuery Actor CSL Request Policy Decision Point XPath Approve XQuery Iterate Destroy Resources ACL File (XML) Provenance GT 4 Container Provenance. Service Slide from John Ibbotson

Provenance Implementation n The Client Side Library exposes Provenance Store functionality and separates Actor

Provenance Implementation n The Client Side Library exposes Provenance Store functionality and separates Actor from alternative Server side implementations n n n EU Provenance project implementation PASOA Pre. Serv Security is being extended to allow federation using Globus Community Authorization Service (CAS) Slide from John Ibbotson

Standardisation

Standardisation

Standardisation Options APIs Programmatic inter-op Recording and querying Interfaces Service inter-op Provenance Model Data

Standardisation Options APIs Programmatic inter-op Recording and querying Interfaces Service inter-op Provenance Model Data inter-op

Purpose of Standardisation Application Record Documentation of Execution Provenance Stores Allow for multiple applications

Purpose of Standardisation Application Record Documentation of Execution Provenance Stores Allow for multiple applications to document their execution.

Purpose of Standardisation Application Record Documentation of Execution Provenance Store Allow for multiple stores

Purpose of Standardisation Application Record Documentation of Execution Provenance Store Allow for multiple stores from multiple IT providers

Purpose of Standardisation Provenance Store Query Provenance of Data Allow for multiple stores from

Purpose of Standardisation Provenance Store Query Provenance of Data Allow for multiple stores from multiple IT providers

Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to

Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to expose their contents (according to standard schema)

Purpose of Standardisation Application Provenance Store Allow third parties to host provenance stores, which

Purpose of Standardisation Application Provenance Store Allow third parties to host provenance stores, which are trusted by application owners but also auditors

Compliance Oriented Architectures n n Separate execution documentation from compliance verification Allows for multiple

Compliance Oriented Architectures n n Separate execution documentation from compliance verification Allows for multiple compliance verifications Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). Approach is suitable for escientific peer-reviewing and business compliance verification

Standardisation Philosophy n n Thin layer common between systems: extensible data model Model can

Standardisation Philosophy n n Thin layer common between systems: extensible data model Model can be extended for specific: n n n technologies (WS, Web, …), or application domains (Bio, Healthcare, Desktop, …) Service interfaces

Proposed List of Specifications Generic Profiles WS-Prov-Intro WS-Prov-DM-Sec WS-Prov-DM-Link WS-Prov-Glo WS-Prov-DM-Infer WS-Prov-DM-DS WS-Prov-Primer WS-Prov-DM-Rel

Proposed List of Specifications Generic Profiles WS-Prov-Intro WS-Prov-DM-Sec WS-Prov-DM-Link WS-Prov-Glo WS-Prov-DM-Infer WS-Prov-DM-DS WS-Prov-Primer WS-Prov-DM-Rel WS-Prov-Rec WS-Prov-Query Technology Bindings WS-Prov-SOAP WS-Prov-WWW Domain Specific Profiles

Provenance Queries (Miles’ 06)

Provenance Queries (Miles’ 06)

Example Application 1. average (7, 5) GUI Averager 4. answer (6) 5. store (“

Example Application 1. average (7, 5) GUI Averager 4. answer (6) 5. store (“ 6”, file 1) 2. divide (12, 2) Divider 3. answer (6) Averager(in 1, in 2) { return (in 1+in 2)/2; } Store Averager delegates the division operation to the service Divider

Example Application 1. average (7, 5) GUI Averager 4. answer (6) 5. store (“

Example Application 1. average (7, 5) GUI Averager 4. answer (6) 5. store (“ 6”, file 1) Store 2. divide (12, 2) Divider 3. answer (6) Relationships • 12 in msg 2 is sum of • 6 in msg 3 is division of • 6 in msg 4 is copy of • 6 in msg 4 is average of • 6 in msg 6 is copy of 7, 5 12, 2 6 7, 5 6 in msg 1 in msg 2 in msg 3 in msg 1 in msg 4 Tracers • are used to demarcate activities (aka sets of services) • added by Averager in call to Divider • returned by Divider in response

The data we want to find the provenance of n Identify the event where

The data we want to find the provenance of n Identify the event where the entity is documented: “file 1” n n Store In this case, the event is the receipt of a request to store the data in file named file 1 Identify the data entity within that message n In this case, the data of interest is the “ 6” stored in file 1

Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Sum of Averager “

Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Sum of Averager “ 12” Divider Divisor Averager “ 2” Divider Dividend Division of Divider Average of “ 6” Averager Copy of Averager “ 6” GUI Copy of GUI “ 6” Store

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to ignore Sum ofthe high level structure of the computation and Averager “ 12” Divider Averager “ 2” to focus on the actual operations Divider Dividend Divisor Division of e. g. allows us to establish what a Averager given provider Divider actually“ 6”does Average of Filter to exclude “Average of” relationships Copy of Averager “ 6” GUI Copy of GUI “ 6” Store

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to consider Sum aofgiven service (and all its inferior invocations) as a black Averager box: high“ 12” level Divider account Averager of “ 2” Divider provenance Divisor Dividend Division of e. g. no detail should be provided about the internals of Averager Divider “ 6” Averager Average of Filter to exclude messages containing tracer Copy of Averager “ 6” GUI Copy of GUI “ 6” Store This is equivalent to hiding the internal operation of Averager

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to

Scoped Provenance Graph GUI “ 7” Averager GUI “ 5” Averager Allows us to scope Sum ofthe provenance graph according to “ 12” Divider Averager types. Averager of data or operations Divisor “ 2” Divider Dividend Division of e. g. looking at the restorations of a painting rather than “ 6” its various Divider Averager owners Average of Copy of Filter to exclude Divisor parameters Averager “ 6” GUI Copy of GUI “ 6” Store

Provenance Query

Provenance Query

Practically … n Event and Data Identification //ps: interaction. Record Event identification [ps: interaction.

Practically … n Event and Data Identification //ps: interaction. Record Event identification [ps: interaction. Key/ps: message. Sink/ wsa: Endpoint. Reference/ wsa: Address="http: //www. example. com/store"] The interaction record in which the receiver (message. Sink) has address http: //www. example. com/store //ps: interaction. PAssertion [ex: envelope/ex: store/ex: location="/home/sm/data/file 1"] //ex: envelope/ex: store/ex: data Data identification

Practically … n The scope of the provenance query n Unscoped query / n

Practically … n The scope of the provenance query n Unscoped query / n Exclude ‘average. Of’ relation /pq: relationship. Target[ps: relation!= "http: //www. example. com#average. Of"] n Exclude tracer introduced by Averager /pq: relationship. Target/ps: interaction. PAssertion [not(ex: envelope/ph: pheader/ ph: interaction. Meta. Data [ph: tracer="process: //sub/1"])]

Donor Data Collection Request Provenance of Donor Diagnosis Request User Interface Patient (in Brain

Donor Data Collection Request Provenance of Donor Diagnosis Request User Interface Patient (in Brain Death Notification) Brain Death Manager Healthcare Record Manager Was Caused By EHCR Request Healthcare Record Manager Is Response To EHCRS Healthcare Record Manager Data Collection Complete EHCRS Healthcare Record Manager Includes Data Brain Death Manager Testing Lab Donor Data Collection Patient Test Results Is Diagnosis Request For Brain Death Manager Diagnose Request Donor Data Collector Was Caused By Donor Data Collector Diagnose Request Decision Maker Test Results Brain Death Manager

Conclusions

Conclusions

To Sum Up Distribution Finance Aerospace Standardising the documentation of Business Processes Healthcare n

To Sum Up Distribution Finance Aerospace Standardising the documentation of Business Processes Healthcare n y Provenance n n Architecture Methodology Automobile Pharmaceutical Record A l p p n Provenance Store Query n n Compliance check Rerun/Reproduce Analyse Slide from John Ibbotson

Conclusions n n n Crucial topic for many applications Full architectural specification An implementation

Conclusions n n n Crucial topic for many applications Full architectural specification An implementation available for download Methodology to make application provenance-aware www. pasoa. org www. gridprovenance. org

Provenance Challenge twiki. ipaw. info

Provenance Challenge twiki. ipaw. info

Publications 1. 2. 3. 4. 5. 6. 7. Paul Groth, Simon Miles, Weijian Fang,

Publications 1. 2. 3. 4. 5. 6. 7. Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14 th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005. Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8 th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004. Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004. Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005. Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In , 2003. Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8 th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005. Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.

Questions

Questions