An Open Provenance Model for Scientific Workflows Professor

  • Slides: 39
Download presentation
An Open Provenance Model for Scientific Workflows Professor Luc Moreau L. Moreau@ecs. soton. ac.

An Open Provenance Model for Scientific Workflows Professor Luc Moreau L. Moreau@ecs. soton. ac. uk University of Southampton www. ecs. soton. ac. uk/~lavm

Provenance & PASOA Teams n University of Southampton n n IBM UK (EU Project

Provenance & PASOA Teams n University of Southampton n n IBM UK (EU Project Coordinator) n n Steven Willmott, Javier Vazquez SZTAKI n n Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari Universitad Politecnica de Catalunya (UPC) n n John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff n n Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen Laszlo Varga, Arpad Andics, Tamas Kifor German Aerospace n Andreas Schreiber, Guy Kloss, Frank Danneman

Contents n n Motivation Provenance Concept Map Process documentation in a concrete bioinformatics application

Contents n n Motivation Provenance Concept Map Process documentation in a concrete bioinformatics application Conclusions

Motivation

Motivation

Peer Review/Audit Academic publishing Banking Accounting Healthcare

Peer Review/Audit Academic publishing Banking Accounting Healthcare

e-Science datasets n How to undertake peer-reviewing and validation of e-Scientific results?

e-Science datasets n How to undertake peer-reviewing and validation of e-Scientific results?

Current Solutions n n Proprietary, Monolithic Silos, Closed Do not inter-operate with other applications

Current Solutions n n Proprietary, Monolithic Silos, Closed Do not inter-operate with other applications Not adaptable to new regulations

Provenance n Oxford English Dictionary: n n the fact of coming from some particular

Provenance n Oxford English Dictionary: n n the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc. ; concretely, a record of the passage of an item through its various owners. Concept vs representation

Application Drivers Aerospace engineering: maintain a historical record of design processes, up to 99

Application Drivers Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients Bioinformatics: verification and auditing of “experiments” (e. g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

Provenance Concept Map

Provenance Concept Map

documents Process Documentation is defined as a past Process has a structure Provenance (concept)

documents Process Documentation is defined as a past Process has a structure Provenance (concept) is an execution of produces is represented by Provenance Query has Provenance (representation ) Application is obtained by P-structure contains Data product assert consists of operates over Services P-assertions

Making Applications Provenance Aware Application Data Product Assert p-assertions and record them as Process

Making Applications Provenance Aware Application Data Product Assert p-assertions and record them as Process Documentation Provenance Store Obtain the provenance of data by issuing provenance queries

Process Documentation I received M 1, M 4 I sent M 2, M 3

Process Documentation I received M 1, M 4 I sent M 2, M 3 Interaction p-assertions M 1 f 2 Relationship p-assertions M 2 M 3 = f 1(M 1) M 2 = f 2(M 1, M 4) M 2 is in reply to M 1 M 3 M 4 Service state p-assertions I received M 1 at time t I used algorithm x. y. z

Data flow n n n Interaction p-assertions allow us to specify a flow of

Data flow n n n Interaction p-assertions allow us to specify a flow of data between services Relationship p-assertions allow us to characterise the flow of data “inside” an service Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

Process Documentation in a Concrete Bioinformatics Application

Process Documentation in a Concrete Bioinformatics Application

Biology n n Determine how protein sequences fold into a 3 D structure? Structure

Biology n n Determine how protein sequences fold into a 3 D structure? Structure of protein sequences may help to answer this question. Structure can be quantified by textual compressibility. Determine the amino acid groupings that maximize compressibility?

Collaboration Diagram

Collaboration Diagram

Actual Call DAG

Actual Call DAG

The P-Structure The logical structure of a provenance store

The P-Structure The logical structure of a provenance store

Interaction Record The set of p-assertions pertaining to a given interaction (i. e. ,

Interaction Record The set of p-assertions pertaining to a given interaction (i. e. , message exchange between a sender and a receiver)

Interaction Key A unique identifier for an interaction Sender identity Receiver identity Local id

Interaction Key A unique identifier for an interaction Sender identity Receiver identity Local id

View The set of p-assertions created by an asserter involved in an interaction (sender

View The set of p-assertions created by an asserter involved in an interaction (sender or receiver view)

Asserter The identity of an asserter

Asserter The identity of an asserter

Interaction P-Assertion An assertion of the contents of a message by an actor that

Interaction P-Assertion An assertion of the contents of a message by an actor that has sent or received that message

Interaction P-Assertion Content The content of an interaction p-assertion: here, the invocation of blast

Interaction P-Assertion Content The content of an interaction p-assertion: here, the invocation of blast (through a wrapper)

Interaction Content Provenance-related information passed in application messages

Interaction Content Provenance-related information passed in application messages

Actor State P-Assertion An assertion made by an actor about its internal state in

Actor State P-Assertion An assertion made by an actor about its internal state in the context of a specific interaction

Relationship P-Assertion With respect to an interaction, a relationship p-assertion is an assertion, made

Relationship P-Assertion With respect to an interaction, a relationship p-assertion is an assertion, made by an actor, that describes how the actor obtained output data or the whole message sent in that interaction by applying some function to input data or messages from other interactions.

Subject Id The identity of the subject of a relationship

Subject Id The identity of the subject of a relationship

Object Id The identity of the object of a relationship

Object Id The identity of the object of a relationship

Process Documentation Characteristics n n Common logical structure of the provenance store shared by

Process Documentation Characteristics n n Common logical structure of the provenance store shared by all asserting and querying actors Can be produced autonomously, asynchronously by the different application components Open, extensible model, for which we are producing a public specification Tools can operate on it (e. g. visualisation, reasoning)

Performance (HPDC’ 05)

Performance (HPDC’ 05)

Standardisation Philosophy n n Thin layer common between systems: extensible data model Model can

Standardisation Philosophy n n Thin layer common between systems: extensible data model Model can be extended for specific: n n n technologies (WS, Web, …), or application domains (Bio, Healthcare, Desktop, …) Service interfaces

Proposed List of Specifications Generic Profiles WS-Prov-Intro WS-Prov-DM-Sec WS-Prov-DM-Link WS-Prov-Glo WS-Prov-DM-Infer WS-Prov-DM-DS WS-Prov-Primer WS-Prov-DM-Rel

Proposed List of Specifications Generic Profiles WS-Prov-Intro WS-Prov-DM-Sec WS-Prov-DM-Link WS-Prov-Glo WS-Prov-DM-Infer WS-Prov-DM-DS WS-Prov-Primer WS-Prov-DM-Rel WS-Prov-Rec WS-Prov-Query Technology Bindings WS-Prov-SOAP WS-Prov-WWW Domain Specific Profiles

Conclusions

Conclusions

To Sum Up Distribution Finance Aerospace Standardising the documentation of Business Processes Healthcare n

To Sum Up Distribution Finance Aerospace Standardising the documentation of Business Processes Healthcare n y Provenance n n Architecture Methodology Automobile Pharmaceutical Record A l p p n Provenance Store Query n n Compliance check Rerun/Reproduce Analyse Slide from John Ibbotson

Conclusions n n n n Crucial topic for many applications Full architectural specification Implementation

Conclusions n n n n Crucial topic for many applications Full architectural specification Implementation available for download Methodology to make application provenanceaware Draft standardisation proposal to be released www. pasoa. org www. gridprovenance. org

Provenance Challenge Workshop at OGF 18, Washington, September 11 -14 twiki. ipaw. info

Provenance Challenge Workshop at OGF 18, Washington, September 11 -14 twiki. ipaw. info

Questions

Questions