Provenance overview Professor Luc Moreau L Moreauecs soton

  • Slides: 51
Download presentation
Provenance: overview Professor Luc Moreau L. Moreau@ecs. soton. ac. uk University of Southampton www.

Provenance: overview Professor Luc Moreau L. Moreau@ecs. soton. ac. uk University of Southampton www. ecs. soton. ac. uk/~lavm Architecture Tutorial

Provenance & PASOA Teams • University of Southampton – Luc Moreau, Paul Groth, Simon

Provenance & PASOA Teams • University of Southampton – Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen • IBM UK (EU Project Coordinator) – John Ibbotson, Neil Hardman, Alexis Biller • University of Wales, Cardiff – Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari • Universitad Politecnica de Catalunya (UPC) – Steven Willmott, Javier Vazquez • SZTAKI – Laszlo Varga, Arpad Andics, Tamas Kifor • German Aerospace – Andreas Schreiber, Guy Kloss, Frank Danneman Architecture Tutorial

Contents • • • Motivation Provenance Concepts Provenance Architecture Standardisation Conclusions Architecture Tutorial

Contents • • • Motivation Provenance Concepts Provenance Architecture Standardisation Conclusions Architecture Tutorial

Motivation Architecture Tutorial

Motivation Architecture Tutorial

Scientific Research Academic Peer Review Architecture Tutorial

Scientific Research Academic Peer Review Architecture Tutorial

Business Regulations Accounting Audit (Sarbanes-Oxley) Banking Audit (Basel II) Architecture Tutorial

Business Regulations Accounting Audit (Sarbanes-Oxley) Banking Audit (Basel II) Architecture Tutorial

Health Care Management European Recommendation R(97)5: on the protection of medical data Architecture Tutorial

Health Care Management European Recommendation R(97)5: on the protection of medical data Architecture Tutorial

e-Science datasets • How to undertake peer-reviewing and validation of e-Scientific results? Architecture Tutorial

e-Science datasets • How to undertake peer-reviewing and validation of e-Scientific results? Architecture Tutorial

Compliance to Regulations • The “next-compliance” problem – Can we be certain that by

Compliance to Regulations • The “next-compliance” problem – Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance? Architecture Tutorial

Current Solutions • Proprietary, Monolithic • Silos, Closed • Do not inter-operate with other

Current Solutions • Proprietary, Monolithic • Silos, Closed • Do not inter-operate with other applications • Not adaptable to new regulations Architecture Tutorial

Provenance • Oxford English Dictionary: – the fact of coming from some particular source

Provenance • Oxford English Dictionary: – the fact of coming from some particular source or quarter; origin, derivation – the history or pedigree of a work of art, manuscript, rare book, etc. ; – concretely, a record of the passage of an item through its various owners. • Concept vs representation Architecture Tutorial

Provenance in Computer Systems • Our definition of provenance in the context of applications

Provenance in Computer Systems • Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data • Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases Architecture Tutorial

Our Approach • Define core concepts pertaining to provenance • Specify functionality required to

Our Approach • Define core concepts pertaining to provenance • Specify functionality required to become “provenance-aware” • Define open data models and protocols that allow systems to inter-operate • Standardise data models and protocols • Provide a reference implementation • Provide reasoning capability Architecture Tutorial

Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99

Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients Architecture Tutorial

Context (2) Bioinformatics: verification and auditing of “experiments” (e. g. for drug approval) High

Context (2) Bioinformatics: verification and auditing of “experiments” (e. g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN) Architecture Tutorial

Provenance Concepts Architecture Tutorial

Provenance Concepts Architecture Tutorial

Provenance “Lifecycle” Core Interfaces to Provenance Store Application Data Results Record Documentation of Execution

Provenance “Lifecycle” Core Interfaces to Provenance Store Application Data Results Record Documentation of Execution Administer Store and its contents Architecture Tutorial Provenance Store Query and Reason over Provenance of Data

Nature of Documentation • We represent the provenance of some data by documenting the

Nature of Documentation • We represent the provenance of some data by documenting the process that led to the data: – documentation can be complete or partial; – it can be accurate or inaccurate; – it can present conflicting or consensual views of the actors involved; – it can provide operational details of execution or it can be abstract. Architecture Tutorial

p-assertion • A given element of process documentation will be referred to as a

p-assertion • A given element of process documentation will be referred to as a p-assertion – p-assertion: is an assertion that is made by an actor and pertains to a process. Architecture Tutorial

Service Oriented Architecture • Broad definition of service as component that takes some inputs

Service Oriented Architecture • Broad definition of service as component that takes some inputs and produces some outputs. • Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. • Interactions with services take place with messages that are constructed according to services interface specification. • The term actor denotes either a client or a service in a SOA. • A process is defined as execution of a workflow Architecture Tutorial

Process Documentation (1) From these p-assertions, we can derive that M 3 was sent

Process Documentation (1) From these p-assertions, we can derive that M 3 was sent by Actor 1 and received by Actor 2 (and likewise for M 4) Actor 2 Actor 1 M 3 M 2 I received M 1, M 4 I sent M 2, M 3 M 4 I received M 3 I sent M 4 If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages Architecture Tutorial

Process Documentation (2) These assertions help identify order of messages, but not how data

Process Documentation (2) These assertions help identify order of messages, but not how data was computed Actor 2 Actor 1 M 3 M 2 is in reply to M 1 M 3 is caused by M 1 M 2 is caused by M 4 Architecture Tutorial M 4 is in reply to M 3

Process Documentation (3) These assertions help identify how data is computed, but provide no

Process Documentation (3) These assertions help identify how data is computed, but provide no information about non-functional characteristics of the computation (time, resources used, etc) Actor 2 Actor 1 M 1 f 2 M 3 = f 1(M 1) M 2 = f 2(M 1, M 4) Architecture Tutorial M 3 f M 4 = f(M 3)

Process Documentation (4) Actor 2 Actor 1 M 3 M 2 I used 386

Process Documentation (4) Actor 2 Actor 1 M 3 M 2 I used 386 cluster Request sat in queue for 6 min Architecture Tutorial M 4 I used sparc processor I used algorithm x version x. y. z

Types of p-assertions (1) – Interaction p-assertion: is an assertion of the contents of

Types of p-assertions (1) – Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message I received M 1, M 4 I sent M 2, M 3 Architecture Tutorial

Types of p-assertions (2) – Relationship p-assertion: is an assertion, made by an actor,

Types of p-assertions (2) – Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data) M 2 is in reply to M 1 M 3 is caused by M 1 M 2 is caused by M 4 Architecture Tutorial M 3 = f 1(M 1) M 2 = f 2(M 1, M 4)

Types of p-assertions (3) – Actor state p-assertion: assertion made by an actor about

Types of p-assertions (3) – Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x. y. z Architecture Tutorial

Data flow • Interaction p-assertions allow us to specify a flow of data between

Data flow • Interaction p-assertions allow us to specify a flow of data between actors • Relationship p-assertions allow us to characterise the flow of data “inside” an actor • Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result Architecture Tutorial

Provenance Architecture Tutorial

Provenance Architecture Tutorial

Interfaces to Provenance Store Application Results Record Documentation of Execution Administer Store and its

Interfaces to Provenance Store Application Results Record Documentation of Execution Administer Store and its contents Architecture Tutorial Provenance Store Query and Reason over Provenance of Data

Architecture Tutorial

Architecture Tutorial

P-Assertion schemas Architecture Tutorial

P-Assertion schemas Architecture Tutorial

The p-structure • The p-structure is a common logical structure of the provenance store

The p-structure • The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors • Hierarchical • Indexed by interactions (interaction= 1 message exchange) Architecture Tutorial

Recording Protocol (Groth 04 -06) • Abstract machines • DS Properties – – Termination

Recording Protocol (Groth 04 -06) • Abstract machines • DS Properties – – Termination Liveness Safety Statelessness • Documentation Properties – Immutability – Attribution – Datatype safety • Foundation for adding necessary cryptographic techniques Architecture Tutorial

Querying Functionality (Miles 06) • Process Documentation Query Interface: allows for “navigation” of the

Querying Functionality (Miles 06) • Process Documentation Query Interface: allows for “navigation” of the documentation of execution – Allows us to view the provenance store (i. e. the pstructure) as if containing XML data structures – Independent of technology used for running application and internal store representation – Seamless navigation of application dependent and application independent process documentation Architecture Tutorial

Querying Functionality (Miles 06) • Provenance Query Interface: allows us to obtain the provenance

Querying Functionality (Miles 06) • Provenance Query Interface: allows us to obtain the provenance of some specific data • A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest • Hence, provenance is seen as the result of a query: – Identify a piece of data at a specific execution point – Scope of the process of interest: • Filter in/out p-assertions according to actors, process, types of relationships, etc Architecture Tutorial

Standardisation Architecture Tutorial

Standardisation Architecture Tutorial

Standardisation Options APIs Programmatic inter-op Recording and querying Interfaces Service inter-op Provenance Model Data

Standardisation Options APIs Programmatic inter-op Recording and querying Interfaces Service inter-op Provenance Model Data inter-op Architecture Tutorial

Purpose of Standardisation Application Record Documentation of Execution Provenance Stores Allow for multiple applications

Purpose of Standardisation Application Record Documentation of Execution Provenance Stores Allow for multiple applications to document their execution. Applications may be running in different institutions. Architecture Tutorial

Purpose of Standardisation Application Record Documentation of Execution Provenance Store Allow for multiple stores

Purpose of Standardisation Application Record Documentation of Execution Provenance Store Allow for multiple stores from multiple IT providers Architecture Tutorial

Purpose of Standardisation Provenance Store Query Provenance of Data Allow for multiple stores from

Purpose of Standardisation Provenance Store Query Provenance of Data Allow for multiple stores from multiple IT providers Architecture Tutorial

Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to

Purpose of Standardisation Convert in standard data format Allow for legacy, monolithic applications to expose their contents (according to standard schema) Architecture Tutorial

Purpose of Standardisation Application Provenance Store Allow third parties to host provenance stores, which

Purpose of Standardisation Application Provenance Store Allow third parties to host provenance stores, which are trusted by application owners but also auditors Architecture Tutorial

Compliance Oriented Architectures • Separate execution documentation from compliance verification • Allows for multiple

Compliance Oriented Architectures • Separate execution documentation from compliance verification • Allows for multiple compliance verifications • Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). • Approach is suitable for escientific peer-reviewing and business compliance verification Architecture Tutorial

Organ Transplant Scenario Hospital Electronic Healthcare Management Service Architecture Tutorial Testing Lab

Organ Transplant Scenario Hospital Electronic Healthcare Management Service Architecture Tutorial Testing Lab

Hospital Actors User Interface Brain Death Manager Architecture Tutorial Donor Data Collector

Hospital Actors User Interface Brain Death Manager Architecture Tutorial Donor Data Collector

What’s on the CD • PRe. Serv (Paul Groth & Simon Miles) • Offer

What’s on the CD • PRe. Serv (Paul Groth & Simon Miles) • Offer recording and querying interfaces • Available from www. pasoa. org • Soon ogsa-dai based version available from www. gridprovenance. org • Is being used in a bioinformatics application (cf. hpdc’ 05, iswc’ 05) Architecture Tutorial

Conclusions Architecture Tutorial

Conclusions Architecture Tutorial

To Sum Up Finance Distribution Aerospace Standardising the documentation of Business Processes Healthcare y

To Sum Up Finance Distribution Aerospace Standardising the documentation of Business Processes Healthcare y • Provenance Automobile Pharmaceutical – Architecture – Methodology Record A l p p Provenance Store Query • Compliance check • Rerun/Reproduce • Analyse Slide from John Ibbotson Architecture Tutorial

Overview of Today’s Talks • Provenance Data Structures • Recording and Querying Provenance –

Overview of Today’s Talks • Provenance Data Structures • Recording and Querying Provenance – Break (30 minutes) • Distribution and Scalability • Security • Methodology Architecture Tutorial

Questions Architecture Tutorial

Questions Architecture Tutorial