Provenance Framework in Kepler Norbert Podhorszki Ilkay Altintas
Provenance Framework in Kepler Norbert Podhorszki Ilkay Altintas Contributors: S. Bowers, B. Ludäscher, T. Mc. Phillips (UC Davis) O. Barney (U Utah), E. Jäger-Frank (SDSC) 7 th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007
Outline Provenance? What is it? Framework in Kepler to record provenance data RWS: A provenance model suitable for Kepler's different computational models. Possible Applications of Provenance Ptolemy Miniconference, February 13, 2007 Provenance 2
What to track and why Do we need some tracking of what is happening? Recreate results and rebuild workflows using the evolution information (see repeatable experiments) Associate the workflow with the results it produced Create links between generated data in different runs, and compare different runs Recover from a system failure • Checkpoint a workflow Debug and explain results (via lineage tracing, …) Smart Reruns • Avoid re-generating the same data all the time Ptolemy Miniconference, February 13, 2007 Provenance 3
Model of Provenance Core feature capture the processing history (trace) leading to a data product Model of Computation (Mo. C) Well-defined in terms of input/output relations and the (partial) order of actions • Mo. C ( Program, Input ) Output • DAG, SDF, DDF, PN, etc Different ways of specification • see Ptolemy-related papers, Kahn-Mc. Queen paper, etc. • give abstract/high-level pseudo code • Practically it is defined through the implementation of the execution system (including the scheduling). In Kepler/Ptolemy it is the Director. There are legal (possible) runs under a given Mo. C Ptolemy Miniconference, February 13, 2007 Provenance 4
Model of Provenance T Model of Provenance (Mo. P) The starting point is a Mo. C and its particular implementation • Observables e. g. a single fired(x, A, y) or reads, writes and actions separately • Trace: recorded assertions (about observable events) during a legal run Mo. P is a Mo. C, except the “legal run” replaced with “legal trace” There is a default Mo. P for a Mo. C: the total trace of each observable events • Turing machine: moves of the head, data read and written A Mo. P may add another information or omit some (“T=R-I+M”) • Trace = Run – Ignored things + Modelled additional things • M: Add real timestamps of actions, execution host information • I: Omit the input for each action if this can be inferred unambiguously later (DAG) • Depends on the application of the trace Ptolemy Miniconference, February 13, 2007 Provenance 5
Mo. P Examples DAG workflow Record: Output data generated by the actions Inference: Execution of actions and inputs to them can be inferred from the DAG itself Smart-rerun Record: Output of an action and the parameters for that action should be recorded Inference: If an action’s parameter is not changed and actions on which this action depends (inferred from the workflow graph) are also unchanged, the action’s output will be the same in a future run. Kitchen definition A Mo. P is “good” if it can handle the intended questions & use cases. Ptolemy Miniconference, February 13, 2007 Provenance 6
Mo. P Examples Kepler: Streaming actors A Stateful actors • An output depends on all inputs in the past. e. g. Add. Substract Stateless actors • An output depends only on inputs read in the current firing. Expression, Record. Assembler Non-conformist actors • Filter, Running average, Daily average (some of the past inputs) • How do you determine correctly which inputs a given output depends on? Ptolemy Miniconference, February 13, 2007 Provenance 7
Mo. P Examples Kepler: Data dependent routing (branches and loops) The firing history of the actors cannot be inferred from the static workflow graph • Something should be recorded (e. g. firings) Ptolemy Miniconference, February 13, 2007 Provenance 8
RWS A Model of Provenance for Kepler Directors Ptolemy Miniconference, February 13, 2007 Provenance 9
RWS: Read − Write − State-reset r… r w…w A s! ? ? ? time r, r … r, w, w, … w, r, … r, w, … w … firing PS what about actor state? what about “real” dependencies? State-reset event s defines when actor “cuts off” dependencies a semantic notion, known to the actor [developer] (or part of a higher-order scheme) r, r … r, w, w, … w, s!, r, … r, w, . . . w, … reference: IPAW’ 06, Bowers et al Ptolemy Miniconference, February 13, 2007 Provenance 10
RWS trace of some actors Stateless actor (r+ w+ s)* : r … r w… w s … Stateful actor (r+ w+)* Simple filter actor (conditional depends only on current token) (r w? s)* : either it emits a token or not Daily average of hourly measurement Generally: RWS firing is defined in terms of r and w events r+ w+ defines one RWS firing (most Kepler actors behave similarly) More general: definition of the RWS firing round ((r w)24 s)* (r+ w+)* s : dependencies among several firings … Ptolemy Miniconference, February 13, 2007 Provenance 11
Kepler Provenance Framework Ptolemy Miniconference, February 13, 2007 Provenance 12
Provenance Framework in Kepler Modeled as a separate concern in the system Optional drag and drop feature Listen to execution and save information (customizable): Context: who, what, where, and when that is associated with the run Input data and its associated metadata Workflow outputs and intermediate data products Workflow definition (entities, parameters, connections): a specification of what exists in the workflow and can have a context of its own Information about the workflow evolution -- workflow trail Ptolemy Miniconference, February 13, 2007 Provenance 13
Kepler System Architecture Authentication GUI …Kepler GUI Extensions… Vergil Kepler Object Manager SMS Actor&Data SEARCH Kepler Core Extensions Ptolemy Miniconference, February 13, 2007 Type System Ext Documentation Smart Re-run / Failure Recovery Provenance Recorder Ptolemy IPAW’ 06 -Altintas et al. Provenance 14
Kepler Provenance Recorder • Parametric and customizable – – Different report formats Variable levels of verbosity • – • (IPAW’ 06, Altintas et al) all, some, medium, on error Multiple cache destinations Saves information on – User name, Date, Run, etc… Ptolemy Miniconference, February 13, 2007 Provenance 15
Implementation details The Provenance Recorder Extends the Ptolemy Abstract. Settable. Attribute Listens to the Director for • Changes in the workflow graph • Initialization, workflow execution and stop • Actor firing Listens to all IOPorts for • Token emissions on output ports to record output data That is, we could say it is a Ptolemy Provenance Framework Ptolemy Miniconference, February 13, 2007 Provenance 16
Implementation details Builds an internal representation of the workflow graph Ptolemy’s Directed. Graph Nodes: IOPorts, Edges: port connections Used for • Recording workflow structure (dependencies among ports) • Subscribing at all ports (listening for input/output) Ptolemy Miniconference, February 13, 2007 Provenance 17
Application: smart-rerun Ptolemy Miniconference, February 13, 2007 Provenance 18
Implementation of RWS in Kepler Data model i. e. observables in all Mo. C implementations in Kepler Port-actor relationship • port. Table(Port, Actor, type) • type is a for atomic and c for composite actors (transparent) Token-object relationship • token. Table(Token, Object) Object-value relationship • object. Table(Object, Value, Type) • type is currently not recorded RWS trace • trace. Table(Port, Event, Token, Firing. Counter) • event: r as read, w as write or s as state-reset Ptolemy Miniconference, February 13, 2007 Provenance 19
Extending the framework 1. 2. Initialization (initialize()) Framework traverses the workflow graph (ports and connections) RWS: generate specific data structures (port, actor and connection details) Just before start (validate()) Framework subscribes for event listeners RWS: subscribe additional listener Token. Get. Event Ptolemy Miniconference, February 13, 2007 Provenance 20
Extending the framework 3. 4. When workflow is modified (change. Executed()) Framework traverses the workflow graph (ports and connections) RWS: re-generate data structures During execution when an event occurs Token. Send. Event() and Token. Get. Event() listeners are extended to generate RWS trace events Ptolemy Miniconference, February 13, 2007 Provenance 21
Possible applications of Provenance Smart-rerun Monitoring/debugging of a workflow see Li. DAR poster today by Efrat Jäger-Frank Answering processing history, data related question Participated at the First Provenance Challenge with Kepler-RWS http: //twiki. ipaw. info/bin/view/Challenge/RWS Reporting/documentation of workflows and data products Ptolemy Miniconference, February 13, 2007 Generate my publication Provenance 22
Acknowledgement RWS model Formalization of the Mo. Ps Shawn Bowers and Timothy Mc. Phillips, UC Davis Bertram Ludäscher, UC Davis Kepler Provenance Framework implementation Oscar Barney, Univ. of Utah, Salt Lake City Efrat Jäger-Frank, SDSC, San Diego Ptolemy Miniconference, February 13, 2007 Provenance 23
References RWS model S. Bowers, T. Mc. Phillips, B. Ludäscher, S. Cohen and S. B. Davidson A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows Intl. Provenance and Annotation Workshop (IPAW), Chicago, 2006 B. Ludäscher, N. Podhorszki, I. Altintas, S. Bowers, T. Mc. Phillips From Computation Models to Models of Provenance and the RWS Model to appear in 2007 in Journal of Concurrency and Computation: Practice and Experience Provenance framework I. Altintas, O. Barney, E. Jäger-Frank Provenance Collection Support in the Kepler Scientific Workflow System Intl. Provenance and Annotation Workshop (IPAW), Chicago, 2006 Ptolemy Miniconference, February 13, 2007 Provenance 24
- Slides: 24