Zen and the Art of SWF Maintenance Kinds

  • Slides: 51
Download presentation
Zen and the Art of SWF Maintenance • Kinds of Scientific Workflows • Why

Zen and the Art of SWF Maintenance • Kinds of Scientific Workflows • Why not just Python scripts? • Business workflows born again ? • Zen and the art of workflow design – … and other research issues

What is a Scientific Workflow (SWF)? • Model the way scientists work with their

What is a Scientific Workflow (SWF)? • Model the way scientists work with their data and tools – Mentally coordinate data export, import, analysis via software systems • Scientific workflows emphasize data flow (≠ business workflows) • Metadata (incl. provenance info, semantic types etc. ) is crucial for automated data ingestion, data analysis, … • Goals: – SWF automation, – SWF, component reuse – SWF design & documentation making scientists’ data analysis and management easier!

What we use SWF for … • Short answer: Everything – includes making coffee

What we use SWF for … • Short answer: Everything – includes making coffee (tea ceremonies are harder) • Kinds of workflows (not disjoint): – Plumbing: Stage files, submit batch jobs, monitor progress, move files off XT 3 to analysis and viz cluster, archive, steer computation, … • Ex: Fusion simulation, Astrophysics (supernova simulation), … your laptop backup? ? ? – Knowledge discovery workflows: automate repetitive data access, retrieval, custom analysis (e. g. Blast), generic steps (PCA, cluster analysis, . . ), • Do this in ways that are meaningful to the scientist • Ex: PIW, Motif analysis, NDDP, … – Conceptual modeling workflows: what the heck is XYZ doing? Reverse engineering of processes and information flows at all levels, in order to optimize, we need to understand first • Ex: napkin drawing workflows to get an overview, refine design from abstract to executable (top-down), or generalize from the concrete/legacy to the abstract (bottom-up); data-driven, taskdriven, . .

Why not just a Python script? • Users who might be able to define,

Why not just a Python script? • Users who might be able to define, reuse, modify, specialize WFs might not be able to do the same for Python scripts • But wait, there’s more: – Modular reuse – Debugging and monitoring of WF execution • easy to “tee” (“man tee” for you windows guys ; -) – Automated Provenance Mgmt – Semantic types – From integrated WF modeling (ER + dataflow + coregistrations) to execution, optimization, archival …

Business workflows born-again? • Yes, there are similarities – And we can learn from

Business workflows born-again? • Yes, there are similarities – And we can learn from BWF! E. g. transactions! • But also big differences: – SWF: • data-flow oriented • streaming/pipelined execution • cf. signal processing (see also COM later) • popular Mo. C: PN – BWF: • task- and control-flow oriented • popular Mo. C: Petri-Net? CSP?

Sample BWFs • Focus is on … – Tasks – Control-flow – Work items

Sample BWFs • Focus is on … – Tasks – Control-flow – Work items • Useful stuff: – Transactions! – How to handle complex controlflow …

Pop Quiz! BWF? SWF?

Pop Quiz! BWF? SWF?

And the answer is …

And the answer is …

Click here for “Oracle” (or another one)

Click here for “Oracle” (or another one)

Dataflow it is!

Dataflow it is!

The Dataflow Difference

The Dataflow Difference

Data/Process/Provenance Central

Data/Process/Provenance Central

BUY ME!!

BUY ME!!

A Signal Processing Pipeline

A Signal Processing Pipeline

Some Terminology (tentative) • Workflow definition W ( WF graph we see) – partial

Some Terminology (tentative) • Workflow definition W ( WF graph we see) – partial specification of a workflow (cf. program) – parameters P need to be instantiated – data-bindings D can be viewed as special parameters • Model of Computation (Mo. C) – Looking at W, P, D we still not know how to execute W(P, D) to compute result R – A Mo. C is an algorithm telling us how to apply W on P and D to obtain R. – Examples: • Mo. C TM (Turing Machine): – given program P and input I, we know what to do • Mo. C PN (Process Network): – Network of independent processes, communicating through (infinite) unidirectional buffers (queues), prefix-monotonic behavior; given a PN and an input stream and prefix-monotonic, deterministic actors, the output stream is determined! (lots of flexibility for execution!) • Mo. C SDF (Synchronous Dataflow): – Similar to PN, but actors must statically declare there token production/consumption rates; solving for pos. int. solutions of balance equations (“LGS”) yields static schedule guaranteeing fixed buffer size

Some Terminology (tentative) • • Model of Computation (Mo. C) WF Run: completed computation

Some Terminology (tentative) • • Model of Computation (Mo. C) WF Run: completed computation WF Execution: ongoing computation Computation graph: graph data structure keeping track of which token has been computed from which other one(s) – Simple examples: evaluating an arithmetic expression; running a “job DAG” – But keeping track of “real dependencies” can be tricky • Ex: output tuples of an SQL query have “witness tuples” in multiple relations; clear for positive existential queries; what are witnesses for universal and negated queries? R = A B ; witnesses anybody? • Similar to the notion of “proof tree” in logic (and LP); negation-asfailure looms it’s ugly (beautiful? ) head!

Research Area: Provenance • (Abstract) Use Cases – “Total Recall”: capture everything the Mo.

Research Area: Provenance • (Abstract) Use Cases – “Total Recall”: capture everything the Mo. C can observe • … and more: Mo. C-inherent plus addtl. observables – Example: time-stamp token-in, token-out events benchmark actor exec time, data movement time, … – The 7 W’s: Who, What, Where, Why, When, Which, (W)how (C. Goble) – Smart Re-run: after Pause or Stop, followed by parameter changes: rerun relevant parts – Fault tolerance, crash recovery (cf. checkpointing) – Result interpretation and post-mortem analysis • Research Question: – Given a use case (as a query U) and a provenance schema PS, can U be answered using PS? (related to query answering using views – a reasoning problem!) – Ultimately: design PS with U in mind! Also: optimize/specialize PS if U is known/limited – Note: the Mo. C can make a difference! For example, some Mo. Cs have explicit notion of “firing” or might exploit actor declarations (“I’m a function! I have no state!”) This means is relevant e. g. for checkpointing (Need to save state or not? When to save state. . )

Research Area: WF/Dataflow Design • Collection-Oriented Modeling (COM) – Assembly line metaphor + Signal

Research Area: WF/Dataflow Design • Collection-Oriented Modeling (COM) – Assembly line metaphor + Signal Processing + XML + … • Streams are nested collections ( XML) • Stream data schema is “registered” to a WF data model (really need this) • Actor “picks up” only certain parts of the stream: scope • Actor declares how within the scope is changed: delta • Gives rise to new notions of type and new problems of type inference (using scope, delta, workflow structure etc. ) – Advantages: • Less “messy” WFs (more linear, less branching) • “Add-only” mode (inject new derived information); augmentation instead of transformation • Tagging data for downstream processing (instead of “bombing”, pass on “dirty” / faulty / strange data with a relevant tag • Pipelined parallelism (can stream an array)

Research: WF Design • ER model primitives: – Entity (-type), attribute, relationship (-type) •

Research: WF Design • ER model primitives: – Entity (-type), attribute, relationship (-type) • SWF model primitives? ? – Actors, directors (Mo. C), … – Lots of new “types”: • Conventional data type (Java style) • Polymorphic types w/ type variables (Haskell style) • Semantic type (formal annotations in logic relative to a controlled vocabulary or knowledge base) • Hybrids • A “theory of adapters” !?

designed to fit hand-crafted control solution; also: forces sequential execution! [Altintas-et-al-PIW-SSDBM’ 03] hand-crafted Web-service

designed to fit hand-crafted control solution; also: forces sequential execution! [Altintas-et-al-PIW-SSDBM’ 03] hand-crafted Web-service actor No data transformations available Complex backward control-flow

A Scientific Workflow Problem: More Solved (Computer Scientist’s view) • Solution based on declarative,

A Scientific Workflow Problem: More Solved (Computer Scientist’s view) • Solution based on declarative, functional dataflow process network map(f)-style iterators (= also a data streaming model!) Powerful type checking • Higher-order constructs: map(f) Generic, declarative “programming” constructs Generic data transformation actors Þ Þ Þ no control-flow spaghetti data-intensive apps free concurrent execution free type checking automatic support to go from piw(Gene. Id) to PIW : =map(piw) over [Gene. Id] Forward-only, abstractable subworkflow piw(Gene. Id)

A Scientific Workflow Problem: Even More Solved (domain&CS coming together!) map(Genbank. WS) Input: {“NM_001924”,

A Scientific Workflow Problem: Even More Solved (domain&CS coming together!) map(Genbank. WS) Input: {“NM_001924”, “NM 020375”} Output: {“CAGT…AATATGAC", “GGGGA…CAAAGA“}

Research Problem: Optimization by Rewriting • Example: PIW as a declarative, referentially transparent functional

Research Problem: Optimization by Rewriting • Example: PIW as a declarative, referentially transparent functional process Þ optimization via functional rewriting possible map(f o g) instead of map(f) o map(g) e. g. map(f o g) = map(f) o map(g) • Technical report &PIW specification in Haskell Combination of map and zip http: //kbis. sdsc. edu/Sci. DAC-SDM/scidac-tn-map-constructs. pdf

Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under

Job Management (here: NIMROD) • Job management infrastructure in place • Results database: under development • Goal: 1000’s of GAMESS jobs (quantum mechanics)

Kepler Coupling Components & Codes • Types of Coupling … – Loosely coupled (“

Kepler Coupling Components & Codes • Types of Coupling … – Loosely coupled (“ 1 st Phase”) • Web Services (SPA, GEON, SEEK, …), • ssh actors, . . + reusability (behavorial polymorphism) + scalability (# components) – efficiency – Tight(er) coupling (“ 2 nd Phase”) • Via CCA (Sci. RUN-2, Ccaffeine, …) (Cipres uses CORBA) • HPC needs: code-coupling as efficient & flexible as possible (e. g. Scott’s challenges…) – memory-to-memory (single node or shared memory), – MPI (multiple-nodes) – optimizations for transfer of data & control (streaming, socket-based connections)

Accord-CCA: Ccaffeine w/ Self-Managed Behavior cf. w/ mobile models, reconfiguration in Ptolemy II Source:

Accord-CCA: Ccaffeine w/ Self-Managed Behavior cf. w/ mobile models, reconfiguration in Ptolemy II Source: Hua Liu and Manish Parashar … begging for a Kepler design and implementation …

Fault Tolerance & Maintenance Challenges

Fault Tolerance & Maintenance Challenges

Workflow Templates and Patterns New Ingredients work w/ Anne Ngu, Shawn Bowers, Terence Critchlow

Workflow Templates and Patterns New Ingredients work w/ Anne Ngu, Shawn Bowers, Terence Critchlow Proposed Layered Architecture

Use Ideas from Fault Tolerant Shell Good ideas in ftsh; some might be (semi

Use Ideas from Fault Tolerant Shell Good ideas in ftsh; some might be (semi -)low hanging fruits for Kepler … Source: Douglas Thain, Miron Livny The Ethernet Approach to Grid Computing

Use of Semantics in SWF… “Smart” Search – Concept-based, e. g. , “find all

Use of Semantics in SWF… “Smart” Search – Concept-based, e. g. , “find all datasets containing biomass measurements” Improved Linking, Merging, Integration – Establishing links between data through semantic annotations & ontologies – Combining heterogeneous sources based on annotations – Concatenate, Union (merge), Join, etc. Transforming – Construct mappings from schema S 1 to S 2 based on annotations Semantic Propagation – “Pushing” semantic annotations through transformations/queries

Typing Workflow Components Semantic Type Editor is used to assign one or more semantic

Typing Workflow Components Semantic Type Editor is used to assign one or more semantic types to the component or to the component’s input and output ports. In the simplest case, a semantic type is a class taken from an OWL-DL ontology. Multiple types define a conjoined concept expression. A simple ontology browser is provided in Kepler to navigate a classified OWL-DL ontology. Classes can be searched for and selected as a semantic type.

More on Semantic Annotation Initial Version Supports: • Actor-level and port-level annotations • Annotations

More on Semantic Annotation Initial Version Supports: • Actor-level and port-level annotations • Annotations are stored in actor’s Mo. ML definition (as new “semantic type” properties) • Creation of composite ports (i. e. , “virtual” ports grouping a set of underlying ports) • Regular and composite ports may have multiple annotations (conjunction) • Annotations can be drawn from multiple ontologies An annotated composite port

More on Semantic Annotation Currently Adding: • “Semantic Link” Annotations for annotation of ports

More on Semantic Annotation Currently Adding: • “Semantic Link” Annotations for annotation of ports via ontology properties – – • Simple condition “filters” in port semantic annotations – • – suggesting/guessing ways to “fill in” given annotations E. g. , possible semantic links Templates and ontology “views” – Semantic Links E. g. , has. Unit(biomass, celsius) Suggesting additional annotations based on given ones – • E. g. , if attribute height > 0 then biomass is annotated as Above. Ground. Biomass Incorporating instances/values in semantic links – • E. g, has. Lat(point 1, lat 1) Supported in Mo. ML, not yet in tool To help specify common annotation patterns

Checking Type Constraints Kepler can statically perform semantic and structural type checking of connections.

Checking Type Constraints Kepler can statically perform semantic and structural type checking of connections. A type checker allows the user to see potentially mismatched port connections as well as known type conflicts before workflow execution. The user can navigate the unsafe and potentially unsafe channels using the Kepler Type Checker dialog. When a channel is selected: (a) it is highlighted on the canvas, (b) the structural type and status is shown (here, the channel is structurally well typed), and (c) the semantic type and status is shown (here, the connection produce a semantic type error).

Kepler Actor-Library • Ontology-based actor organization / browsing • Customizable libraries based on ontologies

Kepler Actor-Library • Ontology-based actor organization / browsing • Customizable libraries based on ontologies • Text search with concept-based expansion Users can discover Image. J using various search terms. Here, Image. J shows up in multiple tree locations based on its given annotations. The library search permits textbased matching against the component’s metadata (its given name and certain properties), expanded with concept matches.

Semantic Searching Kepler provides a more advanced ontology-based search mechanism. Users can start the

Semantic Searching Kepler provides a more advanced ontology-based search mechanism. Users can start the Semantic Search dialog, where components can be search for based on their semantic types. The Semantic Search dialog allows a user to search components by any combination of actor, input, and output semantic types.

Structural Type (XML DTD) Annotations struct. Type(P 2) root elem elem struct. Type(P 3)

Structural Type (XML DTD) Annotations struct. Type(P 2) root elem elem struct. Type(P 3) population sample meas cnt acc lsp = = = (sample)* (meas, lsp) (cnt, acc) xsd: integer xsd: double xsd: string <population> <sample> <meas> <cnt>44, 000</cnt> <acc>0. 95</acc> </meas> <lsp>Eggs</lsp> </sample> … <population> P 1 root elem cohort. Table = measuremnt = phase = obs = (measurement)* (phase, obs) xsd: string xsd: integer <cohort. Table> <measurement> <phase>Eggs</cnt> <obs>44, 000</acc> </measurement> … <cohort. Table> P 2 P 3 S 1 S 2 P 5 (mortality rate for period) (life stage property) P 4 Source: [Bowers-Ludaescher, DILS’ 04]

Ontology-Guided Data Transformation Ontologies (OWL) Semantic Type Ps Compatible Structural/Semantic Association Structural Type Ps

Ontology-Guided Data Transformation Ontologies (OWL) Semantic Type Ps Compatible Structural/Semantic Association Structural Type Ps Correspondence Generate Source Service (⊑) Structural Type Pt (Ps) Transformation Ps Semantic Type Pt Desired Connection Pt Target Service Source: [Bowers-Ludaescher, DILS’ 04]

WF-Design: Adapters for Semantic & Structural Incompatibility Adapters may: C D C C D

WF-Design: Adapters for Semantic & Structural Incompatibility Adapters may: C D C C D D – be abstract (no impl. ) C 1 D 1 C 1 D C 2 D 2 C 2 – be concrete D – bridge a semantic gap – fix a structural mismatch f 1 [S] S f 2 T map f 1 [S] [S ] f 2 [T] map f 1 [[S]] S f 2 T f 1 [[S]] [[S ]] – be generated automatically (e. g. , Taverna’s “list mismatch”) – be reused components (based on signatures) map f 2 [[T]] Source: [Bowers-Ludaescher, ER’ 05]

Additional Design Primitives for Semantic Types Extended Transformations Starting Workflow t 9: Actor Semantic

Additional Design Primitives for Semantic Types Extended Transformations Starting Workflow t 9: Actor Semantic Type Refinement (T t 10: Port Semantic Type Refinement C, D C Resulting Workflow T T T) (C Resulting Workflow D C D D) t 11: Annotation Constraint Refinement ( ) t 12: I/O Constraint Strengthening ( ) 1 C D s t 2 1 C D s t 2 t 13: Data Connection Refinement t 14: Adapter Insertion t 15: Actor Replacement t 16: Workflow Combination (Map) f f … 1 f 2 f Source: [Bowers-Ludaescher, ER’ 05]

Scientific Workflow Design • Support SWF design & reuse, via: – Structural data types

Scientific Workflow Design • Support SWF design & reuse, via: – Structural data types – Semantic types – Associations (=constraints) between them – Type checking, inference, propagation Separation of concerns: – structure, semantics, WF orchestration, etc. Source: [Bowers-Ludaescher, ER’ 05]

Semantic Annotation Propagation

Semantic Annotation Propagation

Forward and Backward Propagation Rules

Forward and Backward Propagation Rules

GEON Dataset Generation & Registration (and co-development in KEPLER) % Makefile $> ant run

GEON Dataset Generation & Registration (and co-development in KEPLER) % Makefile $> ant run Matt et al. (SEEK) Efrat (GEON) Ilkay (SDM) Yang (Ptolemy) Xiaowen (SDM) Edward et al. (Ptolemy) SQL database access (JDBC)

Web Services Actors (WS Harvester) 1 2 4 3 “Minute-made” (MM) WS-based application integration

Web Services Actors (WS Harvester) 1 2 4 3 “Minute-made” (MM) WS-based application integration • Similarly: MM workflow design & sharing w/o implemented components

Some KEPLER Actors (out of 160+ … and counting…)

Some KEPLER Actors (out of 160+ … and counting…)

Different “Directors” for Different Concerns • Example: – Ptolemy Directors – “factoring out” the

Different “Directors” for Different Concerns • Example: – Ptolemy Directors – “factoring out” the concern of workflow “orchestration” (Mo. C) – common aspects of overall execution not left to the actors • Similarly: – “Black Box” (“flight recorder”) • a kind of “recording central” to avoid wiring 100’s of components to recording-actor(s) – “Red Box” (error handling, fault tolerance) • use ftsh ideas; tempaltes – “Yellow Box” (type checking) • for workflow design – “Blue Box” (shipping-and-handling) • central handling of data transport (by value, by reference, by scp, SRB, Grid. FTP, …) – “CCA++ Boxes” • Change behavior (e. g. algorithm) of a component • Change behavior (i. e. , wiring) of a workflow in-flight SDF/PN/DE/… Provenance Recorder On Error Static Analysis SHA @ Component Mgr Composition Mgr

Separation of Concerns: Port Types • Token consumption (& production) “type” – a director’s

Separation of Concerns: Port Types • Token consumption (& production) “type” – a director’s concern • More generally: resource consumption “type” – other scheduling problems • Token “transport type” – by value, reference (which one), protocol (SOAP, scp, Grid. FTP, scp, SRB, …) – a SHA concern • Structural and semantic types – SAT (static analysis & typing) concern – built after static unit type system… • static unit type system as a special case!?

Other Research Problems • Making the system more X-aware: – – – Mo. C-aware:

Other Research Problems • Making the system more X-aware: – – – Mo. C-aware: ok (directors) Provenance-aware: … DS (data schema)-aware: … Semantics-aware: upcoming (should be hybrid w/ DS) Host-aware: allow distributed scheduling of actors Data-transport-aware: choose suitable data transport protocol (scp, bbcp, http, (Grid-)ftp, SRB, SRM, . . . ) – Think of new “folks” on the movie set: • Actors, director • Cameraman (provenance recorder? ) • Editor (FF/REW/Play/Pause/Stop provenance re-run) • Caterer/Stager (feeding actors with yummy tokens!) • Managers for “Process Central” and “Data Central” • Semantic/Hybrid Type Manager

More Research Topics • What if we know something about bandwidths, processor loads, data

More Research Topics • What if we know something about bandwidths, processor loads, data sizes? workflow optimization! • What if we have more semantics for actors? – Black-box: token in/out – Grey-box: data types, semantic types – White box: exact functional behavior is known! – Example: Actor implements a (stream-? ) query! Query Process Network – New optimization opportunities!

A User’s Wish List • Usability • Closing the “lid” (cf. vnc) • Dynamic

A User’s Wish List • Usability • Closing the “lid” (cf. vnc) • Dynamic plug-in of actors (cf. actor & data registries/repositories) • Distributed WF execution • Collection-based programming • Grid awareness • Semantics awareness • WF Deployment (as a web site, as a web service, …) • “Power apps” (cf. SCIRun) • …