Building Scientific Workflows with Taverna and BPEL a
Building Scientific Workflows with Taverna and BPEL: a Comparative Study in ca. Grid Wei Tan 1, Paolo Missier 2, Ravi Madduri 1, Ian Foster 1 foster@mcs. anl. gov http: //www-fp. mcs. anl. gov/~foster/ 1 University of Chicago and Argonne National Laboratory, USA 2 School of Computer Science, University of Manchester, U. K
Agenda • • • Introduction to ca. Grid Why scientific workflows in ca. Grid? BPEL and Taverna comparison - Service discovery - Service composition & workflow execution - Data-driven vs. control-driven modeling - Implicit vs. explicit definition of data - Implicit vs. explicit iteration on data - Workflow result analysis • Conclusion 2 W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Introduction: ca. BIG and ca. Grid Globus W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
As of Oct 19, 2008: 122 participants 70 data 105 services 35 analytical W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Introduction: ca. Grid and workflow Scientific workflow lifecycle Discovery Composition instruments se reu Community data Execution te era gen Connectivity Analysis Virtualization Security ca. Grid computation resource W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 5
Challenges faced by ca. Grid users Discovery Composition ü Locating needed üAccessing services from a workflow ü GUI for building workflows easily services üDetermining function üPersisting and visualizing results Community se reu Analysis rate e n ge üExecuting workflow efficiently Execution Sharing and reusing workflows ca. Grid W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 6
Our goals in this paper • Communicate practical experiences based on our work in the ca. Grid project • Cover the entire scientific workflow lifecycle, from service discovery to service composition, workflow execution, and workflow result analysis Based on ca. Grid requirements for workflow language and tooling Also applicable to other areas in data-intensive and exploratory science? 7 W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
BPEL and Taverna • • • Not the only two but they are representative choices BPEL - XML-based specification for web service based process behavior - Industry standard adopted by IBM, SAP, Oracle, etc. - Has also attracted attention from the scientific community because of its support for SOA paradigm Taverna - Open-source, from the my. Grid consortium in UK - Design and execution of scientific workflows - Plug-in architecture for extension (access more applications, visualize more data types, etc. ) 8 W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Querying semantic data in cancer research • Identify description logic 1 2 3 4 • concepts relating to a particular context, e. g. , “ca. Core” 1) Query all projects related to context “ca. Core” 2) find UML classes in each project 3) use project and UML class information to query the semantic metadata 4) retrieve the concept code We adopt this query as a use case to guide our comparison 9 W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Support for service discovery • Before building a workflow - Need to find appropriate services to be composed - Service endpoints are not naturally known to users - Exact semantics of those services are not known Taverna offers - A extensible scavenger interface for arbitrary service discovery according to users needs (see next page) - A native semantic discovery facility called Feta: my. Grid ontology based service annotation and search. BPEL offers - UDDI which is not widely adopted - Research efforts like: WSMO, OWL-S, which are more on specification level - No open-source tool is available that works with a service query component in an integrated way W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 0
Solution for ca. Grid: Metadata-based service query ca. Grid service metadata • Types of query - String based - Property based - Semantic based ca. Grid scavenger: query the Ca. DSR Service in the use case W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 1
W. Tan, et al. Building Scientific Workflow with Taverna and BPEL
Service composition & workflow execution • Data-driven vs. control-driven modeling • Implicit vs. explicit definition of data • Implicit vs. explicit iteration on data W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 3
Data-driven vs. control-driven modeling Comparison of BPEL and Taverna (Scufl) w. r. t. control/data-flow Activities in model Semantics of links Data definition Data initialization Control logic Parallel execution BPEL Taverna (Scufl) Basic and structure activities Transfer of control Processors as data processing units with in/output ports Transfer of data Explicitly defined Implicitly defined (global variables) (processor’s input/output) Complex data type must Automatically be explicitly initialized Full-fledged: sequence, Limited: sequential, parallel and conditional, parallel, eventconditional triggered, etc Defined in <flow> or By default <For. Each> W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 4
Implicit vs. explicit definition of data • Taverna - Processors have input/output ports with an associated data type - Data travels from the output port of a processor to the input of one or more downstream processors - Interaction among processors is defined entirely by the arcs in the dataflow graph • BPEL - Requires the explicit definition of variables, and explicit initiation for complex types - Data are shared amongst activities (i. e. , are global) - More complexity, but more power and flexibility in data handling W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 5
Implicit vs. explicit iteration on data • Implicit iteration in Taverna - Occurs when an input port receives a list element: - E. g. , a processor that outputs a “list of strings, ” can legally be connected to a processor with an input port of type “string. ” - Taverna interprets this type mismatch as an indication that the destination processor must be invoked repeatedly, once for each element of the input list - This behavior is defined with Taverna's functional programming model • Explicit iteration in BPEL - BPEL does not allow type mismatch and iterate needs to be defined explicitly - Again, BPEL offers more flexibility to define more advanced iteration patterns (with more complexity in the model, though) W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 6
Implicit vs. explicit iteration in Ca. DSR find. Projects returns an array Project [] find. Classes. In. Project receives type Project and finds all UML classes in this (single) project In Taverna an xmlsplitter extracts the project array and feeds this directly into find. Classes. In. Project In BPEL a For. Each construct is needed for the iteration over array Project [] W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 7
Workflow result analysis • Workflow provides a natural framework for data tracking and analysis - In both Taverna and BPEL • Taverna: offers native provenance support - More precise linkage annotation between services’ input and output - Semantic support - Not the focus of our project, see ref. [16] [17] for more details W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 8
Conclusion: Taverna offers lifecycle support üProvides a compact set of primitives that eases the modeling of data flows üAllows users to specify “what to do” instead of “how to do it” Discovery ü Scavenger: for customized service discovery ü Feta: service annotation and discovery. ü Result persistence and visualization composition ü Scufl: compact modeling of data flow ü Built-in processors: Soaplab, Bio. Mart, etc. ü Customized processors as plug-ins Community Execution e s reu Analysis üImplicit iteration: handle parallel execution rate e n ge A community for sharing workflows ca. Grid W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 1 9
Conclusion: BPEL offers unique features • • Build-time - A comprehensive set of primitives to model processes of all flavors - control-flow oriented - data-flow oriented (although a little verbose) - event driven, etc. - Full featured - process logic, data manipulation, event and message processing, fault handling, etc. Run-time - BPEL engines typically run inside application servers with - persistent state storage - reliability and scalability guarantees - Important for long-running and computation-intensive workflows - For now Taverna engine does not provide these capabilities W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 2 0
Conclusion • • Factors in deciding which language/tool to choose - User IT expertise - some prefer scripting language, others a friendly GUI - Problem size - Taverna often runs on desktop and handles problem of moderate size (currently common in bioinformatics) - Grid/server based systems like Swift can deal with huge volume of data and intensive computation (for example, applications in medical informatics, neuroscience, physics) - Applications involved - Web services, batch jobs, shell scripts, etc. Future work - Enrich the ca. Grid workflow tool set based on Taverna - Build more real workflows to help scientific investigation - Address issues of scale as they arise W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 2 1
Thank you for your attention W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 2 2
Introduction: ca. Grid and workflow instruments data Connectivity Virtualization Security ca. Grid computation resource W. Tan, et al. Building Scientific Workflow with Taverna and BPEL 2 3
- Slides: 23