Design and Evaluation of an Autonomic Workflow Engine

  • Slides: 47
Download presentation
Design and Evaluation of an Autonomic Workflow Engine Thomas Heinis, Cesare Pautasso, Gustavo Alsonso

Design and Evaluation of an Autonomic Workflow Engine Thomas Heinis, Cesare Pautasso, Gustavo Alsonso Dept. of Computer Science Swiss Federal Institute of Technology (ETHZ) The 2 nd IEEE International Conference on Autonomic Computing (UCAC-05) March 15 th, 2008 Seo, Dongmahn

Contents l l l Introduction System Background System Architecture Autonomic Capabilities System evaluation Conclusion

Contents l l l Introduction System Background System Architecture Autonomic Capabilities System evaluation Conclusion 2

Contents l Introduction l l l System Background System Architecture Autonomic Capabilities System evaluation

Contents l Introduction l l l System Background System Architecture Autonomic Capabilities System evaluation Conclusion 3

Introduction l l l Motivation Related Work Contribution 4

Introduction l l l Motivation Related Work Contribution 4

Motivation l Workflow management systems e-commerce l virtual laboratories l DNA sequencing l scientific

Motivation l Workflow management systems e-commerce l virtual laboratories l DNA sequencing l scientific computing l Grid computing l idea of process-based Web service composition l 5

Motivation (cont. ) l Workflow engines open environment l unknown workload l difficult to

Motivation (cont. ) l Workflow engines open environment l unknown workload l difficult to choose l la centralized solution l a distributed implementation of the engine l l problem of configuring the system in an optimal way NOT feasible solution l considering the number of parameters involved l the variability of the workload l having a system administrator in charge of manually monitoring l reconfiguring the system 6

Related Work l Decentralization of workflow process execution l l important area of research

Related Work l Decentralization of workflow process execution l l important area of research support business processes lead to higher scalability introduces several problems l lack of a global view over the process scalability and reliability problems per se To address the problem l l GOLIAT , autonomic computing techniques, self-optimizing computer systems autonomic computing principles in the context of distributed workflow engines 7

Contribution l Goal self-tuning l self-configuration capabilities l self-healing capabilities l 8

Contribution l Goal self-tuning l self-configuration capabilities l self-healing capabilities l 8

Contribution (cont. ) l System l extension to the JOpera engine l l l

Contribution (cont. ) l System l extension to the JOpera engine l l l l Java based service composition tool combines a workflow engine with an open architecture to provide support for Web service composition, Grid computing and specialized workflow engines flexible architecture, components Key system modules can be replicated to handle large workloads. Other modules can be paired with a backup to achieve fault tolerance. The autonomic controller can be configured by selecting different reconfiguration strategies. 9

Contribution (cont. ) l the key contributions of the paper l the novel system

Contribution (cont. ) l the key contributions of the paper l the novel system architecture l generic l can be adopted by many engines operating under different models and languages l the resulting scalability and fault tolerance l flexible enough to support the very large loads present in computational applications and large scale Web service composition l the independence of the underlying workflow model l easily extensible to support many different kinds of services 10

Contents l Introduction l System Background l l System Architecture Autonomic Capabilities System evaluation

Contents l Introduction l System Background l l System Architecture Autonomic Capabilities System evaluation Conclusion 11

System Background l l l Requirements Workload Assumptions Deployment Environment 12

System Background l l l Requirements Workload Assumptions Deployment Environment 12

Requirements l the workflow execution engine to support autonomic behavior l must feature l

Requirements l the workflow execution engine to support autonomic behavior l must feature l l self-configuration, l self-tuning and self healing capabilities Self-configuration switching the system’s configuration on the fly l without manual intervention and disrupting the system l requires the workflow execution engine l l to support dynamically and efficiently change the configuration 13

Requirements (cont. ) l self-tuning system reconfiguration to optimal given the current workload l

Requirements (cont. ) l self-tuning system reconfiguration to optimal given the current workload l the workflow engine must give access to its internal state l l control algorithms can analyze current and past performance information l l to plan configuration changes in respose to the current workload assumption l the characteristics of the workload affect the system’s performance l the self-tuning algorithm can optimally adapt the system to the workload by monitoring key performance indicators 14

Requirements (cont. ) l self-healing l able to detect configuration changes due to external

Requirements (cont. ) l self-healing l able to detect configuration changes due to external events l failures of nodes recovery action l requires l l mechanisms for detecting failures and configuration changes of the cluster l to query the workflow execution state 15

Workload Assumptions l the workload is assumed l l l to be a collection

Workload Assumptions l the workload is assumed l l l to be a collection of concurrent workflow processes a worst case scenario not deal with workload prediction issues l future work 16

Deployment Environment l [Assumption] JOpera l l l runs on a dedicated cluster of

Deployment Environment l [Assumption] JOpera l l l runs on a dedicated cluster of computers can use these resources exclusively main goal of the autonomic features l to ensure the optimal configuration of the cluster l l l efficient resource utilization good allocation of the available nodes to the different system components cluster l l configuration is NOT static the system could be extended to use shared nodes l that are also used for other purposes. 17

Contents l l Introduction System Background l System Architecture l l l Autonomic Capabilities

Contents l l Introduction System Background l System Architecture l l l Autonomic Capabilities System evaluation Conclusion 18

System Architecture l l l Workflow Execution Distributed Workflow Execution Scalable Workflow Execution 19

System Architecture l l l Workflow Execution Distributed Workflow Execution Scalable Workflow Execution 19

Workflow Execution l Workflow processes model interactions btw different tasks l by defining the

Workflow Execution l Workflow processes model interactions btw different tasks l by defining the data flow and control flow btw them l 20

Distributed Workflow Execution 21

Distributed Workflow Execution 21

Scalable Workflow Execution l scalability bottleneck l use several layers of caching l btw

Scalable Workflow Execution l scalability bottleneck l use several layers of caching l btw tuple space and threads producing and consuming tuples 22

Contents l l l Introduction System Background System Architecture l Autonomic Capabilities l l

Contents l l l Introduction System Background System Architecture l Autonomic Capabilities l l System evaluation Conclusion 23

Autonomic Capabilities l Self-Tuning Information Strategy l Optimization Strategy l Selection Strategy l l

Autonomic Capabilities l Self-Tuning Information Strategy l Optimization Strategy l Selection Strategy l l Self-Configuration l l Reconfiguration Actions Self-Healing 24

Self-tuning l Information Strategy detect imbalances in the system’s configuration l to sample the

Self-tuning l Information Strategy detect imbalances in the system’s configuration l to sample the current space size l l Optimization Strategy l to establish a configuration l such that the number of navigator and dispatcher threads is balanced l Selection Strategy l prioritizing nodes according to how well suited they are for a configuration change 25

Self-Configuration l l a closed feedback-loop controller Reconfiguration Actions l Starting Threads l the

Self-Configuration l l a closed feedback-loop controller Reconfiguration Actions l Starting Threads l the l JOpera API Stopping Navigator Threads l migrating l l the state of the processes the navigator thread is working on and redirecting associated events by flushing the locally cached state into the global tuple space 26

Self-Configuration (cont. ) l Stooping Dispatcher Threads l more difficult l task may involve

Self-Configuration (cont. ) l Stooping Dispatcher Threads l more difficult l task may involve the invocation of a local application or the interaction with a remote service provider on the Web l metadata l kill method l l immediately stops all active task executions ensures all task invocations will be repeated on a differend dispatcher thread l stop l method immediately ceases to take tuples from the task space 27

Self-Healing l l periodically monitors the nodes of the cluster Handling Dispatcher Thread Failures

Self-Healing l l periodically monitors the nodes of the cluster Handling Dispatcher Thread Failures the task that were managed by it are lost and have to be restarted l very similar to self-configuration component kills a dispatcher l l Handling Navigator Thread Failures the state of the execution of the process is still the available in the global process execution state space l simply removing their entries in the tuple routing table which point to the failed navigator l 28

Contents l l Introduction System Background System Architecture Autonomic Capabilities l System evaluation l

Contents l l Introduction System Background System Architecture Autonomic Capabilities l System evaluation l Conclusion 29

System evaluation l l l Experimental Setup Base line Autonomic Behavior Self-Configuration l Reconfiguration

System evaluation l l l Experimental Setup Base line Autonomic Behavior Self-Configuration l Reconfiguration Overhead l l l Self-Healing Discussion 30

Experimental Setup l a cluster of up to 20 nodes l l 1. 0

Experimental Setup l a cluster of up to 20 nodes l l 1. 0 GHz dual P-III, 1 GB of RAM, Linux (Kernel version 2. 4. 22) and Sun’s Java Development Kit version 1. 4. 2 one additional node the global tuple space server l IBM’s T-Spaces v 2. 1. 3 l 31

Base Line l two different workloads 1000 concurrent processes containing 10 parallel tasks of

Base Line l two different workloads 1000 concurrent processes containing 10 parallel tasks of duration of 0 seconds (workload 0) l 1000 processes containing 10 parallel tasks of duration of 20 seconds (workload 20) l l total 15 nodes l 14 navigators and 1 dispatcher up to 14 dispatchers and 1 navigator 32

Base Line (cont. ) 33

Base Line (cont. ) 33

Base Line (cont. ) 34

Base Line (cont. ) 34

Autonomic Behavior l Self-Configuration 35

Autonomic Behavior l Self-Configuration 35

Autonomic Behavior (cont. ) 36

Autonomic Behavior (cont. ) 36

Autonomic Behavior (cont. ) 37

Autonomic Behavior (cont. ) 37

Autonomic Behavior (cont. ) l Reconfiguration Overhead 38

Autonomic Behavior (cont. ) l Reconfiguration Overhead 38

Self-Healing l l l initially to use 15 nodes to replace 5 of the

Self-Healing l l l initially to use 15 nodes to replace 5 of the nodes assigned workload consists of four peaks of 500 processes occurring every 100 seconds l each of the processes consist of 10 parallel tasks of 10 seconds duration l l change nodes grow to 20 nodes at t=90 l reduced by 5 nodes at t = 140 l again by 5 nodes at t=230 l 39

Self-Healing (cont. ) 40

Self-Healing (cont. ) 40

Self-Healing (cont. ) 41

Self-Healing (cont. ) 41

Self-Healing (cont. ) 42

Self-Healing (cont. ) 42

Self-Healing (cont. ) 43

Self-Healing (cont. ) 43

Discussion l to find an optimal static configuration for a given workload very difficult

Discussion l to find an optimal static configuration for a given workload very difficult l different characteristics lead to different optimal configurations l l autonomic controller was able to adapt the configuration of the workflow engine l according to the variable characteristics of the workload l l self-healing experiment l common situation in the lifetime of a cluster-based system 44

Contents l l l Introduction System Background System Architecture Autonomic Capabilities System evaluation l

Contents l l l Introduction System Background System Architecture Autonomic Capabilities System evaluation l Conclusion 45

Conclusion l l l the design of an autonomic workflow engine demonstrated its self-managing

Conclusion l l l the design of an autonomic workflow engine demonstrated its self-managing behavior and evaluated its performance show to apply the autonomic computing paradigm to greatly simplify the deployment and the maintenance of such systems homogeneous workload more complex characteristics as part of future work 46

47

47