Airship resiliency framework using Argo Workflows Kiriti Muktevi

  • Slides: 16
Download presentation
Airship resiliency framework using Argo Workflows Kiriti Muktevi, Imaginea Gurpreet Singh, Imaginea Pradeep Kumar,

Airship resiliency framework using Argo Workflows Kiriti Muktevi, Imaginea Gurpreet Singh, Imaginea Pradeep Kumar, Solution Architect, Ericsson Hemanth Nakkina, Solution Architect, Ericsson Internal | 2018 -02 -21

Agenda q Why Resiliency Testing is Important q Why Torpedo q Introduction to Torpedo

Agenda q Why Resiliency Testing is Important q Why Torpedo q Introduction to Torpedo q Architecture Overview q Core Components q Demonstration Ericsson Internal | 2018 -02 -21

Why Resiliency is important in Containerized/Virtualized Cloud deployments q Deal with unexpected failures in

Why Resiliency is important in Containerized/Virtualized Cloud deployments q Deal with unexpected failures in a distributed system q Microservices need to be resilient to failures to be able to restart often on another node. q Reliable upgrades/downgrades to maintain consistent state q Cloud-based systems need to embrace failures and automatically recover from them e. g. network pod failures, database pod failures (stateful sets) Ericsson Internal | 2018 -02 -21

Introduction to Torpedo • A framework to test the resiliency of the Airship(Any Kubernetes)

Introduction to Torpedo • A framework to test the resiliency of the Airship(Any Kubernetes) deployed environment • No mechanism to validate the resiliency of the deployed services • The data sources will provide the components on which resiliency tests are to be performed • Results of the analysis are exported to elastic search and represented graphically on Kibana dashboard. • Operations teams/ need audit logs for future reference. Ericsson Internal | 2018 -02 -21

Why Torpedo • Testing for a particular service, for example: - Nova What will

Why Torpedo • Testing for a particular service, for example: - Nova What will happen if I kill nova-compute pod? • In case of random service failures, how does the target platform behave. • Tools like Chaos. Kube, Powerseal can take care of the task of introducing random failures to the platform. • They all lack overall framework for specific testing needs. Ericsson Internal | 2018 -02 -21

Torpedo Architecture T 1 Destroyer & Traffic Generator C 1 LOG COLLECTOR AND ANALYZER

Torpedo Architecture T 1 Destroyer & Traffic Generator C 1 LOG COLLECTOR AND ANALYZER Atta ck T 2 Choose weapons, create scenarios Destruction Scenarios Attack C 2 Attack Orchestrator (Argo Workflows) Runtime Metrics Tn Cn Ericsson Internal | 2018 -02 -21 Aggregate Sys. Logs/Metrics Torpedo Meta controller Torpedo CRD Test Suite ELK

Components of Torpedo • Torpedo Meta controller • Orchestrator (Argo Workflows) • Traffic generator

Components of Torpedo • Torpedo Meta controller • Orchestrator (Argo Workflows) • Traffic generator • Destroyer • Log Collector and Analyzer Ericsson Internal | 2018 -02 -21

Torpedo Meta controller q Meta Controller is an add-on for k 8 s to

Torpedo Meta controller q Meta Controller is an add-on for k 8 s to write and deploy Custom Controllers. Easy to define behavior for a new extension API. q Meta Controller helps a test suite to be defined and executed as a Kubernetes Custom resource (Torpedo CR). q Users can define test cases as a Torpedo CR q Custom Torpedo controller builds Argo YAMLs based on user input CR and triggers the test case Argo workflow. Ericsson Internal | 2018 -02 -21

Argo Workflows q Generates a DAG of all jobs that needs to be executed

Argo Workflows q Generates a DAG of all jobs that needs to be executed § Traffic jobs § Chaos jobs § Sanity jobs § Log collection and Analyzer jobs Ericsson Internal | 2018 -02 -21

Traffic Generator q Light weight HTTP traffic generator (similar to Gabbit in openstack functional

Traffic Generator q Light weight HTTP traffic generator (similar to Gabbit in openstack functional tests). q API calls/traffic can be specified as YAMLs q Pluggable module, should be able to plug any other traffic generation modules. Ericsson Internal | 2018 -02 -21

Destroyer q q Custom plugin that can initiate the random destruction of pods of

Destroyer q q Custom plugin that can initiate the random destruction of pods of target micro services Test how your system behaves under arbitrary pod failures. Allows to filter target pods by namespaces, labels, annotations as well as exclude It allows to specify a mean time between failures on a per-pod basis, a feature that chaoskube lacks. Ericsson Internal | 2018 -02 -21

Destroyer(Example) Ericsson Internal | 2018 -02 -21 - - service: nova component: os-api kill-interval:

Destroyer(Example) Ericsson Internal | 2018 -02 -21 - - service: nova component: os-api kill-interval: 30 kill-count: 5 same-node: True pod-labels: - 'application=nova' - 'component=os-api' node-labels: - 'openstack-nova-control=enabled' service-mapping: nova name: nova-os-api nodes: '' max-nodes: 2 sanity-checks: '' extra-args: "" job-duration: 100 count: 60

Log Collector and Analyzer q All the logs and metrics will be maintained in

Log Collector and Analyzer q All the logs and metrics will be maintained in ELK or EFK stack ( currently used LMA stack in Airship) q Generated Torpedo results are saved in ELK/EFK stack and can viewed by Kibana/Grafana dashboards Ericsson Internal | 2018 -02 -21

DEMONSTRATION q Testing stateless component q Testing stateful component q Testing control node force

DEMONSTRATION q Testing stateless component q Testing stateful component q Testing control node force shut down Ericsson Internal | 2018 -02 -21

Next Steps…… q Apart from the mentioned scenarios, we will be adding more resiliency

Next Steps…… q Apart from the mentioned scenarios, we will be adding more resiliency scenarios such as q Ceph/Storage resiliency q Rack down/failure scenarios q Extending Torpedo to run resiliency tests against nodes running on different private/public clouds e. g. Google Cloud, AWS, Azure q Exploring compatibility of various other tools which can be integrated into the Torpedo ecosystem. Ericsson Internal | 2018 -02 -21

Ericsson Internal | 2018 -02 -21

Ericsson Internal | 2018 -02 -21