Resiliency and Geo Redundancy Lessons from SDNC Sharon

































- Slides: 33
Resiliency and Geo. Redundancy – Lessons from SDNC Sharon Chisholm June 20, 2018
Overview • High-Availability • Site Resiliency • Geo-redundancy • SDNC • Highly-Available Roadmap • Site Resiliency • Geo-redundancy • Bonus Lesson • Solution Exclusions • Where to learn more 2 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
High-Availability • As we move towards a virtualized and open-sourced world, we still need be concerned about high-availability of critical services • A highly available ONAP solution minimizes downtime over • simple or catastrophic failures • System Maintenance • Software Upgrades • Achieved through a combination of • Site Resiliency • Geo-graphic Redundancy 3 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Trends in High-availability • Away from expensive highly-available hardware to layering high-availability on lots of less reliable (but inexpensive) hardware • Increased emphasis on site resiliency • Overtime, improvements in Site Resiliency make site fail-overs less frequent, but still a critical component of a highly-available solution. 4 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Site Resiliency • Use of redundant instances and auto-recovery mechanisms within a single geographic site in order to increase the availability of services • Mechanisms • Multiple instances of critical components • • • Including ability to seamlessly share data Load balancers or dumb pick an instance mechanisms Health monitoring and Component restart • Kubernetes • Reduced fate-sharing increases resiliency • • 5 Different VMs, hardware etc. Note this may have some impact though on performance. Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Toronto
Geo-Redundancy • Site resiliency goes a long way to ensure service availability • Doesn’t help in the case of serious incident • • • Natural disasters Human-made disasters Reptile attack Vancouver • A separate physical site unlikely to be impacted by the same events is the only option • Redundancy Types • • Active/Standby Active/Active • Failover Types • • Manual or Automated failover Full or Partial Failover • Site Reversion 6 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Toronto
Platform Maturity Requirements - Resiliency 7 Level Description 0 no redundancy 1 support manual failure detection & rerouting or recovery within a single site; tested to complete in 30 minutes Site Resiliency 2 • support automated failure detection & rerouting • within a single geographic site • stateless components: establish baseline measure of failed requests for a component failure within a site • stateful components: establish baseline of data loss for a component failure within a site Site Resiliency 3 • support automated failover detection & rerouting • across multiple sites • stateless components • improve on # of failed requests for component failure within a site • establish baseline for failed requests for site failure • stateful components Geo-Redundancy Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Note
SDNC Resiliency Roadmap 8 Product Site Resiliency Component Auto Restart Geo. Redundancy Manual Switch Auto-switch Site Reversion ONAP Amsterdam Multiple ODL in cluster No No No ONAP Beijing Yes Yes (2 step) No Manual (2 Step) ONAP Casablanca Yes Yes (1 step) Yes Manual (1 step) Yes Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
SDNC Site Resiliency – Beijing Site Resiliency based on Kubernetes clusters • Local clusters • Health monitoring • Restarting problem components • Consolidated status reporting • Multi-VM support • Reduce single point failure https: //wiki. onap. org/display/DW/SDN-C+Clustering+on+Kubernetes 9 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Site Resiliency Architecture - SDNC 2 DB pods 3 sdnc (ODL) pods 1 admin (portal) pod 1 dgbuilder pod Less critical functions don’t need local replication if they can restart quickly 10 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
SDN-C Startup Order 11 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
My. SQL versus Maria DB • Historically SDN-C was using My. SQL • My. SQL has concept of mastership, that needs updating even during local failure • SDN-C plans to move to Maria. DB • Maria DB is active/active, so no requirement for change of mastership • Makes things much easier! • Because of planned DB move, we did not implement code to successfully change mastership, so for this short window, this needs to be done manually. • SDNC will move to Maria. DB in Casablanca Active/Active makes for simpler failover 12 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Manual Failover Solution – Beijing CORE DNS active Amdocs SDNC Team standby Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Health Check API/Script Role Switch API/Script Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Role Switch API/Script Health Check API/Script Sdnhost-sdnc : connectivity to pods via ports to enable mdsal geo-replication 13 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Manual Failover Solution – Casablanca CORE DNS active AT&T PROM Team Amdocs SDNC Team standby Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Health Check Script 14 Role Switch Script Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport PROM MUSIC Role Switch API/Script Health Check API/Script
Geo-Redundancy Architecture active 15 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs standby
Core DNS • Need to have change of site to be transparent to the client • With Core. DNS, MSO (or other components) can look up the address of the SDNC server and automatically pick up the new site once a switch occurs • Note that the client likely caches the address, so it may fail until the address is refreshed. • Enables sending requests to a single known address, which resolves to the current primary server • https: //wiki. onap. org/display/DW/Setting+up+Core. Dns+Provider • Note that Kubernetes is moving to core. DNS 16 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Types of Geo-redundant Failures (SDN-C) Simple Failure • Other side is reachable, at least enough that the ODL cluster can be configured • When original site is restored to good health, it will automatically become the standby Site Catastrophic Failure • Simple Failure switch has failed • • • Site unreachable ODL unresponsive Etc Ability of your underlying technology to react to changes in your deployment can have serious impacts on the complexity of your solution • When original site is restored to good health, the cluster needs to be manually configured to include both sites again. 17 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Simple Failure CORE DNS active standby Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Health Check Script 18 Role Switch Script Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport PROM MUSIC Role Switch API/Script Health Check API/Script
Catastrophic Failure CORE DNS Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Health Check Script 19 Role Switch Script Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport PROM MUSIC Role Switch API/Script Health Check API/Script
Split-Brain Failure CORE DNS Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport Health Check Script 20 Role Switch Script Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Sdnhost-sdnc-0 Sdnhost-sdnc-1 Sdnhost-sdnc-2 nodeport PROM MUSIC Role Switch API/Script Health Check API/Script
Operational Considerations - Troubleshooting Want to ensure we can determine • Current state of a given site • Health/Unhealthy – whether a service request can successfully be processed • Current role of a given site • • Active – receiving traffic; ODL voting members of cluster Standby – not receiving traffic; ODL non-voting members of cluster • Current site’s knowledge of other sites • Logged • History of what has happened, particularly error path. • Logging • End to end transaction tracing • Logging Potential Enhancements • Provide more information in status, so less time can be spent in logs 21 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Notifications In order to ensure that operators are aware of what is happening in their systems, notifications have been implemented via Dmaa. P to cover the following cases • Site Health Issue detected • Site Health Issue cleared • Site Switch initiated • Site Switch (catastrophic failure) initiated CORE DNS Native Data replication Sdnhost-sdnc-0 Sdnhost-sdnc-1 nodeport Sdnhost-sdnc-2 nodeport Health Check Script 22 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Role Switch Script Sdnhost-sdnc-0 Sdnhost-sdnc-1 nodeport Sdnhost-sdnc-2 nodeport PRO M MUS IC IC Role Switch API/Script Health Check API/Script
Failing Component versus Failing ONAP MUSIC Role Switch Script Health Check Script PROM ONAP Portal – Design Studio & Dashboard Service & Product Design Ø Inventory/Topology (AAI) Ø Resource Level Orchestration Testing & Certification Common Services Policy Execution Engine Active / Available Data Distribution Resource / Service Topology DMa a. P … Holmes CDAP Loggin g AAF Multi-VIM Interface L 0 -3 Network Controller (SDN-C) Resources Policies Eng. Rules Analytics Infrastructure Adaption Layer Service Logic Interpreter Life Cycle Mgmt Func L 4 -7 Controller (App-C) Config Database Service Logic Interpreter Vendor VNFM Adaption Layer (VF-C) Life Cycle Mgmt funcs Resource Level Orchestration Common Services Azure MEC Netconf Yang CLI Chef Netconf Recipe / Engineering Rules & Policy Distribution Data Collection Layer DMa a. P Loggin g AAF Micro Services Bus Catalog Multi-VIM Interface Products L 0 -3 Network Controller (SDN-C) L 4 -7 Controller (App-C) VNF 2 Open. Stack VNF 3 Resources Policies Eng. Rules Analytics Config Database Infrastructure Adaption Layer Service Logic Interpreter Life Cycle Mgmt Func Config Database Service Logic Interpreter VMware Rack. Spac e Life Cycle Mgmt funcs Vendor A Vendor B Open Stack AWS Azure MEC Netconf Yang CLI Netconf Vendor A Ansible 3 rd Party Network controller s. Vs. V NFNFNF MM M … Az ure VNFn Peripherals Further Rationalization Opportunities VNF 1 Infrastructure Same solution could be applied at the ONAP level Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Life Cycle Mgmt funcs Standard Adaption Layer Recipe / Engineering Rules & Policy Distribution PNF 1 Vendor VNFM Adaption Layer (VF-C) Services s. Vs. V NFNFNF MM M Infrastructure 23 Data Distribution Resource / Service Topology Ansible controller VNF 1 Policy Execution Engine … Holmes CDAP Chef AWS 3 rd Party Network Further Rationalization Opportunities Ø Policy DCAE Analytic µServices Active / Available Vendor OA&M Adaptor Open Stack Policy Creation & Validation Data & Analytics (DCAE) Adaption Layer Standard Adaption Layer Inventory/Topology (AAI) Service Level Orchestration Entitlements Analytic Application Design Services Config Database Ø Micro Services Bus Catalog Products Orchestration (SO) Resource Onboarding Service & Product Design Run-time External Data Movement & APIs ONAP Operations Management (OOM) ESR Data Collection Layer ESR Analytic Application Design Policy DCAE Analytic µServices Service Level Orchestration Entitlements Policy Creation & Validation Data & Analytics (DCAE) VNF/PNF SDK Orchestration (SO) Resource Onboarding Design-time (SDC) Run-time External Data Movement & APIs ONAP Operations Management (OOM) Testing & Certification VNF/PNF SDK Design-time (SDC) BSS OSS Role Switch Script Health Check Script VNF 2 Open. Stack VNF 3 VMware PNF 1 Rack. Spac e … Az ure VNFn Peripherals Vendor B
Bonus Lessons • It is important that you don’t introduce new failure points while trying to improve availability. • Ensure that all new components involved are themselves resilient • If a situation can’t be avoided, it should at least be possible to recover from • Kubernetes Federation looked promising, but • • • 24 Not widely deployed on Openstack, so problems not fully worked out yet Does not buy us much in an active/standby solution We spent time on this, but ended up not using it. Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Solution Exclusions • Active/Active • Handling Inflight requests • • The assumption is the client will retry when the request fails over All SDNC requests should aim to be idempotent • Automated Site Reversion • Want to avoid reverting to initial primary site until you are confident in it. • Full ONAP failover, although we could plug into a solution like that • My. SQL change of mastership 25 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Impotency simplifies client logic in case of failure
Where to learn more • Site Resiliency • https: //wiki. onap. org/display/DW/SDN-C+Clustering+on+Kubernetes • Geo-Redundancy • https: //wiki. onap. org/display/DW/Geo-Redundant+SDN-C+on+Kubernetes • Beijing • https: //wiki. onap. org/display/DW/SDNC+%3 A+Beijing+%3 A+Georedundant+SDNC+%3 A+Manual+Failover • Casablanca • 26 https: //wiki. onap. org/display/DW/SDNC%3 A+Geo-redundancy+Enhancements Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Backup 27 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
PROM Daemon Tunable Parameters Music Communicate Status with other PROMs Communicate Status with PROMs Parameters allow you to tune behavior to ensure • Avoid pre-mature site switches • Allow time for site resiliency to try to repair issues • Allow time for communication to be restored • Ensure health check interval greater than time to calculate health • Ensure problems still get detected in a timely manner 28 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs Site Switch SDNC-Health Check PROM Communicate Status with other PROMs Modify SDNC, i. DNS Determined Health
Parameter Tuning • Parameters • "core-monitor-sleep-time": "10000" • • "no-of-retry-attempts": "3" • • This is PROM’s switch over time. PROM takes 60 seconds to switch from Standby to Active "prom-timeout": "60000“ • • When site becomes unhealthy PROM will try 3 times as configured to verify it remains unhealthy "restart-backoff-time": "10000“ • 29 This is PROM’s polling-interval which triggers monitor script every 10 seconds as configured on ACTIVE PROM This parameter delays between the retries. In this e. g. there will be 10 seconds delay between number of retries Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Operational Considerations - Logging • Requested alignment with ONAP logging guidelines • Want to enable end to end transaction traceability • Existing logging will show audit, request/response for the requests to SDNC • Ideally, rather than timing out with no reason, we should timeout with a reason. As we are likely dead or dying at that point, the best we can do would be to ensure that we log enough information as we die to help aid troubleshooting problems. 30 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Kubernetes Federation • What is provides • We are no longer sure if it provides value, but we have not yet removed it from our solution • • We originally thought we needed it for communication Perhaps in an active/active world, it would provide more value • What it doesn’t provide • • • Replication Active/Backup determination Health Monitoring https: //wiki. onap. org/display/DW/Kubernetes+Federation 31 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
Generic Kubernetes Federation 32 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs
33 Information Security Level 2 – Sensitive © 2017 – Proprietary & Confidential Information of Amdocs