Machine Learning in Infrastructure Monitoring Identifying and Categorizing

  • Slides: 8
Download presentation
Machine Learning in Infrastructure Monitoring: Identifying and Categorizing Flapping Events to reduce resource wastage

Machine Learning in Infrastructure Monitoring: Identifying and Categorizing Flapping Events to reduce resource wastage and service loss

Background Discussed in the previous insight: What data centers need to know about flapping

Background Discussed in the previous insight: What data centers need to know about flapping events The impact of flapping events in infrastructure monitoring Top flapping events by occurrence Root causes of event flapping This insight will focus on how machine learning helps data centers identify and categorize flapping events in order to reduce resource wastage and service loss Approaches to handling flapping events Confidential & Restricted Reactive • Addressing flapping events as and when they occur Proactive • Configuring/resetting the thresholds to reduce the volume of flapping events Predictive • Predicting and categorizing the events as flapping or non- flapping leveraging machine learning to minimize resource wastage and service loss Confidential & Restricted

Reactive approach: Addressing flapping events as and when they occur A typical workflow in

Reactive approach: Addressing flapping events as and when they occur A typical workflow in handling the events in a reactive approach method: In reactive approach, teams start working immediately after getting event notification. Whether it’s a flapping or non- flapping event, the resource assignment is done and trouble shooting begins. Hence, this approach involves resource wastage as well as service loss in some cases, based on event’s nature. Step Confidential & Restricted

Proactive approach: Updating/resetting the thresholds to reduce the volume of flapping events Setting thresholds

Proactive approach: Updating/resetting the thresholds to reduce the volume of flapping events Setting thresholds which are too low can cause unnecessary event triggers. Storage-related alerts Timeout-related alerts It has been observed that typical storage-related alerts are set to a very low threshold (e. g. , 80%) causing a lot of events. Reconfiguring the alarm thresholds to a higher permissible number (e. g: 90%) based on the field experience can reduce noise. Setting timeout related alerts too low can cause the system to throw alerts even as the response is not really ‘timed-out’ but delayed. Low threshold timeout triggers a lot of events like- site down, network latency, port not responding etc. Identifying the events’ behavior and proactively increasing threshold limits for those events accordingly can reduce the number of flapping events significantly. This decreases the resource wastage as monitoring teams get fewer flapping events. However, service loss may still continue because of the fewer flapping events which still affects the visibility of the genuine events. Confidential & Restricted

Predictive approach: Identifying and categorizing flapping events to minimize resource wastage and service loss

Predictive approach: Identifying and categorizing flapping events to minimize resource wastage and service loss Machine Learning-based solution to identify and categorize flapping events Building Intelligence Monitoring teams maintain events log that contains event related information. Events log is in natural language. Event Logs Confidential & Restricted Using NLP techniques, the Intelligence system gains power to read and process "Events logs(Text)" as humans do With the combined power of NLP and machine learning algorithm, system learns to classify the events into flapping/ non-flapping by identifying hidden / underlying patterns in the events. once the new event arrives into the system, with the previously acquired intelligence, system reads the events in textual format and classifies them. Genuine Event X Flapping Event

Big data frameworks help in gathering and feeding huge volumes of data in to

Big data frameworks help in gathering and feeding huge volumes of data in to the Machine Learning system 1 Gathering information with the help of big data frameworks Output: Categorizing events in flapping/non-flapping and extrapolate the information Event: XXXX Nature: Flapping Severity: Low Cascading: No Client: ABYZ 3 Confidential & Restricted Big data frameworks: Helps the system in gathering a variety of data including: Event logs Customer Relationship Data Base (CRDB ) Network element information Inventory information Enabling machine to read information and identify patterns based on algorithms 2 Machine Learning and Natural Language Processing: System reads and analyses all the information passed by big data frameworks like: Events log Events nature, patterns, severity, cascaded events Client information and SLA level With the help of pattern matching algorithm and analysed data, the system predicts future events, their nature, severity, subevents, associated client etc.

Use Case: How Machine Learning analyzes and categorizes the nature of events Objective: To

Use Case: How Machine Learning analyzes and categorizes the nature of events Objective: To demonstrate how flapping events can be identified and categorized leveraging machine learning and natural language processing Machine Learning algorithm used: Gradient Boosting Machine Sample events: OUTPUT Component device 44676 is not available Event 1: Device Failed Availability Check: Component device 44676 Is not available Unique Part Device Failed Availability Check Common Part Event 2: Device Failed Availability Check: UDP – SNMP Confidential & Restricted Event type: Device Failed Availability Check: Component Device XXXX is not available System checks for unique part in the combined corpus and categorizes the events (flapping or non – flapping) UDP – SNMP Machine Learning System Gradient boost Algorithm Event 1: Device Failed Availability Check: Component device 44676 is not available Nature: Non – Flapping Action : Assign Ticket OUTPUT Corpus Bag of words for Flapping Events Bag of words for Non- flapping Events Unique Part 7 Event 2: Device Failed Availability Check: UDP – SNMP Nature: Flapping Action: No Action Required Leveraging Machine Learning to identify and categorize events, data centers can reduce resource wastage by approximately 50% and service loss by approximately 10%

Amsterdam London New York THANK YOU! Tualatin Dallas Bengaluru Chennai Johannesburg USA UK THE

Amsterdam London New York THANK YOU! Tualatin Dallas Bengaluru Chennai Johannesburg USA UK THE NETHERLANDS SOUTH AFRICA INDIA Prodapt North America Prodapt (UK) Limited Prodapt Solutions Europe Prodapt SA (Pty) Ltd. Prodapt Solutions Pvt. Ltd. Tualatin: 7565 SW Mohawk St. , Ph: +1 503 636 3737 Dallas: 222 W. Las Colinas Blvd. , Irving Ph: +1 972 201 9009 New York: 1 Bridge Street, Irvington Ph: +1 646 403 8158 Reading: Davidson House, The Forbury, Reading RG 1 3 EU Ph: +44 (0) 11 8900 1068 Amsterdam: Zekeringstraat 17 A, 1014 BM Ph: +31 (0) 20 4895711 Prodapt Consulting BV Rijswijk: De Bruyn Kopsstraat 14 Ph: +31 (0) 70 4140722 Johannesburg: No. 3, 3 rd Avenue, Rivonia Ph: +27 (0) 11 259 4000 Chennai: 1. Prince Infocity II, OMR Ph: +91 44 4903 3000 2. “Chennai One” SEZ, Thoraipakkam Ph: +91 44 4230 2300 Bangalore: “Career. Net Campus” No. 53, Devarabisana Halli, Outer Ring Road