Nagios in the Agile Dev Ops Continuous Deployment

IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment

What does this mean ? ►Things change quickly ►New features add up instantly ►Can

Overview ►Nagios Core 3. 2. 0 ► 800+ Hosts ► 18000+ Service Checks ►Single

Server Lifecycle Management Purchase & Asset Manageme nt DHCP, DNS Preseed, CFEngine Nagios, Opspush

[ Operations ] Continuous Integration and Deployment 2012 9

IMVU Asset Database ( Asset. DB ) ►Built internally by IMVU ►Simple but powerful

Auto generation of Nagios configuration files #generate_nagios_conf. pl ( most configurations auto generated from

Ops Buildbot ( builds, builders/buildslaves ) # svn commit hosts. cfg hostgroups. cfg 2012

Opspush ( Operations Push System ) # opspush --comment “xxxxxx” –role nagios run “cfagent

Product Development Ideation, UI Design, Usability Testing, etc Tech Design Monitoring and Alerting Coverage.

Tech Designs & New Nagios Alert Requests 2012 15

Big Data / De-Sharding ► Data freshness is critical to help make the right

How we try to prevent and catch failures Local Acceptance Tests Hypo Builds Buildbot

Cluster Immune System Automated push monitoring and rollback ! Push to X% of servers

Don’t just rely on Standard Metrics 2012

Demystifying P 1 s ( Priority 1 ) P 1: Priority 1 issue impacting

5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ►

Monitor Business & Application Level Metrics 2012 25

Monitor Response Times Load Average is a meaningless number 2012 26

Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution

Istatd: 10 Second Resolution of Data 2012 28

Istatd: Stacking graphs on the fly 2012 29

Have a “Strategy” for Monitoring and Alerting

Our (Nagios) Strategy ► Human element of Monitoring and Alerting ( Nagios ) ►

Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives. You

Daily Triage of Nagios Alerts and Interrupts 2012 33

Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ►

Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact

Why Decouple ? Ø For scalability and efficiency Ø Our model was higher performing

Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for any business, invest

Thank You !!! kjalleda@imvu. com We are Hiring: imvu. com/jobs Engineering Blog: http: //engineering.

Slides: 41

Download presentation

Nagios in the Agile / Dev. Ops / Continuous Deployment World Kishore Jalleda Director of Operations IMVU, Inc kjalleda@imvu. com

About IMVU 2012 2

About IMVU Avatar based Social Entertainment destination $50+ Million Annual Revenue 100+ Million Registered Users 10+ Million Items in Virtual Catalog 2012 3

IMVU Engineering and Continuous Deployment ►Doing the Impossible 50 times a day ►Continuous deployment (CD) is real ►IMVU has been one of the pioneers of CD ►Dev. Ops culture is big ►No approval needed to ship to 1% of customers Check out our engineering blog http: //engineering. imvu. com/ 2012 4

What does this mean ? ►Things change quickly ►New features add up instantly ►Can break frequently ►Failures can cascade rapidly ►Things can fall through the cracks ►Many things change at the same time ►Etc 2012 5

Insights into Nagios @IMVU

Overview ►Nagios Core 3. 2. 0 ► 800+ Hosts ► 18000+ Service Checks ►Single Nagios Instance ► 8 cores, 8 GB RAM 2012 7

Server Lifecycle Management Purchase & Asset Manageme nt DHCP, DNS Preseed, CFEngine Nagios, Opspush Cacti, Istatd 2012 CFEngine Production Decommiss ion 8

[ Operations ] Continuous Integration and Deployment 2012 9

IMVU Asset Database ( Asset. DB ) ►Built internally by IMVU ►Simple but powerful concept ►Source of truth for everything asset related ►Has information on ►Class ( mysql, standard-http-server, redis ) ►Role ( customer shard, clientdynweb ) ►Tag (available, no-update ) ►Attributes (cpu-cores, memory-size, mysql-role ) ►Much more … 2012 10

Auto generation of Nagios configuration files #generate_nagios_conf. pl ( most configurations auto generated from Asset. DB ) 2012 11

Ops Buildbot ( builds, builders/buildslaves ) # svn commit hosts. cfg hostgroups. cfg 2012 12

Opspush ( Operations Push System ) # opspush --comment “xxxxxx” –role nagios run “cfagent -v” on the box --use-last-green-rev green opspush check status of “last build” yes red --oncalloverride ? No exit 2012 13

Product Development Ideation, UI Design, Usability Testing, etc Tech Design Monitoring and Alerting Coverage. . Nagios 2012 Production Maintenance 14

Tech Designs & New Nagios Alert Requests 2012 15

Nagios Alert Request Template 2012 16

Big Data / De-Sharding ► Data freshness is critical to help make the right business decisions ► Nagios used for ETL/DW status and error checking ► Nagios and Ops embeds can help empower your Data Infrastructure team 2012 17

Things will FAIL 2012 18

How we try to prevent and catch failures Local Acceptance Tests Hypo Builds Buildbot Automated Cluster Immunity (CI) 2012 Manual QA using roll-out Nagios 3 rd party like webmetrics, customers, etc 19

Cluster Immune System Automated push monitoring and rollback ! Push to X% of servers Monitor Critical Metrics Good Push to rest Bad Monitor Critical Metrics Bad Auto Rollback w 00 t!, my change is Live Good

Don’t just rely on Standard Metrics 2012

Demystifying P 1 s ( Priority 1 ) P 1: Priority 1 issue impacting live operations Phases ► Identification (Nagios ) ► Communication and Declaration ► Resolution ► Postmortem / 5 Whys / Root Cause Analysis ► P 1 follow up 2012 22

5 Why / Postmortem (PM) / Root Cause Analysis ► 5 Why process ► Amazing culture of running blameless postmortems ► New Nagios checks are the most common action Items. ► A lot of monitoring and alerting on business and application level metrics was originally the outcome of PMs 2012 23

Example “ 5 Whys” Process 2012 24

Monitor Business & Application Level Metrics 2012 25

Monitor Response Times Load Average is a meaningless number 2012 26

Continuous Monitoring ( Istatd ) ► Developed by IMVU ► Sub 10 sec resolution of data ► API to get average, SD, min, max sample count for each data point in a graph ► Ability to stack multiple graphs on the fly ► Long retention times ► Releasing as open source this week !!! https: //github. com/imvu-open/istatd/wiki 2012 27

Istatd: 10 Second Resolution of Data 2012 28

Istatd: Stacking graphs on the fly 2012 29

Have a “Strategy” for Monitoring and Alerting

Our (Nagios) Strategy ► Human element of Monitoring and Alerting ( Nagios ) ► Nagios & Test Driven Development ( TDD ) ► Decouple ( Nagios ) ► Aggregated Checks 2012 31

Human Element of Monitoring and Alerting ► Have zero tolerance towards False Positives. You do not want your ops staff to walk into the office next AM looking like zombies ; ) ► Do not let people develop immunity to pages as very soon real issues will be ignored ► All pages are Actionable policy: If there is no action, it should not be paging ► Automatic enabling of alerting/notifications for improperly silenced ones. ► Ownership and accountability of issues/alerts 2012 32

Daily Triage of Nagios Alerts and Interrupts 2012 33

Nagios & Test Driven Development (TDD) ► Write tests for your Nagios Infrastructure ► Adopted heavily by Ops ( imp to keep pace with eng, Dev. Ops culture is awesome ) ► High degree of confidence in pushing changes ► Things will eventually change ( OS, libraries, logic, people, Nagios version, etc ). Tests will make the change much smoother. ► Functional testing can still be a challenge 2012 34

Sample Nagios Test Output 2012 35

Decouple Nagios We do it using “Fact, Worker, Reporter & Aggregator” Model Worker fact Reporter Aggregator Redis fact status 2012 36

Why Decouple ? Ø For scalability and efficiency Ø Our model was higher performing compared to NRPE Ø Lets you make changes ( like thresholds ) in one place instead of on like a 1000 machines ( if using NRPE ) Ø Lets you do aggregated checks, which is again a very simple but powerful concept to reduce paging levels by a ton 2012 37

Closing Remarks

Closing Remarks ► Monitoring and Alerting (M&A) is mission critical for any business, invest properly and smartly in it ► Don’t limit the usage of Nagios to just Ops. The secret to wide spread adoption is to make things frictionless ► Bathroom breaks can take 5 -10 minutes, so don’t fret too much about Nagios performance ► Build some form of predictive monitoring and alerting to catch and alert on change in trends ► Invest in configuration automation, validation and compliance ► Finally, Nagios has been like a Honda, very reliable !!! 2012 39

Questions ?

Thank You !!! kjalleda@imvu. com We are Hiring: imvu. com/jobs Engineering Blog: http: //engineering. imvu. com/ 2012 41