Performance Management with Free and Bundled Tools Adrian
Performance Management with Free and Bundled Tools Adrian Cockcroft Netflix Inc. acockcroft@netflix. com (Co-authored with Mario Jauvin MFJ Associates mario@mfjassociates. net) 11/5/2020 The Performance People
Agenda l l l Overview of Capacity Planning Requirements and Data Sources Performance Data Collection Free Network Monitoring Tools Free System Monitoring Tools Free Load Generation and Modelling Tools Licences and References 05 November 2020 Adrian Cockcroft and Mario Jauvin
What are we talking about? QA Load generation with Grinder or SLAMD, modelling with PDQ and R Network monitoring with Application Tier monitoring with Orca, Cacti, Big. Sister, Ganglia, XEtoolkit Wire. Shark, MRTG, Big. Sister, Cacti, Nagios, Open. NMS, Zenoss, Openxtra, ntop Database Tier monitoring With SEtoolkit, Orca, XEtoolkit 05 November 2020 Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements and Data Sources 05 November 2020 Adrian Cockcroft and Mario Jauvin
Definitions l l Capacity – Resource utilization and headroom Planning – Predicting future needs by analyzing historical data and modeling future scenarios Performance Monitoring – Collecting and reporting on performance data Free Tools – Bundled with the OS or available for no $$$ 05 November 2020 Adrian Cockcroft and Mario Jauvin
Capacity Planning Requirements l l We care about CPU, Memory, Network and Disk resources, and Application response times We need to know how much of each resource we are using now, and will use in the future We need to know how much headroom we have to handle higher loads We want to understand how headroom varies, and how it relates to application response times and throughput 05 November 2020 Adrian Cockcroft and Mario Jauvin
CPU Capacity Measurements l l l CPU Capacity is defined by CPU type and clock rate, or a benchmark rating like SPECrate. Int 2000 CPU utilization is defined as busy time divided by elapsed time for each CPU load average measures the average number of jobs running and ready to run 05 November 2020 Adrian Cockcroft and Mario Jauvin
Memory Capacity Measurements l Physical Memory Capacity Utilization and Limits – – – l l Kernel memory Shared Memory segment Executable code, stack and heap File system cache usage Unused free memory Virtual Memory Capacity - Swap Space Memory Throughput – Page in and page out rates 05 November 2020 Adrian Cockcroft and Mario Jauvin
Network Capacity Measurements l Network Interface Throughput – l TCP Protocol Specific Throughput – – l TCP connection count and connection rates TCP byte rates input and output NFS/SMB Protocol Specific Throughput – – l Byte and packet rates input and output Byte rates read and write NFS/SMB service response times HTTP Protocol Specific Throughput – – HTTP operation rates Get and put payload byte rates and size distribution 05 November 2020 Adrian Cockcroft and Mario Jauvin
Disk Capacity Measurements l l Detailed metrics vary by platform Easy for the simple disk cases Hard for cached RAID subsystems Almost Impossible for shared disk subsystems and SANs – Another system or volume can be sharing a backend spindle, when it gets busy your own volume can saturate, even though you did not change your own workload 05 November 2020 Adrian Cockcroft and Mario Jauvin
Capacity Planning Challenges l l l l Constantly changing infrastructure Limited attention span from staff Horizontally scaled commodity systems Per node software licencing costs too much Too many tools, too many agents per node Too much data, not enough analysis Non-linear and non-intuitive scalability Lack of tools and metrics for virtualized resources 05 November 2020 Adrian Cockcroft and Mario Jauvin
Observability l Four different viewpoints – – l l l Management Engineering QA Testing Operations Each needs very different information Ideal would be different views of the same performance database Reality is a mess of disjoint tools 05 November 2020 Adrian Cockcroft and Mario Jauvin
Management Viewpoint l l l Daily summary of status and problems Business oriented metrics Future scenario planning Marketing and management input Concise report with dashboard style status indicators Free tools: R, Spreadsheet and Web based displays, no good summarization tools 05 November 2020 Adrian Cockcroft and Mario Jauvin
Engineering Viewpoint l l l Large volumes of detailed data at several different time scales Input to tuning, reconfiguring and future product development Low level problem diagnosis Detailed reports with drill down and correlation analysis Free tools: XE/SE Toolkit, Orca, Ganglia, Cacti, R 05 November 2020 Adrian Cockcroft and Mario Jauvin
QA Test Viewpoint l l l Workload specification tools Load generation frameworks Testing for functionality and performance Regression tools to compare releases Modelling difference between test configuration and production configuration Free Tools: The Grinder, SLAMD, R, PDQ 05 November 2020 Adrian Cockcroft and Mario Jauvin
Operations Viewpoint l l l Immediate timeframe Real time display, updated in seconds Alert based monitoring High level problem diagnosis Simple high level graphs and views Free tools: Big. Sister, Nagios, Open. NMS, MRTG, Cacti, Ganglia, Wire. Shark, ntop 05 November 2020 Adrian Cockcroft and Mario Jauvin
Measurement Data Interfaces l Several generic raw access methods – – – l Command based data interfaces – – l l Read the kernel directly (not a good idea) Structured system data (Solaris kstat, Linux /proc) Process data Network data Accounting data Application data Scrape data from vmstat, iostat, netstat, sar, ps Higher overhead, lower resolution, missing metrics Data available is platform specific either way Much more detail on this topic in the Solaris/Linux Performance Measurement and Tuning Class 05 November 2020 Adrian Cockcroft and Mario Jauvin
Free Network Monitoring Tools 05 November 2020 Adrian Cockcroft and Mario Jauvin
SNMP l l l Simple network management protocol UDP protocol based on port 161 Client/server like – – l Client is called management application entity Server is called an agent entity Agent entity is designed to be implemented on network hardware, router, switches, etc 05 November 2020 Adrian Cockcroft and Mario Jauvin
SNMP – MIBs l l l Management information base Defines the structure and the semantic of the information that can be reported on Most commonly used is MIB-II which defines a set of standard networking attributes – – – l Interface tables System level information Routing tables Specified using ASN. 1 (abstract syntax notation 1) 05 November 2020 Adrian Cockcroft and Mario Jauvin
SNMP – commands l l l Called PDU (protocol data units) GETNEXT GETBULK SET Encoded using BER (basic encoding rules) 05 November 2020 Adrian Cockcroft and Mario Jauvin
Versions l l Version 1, original version done in May 1991 Version 2, around 1993. Failed because the IETF credo of “rough consensus and running code” could not be met on securing SNMP Turned into V 2 c for community string security (like V 1) Version 3, added security and complexity in 1998 05 November 2020 Adrian Cockcroft and Mario Jauvin
SNMP tools l l l Too numerous to name all but… Open. NMS Nagios Cacti MRTG Net-snmp – See www. snmplink. org 05 November 2020 Adrian Cockcroft and Mario Jauvin
SNMP tools l l l Snmpwalk – will report all data in a specified MIB get. If – will report data about interfaces and includes built-in MIB browser Snmptable – will report tabular data from MIB tables 05 November 2020 Adrian Cockcroft and Mario Jauvin
Open. NMS l Well…. it’s not that portable – – – 95% java is not 100% java Requires about 20 -30 different platform specific packages (Postgre. SQL, Perl, RRD tool, Tomcat 4 etc…) Difficult to install Easy auto discovery Web-based interface 05 November 2020 Adrian Cockcroft and Mario Jauvin
Open. NMS l Main screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
Open. NMS l Node screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
Nagios l l Easy to build/compile (on Solaris 10) Easy to install Quick response from CGI Configuration is manual and a pain – – l 13 configuration files with all kinds of interrelated entries Tedious and error prone Requires plugins to do anything 05 November 2020 Adrian Cockcroft and Mario Jauvin
Nagios l Main screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
Nagios l Host detail screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop l l l Similar to familiar UNIX top tool for processes but used for network Provide huge selection of real-time data Can be found at http: //www. openxtra. co. uk/ 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop – Active Sessions 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop Hosts 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop Network Load 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop_Network_Thruput 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop Port Dist 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop_Protocol_Dist 05 November 2020 Adrian Cockcroft and Mario Jauvin
ntop Protocols 05 November 2020 Adrian Cockcroft and Mario Jauvin
Zenoss l l Open source monitoring and management of IT infrastructure Zenoss core is free Other editions are for a fee Get it from http: //www. zenoss. com/download/ 05 November 2020 Adrian Cockcroft and Mario Jauvin
zenoss Architecture 05 November 2020 Adrian Cockcroft and Mario Jauvin
zenoss Dash Config 05 November 2020 Adrian Cockcroft and Mario Jauvin
zenoss Google 05 November 2020 Adrian Cockcroft and Mario Jauvin
zenoss Google Alerts 05 November 2020 Adrian Cockcroft and Mario Jauvin
Zenoss Graphs 05 November 2020 Adrian Cockcroft and Mario Jauvin
zenoss Topology 05 November 2020 Adrian Cockcroft and Mario Jauvin
MRTG l l Really simple to install and configure Require manual config file creation Only for MIB-II interface plotting out of the box Graphing not flexible, axis, time etc 05 November 2020 Adrian Cockcroft and Mario Jauvin
MRTG l Interface screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
MRTG l Other CPU screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool l Software to store, retrieve and graph numerical time series data Use a round robin algorithm Data files are a fixed size – – Don’t grow Don’t require maintenance 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool l l Compiles on most platforms Used by many SNMP based tools – – – Open. NMS Cacti Big. Sister Weather. Map 4 RRD Mail. Graph 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool l 14 all CGI script that plots data similar to MRTG Configurable to collect data at different interval (unlike MRTG) Flexible and variable in what data can be collected 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool l Sample screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool l Screen shot 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool Create a RRD database rrdtool create test. rrd --start 920804400 DS: speed: COUNTER: 600: U: U RRA: AVERAGE: 0. 5: 1: 24 RRA: AVERAGE: 0. 5: 6: 10 l 05 November 2020 Adrian Cockcroft and Mario Jauvin
RRD tool Create a graph rrdtool graph speed. png --start 920804400 --end 920808000 DEF: myspeed=test. rrd: speed: AVERAGE LINE 2: myspeed#FF 0000 l 05 November 2020 Adrian Cockcroft and Mario Jauvin
Free Performance Data Collection and Rules Toolkits 05 November 2020 Adrian Cockcroft and Mario Jauvin
SE toolkit Example Tools l l l A free performance toolkit for rapidly creating custom data sources Makes all the very extensive Solaris metrics easily available Very system specific and not enough metrics exist to port to Linux Written by Rich Pettit with contributions from Adrian Cockcroft Get SE 3. 4 from http: //sourceforge. net/projects/setoolkit/ Open source with support for SPARC & x 86 Solaris 8, 9, 10 Function Example SE Programs Rule Monitors cpg. se monlog. se mon_cm. se live_test. se percollator. se zoom. se virtual_adrian_lite. se iomonitor. se iost. se xit. se ps-p. se Disk Monitors siostat. se xiostat. se CPU Monitors cpu_meter. se vmmonitor. se mpvmstat. se Process Monitors msacct. se pea. se ps-ax. se pwatch. se pw. se Network Monitors net. se tcp_monitor. se netstatx. se nfsmonitor. se nx. se Clones iostat. se uname. se Data browsers aw. se infotool. se multi_meter. se Contributed Code anasa Test Programs syslog. se vmstat. se nfsstat-m. se perfmeter. se xload. se dfstats kview systune watch orcollator. se cpus. se pure_test. se net_example collisions. se nproc. se uptime. se kvmname. se dumpkstats. se 05 November 2020 disks. se Adrian Cockcroft and Mario Jauvin
SE language features l l SE is a 64 bit interpreted dialect of C – Not a new language to learn from scratch! – Standard C /usr/ccs/bin/cpp used at runtime to preprocess SE scripts – Main omissions - pointer types and goto – Main additions - classes and “string” type – powerful ways to handle dynamically allocated data – built-in fast balanced tree routines for storing key indexed data Dynamic linking to all existing C libraries – Built-in classes access kernel data – Supplied class code hides details, provides the data you want Example scripts improve on basic utilities e. g. siostat. se, nx. se, pea. se Example rule based monitors e. g. virtual_adrian. se, orcallator. se 05 November 2020 Adrian Cockcroft and Mario Jauvin
Creating Rules l l Based on real experiences of all the things that go wrong Capture an approximation to intuition Test and calibrate rules on as many systems as possible Easy? ? 05 November 2020 Adrian Cockcroft and Mario Jauvin
Configuring Rules l l l Thresholds should be configured Very application dependent Capture the operating envelope – – l Measure the underlying values Measure peaks in normal operation Note values during problems Set thresholds to capture the difference This applies to any tool – SE Toolkit, Cacti, Ganglia, Nagios, Open. NMS etc. 05 November 2020 Adrian Cockcroft and Mario Jauvin
Rules as Objects l l Define only the input and output information Hide implementation details Make high level rule objects trivial to use and reuse SE Toolkit does it in three lines of code: – – – #include <rules file> Declare rule object as a typed variable Read and use or print object status 05 November 2020 Adrian Cockcroft and Mario Jauvin
"virtual adrian" rules summary l l l l Disk Rule for all disks at once – Looks for slow disks and unbalanced usage Network Rule for all networks at once – Looks for slow nets and unbalanced usage Swap Rule - Looks for lack of available swap space RAM Rule - Looks for short page residence times CPU Power Rule – Scales on MP systems – Looks for long run queue delays Mutex Rule - Looks for kernel lock contention and high sys CPU time TCP Rule – Looks for listen queue problems – Reports on connection attempt failures 05 November 2020 Adrian Cockcroft and Mario Jauvin
XE Toolkit - www. xetoolkit. com l Complete re-write of SE Toolkit by Rich Pettit – – – l Licencing – – l Extensible Java collector, customize with jar files Release 1. 2 available April 2008 Multi-platform support Solaris, Linux/x 86, Windows, BSD, OSX, HP-UX, AIX, Linux/s 390, Linux/Power Free GPL version for standard use and shared derivations Open source, hosted at http: //sourceforge. net/projects/xe-toolkit/ Commercial support available if needed Commercial product license for custom in-house derivations Addresses all the issues people had with SE toolkit ! 05 November 2020 Adrian Cockcroft and Mario Jauvin
Captive Metrics / XE Toolkit Architecture 05 November 2020 Adrian Cockcroft and Mario Jauvin
Free System Monitoring Tools 05 November 2020 Adrian Cockcroft and Mario Jauvin
Collated Performance Data - Orca l l Problems with time sync when collecting data from multiple tools – No timestamp at all for vmstat, netstat, df. . . – No timestamp by default for iostat and ps. . . – No way to collect realtime stats from an http logfile Use SE Toolkit to generate one timestamped row containing all the data – First version of percollator. se written by Adrian Cockcroft in 1996 – Extended orcallator. se written by Blair Zajac a few years later – Graphs generated by orca batch job feeding rrdtool based web pages – Active community developing tool at http: //www. orcaware. com – Extended to collect much more data, including process workloads – Basic data collection ported to Linux, HP-UX and Windows Orca is basically MRTG for System metrics rather than Network See http: //www. orcaware. com/orca/docs/Orca_Understanding_Performance_Data. ppt 05 November 2020 Adrian Cockcroft and Mario Jauvin
Orca data collections l Collected using “procollator” reading info from /proc on Linux [Uptime] [Average # Processes in Run Queue (Load Average)] [CPU Usage] [New Process Spawn Rate] [Number of System & Running Processes] [Context Switches & Interrupts Rate] [Interface Input Bits Per Second] [Interface Output Bits Per Second] [Interface Input Packets Per Second] [Interface Output Packets Per Second] [Interface Input Errors Per Second] [Interface Output Errors Per Second] [Interface Input Dropped Per Second] [Interface Output Collisions] [Interface Output Carrier Losses] [TCP Current Connections] [IP Statistics] [TCP Statistics] [ICMP Statistics] [UDP Statistics] [Disk System Wide Reads/Writes Per Second] [Disk System Wide Transfer Rate] [Disk Reads/Writes Per Second] [Disk Transfer Rate] [Disk Space Percent Usage] [Physical Memory Usage] [Swap Usage] [Page Ins & Outs Rate] [Swap Ins & Outs Rate] l l Orca on Solaris collects many more metrics than shown above Strength of Orca is lots of detailed metrics with low overhead for collection Easily customized to add more system metrics or application metrics Orca can already track HTTP traffic and parse log files 05 November 2020 Adrian Cockcroft and Mario Jauvin
All metrics are stored in “round robin database” format using RRDtool to generate displays over different time spans Web page is simple collection of plots with drill down by metric or by time Suitable for monitoring a relatively small number of systems in great detail, e. g. backend database servers 05 November 2020 Adrian Cockcroft and Mario Jauvin
Cacti – www. cacti. net l l l Web based user interface based on RRDtool More sophisticated GUI than Orca or MRTG Less sophisticated system metric collection, but more coverage of networking Better management of groups of systems and devices than Orca, useful for tens to hundreds of nodes Access control and personalization for users 05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
Ganglia – www. ganglia. info l l l Web based RRDtool GUI somewhat similar to Cacti Better management of clusters of systems and devices than Cacti, useful for hundreds to thousands of nodes in a hierarchy of clusters Provides many summary statistic plots at cluster level and collects detailed configuration data XML based data representation Uses low overhead network protocol In common use at hundreds of large HPC Grid sites, less visibly in use at some large commercial sites 05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
Big. Brother and Big. Sister l l l Network and system dashboard alert monitor Widely used at internet sites Bigbrother is at http: //www. bb 4. com Bigsister is at http: //bigsister. graeff. com Bigsister seems to have more features, alert logging, better portability and more efficient data collection. Compatible update to BB 4. 05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
Free QA Test and Modelling Tools 05 November 2020 Adrian Cockcroft and Mario Jauvin
QA Test Requirements l Generate test workload – l Collect performance metrics – l l SLAMD, Grinder Any of the tools already mentioned Report regression against baseline Predict capacity needed for production system – – Use spreadsheets for simple linear prediction Use modelling tools such as PDQ for queuing models 05 November 2020 Adrian Cockcroft and Mario Jauvin
Grinder 3 - Powerful New Features l 100% Pure Java - works on any hardware platform and any operating system that supports J 2 SE 1. 3 and above. l Java and Jython based load testing framework – – – l l Web Browsers: simulate web browsers using HTTP, and HTTPS. Web Services: test interfaces using SOAP and XML-RPC. Database: test databases using JDBC. Middleware: RPC and MOM based systems using IIOP, RMI/JRMP, and JMS. Other Internet protocols: POP 3, SMTP, FTP, and LDAP. See http: //grinder. sourceforge. net/g 3/features. html J 2 EE Performance Testing with BEA Web. Logic Server by Peter Zadrozny, Philip Aston and Ted Osborne, originally published by Expert Press and now by APress uses Grinder 2 throughout. 05 November 2020 Adrian Cockcroft and Mario Jauvin
SLAMD l l l Load generation framework, written in Java Originally built to test LDAP servers by Sun Extended to be very generic and published as open source. Actively being developed. Sophisticated functions and user interface See http: //www. slamd. com Latest Release 2. 0 has better usability focus 05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
05 November 2020 Adrian Cockcroft and Mario Jauvin
PDQ Modelling Tool l l Dr Neil Gunther’s toolkit at http: //www. perfdynamics. com Library used from C or Perl provides MVA queueing models Use to calibrate in QA and predict in production PDQ modelling tool details: – – The Practical Performance Analyst Dr. Neil Gunther - Mc. Graw-Hill, 1998 ISBN 0 -07 -912946 -3 Analyzing Computer System Performance with Perl: PDQ 2004, ISBN 3 -54 -020865 -8 05 November 2020 Adrian Cockcroft and Mario Jauvin
References and Conclusion 05 November 2020 Adrian Cockcroft and Mario Jauvin
Licences for Free Tools l Open Source Initiative – – l “OSI Approved licences” http: //opensource. org/licenses/category Comparisons of Common Licences – http: //zooko. com/license_quick_ref. html 05 November 2020 Adrian Cockcroft and Mario Jauvin
Web Pages and Books l Adrian’s Performance and other topics blog – l MFJ Associates performance tools link page – l l l http: //www. generalconcepts. com/resources/monitoring/ More tools compiled by Openxtra – l http: //www. mfjassociates. net/perf_links. html More free tools compiled by John Sellens – l http: //perfcap. blogspot. com http: //www. openxtra. co. uk/resource-center/open_source_network_monitor_tools. php SE toolkit info: Sun Performance and Tuning - Java and the Internet - Adrian Cockcroft and Richard Pettit - Sun Press/Prentice Hall, 2 nd Edition, 1998 ISBN 0 -13095249 -4 Solaris 8 and Linux: System Performance Tuning 2 nd Edition – Gian-Paolo Musumeci, O’Reilly 2002 ISBN: 0 -596 -00284 -X Solaris Internals http: //www. solarisinternals. com – Richard Mc. Dougall and James Mauro - new 2 nd edition and new performance book by Richard Mc. Dougall and Brendan Gregg 05 November 2020 Adrian Cockcroft and Mario Jauvin
Concluding Remarks l l l Many large installations depend on free tools A full suite of functionality is available Several tools are needed to cover the bases Tradeoff between function and ease of use Support may be available, but typically Google is the best support tool Functionality is increasing…. 05 November 2020 Adrian Cockcroft and Mario Jauvin
Questions? acockcroft@netflix. com mario@mfjassociates. net 05 November 2020 Adrian Cockcroft and Mario Jauvin
- Slides: 93