Comprehensive Container Based Service Monitoring with Kubernetes and
Comprehensive Container Based Service Monitoring with Kubernetes and Istio OSCON 2018 Fred Moyer @phredmoye https: //www. slideshare. net/redhotpenguin/comprehensive-container -based-service-monitoring-with-kubernetes-and-istio-106288073
Monitoring Nerd @phredmoyer Developer Evangelist @Circonus / @IRONdb @Istio. Mesh Geek Observability and Statistics Dork @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Istio. io “An open platform to connect, manage, and secure microservices” @phredmoye
Happy Birthday! @phredmoye
K 8 S + Istio ● ● Orchestration Deployment Scaling Data Plane @phredmoye ● ● Policy Enforcement Traffic Management Telemetry Control Plane
Istio Architecture @phredmoye
Istio GCP Deployment @phredmoye
Istio Sample App $ istioctl create -f apps/bookinfo. yaml @phredmoye
Istio Sample App @phredmoye
Istio Sample App @phredmoye
Istio Sample App kind: Deployment metadata: name: ratings-v 1 spec: replicas: 1 template: metadata: labels: app: ratings version: v 1 spec: containers: - name: ratings image: istio/examples-bookinfo-ratings-v 1 image. Pull. Policy: If. Not. Present ports: @phredmoye - container. Port: 9080
Istio Sample App $ istioctl create -f apps/bookinfo/route-rule-reviews-v 2 -v 3. yaml type: route-rule spec: name: reviews-default destination: reviews. default. svc. cluster. local precedence: 1 route: - tags: version: v 2 weight: 80 - tags: version: v 3 weight: 20 @phredmoye
Istio K 8 s Services > kubectl get services NAME details kubernetes productpage ratings reviews @phredmoye CLUSTER-IP 10. 0. 0. 31 10. 0. 0. 120 10. 0. 0. 15 10. 0. 0. 170 EXTERNAL-IP <none> <none> PORT(S) 9080/TCP 443/TCP 9080/TCP AGE 6 m 7 d 6 m 6 m 6 m
> kubectl get Istio K 8 s App Pods pods NAME details-v 1 -1520924117 productpage-v 1 -560495357 ratings-v 1 -734492171 reviews-v 1 -874083890 reviews-v 2 -1343845940 reviews-v 3 -1813607990 @phredmoye READY 2/2 2/2 2/2 STATUS Running Running RESTARTS 0 0 0 AGE 6 m 6 m 6 m
Istio K 8 s System Pods > kubectl get pods -n istio-system NAME READY STATUS istio-ca-797 dfb 66 c 5 1/1 Running 0 2 m istio-ingress-84 f 75844 c 4 1/1 Running 0 2 m istio-egress-29 a 16321 d 3 1/1 Running 0 2 m istio-mixer-9 bf 85 fc 68 3/3 Running 0 2 m istio-pilot-575679 c 565 2/2 Running 0 2 m grafana-182346 ba 12 2/2 Running 0 2 m prometheus-837521 fe 34 2/2 Running 0 2 m @phredmoye RESTARTS AGE
Istio Grafana Dashboard @phredmoye
Istio Grafana Dashboard @phredmoye
Rate @phredmoye
Errors @phredmoye
Duration @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Setup Istio https: //github. com/redhotpenguin/oscon_2018 @phredmoye
Bookinfo Sample Application https: //github. com/redhotpenguin/oscon_2018 @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
@phredmoye
Istio Metrics Adapter ● Golang based adapter API ● Legacy - In process (built into the Mixer executable) ● Current - Out of process for new adapter dev ● Set of handler hooks and YAML files @phredmoye
Istio Mixer Metric Adapter @phredmoye
Out of Process g. RPC Adapters @phredmoye
Containers? ● Ephemeral ● High Cardinality ● Difficult to Instrument ● Instrument Services, Not Containers @phredmoye
Istio Mixer Provided Telemetry ● ● ● ● Request Count by Response Code Request Duration Request Size Response Size Connection Received Bytes Connection Sent Bytes Connection Duration Template Based Meta. Data (Metric Tags) @phredmoye
Metric Dimensions Handle. Metric invoked with: Adapter config: &Params{File. Path: out. txt, } Instances: 'i 1 metric. instance. istiosystem': { Value = 1652 Dimensions = map[response_code: 200] } @phredmoye
Istio Mixer Metrics Adapter “SHOW ME THE CODE” https: //github. com/istio/istio/blob/master/mixer/adapter/circon us @phredmoye
Istio Mixer Metrics Adapter @phredmoye
Istio Mixer Metrics Adapter @phredmoye
Istio Mixer Metrics Adapter // Handle. Metric submits metrics to Circonus via circonus-gometrics func (h *handler) Handle. Metric(ctx context. Context, insts []*metric. Instance) error { for _, inst : = range insts { metric. Name : = inst. Name metric. Type : = h. metrics[metric. Name] switch metric. Type { case config. GAUGE: value, _ : = inst. Value. (int 64) h. cm. Gauge(metric. Name, value) @phredmoye case config. COUNTER: h. cm. Increment(metric. Name)
Istio Mixer Metrics Adapter case config. DISTRIBUTION: value, _ : = inst. Value. (time. Duration) h. cm. Timing(metric. Name, float 64(value)) } } return @phredmoye nil
Istio Mixer Metrics Adapter handler struct { cm *cgm. Circonus. Metrics env adapter. Env metrics map[string]config. Params_Metric. Info_Type cancel context. Cancel. Func } @phredmoye
And some YAML metrics: - name: requestcount. metric. istio-system type: COUNTER - name: requestduration. metric. istio-system type: DISTRIBUTION - name: requestsize. metric. istio-system type: GAUGE - name: responsesize. metric. istio-system type: GAUGE @phredmoye
Buffer metrics, then report env. Schedule. Daemon( func() { ticker : = time. New. Ticker(b. adp. Cfg. Submission. Interval) for { select { case <-ticker. C: cm. Flush() case <-adapter. Context. Done(): ticker. Stop() cm. Flush() return } } @phredmoye })
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Create a metrics adapter https: //github. com/redhotpenguin/oscon_2018 @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
@phredmoye
MATH https: //youtu. be/y. CX 1 Ze 3 Oc. Ko @phredmoye
Histogram Basics Mode Median q(0. 5) Number of Samples q(0. 9) Mean Sample Value @phredmoye q(1)
Histogram https: //github. com/circonus-labs/circonusllhist @phredmoye
Log linear histogram https: //github. com/circonus-labs/circonusllhist @phredmoye
Log linear histogram https: //github. com/circonus-labs/circonusllhist @phredmoye
Bi-modal Histogram @phredmoye
Multi-modal Histogram @phredmoye
Multi-modal Histogram @phredmoye
Heatmap - Time Series Histograms @phredmoye
Heatmap - Time Series Histograms @phredmoye
Quantile Calculation 1. Given a quantile q(X) where 0 < X < 1 2. Sum up the counts of all the bins, C 3. Multiply X * C to get count Q 4. Walk bins, sum bin boundary counts until > Q 5. Interpolate quantile value q(X) from bin @phredmoye
Linear Interpolation right_count=800 left_count=600 bin_count=200 q(X) = left_value+(Q-left_count) / (right_count-left_count)*bin_width q(X) = 1. 0+(700 -600) / (800 -600)*0. 1 left_value=1. 0 @phredmoye Q = 700 X = 0. 5 q(X) = 1. 05 right_value=1. 1
Inverse Quantiles ● What’s the 95 th percentile latency? ○ q(0. 95) = 10 ms ● What percent of requests exceeded 10 ms? ○ 5% for this data set; what about others? @phredmoye
Inverse Quantile Calculation 1. Given a sample value X, locate its bin 2. Using the previous linear interpolation equation, solve for Q given X @phredmoye
Inverse Quantile Calculation X = left_value+(Q-left_count) / (right_count-left_count)*bin_width X-left_value = (Q-left_count) / (right_count-left_count)*bin_width (X-left_value)/bin_width = (Q-left_count)/(right_count-left_count) (X-left_value)/bin_width*(right_count-left_count) = Q-left_count Q = (X-left_value)/bin_width*(right_count-left_count)+left_count @phredmoye
Linear Interpolation left_count=600 right_count=800 Q =(X-left_value)/bin_width * (right_count-left_count)+left_count Q = (1. 05 -1. 0)/0. 1*(800 -600)+600 X = 1. 05 Q = 700 left_value=1. 0 @phredmoye right_value=1. 1
Inverse Quantile Calculation 1. Given a sample value X, locate its bin 2. Using the previous linear interpolation equation, solve for Q given X 3. Sum the bin counts up to Q as Qleft 4. Inverse quantile qinv(X) = (Q total-Qleft)/Q total 5. For Qleft=700, Qtotal = 1, 000, qinv(X) = 0. 3 6. 30% of sample values exceeded X @phredmoye
Inverse Quantile Calculation @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Create a metrics adapter https: //github. com/redhotpenguin/oscon_2018 @phredmoye
Create a metrics adapter https: //github. com/redhotpenguin/oscon_2018 @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Service Level Objectives ● SLI - Service Level Indicator ● SLO - Service Level Objective ● SLA - Service Level Agreement @phredmoye
Service Level Objectives @phredmoye
“SLIs drive SLOs which inform SLAs” SLI - Service Level Indicator, a measure of the service that can be quantified “ 95 th percentile latency of homepage requests over past 5 minutes < @phredmoye 300 ms” Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https: //youtu. be/t. Eyl. Fyxb. DLE
“SLIs drive SLOs which inform SLAs” SLO - Service Level Objective, a target for Service Level Indicators Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https: //youtu. be/t. Eyl. Fyxb. DLE “ 95 th percentile homepage SLI will succeed 99. 9% over trailing year” @phredmoye
“SLIs drive SLOs which inform SLAs” SLA - Service Level Agreement, a legal agreement between a customer and a service provider based on SLOs Excerpted from “SLIs, SLOs, SLAs, oh my!” @sethvargo @lizthegrey https: //youtu. be/t. Eyl. Fyxb. DLE “Service credits if 95 th percentile homepage SLI succeeds less than 99. 5% over trailing year” @phredmoye
Log linear histogram SLI - “ 90 th percentile latency of requests over past 5 minutes < 1, 000 ms” @phredmoye
Emerging Standards @phredmoye ● USE ○ Utilization, Saturation, Errors ○ Introduced by Brendan Gregg @brendangregg ○ KPIs for host based health ● The Four Golden Signals ○ Latency, Traffic, Errors, Saturation ○ Covered in the Google SRE Book ○ Extended version of RED ● RED ○ Rate, Errors, Duration ○ Introduced by Tom Wilkie @tom_wilkie ○ KPIs for API based health, SLI focused
RED ● Rate ○ Requests per second ○ First derivative of request count provided by Istio ● Errors ○ Unsuccessful requests per second ○ First derivative of failed request count provided by Istio ● Duration ○ Request latency provided by Istio @phredmoye
Duration Problems: ● Percentiles > averages, but have limitations ○ Aggregated metric, fixed time window ○ Cannot be re-aggregated for cluster health ○ Cannot be averaged (common mistake) ● Stored aggregates are outputs, not inputs ● Difficult to measure cluster SLIs ● Leave a lot to be desired for @phredmoye
RED @phredmoye
RED - SLI Alerting @phredmoye
Your boss wants to know ● How many users got angry on the Tuesday slowdown after the big marketing promotion? ● Are we over-provisioned or underprovisioned on our purchasing checkout service? ● Other business centric questions @phredmoye
The Slowdown ● Marketing launched a new product ● Users complained the site was slow ● Median human reaction time is 215 ms [1] ● If users get angry (rage clicks) when requests take more than 500 ms, how many users got angry? [1] - https: //www. humanbenchmark. com/tests/reactiontime/ @phredmoye
The Slowdown 1. Record all service request latencies as distribution 2. Plot as a heatmap 3. Calculate percentage of requests that exceed 500 ms SLI using inverse percentiles 4. Multiply result by total number requests, integrate over time @phredmoye
The Slowdown 4 million slow requests @phredmoye
Under or Over Provisioned? ● “It depends” ● Time of day, day of week ● Special events ● Behavior under load ● Latency bands shed some light @phredmoye
Latency Bands @phredmoye
Talk Agenda @phredmoye ❏ Istio Overview ❏ Exercises ❏ Istio Mixer Metrics Adapters ❏ Exercises ❏ Math and Statistics ❏ Exercises ❏ SLOs / RED Dashboards
Conclusions ● Monitor services, not containers ● Record distributions, not aggregates ● Istio gives you RED metrics for free ● Use math to ask the right questions @phredmoye
Thank you! Questions? Tweet me @phredmoyer @phredmoye
- Slides: 86