Building a RealTime Anomaly Detection System with Flink
Building a Real-Time Anomaly. Detection System with Flink @ Mux Scott Kidder Software Engineer @ Mux
What is Mux? Real-time analytics for video Customers include PBS, Funny or Die, IGN, Wistia Track playback failures, start-up time, rebuffering, and more Process millions of 2
3
“All production flows have a basic characteristic: the material becomes more valuable as it moves through the process. ” — Andy Grove, High Output Management 4
What processing adds the most value? 5
Making Video-Delivery a Monitored Service Susan Fowler’s book “Production-Ready Microservices” lists components of a wellmonitored microservice: 1. Logging 2. Dashboards 3. Alerting 4. On-call Rotation Mux supported the first two in the form of video-event ingestion and a pretty web dashboard, but lacked alerting and an oncall rotation 6
• Example: CDN publishing for HTTP Live Streaming is broken, resulting in widespread live-streaming failures. Types of Error-Rate Alerts Video-Title Alerts 7 • Problems affecting specific video-titles within a customer property • Example: Poorly encoded/mislabeled video is published to catalog
8 Needed a system to detect error-rate anomalies Alerting Technical Requirements in video-views across a customer property and for every video title Very low latency, high-availability Horizontally-scalable on AWS commodity hardware, preferably running in a Docker container Easy to use at every stage: prototyping, development, production Read from AWS Kinesis streams, but preferably support Kafka too
Application Design 9
Event Ingestion Architecture 10
Flink Execution Plan Hash Property-wide Counting Window (parallelism: n) Error-Type Flat. Map (parallelism: 1) Rebalance Kinesis Mux Alerting REST API Hash Video Events Kinesis Source (parallelism: n) Hash Property Video. Title Counting Window (parallelism: n) Anomaly. Detection Rolling Fold (parallelism: n) Error-Type Flat. Map (parallelism: 1) Rebalance 11 Anomaly. Detection Rolling Fold (parallelism: n) Hash Influx. DB
Added a simple Rabbit. MQ stream source that accepts control messages Control messages feed into the Flat. Map operator Introducing a Control Stream Control operations include: 12 • Dump error-rates to S 3 for each property/error-type permutation • Dump active alert-incident state to S 3
Flink Execution Plan with Control Stream Source (parallelism: 1) Rabbit. MQ Hash Kinesis Video Events Kinesis Source (parallelism: n) Hash Forward Property-wide Counting Window (parallelism: n) Error-Type Flat. Map Join (parallelism: 1) Rebalance Hash Influx. DB Forward Property Video. Title Counting Window (parallelism: n) Anomaly. Detection Rolling Fold (parallelism: n) Error-Type Flat. Map Join (parallelism: 1) Rebalance 13 Anomaly. Detection Rolling Fold (parallelism: n) Mux Alerting REST API Hash AWS S 3
Deployment and Operations 14
All services at Mux are deployed in Docker containers Docker Created a custom Docker image of Flink built from source Use Build. Kite to build Docker image and push to Docker Hub Same image for Flink Job Manager & Task Manager Configure Flink using environment-variables Used with Alpine & Debian-Jessie base-images successfully Deploy Flink Standalone Cluster with Rancher 15
Use Build. Kite to build our Flink application JAR Builds & Behavioral Testing Build. Kite build runs in a Docker container on AWS EC 2 instances Behavioral tests written in Cucumber (Ruby) Cucumber tests run against a set of Docker containers brought up using Docker-Compose: Flink, Kinesalite (Kinesis clone), Minio (S 3 clone), Rabbit. MQ, Influx. DB Docker-managed networking to connect services 16
Internal Monitoring 17 • Use Statsd to emit Flink Metrics about Flink cluster • Telegraf Docker container consumes Statsd metrics and writes to Influx. DB • Kapacitor monitors Influx. DB writes • Kapacitor scripts can trigger alerts(Ops. Genie, Pager. Duty, etc)
Mux Alerting UI 18
Listing of Alert Incidents 19
Slack Notifications for Alerts 20
Alert Incident Details 21
Thank You! 22
- Slides: 22