Kafka Pay Pal Enabling 400 Billion Messages a
Kafka @ Pay. Pal: Enabling 400 Billion Messages a Day Kevin Lu, Na Yang and Maulin Vasavada © 2018 Pay. Pal Inc. Confidential and proprietary.
Agenda Pay. Pal Kafka @ Pay. Pal Today The Beginning Growth Spurt Road to Kafka-as-a-Service Future © 2018 Pay. Pal Inc. Confidential and proprietary.
Pay. Pal Growth © 2018 Pay. Pal Inc. Confidential and proprietary. From https: //www. paypal. com/us/webapps/mpp/stories/media-resources
Kafka @ Pay. Pal Today Overview 400+ Billion messages per day 50+ 3000+ ~7 PB Clusters Topics Disk Kafka Journey 0. 8 © 2018 Pay. Pal Inc. Confidential and proprietary. 0. 9 0. 1 0 1. 1
Kafka @ Pay. Pal Today Tech stack Language s Application Frameworks Multi-Tenant Gimel Multiple Regions & Availability Zones © 2018 Pay. Pal Inc. Confidential and proprietary.
Data Pipelines Frameworks & Platforms Use Cases User behavioral tracking Experimental Kafka Business Events Application logs Merchant monitoring Risk & compliance Application metrics Batch Processing Real-Time Streaming © 2018 Pay. Pal Inc. Confidential and proprietary. Gimel
The Beginning… © 2018 Pay. Pal Inc. Confidential and proprietary.
The Beginning Individual teams set up their own clusters Challenges • Duplicate work • Lack of Standardization • Inexperience with Kafka • No knowledge sharing © 2018 Pay. Pal Inc. Confidential and proprietary.
The Solution Dedicated Kafka team © 2018 Pay. Pal Inc. Confidential and proprietary.
Growth Spurt!!! © 2018 Pay. Pal Inc. Confidential and proprietary.
Growth Spurt! Millions Daily Message Volume (log scale) 100 B 100000 1 B 1000 100 M 10 2015 2016 Early 2017 Dedicated Kafka Team Ecosystem Expansion Massive Scale © 2018 Pay. Pal Inc. Confidential and proprietary.
Growth Challenges… 1. EVERYTHING was manual & bare metal MONTHS to onboard a new use-case 2. No visibility into cluster health TTD (time to detect) a broker failure: 24 hours 3. Issues 4. Support vs dev work Long TTR (time to recover) of DAYS to a WEEK to resolve issues Engineers/developers spent MORE time on support work vs dev work © 2018 Pay. Pal Inc. Confidential and proprietary.
Road to Kafka as a Service Operational Dashboard & Client Visibility © 2018 Pay. Pal Inc. Confidential and proprietary.
Kafka. Mon A monitoring and operational dashboard • View cluster, topic, partition, consumer group info • Topic management (create, modify, repartition) • Broker/zookeeper deployment • Mirror. Maker deployment • View logs © 2018 Pay. Pal Inc. Confidential and proprietary.
Availability and Customer KPIs Our challenge • • What is my topic availability? What is my producer latency? Will I get notified if something bad happens in my Kafka data pipeline? … Producer Consumer • • Producer Consumer Cluster is up, everything looks OK! © 2018 Pay. Pal Inc. Confidential and proprietary. • • What is my topic availability? What is my consumer latency? Will I get notified if something bad happens in my Kafka data pipeline? …
Availability and Customer KPIs Our solutions Canary producer Canary consumer Producer Consumer Send availability Send latency Send throughput Over 4000 clients are monitored! Fetch availability Fetch latency End-to-end latency Fetch throughput Time Series Database Alerts © 2018 Pay. Pal Inc. Confidential and proprietary. Dashboard
© 2018 Pay. Pal Inc. Confidential and proprietary.
Is There Data Loss? ? ? © 2018 Pay. Pal Inc. Confidential and proprietary.
Is There Any Data Loss? Producer MM Is there any data loss during an outage? © 2018 Pay. Pal Inc. Confidential and proprietary. Consumer How can we tell?
Data Loss Auditing Producer MM t me e au d es it sa ge cluster: cluster_1 topic: Test count: 180 type: producer time_bucket_start_ms: 1501796200000 time_bucket_end_ms: 1501796260000 sage ssag m audi Consumer Data Loss Auditor data loss alerts © 2018 Pay. Pal Inc. Confidential and proprietary. mes audit cluster: cluster_2 topic: Test consumer_group_id: canary count: 180 type: consumer time_bucket_start_ms: 1501796200000 time_bucket_end_ms: 1501796260000
Data Loss Auditing Data loss tracking dashboard dataloss detected! © 2018 Pay. Pal Inc. Confidential and proprietary.
Road to Kafka as a Service Onboarding and Capacity Management © 2018 Pay. Pal Inc. Confidential and proprietary.
Customer Onboarding Automated process Web UI Customer File ticket Kafka team Customer Review ticket and do capacity approval Create topic Update ticket Use Kafka Request new hardware Kafka team Weeks © 2018 Pay. Pal Inc. Confidential and proprietary. Onboarding time Minutes
Self-service Onboarding User and application metadata © 2018 Pay. Pal Inc. Confidential and proprietary. Topic details
Capacity Management ~10 TB ~7 PB ~6 tbps ~3000 Memory Disk Network CPU Tens of millions $ in hardware cost! © 2018 Pay. Pal Inc. Confidential and proprietary.
Capacity Management Manage capacity based on individual cluster Manage capacity based on data center Capacity showback user 1 user 2 user 3 user 4 user 5 used unused © 2018 Pay. Pal Inc. Confidential and proprietary. used unused
Capacity Showback Organization Hierarchy Customer used capacity © 2018 Pay. Pal Inc. Confidential and proprietary. Organization Hierarchy Customer used capacity by organization
Road to Kafka as a Service Configuration Service & Control Plane © 2018 Pay. Pal Inc. Confidential and proprietary.
Clusters uneven-utilization over time Storage and Bandwidth Usage across clusters Client-apps { highly utilized } { fairly utilized } Client-apps { lightly utilized } © 2018 Pay. Pal Inc. Confidential and proprietary. { need to decommission }
Availability challenges A) 0% Send availability for some topics Producer 1 T 1 Producer-2 T 3 Producer 1 Producer 2 T 1 T 2 T 3 Downtime { batch expired } B) 0%/Low cluster availability upon Network switch or Power outage Days/Hours Client-app-1 T 2 Client-app-2 T 3 © 2018 Pay. Pal Inc. Confidential and proprietary. Client-app-1 Client-app-2 T 1 T 2 T 3
Configuration Service Client-app Pay. Pal Library 1 4 read/write T 1 T 2 T 3 3 { T 1, T 2, T 3 } { kafka properties } Control & Data Plane Configuration service 2 read metadata © 2018 Pay. Pal Inc. Confidential and proprietary. Admin service
Efficiency Upon topic movement Manual Automated Error prone Streamlined Unpredictable Predictable Several days/hours Few mins © 2018 Pay. Pal Inc. Confidential and proprietary.
Kafka Control & Data Plane Admin Clients Other Eco-systems Control Plane APIs Presentation Layer Admin Service Mirror. Maker Management Data Plane Zookeepers Configuration Service Capacity Management Showback Management Restricted Access Brokers Mirror. Makers metadata © 2018 Pay. Pal Inc. Confidential and proprietary. Monitoring
Future • Automate client failover • Automate remediation • Contribute back to the community • Client side KPIs • Learnings © 2018 Pay. Pal Inc. Confidential and proprietary.
Conclusion • Measure client perceived availability • Abstract clusters from clients • Build metadata • Automate • Cluster deployment & configuration • Client-onboarding • Remediation processes • Make alerts actionable © 2018 Pay. Pal Inc. Confidential and proprietary.
Thank you! © 2018 Pay. Pal Inc. Confidential and proprietary.
- Slides: 36