Experiences in running Apache Flink at large scale
- Slides: 44
Experiences in running Apache Flink® at large scale Stephan Ewen (@Stephan. Ewen)
Lessons learned from running Flink at large scale Including various things we never expected to become a problem and evidently still did… Also a preview to various fixes coming in Flink… 2
What is large scale? Large Data Volume ( events / sec ) Large Application State (GBs / TBs) Complex Dataflow Graphs (many operators) High Parallelism (1000 s of subtasks) 3
Distributed Coordination 4
Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob Server Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs Deployment RPC Call Task. Manager Akka / RPC Blob Server 5
Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob Server Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs Deployment RPC Call KBs up to MBs KBs Task. Manager few bytes Akka / RPC Blob Server 6
RPC volume during deployment (back of the napkin calculation) number of x parallelism tasks 10 x 1000 x x size of task = objects 2 MB = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1 GBs net 7
Timeouts and Failure detection ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1 GBs net Default RPC timeout: 10 secs Solution: Caveat: Future: default settings lead to failed deployments with RPC timeouts Increase RPC timeout Increasing the timeout makes failure detection slower Reduce RPC load (next slides) 8
Dissecting the RPC messages Message part Size Variance across subtasks and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable 9
Upcoming: Deploying Tasks Out-of-band transfer and caching of large and constant message parts Job. Manager KBs (1) Deployment RPC Call Task. Manager (Recover State Handle, Correlation IDs, BLOB pointers) Akka / RPC Blob Server Blob Cache (2) Download and cache BLOBs (Job Config, Task Objects) MBs 10
Checkpoints at scale 11
Robustly checkpointing… …is the most important part of running a large Flink program 12
Review: Checkpoints Trigger checkpoint Inject checkpoint barrier source / transform stateful operation 13
Review: Checkpoints Take state snapshot Trigger state snapshot source / transform stateful operation 14
Review: Checkpoint Alignment 6 9 5 7 3 2 6 5 4 1 3 2 1 checkpoint barrier n y c b a x b operator e d c a y operator f d f input buffer 8 4 e h begin aligning g aligning 15
Review: Checkpoint Alignment 9 emit barrier n 8 7 6 9 input buffer 8 7 6 5 4 3 2 1 c e d b 4 3 2 1 a operator e d c operator f f h 5 g i checkpoint h g continue 16
Understanding Checkpoints 17
Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take? How well behaves the alignment? (lower is better) 18
Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take? long delay = under backpressure too long means under constant backpressure means the application is under provisioned too much state per node snapshot store cannot keep up with load (low bandwidth) changes with incremental checkpoints How well behaves the alignment? (lower is better) most important metric 19
Alignments: Limit in-flight data § In-flight data is data "between" operators • On the wire or in the network buffers • Amount depends mainly on network buffer memory § Need some to buffer out network fluctuations / transient backpressure § Max amount of in-flight data is max amount buffered during alignment 6 5 4 3 checkpoint barrier 2 1 operator c b y x a d f e 20
Alignments: Limit in-flight data § Flink 1. 2: Global pool that distributes across all tasks • Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots § Flink 1. 3: Limits the max in-flight data automatically • Heuristic based on of channels and connections involved in a transfer step 6 5 4 3 checkpoint barrier 2 1 operator c b y x a d f e 21
Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint § Stall of one operator on the path 22
Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint § Stall of one operator on the path 23
Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint GC stall § Stall of one operator on the path 24
Catching up from heavy alignments § Operators that did heavy alignment need to catch up again § Otherwise, next checkpoint will have a heavy alignment as well consumed first after checkpoint completed 9 8 7 6 5 4 3 2 1 c 9 8 7 6 e e 5 d operator c b d b a operator f h g a f h g 25
Catching up from heavy alignments § Giving the computation time to catch up before starting the next checkpoint • Useful: Set the min-time-between-checkpoints § Asynchronous checkpoints help a lot! • Shorter stalls in the pipelines means less build-up of in-flight data • Catch up already happens concurrently to state materialization 26
Asynchronous Checkpoints Processing pipeline continues source / transform stateful operation Durably persist snapshots asynchronously 27
Asynchrony of different state types State Flink 1. 2 Flink 1. 3 + Keyed state Rocks. DB ✔ ✔ ✔ Keyed State on heap ✘ (✔) ✔ ✔ ✘ ✘ ✔/✘ ✔ ✔ Timers Operator State (hidden in 1. 2. 1) ✔ 28
When to use which state backend? no State ≥ Memory ? Async. Heap a bit simplified no Complex Objects? (expensive serialization) yes yes high data rate? no Rocks. DB 29
File Systems, Object Stores, and Checkpointed State 30
Exceeding FS request capacity § Job size: 4 operators § Parallelism: 100 s to 1000 § State Backend: Fs. State. Backend § State size: few KBs per operator, 100 s to 1000 of files § Checkpoint interval: few secs § Symptom: S 3 blocked off connections after exceeding 1000 s HEAD requests / sec 31
Exceeding FS request capacity What happened? § Operators prepare state writes, ensure parent directory exists § Via the S 3 FS (from Hadoop), each mkdirs causes 2 HEAD requests § Flink 1. 2: Lazily initialize checkpoint preconditions (dirs. ) § Flink 1. 3: Core state backends reduce assumption of directories (PUT/GET/DEL), rich file systems support them as fast paths 32
Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend for most states Task. Manager Task Root Checkpoint File (metadata) Task checkpoint data files 33
Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend for small states Task. Manager Task ack+data Task. Manager Task Increasing small state threshold reduces number of files (default: 1 KB) Task checkpoint data directly in metadata file 34
Lagging state cleanup Symptom: Checkpoints get cleaned up too slow State accumulates over time One Job. Manager deleting files many Task. Managers create files TM TM TM Job. Manager 35
Lagging state cleanup § Problem: File. Systems and Object Stores offer only synchronous requests to delete state object Time to delete a checkpoint may accumulates to minutes. § Flink 1. 2: Concurrent checkpoint deletes on the Job. Manager § Flink 1. 3: For File. Systems with actual directory structure, use recursive directory deletes (one request per directory) 36
Orphaned Checkpoint State Who owns state objects at what time? Job. Manager (2) Ack and transfer ownership of state Checkpoint Coordinator (3) Job. Manager records state reference Task. Manager Task (1) Task. Manager writes state 37
Orphaned Checkpoint State Upcoming: Searching for orphaned state fs: ///checkpoints/job-61776516/chk-113 /chk-129 /chk-221 /chk-272 /chk-273 periodically sweep checkpoint directory for leftover dirs retained latest It gets more complicated with incremental checkpoints… 38
Conclusion & General Recommendations 39
The closer you application is to saturating either network, CPU, memory, FS throughput, etc. the sooner an extraordinary situation causes a regression Enough headroom in provisioned capacity means fast catchup after temporary regressions Be aware that certain operations are spiky (like aligned windows) Production test always with checkpoints ; -) 40
Recommendations (part 1) Be aware of the inherent scalability of primitives § Broadcasting state is useful, for example for updating rules / configs, dynamic code loading, etc. § Broadcasting does not scale, i. e. , adding more nodes does not. Don't use it for high volume joins § Putting very large objects into a Value. State may mean big serialization effort on access / checkpoint § If the state can be mappified, use Map. State – it performs much better 41
Recommendations (part 2) If you are about recovery time § Having spare Task. Managers helps bridge the time until backup Task. Managers come online § Having a spare Job. Manager can be useful • Future: Job. Manager failures are non disruptive 42
Recommendations (part 3) If you care about CPU efficiency, watch your serializers § JSON is a flexible, but awfully inefficient data type § Kryo does okay - make sure you register the types § Flink's directly supported types have good performance basic types, arrays, tuples, … § Nothing ever beats a custom serializer ; -) 43
Thank you! Questions? 44
- Once upon a time, there
- Running running running
- Large scale vs small scale map
- Small scale map
- Geography skills handbook
- Scale on a map definition
- Large scale vs small scale map
- Flink queryable state
- Flink anomaly detection
- Flink benchmark
- Jim flink
- Budget pacing
- Flink stateful stream processing
- Market entry modes for international businesses chapter 7
- Large scale interventions
- Large map scale
- Large scale classification
- The anatomy of a large scale hypertextual web search engine
- Large scale map definition
- Berk atikoglu
- The anatomy of a large-scale hypertextual web search engine
- Gis project steps
- Large scale entry example
- Monolithic ic
- Large-scale philanthropy
- The anatomy of a large-scale hypertextual web search engine
- Pregel: a system for large-scale graph processing
- Large scale fading in wireless communication
- Large rotating air mass
- The anatomy of a large scale hypertextual web search engine
- Double pot method for disinfection of wells
- Large scale distributed deep networks
- Large scale global investment
- Aser: a large-scale eventuality knowledge graph
- Large scale manufacturing of semisolids
- Abhishek verma google
- Large scale fermenter design
- Oag: toward linking large-scale heterogeneous entity graphs
- A comparison of approaches to large-scale data analysis
- Large scale systems
- Automatic wrappers for large scale web extraction
- Poss scale vs rass scale
- Principle of surveying
- Scale drawings/models & scale factor
- What is five tone scale