Experiences in running Apache Flink at large scale

  • Slides: 44
Download presentation
Experiences in running Apache Flink® at large scale Stephan Ewen (@Stephan. Ewen)

Experiences in running Apache Flink® at large scale Stephan Ewen (@Stephan. Ewen)

Lessons learned from running Flink at large scale Including various things we never expected

Lessons learned from running Flink at large scale Including various things we never expected to become a problem and evidently still did… Also a preview to various fixes coming in Flink… 2

What is large scale? Large Data Volume ( events / sec ) Large Application

What is large scale? Large Data Volume ( events / sec ) Large Application State (GBs / TBs) Complex Dataflow Graphs (many operators) High Parallelism (1000 s of subtasks) 3

Distributed Coordination 4

Distributed Coordination 4

Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob

Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob Server Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs Deployment RPC Call Task. Manager Akka / RPC Blob Server 5

Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob

Deploying Tasks Happens during initial deployment and recovery Job. Manager Akka / RPC Blob Server Contains - Job Configuration - Task Code and Objects - Recover State Handle - Correlation IDs Deployment RPC Call KBs up to MBs KBs Task. Manager few bytes Akka / RPC Blob Server 6

RPC volume during deployment (back of the napkin calculation) number of x parallelism tasks

RPC volume during deployment (back of the napkin calculation) number of x parallelism tasks 10 x 1000 x x size of task = objects 2 MB = RPC volume 20 GB ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1 GBs net 7

Timeouts and Failure detection ~20 seconds on full 10 GBits/s net > 1 min

Timeouts and Failure detection ~20 seconds on full 10 GBits/s net > 1 min with avg. of 3 GBits/s net > 3 min with avg. of 1 GBs net Default RPC timeout: 10 secs Solution: Caveat: Future: default settings lead to failed deployments with RPC timeouts Increase RPC timeout Increasing the timeout makes failure detection slower Reduce RPC load (next slides) 8

Dissecting the RPC messages Message part Size Variance across subtasks and redeploys Job Configuration

Dissecting the RPC messages Message part Size Variance across subtasks and redeploys Job Configuration KBs constant Task Code and Objects up to MBs constant Recover State Handle KBs variable Correlation IDs few bytes variable 9

Upcoming: Deploying Tasks Out-of-band transfer and caching of large and constant message parts Job.

Upcoming: Deploying Tasks Out-of-band transfer and caching of large and constant message parts Job. Manager KBs (1) Deployment RPC Call Task. Manager (Recover State Handle, Correlation IDs, BLOB pointers) Akka / RPC Blob Server Blob Cache (2) Download and cache BLOBs (Job Config, Task Objects) MBs 10

Checkpoints at scale 11

Checkpoints at scale 11

Robustly checkpointing… …is the most important part of running a large Flink program 12

Robustly checkpointing… …is the most important part of running a large Flink program 12

Review: Checkpoints Trigger checkpoint Inject checkpoint barrier source / transform stateful operation 13

Review: Checkpoints Trigger checkpoint Inject checkpoint barrier source / transform stateful operation 13

Review: Checkpoints Take state snapshot Trigger state snapshot source / transform stateful operation 14

Review: Checkpoints Take state snapshot Trigger state snapshot source / transform stateful operation 14

Review: Checkpoint Alignment 6 9 5 7 3 2 6 5 4 1 3

Review: Checkpoint Alignment 6 9 5 7 3 2 6 5 4 1 3 2 1 checkpoint barrier n y c b a x b operator e d c a y operator f d f input buffer 8 4 e h begin aligning g aligning 15

Review: Checkpoint Alignment 9 emit barrier n 8 7 6 9 input buffer 8

Review: Checkpoint Alignment 9 emit barrier n 8 7 6 9 input buffer 8 7 6 5 4 3 2 1 c e d b 4 3 2 1 a operator e d c operator f f h 5 g i checkpoint h g continue 16

Understanding Checkpoints 17

Understanding Checkpoints 17

Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take?

Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take? How well behaves the alignment? (lower is better) 18

Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take?

Understanding Checkpoints delay = end_to_end – sync – async How long do snapshots take? long delay = under backpressure too long means under constant backpressure means the application is under provisioned too much state per node snapshot store cannot keep up with load (low bandwidth) changes with incremental checkpoints How well behaves the alignment? (lower is better) most important metric 19

Alignments: Limit in-flight data § In-flight data is data "between" operators • On the

Alignments: Limit in-flight data § In-flight data is data "between" operators • On the wire or in the network buffers • Amount depends mainly on network buffer memory § Need some to buffer out network fluctuations / transient backpressure § Max amount of in-flight data is max amount buffered during alignment 6 5 4 3 checkpoint barrier 2 1 operator c b y x a d f e 20

Alignments: Limit in-flight data § Flink 1. 2: Global pool that distributes across all

Alignments: Limit in-flight data § Flink 1. 2: Global pool that distributes across all tasks • Rule-of-thumb: set to 4 * num_shuffles * parallelism * num_slots § Flink 1. 3: Limits the max in-flight data automatically • Heuristic based on of channels and connections involved in a transfer step 6 5 4 3 checkpoint barrier 2 1 operator c b y x a d f e 21

Heavy alignments § A heavy alignment typically happens at some point Different load on

Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint § Stall of one operator on the path 22

Heavy alignments § A heavy alignment typically happens at some point Different load on

Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint § Stall of one operator on the path 23

Heavy alignments § A heavy alignment typically happens at some point Different load on

Heavy alignments § A heavy alignment typically happens at some point Different load on different paths § Big window emission concurrent to a checkpoint GC stall § Stall of one operator on the path 24

Catching up from heavy alignments § Operators that did heavy alignment need to catch

Catching up from heavy alignments § Operators that did heavy alignment need to catch up again § Otherwise, next checkpoint will have a heavy alignment as well consumed first after checkpoint completed 9 8 7 6 5 4 3 2 1 c 9 8 7 6 e e 5 d operator c b d b a operator f h g a f h g 25

Catching up from heavy alignments § Giving the computation time to catch up before

Catching up from heavy alignments § Giving the computation time to catch up before starting the next checkpoint • Useful: Set the min-time-between-checkpoints § Asynchronous checkpoints help a lot! • Shorter stalls in the pipelines means less build-up of in-flight data • Catch up already happens concurrently to state materialization 26

Asynchronous Checkpoints Processing pipeline continues source / transform stateful operation Durably persist snapshots asynchronously

Asynchronous Checkpoints Processing pipeline continues source / transform stateful operation Durably persist snapshots asynchronously 27

Asynchrony of different state types State Flink 1. 2 Flink 1. 3 + Keyed

Asynchrony of different state types State Flink 1. 2 Flink 1. 3 + Keyed state Rocks. DB ✔ ✔ ✔ Keyed State on heap ✘ (✔) ✔ ✔ ✘ ✘ ✔/✘ ✔ ✔ Timers Operator State (hidden in 1. 2. 1) ✔ 28

When to use which state backend? no State ≥ Memory ? Async. Heap a

When to use which state backend? no State ≥ Memory ? Async. Heap a bit simplified no Complex Objects? (expensive serialization) yes yes high data rate? no Rocks. DB 29

File Systems, Object Stores, and Checkpointed State 30

File Systems, Object Stores, and Checkpointed State 30

Exceeding FS request capacity § Job size: 4 operators § Parallelism: 100 s to

Exceeding FS request capacity § Job size: 4 operators § Parallelism: 100 s to 1000 § State Backend: Fs. State. Backend § State size: few KBs per operator, 100 s to 1000 of files § Checkpoint interval: few secs § Symptom: S 3 blocked off connections after exceeding 1000 s HEAD requests / sec 31

Exceeding FS request capacity What happened? § Operators prepare state writes, ensure parent directory

Exceeding FS request capacity What happened? § Operators prepare state writes, ensure parent directory exists § Via the S 3 FS (from Hadoop), each mkdirs causes 2 HEAD requests § Flink 1. 2: Lazily initialize checkpoint preconditions (dirs. ) § Flink 1. 3: Core state backends reduce assumption of directories (PUT/GET/DEL), rich file systems support them as fast paths 32

Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend

Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend for most states Task. Manager Task Root Checkpoint File (metadata) Task checkpoint data files 33

Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend

Reducing FS stress for small state Job. Manager Checkpoint Coordinator Fs/Rocks. DB state backend for small states Task. Manager Task ack+data Task. Manager Task Increasing small state threshold reduces number of files (default: 1 KB) Task checkpoint data directly in metadata file 34

Lagging state cleanup Symptom: Checkpoints get cleaned up too slow State accumulates over time

Lagging state cleanup Symptom: Checkpoints get cleaned up too slow State accumulates over time One Job. Manager deleting files many Task. Managers create files TM TM TM Job. Manager 35

Lagging state cleanup § Problem: File. Systems and Object Stores offer only synchronous requests

Lagging state cleanup § Problem: File. Systems and Object Stores offer only synchronous requests to delete state object Time to delete a checkpoint may accumulates to minutes. § Flink 1. 2: Concurrent checkpoint deletes on the Job. Manager § Flink 1. 3: For File. Systems with actual directory structure, use recursive directory deletes (one request per directory) 36

Orphaned Checkpoint State Who owns state objects at what time? Job. Manager (2) Ack

Orphaned Checkpoint State Who owns state objects at what time? Job. Manager (2) Ack and transfer ownership of state Checkpoint Coordinator (3) Job. Manager records state reference Task. Manager Task (1) Task. Manager writes state 37

Orphaned Checkpoint State Upcoming: Searching for orphaned state fs: ///checkpoints/job-61776516/chk-113 /chk-129 /chk-221 /chk-272 /chk-273

Orphaned Checkpoint State Upcoming: Searching for orphaned state fs: ///checkpoints/job-61776516/chk-113 /chk-129 /chk-221 /chk-272 /chk-273 periodically sweep checkpoint directory for leftover dirs retained latest It gets more complicated with incremental checkpoints… 38

Conclusion & General Recommendations 39

Conclusion & General Recommendations 39

The closer you application is to saturating either network, CPU, memory, FS throughput, etc.

The closer you application is to saturating either network, CPU, memory, FS throughput, etc. the sooner an extraordinary situation causes a regression Enough headroom in provisioned capacity means fast catchup after temporary regressions Be aware that certain operations are spiky (like aligned windows) Production test always with checkpoints ; -) 40

Recommendations (part 1) Be aware of the inherent scalability of primitives § Broadcasting state

Recommendations (part 1) Be aware of the inherent scalability of primitives § Broadcasting state is useful, for example for updating rules / configs, dynamic code loading, etc. § Broadcasting does not scale, i. e. , adding more nodes does not. Don't use it for high volume joins § Putting very large objects into a Value. State may mean big serialization effort on access / checkpoint § If the state can be mappified, use Map. State – it performs much better 41

Recommendations (part 2) If you are about recovery time § Having spare Task. Managers

Recommendations (part 2) If you are about recovery time § Having spare Task. Managers helps bridge the time until backup Task. Managers come online § Having a spare Job. Manager can be useful • Future: Job. Manager failures are non disruptive 42

Recommendations (part 3) If you care about CPU efficiency, watch your serializers § JSON

Recommendations (part 3) If you care about CPU efficiency, watch your serializers § JSON is a flexible, but awfully inefficient data type § Kryo does okay - make sure you register the types § Flink's directly supported types have good performance basic types, arrays, tuples, … § Nothing ever beats a custom serializer ; -) 43

Thank you! Questions? 44

Thank you! Questions? 44