Apache Flink Stephan Ewen Flink committer cofounder CTO
- Slides: 44
Apache Flink Stephan Ewen Flink committer co-founder / CTO @ data Artisans @Stephan. Ewen
Looking back one year 2
April 16, 2014 3
Stratosphere 0. 4 Pact API (Java) Data. Set API (Scala) Stratosphere Optimizer Stratosphere Runtime Local Remote Batch processing on a pipelining engine, with iterations … 4
Looking at now… 5
What is Apache Flink? Real-time data streams Flink (master) Event logs Kafka, Rabbit. MQ, . . . Historic data HDFS, JDBC, . . . ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, . . .
HBase Data. Set (Java/Scala) Data. Stream (Java/Scala) JDBC Flink Optimizer Stream Builder Hadoop M/R Dataflow SAMOA Dataflow ML Table HCatalog Gelly HDFS Python What is Apache Flink? Flink Dataflow Runtime Kafka Rabbit. MQ Flume Local Remote Yarn Tez Embedded 7
Batch / Steaming APIs case class Word (word: String, frequency: Int) Data. Set API (batch): val lines: Data. Set[String] = env. read. Text. File(. . . ) lines. flat. Map {line => line. split(" "). map(word => Word(word, 1))}. group. By("word"). sum("frequency"). print() Data. Stream API (streaming): val lines: Data. Stream[String] = env. from. Socket. Stream(. . . ) lines. flat. Map {line => line. split(" "). map(word => Word(word, 1))}. window(Count. of(1000)). every(Count. of(100)). group. By("word"). sum("frequency"). print() 8
Technology inside Flink case class Path (from: Long, to: Long) val tc = edges. iterate(10) { paths: Data. Set[Path] => val next = paths. join(edges). where("to"). equal. To("from") { (path, edge) => Path(path. from, edge. to) }. union(paths). distinct() next } Group. Red sort Type extraction stack Dataflow Graph forward Join Hybrid Hash build. H T Cost-based optimizer hash-part [0] Map Data. Sourc e Filter Pre-flight (Client) probe lineitem. tbl Data. Sourc e orders. tbl Program deploy operators Memory manager Out-of-core algos Batch & Streaming State & Checkpoints Workers track intermediate results Recovery metadata Task scheduling Master
Flink by Feature / Use Case 10
Data Streaming Analysis 11
Life of data streams § Create: create streams from event sources (machines, databases, logs, sensors, …) § Collect: collect and make streams available for consumption (e. g. , Apache Kafka) § Process: process streams, possibly generating derived streams (e. g. , Apache Flink) 12
Stream Analysis in Flink More at: http: //flink. apache. org/news/2015/02/09/streaming-example. html 13
Defining windows in Flink § Trigger policy • When to trigger the computation on current window § Eviction policy • When data points should leave the window • Defines window width/size § E. g. , count-based policy • evict when #elements > n • start a new window every n-th element § Built-in: Count, Time, Delta policies 14
Checkpointing / Recovery § Flink acknowledges batches of records • Less overhead in failure-free case • Currently tied to fault tolerant data sources (e. g. , Kafka) § Flink operators can keep state • State is checkpointed • Checkpointing and record acks go together § Exactly one semantics for state 15
Checkpointing / Recovery Operator checkpoint starting Pushes checkpoint barriers through the data flow Checkpoint done barrier Data Stream Before barrier = After barrier = Not in snapshot part of the snapshot (backup till next snapshot) checkpoint in progress Checkpoint done Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots 16
Heavy ETL Pipelines 17
Heavy Data Pipelines Apology: Graph had to be blurred for online slides, due to confidentiality Complex ETL programs 18
Memory Management Managed Unmanaged Flink contains its own memory management stack. Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. To do that, Flink contains its own type extraction and serialization components. User code objects Sorting, hashing, caching Shuffling, broadcasts public class WC { public String word; public int count; } empty page Pool of Memory Pages More at: https: //cwiki. apache. org/confluence/pages/viewpage. action? page. Id=53741525 19
Smooth out-of-core performance Single-core join of 1 KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core More at: http: //flink. apache. org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room. html 20
Benefits of managed memory § More reliable and stable performance (less GC effects, easy to go to disk) 21
Table API val customers = envread. Csv. File(…). as('id, 'mkt. Segment). filter( 'mkt. Segment === "AUTOMOBILE" ) val orders = env. read. Csv. File(…). filter( o => date. Format. parse(o. order. Date). before(date) ). as('order. Id, 'cust. Id, 'order. Date, 'ship. Prio) val items = orders. join(customers). where('cust. Id === 'id). join(lineitems). where('order. Id === 'id). select('order. Id, 'order. Date, 'ship. Prio, 'extd. Price * (Literal(1. 0 f) - 'discount) as 'revenue) val result = items. group. By('order. Id, 'order. Date, 'ship. Prio). select('order. Id, 'revenue. sum, 'order. Date, 'ship. Prio) 22
Iterations in Data Flows Machine Learning Algorithms 23
Iterate by looping Client Step Step § for/while loop in client submits one job per iteration step § Data reuse by caching in memory and/or disk 24
Iterate in the Dataflow 25
Large-Scale Machine Learning Factorizing a matrix with 28 billion ratings for recommendations (Scale of Netflix or Spotify) More at: http: //data-artisans. com/computing-recommendations-with-flink. html 26
State in Iterations Graphs and Machine Learning 27
Iterate natively with deltas Replace initial workset A B workset initial solution partial solution X Y delta set iteration result other datasets Merge deltas 28
# of elements updated Effect of delta iterations… iteration
… very fast graph analysis Performance competitive with dedicated graph analysis systems … and mix and match ETL-style and graph analysis in one program More at: http: //data-artisans. com/data-analysis-with-flink. html 30
Closing 31
Flink Roadmap for 2015 § Out-of-core state in Streaming § Monitoring and scaling for streaming § Streaming Machine Learning with SAMOA § More additions to the libraries • Batch Machine Learning • Graph library additions (more algorithms) § SQL on top of expression language § Master failover 32
Flink community 120 #unique contributor ids by git commits 100 80 60 40 20 0 May-10 Dec-10 Jun-11 Jan-12 Jul-12 Feb-13 Aug-13 Mar-14 Oct-14 Apr-15
flink. apache. org @Apache. Flink
Backup 35
Cornerpoints of Flink Design Flexible Data Streaming Engine Robust Algorithms on Managed Memory à Low Latency Steam Proc. à Highly flexible windows No Out. Of. Memory Errors à Scales to very large JVMs à Efficient an robust processing High-level APIs, beyond key/value pairs à Java/Scala/Python (upcoming) à Relational-style optimizer Pipelined Execution of Batch Programs à Better shuffle performance à Scales to very large groups Active Library Development Native Iterations à Graphs / Machine Learning à Streaming ML (coming) à Very fast Graph Processing à Stateful Iterations for ML 36
Program optimization 37
A simple program val orders = … val lineitems = … val filtered. Orders = orders. filter(o => data. Format. parse(l. ship. Date). after(date)). filter(o => o. ship. Prio > 2) val lineitems. Of. Orders = filtered. Orders. join(lineitems). where(“order. Id”). equal. To(“order. Id”). apply((o, l) => new Selected. Item(o. order. Date, l. extd. Price)) val price. Sums = lineitems. Of. Orders. group. By(“order. Date”). sum(“l. extd. Price”); 38
Two execution plans Group. Red sort hash-part [0, 1] Join Hybrid Hash build. HT forward Best plan depends on relative sizes of input files Combine Join Hybrid Hash probe build. HT probe broadcast forward hash-part [0] Map Data. Source lineitem. tbl Filter Data. Source orders. tbl lineitem. tbl 39
Examples of optimization § Task chaining • Coalesce map/filter/etc tasks § Join optimizations • Broadcast/partition, build/probe side, hash or sortmerge § Interesting properties • Re-use partitioning and sorting for later operations § Automatic caching • E. g. , for iterations 40
Visualization 41
Visualization tools 42
Visualization tools 43
Visualization tools 44
- Stephan ewen
- Nina ewen
- Paychic driving
- Sarah ewen
- Lawrence larry page
- Former cofounder miguel icaza microsoft
- Cofounder brain startup company
- Flink tm
- Flink ap
- Flink
- Netflix
- Flink queryable state
- Flink anomaly detection
- Posack obligation
- Cmo cto ceo
- Miraclebros guidewire
- Hybrid algorithm cto
- Cto
- Workday oms
- Gaia second wire
- Tata motors cto
- Cto
- Defense travel management office
- Star cto technique
- Apa itu cto
- Ceoceo
- Sfa cto
- Cto forum magazine
- Cto organizational chart
- Cto survey
- Cto persona
- Barclays cto
- Knight orthosis
- Manuales cto
- Uscybercom cto 10-084
- Cto sims
- Cto gpon
- Cto organizational structure
- Thomson reuters labs
- Stephan winter
- Klaas enno stephan
- Stephan börzsönyi
- Yasmine airlines
- Stephan zipper
- Stephan matrakchine