Gearpump Real time DAGProcessing at Scale Sean Zhong























![Fault-Tolerance: Recovery time 91 worker nodes, 1000 tasks [*]: Recovery time is the time Fault-Tolerance: Recovery time 91 worker nodes, 1000 tasks [*]: Recovery time is the time](https://slidetodoc.com/presentation_image_h2/e5ab93c3269381e977adaa28924d1088/image-24.jpg)
































- Slides: 56

Gearpump Real time DAG-Processing at Scale Sean Zhong Xiang. zhong@intel. com, Intel Software Strata Singapore 2015 1

What is Gearpump • Akka based lightweight Real time data processing platform. • Apache License http: //gearpump. io version 0. 7 What is Akka? Simple and Powerful Message level streaming Long running daemons • Akka: • Communication, concurrency, Isolation, and fault-tolerant 2

What is Akka? • Micro-service(Actor) oriented. • Message Driven • Lock-free • Location-transparent It is like our human society, driven by message Which can scale to 7 billion population! 3

Micro-service oriented higher abstraction • Break your application into Micro services instead of object. • Throw away locks • Use Immutable Async message to exchange information between micro-service instead of shared object. 4

Gearpump in Big Data Stack visualization Cluster manager monitor/alert/notify Visualization & management Cloudera Manager SQL Catalyst Stream. SQL Impala batch Data explore Machine learning Graphx stream storm Analytics Here! Gearpump Engine store Storage 5

Why another streaming platform? • The requirements are not fully met. • A higher abstraction like micro-service oriented can largely simplify the problem. 6

What we want • Meet The 8 Requirements of Real-Time Stream Processing (2006) Flexible Any where Any size Any source Any use case dynamic DAG ②Stream. SQL Volume High throughput ⑦Scale linearly Speed Accuracy ①In-Stream Zero latency ⑥HA ⑧Responsive Exactly-once ③Message loss/delay/out of order ④Predictable Visual Easy to debug WYSWYG 7 7

Overview 8

DAG representation and API Low level Graph API Syntax: Graph(A~>B~>C~>D, B~>E~>D) DAG Processor E Processor Field A B C D grouping Shuffle Processor 9

Architecture - Actor Hierarchy Each App has one isolated App. Master, and use Actor Supervision YARN tree for error handling. Client Master Cluster HA Design Hook in and query state As general service 10

Architecture - Master HA (no SPOF) • Akka Cluster for a centerless HA system • Conflict free data types(CRDT) for consistency CRDT Data type example: leader ssi sip Go s Go Worker p Master State standby Master Gossip Standby Master Akka Cluster Decentralized: Not rely on single central meta server 11

Feature Highlights Akka/Akkastream/Storm compatible function usability Throughput 14 million/s (*) 2 ms Latency(*) Exactly-once Dynamic DAG Out of Order Message Flexible DSL DAG Visualization Internet of Thing [*] Test environment: Intel® Xeon™ Processor E 5 -2690, 4 nodes, each node has 32 CPU cores, and 64 GB memory, 10 Gb. E network. We use default configuration of Gearpump 0. 7 release. See backup page for more details. We use the SOL workload, message size 100 bytes, (https: //github. com/intel-hadoop/storm-benchmark) (tested carried by Intel team) 12

Using Gearpump 13

Three steps to use it 1. Download binary from http: //gearpump. io 2. Submit jar by UI 3. Monitor Status 14

Application Submission Flow YARN or without YARN Master 1. Submit a Jar 3. st e u Req Workers rce ou Res ARN ce Y k As esour YARN for R 1 2 App. Master 2. Create App. Master 3 4 Master Workers App. Master Executor 4. Report Executor to App. Master 15 15

Low level Graph API - Word. Count val context = new Client. Context() val split = Processor[Split](split. Parallism) val sum = Processor[Sum](sum. Parallism) val app = Stream. Application("word. Count", Graph(split ~> sum), User. Config. empty) val app. Id = context. submit(app) context. close() Scala Java class Split(task. Context : Task. Context, conf: User. Config) extends Task(task. Context, conf) { override def on. Next(msg : Message) : Unit = { /* split the line */ } } class Sum (task. Context : Task. Context, conf: User. Config) extends Task(task. Context, conf) { val count = /**count of words **/ override def on. Next(msg : Message) : Unit = {/* do aggregation on word*/} } 16

High Level DSL API - Word. Count val context = Client. Context() val app = new Stream. App("dsl", context) val data = "This is a good start, bingo!!" app. from. Collection(data. lines) // word => (word, count = 1). flat. Map(line => line. split("[\s]+")). map((_, 1)) // (word, count 1), (word, count 2) => (word, count 1 + count 2). group. By. Key(). sum. log val app. Id = context. submit(app) context. close() 17

Akka-stream API – Word. Count implicit val system = Actor. System("akka-test") implicit val materializer = new Gearpump. Materializer(system) val echo = system. actor. Of(Props(new Echo())) val sink = Sink. actor. Ref(echo, "COMPLETE") val source = Gear. Source. from[String](new Collection. Data. Source(lines)) source. map. Concat{line => line. split(" "). to. List}. group. By 2(x=>x). map(word => (word, 1)). reduce {(a, b) => (a. _1, a. _2 + b. _2)}. log("word-count"). run. With(sink) Available at branch https: //github. com/gearpump/tree/akkastream 18

UI Portal DAG Visualization DAG Page Track global min-Clock of all message DAG: • Node size reflect throughput • Edge width represents flow rate • Red node means something goes wrong 19 19

UI Portal – Processor Detail Processor Page Data skew distribution Task throughput and latency 20

Performance optimization 21

Throughput and Latency Throughput: 11 million message/second Latency: 17 ms on full load SOL Shuffle test 32 tasks->32 tasks [*] Test environment: Intel® Xeon™ Processor E 5 -2680, 4 nodes, each node has 32 CPU cores, and 64 GB memory, 10 Gb. E network. (tested carried by Intel team) 22 We use default configuration of Gearpump 0. 2 release. We use the SOL workload, message size 100 bytes, (https: //github. com/intel-hadoop/storm-benchmark)

Scalability • Test run on 100 nodes(*) and 3000 tasks • Gearpump performance scales: 100 nodes [*] We use 8 machines to simulate 100 worker nodes Test environment: Intel® Xeon™ Processor E 5 -2680, each node has 32 CPU cores, and 64 GB memory, 10 Gb. E network. (tested carried by Intel team) We use default configuration of Gearpump 0. 3. 5 release. We use the SOL workload, message size 100 bytes, (https: //github. com/intel-hadoop/storm-benchmark) 23
![FaultTolerance Recovery time 91 worker nodes 1000 tasks Recovery time is the time Fault-Tolerance: Recovery time 91 worker nodes, 1000 tasks [*]: Recovery time is the time](https://slidetodoc.com/presentation_image_h2/e5ab93c3269381e977adaa28924d1088/image-24.jpg)
Fault-Tolerance: Recovery time 91 worker nodes, 1000 tasks [*]: Recovery time is the time interval between: a) failure happen b) all tasks in topology resume processing data. Test environment: Intel® Xeon™ Processor E 5 -2680 91 worker nodes, 1000 tasks (We use 7 machines to simulate 91 worker nodes). Each node has 32 CPU cores, and 64 GB memory, 10 Gb. E network. We use default configuration of Gearpump 0. 3. 5 release. We use the SOL workload (https: //github. com/intel-hadoop/storm-benchmark) (by Intel 24

High performance Messaging Layer • Akka remote message has a big overhead, (sender + receiver address) • Reduce 95% overhead (400 bytes to ~20 bytes) Sync with other executors convert to short address Effective batching convert from short address 25

Effective batching Network Idle: Flush as fast as we can Network Busy: Smart batching until the network is open again. Network Bandwidth Doubled For 100 byte per message This feature is ported from Storm-297 Test environment: Same as storm-297 26

High performance flow Control Pass back-pressure level-by-level About ~1% throughput impact Task Task Task 1. NO central ack nodes 2. Each level knows network status, thus can optimize the network at best Back-pressure Sliding window Another option(not used): big-loop-feedback flow control 27

Use cases 28

What is a good use case for Gearpump? When you want exactly-once message processing, with millisecond latency. When you want to integrate with Akka and Akka-stream transparently. When you want dynamic modification of online DAG When you want to connect with Io. T edge device, location tranparency. When you want a simple scheduler to distribute customized application, like collecting logs, distributed cron… When you want to use Monoid state like Count-Min Sketch… Besides, it can integrate with: YARN, Kafka, Storm and etc. . 29

Io. T Transparent Cloud Target Problem Large gap between edge devices and data center Location transparent. Unified programming model across the boundary. dag on device side Data Center log case: Intelligent Traffic System, 3000 traffic lights, travel time, overspeed detection… 30

Exactly-once: Financial use cases Target Problem both real-time and accuracy are important Programing trading realtime Stock index Other data source Crawlers Process Alerts Rules Actions Reports Transaction Account 31

Transformers: Dynamic DAG Target Problem No existing way to manipulate the DAG on the fly It can change parallelism online to scale out without message loss add/remove source/sink processor dynamically Add Replace Delete B Each Processor can has its own independent jar 32

Eve: Online Machine Learning Target Problem ML train and predict online for real-time decision support. • Decide immediately based online learning result Learn Input sensor Predict Output Decide TAP analytics platform integration: http: //trustedanalytics. org/ 33

Gearpump Internals 34

General ideas Minclock service Replayable Source Min. Clock service track the min timestamp of all pending messages in the system now and future Message(timestamp) DAG State Every message in the system Has a application timestamp(message birth time) Normal Flow 35

General ideas Minclock service ② clock pause at Tp ③ replay from Tp Replayable Source Message(timestamp) ①Detect Message loss at Tp DAG State Checkpoint Store Recovery Flow ④Exactly-once State can be reconstructed by message replay: 36

1. 2. 3. 4. Detect Message loss Pause the clock at Tp when message is lost Replay from clock Tp Exactly-once 37

Detect Failure in time Easy to trouble-shoot When An error happen, we know § When § Where § Why Master App. Master n ch ain Failure io pe rv is Su Ack. Request and Ack to detect Message loss: Executor Failure Task 38

Recover the runtime when machine crashed 1. Quarantine Global clock service App. Master Source ① error detected Executor Task ②Fence zombie Store Executor Task Send message Use dynamic session ID to fence zombies 39

DAG Recovery: Quarantine and Recover 2. Recover the executor JVM, and replay message App. Master Replay Source ②isolate zombie Global clock service ① error detected Store Executor Task Send message ③ Recover 40

1. 2. 3. 4. Detect Message loss Pause the clock at Tp when message is lost Replay from clock Tp Exactly-once 41

Application’s Clock Service Level Clock Report task min-clock Clock o f. D Definition: Task min-clock is Minimum of ( min timestamp of pending-messages in current task Task min-Clock of all upstream tasks ) Ever incremental A 1000 Later B 800 C D E 600 400 Earlier 42

1. 2. 3. 4. Detect Message loss Pause the clock at Tp when message is lost Replay from clock Tp Exactly-once 43

Source-based Message Replay from the very-beginning source Source Like offset of kafka queue -> message timestamp Normal Flow 44

Source-based Message Replay from the very-beginning source Source ②Replay from offset Recovery Flow Global Clock Service ①Resolve offset with timestamp Tp 45

1. 2. 3. 4. Detect Message loss Clock pause at Tp when message is lost Replay from clock Tp Exactly-once 46

Exactly-once message processing Key: Ensure State(t) only contains message(timestamp <= t) How? DAG runtime append only checkpoint Checkpoint Store 47

Exactly-once message processing Two states Messages Streaming System State Accept (t < Tc) State Accept all Checkpoint Store Normal Flow 48

Exactly-once message processing Two states Messages Streaming System State Accept (t < Tc) State Accept all Recover checkpoint in failures Checkpoint Store Recovery Flow 49

How to do Dynamic DAG? Multiple-Version DAG Target Problem: Replace processor B with B’ at time Tc DAG(Version = 0) DAG(Version = 1) transit A B C Message. time >= Tc B’ A B Message. time < Tc C NO message loss during the transition 50

Demo 51

Live demo 52

References • • • 钟翔 大数据时代的软件架构范式:Reactive架构及Akka实践, 程序员期刊2015年 2 A期 Gearpump whitepaper http: //typesafe. com/blog/gearpump-real-time-streaming-engine-using-akka 吴甘沙 低延迟流处理系统的逆袭, 程序员期刊2013年 10期 Stonebraker http: //cs. brown. edu/~ugur/8 rules. Sig. Rec. pdf https: //github. com/intel-hadoop/gearpump Gearpump: https: //github. com/intel-hadoop/gearpump http: //highlyscalable. wordpress. com/2013/08/20/in-stream-big-data-processing/ https: //engineering. linkedin. com/kafka/benchmarking-apache-kafka-2 -million-writes-second-three -cheap-machines Sqlstream http: //www. sqlstream. com/customers/ http: //www. statsblogs. com/2014/05/19/a-general-introduction-to-stream-processing/ http: //www. statalgo. com/2014/05/28/stream-processing-with-messaging-systems/ Gartner report on IOT http: //www. zdnet. com/article/internet-of-things-devices-will-dwarf-number -of-pcs-tablets-and-smartphones/ 53

Legal Disclaimers 1 This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. 2 Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel. com]. 3 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and Mobile. Mark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § For more information go to http: //www. intel. com/performance. Intel, the Intel logo, Xeon are trademarks of Intel Corporation in the U. S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation 54

Latest Performance evaluation on Gearpump 0. 7 This is the latest performance test on version 0. 7. Please see the embedded document on the right to see the configuration details. 55
