Spark Fast Interactive LanguageIntegrated Cluster Computing Matei Zaharia

Project Goals Extend the Map. Reduce model to better support two common classes of

Motivation Most current cluster programming models are based on acyclic data flow from stable

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set

Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for

Outline Spark programming model Implementation Demo User applications

Programming Model Resilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Created

Example: Log Mining Load error messages from a log into memory, then interactively search

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost

Example: Logistic Regression Goal: find best line separating two sets of points random initial

Example: Logistic Regression val data = spark. text. File(. . . ). map(read. Point).

Running Time (s) Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000

Spark Applications In-memory data mining on Hive data (Conviva) Predictive analytics (Quantifind) City traffic

Conviva Geo. Report Hive 20 Spark 0, 5 0 Time (hours) 5 10 15

Frameworks Built on Spark Pregel on Spark (Bagel) » Google message passing model for

Implementation Runs on Apache Mesos to share resources with Hadoop & other apps Spark

Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the

Conclusion Spark provides a simple, efficient, and powerful programming model for a wide range

Related Work Dryad. LINQ, Flume. Java » Similar “distributed collection” API, but cannot reuse

11, 5 29, 7 40, 7 58, 1 100 90 80 70 60 50

No Failure in the 6 th Iteration 59 5 6 Iteration 57 4 59

Spark Operations Transformations (define a new RDD) Actions (return a result to driver program)

Slides: 25

Download presentation

Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy Mc. Cauley, Michael Franklin, Scott Shenker, Ion Stoica www. spark-project. org UC BERKELEY

Project Goals Extend the Map. Reduce model to better support two common classes of analytics apps: » Iterative algorithms (machine learning, graphs) » Interactive data mining Enhance programmability: » Integrate into Scala programming language » Allow interactive use from Scala interpreter

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Input Reduce Output Map Reduce

Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce Benefits of data flow: runtime can decide Map Output Input to run where tasks and can automatically recover from failures Reduce Map

Motivation Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: » Iterative algorithms (machine learning, graphs) » Interactive data mining tools (R, Excel, Python) With current frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficient reuse Retain the attractive properties of Map. Reduce » Fault tolerance, data locality, scalability Support a wide range of applications

Outline Spark programming model Implementation Demo User applications

Programming Model Resilient distributed datasets (RDDs) » Immutable, partitioned collections of objects » Created through parallel transformations (map, filter, group. By, join, …) on data in stable storage » Can be cached for efficient reuse Actions on RDDs » Count, reduce, collect, save, …

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark. text. File(“hdfs: //. . . ”) Base. Transformed RDD results errors = lines. filter(_. starts. With(“ERROR”)) messages = errors. map(_. split(‘t’)(2)) cached. Msgs = messages. cache() Driver tasks Cache 1 Worker Block 1 Action cached. Msgs. filter(_. contains(“foo”)). count Cache 2 cached. Msgs. filter(_. contains(“bar”)). count Worker . . . Cache 3 Result: scaled tosearch 1 TB data in 5 -7 sec full-text of Wikipedia data) in <1(vs sec 170 (vssec 20 for secon-disk for on-disk data) Worker Block 3 Block 2

RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = text. File(. . . ). filter(_. starts. With(“ERROR”)). map(_. split(‘t’)(2)) HDFS File Filtered RDD filter (func = _. contains(. . . )) Mapped RDD map (func = _. split(. . . ))

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + – – – – target

Example: Logistic Regression val data = spark. text. File(. . . ). map(read. Point). cache() var w = Vector. random(D) for (i <- 1 to ITERATIONS) { val gradient = data. map(p => (1 / (1 + exp(-p. y*(w dot p. x))) - 1) * p. y * p. x ). reduce(_ + _) w -= gradient } println("Final w: " + w)

Running Time (s) Logistic Regression Performance 4500 4000 3500 3000 2500 2000 1500 1000 500 0 127 s / iteration Hadoop Spark 1 5 10 20 Number of Iterations 30 first iteration 174 s further iterations 6 s

Spark Applications In-memory data mining on Hive data (Conviva) Predictive analytics (Quantifind) City traffic prediction (Mobile Millennium) Twitter spam classification (Monarch) Collaborative filtering via matrix factorization …

Conviva Geo. Report Hive 20 Spark 0, 5 0 Time (hours) 5 10 15 20 Aggregations on many keys w/ same WHERE clause 40× gain comes from: » Not re-reading unused columns or filtered records » Avoiding repeated decompression » In-memory storage of deserialized objects

Frameworks Built on Spark Pregel on Spark (Bagel) » Google message passing model for graph computation » 200 lines of code Hive on Spark (Shark) » 3000 lines of code » Compatible with Apache Hive » ML operators in Scala

Implementation Runs on Apache Mesos to share resources with Hadoop & other apps Spark Can read from any Hadoop input source (e. g. HDFS) Node No changes to Scala compiler Hadoop MPI … Mesos Node

Spark Scheduler Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shuffles B: A: G: Stage 1 C: group. By D: F: map E: Stage 2 join union = cached data partition Stage 3

Interactive Spark Modified Scala interpreter to allow Spark to be used interactively from the command line Required two changes: » Modified wrapper code generation so that each line typed has references to objects for its dependencies » Distribute generated classes over the network

Demo

Conclusion Spark provides a simple, efficient, and powerful programming model for a wide range of apps Download our open source release: www. spark-project. org matei@berkeley. edu

Related Work Dryad. LINQ, Flume. Java » Similar “distributed collection” API, but cannot reuse datasets efficiently across queries Relational databases » Lineage/provenance, logical logging, materialized views Graph. Lab, Piccolo, Big. Table, RAMCloud » Fine-grained writes similar to distributed shared memory Iterative Map. Reduce (e. g. Twister, Ha. Loop) » Implicit data sharing for a fixed computation pattern Caching systems (e. g. Nectar) » Store data in files, no explicit control over what is cached

11, 5 29, 7 40, 7 58, 1 100 90 80 70 60 50 40 30 20 10 0 68, 8 Iteration time (s) Behavior with Not Enough RAM Cache disabled 25% 50% 75% % of working set in memory Fully cached

No Failure in the 6 th Iteration 59 5 6 Iteration 57 4 59 3 57 81 58 2 58 56 1 57 140 120 100 80 60 40 20 0 119 Iteratrion time (s) Fault Recovery Results 7 8 9 10

Spark Operations Transformations (define a new RDD) Actions (return a result to driver program) map filter sample group. By. Key reduce. By. Key sort. By. Key flat. Map union join cogroup cross map. Values collect reduce count save lookup. Key