Spark Looking Back Looking Forward Patrick Wendell Databricks

Spark: Looking Back, Looking Forward Patrick Wendell Databricks

Welcome to Databricks! Founded by creators of Spark… donated Spark to Apache foundation in 2013. Databricks cloud – integrated analytics platform based on Apache Spark (limited Beta). http: //databricks. com/registration New office, so pardon any kinks!

About Me Work at Databricks managing the Spark team Spark 1. 2 release manager Committer on Spark since Berkeley days

Agenda for Today Reflections and directions for Spark Deeper dive for new API’s in Spark SQL and Mllib Committer panel / Q&A

Show of Hands! How familiar are you with Spark? A. Heard of it, but haven't used it before. B. Kicked the tires with some basics. C. Worked or working on a proof-of-concept deployment. D. Worked or working on a production deployment.

A Bit about Spark… User app DStream’s: Streams of RDD’s RDD-Based Graphs Spark Streaming Graph. X real-time Graph RDD-Based Matrices MLLib machine learning RDD-Based Tables Spark SQL Spark RDD API Kafka, S 3, Cassandra, HDFS YARN, Mesos, Standalone

Spark in 2014 – Project Growth Code Patches 4000 3567 Contributors 450 3500 400 3000 350 300 250 2000 1500 417 200 1071 150 100 500 50 0 0 2013 2014 137 2013 2014

Spark in 2014 – User Growth Downloads 30 Days From Release (Sampled) Spark 0. 9 (Feb) Spark 1. 0 (May) Spark 1. 1 (Sep)

Spark in 2014 – Broader Ecosystem Now supported by all major Hadoop vendors… But also beyond Hadoop…

Spark in 2014 – Major additions Usability and portability of core engine API stability (Spark 1. 0!) Vastly expanded UI and instrumentation Integration with Hadoop security Disk-spilling and shuffle optimizations Feature coverage for libraries Spark SQL library and Schema. RDD’s Graph. X library Expansion of MLlib

So… what’s coming?

New Technical Directions in 2015 Schema. RDD’s as a common interchange format. Data-frame style API’s From developers to data scientists Extensibility and pluggable API’s Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages

Schema. RDD’s as a Key Concept RDD: “Immutable partitioned collection of elements”. Schema. RDD: “An RDD of Row objects that has an associated schema”

Why Schema. RDD’s are Useful Having structure and types is very powerful. Allows us to optimize performance more easily. Fosters interoperability between libraries (Spark’s and third party). Enables higher level and safer user-facing API.

Schema. RDDs - Data Frame API’s # Pandas ‘data frame’ style lineitems. groupby(‘customer’). agg(Map( ‘units’ >‘avg’, ‘total. Price’ >‘std’ )) # or SQL style SELECT AVG(units), STD(total. Price) FROM linetiems GROUP BY customer

Schema. RDDs - Data Frame API’s Data frame API’s are more familiar to data scientists and easier to use. Many user issues would be solved by writing against such API’s.

Schema RDDs - Interoperability Any data source made available to Spark SQL is instantly available in Java, Python, and R, with correct types. Major internal API’s (such as ML pipelines API) can make assumptions about input RDD’s.

Spark with Schema RDD’s User app DStream’s: Streams of RDD’s RDD-Based Graphs Spark Streaming Graph. X real-time Graph RDD-Based Matrices MLLib machine learning RDD-Based Tables Spark SQL Schema RDD / Data Frame API Base RDD API Kafka, S 3, Cassandra, HDFS YARN, Mesos, Standalone

Technical Directions in 2015 Schema. RDD’s as a common interchange format. “Data frame” style API’s From developers to data scientists Extensibility and pluggable API’s Data source API (SQL) Pipelines API (Mllib) Receiver API (Streaming) Spark Packages

Spark SQL – Initial Input Support Hive metastore tables JSON (built in) Parquet (built in) Any Spark RDD + user-schema creation

Spark SQL – Datasources API Data Source API Table scan/sink Optimization rules Table catalog Schema RDD

Mllib – Pipelines API Practical ML pipelines involve feature extraction, model fitting, testing, and validation. Pipelines API provides re-usable components and a language for describing workflows. Relies heavily on Schema. RDD’s for interoperability.

Pluggable APIs Goal is to stabilize both of these API’s in the next few releases to allow community implementations to proliferate. Now Spark is facilitating feedback from third party library authors.

spark-packages. org

spark-packages. org Standard library for Spark-related projects (think CRAN, Py. Pi, etc). Plan is to make installation on Spark a single click.

User app Spark Community Streaming Package real-time Graph. X Graph MLLib machine learning Spark SQL Schema RDD / Data Frame API Base RDD API Kafka, S 3, Cassandra, HDFS YARN, Mesos, Standalone

Spark in 2015 – Production Use Cases Submissions to Spark Summit East show a broad variety of production use cases: Hadoop workload migration Recommendations Data pipeline and ETL Fraud detection User engagement analytics Scientific computing Medical diagnosis Smart grid/utility analytics

Community goals in 2015 Maintain a strong, cohesive community, as we grow. Continue to provide transparency and community involvement in technical roadmap. Maintain trust of users to update to new releases. Encourage ecosystem projects outside of Spark (stable API’s are a huge part of this).

To conclude. In 2015 expect… - Increasing focus on Schema RDD as an integration point. - Friendlier, higher level API’s. - Continued focus on usability and performance.