UNIT 2 Big Data Analytics with Spark and
UNIT 2: Big. Data Analytics with Spark and Spark Platforms Shelly Garion IBM Research -- Haifa 1 © 2015 IBM Corporation
Outline § Map/Reduce § Scala § Spark Core API § Transformations and Actions § Spark Platforms: – MLLib – Machine Learning – Graph. X – Graph Processing – SQL – Streaming § What’s new? 2 © 2015 IBM Corporation
How to Analyze Big. Data? 3 © 2015 IBM Corporation
Basic Example: Word Count (Spark & Python) 4 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Basic Example: Word Count (Spark & Scala) 5 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Scala § Spark was originally written in Scala – Java and Python API were added later § Scala: high-level language for the JVM – Object oriented – Functional programming – Immutable – Inspired by criticism of the shortcomings of Java § Static types – Comparable in speed to Java – Type inference saves us from having to write explicit types most of the time § Interoperates with Java – Can use any Java class – Can be called from Java code 6 © 2015 IBM Corporation
Scala vs. Java 7 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark 8 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark & Scala: Creating RDD or Soft. Layer object store 9 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark & Scala: Basic Transformations 10 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark & Scala: Basic Actions 11 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark & Scala: Key-Value Operations 12 Holden Karau, Making interactive Big. Data applications fast and easy, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Example: Spark Core API 13 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https: //spark-summit. org/2014/ © 2015 IBM Corporation
Example: Spark Core API 14 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https: //spark-summit. org/2014/ © 2015 IBM Corporation
Example: Spark Core API 15 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https: //spark-summit. org/2014/ © 2015 IBM Corporation
Example: Spark Core API Better implementation: 16 Aaron Davidson, A deeper understanding of Spark internals, Spark Summit July 2014, https: //spark-summit. org/2014/ © 2015 IBM Corporation
Example: Page. Rank How to implement Page. Rank algorithm using Map/Reduce? 17 Hossein Falaki, Numerical Computing with Spark, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform 18 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform: Graph. X 19 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform: Graph. X Example: Page. Rank is implemented using Pregel graph processing 20 © 2015 IBM Corporation
Spark Platform: MLLib 21 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform: MLLib Example: K-Means Clustering Goal: Segment tweets into clusters by geolocation using Spark MLLib K-means clustering https: //chimpler. wordpress. com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 22 © 2015 IBM Corporation
Spark Platform: MLLib Example: K-Means Clustering https: //chimpler. wordpress. com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 23 © 2015 IBM Corporation
Spark Platform: MLLib Example: K-Means Clustering https: //chimpler. wordpress. com/2014/07/11/segmenting-audience-with-kmeans-and-voronoi-diagram-using-spark-and-mllib/ 24 © 2015 IBM Corporation
Spark Platform: Streaming 25 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform: Streaming Example 26 © 2015 IBM Corporation
Spark Platform: SQL 27 Patrick Wendell, Big Data Processing, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
Spark Platform: SQL & MLLib Example // SVM using Stochastic Gradient Descent 28 Xiangrui Meng, MLLib: scalable machine learning on Spark, Spark Workshop April 2014, http: //stanford. edu/~rezab/sparkworkshop/ © 2015 IBM Corporation
What’s new in 2015? § Spark R (R interface) § Data. Frame – API via Spark SQL § Spark ML – support for pipelines 29 Matei Zaharia, New directions for Spark in 2015, Spark Summit East March 2015, https: //spark-summit. org/east-2015/ © 2015 IBM Corporation
- Slides: 29