Getting started in Apache Spark and Flink with

Outline § Scala § basics of Scala programming language § Spark § motivation /

Three main benefits to use Spark 1. Spark is easy to use—you can develop

Data Science Tasks Experimentation: development of the model § Python, MATLAB, R § i.

A Brief History of Spark § Spark is an open source project § Spark

What Is Apache Spark? § Spark Core: resilient distributed dataset (RDD) § Spark SQL:

What Is Apache Spark? § Components for distributed execution in Spark 11. 07. 2016

Spark Runtime Architecture The components of a distributed Spark application 11. 07. 2016 |

Spark Runtime Architecture § The master/slave architecture with one central coordinator and many distributed

Downloading Spark and Getting Started § Download a version “Pre-built for Hadoop 2. X

Introduction to Spark’s Scala Shell § Run: bin/spark-shell § Type in the shell the

Filtering: lambda functions § Filtering example (Scala): § Filtering example (Java 7): § Filtering

Standalone Spark Applications § Link to Spark (Maven or SBT), e. g. : §

Standalone Spark Applications § SBT build file § Build JAR and run it: 11.

Programming with RDDs RDD -- Resilient Distributed Dataset § Immutable distributed collection of objects

Programming with RDDs § Once created, RDDs offer two types of operations: § Transformations

Spark Execution Steps (Shell & Standalone) 1. Create some input RDDs from external data.

RDD Operations: Transformations § filter() operation does not mutate the existing input. RDD §

RDD Operations: Actions Return some result and launch actual computation: § take() to retrieve

Common Transformations and Actions Element-wise transformations § Mapped and filtered RDD from an input

Common Transformations and Actions Element-wise transformations § Splitting lines into multiple words: § Difference

Common Transformations and Actions Some simple set operations: 11. 07. 2016 | Spark tutorial

Common Transformations and Actions Basic RDD transformations on an RDD containing {1, 2, 3,

Common Transformations and Actions Two-RDD transformations on RDDs containing {1, 2, 3} and {3,

Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}:

Persistence (Caching) Double execution: § Persistence levels: 11. 07. 2016 | Spark tutorial |

Working with Key/Value Pairs § Pair RDDs are a useful building block in many

Transformations on Pair RDDs § Transformations on one pair RDD (example: {(1, 2), (3,

Transformations on Pair RDDs § Transformations on two pair RDDs (rdd = {(1, 2),

Transformations on Pair RDDs § Using partial functions syntax for Pair RDDs in Scala

Transformations on Pair RDDs Word and document counts: § Per-key average with reduce. By.

Transformations on Pair RDDs Word count example revisited: 11. 07. 2016 | Spark tutorial

Transformations on Pair RDDs Example of a join (inner join is the default): 11.

Actions Available on Pair RDDs Actions on pair RDDs (example ({(1, 2), (3, 4),

Example: Page. Rank § links – (page. ID, link List) – a list of

Important topics not covered in this intro MLlib § Machine Learning in the distributed

Slides: 38

Download presentation

Getting started in Apache Spark and Flink (with Scala) Alexander Panchenko, Gerold Hintz, Steffen Remus 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus

Outline § Scala § basics of Scala programming language § Spark § motivation / what do you get on top of Map. Reduce § basics of Spark: RDDs, transformations, actions, shuffling § “tricks” useful in Spark context § Spark Hands-on session § run Spark notebook and solve easy tasks § setup Spark project & submit job to cluster § Flink § theory § difference from Spark 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 2

Three main benefits to use Spark 1. Spark is easy to use—you can develop applications on your laptop, using a high-level API 2. Spark is fast, enabling interactive use and complex algorithms 3. Spark is a general engine, letting you combine multiple types of computations (e. g. , SQL queries, text processing, and machine learning) that might previously have required different engines. This tutorial is based on the book by creators of Spark: Karau H. , Konwinski A. , Windell P. , Zaharia M. “Learning Spark. Lighting-fast Data Analysis. ” O’Really. 2015 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 3

Data Science Tasks Experimentation: development of the model § Python, MATLAB, R § i. Python notebooks § Interactive computing § Easy-to-use Production: using the model § Java, Scala, C++/C § Unit tests § Fault tolerance § No interactive computing § Scalability Scala + Spark can be used for both! 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 4

A Brief History of Spark § Spark is an open source project § Spark started in 2009 as a research project in the UC Berkeley RAD Lab § Research papers were published about Spark at academic conferences and soon after its creation in 2009 § In 2011, the AMPLab started to develop higher-level components on Spark, such as Shark (Hive on Spark) and Spark Streaming § Currently one of the most active project in Scala language: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 5

What Is Apache Spark? § Spark Core: resilient distributed dataset (RDD) § Spark SQL: Hive tables, Parquet, JSON, Datasets 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 6

What Is Apache Spark? § Components for distributed execution in Spark 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 7

Spark Runtime Architecture The components of a distributed Spark application 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 8

Spark Runtime Architecture § The master/slave architecture with one central coordinator and many distributed workers § The central coordinator is called the driver § The driver communicates with distributed workers called executors § The driver is the process where the main() method of your program runs § The driver: § Converting a user program into tasks § Scheduling tasks on executors 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 9

Downloading Spark and Getting Started § Download a version “Pre-built for Hadoop 2. X and later”: http: //spark. apache. org/downloads. html § Directories you see here that come with Spark: § README. md § Contains short instructions for getting started with Spark. § bin § Contains executable files that can be used to interact with Spark in various ways (e. g. , the Spark shell, which we will cover later in this chapter). § core, streaming, python, … § Contains the source code of major components of the Spark project. § examples § Contains some helpful Spark standalone jobs that you can look at and run to learn about the Spark API. 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 10

Introduction to Spark’s Scala Shell § Run: bin/spark-shell § Type in the shell the Scala line count: § We can run parallel operations on the RDD, such as counting the lines of text in the file or printing the first one 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 11

Filtering: lambda functions § Filtering example (Scala): § Filtering example (Java 7): § Filtering example (Java 8): 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 12

Standalone Spark Applications § Link to Spark (Maven or SBT), e. g. : § Write a sample class, e. g. word count: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 13

Standalone Spark Applications § SBT build file § Build JAR and run it: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 14

Programming with RDDs RDD -- Resilient Distributed Dataset § Immutable distributed collection of objects § Each RDD is split into multiple partitions § Partitions may be computed on different nodes Creating an RDD § Loading an external dataset § Distributing a collection of objects 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 15

Programming with RDDs § Once created, RDDs offer two types of operations: § Transformations § Actions § Transformations construct a new RDD from a previous one § Actions, compute a result based on an RDD § either return it to the driver program § or save it to an external storage system, e. g. HDFS § RDDs are recomputed each time you run an action § To reuse an RDD you need to persist it in memory: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 16

Spark Execution Steps (Shell & Standalone) 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark. 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 17

RDD Operations: Transformations § filter() operation does not mutate the existing input. RDD § It returns a pointer to an entirely new RDD § input. RDD can still be reused later in the program, e. g. : 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 18

RDD Operations: Actions Return some result and launch actual computation: § take() to retrieve a small number of elements 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 19

Common Transformations and Actions Element-wise transformations § Mapped and filtered RDD from an input RDD: § Squaring the values in an RDD: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 20

Common Transformations and Actions Element-wise transformations § Splitting lines into multiple words: § Difference between flat. Map() and map() on an RDD: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 21

Common Transformations and Actions Some simple set operations: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 22

Common Transformations and Actions Basic RDD transformations on an RDD containing {1, 2, 3, 3}: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 23

Common Transformations and Actions Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 24

Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 25

Common Transformations and Actions Basic actions on an RDD containing {1, 2, 3, 3}: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 26

Persistence (Caching) Double execution: § Persistence levels: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 27 Reusing result:

Working with Key/Value Pairs § Pair RDDs are a useful building block in many programs § Allow you to act on each key in parallel or regroup data § For instance: § reduce. By. Key() method that can aggregate data for each key § join() method that can merge two RDDs by grouping elements with the same key § Creating Pair RDDs = creating Scala tuples: § Creating a pair RDD using the first word as the key 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 28

Transformations on Pair RDDs § Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)}) 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 29

Transformations on Pair RDDs § Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)}) 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 30

Transformations on Pair RDDs § Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)}) 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 31

Transformations on Pair RDDs § Using partial functions syntax for Pair RDDs in Scala § Simple filter on second element: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 32

Transformations on Pair RDDs Word and document counts: § Per-key average with reduce. By. Key() and map. Values(): 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 33

Transformations on Pair RDDs Word count example revisited: 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 34

Transformations on Pair RDDs Example of a join (inner join is the default): 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 35

Actions Available on Pair RDDs Actions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)})) 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 36

Example: Page. Rank § links – (page. ID, link List) – a list of neighbors of each page § ranks – (page. ID, rank) – current rank for each page 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 37

Important topics not covered in this intro MLlib § Machine Learning in the distributed way § Basic Linear Algebra in the distributed way: sparse and dense vectors and matrices Partitioning § No free lunch, neither automagic scaling of any algorithm § Making efficient algorithm = trying to minimize shuffling of the data Spark SQL, Spark 2. 0, Datasets, Data. Frames § Something like Python’s pandas or R’s Data. Frame § Great for interactive data mining and for working with CSV files 11. 07. 2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 38