Spark Spark ideas expressive computing system not limited

Spark ideas • expressive computing system, not limited to map-reduce model • facilitate system

RDD abstraction • • • Resilient Distributed Datasets partitioned collection of records spread across

RDD operations • transformations to build RDDs through deterministic operations on other RDDs –

Job example Driver val log = sc. text. File(“hdfs: //. . . ”) val

RDD partition-level view Dataset-level view: log: Partition-level view: Hadoop. RDD path = hdfs: //.

Job scheduling RDD Objects DAGScheduler DAG rdd 1. join(rdd 2). group. By(…). filter(…) build

Available APIs • • You can write in Java, Scala or Python interactive interpreter:

Hand on - interpreter • script http: //cern. ch/kacper/spark. txt • run scala spark

Hand on – build and submission • download and unpack source code wget http:

Summary • concept not limited to single pass map-reduce • avoid soring intermediate results

Slides: 11

Download presentation

Spark

Spark ideas • expressive computing system, not limited to map-reduce model • facilitate system memory – avoid saving intermediate results to disk – cache data for repetitive queries (e. g. for machine learning) • compatible with Hadoop

RDD abstraction • • • Resilient Distributed Datasets partitioned collection of records spread across the cluster read-only caching dataset in memory – different storage levels available – fallback to disk possible

RDD operations • transformations to build RDDs through deterministic operations on other RDDs – transformations include map, filter, join – lazy operation • actions to return value or export data – actions include count, collect, save – triggers execution

Job example Driver val log = sc. text. File(“hdfs: //. . . ”) val errors = file. filter(_. contains(“ERROR”)) errors. cache() Action! errors. filter(_. contains(“I/O”)). count() errors. filter(_. contains(“timeout”)). count() Worker Cache 1 Worker Cache 2 Block 1 Block 2 Block 3

RDD partition-level view Dataset-level view: log: Partition-level view: Hadoop. RDD path = hdfs: //. . . errors: Filtered. RDD func = _. contains(…) should. Cache = true Task 1 Task 2. . . source: https: //cwiki. apache. org/confluence/display/SPARK/Spark+Internals

Job scheduling RDD Objects DAGScheduler DAG rdd 1. join(rdd 2). group. By(…). filter(…) build operator DAG Task. Set split graph into stages of tasks submit each stage as ready Task. Scheduler Worker Cluster manager Threads launch tasks via cluster manager retry failed or straggling tasks Task Block manager execute tasks store and serve blocks source: https: //cwiki. apache. org/confluence/display/SPARK/Spark+Internals

Available APIs • • You can write in Java, Scala or Python interactive interpreter: Scala & Python only standalone applications: any performance: Java & Scala are faster thanks to static typing

Hand on - interpreter • script http: //cern. ch/kacper/spark. txt • run scala spark interpreter $ spark-shell • or python interpreter $ pyspark

Hand on – build and submission • download and unpack source code wget http: //cern. ch/kacper/Gva. Weather. tar. gz; tar -xzf Gva. Weather. tar. gz • build definition in Gva. Weather/gvaweather. sbt • source code Gva. Weather/src/main/scala/Gva. Weather. scala • building cd Gva. Weather sbt package • job submission spark-submit --master local --class Gva. Weather target/scala-2. 10/gva-weather_2. 10 -1. 0. jar

Summary • concept not limited to single pass map-reduce • avoid soring intermediate results on disk or HDFS • speedup computations when reusing datasets