Introduction to Spark Outlines A brief history of

Introduction to Spark

Outlines • A brief history of Spark • Programming with RDDs Transformations v Actions v

A brief history

Limitations of Map. Reduce • Map. Reduce use cases showed two major limitations: v Difficulty of programming directly in Map. Reduce § Batch processing does not fit the use cases v Performance bottlenecks § Data will be frequently loaded from and saved to hard drives • Spark is designed to overpass the limitations of Map. Reduce Handles batch, interactive, and real-time within a single framework v Native integration with Java, Python, Scala v Programming at a higher level of abstraction v More general: map/reduce is just one set of supported constructs v

Spark components • Data are partitioned and executed on multiple worker nodes

Resilient Distributed Dataset (RDD) • An RDD is simply a distributed collection of elements An RDD is an immutable distributed collection of objects v Each RDD is split into multiple partitions v • In Spark all work is expressed as one of three operations v Creation: Creating new RDDs § Non-RDD data => RDD v Transformation: Transforming existing RDDs § RDD => RDD v Action: Calling operations on RDDs to compute a result § RDD => non-RDD value • Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them

Creation of an RDD • Users create RDDs in two ways v Loading an external dataset v Parallelizing a collection in your driver program

Transformations on RDDs • Transformations are operations on RDDs that return a new RDD Such as map(), filter() v Creation of RDD can be considered as a special type of transformation v • Transformed RDDs are computed lazily v Only when you use them in an action • Creations of RDDs are also carried out lazily

Actions on RDDs • Operations that do something on the dataset E. g. , Operations that return a final value to the driver program or write data to an external storage system v Such as count() and first() v • Actions force the evaluation of the transformations required for the RDD they were called on

Lazy evaluation • Lazy evaluation The operation is not immediately performed when we call a transformation on an RDD v Spark internally records metadata to indicate that this operation has been requested v • Spark will not begin to execute until it sees an action • Spark will re-compute the RDD and all of its dependencies each time we call an action on the RDD • Result RDD will be computed twice in the above example v input RDD might be computed twice if it is not persistent

Persistence (caching) • Ask Spark to persist the data to avoid computing an RDD multiple times

Element-wise transformations • map() Takes in a function and applies it to each element in the RDD. v The result of the function is the new value of each element in the resulting RDD v map()’s return type does not have to be the same as its input type v • filter() v Takes in a function and returns an RDD that only has elements that pass the filter() function

The sample code for map()

Element-wise transformations • flat. Map() The function we provide to flat. Map() is called individually for each element in our input RDD v Instead of returning a single element, we return an iterator with our return values v Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators v

flat. Map() vs map() • flat. Map(): “flattening” the iterators returned to it

Pseudo set operations • union operation keeps duplicates • intersection operation removes duplicates

cartesian() transform

Actions • collect(): Return all the elements of the dataset as an array to the driver program • count. By. Value() returns a map of each unique value to its count

• take(num): return the first num elements of the RDD • top() will use the default ordering on the data

• Both reduce() and fold() will reduce the input RDD to a single element of the same type v fold() needs an initial value

More details on reduce() and fold() • RDD. reduce((x, y)=>x+y), RDD. fold(initial_value)((x, y)=>x+y) x: accumulator v y: element in the partition v • In reduce() The accumulator first takes the first element in a partition then updating its value by adding the next element v For example: a partition (1, 2, 3, 4, 5), RDD. reduce((x, y)=>x+y) v § § Iteration 1: x=1, y=2 => x=(1+2)=3 Iteration 2: x=3, y=3 => x=(3+3)=6 Iteration 3: x=6, y=4 => x=(6+4)=10 Iteration 4: x=10, y=5 => x=(10+5)=15 • In fold() The accumulator first takes the initial value in a partition then updating its value by adding the next element v For example: a partition (1, 2, 3, 4, 5), RDD. fold(0)((x, y)=>x+y) v § Iteration 1: x=0, y=1 => x=(0+1)=1

More details on reduce() and fold() • The overall process v Partitions are processed in parallel using multiple executors (or executor threads) § Each partition is processed sequentially using a single thread v Final merge is performed sequentially using a single thread on the driver • For multiple partitions of an RDD v First, the function will be applied on each partition § Each partition will produce an accumulator All accumulators will be collected by the driver in a nondeterministic order v Then the function will be applied to the list of accumulators v § For fold(), the initial value will be used again when aggregating the accumulators • The partitioning behavior, plus certain sources of ordering nondeterminism may bring uncertainty to reduce()/fold() action when dealing with non communicative operations sc. parallelize(Seq(2. 0, 3. 0), 2). fold(1. 0)((a, b) => math. pow(b, a)) v What is the output? v

aggregate() • The output of aggregate() can be different from the input RDD • Prototype: def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B v aggregate(zero. Value) (seq. Op, comb. Op) v It traverses the elements in different partitions v § Using seq. Op to update the accumulator in each partition § Then applies comb. Op (combine operation) to accumulators from different partitions v The zero. Value is used in both seq. Op and comb. Op • Example: how to calculate the average of an input RDD (1, 2, 3, 3) val sum=input. RDD. aggregate(0)((x, y) => x + y, (x, y) => x + y) v val count = input. RDD. aggregate(0)((x, y) => x + 1, (x, y) => x + y) v val average=sum/count v How about val count= input. RDD. fold(0)((x, y) => x + 1)? v

aggregate() • Using a tuple x as the accumulator x. _1: the running total v x. _2: the running count v val result = input. aggregate((0, 0))( (acc, value)=>(acc. _1+value, acc. _2+1), (ACC, acci)=>(ACC. _1+acci. _1, ACC. _2+acci. _2)) val ave = result. _1 / result. _2. to. Double