Working with KeyValue Pairs Creating pair RDDs Convert

  • Slides: 26
Download presentation
Working with Key/Value Pairs

Working with Key/Value Pairs

Creating pair RDDs • Convert regular RDDs to pair RDDs § val lines =

Creating pair RDDs • Convert regular RDDs to pair RDDs § val lines = sc. text. File(“Readme. md”) § val pairs = lines. map(x => (x. split(“ ”)[0], x)) • Create pair RDDs § val x = sc. parallelize(List(("a", 1), ("b", 1), ("a", 2)))

Transformations • Pair RDDs are still RDDs § Transformations on regular RDDs can be

Transformations • Pair RDDs are still RDDs § Transformations on regular RDDs can be applied on pair RDDs § Is the above example appropriate for the lines of code?

Transformations: aggregations • reduce. By. Key() and fold. By. Key() § Special cases of

Transformations: aggregations • reduce. By. Key() and fold. By. Key() § Special cases of combine. By. Key() operation § They are transformations, not actions § Run several parallel reduce/fold operations, one for each key in the dataset, where each operation combines values that have the same key § Implementation details on Spark v Each node carries out local aggregations • Use initial value to initialize the accumulator • Initialization will not take place if the key does not exist v Shuffle v Reducer nodes carries out second rounds of aggregations • Use the first accumulator for initialization

Per-key average • What is the last step to calculate the per-key average?

Per-key average • What is the last step to calculate the per-key average?

Word count in Spark • Another implementation § val input = sc. text. File("s

Word count in Spark • Another implementation § val input = sc. text. File("s 3: //. . . ") § val words = input. flat. Map(x => x. split(" ")) § val result = words. count. By. Value() • Warning: § count. By. Value() is an action § Using count. By. Value() may bring scalability issue

combine. By. Key() transformation • Turns an RDD[(K, V)] into a result RDD[(K, C)],

combine. By. Key() transformation • Turns an RDD[(K, V)] into a result RDD[(K, C)], § V: value; C: accumulator • Three required component functions § create. Combiner v Used to initialize the accumulator of the value for each key in a partition v Called when a key is encountered for the first time in a partition § merge. Value v Merges V’s to C’s, respectively, inside each partition v Called when a key has been seen previously § merge. Combiners v Merges a list of C’s to a single one

Per-key average using combine. By. Key() A better version: val result = input. combine.

Per-key average using combine. By. Key() A better version: val result = input. combine. By. Key( (v)=>(v, 1), (acc: (Int, Int), value)=>(acc. _1+value, acc. _2+1), (ACC: (Int, Int), acci: (Int, Int))=>(ACC. _1+acci. _1, ACC. _2+acci. _2) )

Per-key average using combine. By. Key()

Per-key average using combine. By. Key()

map on pair RDDs • map{case (key, value) => (key, value. _1 / value.

map on pair RDDs • map{case (key, value) => (key, value. _1 / value. _2. to. Float)} • is same to • map({case (key, value) => (key, value. _1 / value. _2. to. Float)}) • { case argument => body } is a Partial Function • Can we use the following? § map( (key, value) => (key, value. _1 / value. _2. to. Float) ) • How about the following? § map( x => (x. _1, x. _2. _1 / x. _2. to. Float) )

Actions on pair RDDs • All of the traditional actions available on the base

Actions on pair RDDs • All of the traditional actions available on the base RDD are also available on pair RDDs • Additional actions

Turning the level of parallelism • Use repartition() and coalesce() to partition an RDD

Turning the level of parallelism • Use repartition() and coalesce() to partition an RDD to a particular number of partitions § repartition() can increase or decrease the number of partitions v It does a full shuffle, so it is expensive § coalesce() only decreases the number of partitions v It will try to minimize the data movement across nodes § Both repartition() and coalesce() are transformations v The original RDDs are not affected • partition. By() can be used on pair RDDs § Only applied to pair RDDs § Need to provide a partitioner § Pairs with the same key will be sent to the same partition

An example to use partition. By() • Two RDDs § A large RDD of

An example to use partition. By() • Two RDDs § A large RDD of (User. ID, User. Info) v User. Info contains a list of topics the user is subscribed to v E. g. (Alice, (music, sport, history)) § A small RDD of (User. ID, Link. Info) v Link. Info contains a list of links a user have clicked in the last 5 minutes • Wish to count how many users visited a link that was not to one of their subscribed topics

An example to use partition. By() • join() transformation

An example to use partition. By() • join() transformation

An example to use partition. By() • The initial approach

An example to use partition. By() • The initial approach

An example to use partition. By() • The behavior

An example to use partition. By() • The behavior

An example to use partition. By() • The improved implementation

An example to use partition. By() • The improved implementation

An example to use partition. By() • The improved behavior

An example to use partition. By() • The improved behavior

Determining an RDD’s Partitioner • Tell how an RDD is partitioned using its partitioner

Determining an RDD’s Partitioner • Tell how an RDD is partitioned using its partitioner property § rdd. partitioner • Some transformations will result in a partitioner set on the output RDD § e. g. , reduceby. Key() • Some transformations will produce a result with no partitioner § e. g. , map()

Page. Rank • Two RDDs § (page. ID, link List) § (page. ID, rank)

Page. Rank • Two RDDs § (page. ID, link List) § (page. ID, rank) • Algorithm (details are different than the one discussed in Map. Reduce) § Initialize each page’s rank to 1. 0 § On each iteration, have page p send a contribution of rank(p)/num. Neighbors(p) to its neighbors (the pages it has links to) § Set each page’s rank to 0. 15 + 0. 85 * contributions. Received § The last two steps repeat for multiple iterations

Page. Rank is Spark

Page. Rank is Spark

Page. Rank is Spark

Page. Rank is Spark

Custom Partitioner • Two built-in partitioners § Hash. Partitioner § Range. Partitioner • Custom

Custom Partitioner • Two built-in partitioners § Hash. Partitioner § Range. Partitioner • Custom Partitioner § For example, partition pair RDD based on part of the keys v http: //www. cnn. com/WORLD v http: //www. cnn. com/US § Define the custom partitioner by subclassing org. apache. spark. Partitioner v num. Partitioners: Int v get. Partition(key: Any): Int v Equals(): Boolean

Custom Partitioner example

Custom Partitioner example

Accumulator • Accumulators will NOT be updated if transformations have not been evaluated. •

Accumulator • Accumulators will NOT be updated if transformations have not been evaluated. • Accumulators are write-only for work nodes.

Examples • In pagerank. v 1. scala code, the contribs. count() is required to

Examples • In pagerank. v 1. scala code, the contribs. count() is required to update the accumulator. • In pagerank. v 2. scala code, the work nodes try to read the value of the accumulator, which will result in error.