Working with KeyValue Pairs Creating pair RDDs Convert
- Slides: 26
Working with Key/Value Pairs
Creating pair RDDs • Convert regular RDDs to pair RDDs § val lines = sc. text. File(“Readme. md”) § val pairs = lines. map(x => (x. split(“ ”)[0], x)) • Create pair RDDs § val x = sc. parallelize(List(("a", 1), ("b", 1), ("a", 2)))
Transformations • Pair RDDs are still RDDs § Transformations on regular RDDs can be applied on pair RDDs § Is the above example appropriate for the lines of code?
Transformations: aggregations • reduce. By. Key() and fold. By. Key() § Special cases of combine. By. Key() operation § They are transformations, not actions § Run several parallel reduce/fold operations, one for each key in the dataset, where each operation combines values that have the same key § Implementation details on Spark v Each node carries out local aggregations • Use initial value to initialize the accumulator • Initialization will not take place if the key does not exist v Shuffle v Reducer nodes carries out second rounds of aggregations • Use the first accumulator for initialization
Per-key average • What is the last step to calculate the per-key average?
Word count in Spark • Another implementation § val input = sc. text. File("s 3: //. . . ") § val words = input. flat. Map(x => x. split(" ")) § val result = words. count. By. Value() • Warning: § count. By. Value() is an action § Using count. By. Value() may bring scalability issue
combine. By. Key() transformation • Turns an RDD[(K, V)] into a result RDD[(K, C)], § V: value; C: accumulator • Three required component functions § create. Combiner v Used to initialize the accumulator of the value for each key in a partition v Called when a key is encountered for the first time in a partition § merge. Value v Merges V’s to C’s, respectively, inside each partition v Called when a key has been seen previously § merge. Combiners v Merges a list of C’s to a single one
Per-key average using combine. By. Key() A better version: val result = input. combine. By. Key( (v)=>(v, 1), (acc: (Int, Int), value)=>(acc. _1+value, acc. _2+1), (ACC: (Int, Int), acci: (Int, Int))=>(ACC. _1+acci. _1, ACC. _2+acci. _2) )
Per-key average using combine. By. Key()
map on pair RDDs • map{case (key, value) => (key, value. _1 / value. _2. to. Float)} • is same to • map({case (key, value) => (key, value. _1 / value. _2. to. Float)}) • { case argument => body } is a Partial Function • Can we use the following? § map( (key, value) => (key, value. _1 / value. _2. to. Float) ) • How about the following? § map( x => (x. _1, x. _2. _1 / x. _2. to. Float) )
Actions on pair RDDs • All of the traditional actions available on the base RDD are also available on pair RDDs • Additional actions
Turning the level of parallelism • Use repartition() and coalesce() to partition an RDD to a particular number of partitions § repartition() can increase or decrease the number of partitions v It does a full shuffle, so it is expensive § coalesce() only decreases the number of partitions v It will try to minimize the data movement across nodes § Both repartition() and coalesce() are transformations v The original RDDs are not affected • partition. By() can be used on pair RDDs § Only applied to pair RDDs § Need to provide a partitioner § Pairs with the same key will be sent to the same partition
An example to use partition. By() • Two RDDs § A large RDD of (User. ID, User. Info) v User. Info contains a list of topics the user is subscribed to v E. g. (Alice, (music, sport, history)) § A small RDD of (User. ID, Link. Info) v Link. Info contains a list of links a user have clicked in the last 5 minutes • Wish to count how many users visited a link that was not to one of their subscribed topics
An example to use partition. By() • join() transformation
An example to use partition. By() • The initial approach
An example to use partition. By() • The behavior
An example to use partition. By() • The improved implementation
An example to use partition. By() • The improved behavior
Determining an RDD’s Partitioner • Tell how an RDD is partitioned using its partitioner property § rdd. partitioner • Some transformations will result in a partitioner set on the output RDD § e. g. , reduceby. Key() • Some transformations will produce a result with no partitioner § e. g. , map()
Page. Rank • Two RDDs § (page. ID, link List) § (page. ID, rank) • Algorithm (details are different than the one discussed in Map. Reduce) § Initialize each page’s rank to 1. 0 § On each iteration, have page p send a contribution of rank(p)/num. Neighbors(p) to its neighbors (the pages it has links to) § Set each page’s rank to 0. 15 + 0. 85 * contributions. Received § The last two steps repeat for multiple iterations
Page. Rank is Spark
Page. Rank is Spark
Custom Partitioner • Two built-in partitioners § Hash. Partitioner § Range. Partitioner • Custom Partitioner § For example, partition pair RDD based on part of the keys v http: //www. cnn. com/WORLD v http: //www. cnn. com/US § Define the custom partitioner by subclassing org. apache. spark. Partitioner v num. Partitioners: Int v get. Partition(key: Any): Int v Equals(): Boolean
Custom Partitioner example
Accumulator • Accumulators will NOT be updated if transformations have not been evaluated. • Accumulators are write-only for work nodes.
Examples • In pagerank. v 1. scala code, the contribs. count() is required to update the accumulator. • In pagerank. v 2. scala code, the work nodes try to read the value of the accumulator, which will result in error.
- Python unordered pair
- Dof of screw pair
- Ab3e2 shape
- Cold working disadvantages
- Hot working and cold working difference
- Machining operations
- Hard work vs smart work
- Pengerjaan panas dan dingin
- Words that create imagery
- The american people creating a nation and a society
- Creating a new nation
- Chapter 18 creating competitive advantage
- Do organization have uniform culture
- Capturing value from customers
- What is the correct steps in creating a parallel program
- Puja myles
- Creating an english environment
- Creating a dinosaur sculpture
- How to write a thesis statement ap world?
- Chapter 8 creating the constitution
- Target marketing strategies
- How to build a cloud center of excellence gartner
- Creating equations from word problems
- Open thesis statement
- Excel module 1: creating a worksheet and a chart
- Creating abn tests
- Strategies for creating success in college and in life