Have fun with Hadoop Experiences with Hadoop and
- Slides: 31
Have fun with Hadoop Experiences with Hadoop and Map. Reduce Jian Wen DB Lab, UC Riverside
Outline Background on Map. Reduce Summer 09 (freeman? ): Processing Join using Map. Reduce Spring 09 (Northeastern): Netflix. Hadoop Fall 09 (UC Irvine): Distributed XML Filtering Using Hadoop
Background on Map. Reduce Started from Winter 2009 ◦ Course work: Scalable Techniques for Massive Data by Prof. Mirek Riedewald. ◦ Course project: Netflix. Hadoop Short explore in Summer 2009 ◦ Research topic: Efficient join processing on Map. Reduce framework. ◦ Compared the homogenization and map-reducemerge strategies. Continued in California ◦ UCI course work: Scalable Data Management by Prof. Michael Carey ◦ Course project: XML filtering using Hadoop
Map. Reduce Join: Research Plan Focused on performance analysis on different implementation of join processors in Map. Reduce. ◦ Homogenization: additional information about the source of the data in the map phase, then do the join in the reduce phase. ◦ Map-Reduce-Merge: a new primitive called merge is added to process the join separately. ◦ Other implementation: the map-reduce
Map. Reduce Join: Research Notes Cost analysis model on process latency. ◦ The whole map-reduce execution plan is divided into several primitives for analysis. Distribute Mapper: partition and distribute data onto several nodes. Copy Mapper: duplicate data onto several nodes. MR Transfer: transfer data between mapper and reducer. Summary Transfer: generate statistics of data and pass the statistics between working nodes. Output Collector: collect the outputs. Some basic attempts on theta-join using Map. Reduce. ◦ Idea: a mapper supporting multi-cast key.
Netflix. Hadoop: Problem Definition From Netflix Competition ◦ Data: 100480507 rating data from 480189 users on 17770 movies. ◦ Goal: Predict unknown ratings for any given user and movie pairs. ◦ Measurement: Use RMSE to measure the precise. Out approach: Singular Value Decomposition (SVD)
Netflix. Hadoop: SVD algorithm A feature means… ◦ User: Preference (I like sci-fi or comedy…) ◦ Movie: Genres, contents, … ◦ Abstract attribute of the object it belongs to. Feature Vector ◦ Each user has a user feature vector; ◦ Each movie has a movie feature vector. Rating for a (user, movie) pair can be estimated by a linear combination of the feature vectors of the user and the movie. Algorithm: Train the feature vectors to minimize the prediction error!
Netflix. Hadoop: SVD Pseudcode Basic idea: ◦ Initialize the feature vectors; ◦ Recursively: calculate the error, adjust the feature vectors.
Netflix. Hadoop: Implementation Data Pre-process ◦ Randomize the data sequence. ◦ Mapper: for each record, randomly assign an integer key. ◦ Reducer: do nothing; simply output (automatically sort the output based on the key) ◦ Customized Rating. Output. Format from File. Output. Format Remove the key in the output.
Netflix. Hadoop: Implementation Feature Vector Training ◦ Mapper: From an input (user, movie, rating), adjust the related feature vectors, output the vectors for the user and the movie. ◦ Reducer: Compute the average of the feature vectors collected from the map phase for a given user/movie. Challenge: vectors! Global sharing feature
Netflix. Hadoop: Implementation Global sharing feature vectors ◦ Global Variables: fail! Different mappers use different JVM and no global variable available between different JVM. ◦ Database (DBInput. Format): fail! Error on configuration; expecting bad performance due to frequent updates (race condition, query start-up overhead) ◦ Configuration files in Hadoop: fine! Data can be shared and modified by different mappers; limited by the main memory of each working node.
Netflix. Hadoop: Experiments usingle-thread, multithread and Map. Reduce Test Environment ◦ Hadoop 0. 19. 1 ◦ Single-machine, virtual environment: Host: 2. 2 GHz Intel Core 2 Duo, 4 GB 667 RAM, Max OS X Virtual machine: 2 virtual processors, 748 MB RAM each, Fedora 10. ◦ Distributed environment: 4 nodes (should be… currently 9 node) 400 GB Hard Driver on each node Hadoop Heap Size: 1 GB (failed to finish)
Netflix. Hadoop: Experiments
Netflix. Hadoop: Experiments
Netflix. Hadoop: Experiments
XML Filtering: Problem Definition Aimed at a pub/sub system utilizing distributed computation environment ◦ Pub/sub: Queries are known, data are fed as a stream into the system (DBMS: data are known, queries are fed).
XML Filtering: Pub/Sub System XML Docs XML Filters XML Queries
XML Filtering: Algorithms Use YFilter Algorithm ◦ YFilter: XML queries are indexed as a NFA, then XML data is fed into the NFA and test the final state output. ◦ Easy for parallel: queries can be partitioned and indexed separately.
XML Filtering: Implementations Three benchmark platforms are implemented in our project: ◦ Single-threaded: Directly apply the YFilter on the profiles and document stream. ◦ Multi-threaded: Parallel YFilter onto different threads. ◦ Map/Reduce: Parallel YFilter onto different machines (currently in pseudo-distributed environment).
XML Filtering: Single-Threaded Implementation l l l The index (NFA) is built once on the whole set of profiles. Documents then are streamed into the YFilter for matching. Matching results then are returned by YFilter.
XML Filtering: Multi-Threaded Implementation l l Profiles are split into parts, and each part of the profiles are used to build a NFA separately. Each YFilter instance listens a port for income documents, then it outputs the results through the socket.
XML Filtering: Map/Reduce Implementation Profile splitting: Profiles are read line by line with line number as the key and profile as the value. ◦ Map: For each profile, assign a new key using (old_key % split_num) ◦ Reduce: For all profiles with the same key, output them into a file. ◦ Output: Separated profiles, each with profiles having the same (old_key % split_num) value.
XML Filtering: Map/Reduce Implementation Document matching: Split profiles are read file by file with file number as the key and profiles as the value. ◦ Map: For each set of profiles, run YFilter on the document (fed as a configuration of the job), and output the old_key of the matching profile as the key and the file number as the values. ◦ Reduce: Just collect results. ◦ Output: All keys (line numbers) of matching profiles.
XML Filtering: Map/Reduce Implementation
XML Filtering: Experiments Hardware: ◦ Macbook 2. 2 GHz Intel Core 2 Duo ◦ 4 G 667 MHz DDR 2 SDRAM Software: ◦ Java 1. 6. 0_17, 1 GB heap size ◦ Cloudera Hadoop Distribution (0. 20. 1) in a virtual machine. Data: ◦ XML docs: SIGMOD Record (9 files). ◦ Profiles: 25 K and 50 K profiles on SIGMOD Record. Data Size 1 2 3 4 5 6 7 8 9 478416 415043 312515 213197 103528 53019 42128 30467 20984
XML Filtering: Experiments Run-out-of-memory: We encountered this problem in all the three benchmarks, however Hadoop is much robust on this: ◦ Smaller profile split ◦ Map phase scheduler uses the memory wisely. Race-condition: since the YFilter code we are using is not thread-safe, in multi-threaded version racecondition messes the results; however Hadoop works this around by its shared-nothing run-time. ◦ Separate JVM are used for different mappers, instead of threads that may share something lower-level.
Thousands XML Filtering: Experiments 45 Time Costs for Splitting 40 35 30 25 Time(ms) 20 15 10 5 0 Single 2 M 2 R: 2 S 2 M 2 R: 4 S 2 M 2 R: 8 S 4 M 2 R: 4 S
XML Filtering: Experiments 0: 03: 37 Map/Reduce: # of Splits on Profiles Time 0: 02: 53 There are memory failures, and jobs are failed too. 0: 02: 10 2 split 4 split 0: 01: 27 6 split 8 split 0: 00: 44 0: 01 0 1 2 3 4 5 Tasks 6 7 8 9
XML Filtering: Experiments 0: 04: 20 Map/Reduce: # of Mappers 0: 03: 37 Time 0: 02: 53 0: 02: 10 2 M 2 R 4 M 2 R 0: 01: 27 0: 00: 44 0: 01 0 1 2 3 4 5 Tasks 6 7 8 9
XML Filtering: Experiments 0: 08: 39 Map/Reduce: # of Profiles 0: 07: 13 There are memory failures but recovered. Time 0: 05: 46 0: 04: 20 25 K 50 K 0: 02: 53 0: 01: 27 0: 01 0 1 2 3 4 5 Tasks 6 7 8 9
Questions?
- Hadoop io hadoop comes with a set of
- Math quiz
- Smile and have fun
- A cube has 6 faces 8 vertices and 12 edges
- W
- Amazon work hard have fun
- Work hard play hard make history
- Maksud pencerapan
- Girls just wanna have fun singer
- Why have more fun
- Why have more fun
- Clausula adverbial
- We will have fun together
- Cs_rs11-iiia-1
- Guiding play and puppetry experiences
- Comparative development experiences of india and china
- Contrived experiences activities
- The totality of a groups thought experiences and patterns
- Don tiburcio de espadaña
- Architecture of rhipe
- Scale up and scale out in hadoop
- Combiner in hadoop
- Words have meaning and names have power
- Dangerous curves the zoo
- Endoskeleton, often has spiny skin
- Types of dining experiences
- Characteristics of a memoir
- Positive subjective experiences tend to be
- Guiding storytelling experiences
- What is social studies and why is it important
- They do not appreciate my singing
- What are the experiences of latin american countries