Map Reduce Based on Map Reduce Simplified Data

Map. Reduce Based on: Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Je�rey D. Ullman Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

contents � Why do we need distributed computing for big data ? � What is map. Reduce? � Functional programming review. � Map. Reduce concept. � First example – word counting. � Fail tolerance. � Optimizations. � More examples. � Complexity. � Real world example.

Why do we need distributed computing for big data ? �Single ◦ ◦ computer – has not enough: RAM HD capacity, IOPS. network bandwidth. CPU.

What is Map. Reduce �Map. Reduce is a software framework introduced by Google to support distributes computing on large data sets on clusters of computers. �There are many other Map. Reduce framework made to work on different environments (Hadoop is the leading open source implementation). �Why not other framwork (like MPI)?

Functional programming review �Functional operations do not modify data structures : they always create new ones. original data still exists in unmodified form. �No side-affect (reading input from user, networking etc’) �Data flows are explicit in program design. �Order � Fun of operations does not matter: foo (I : int list) = sum(I)+ mul(I) + length(I) �Functions can be passed as arguments

Map Creates a new list applying f to each element of the input list; returns output in order. map f a [] = f(a) map f (a: as) = list(f(a), map(f, as)) Example: upper(x) : char->char Input : lst = [ a, b, c] Operation : Map upper lst ; Output : [A , B , C] Google's video slides - Cluster Computing and Map. Reduce

Fold Moves across a list, applying f to each element plus an accumulator. F returns the next accumulator value, which is combined with the next element of the list. fun foldl f z [] = z | foldl f z (x: : xs) = foldl f (f(z, x)) xs; Example: We wish to write sum function Receiving int list; return sum; fun sum(lst) = foldl(fn (x, a)=>x+a) 0 lst Google's video slides - Cluster Computing and Map. Reduce

"Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

Google’s map Map (in_key , in_value) -> (out_key , intermediate_value) list Example: Map(“play. txt”, ”to be or not to be”) Will Emit: (“to”, 1), (“be”, 1), (“or”, 1), (“not”, 1), (“to”, 1), (be”, 1) "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

Google’s reduce Reduce (out_key, intermediate_value list) -> (key, out_value) list Example: reduce(“to”, [1, 1, 1]) Will Emit: [(“to”, 3)] "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

Partition and combine functions Partition: A simple hash function - hash(key) mod R. the key may be different like hash(hostname(url)) mod R (“to”, 1), (“be”, 1), (“or”, 1), (“not”, 1), (“to”, 1), (be”, 1) hash(key) mod 2 (“to”, 1), (“be”, 1), (“to”, 1), (be”, 1) (“or”, 1), (“not”, 1) combine: Similar to reduce function , applied over local worker (more details will fallow) (“to”, 1), (“be”, 1), (“or”, 1), (“not”, 1), (“to”, 1), (be”, 1) (“to”, 2), (“be”, 2), (“or”, 1), (“not”, 1)

The Map. Reduce concept Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept Spite the work to pieces Start running code on workers Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

The Map. Reduce concept Assign mappers Assign reducers Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept Mappers read input Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept Workers finishes : • writes the output of map into R regions by the partitioning function • Registers the results at the master. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

The Map. Reduce concept Reducers read the input, sort it and start reducing. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

The Map. Reduce concept Reducer store it’s output on GFS , and Inform the master. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

The Map. Reduce concept "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

First example – word counting (w 1, 1) "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

First example – word counting "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

Example 2 – reverse a list of links mapper input: ( url, web page content) Mapper function: reduce function :

Example 2 – reverse a list of links mapper input: ( url, web page content) ( themarker. com, <HEAD> … href="ynet. com”. . ) ( calcalist. com, <HEAD> … href="ynet. com”. . ) mapper function: (url, web page content) -> (target, source) list [(ynet. com, themarker. com)] [(ynet. com, calcalist. com)] reduce function : (target, source) list -> (target, source list) (ynet. com, [themarker. com, calcalist. com])

Example 3 – distributed grep Given a word and a list of text file, will return the files and lines that the word appears in. Mapper input: ( doc. Id, doc. Content) Mapper function: Reduce function :

Example 3 – distributed grep mapper input: ( doc. Id, doc. Content) mapper function: ( doc. Id, doc. Content) -> (doc. Id, line that match pattern) reduce function : Identity function

Example 4 – BFS Given N, will return the nodes in the graph. Each node will include the distance from N. Mapper input: (node. Id, N) // N. distance – distance from source node // N. Adjacency. List Mapper function: Reduce function :

Example 4 – BFS "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer

Example 5 – Matrix Multiplication

Example 5 – Matrix-Vector Multiplication

Fail tolerance during the map. Reduce task, the master ping all workers. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Fail tolerance 1) in progress map or reduce task is restarted on another machine. 2) Completed map task is restarted on another machine. 3) Completed reduce task is not restarted since it’s result stored on GFS. 4) If a few mappers fail on the same input – the input is marked as nonvalid. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Fail tolerance Master – a single point of failure Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Optimizations �The master tries to allocate mapper which is the closest to the machine that stores the input file. �Combine is used to reduce network bandwidths consumption. e. g, better transmitting ‘(“pig”, 3)’ then ‘(“pig”, 1) , (“pig”, 1)’. �Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation.

Optimizations Some mappers may be lagging behind , the master allocate a backup worker near the end of the mappers operation. Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

BW over time Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Google map. Reduce usage Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat

Complexity Theory for Map. Reduce we wish to: • Shrink the wall-clock time • Execute each reducer in main memory We will look into two parameters in the algorithm: • reducer size(q): This parameter is the upper bound on the number of values that are allowed to appear in the list associated with a single key. • replication rate(r): the number of key-value pairs produced by all the Map tasks on all the inputs, divided by the number of inputs.

Complexity -Example Similarity Joins: • given a large set of elements X and a similarity measure s(x, y) that tells how similar two elements x and y of set X are. • 1 M images, 1 MB each.

Complexity - Example

Complexity - Example(fixed)

Real world example A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables.

Real world example Distributed Message Passing for Large Scale Graphical Models, Alexander Schwing

Real world example Iteration 1: Input entry for a single map task will be as followed:

Real world example Iteration 1:

Real world example Iteration 1: mapper output

Real world example

Real world example Iteration 2:

Real world example

Hands on

conclusions The good: 1) simple. 2) proven. 3) many implementations for different platforms and languages. The bad: 1) performance improvements enabled by common database is prevented. 2) map reduce algorithms is not always easy to design. 3) not all algorithms can be converted to work efficiently on mapreduce.

References � Map. Reduce: Simplified Data Processing on Large Clusters - Jeffrey Dean and Sanjay Ghemawat. � "Data-Intensive Text Processing with Map. Reduce", jimmy Lin & Chris dyer � “Pro Hadoop” By Jason Venner � series of Google's video - Cluster Computing and Map. Reduce http: //code. google. com/edu/submissions/mapreduceminilecture/listing. html

Partial Implementations list ◦ The Google Map. Reduce framework is implemented in C++ with interfaces in Python and Java. ◦ The Hadoop project is a free open source Java Map. Reduce implementation. ◦ Twister is an open source Java Map. Reduce implementation that supports iterative Map. Reduce computations efficiently. ◦ Greenplum is a commercial Map. Reduce implementation, with support for Python, Perl, SQL and other languages. ◦ Aster Data Systems n. Cluster In-Database Map. Reduce supports Java, C, C++, Perl, and Python algorithms integrated into ANSI SQL. ◦ Grid. Gain is a free open source Java Map. Reduce implementation. ◦ Phoenix is a shared-memory implementation of Map. Reduce implemented in C. ◦ File. Map is an open version of the framework that operates on files using existing file-processing tools rather than tuples. ◦ Map. Reduce has also been implemented for the Cell Broadband Engine, also in C. ◦ Mars: Map. Reduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA. ◦ Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. ◦ Couch. DB uses a Map. Reduce framework for defining views over distributed documents and is implemented in Erlang. ◦ Skynet is an open source Ruby implementation of Google’s Map. Reduce framework ◦ Disco is an open source Map. Reduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. ◦ Misco is an open source Map. Reduce designed for mobile devices and is implemented in Python. ◦ Qizmt is an open source Map. Reduce framework from My. Space written in C#. ◦ The open-source Hive framework from Facebook (which provides an SQL-like language over files, layered on the open-source Hadoop Map. Reduce engine. ) ◦ The Holumbus Framework: Distributed computing with Map. Reduce in Haskell Holumbus-Map. Reduce ◦ Bash. Reduce: Map. Reduce written as a Bash script written by Erik Frey of Last. fm ◦ Map. Reduce for Go ◦ Meguro - a Javascript Map. Reduce framework ◦ Mongo. DB is a scalable, high-performance, open source, schema-free, document-oriented database. Written in C++ that features Map. Reduce ◦ Parallel: : Map. Reduce is a CPAN module providing experimental Map. Reduce functionality for Perl. ◦ Map. Reduce on volunteer computing ◦ Secure Map. Reduce ◦ Map. Reduce with MPI implementation