CS 345 A Data Mining Map Reduce Singlenode
- Slides: 28
CS 345 A Data Mining Map. Reduce
Single-node architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk
Commodity Clusters o Web data sets can be very large n Tens to hundreds of terabytes o Cannot mine on a single server (why? ) o Standard architecture emerging: n Cluster of commodity Linux nodes n Gigabit ethernet interconnect o How to organize computations on this architecture? n Mask issues such as hardware failure
Cluster Architecture 2 -10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch CPU Mem Disk … Switch CPU Mem Disk Each rack contains 16 -64 nodes CPU … Mem Disk
Stable storage o First order problem: if nodes can fail, how can we store data persistently? o Answer: Distributed File System n Provides global file namespace n Google GFS; Hadoop HDFS; Kosmix KFS o Typical usage pattern n Huge files (100 s of GB to TB) n Data is rarely updated in place n Reads and appends are common
Distributed File System o Chunk Servers n n File is split into contiguous chunks Typically each chunk is 16 -64 MB Each chunk replicated (usually 2 x or 3 x) Try to keep replicas in different racks o Master node n a. k. a. Name Nodes in HDFS n Stores metadata n Might be replicated o Client library for file access n Talks to master to find chunk servers n Connects directly to chunkservers to access data
Warm up: Word Count o We have a large file of words, one word to a line o Count the number of times each distinct word appears in the file o Sample application: analyze web server logs to find popular URLs
Word Count (2) o Case 1: Entire file fits in memory o Case 2: File too large for mem, but all <word, count> pairs fit in mem o Case 3: File on disk, too many distinct words to fit in memory n sort datafile | uniq –c
Word Count (3) o To make it slightly harder, suppose we have a large corpus of documents o Count the number of times each distinct word occurs in the corpus n words(docs/*) | sort | uniq -c n where words takes a file and outputs the words in it, one to a line o The above captures the essence of Map. Reduce n Great thing is it is naturally parallelizable
Map. Reduce: The Map Step Input key-value pairs k v … k Intermediate key-value pairs k v k v map … v k v
Map. Reduce: The Reduce Step Intermediate key-value pairs Key-value groups v k k v v v Output key-value pairs reduce k v group k v v k v … … k v k … v k v
Map. Reduce o Input: a set of key/value pairs o User supplies two functions: n map(k, v) list(k 1, v 1) n reduce(k 1, list(v 1)) v 2 o (k 1, v 1) is an intermediate key/value pair o Output is the set of (k 1, v 2) pairs
Word Count using Map. Reduce map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
Distributed Execution Overview User Program fork assign map Input Data Split 0 read Split 1 Split 2 fork Master fork assign reduce Worker local write Worker remote read, sort write Output File 0 Output File 1
Data flow o Input, final output are stored on a distributed file system n Scheduler tries to schedule map tasks “close” to physical storage location of input data o Intermediate results are stored on local FS of map and reduce workers o Output is often input to another map reduce task
Coordination o Master data structures n Task status: (idle, in-progress, completed) n Idle tasks get scheduled as workers become available n When a map task completes, it sends the master the location and sizes of its R intermediate files, one for each reducer n Master pushes this info to reducers o Master pings workers periodically to detect failures
Failures o Map worker failure n Map tasks completed or in-progress at worker are reset to idle n Reduce workers are notified when task is rescheduled on another worker o Reduce worker failure n Only in-progress tasks are reset to idle o Master failure n Map. Reduce task is aborted and client is notified
How many Map and Reduce jobs? o M map tasks, R reduce tasks o Rule of thumb: n Make M and R much larger than the number of nodes in cluster n One DFS chunk per map is common n Improves dynamic load balancing and speeds recovery from worker failure o Usually R is smaller than M, because output is spread across R files
Combiners o Often a map task will produce many pairs of the form (k, v 1), (k, v 2), … for the same key k n E. g. , popular words in Word Count o Can save network time by preaggregating at mapper n combine(k 1, list(v 1)) v 2 n Usually same as reduce function o Works only if reduce function is commutative and associative
Partition Function o Inputs to map tasks are created by contiguous splits of input file o For reduce, we need to ensure that records with the same intermediate key end up at the same worker o System uses a default partition function e. g. , hash(key) mod R o Sometimes useful to override n E. g. , hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file
Exercise 1: Host size o Suppose we have a large web corpus o Let’s look at the metadata file n Lines of the form (URL, size, date, …) o For each host, find the total number of bytes n i. e. , the sum of the page sizes for all URLs from that host
Exercise 2: Distributed Grep o Find all occurrences of the given pattern in a very large set of files
Exercise 3: Graph reversal o Given a directed graph as an adjacency list: src 1: dest 11, dest 12, … src 2: dest 21, dest 22, … o Construct the graph in which all the links are reversed
Exercise 4: Frequent Pairs o Given a large set of market baskets, find all frequent pairs n Remember definitions from Association Rules lectures
Implementations o Google n Not available outside Google o Hadoop n An open-source implementation in Java n Uses HDFS for stable storage n Download: http: //lucene. apache. org/hadoop/ o Aster Data n Cluster-optimized SQL Database that also implements Map. Reduce n Made available free of charge for this class
Cloud Computing o Ability to rent computing by the hour n Additional services e. g. , persistent storage o We will be using Amazon’s “Elastic Compute Cloud” (EC 2) o Aster Data and Hadoop can both be run on EC 2 o In discussions with Amazon to provide access free of charge for class
Special Section on Map. Reduce o Tutorial on how to access Aster Data, EC 2, etc o Intro to the available datasets o Friday, January 16, at 5: 15 pm n Right after Info. Seminar n Tentatively, in the same classroom (Gates B 12)
Reading o Jeffrey Dean and Sanjay Ghemawat, Map. Reduce: Simplified Data Processing on Large Clusters http: //labs. google. com/papers/mapreduce. html o Sanjay Ghemawat, Howard Gobioff, and Shun. Tak Leung, The Google File System http: //labs. google. com/papers/gfs. html
- Eck
- Multimedia data mining
- Strip mining vs open pit mining
- Strip mining vs open pit mining
- Difference between strip mining and open pit mining
- Difference between text mining and web mining
- Data mining map
- Data reduction in data mining
- What is data mining and data warehousing
- What is missing data in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data reduction in data mining
- Data cube technology in data mining
- Data reduction in data mining
- Data warehouse dan data mining
- Data mining dan data warehouse
- Analitical cubism
- Descriptive mining of complex data objects
- Data warehousing data mining and olap
- Noisy data in data mining
- How many tier data warehouse architecture?
- Markku roiha
- Data compression in data mining
- Introduction to data warehousing and data mining
- Data warehouse dan data mining
- Cs 412 introduction to data mining
- Mapreduce شرح
- Mapreduce types and formats