Hadoop Streaming using Python 2014 5 29 Streaming

  • Slides: 21
Download presentation
Hadoop Streaming using Python 지능형 시스템 2014. 5. 29 박영택

Hadoop Streaming using Python 지능형 시스템 2014. 5. 29 박영택

Streaming Example #1 • Read any text input and computes the average length of

Streaming Example #1 • Read any text input and computes the average length of all words that start each character. – Example • Text input Now is definitely the time • Output N d i t 3 10 2 3. 5

Streaming Example Mapper • The Mapper receives a line of text for each input

Streaming Example Mapper • The Mapper receives a line of text for each input value. For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. – Example • Input value Now is definitely the time • Mapper should emit N d i t t 3 10 2 3 4

Example Mapper code • The Mapper receives a line of text for each input

Example Mapper code • The Mapper receives a line of text for each input value. For each word in the line, emit the first letter of the word as a key, and the length of the word as a value. Now is definitely the time #!/usr/bin/python import re import sys NONALPHA = re. compile("W") for input in sys. stdin. readlines(): for w in NONALPHA. split(input): if len(w) > 0: print "{0}t{1}". format(w[0]. lower(), str(len(w)))

Regular expression • re. compile(pattern, flag=0) – Compile a regular expression pattern, returning a

Regular expression • re. compile(pattern, flag=0) – Compile a regular expression pattern, returning a pattern object. – W : any non-alphanumeric character • Example input Now is definitely the time Non-alphanumeric pattern NONALPHA = re. compile("W") NONALPHA. split(input) – NONALPHA is Non-alphanumeric pattern object – NONALPHA. split(input) return [“Now”, “is”, “definitely”, “the”, “time”]

Regular expression example import re text = "Now is definitely the time" non. Alpha

Regular expression example import re text = "Now is definitely the time" non. Alpha = re. compile("W") print "return re. compile object : ", re. findall(non. Alpha, text) result = non. Alpha. split(text) print "non. Alpha split : ", result

Streaming Example Reducer • The Reducer receives the keys in sorted order, and all

Streaming Example Reducer • The Reducer receives the keys in sorted order, and all the values for one key appear together. So, for the Mapper output previous, the Reducer would receives following N d i t t • 3 10 2 3 4 For either type of input, the final output should be N d i t 3 10 2 3. 5

Example Reducer code N d i t t 3 10 2 3 4 #!/usr/bin/python

Example Reducer code N d i t t 3 10 2 3 4 #!/usr/bin/python import sys wordcount = 0. 0 lettercount = 0 key = None for input in sys. stdin. readlines(): input = input. rstrip() parts = input. split("t") if len(parts) < 2: continue newkey=parts[0] wordlen=int(parts[1]) if not key: key = newkey if key != newkey: print "{0}t{1}". format(key, str(lettercount / wordcount)) key = newkey; wordcount = 0. 0 lettercount = 0 wordcount = wordcount + 1. 0 lettercount = lettercount + wordlen if key != None: print "{0}t{1}". format(key, str(lettercount / wordcount))

Running code • Run Streaming Example – $hadoop-streaming. jar /usr/lib/hadoop-0. 20 mapreduce/contrib/streaming/hadoop-streaming-2. 0. 0

Running code • Run Streaming Example – $hadoop-streaming. jar /usr/lib/hadoop-0. 20 mapreduce/contrib/streaming/hadoop-streaming-2. 0. 0 -mr 1 -cdh 4. 2. 1. jar $ hadoop jar $hadoop-streaming. jar - input $input. Data/input. txt - output - mapper. py - reducer. py • Result – Hadoop fs –cat mrstream/output/part* N d i t 3 10 2 3. 5

Streaming Example #2 • Use New York Stock Exchange records of stock prices from

Streaming Example #2 • Use New York Stock Exchange records of stock prices from 19622010. We will write a streaming application program that prints the unique list of ticker symbols available in the data. – Example • Input Data NYSE NYSE Stock symbol AEA AEA AEA … … exchange • output AEA YSI YUM YZC … 2010 -02 -08 2010 -02 -05 2010 -02 -04 2010 -02 -03 2010 -02 -02 2010 -02 -01 Stock price open 4. 42 4. 55 4. 65 4. 74 4. 84 Stock price high 4. 42 4. 54 4. 69 5 4. 92 … … … date Stock price low close 4. 21 4. 24 4. 22 4. 41 4. 39 4. 42 4. 55 4. 62 4. 66 4. 68 4. 75 … … Stock volume 205500 194300 233800 182100 222700 194800 Stock price Adj close 4. 24 4. 41 4. 42 4. 55 4. 66 4. 75 … …

Example Mapper code • For the reduce function we will use the shell utility

Example Mapper code • For the reduce function we will use the shell utility uniq, which when provided sorted output returns a unique set of value – Example( mapper code) • parse. py #!/usr/bin/python import sys while 1: line = sys. stdin. readline() if line == "": break fields = line. split(", ") print "{0}". format(fields[1])

Hadoop Streaming • Usage Hadoop Streaming $ hadoop jar $hadoop-streaming. jar [options] - input

Hadoop Streaming • Usage Hadoop Streaming $ hadoop jar $hadoop-streaming. jar [options] - input <path> : HDFS input file for the Map step - output <path> : HDFS output file for the Reduce step - mapper <cmd | java. Class. Name> : the streaming command to run - reducer <cmd | java. Class. Name> : the streaming command to run - file <file> : file/dir to be shipped in the job jar file - combiner <Java. Class. Name> Combiner has to be a Java class …

Running code • Run Streaming Example – $hadoop-streaming. jar /usr/lib/hadoop-0. 20 mapreduce/contrib/streaming/hadoop-streaming-2. 0. 0

Running code • Run Streaming Example – $hadoop-streaming. jar /usr/lib/hadoop-0. 20 mapreduce/contrib/streaming/hadoop-streaming-2. 0. 0 -mr 1 -cdh 4. 2. 1. jar $ hadoop jar $hadoop-streaming. jar -input $input. Data/nyse -output -mapper parse. py -reducer /usr/bin/uniq -file parse. py • Result – Hadoop fs –cat mrstream/output/part* AEA YSI YUM YZC …

Streaming Example #3 • Inverted Index ( with wordcount) – Example • Text input

Streaming Example #3 • Inverted Index ( with wordcount) – Example • Text input – a news article doc 1 WASHINGTON United States Special Operations troops are forming elite counterterrorism units in four countries in North and West Africa … • Output operations over doc 1: 1 month doc 1: 1 four doc 1: 1 fighters doc 1: 1. . doc 1: 1

Example Mapper code #!/usr/bin/env python from sys import stdin import re doc_id = None

Example Mapper code #!/usr/bin/env python from sys import stdin import re doc_id = None for line in stdin: if not line. strip(): continue if not doc_id: doc_id, content = line. split('t') words = re. findall(r'w+', line) for word in words: print "{0}t{1}: 1". format(word. lower(), doc_id)

Example Reducer code #!/usr/bin/env python from sys import stdin import re index = {}

Example Reducer code #!/usr/bin/env python from sys import stdin import re index = {} for line in stdin: word, postings = line. split('t') index. setdefault(word, {}) for posting in postings. split(', '): doc_id, count = posting. split(': ') count = int(count) index[word]. setdefault(doc_id, 0) index[word][doc_id] += count for word in index: postings_list = ["%s: %d" % (doc_id, index[word][doc_id]) for doc_id in index[word]] postings = ', '. join(postings_list) print "{0}t{1}". format(word, postings)

Running code • Run Streaming Example $hs mapper. py reducer. py input output (

Running code • Run Streaming Example $hs mapper. py reducer. py input output ( input path : input/doc 1. txt ) • Result – Hadoop fs –cat output/part*. . officer haram night served because some. . doc 1: 2 doc 1: 1 doc 1: 3 doc 1: 1 doc 1: 4

Streaming Example #4 • Find tatal sale per store – Example • Text input(purchases.

Streaming Example #4 • Find tatal sale per store – Example • Text input(purchases. txt) date 2012 -01 -01 2012 -01 -01 time 09: 00 09: 00 store San Jose Fort Worth San Diego Pittsburgh Omaha Stockton item Men's Clothing Women's Clothing Music Pet Supplies Children's Clothing Men's Clothing cost 214. 05 153. 57 66. 08 493. 51 235. 63 247. 18 payment Amex Visa Cash Discover Master. Card … … … • output store Fort Worth San Diego San Jose Stockton Omaha cost 153. 57 1636. 08 10214. 05 2447. 18 2235. 63 Pittsburgh 4493. 51 … …

Example Mapper code #!/usr/bin/python # # # Format of each line is: datettimetstore nametitem

Example Mapper code #!/usr/bin/python # # # Format of each line is: datettimetstore nametitem descriptiontcosttmethod of payment We want elements 2 (store name) and 4 (cost) We need to write them out to standard output, separated by a tab import sys for line in sys. stdin: data = line. strip(). split("t") if len(data) == 6: date, time, store, item, cost, payment = data print "{0}t{1}". format(store, cost)

Example Reducer code #!/usr/bin/python import sys sales. Total = 0 old. Key = None

Example Reducer code #!/usr/bin/python import sys sales. Total = 0 old. Key = None # Loop around the data # It will be in the format keytval # Where key is the store name, val is the sale amount # # All the sales for a particular store will be presented, # then the key will change and we'll be dealing with the next store for line in sys. stdin: data_mapped = line. strip(). split("t") if len(data_mapped) != 2: # Something has gone wrong. Skip this line. continue this. Key, this. Sale = data_mapped if old. Key and old. Key != this. Key: print old. Key, "t", sales. Total old. Key = this. Key; sales. Total = 0 old. Key = this. Key sales. Total += float(this. Sale) if old. Key != None: print old. Key, "t", sales. Total print "{0}t{1}". format(old. Key, sales. Total)