IBM Research INTRODUCTION TO HADOOP MAP REDUCE 2007
® IBM Research INTRODUCTION TO HADOOP & MAP REDUCE © 2007 IBM Corporation
IBM Research | India Research Lab What is Hadoop? An Open-Source Software , batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind
IBM Research | India Research Lab Hadoop vs Relational § General Purpose vs Relational Data § User Control vs System Defined 4 No Schema vs Schema § Key-Value Pairs vs Tables § Offline/batch vs Online/Real-time
IBM Research | India Research Lab Hadoop Eco-System § HDFS 4 Hadoop Distributed File System § Map-Reduce System 4 A distributed framework for executing work in parallel § Hive/Pig/Jaql 4 SQL like languages to manipulate relational data on HDFS § HBase 4 Column-Store on Hadoop § Misc 4 Avro, Ganglia, Sqoop, Zoo. Keeper, Mahout
IBM Research | India Research Lab HDFS § Hadoop Distributed File System § Stores files in blocks across many nodes in a cluster § Replicates the blocks across nodes for durability 4 Default – 64 MB § Master/Slave Architecture § HDFS Master 4 Name. Node • Runs on a single node as a master process • Directs client access to files in HDFS § HDFS Slave 4 Data. Node • Runs on all nodes in the cluster • Block creation/replication/deletion • Takes orders from the namenode
IBM Research | India Research Lab HDFS Name. Node 1 4 2 3 5 6 Data Nodes
IBM Research | India Research Lab HDFS Name. Node Put File 1, 4, 5 File 1. txt 2, 5, 6 2, 3, 4 1 4 2 3 5 6 Data Nodes
IBM Research | India Research Lab HDFS Name. Node 1, 4 Read File 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines 1 4 2 3 5 6 Data Nodes
IBM Research | India Research Lab HDFS § Fault-Tolerant 4 Handles Node Failures § Self-Healing 4 Rebalances files across cluster 4 Data from the remaining two nodes is automatically copied § Scalable 4 Just by adding new nodes
IBM Research | India Research Lab Map-Reduce § Logical Functions : Mappers and Reducers § Developers write map and reducer functions then submit a jar to the Hadoop Cluster § Hadoop handles distributing the Map and Reduce tasks across the cluster § Typically Batch-Oriented
IBM Research | India Research Lab Map-Reduce Job-Flow
IBM Research | India Research Lab Word-Count Sort/Shuffle A-I Hadoop Uses Map-Reduce There is a Map-Phase There is a Reduce phase (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce , 1) (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (a, [1, 1]) (Hadoop, 1) (is, [1, 1]) (a, 2) (hadoop, 1) (is, 2) J-Q (map, [1, 1]) (phase, [1, 1]) (map, 2) (phase, 2) R-Z (reduce, [1, 1]) (there, [1, 1]) (uses, 1) (reduce, 2) (there, 2) (uses, 1)
IBM Research | India Research Lab Map-Reduce Daemons § Job-Tracker (Master) 4 Manages map-reduce jobs, 4 Partitions tasks across different nodes, 4 Manages task-failures, Restarts tasks on different nodes 4 Speculative Execution § Task-Tracker (Slave) 4 Creates individual map and reduce tasks 4 Reports task status to job-tracker
IBM Research | India Research Lab Word-Count Map Type of Output Key Type of Output Value § public class Word. Count. Map extends Mapper<Long. Writable, Text, Int. Writable>{ Type of Input Key Type of Input Value public void map(Long. Writable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens. length; i++){ context. write(new Text(tokens[i]), new Int. Writable(1)); } } }
IBM Research | India Research Lab Word Count Reduce Type of Output Key Type of Output Value public class Data. Read. Reduce extends Reducer<Text, Int. Writable, Text, Int. Writable>{ Type of Input Key Type of Input Value public void reduce(Text key, Iterable<Int. Writable> values, Context context){ context. write(key, new Int. Writable(count(values))); } }
IBM Research | India Research Lab Word-Count Runner Class public class Word. Count. Runner{ public static void main(String[] args){ Job job = new Job(); job. set. Mapper. Class(Word. Count. Map. class); job. set. Reducer. Class(Word. Count. Reduce. class); job. set. Jar. By. Class(Word. Count. Runner. class); File. Input. Format. add. Input. Path(job, input. Files. Path); File. Output. Format. add. Output. Path(job, output. Path); job. set. Map. Output. Key. Class(Text. class); job. set. Map. Output. Values. Class(Int. Writable. class); job. set. Output. Key. Class(Text. class); job. set. Output. Value. Class(Int. Writable. class); job. set. Num. Reduce. Tasks(1); job. wait. For. Completion(true); } }
IBM Research | India Research Lab Running a Job §. /bin/hadoop jar WC. jar Word. Count. Runner WC
IBM Research | India Research Lab Cluster View of a MR Job Flow Name. Node M R Job. Tracker JAR Task Tracker k, v M k, v R JOB FINISHED R R k, v MAP PHASE SHUFFLE SORT k, v REDUCE PHASE Data Node
IBM Research | India Research Lab Map-Reduce Example: Aggregation § Compute Avg of B for each distinct value of A A B C R 1 1 10 12 R 2 2 20 34 R 3 1 10 22 R 4 1 30 56 R 5 3 40 17 R 6 2 10 49 R 7 1 20 44 Reducer 1 MAP 1 (1, 10) (2, 20) (1, 10) (1, [10, 30, 20]) (1, 17. 5) Reducer 2 MAP 2 (1, 30) (3, 40) (2, 10) (1, 20) (2, 10) (3, 40) (2, 15) (3, 40)
IBM Research | India Research Lab Map-Reduce Example : Join § Select R. A, R. B, S. D where R. A==S. A A B C R 1 1 10 12 R 2 2 20 34 R 3 1 10 22 R 4 1 30 56 R 5 3 40 17 A D E S 1 1 20 22 S 2 2 30 36 S 3 2 10 29 S 4 3 50 16 S 5 3 40 37 Reducer 1 MAP 1 (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [(R, 10), (R, 30), (S, 20)] ) (1, 10, 30) (1, 10, 20) Reducer 2 MAP 2 (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40)
IBM Research | India Research Lab Map-Reduce Example : Inequality Join § Select R. A, R. B, S. D where R. A <= S. A § Consider 3 -Node Cluster A B C R 1 1 10 12 R 2 2 20 34 R 3 1 10 22 R 4 1 30 56 R 5 3 40 17 A D E S 1 1 20 22 S 2 2 30 36 S 3 2 10 29 S 4 3 50 16 S 5 3 40 37 Reducer 1 MAP 1 (r 1, [R, 1, 10]) (r 2, [R, 1, 10]) (r 3, [R, 1, 10]) (r 2, [R, 2, 20]) (r 3, [R, 2, 20]) …. . (r 3, [R, 3, 40]) (r 1, ([R, 1, 10], [R, 1, 30], [S, 1, 20]) Reducer 2 …… MAP 2 (r 1, [S, 1, 20]) (r 2, [S, 2, 30]) (r 2, [S, 2, 10]) (r 3, [S, 3, 50]) (r 3, [S, 3, 40]) (1, 10, 20) (1, 30, 20) Reducer 3 (r 3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40)
IBM Research | India Research Lab Designing a Map-Reduce Algorithm § Thinking in terms of Map and Reduce 4 What data should be the key? 4 What data should be the values? § Minimizing Cost 4 Reading Cost 4 Communication Cost 4 Processing Cost at Reducer § Load Balancing 4 All reducers should get similar volume of traffic 4 Should not happen that only few machines are busy while others are loaded
IBM Research | India Research Lab SQL-Like Languages For Map-Reduce § Hive, Pig, JAQL § A user need not write native Java Map-Reduce Code § SQL like statements can be written to process data on Hadoop § Allows users without a sound understanding of map-reduce to work on data stored on HDFS
IBM Research | India Research Lab JAQL § Simpler language for writing Map-Reduce jobs 4 Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users § Exploit massive parallelism using Hadoop § Provides a simple yet powerful language to manipulate semi-structured data § Uses JSON as data model 4 Most data has a natural JSON representation § § Easily extended using Java, Python, Java. Script Inspired from UNIX pipes Other languages: Hive, Pig Resources 4 http: //code. google. com/p/jaql 4 http: //jaql. org
IBM Research | India Research Lab JSON has arrays, records, strings, numbers, boolean, and null Java. Script Object Notation (JSON) [] == array, {} == record or object, x: == field name § $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, ] § $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day: 1, month: 1, year: 1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day: 1, month: 2, year: 1978}}, ] § $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB 2, SQL]}, ] § $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org: `IBM’, from: 2000, to: 2005}, {org: `yahoo’, from: 2005, to: `2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org: `IBM’, from: 2000, to: 2003}, {org: `oracle’, from: 2003, to: `2010’}] ]
IBM Research | India Research Lab Accessing Data § $emp = [ § § § § {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org: `IBM’, from: 2000, to: 2005}, {org: `yahoo’, from: 2005, to: `2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org: `IBM’, from: 2000, to: 2003}, {org: `oracle’, from: 2003, to: `2010’}] } ] $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org: `IBM’, from: 2000, to: 2005}, {org: `yahoo’, from: 2005, to: `2010’}] } $emp[0]. name = “Jon Doe” $emp[0]. exp[0] = {org: `IBM’, from: 2000, to: 2005} $emp[0]. exp[0]. org = ‘IBM’ $emp[0]. skills[0] = ‘Java’ $emp[*]. name = [‘Jon Doe’, ‘Vince Wayne’] $emp[0]. exp[*]. org = [‘IBM’, ’yahoo’] $emp[*]. exp[*]. org = [[‘IBM’, ’yahoo’], [‘IBM’, ’oracle’]]
IBM Research | India Research Lab JAQL core functionalities § § § Filter Transform Group Join Sort Expand
IBM Research | India Research Lab Filter § $input -> filter <boolean expression>; 4 In <boolean expression> the variable $ is bound to each item of the input 4 The <boolean expression> can be composed of the relations ==, !=, >, >=, <, <= 4 Complex expressions can be created with not, and, or which are evaluated in this order 4 If the <boolean expression> evaluates to true, the item from the input is included in the output
IBM Research | India Research Lab Filter Example § $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false} ]; § $employees -> filter $. mgr or $. income > 30000; § [ { "income": 32500, "mgr": false, "name": "Vince Wayne" }, { "income": 72000, "mgr": true, "name": "Jane Dean" } ]
IBM Research | India Research Lab Group By § § $input -> group by <variable> = <grouping items> into <expression> Similar to SQL group-by $ is bound to the grouped items To get an array of all values for an item that are aggregated into one group, use $[*]
IBM Research | India Research Lab Group By Example § $employees = [ {id: 1, dept: 1, band: 7, income: 12000}, {id: 2, dept: 1, band: 8, income: 13000}, {id: 3, dept: 2, band: 7, income: 15000}, {id: 4, dept: 1, band: 8, income: 10000}, {id: 5, dept: 3, band: 7, income: 8000}, {id: 6, dept: 2, band: 8, income: 5000}, {id: 7, dept: 1, band: 7, income: 24000} ] § $emplyees -> group by $. dept into {$dept, total: sum($[*]. income)}; [ {dept: 1, total: 59000}, {dept: 2, total: 20000}, {dept: 3, total: 8000} ] § $emplyees -> group by $. dept_group = $dept into {$dept_group, total: sum($[*]. income)}; § $employees -> group by $dept_group = {$. dept, $. band} into {$dept_group. *, total: sum($[*]. income)} § $employees -> group by $dept_group = {$. dept, $. band} into {$dept_group, total: sum($[*]. income)}
IBM Research | India Research Lab Join § § Join <variable-list> where <join-condition(s)> into <expression> <variable list> contains two or more variables that should share at least one attribute <join condition(s)> : only equality predicates are allowed <expression> is applied to all items from the input that match the join condition. To copy all fields of an input, use $input. * § Add the keyword ‘preserve’ to make it full join
IBM Research | India Research Lab Join Example § $users = [ {name: "Jon Doe", password: "asdf 1234", id: 1}, {name: "Jane Doe", password: "qwertyui", id: 2}, {name: "Max Mustermann", password: "q 1 w 2 e 3 r 4", id: 3} ]; $pages = [ {userid: 1, url: "code. google. com/p/jaql/"}, {userid: 2, url: "www. cnn. com"}, {userid: 1, url: "java. sun. com/javase/6/docs/api/"} ] § Join $users, $pages where $users. id == $pages. userid into {$users. name, $pages. *} § [ { "name": "Jon Doe", "url": "code. google. com/p/jaql/", "userid": 1 }, { "name": "Jon Doe", "url": "java. sun. com/javase/6/", "userid": 1 }, { "name": "Jane Doe", "url": "www. cnn. com", "userid": 2 } ]
IBM Research | India Research Lab IBM Info. Sphere Big. Insights § IBM’s offering for managing Big-Data § Powered by Hadoop and other components § Provides a fully tested environments
IBM Research | India Research Lab Recap § Introduction to Apache Hadoop 4 HDFS and Map-Reduce Programming Framework 4 Name Node, Data Node 4 Job Tracker, Task Tracker 4 Map and Reduce Methods Signatures § Word-Count Example 4 Flow In Map-Reduce 4 Java Implementation § More Map-Reduce Examples 4 Aggregation, Equi-Join and Inequality Join § Introduction to JAQL and IBM Big. Insights
IBM Research | India Research Lab Advanced Concepts In Hadoop § Map-Reduce Programming Framework 4 Combiner, Counter, Partitioner, Distributed-Cache 4 Hadoop I/O 4 Input-Formats and Output-Formats • Input and Output-Formats provided by Hadoop • Writing Custom Input and Output Formats • Passing custom objects as key-values 4 Chaining Map-Reduce Jobs § Hadoop Tuning and Optimization 4 Configuration Parameters § Hadoop Eco-System 4 Hive/Pig/JAQL 4 HBase 4 Avro, Zoo. Keeper, Mahout, Sqoop, Ganglia etc. § An Overview of Hadoop Research 4 Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins 4 Graph Processing, Text Processing etc 4 Systems : Re. Store, Perf. XPlain, Stubby, RAMP, Hadoop. DB etc.
IBM Research | India Research Lab References § Hadoop – The Definitive Guide. Oreilly Press § Pro-Hadoop : Build scalable, distributed applications in the Cloud. § Hadoop Tutorial : http: //developer. yahoo. com/hadoop/tutorial/. § www. slideshare. net
- Slides: 37