Distributed Systems Lecture 3 Big Data and Map

Distributed Systems Lecture 3 Big Data and Map. Reduce 1

Previous lecture • Overview of main cloud computing aspects – Definition – Models – Elasticity – Cloud stack – Virtualization • AWS 2

Data intensive computing • Clouds are designed for data intensive applications • Approach: move application to data • Computation-Intensive computing – Example areas: MPI-based, High-performance computing, Grids – Typically run on supercomputers (e. g. , NCSA Blue Waters) – High CPU utilization • Data-Intensive computing – Typically store data at datacenters – Use compute nodes nearby (same datacenter, rack; latency based) – Compute nodes run computation services – High I/O utilization • In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric 3

Big Data (1) • Data mining huge amounts of data collected in a wide range of domains from astronomy to healthcare has become essential for planning and performance – – – Sload Digital Sky Survey (2000): 200 GB/night Large Synoptic Survey Telescope (2016): 140 TB every five days NASA Center for Climate Simulation: 32 PB of information e. Bay: two warehouses of 7. 5 respectively 40 PB Amazon holds the world’s largest databases: 7. 5 TB, 18. 5 TB, and 24. 7 TB Facebook: 50 million photos • Michael Dell (CEO): – “Top thinkers are no longer the people who can tell you what happened in the past, but those who can predict the future. Welcome to the Data Economy, which holds the promise for significantly advancing society and economic growth on a global scale. ”

Big Data (2) Dimensions • Volume – Analyzing large volumes (TBs) of distributed data • E. g. Derive insights into large historical data sets (e. g. • Velocity – Fast processing of variable data sets (streaming data) • E. g. Trend analysis, weather forecast • Variety (Complexity) – Highly complex analysis at large scale • E. g. audio/video analysis at web scale, speech to text etc. – Unstructured or structured data • E. g. relational databases, graphs

Big Data (3) Velocity Realtime O(seconds) Periodic O(days/hours) Re Batch Historic MB GB TB PB lat Au os ot Ph di o ion al Vi de o Graph s Variety Volume da ta

Big Data (4) Data Economy • Data is an important asset to any organization • Discovery of knowledge; Enabling discovery; annotation of data • Complex computational models • No single environment is good enough: need elastic, ondemand capacities → cloud computing • New programming models for Big. Data on Cloud • Supporting algorithms and data structures

Big Data analytics (1) Map. Reduce • Cloud based Big Data programming – – Large number of low-cost “commodity” hardware Performance/$ High failure rate increases with the number of resources High Volume, distributed data (usually unstructured or tuple based) • Embarrassingly parallel computations on big data • Programming model: – Split data and process each chunk independently. Join the intermediate results and output the aggregated result – Same computation performed at different nodes on different pieces of the dataset

Big Data analytics (2) Hadoop MR Job Tracker HDFS Name Node Hadoop Master MR MR MR M/R M/R M/R Task Tracker HDFS Data Node B B B B B Hadoop Slaves (workers)

Hadoop Distributed File System (HDFS) (3) • Each block is Filename “x”, size: 1 GB HDFS blockid “ 1”, datanodes d 1, d 3, d 4 blockid “ 2”, datanodes d 2, d 4, d 5 … Data Node x 1 B B Data Node x 2 B Rack 1 B Data Node B x 1 B Data Node x 1 x 2 Rack 2 B Name Node replicated (3 times by default) • One replica of each block on the same rack, rest spread across the cluster Data Node B B x 2 Data Node B B B Rack 3

Hadoop MR execution (4) • Terms are borrowed from Functional Languages (e. g. , Lisp) • Example: sum of squares (map square ‘(1 2 3 4)) – Output: (1 4 9 16) [processes each record sequentially and independently] (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – Output: 30 [processes set of all records in groups] 11

Hadoop MR execution (5) Execute “wc” on file “x” with 4 mappers 3 reducers MR K 1, k 2, k 3 k 2=>v 1, v 2 b 5 b 6 b 7 K 4, k 5 Mapper b 4 o 1 o 2 o 3 Mapper k 3=>v 1, v 2 Reducer k 1=>v 1, v 2… R b 3 Launch Reducers Reducer b 2 M b 1 Mapper Job Tracker Launch Mappers (Prefer Collocation with data) Shuffle Stage k 4=>v 1, v 2 k 5=>v 1, v 2

Hadoop MR programming model (6) Map(k, v) -> (k’, v’) Input Map Reduce(k’, v’[]) -> (k’’, v’’) Shuffle – MR System Reduce Output User defined functions void map(String key, String value) { //do work //emit key, value pairs to reducers } void reduce(String key, Iterator values) { //for each key, iterate through all values //aggregate results //emit final result }

Hadoop - example: word count (7) Input: Large number of documents Output: Count of each word occurring across documents void map(String key, String value) { //Key: document name (ignored) //value: document contents } void reduce(String key, Iterator values) { //key: word //values: list of counts for each word w in value Emit(w, 1) int count = 0; foreach v in values count += v } Emit(key, count)

Hadoop – example: word count (8) public class Map extends Mapper<Long. Writable, Text, Int. Writable> { private final static Int. Writable one = new Int. Writable(1); private Text word = new Text(); @Override public void map(Long. Writable key, Text value, Context context) throws IOException, Interrupted. Exception { String line = value. to. String(); String. Tokenizer tokenizer = new String. Tokenizer(line); while (tokenizer. has. More. Tokens()) { word. set(tokenizer. next. Token()); context. write(word, one); } } } 15

Hadoop – example: word count (9) public class Reduce extends Reducer<Text, Int. Writable, Text, Int. Writable> { @Override protected void reduce(Text key, Iterable<Int. Writable> values, Context context) throws IOException, Interrupted. Exception { int sum = 0; for (Int. Writable value : values) { sum += value. get(); } context. write(key, new Int. Writable(sum)); } } Source: http: //muhammadkhojaye. blogspot. com/2012/04/how-to-run-amazon-elastic-mapreducejob. html 16

Hadoop – example: word count (10) public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job. set. Output. Key. Class(Text. class); job. set. Output. Value. Class(Int. Writable. class); job. set. Mapper. Class(Map. class); job. set. Reducer. Class(Reduce. class); job. set. Input. Format. Class(Text. Input. Format. class); job. set. Output. Format. Class(Text. Output. Format. class); File. Input. Format. add. Input. Path(job, new Path(args[0])); File. Output. Format. set. Output. Path(job, new Path(args[1])); job. wait. For. Completion(true); } 17

Hadoop - example: word count (11) A, 1 Peck, A, 1 11 Peter, A, 1 11 Piper, Peter, Peck, 111 Picked, Peck, A, 1 1 1 Picked, Peter, Peck, 111 Piper, 11 Picked, Peter Piper picked a peck of pickled peppers M A peck of pickled M Peter, 2 Piper, 2 Picked, 2 A, 2 Peck, 2 R R peppers Peter Piper picked M Local Shuffle and Sort Of, 11 Pickled, Of, 1 1 Peppers 11 Pickled, 11 Peppers, Pickled, 1 Of, 2 Pickled, 2 Peppers 2

Hadoop - more examples (12) • Distributed search (grep) – Map: Emit line if it matches the patter – Reduce: Concatenate • Analysis of large scale system logs • More – Jerry Zhao, Jelena Pjesivac-Grbovic, Map. Reduce: The Programming Model and Practice. Sigmetrics tutorial, Sigmetrics 2009. research. google. com/pubs/archive/36249. pdf – http: //www. dbms 2. com/2008/08/26/known-applications-ofmapreduce/

Hadoop - other patterns and optimizations (13) Map-Reduce… M R Iterative Map-Reduce M R A, 1 B, 1 A, 2 B, 1 M C R C, 1 C, 2 B, 1 Local Combiners (Mappers) Custom Data Partitioners R R Goal: Increase local work (Mapper+Combiners), reduce data transfer over network A, 2 C, 2 e. g. , hash(line) mod R

Hadoop - other important aspects (14) • Data Locality – Map Heavy Jobs – Execute Mappers on data partitions, move data to reducers – Reduce Heavy Jobs – Do not move data to reducers, instead move reducers to intermediate data • Fault Tolerance – Worker Failure • Re-execute on failure (Hadoop) • Regenerate current state with minimum re-execution (e. g. Resilient Distributed Datasets, SPARK) – Master Failure • Primary/Secondary Master. • Regular backups to secondary. On failure of primary, secondary takes over

Hadoop in real life Easy to write and run highly parallel programs in new cloud programming paradigms: • Google: Map. Reduce and Pregel (and many others) • Amazon: Elastic Map. Reduce service (pay-as-you-go) • Google (Map. Reduce) – Indexing: a chain of 24 Map. Reduce jobs – ~200 K jobs processing 50 PB/month (in 2006) • Yahoo! (Hadoop + Pig) – Web. Map: a chain of 100 Map. Reduce jobs – 280 TB of data, 2500 nodes, 73 hours • Facebook (Hadoop + Hive) – ~300 TB total, adding 2 TB/day (in 2008) – 3 K jobs processing 55 TB/day • Similar numbers from other companies, e. g. , Yieldex, eharmony. com, etc. • No. SQL: My. SQL has been an industry standard for a while, but Cassandra is 2400 times faster! 22

Useful links • Installing Hadoop on single node (Linux) • Word. Count example – https: //hadoop. apache. org/docs/current/hadoop-mapreduceclient/hadoop-mapreduce-client-core/Map. Reduce. Tutorial. html • Running Word. Count on AWS Elastic Map. Reduce – http: //muhammadkhojaye. blogspot. com/2012/04/how-to-runamazon-elastic-mapreduce-job. html 23

Next lecture • Failure detection 24