MapReduce and Hadoop performance Ioana Manolescu Senior researcher

Plan • The Map/Reduce model • Two performance problems and ways out – Blocking

The Map/Reduce model • Problem: – How to compute in a distributed fashion a

Map/Reduce in detail 03/04/13 Ioana Manolescu (Inria), Big Data Paris 4

Reduce Map Hadoop Map/Reduce Data Load Map() Local sort Map write Merge Reduce Final

Performance problem 1: Idle CPU due to blocking steps

Hadoop resource usage Blocking sort-merge 03/04/13 Data Load Map() Local sort Map write Merge

Hadoop benchmark [Li, Mazur, Diao, Mc. Gregor, Shenoy, ACM SIGMOD Conference 2011] CPU stalls

Hash-based algorithms to improve Hadoop performance Data Load [Li, Mazur, Diao, Mc. Gregor, Shenoy,

Performance problem 2: non-selective data access

Data access in Hadoop • Basic model: read all the data – If the

Accelerating data access in Hadoop • Idea 1: Hadop++ [Jindal, Quiané-Ruiz, Dittrich, ACM SOCC,

Accelerating data access in Hadoop Data Load • Idea 2: HAIL [Dittrich, Quiané-Ruiz, Richter,

Hadoop, Hadoop++ and HAIL Data Load Map() Local sort Map write Merge Reduce Final

Conclusion Toward Map/Reduce optimization

Optimization defined as Hadoop tuning Hadoop parameters: [Herodotou, technical report, Duke Univ, 2011] 03/04/13

Hadoop tuning [Babu, ACM So. CC 2010] 03/04/13 Ioana Manolescu (Inria), Big Data Paris

Hadoop performance model [Li, Mazur, Diao, Mc. Gregor, Shenoy, SIGMOD 2011] 03/04/13 Ioana Manolescu

Hadoop performance model [Li, Mazur, Diao, Mc. Gregor, Shenoy, SIGMOD 2011] Easy 03/04/13 Ioana

References • Shivnath Babu. "Towards automatic optimization of Map. Reduce programs", ACM So. CC,

Merci Questions now? Questions/feedback later on: ioana. manolescu@inria. fr 03/04/13 Ioana Manolescu (Inria), Big

Slides: 21

Download presentation

Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013

Plan • The Map/Reduce model • Two performance problems and ways out – Blocking steps in the Map/Reduce processing pipeline – Non-selective data access • Conclusion: toward Map/Reduce optimization? 03/04/13 Ioana Manolescu (Inria), Big Data Paris 2

The Map/Reduce model • Problem: – How to compute in a distributed fashion a given processing to a very large amount of data • Map/Reduce solution: – Programmer: express the processing as • Splitting the data • Extract (key, value pairs) from each partition (MAP) • Compute partial results for each key (REDUCE) – Map/Reduce platform (e. g. , Hadoop): • Distributes partitions, runs one MAP task per partition • Runs one or several REDUCE tasks per key • Sends data across machines from MAP to REDUCE 03/04/13 Ioana Manolescu (Inria), Big Data Paris Implicit parallelism Communication across machines Fault tolerance 3

Map/Reduce in detail 03/04/13 Ioana Manolescu (Inria), Big Data Paris 4

Reduce Map Hadoop Map/Reduce Data Load Map() Local sort Map write Merge Reduce Final write 03/04/13 Ioana Manolescu (Inria), Big Data Paris 5

Performance problem 1: Idle CPU due to blocking steps

Hadoop resource usage Blocking sort-merge 03/04/13 Data Load Map() Local sort Map write Merge Reduce Final write Ioana Manolescu (Inria), Big Data Paris 7

Hadoop benchmark [Li, Mazur, Diao, Mc. Gregor, Shenoy, ACM SIGMOD Conference 2011] CPU stalls during I/O intensive Merge Reduce strictly follows Merge 03/04/13 Ioana Manolescu (Inria), Big Data Paris 8

Hash-based algorithms to improve Hadoop performance Data Load [Li, Mazur, Diao, Mc. Gregor, Shenoy, SIGMOD 2011] Main idea: use non-blocking hash-based algorithms to group items by keys during Map. Local. Sort and Reduce. Merge Principle of hashing: Map() Local sort Map write key value 3, 1, 2, 8, 2, 13, 1, 2, 3, 9… h(x) e. g. x%3 0 9, 3, 3 1 1, 13, 1 2 2, 2, 8, 2 • Partitions can be in memory or flushed to disk • If the reduce works incrementally, early send 03/04/13 Ioana Manolescu (Inria), Big Data Paris Merge Reduce Final write 9

Performance problem 2: non-selective data access

Data access in Hadoop • Basic model: read all the data – If the tasks are selective, we don't really need to! • Database indexes? But: – Map/Reduce works on top of a file system (e. g. Hadoop file system, HDFS) – Data is stored only once – Hard to foresee all future processing • "Exploratory nature" of Hadoop Data Load Map() Local sort Map write Merge Reduce Final write 03/04/13 Ioana Manolescu (Inria), Big Data Paris 11

Accelerating data access in Hadoop • Idea 1: Hadop++ [Jindal, Quiané-Ruiz, Dittrich, ACM SOCC, 2011] – Add header information to each data split, summarizing split attribute values – Modify the Record. Reader of HDFS, used by the Map() will prune irrelevant splits Data Load Map() Local sort Map write Merge Reduce Final write 03/04/13 Ioana Manolescu (Inria), Big Data Paris 12

Accelerating data access in Hadoop Data Load • Idea 2: HAIL [Dittrich, Quiané-Ruiz, Richter, Schuh, Jindal, Schad, PVLDB 2012] – Each storage node builds an in-memory, clustered index of the data in its split – There are three copies of each split for reliability Build three different indexes! – Customize Record. Reader 03/04/13 Ioana Manolescu (Inria), Big Data Paris Map() Local sort Map write Merge Reduce Final write 13

Hadoop, Hadoop++ and HAIL Data Load Map() Local sort Map write Merge Reduce Final write 03/04/13 Ioana Manolescu (Inria), Big Data Paris 14

Conclusion Toward Map/Reduce optimization

Optimization defined as Hadoop tuning Hadoop parameters: [Herodotou, technical report, Duke Univ, 2011] 03/04/13 Ioana Manolescu (Inria), Big Data Paris 16

Hadoop tuning [Babu, ACM So. CC 2010] 03/04/13 Ioana Manolescu (Inria), Big Data Paris 17

Hadoop performance model [Li, Mazur, Diao, Mc. Gregor, Shenoy, SIGMOD 2011] 03/04/13 Ioana Manolescu (Inria), Big Data Paris 18

Hadoop performance model [Li, Mazur, Diao, Mc. Gregor, Shenoy, SIGMOD 2011] Easy 03/04/13 Ioana Manolescu (Inria), Big Data Paris 19

References • Shivnath Babu. "Towards automatic optimization of Map. Reduce programs", ACM So. CC, 2010 • Herodotos Herodotou. "Hadoop Performance Models". Duke University, 2011 • Boduo Li, Edward Mazur, Yanlei Diao, Andrew Mc. Gregor, Prashant Shenoy. "A Platform for Scalable One-Pass Analytics using Map. Reduce", ACM SIGMOD 2011 • Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Stefan Richter, Stefan Schuh, Alekh Jindal, Jorg Schad. "Only Aggressive Elephants are Fast Elephants", VLDB 2012 • A. Jindal, J. -A. Quiané-Ruiz, and. J. Dittrich. "Trojan Data Layouts: Right Shoes for a Running Elephant" SOCC, 2011 • Harold Lim, Herodotos Herodotou, Shivnath Babu. "Stubby: A Transformation-based Optimizer for Map. Reduce Workflows" PVLDB 2012 03/04/13 Ioana Manolescu (Inria), Big Data Paris 20

Merci Questions now? Questions/feedback later on: ioana. manolescu@inria. fr 03/04/13 Ioana Manolescu (Inria), Big Data Paris 21