Pig Building HighLevel Dataflows over MapReduce Utkarsh Srivastava

Pig : Building High-Level Dataflows over Map-Reduce Utkarsh Srivastava Research & Cloud Computing

Data Processing Renaissance ▪ Internet companies swimming in data • E. g. TBs/day at Yahoo! ▪ Data analysis is “inner loop” of product innovation ▪ Data analysts are skilled programmers

Data Warehousing …? Scale $$$$ SQL Often not scalable enough Prohibitively expensive at web scale • Up to $200 K/TB • Little control over execution method • Query optimization is hard • Parallel environment • Little or no statistics • Lots of UDFs

New Systems For Data Analysis ▪ Map-Reduce ▪ Apache Hadoop ▪ Dryad . . .

Map-Reduce Input records map k 1 v 1 k 2 v 2 k 1 v 3 k 1 v 5 k 2 v 4 k 2 v 2 k 1 v 5 k 2 v 4 reduce map Just a group-by-aggregate? reduce Output records

The Map-Reduce Appeal Scale $ SQL Scalable due to simpler design • Only parallelizable operations • No transactions Runs on cheap commodity hardware Procedural Control- a processing “pipe”

Disadvantages 1. Extremely rigid data flow M R Other flows constantly hacked in M Join, Union Split M R Chains 2. Common operations must be coded by hand • Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions • Difficult to maintain, extend, and optimize M

Pros And Cons Need a high-level, general data flow language

Enter Pig Latin Need a high-level, general data flow language Pig L atin

Outline • Map-Reduce and the need for Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

Example Data Analysis Task Find the top 10 most visited pages in each category Visits Url Info User Url Time Url Category Page. Rank Amy cnn. com 8: 00 cnn. com News 0. 9 Amy bbc. com 10: 00 bbc. com News 0. 8 Amy flickr. com 10: 05 flickr. com Photos 0. 7 Fred cnn. com 12: 00 espn. com Sports 0. 9

Data Flow Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top 10 urls

In Pig Latin visits = load ‘/data/visits’ as (user, url, time); g. Visits = group visits by url; visit. Counts = foreach g. Visits generate url, count(visits); url. Info = load ‘/data/url. Info’ as (url, category, p. Rank); visit. Counts = join visit. Counts by url, url. Info by url; g. Categories = group visit. Counts by category; top. Urls = foreach g. Categories generate top(visit. Counts, 10); store top. Urls into ‘/data/top. Urls’;

Step-by-step Procedural Control Target users are entrenched procedural programmers The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. Jasmine Novak Engineer, Yahoo! With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful. David Ciemiewicz Search Excellence, Yahoo! • Automatic query optimization is hard • Pig Latin does not preclude optimization

Quick Start and Interoperability visits = load ‘/data/visits’ as (user, url, time); g. Visits = group visits by url; visit. Counts = foreach g. Visits generate url, count(url. Visits); url. Info = load ‘/data/url. Info’ as (url, category, p. Rank); visit. Counts = join visit. Counts by url, url. Info by url; g. Categories = group visit. Counts by category; top. Urls = foreach g. Categories generate Operates directly over top(visit. Counts, 10); files store top. Urls into ‘/data/top. Urls’;

Quick Start and Interoperability visits = load ‘/data/visits’ as (user, url, time); g. Visits = group visits by url; visit. Counts = foreach g. Visits generate url, count(url. Visits); url. Info = load ‘/data/url. Info’ as (url, category, p. Rank); visit. Counts = join visit. Counts by url, url. Info by url; g. Categories = group visit. Counts by category; Schemas optional; top. Urls = foreach g. Categories generate top(visit. Counts, 10); Can be assigned dynamically store top. Urls into ‘/data/top. Urls’;

User-Code as a First-Class Citizen visits = load ‘/data/visits’ as (user, url, time); g. Visits = group visits functions by url; User-defined (UDFs) visit. Countscan = foreach g. Visits generate url, count(url. Visits); be used in every construct url. Info • Load, Store = • load ‘/data/url. Info’ as (url, category, p. Rank); Group, Filter, Foreach visit. Counts = join visit. Counts by url, url. Info by url; g. Categories = group visit. Counts by category; top. Urls = foreach g. Categories generate top(visit. Counts, 10); store top. Urls into ‘/data/top. Urls’;

Nested Data Model • Pig Latin has a fully-nestable data model with: –Atomic values, tuples, bags (lists), and maps yahoo , finance email news • More natural to programmers than flat tuples • Avoids expensive joins

Nested Data Model Decouples grouping as an independent operation User Url Time Amy cnn. com 8: 00 Amy bbc. com 10: 05 Fred cnn. com 12: 00 group by url cnn. com bbc. com Visits Amy cnn. com 8: 00 Fred cnn. com 12: 00 Amy bbc. com 10: 05 • Common case: aggregation on these nested sets I franklysophisticated like pig much better than SQLsequence in some analysis • Power users: UDFs, e. g. , respects (group + optional ﬂatten works better for me, • Efficient. I love Implementation (see paper) nested data structures). ” Ted Dunning Chief Scientist, Veoh 19

Co. Group results revenue query url rank query ad. Slot amount Lakers nba. com 1 Lakers top 50 Lakers espn. com 2 Lakers side 20 Kings nhl. com 1 Kings top 30 Kings nba. com 2 Kings side 10 group Lakers Kings results revenue Lakers nba. com 1 Lakers top 50 Lakers espn. com 2 Lakers side 20 Kings nhl. com 1 Kings top 30 Kings nba. com 2 Kings side 10 Cross-product of the 2 bags would give natural join

Outline • Map-Reduce and the need for Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

Implementation SQL automatic rewrite + optimize user or Pig is open-source. or http: //hadoop. apache. org/pig Hadoop Map -Reduce cluster • ~50% of Hadoop jobs at Yahoo! are Pig • 1000 s of jobs per day

Compilation into Map-Reduce Load Visits Every group or join operation forms a map-reduce boundary Map 1 Group by url Reduce 1 Map 2 Foreach url generate count Load Url Info Join on url Other operations pipelined into map and reduce phases Group by category Foreach category generate top 10(urls) Reduce 2 Map 3 Reduce 3

Optimizations: Using the Combiner Input records map k 1 v 1 k 2 v 2 k 1 v 3 k 1 v 5 k 2 v 4 k 2 v 2 k 1 v 5 k 2 v 4 reduce Output records map reduce Can pre-process data on the map-side to reduce data shipped • Algebraic Aggregation Functions • Distinct processing

Optimizations: Skew Join • Default join method is symmetric hash join. cross product carried out on 1 reducer group Lakers Kings results revenue Lakers nba. com 1 Lakers top 50 Lakers espn. com 2 Lakers side 20 Kings nhl. com 1 Kings top 30 Kings nba. com 2 Kings side 10 • Problem if too many values with same key • Skew join samples data to find frequent values • Further splits them among reducers

Optimizations: Fragment-Replicate Join • Symmetric-hash join repartitions both inputs • If size(data set 1) >> size(data set 2) – Just replicate data set 2 to all partitions of data set 1 • Translates to map-only job – Open data set 2 as “side file”

Optimizations: Merge Join • Exploit data sets are already sorted. • Again, a map-only job – Open other data set as “side file”

Optimizations: Multiple Data Flows Map 1 Load Users Filter bots Group by state Group by demographic Reduce 1 Apply udfs Store into ‘bystate’ Apply udfs Store into ‘bydemo’

Optimizations: Multiple Data Flows Map 1 Load Users Filter bots Split Group by state Group by demographic Reduce 1 Demultiplex Apply udfs Store into ‘bystate’ Apply udfs Store into ‘bydemo’

Other Optimizations • Carry data as byte arrays as far as possible • Using binary comparator for sorting • “Streaming” data through external executables

Performance

Outline • Map-Reduce and the need for Pig Latin • Compilation into Map-Reduce • Example Generation • Future Work

Example Dataflow Program LOAD (user, url) LOAD (url, pagerank) JOIN on url FOREACH user, canonicalize(url) GROUP on user FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 Find users that tend to visit high -pagerank pages

Iterative Process LOAD (user, url) LOAD (url, pagerank) JOIN on url FOREACH user, canonicalize(url) Bug in UDF canonicalize? Joining on right attribute? GROUP on user FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 No Output ☹ Everything being filtered out?

How to do test runs? • Run with real data – Too inefficient (TBs of data) • Create smaller data sets (e. g. , by sampling) – Empty results due to joins [Chaudhuri et. al. 99], and selective filters • Biased sampling for joins – Indexes not always present

Examples to Illustrate Program LOAD (user, url) (Amy, cnn. com) (Amy, http: //www. frogs. com) (Fred, www. snails. com/index. html) FOREACH user, canonicalize(url) (Amy, www. cnn. com) (Amy, www. frogs. com) (Fred, www. snails. com) (www. cnn. com, 0. 9) (www. frogs. com, 0. 3) (www. snails. com, 0. 4) LOAD (url, pagerank) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 (Amy, www. cnn. com, 0. 9) (Amy, www. frogs. com, 0. 3) (Fred, www. snails. com, 0. 4) ( Amy, (Amy, www. cnn. com, 0. 9) (Amy, www. frogs. com, 0. 3) ( Fred, (Fred, www. snails. com, 0. 4) (Amy, 0. 6) (Fred, 0. 4) (Amy, 0. 6) ) )

Value Addition From Examples • Examples can be used for – Debugging – Understanding a program written by someone else – Learning a new operator, or language

Good Examples: Consistency LOAD (user, url) (Amy, cnn. com) (Amy, http: //www. frogs. com) (Fred, www. snails. com/index. html) FOREACH user, canonicalize(url) (Amy, www. cnn. com) (Amy, www. frogs. com) (Fred, www. snails. com) LOAD (url, pagerank) JOIN on url GROUP on user 0. Consistency FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 output example = operator applied on input example

Good Examples: Realism LOAD (user, url) (Amy, cnn. com) (Amy, http: //www. frogs. com) (Fred, www. snails. com/index. html) FOREACH user, canonicalize(url) (Amy, www. cnn. com) (Amy, www. frogs. com) (Fred, www. snails. com) LOAD (url, pagerank) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 1. Realism

Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url FOREACH user, canonicalize(url) GROUP on user Demonstrate the salient properties of each operator, e. g. , FILTER FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 (Amy, 0. 6) (Fred, 0. 4) (Amy, 0. 6)

Good Examples: Conciseness LOAD (user, url) (Amy, cnn. com) (Amy, http: //www. frogs. com) (Fred, www. snails. com/index. html) FOREACH user, canonicalize(url) (Amy, www. cnn. com) (Amy, www. frogs. com) (Fred, www. snails. com) LOAD (url, pagerank) JOIN on url GROUP on user FOREACH user, AVG(pagerank) FILTER avg. PR> 0. 5 3. Conciseness

Implementation Status • Available as ILLUSTRATE command in open-source release of Pig • Available as Eclipse Plugin (Pig. Pen) • See SIGMOD 09 paper for algorithm and experiments

Related Work • Sawzall – Data processing language on top of map-reduce – Rigid structure of filtering followed by aggregation • Hive – SQL-like language on top of Map-Reduce • Dryad. LINQ – SQL-like language on top of Dryad • Nested data models – Object-oriented databases

Future / In-Progress Tasks • • Columnar-storage layer Metadata repository Profiling and Performance Optimizations Tight integration with a scripting language –Use loops, conditionals, functions of host language • Memory Management • Project Suggestions at: http: //wiki. apache. org/pig/Proposed. Projects

Credits

Summary • Big demand for parallel data processing – Emerging tools that do not look like SQL DBMS – Programmers like dataflow pipes over static files • Hence the excitement about Map-Reduce • But, Map-Reduce is too low-level and rigid Pig Latin Sweet spot between map-reduce and SQL