Pig Latin编程语言 实例分析:在每个category中找到最访问的10个页面 Visits Url Info User Url Time Url Category Page. Rank Amy cnn. com 8: 00 cnn. com News 0. 9 Amy bbc. com 10: 00 bbc. com News 0. 8 Amy flickr. com 10: 05 flickr. com Photos 0. 7 Fred cnn. com 12: 00 espn. com Sports 0. 9
Pig Latin编程语言 Load Visits Group by url Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top 10 urls
Pig Latin编程语言 Pig Latin实现 visits = load ‘/data/visits’ as (user, url, time); g. Visits = group visits by url; visit. Counts = foreach g. Visits generate url, count(visits); url. Info = load ‘/data/url. Info’ as (url, category, p. Rank); visit. Counts = join visit. Counts by url, url. Info by url; g. Categories = group visit. Counts by category; top. Urls = foreach g. Categories generate top(visit. Counts, 10); store top. Urls into ‘/data/top. Urls’;
Pig Latin编程语言 Map. Reduce作业 Map 1 Load Visits Group by url 每个group或者join操作都形 成一个map-reduce的界限 Reduce 1 Foreach url generate count Map 2 Load Url Info Join on url Group by category Foreach category generate top 10(urls) Reduce 2 Map 3 Reduce 3