Map Reduce 3 Map Reduce chenyishuaigmail com MapReduce

  • Slides: 35
Download presentation
大数据存储与应用 大规模文件系统及 Map Reduce 3. Map. Reduce 陈一帅 chenyishuai@gmail. com

大数据存储与应用 大规模文件系统及 Map Reduce 3. Map. Reduce 陈一帅 chenyishuai@gmail. com

Map-Reduce 原理

Map-Reduce 原理

Python Map/Reduce • Map: applies a function to all the items in an input_list

Python Map/Reduce • Map: applies a function to all the items in an input_list • a = [1, 2, 3] b = [4, 5, 6] map(lambda x, y: x+y, a, b( • [5, 7, 9]

Python Map/Reduce • Reduce: applies a rolling computation to sequential pairs of values in

Python Map/Reduce • Reduce: applies a rolling computation to sequential pairs of values in a list • reduce(lambda x, y: x*y, [1, 2, 3([ • 6 • (3*(2*1))

Map-Reduce

Map-Reduce

Map-Reduce

Map-Reduce

Group by key • Task of Map-Reduce environment • Partition • Hash(word) mod R

Group by key • Task of Map-Reduce environment • Partition • Hash(word) mod R • R: Reducer个数 • Hash(first letter(word)) mod R

Google

Google

 作调度 • Task • 状态:Idle,In-progress,Completed • 分配idle task给Worker • Map Worker 完成一个Task,报告 master

作调度 • Task • 状态:Idle,In-progress,Completed • 分配idle task给Worker • Map Worker 完成一个Task,报告 master 作完 成,及中间结果存储的位置(partition好了,每 个Reducer 一个中间结果文件) • Master通知Reducer去拿 • Reducer Worker 完成一个Task,报告 master 作结果

Google

Google

Pipeline

Pipeline

存储 + 计算 = Map-Reduce计算模 型 namenode job submission node namenode daemon jobtracker tasktracker

存储 + 计算 = Map-Reduce计算模 型 namenode job submission node namenode daemon jobtracker tasktracker datanode daemon Linux file system … slave node

Map-Reduce 算法

Map-Reduce 算法

Map-Reduce: 矩阵乘法 • • Page. Rank n × n 矩阵 M n × 1

Map-Reduce: 矩阵乘法 • • Page. Rank n × n 矩阵 M n × 1 向量 V M ×V • 通过key,把计算元素( mijvj )Partition到一个 Reducer去 • Key: i

Map-Reduce: 矩阵乘法 • Key: (i, k)

Map-Reduce: 矩阵乘法 • Key: (i, k)

Projection: R 1 R 2 R 3 R 4 R 5 No reducer

Projection: R 1 R 2 R 3 R 4 R 5 No reducer

Selection R 1 R 2 R 1 R 3 R 4 R 5 No

Selection R 1 R 2 R 1 R 3 R 4 R 5 No reducer

Relational Joins R 1 S 1 R 2 S 2 R 3 S 3

Relational Joins R 1 S 1 R 2 S 2 R 3 S 3 R 4 S 4 R 1 S 2 R 2 S 4 R 3 S 1 R 4 S 3

Join • Key:B

Join • Key:B

优化开销:Combiners k 1 v 1 k 2 v 2 map a 1 k 3

优化开销:Combiners k 1 v 1 k 2 v 2 map a 1 k 3 v 3 k 4 v 4 map b 2 c 3 c k 5 v 5 k 6 v 6 map 6 a 5 c map 2 b 7 c Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 减少通信开销 reduce r 1 s 1 r 2 s 2 r 3 s 3 8

优化: Combiners k 1 v 1 k 2 v 2 map a 1 c

优化: Combiners k 1 v 1 k 2 v 2 map a 1 c 3 c 6 减少通信开销 k 4 v 4 map b 2 c combine a 1 k 3 v 3 3 c c partition k 6 v 6 map 6 a 5 combine b 2 k 5 v 5 c map 2 b 7 combine 9 a 5 partition c c combine 2 b 7 partition c partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 8 c 2 3 9 6 8 8 reduce r 1 s 1 r 2 s 2 r 3 s 3 8