Department of Computer Science Map Reduce for the



























- Slides: 27

Department of Computer Science Map. Reduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam

Department of Computer Science Map. Reduce n n A model for parallel programming Proposed by Google n n Large scale distributed systems – 1, 000 node clusters Applications: n n n Distributed sort Distributed grep Indexing Simple, high-level interface Runtime handles: n parallelization, scheduling, synchronization, and communication 2

Department of Computer Science Cell B. E. Architecture n A heterogeneous computing platform: n n Programming is hard n n n 1 PPE, 8 SPEs Multi-threading is explicit SPE local memories are software-managed The Cell is like a “cluster-on-a-chip” 3

Department of Computer Science Motivation Map. Reduce Cell B. E. Scalable parallel model Simple interface Complex parallel architecture Hard to program Map. Reduce for the Cell B. E. Architecture 4

Department of Computer Science Overview n Motivation n n Map. Reduce Example Design Evaluation n Map. Reduce Cell B. E. Architecture Workload Characterization Application Performance Conclusions and Future Work 5

Department of Computer Science Map. Reduce Example Counting word occurrences in a set of documents: 6

Department of Computer Science Overview n Motivation n n Map. Reduce Example Design Evaluation n Map. Reduce Cell B. E. Architecture Workload Characterization Application Performance Conclusions and Future Work 7

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 8

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs 9

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: two-phase 2. Partition – hash and distribute external sort 3. Quick-sort 4. Merge-sort 10

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: two-phase 2. Partition – hash and distribute external sort 3. Quick-sort 4. Merge-sort 11

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: two-phase 2. Partition – hash and distribute external sort 3. Quick-sort 4. Merge-sort 12

Department of Computer Science Design Flow of Execution Five stages: Map, Partition, Quick-sort, Merge-sort, Reduce 1. Map streams key/value pairs Key grouping implemented as: two-phase 2. Partition – hash and distribute external sort 3. Quick-sort 4. Merge-sort 5. Reduce “reduces” key/list-of-values pairs to key/value pairs. 13

Department of Computer Science Overview n Motivation n n Map. Reduce Example Design Evaluation n Map. Reduce Cell B. E. Architecture Workload Characterization Application Performance Conclusions and Future Work 14

Department of Computer Science Evaluation Methodology n Map. Reduce Model Characterization n n Synthetic micro-benchmark with six parameters Run on a 3. 2 GHz Cell Blade n n Measured effect of each parameter on execution time Application Performance Comparison n Six full applications n n n Map. Reduce versions run on 3. 2 GHz Cell Blade Single-threaded versions run on 2. 4 GHz Core 2 Duo Evaluation n Measured speedup comparing execution times Measured overheads on the Cell monitoring SPE idle time Measured ideal speedup assuming no Cell overheads 15

Department of Computer Science Map. Reduce Model Characterization Effect on Execution Time Characteristi Description c Map intensity Execution cycles per input byte to Map Reduce intensity Execution cycles per input byte to Reduce Map fan-out Ratio of input size to output size in Map Reduce fan-in Number of values per key in Reduce Partitions Number of partitions Input size in bytes Model Characteristics 16

Department of Computer Science Application Performance n Applications histogram: kmeans: linear. Reg: word. Count: NAS_EP: dist. Sort: counts bitmap RGB occurrences clustering algorithm least-squares linear regression word count EP benchmark from NAS suite distributed sort 17

Department of Computer Science Speedup Over Core 2 Duo 18

Department of Computer Science Runtime Overheads 19

Department of Computer Science Overview n Motivation n n Map. Reduce Example Design Evaluation n Map. Reduce Cell B. E. Architecture Workload Characterization Application Performance Conclusions and Future Work 20

Department of Computer Science Conclusions and Future Work n Conclusions Programmability benefits n High-performance on computationally intensive workloads n n Not n applicable to all application types Future Work Additional performance tuning n Extend for clusters of Cell processors n n Hierarchical Map. Reduce 21

Department of Computer Science Questions?

Department of Computer Science Backup Slides

Department of Computer Science Map. Reduce API void Map. Reduce_exec(Map. Reduce Specification specification); The exec function initializes the Map. Reduce runtime and executes Map. Reduce according to the user specification. void Map. Reduce_emit. Intermediate(void **key, void **value); void Map. Reduce_emit(void **value); These two functions are called by the user-defined Map and Reduce functions, respectively. These functions take references to pointers as arguments, and modify the referenced pointer to point to pre-allocated storage. It is then the responsibility of the application to provision this storage. 24

Department of Computer Science Optimizations 1) Priority work queue n n Distributes load Avoids serialization n 2) 3) Pipelined execution maximizes concurrency Double-buffering Application support n n n Map only Map with sorted output Chaining invocations 25

Department of Computer Science Optimizations 1) Priority work queue n n Distributes load Avoids serialization n 2) 3) Pipelined execution maximizes concurrency Double-buffering Application support n n n Map only Map with sorted output Chaining invocations 26

Department of Computer Science Optimizations 4) Balanced merge 5) Map and Reduce output regions pre-allocated. n n (n / log(n) better bandwidth utilization as n → ∞) optimal memory alignment bulk memory transfers no user memory management no dynamic allocation overhead 27