Lecture 10 Patterns for Parallel Programming III John

  • Slides: 21
Download presentation
Lecture 10 Patterns for Parallel Programming III John Cavazos Dept of Computer & Information

Lecture 10 Patterns for Parallel Programming III John Cavazos Dept of Computer & Information Sciences University of Delaware www. cis. udel. edu/~cavazos/cisc 879 CISC 879 : Software Support for Multicore Architectures

Lecture 10: Overview • Cell B. E. Clarification • Design Patterns for Parallel Programs

Lecture 10: Overview • Cell B. E. Clarification • Design Patterns for Parallel Programs • Finding Concurrency • Algorithmic Structure • • • Organize by Tasks Organize by Data Supporting Structures CISC 879 : Software Support for Multicore Architectures

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data,

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open(". . /spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0, NULL); spe_program_load(spe[i], program); t_args[i]. spe = spe[i]; t_args[i]. spuid = i; pthread_create(&pts[i], NULL, &my_spe_thread, &t_args[i]); } for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %xn", mbox_data); int rc; } } for (i = 0; i < N; i++) { pthread_join(pts[i], NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; CISC 879 : Software Support for Multicore Architectures

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data,

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open(". . /spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0, NULL); spe_program_load(spe[i], program); t_args[i]. spe = spe[i]; t_args[i]. spuid = i; pthread_create(&pts[i], NULL, &my_spe_thread, &t_args[i]); } for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %xn", mbox_data); int rc; } } for (i = 0; i < N; i++) { pthread_join(pts[i], NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; CISC 879 : Software Support for Multicore Architectures

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data,

LS-LS DMA transfer (PPU) rc = spe_in_mbox_write(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); rc = spe_out_intr_mbox_read(spe[0], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int i; spe_program_handle_t *program; program = spe_image_open(". . /spu/hello"); for (i = 0; i < N; i++) { spe[i] = spe_context_create(0, NULL); spe_program_load(spe[i], program); t_args[i]. spe = spe[i]; t_args[i]. spuid = i; pthread_create(&pts[i], NULL, &my_spe_thread, &t_args[i]); } for (i = 0; i < N; i++) { rc = spe_in_mbox_write(spe[i], &mbox_data, 1, SPE_MBOX_ALL_BLOCKING); void *ls = spe_ls_area_get(spe[1]); unsigned int mbox_data = (unsigned int)ls; printf ("mbox_data %xn", mbox_data); int rc; } } for (i = 0; i < N; i++) { pthread_join(pts[i], NULL); } spe_image_close(program); for (i = 0; i < N; i++) { spe_context_destroy(spe[i]); } return 0; CISC 879 : Software Support for Multicore Architectures

LS-LS DMA transfer (SPU) int main() { gettimeofday(&tv, NULL); mfc_put(&tv, ea + (unsigned int)&tv,

LS-LS DMA transfer (SPU) int main() { gettimeofday(&tv, NULL); mfc_put(&tv, ea + (unsigned int)&tv, sizeof(tv), tag, 1, 0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); spu_write_out_intr_mbox(0); printf("spu %lld; t. tv_usec %ldn", spuid, tv. tv_usec); if (spuid == 0) { unsigned int ea; unsigned int tag = 0; unsigned int mask = 1; ea = spu_read_in_mbox(); printf("ea = %pn", (void*)ea); } spu_read_in_mbox(); printf("spu %lld; tv. tv_usec = %ldn", spuid, tv. tv_usec); return 0; } CISC 879 : Software Support for Multicore Architectures

LS-LS Output -bash-3. 2$. /a. out spu 0; t. tv_usec = 875360 spu 1;

LS-LS Output -bash-3. 2$. /a. out spu 0; t. tv_usec = 875360 spu 1; t. tv_usec = 876446 spu 2; t. tv_usec = 877443 spu 3; t. tv_usec = 878459 mbox_data f 7764000 ea = 0 xf 7764000 spu 0; tv. tv_usec = 875360 spu 1; tv. tv_usec = 875360 spu 2; tv. tv_usec = 877443 spu 3; tv. tv_usec = 878459 CISC 879 : Software Support for Multicore Architectures

Organize by Data • Operations on core data structure • Geometric Decomposition • Recursive

Organize by Data • Operations on core data structure • Geometric Decomposition • Recursive Data CISC 879 : Software Support for Multicore Architectures

Geometric Deomposition • Arrays and other linear structures • • Divide into contiguous substructures

Geometric Deomposition • Arrays and other linear structures • • Divide into contiguous substructures Example: Matrix multiply • Data-centric algorithm and linear data structure (array) implies geometric decomposition CISC 879 : Software Support for Multicore Architectures

Recursive Data • Lists, trees, and graphs • • Structures where you would use

Recursive Data • Lists, trees, and graphs • • Structures where you would use divide-and-conquer May seem that can only move sequentially through data structure • But, there are ways to expose concurrency CISC 879 : Software Support for Multicore Architectures

Recursive Data Example • Find the Root: Given a forest of directed trees find

Recursive Data Example • Find the Root: Given a forest of directed trees find the root of each node • • Parallel approach: For each node, find its successor’s successor Repeat until no changes • O(log n) vs O(n) Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

Organize by Flow of Data Organize By Flow of Data Regular Pipeline Irregular Event-Based

Organize by Flow of Data Organize By Flow of Data Regular Pipeline Irregular Event-Based Coordination CISC 879 : Software Support for Multicore Architectures

Organize by Flow of Data • • • Computation can be viewed as a

Organize by Flow of Data • • • Computation can be viewed as a flow of data going through a sequence of stages Pipeline: one-way predictable communication Event-based Coordination: unrestricted unpredictable communication CISC 879 : Software Support for Multicore Architectures

Pipeline performance • Concurrency limited by pipeline depth • • Stages should be equally

Pipeline performance • Concurrency limited by pipeline depth • • Stages should be equally computationally intensive • • • Balance computation and communication (architecture dependent) Slowest stage creates bottleneck Combine lightly loaded stages or decompose heavilyloaded stages Time to fill and drain pipe should be small CISC 879 : Software Support for Multicore Architectures

Supporting Structures • Single Program Multiple Data (SPMD) • Loop Parallelism • Master/Worker •

Supporting Structures • Single Program Multiple Data (SPMD) • Loop Parallelism • Master/Worker • Fork/Join CISC 879 : Software Support for Multicore Architectures

SPMD Pattern • Create single program that runs on each processor • Initialize •

SPMD Pattern • Create single program that runs on each processor • Initialize • Obtain a unique identifier • Run the same program each processor • Identifier and input data can differentiate behavior • Distribute data (if any) • Finalize Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

SPMD Challenges • Split data correctly • Correctly combine results • Achieve even work

SPMD Challenges • Split data correctly • Correctly combine results • Achieve even work distribution • If programs require dynamic load balancing, another pattern may be more suitable (Job Queue) Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

Loop Parallelism Pattern • • Many programs expressed as iterative constructs Programming models like

Loop Parallelism Pattern • • Many programs expressed as iterative constructs Programming models like Open. MP provide pragmas to automatically assign loop iterations to processors Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC

Master/Work Pattern Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

Master/Work Pattern • Relevant where tasks have no dependencies • • Embarrassingly parallel Problem

Master/Work Pattern • Relevant where tasks have no dependencies • • Embarrassingly parallel Problem is determining when entire problem complete Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures

Fork/Join Pattern • • Parent creates new tasks (fork), then waits until they complete

Fork/Join Pattern • • Parent creates new tasks (fork), then waits until they complete (join) Tasks created dynamically • • Tasks can create more tasks Tasks managed according to relationships Slide Source: Dr. Rabbah, IBM, MIT Course 6. 189 IAP 2007 CISC 879 : Software Support for Multicore Architectures