CS 584 Designing Parallel Algorithms Designing a parallel

  • Slides: 64
Download presentation
CS 584

CS 584

Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe

Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient n Except creativity We can benefit from a methodical approach. n Framework for algorithm design Most problems have several parallel solutions which may be totally different from the best sequential algorithm.

PCAM Algorithm Design 4 Stages to designing a parallel algorithm n n Partitioning Communication

PCAM Algorithm Design 4 Stages to designing a parallel algorithm n n Partitioning Communication Agglomeration Mapping P & C focus on concurrency and scalability. A & M focus on locality and performance.

PCAM Algorithm Design Partitioning n Computation and data are decomposed. Communication n Coordinate task

PCAM Algorithm Design Partitioning n Computation and data are decomposed. Communication n Coordinate task execution Agglomeration n Combining of tasks for performance Mapping n Assignment of tasks to processors

Partitioning Ignore the number of processors and the target architecture. Expose opportunities for parallelism.

Partitioning Ignore the number of processors and the target architecture. Expose opportunities for parallelism. Divide up both the computation and data Can take two approaches n n domain decomposition functional decomposition

Domain Decomposition Start algorithm design by analyzing the data Divide the data into small

Domain Decomposition Start algorithm design by analyzing the data Divide the data into small pieces n Approximately equal in size Then partition the computation by associating it with the data. Communication issues may arise as one task needs the data from another task.

Domain Decomposition Evaluate the definite integral. 1 4 ò 1+ x 2 0 0

Domain Decomposition Evaluate the definite integral. 1 4 ò 1+ x 2 0 0 1

Split up the domain 0 1

Split up the domain 0 1

Split up the domain 0 1

Split up the domain 0 1

Split up the domain Now each task simply evaluates the integral in their range.

Split up the domain Now each task simply evaluates the integral in their range. All that is left is to sum up each task's answer for the total. 0 0. 25 0. 75 1

Domain Decomposition Consider dividing up a 3 -D grid n What issues arise? Other

Domain Decomposition Consider dividing up a 3 -D grid n What issues arise? Other issues? n n n What if your problem has more than one data structure? Different problem phases? Replication?

Functional Decomposition Focus on the computation Divide the computation into disjoint tasks n Avoid

Functional Decomposition Focus on the computation Divide the computation into disjoint tasks n Avoid data dependency among tasks After dividing the computation, examine the data requirements of each task.

Functional Decomposition Not as natural as domain decomposition Consider search problems Often functional decomposition

Functional Decomposition Not as natural as domain decomposition Consider search problems Often functional decomposition is very useful at a higher level. n Climate modeling w Ocean simulation w Hydrology w Atmosphere, etc.

Partitioning Checklist Define a LOT of tasks? Avoid redundant computation and storage? Are tasks

Partitioning Checklist Define a LOT of tasks? Avoid redundant computation and storage? Are tasks approximately equal? Does the number of tasks scale with the problem size? Have you identified several alternative partitioning schemes?

Communication The information flow between tasks is specified in this stage of the design

Communication The information flow between tasks is specified in this stage of the design Remember: n n Tasks execute concurrently. Data dependencies may limit concurrency.

Communication Define Channel n n Link the producers with the consumers. Consider the costs

Communication Define Channel n n Link the producers with the consumers. Consider the costs w Intellectual w Physical n Distribute the communication. Specify the messages that are sent.

Communication Patterns Local vs. Global Structured vs. Unstructured Static vs. Dynamic Synchronous vs. Asynchronous

Communication Patterns Local vs. Global Structured vs. Unstructured Static vs. Dynamic Synchronous vs. Asynchronous

Local Communication within a neighborhood. Algorithm choice determines communication.

Local Communication within a neighborhood. Algorithm choice determines communication.

Global Communication Not localized. Examples n n All-to-All Master-Worker 5 1 3 7 2

Global Communication Not localized. Examples n n All-to-All Master-Worker 5 1 3 7 2

Avoiding Global Communication Distribute the communication and computation 5 5 13 1 3 3

Avoiding Global Communication Distribute the communication and computation 5 5 13 1 3 3 10 2 7 7 3 2 1 1

Divide and Conquer Partition the problem into two or more subproblems Partition each subproblem,

Divide and Conquer Partition the problem into two or more subproblems Partition each subproblem, etc. 7 8 5 5 3 3 2 1 1 Results in structured nearest neighbor communication pattern.

Structured Communication Each task’s communication resembles each other task’s communication Is there a pattern?

Structured Communication Each task’s communication resembles each other task’s communication Is there a pattern?

Unstructured Communication No regular pattern that can be exploited. Examples n n Unstructured Grid

Unstructured Communication No regular pattern that can be exploited. Examples n n Unstructured Grid Resolution changes Complicates the next stages of design

Synchronous Communication Both consumers and producers are aware when communication is required Explicit and

Synchronous Communication Both consumers and producers are aware when communication is required Explicit and simple t=1 t=2 t=3

Asynchronous Communication Timing of send/receive is unknown. n No pattern Consider: very large data

Asynchronous Communication Timing of send/receive is unknown. n No pattern Consider: very large data structure n n n Distribute among computational tasks (polling) Define a set of read/write tasks Shared Memory

Problems to Avoid A centralized algorithm n n Distribute the computation Distribute the communication

Problems to Avoid A centralized algorithm n n Distribute the computation Distribute the communication A sequential algorithm n n Seek for concurrency Divide and conquer w Small, equal sized subproblems

Communication Design Checklist Is communication balanced? n All tasks about the same Is communication

Communication Design Checklist Is communication balanced? n All tasks about the same Is communication limited to neighborhoods? n Restructure global to local if possible. Can communications proceed concurrently? Can the algorithm proceed concurrently? n Find the algorithm with most concurrency. w Be careful!!!

Agglomeration Partition and Communication steps were abstract Agglomeration moves to concrete. Combine tasks to

Agglomeration Partition and Communication steps were abstract Agglomeration moves to concrete. Combine tasks to execute efficiently on some parallel computer. Consider replication.

Agglomeration Goals Reduce communication costs by n n increasing computation decreasing/increasing granularity Retain flexibility

Agglomeration Goals Reduce communication costs by n n increasing computation decreasing/increasing granularity Retain flexibility for mapping and scaling. Reduce software engineering costs.

Changing Granularity A large number of tasks does not necessarily produce an efficient algorithm.

Changing Granularity A large number of tasks does not necessarily produce an efficient algorithm. We must consider the communication costs. Reduce communication by n n having fewer tasks sending less messages (batching)

Surface to Volume Effects Communication is proportional to the surface of the subdomain. Computation

Surface to Volume Effects Communication is proportional to the surface of the subdomain. Computation is proportional to the volume of the subdomain. Increasing computation will often decrease communication.

How many messages total? How much data is sent?

How many messages total? How much data is sent?

How many messages total? How much data is sent?

How many messages total? How much data is sent?

Replicating Computation Trade-off replicated computation for reduced communication. Replication will often reduce execution time

Replicating Computation Trade-off replicated computation for reduced communication. Replication will often reduce execution time as well.

Summation of N Integers s = sum b = broadcast How many steps?

Summation of N Integers s = sum b = broadcast How many steps?

Using Replication (Butterfly)

Using Replication (Butterfly)

Using Replication Butterfly to Hypercube

Using Replication Butterfly to Hypercube

Avoid Communication Look for tasks that cannot execute concurrently because of communication requirements. Replication

Avoid Communication Look for tasks that cannot execute concurrently because of communication requirements. Replication can help accomplish two tasks at the same time, like: n n Summation Broadcast

Preserve Flexibility Create more tasks than processors. Overlap communication and computation. Don't incorporate unnecessary

Preserve Flexibility Create more tasks than processors. Overlap communication and computation. Don't incorporate unnecessary limits on the number of tasks.

Agglomeration Checklist Reduce communication costs by increasing locality. Do benefits of replication outweigh costs?

Agglomeration Checklist Reduce communication costs by increasing locality. Do benefits of replication outweigh costs? Does replication compromise scalability? Does the number of tasks still scale with problem size? Is there still sufficient concurrency?

Mapping Specify where each task is to operate. Mapping may need to change depending

Mapping Specify where each task is to operate. Mapping may need to change depending on the target architecture. Mapping is NP-complete.

Mapping Goal: Reduce Execution Time n n Concurrent tasks ---> Different processors High communication

Mapping Goal: Reduce Execution Time n n Concurrent tasks ---> Different processors High communication ---> Same processor Mapping is a game of trade-offs.

Mapping Many domain-decomposition problems make mapping easy. n n n Grids Arrays etc.

Mapping Many domain-decomposition problems make mapping easy. n n n Grids Arrays etc.

Mapping Unstructured or complex domain decomposition based algorithms are difficult to map.

Mapping Unstructured or complex domain decomposition based algorithms are difficult to map.

Other Mapping Problems Variable amounts of work per task Unstructured communication Heterogeneous processors n

Other Mapping Problems Variable amounts of work per task Unstructured communication Heterogeneous processors n n different speeds different architectures Solution: LOAD BALANCING

Load Balancing Static n n Determined a priori Based on work, processor speed, etc.

Load Balancing Static n n Determined a priori Based on work, processor speed, etc. Probabilistic n Random Dynamic n Restructure load during execution Task Scheduling (functional decomp. )

Static Load Balancing Based on a priori knowledge. Goal: Equal WORK on all processors

Static Load Balancing Based on a priori knowledge. Goal: Equal WORK on all processors Algorithms: n n Basic Recursive Bisection

Basic Divide up the work based on n n Work required Processor speed æ

Basic Divide up the work based on n n Work required Processor speed æ ö ç pi ÷ ri = Rç ÷ çå p÷ è i ø

Recursive Bisection Divide work in half recursively. Based on physical coordinates.

Recursive Bisection Divide work in half recursively. Based on physical coordinates.

Dynamic Algorithms Adjust load when an imbalance is detected. Local or Global

Dynamic Algorithms Adjust load when an imbalance is detected. Local or Global

Task Scheduling Many tasks with weak locality requirements. Manager-Worker model.

Task Scheduling Many tasks with weak locality requirements. Manager-Worker model.

Task Scheduling Manager-Worker Hierarchical Manager-Worker n Uses submanagers Decentralized n n n No central

Task Scheduling Manager-Worker Hierarchical Manager-Worker n Uses submanagers Decentralized n n n No central manager Task pool on each processor Less bottleneck

Mapping Checklist Is the load balanced? Are there communication bottlenecks? Is it necessary to

Mapping Checklist Is the load balanced? Are there communication bottlenecks? Is it necessary to adjust the load dynamically? Can you adjust the load if necessary? Have you evaluated the costs?

PCAM Algorithm Design Partition n Domain or Functional Decomposition Communication n Link producers and

PCAM Algorithm Design Partition n Domain or Functional Decomposition Communication n Link producers and consumers Agglomeration n Combine tasks for efficiency Mapping n Divide up the tasks for balanced execution

Example: Atmosphere Model Simulate atmospheric processes n n Wind Clouds, etc. Solves a set

Example: Atmosphere Model Simulate atmospheric processes n n Wind Clouds, etc. Solves a set of partial differential equations describing the fluid behavior

Representation of Atmosphere

Representation of Atmosphere

Data Dependencies

Data Dependencies

Partition & Communication

Partition & Communication

Agglomeration

Agglomeration

Mapping

Mapping

Mapping

Mapping