Chapter 4 First Steps Toward Parallel Programming Principles




































- Slides: 36
Chapter 4: First Steps Toward Parallel Programming Principles of Parallel Programming First Edition by Calvin Lawrence Snyder Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley
Toward writing parallel programs • • Build intuition toward parallelism When to parallelize When overhead is too great Consider – Data allocation – Work allocation – Data structure design – Algorithms Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -2
3 ways to formulate parallel computations • Unlimited Parallelism • Fixed Parallelism • Scalable Parallelsim Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -3
2 classes of parallel algorithms • Data parallel • Task parallel Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -4
Data parallel • Perform same computation to different data items at the same time. • Parallelism grows as data grows • Example – P chefs preparing N meals – Each chef prepares N/P meals – As N increases, also increase P, limited by constraints Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -5
Task parallel • • Perform distinct computations at the same time Number of tasks typically fixed Not scalable Example – Chef for salad, chef for dessert, chef appetizer – There are dependencies among tasks – Utilizes pipelining • Hybred of data and task is often used Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -6
Pseudo code – Peril-L • • Minimal, easy to learn Universal to any language Allow reasoning about performance Will extend C Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -7
Perl-L • Threads – forall (i in (1. . 12)) printf(“Hello %in”, i); • Prints 12 Hello’s in random order • Threads compete and execute in parallel Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -8
Perl-L • exclusive – One thread executes body at a time forall (i in (1. . 12)){ exclusive { printf(“Hello %in”, i); }} • barrier – Forces all threads to stop at the barrier until all Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley threads arrive at which point they continue 4 -9
Perl-L • barrier – All threads wait for all to arrive, then continue forall (i in (1. . 12)) { printf(“tweedle dee n”); barrier; printf(“tweedle dum n”); } • All tweedle dee’s print before tweedle dum’s Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -10
Peril-l memory model • Global – Variables visible to all threads – Outside a forall – Variables underlined • Local – Variables visible to only local thread – Inside a forall – Variables not underlined Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -11
Perl-l • Multiple reads concurrent • One write – Allows race conditions, last write wins Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -12
Connecting global and local memory • Global memory is distributed to local memory • Localize takes global memory to make it local int all. Data[n]; // global forall (thd. ID in (0. . P-1)) { // spawn threads int size = n/P; // size of allocations int loc. Data[size]=localize(all. Data[]); // map globals to this thd locals Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -13
Connecting global and local memory (cont) • Modification to local data is same as modifying global data but with out λ delay of accessing nonlocal memory Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -14
Issues of localization of global memory • Global arrays use local indices which start at 0 • Multiple threads on a processor keep data local to the thread • There is no local copy, both local and global reference the same memory location? Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -15
Handy functions • size = my. Size(global, i) – Feturns the size of the ith dimension of the local portion of the global array • local. To. Global(loc. Data, i, j) – Returns global index corresponds to ith index of the jth dimension of the local array, loc. Data Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -16
Full Empty variables - synchronization • Like matter, next slide • Incurs over head like global memory, λ • int t’=0; //declare empty t and fill it Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -17
Table 4. 1 Semantics of full/empty variables. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -18
Reduce/Scan • Reduce – combines a set of values to produce a single value – Written with / – +/count //add elements of count • Scan – parallel prefix computation, embodies logic that performs a sequential operation in parts and carries along the intermediate results – Written with – Minitems //scan, ie find smallest of items’ prefixs Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -19
Additional example • least = min/data. Array; //scalar stored in local //least of each thread. • reduce/scan combine values across multiple threads Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -20
More examples - reduce • count – local in each thread total=+/count; • Combined into a single result stored in each thread Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -21
More examples - scan • count local to each thread • before. Me =+count; • count variables are accumulate so the ith thread has its before. Me variable assigned the sum of the first i count values Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -22
Implied Reduce - Scan synchronization • Consider largest = max/local. Total; • All threads must arrive at this statement to perform the summation. • Threads proceed only after the assignment Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -23
Programming consideration exclusive { total +=priv_count; } //done serially Versus Total =+/priv_count; //done with tree structure Converts from O(p) to O(lg P) Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -24
Figure 4. 1 The Count 3 s computation (Try 3) written in the Peril-L notation. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -25
Formulating Parallelism • Fixed Parallelism – Write code designed for a particular machine – Improving the machine may not increase parallelism • Unlimited Parallelism – Use forall ( i in (0. . n-1) – Will use available resources – Will require substantial thread communication Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -26
Figure 4. 2 Fixed Parallelism solution to Count 3 s (t=4). Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -27
Formulating Parallelism (cont) • Scalable – As follows: • Determine how components (data structures, work load, etc) grow as n increases. • Formulate a set S of substantial subproblems where natural units of the solution are assigned to each S • Solve each S independently – Utilizes locality Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -28
Figure 4. 3 Scalable Parallelism solution to Count 3 s. Notice that the array segment has been localized. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -29
Table 4. 2 Helper functions. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -30
Figure 4. 4 Odd/Even Interchange to alphabetize a list L of records on field x. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -31
Figure 4. 5 Fixed 26 -way parallel solution to alphabetizing. The function let. Rank(x) returns the 0 origin rank of the Latin letter x. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -32
Figure 4. 6 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -33
Table 4. 3 Merge operations. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -34
Figure 4. 7 Peril-L program using Batcher’s sort to alphabetize records in L. Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -35
Figure 4. 7 Peril-L program using Batcher’s sort to alphabetize records in L. (cont. ) Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 4 -36