# Parallel Programming Sorting David Monismith CS 599 Notes

• Slides: 20

Parallel Programming - Sorting David Monismith CS 599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta, Karypis, and Kumar

Sorting • Useful because sorted data is often easier to manipulate than unsorted data. • Arranges elements into ascending or descending order. • Internal or external – Can be performed in main memory or must use secondary memory • Comparison or non-comparison based – Compare and Swap sorting methods such as Merge. Sort have a time complexity of Θ(n lg n) in the best case. Some such algorithms like Insertion Sort have O(n 2) complexity. – Non-comparison methods such as radix sort, which manipulate a known property of the data have a time complexity of Θ(n) in the best case. • Our focus will be on Comparison-based algorithms.

Issues in Parallelizing Sorting • Distributing elements between processes. • Assume input and data are distributed between processes. • It may be necessary to enforce global ordering to ensure ordered results. – Example: We may need to require all elements in process Pj are less than or equal to those in process Pi if j < i. • Note that it is easy to compare and swap elements within the same process, however, communication is required if the elements are not within the same process.

One Element Per Process • What if each process only contains one element from the set to be sorted? • Given a process Pj with element aj and a process Pi with element ai , if j < i, and aj > ai, then Pj should contain ai and Pi should contain aj after a compare and swap. • Note that this method requires that both elements be sent to both processes for comparison to occur. • This means if the communication time takes more than the comparison time, then this method is inefficient. • This is true for most computers today. • So, in general, using one element per process is inefficient.

Multiple Elements Per Process • Want to sort large amount of data with few processes, where few is much less than large. • Assign n/p elements to each process – n is the total number of elements – p is the total number of processes • If we assume the elements stored within each process are already sorted, we can have each process to merge its data with its neighbor’s data. • In an ascending sort, the process with the smaller rank should keep the smaller elements, and the process with the large rank should keep the larger elements. • Given a bidirectional communication link, this process will take Θ(n/p) time and Θ(p) steps as shown previously with the Odd-Even sort.

Merging Elements void merge. Arrays(int * arr, int * neighbor. Arr, int * temp, int size, int low_or_high) { int i, j, k; i = j = k = 0; //Assume arrays are already sorted for(i = 0, j = 0, k = 0; i < size*2; i++) { if(j < size && k < size) { if(arr[j] < neighbor. Arr[k]){ temp[i] = arr[j]; j++; } else { temp[i] = neighbor. Arr[k]; k++; } } else if(j < size) { temp[i] = arr[j]; j++; } else { temp[i] = neighbor. Arr[k]; k++; } } if(low_or_high % 2 != 0) for(i = 0; i < size; i++) arr[i] = temp[i]; else for(i = size, j=0; j < size; i++, j++) arr[j]= temp[i]; }

Sorting Networks • Sorting Networks are designed to sort n elements in less than Θ(n lg n) assuming many comparisons will be performed in parallel. • Such networks require a Comparator – Increasing or decreasing • Depth – number of columns in the network

Bitonic Sort • Sorts n elements in Θ(lg 2 n) time. • Rearranges a bitonic sequence into a sorted sequence. • Bitonic sequence of elements has the property that there is an index i to which the sequence is monotonically increasing, and after which the sequence is monotonically decreasing. • Or a cyclic index shift of the sequence can produce such a sequence. • Examples: – 1, 2, 3, 4, 0, -1 – 9, 7, 2, -1, 3, 4 -> -1, 3, 4, 9, 7, 2

Bitonic Sort • Bitonic Split – splitting a length n bitonic sequence into two sequences defined by: – s 1 = min(a 0, an/2), min(a 1, an/2+1), . . . , min(an/2 -1, an-1) – s 2 = max(a 0, an/2), max(a 1, an/2+1), . . . , max(an/2 -1, an-1) • It is possible to recursively obtain shorter bitonic sequences until reaching a sequence of size 1, at which point the entire sequence is monotonically increasing. • This operation requires log n splits, and is referred to as a Bitonic Merge • An example of a Bitonic Merging Network is provided in one of the following slides.

Bitonic Sort • An unsorted sequence of elements can be viewed as a sequence of concatenated pairs of bitonic sequences in increasing and decreasing order. • By iteratively ordering and merging these sequences, it is possible to create a bitonic sequence. • Then the algorithm from the previous slide can be applied. • Notice that the sorting algorithm requires O(n lg 2 n) operations • A picture of the network is provided on the next slide.

Bitonic Sort A bitonic sorting network. The left half of the network generates a bitonic sequence Arrows indicate where the larger element should be stored. Notice that the second half of the network is the Bitonic Merge. • Source: http: //en. wikipedia. org/wiki/Bitonic_sorter • •

Example • Using the left half of the network sort the following bitonic sequence. • 3, 8, 12, 34, 41, 47, 59, 82, 79, 64, 12, 11, 10, 8, 4, -3 • After sorting this sequence, we will look at Open. MP parallelized code for the bitonic sort. • It is non-trivial to map bitonic sort to a ring, mesh, and hypercube. • Note that bitonic sort, while a better option than a O(n 2) sort is not the best sorting option in general.

Recall Odd-Even Sort • Previously we looked at the odd-even sort, which is a parallelized version of bubble sort • Because odd-even sort is a variant of bubble sort, it is impossible to improve the number of operations required past O(n 2) • Even when using a O(n/p lg n/p) local sort with MPI, odd-even sort scales poorly. • It is cost optimal when p = O(lg n)

Quick Sort • Divide and conquer algorithm. • Recursively divides a sequence into smaller subsequences. • Makes use of a pivot • The following algorithm is based on en. wikipedia. org/wiki/Quicksort

Quick Sort Algorithm void quicksort(int * arr, int lo, int hi) { if(lo < hi) { int p = partition(arr, lo, hi); quicksort(arr, lo, p - 1); quicksort(arr, p + 1, hi); } }

Quick Sort Algorithm int partition(int * arr, int lo, int hi) { int i, int store. Idx = lo; int pivot. Idx = choose. Pivot(arr, lo, hi); int pivot. Val = arr[pivot. Idx]; swap(&arr[pivot. Idx], &arr[hi]); for(i = lo; i < hi; i++) { if(arr[i] < pivot. Val) { swap(&arr[i], &arr[store. Idx]); store. Idx++; } } swap(&arr[store. Idx], &arr[hi]); return store. Idx; }

Quicksort Median of Three int choose. Pivot(int * arr, int lo, int hi) { int mid = (lo+hi)/2; int temp; if(arr[lo] < arr[hi]) { temp = lo; lo = hi; hi = temp; } if(arr[mid] < arr[lo]) { temp = mid; mid = lo; lo = temp; } if(arr[hi] < arr[mid]) { temp = mid; mid = hi; hi = temp; } return mid; }

Parallelization • Select a pivot • Broadcast the pivot to all processes • After receiving the pivot, each process rearranges its data into two local blocks – Less than the pivot - Li – Greater than the pivot - Gi • Next, global rearrangement is performed so that elements smaller than the pivot are at the beginning of the array and those larger than the pivot are at the end. • Then, the algorithm partitions the processes into two groups – those responsible for sorting the larger and smaller elements. • Recursion ends when a particular block of elements is assigned to only one process. • At that point, the elements can be sorted with a serial quicksort.

Partitioning • Partitioning processes for quick sort is done by the relative sizes of the smaller (S) and larger (L) blocks. • The first ceil( |S|p/n + 0. 5 ) processes are used to sort the smaller elements (S) and the remaining processes are assigned to sort the larger elements. • A prefix sum is used to determine the index location where the elements of each local array should be placed in the local array.

Bucket and Sample Sort • These sorting algorithms are based upon prior knowledge of the type of data in the sequence to be sorted. • Data is placed in buckets representative of different ranges in the sequence. • A prefix sum is useful in determining the indices of the elements to be placed in each bucket. • Sampling is useful in determining the range of values to be placed in each bucket.