ExternalMemory Sorting n Externalmemory algorithms n n Externalmemory

  • Slides: 15
Download presentation
External-Memory Sorting n External-memory algorithms n n External-memory sorting n n When data do

External-Memory Sorting n External-memory algorithms n n External-memory sorting n n When data do not fit in main-memory Rough idea: sort peaces that fit in mainmemory and “merge” them Main-memory merge sort: n n The main part of the algorithm is Merge Let’s merge: • 3, 6, 7, 11, 13 • 1, 5, 8, 9, 10 1

Main-Memory Merge Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first

Main-Memory Merge Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A 1 03 Copy the second half of A into array A 2 04 Merge-Sort(A 1) 05 Merge-Sort(A 2) 06 Merge(A, A 1, A 2) n Divide Conquer Combine Running time? 2

Merge-Sort Recursion Tree log 2 N 1 2 3 4 5 6 1 2

Merge-Sort Recursion Tree log 2 N 1 2 3 4 5 6 1 2 5 7 9 10 13 19 3 1 2 5 10 9 13 19 3 2 10 10 n n 2 7 7 1 5 13 19 5 1 13 19 8 9 7 9 9 7 10 11 12 13 15 17 19 4 6 8 11 12 15 17 4 8 15 6 11 12 17 4 15 15 4 3 8 12 17 8 3 12 17 6 11 In each level: merge runs (sorted sequences) of size x into runs of size 2 x, decrease the number of runs twofold. What would it mean to run this on a file in external memory? 3

External-Memory Merge-Sort n Idea: increase the size of initial runs! n Initial runs –

External-Memory Merge-Sort n Idea: increase the size of initial runs! n Initial runs – the size of available main memory (M data elements) 1 2 3 4 5 6 7 8 9 10 13 19 3 9 13 19 3 11 12 13 15 17 19 4 6 8 11 12 15 17 4 8 15 6 11 12 17 External two-way merge 1 2 5 7 External two-way merges 1 2 5 10 Main-memory sort 10 2 5 1 7 Main-memory sort 13 19 9 7 Main-memory sort 15 4 8 3 Main-memory sort 12 17 6 11 4

External-Memory Merge Sort n n Input file X, empty file Y Phase 1: Repeat

External-Memory Merge Sort n n Input file X, empty file Y Phase 1: Repeat until end of file X: n n Read the next M elements from X Sort them in main-memory Write them at the end of file Y Phase 2: Repeat while there is more than one run in Y: n n n Empty X Merge. All. Runs(Y, X) X is now called Y, Y is now called X 5

External-Memory Merging n Merge. All. Runs(Y, X): repeat until the end of Y: n

External-Memory Merging n Merge. All. Runs(Y, X): repeat until the end of Y: n n Call Twoway. Merge to merge the next two runs from Y into one run, which is written at the end of X Twoway. Merge: uses three main-memory arrays of size B Read, when p 1 = B (p 2 = B) Bf 1 min(Bf 1[p 1], Bf 2[p 2]) p 1 po Bf 2 p 2 Current page Bfo Current page Write, when Bfo full File Y: EOF Run 1 Run 2 File X: Merged run 6

Analysis: Assumptions n Assumptions and notation: n Disk page size: • B data elements

Analysis: Assumptions n Assumptions and notation: n Disk page size: • B data elements n Data file size: • N elements, n = N/B disk pages n Available main memory: • M elements, m = M/B pages 7

Analysis 8 M = N Phase 2 4 M 4 M 2 M Phase

Analysis 8 M = N Phase 2 4 M 4 M 2 M Phase 1 n 2 M 2 M M M M M Phase 1: n n 2 M Read file X, write file Y: 2 n = O(n) I/Os Phase 2: n n One iteration: Read file Y, write file X: 2 n = O(n) I/Os Number of iterations: log 2 N/M = log 2 n/m 8

Analysis: Conclusions n Total running time of external-memory merge sort: O(n log 2 n/m)

Analysis: Conclusions n Total running time of external-memory merge sort: O(n log 2 n/m) n We can do better! n Observation: n n Phase 1 uses all available memory Phase 2 uses just 3 pages out of m available!!! 9

Two-Phase, Multiway Merge Sort n Idea: merge all runs at once! n n Phase

Two-Phase, Multiway Merge Sort n Idea: merge all runs at once! n n Phase 1: the same (do internal sorts) Phase 2: perform Multiway. Merge(Y, X) 8 M = N Phase 2 Phase 1 M M M M 10

Multiway Merging Bf 1 p 1 Bf 2 Read, when pi = B p

Multiway Merging Bf 1 p 1 Bf 2 Read, when pi = B p 2 min(Bf 1[p 1], Bf 2[p 2], …, Bfk[pk]) Bfo po Bfk pk Current page Write, when Bfo full Current page File Y: Run 1 Run 2 Run k=n/m EOF File X: Merged run 11

Analysis of TPMMS n Phase 1: O(n), Phase 2: O(n) n Total: O(n) I/Os!

Analysis of TPMMS n Phase 1: O(n), Phase 2: O(n) n Total: O(n) I/Os! n The catch: files only of “limited” size can be sorted n Phase 2 can merge a maximum of m-1 runs. n Which means: N/M < m-1 n How large files can we sort with TPMMS on a machine with 128 Mb main memory and disk page size of 16 Kb? 12

General Multiway Merge Sort n What if a file is very large or memory

General Multiway Merge Sort n What if a file is very large or memory is small? n General multiway merge sort: n Phase 1: the same (do internal sorts) n Phase 2: do as many iterations of merging as necessary until only one run remains Each iteration repeatedly calls Multiway. Merge(Y, X) to merge groups of m-1 runs until the end of file Y is reached. 13

Analysis (m-1)3 M = N Phase 2 (m-1)M Phase 1 n n . .

Analysis (m-1)3 M = N Phase 2 (m-1)M Phase 1 n n . . . (m-1)M M M … M . . . M M … M Phase 1: O(n), each iteration of phase 2: O(n) How many iterations are there in phase 2? n n . . . (m-1)2 M Number of iterations: logm-1 N/M = logmn Total running time: O(n logm n) I/Os 14

Conclusions n External sorting can be done in O(n logm n) I/O operations for

Conclusions n External sorting can be done in O(n logm n) I/O operations for any n n n This is asymptotically optimal In practice, we can usually sort in O(n) I/Os n Use two-phase, multiway merge-sort 15