External Sorting Putting files in order Creative Commons

  • Slides: 21
Download presentation
External Sorting Putting files in order Creative Commons License – Curt Hill

External Sorting Putting files in order Creative Commons License – Curt Hill

Unbalanced Merging • The Merge Sort algorithm may be adapted to file sorting •

Unbalanced Merging • The Merge Sort algorithm may be adapted to file sorting • Given three files: – One input file - filea – Two temp files – fileb, filec • • • Assume file a is one element runs Distribute the runs onto fileb, filec Merge them back onto filea Now filea has two element runs Continue this until done Creative Commons License – Curt Hill

Distribute Phase Filea 8 1 3 14 9 2 6 8 3 9 6

Distribute Phase Filea 8 1 3 14 9 2 6 8 3 9 6 1 14 2 Fileb Run length 1 Filec Creative Commons License – Curt Hill

Merge Phase Fileb 8 3 9 6 Filea starting 1 8 Run length 1

Merge Phase Fileb 8 3 9 6 Filea starting 1 8 Run length 1 Filec 1 14 2 Filea ending 1 8 3 14 2 9 6 Run length 2 Creative Commons License – Curt Hill

Unbalance Merging Again • This require 2 N log 2 N writes and reads

Unbalance Merging Again • This require 2 N log 2 N writes and reads • It ignores the fact that files are read a block at a time rather than a record at a time • It ignores the fact that internal sorting is much faster than external • Thus, all further merges should sort a page internally, so that runs are always greater than one Creative Commons License – Curt Hill

Observations • External sorting is all about – Internally sorting as much as will

Observations • External sorting is all about – Internally sorting as much as will fit in memory – Merging the sorted chunks until there are no pieces left • The trick in internal sorting is making the sorts smarter • The trick in external sorting is making the merges smarter Creative Commons License – Curt Hill

Unbalanced • The distribute step takes one file and copies it onto two others

Unbalanced • The distribute step takes one file and copies it onto two others • Does not do anything to organize the file in the process • One extra file will allow us to do something in this step • This is called Balanced merging Creative Commons License – Curt Hill

Balanced Merging • Distribute the initial records on filea and fileb – Usually after

Balanced Merging • Distribute the initial records on filea and fileb – Usually after an internal sort is performed • Merge first run on filec, second run on filed, etc • Oscillate back and forth between (filea, fileb) and (filec, filed) • This now takes N log 2 N reads and writes Creative Commons License – Curt Hill

Balanced Merge Filea 8 3 9 6 Run length 1 Fileb 1 14 2

Balanced Merge Filea 8 3 9 6 Run length 1 Fileb 1 14 2 Filec 1 8 2 9 Filed 3 14 6 Creative Commons License – Curt Hill Run length 2

How long is a run? • In Merge sort a run starts at length

How long is a run? • In Merge sort a run starts at length 1 • Doubles in each pass • The Natural thing is to compare adjacent items and determine the run is over when the key drops • This trick allows us to merge a sorted file in just one run • Only in pathological cases does it not help Creative Commons License – Curt Hill

Natural Merge Filea 8 3 9 6 Filec 1 8 14 6 Run length

Natural Merge Filea 8 3 9 6 Filec 1 8 14 6 Run length 1 Fileb 1 14 2 Filed 2 3 9 Creative Commons License – Curt Hill Run length 2+

Commentary • • Merging is different from sorting Comparisons and moves in memory are

Commentary • • Merging is different from sorting Comparisons and moves in memory are trivial Only reads and writes of blocks are significant If we can get the whole file in memory, we should do an internal sort • If we cannot – the normal case – we should only have the first block of as many files as we can afford – Each first being replaced by subsequent blocks – Not 2 -way merging, but N-way merging Creative Commons License – Curt Hill

N-Way Balanced Merging • Similar process to given above except – N files instead

N-Way Balanced Merging • Similar process to given above except – N files instead of 2 – N is even – Bigger is better, provided we have memory space • Example N = 6 – Distribute input on 3 files – Merge 3 files to 3 files until all are sorted – Each pass multiplies run length by three, instead of two Creative Commons License – Curt Hill

Is this the best we can do? • No • The more we can

Is this the best we can do? • No • The more we can merge together, the better – A 5 to 1 merge multiplies the run length by 5 – It does more comparisons but file accesses and not comparisons are the driving factors of merges • Unbalance into the Polyphase sort Creative Commons License – Curt Hill

Polyphase Merge Distribution • Suppose we have N work files • Place runs in

Polyphase Merge Distribution • Suppose we have N work files • Place runs in N-1 of these • When this phase is done, we have N-1 files with various runs and one that is unused • Then we enter the merge phase until complete Creative Commons License – Curt Hill

Polyphase Merging • Merge the N-1 files onto the Nth file • As soon

Polyphase Merging • Merge the N-1 files onto the Nth file • As soon as any of the run files is exhausted: – Call the exhausted file M – Finish the current run – Close the N file – Open the M file for output – Open the N file for input Creative Commons License – Curt Hill

Polyphase Merging 1 1 2 2 4 3 3 4 5 5 Four to

Polyphase Merging 1 1 2 2 4 3 3 4 5 5 Four to one merge First merge phase Second merge phase Creative Commons License – Curt Hill

Polyphase comments • We have the maximal merging ratio (N 1): 1 at all

Polyphase comments • We have the maximal merging ratio (N 1): 1 at all times • We do not necessarily finish an input file in the current output file • One problem – We do not want two files to end at about the same time – This diminishes our merging ratio • The trick is to distribute the originals unequally – Very specifically unequally Creative Commons License – Curt Hill

Distribution of runs • Key: length of run count of runs F 6 F

Distribution of runs • Key: length of run count of runs F 6 F 5 F 4 F 3 F 2 F 1 Processsed 131 130 128 124 116 - 129 Phase 1 115 114 112 18 - 516 80 Phase 2 17 16 14 - 98 58 72 Phase 3 13 12 - 174 94 54 68 Phase 4 11 - 332 172 92 52 66 Phase 5 651 331 171 91 51 65 Distrib - Phase 6 1291 129 Creative Commons License – Curt Hill

Polyphase distribution • We do not want equal distribution of runs on our files

Polyphase distribution • We do not want equal distribution of runs on our files • This would cause multiple files to end at the same time • Solution: Distribute runs unequally – Perfect Fibonacci distributions – The exponents of the previous tables reflect this sequence Creative Commons License – Curt Hill

Finally • Merging has exactly the same goal as sorting: – Put keyed records

Finally • Merging has exactly the same goal as sorting: – Put keyed records into a particular order • The constraints make the approaches quite different – Sorting: compares and moves – Merging: reading and writing blocks • Nobody would suggest the Polyphase merge as a sort Creative Commons License – Curt Hill