Sorting by the Numbers Sorting Part Four Question
Sorting by the Numbers Sorting Part Four
Question n Suppose you are given the task of writing an application to sort a big data file. What do you need to know to pick a good solution? File Size = 1 GB ¨ Record Size = 250 Bytes ¨ Available Memory = ¼ GB ¨
How many Runs? How big is each Run? n Total Records to Process 1 billion bytes in the file ¨ 250 bytes for each record ¨ = 4 million records in the file ¨ n Run Size 1 GB file ¨ ¼ GB memory ¨ = 4 Runs of 1 million records each ¨
Time to Create the Runs n Sorting One Run ¨ Using either Quicksort or Ordered Binary Tree n n ¨ n N log 2 N 1 million * 20 approximately 20 million comparisons of internal memory locations Sorting Four Runs ¨ 80 million internal memory comparisons
Refresher on Merging Files File One 1 3 5 7 9 File Two 2 4 6 8 10 File One 1 2 3 4 5 File Two 6 7 8 9 10 n So, to merge 2 files of N random records each, requires 2 N compares n And, to merge 2 files where the runs were built from a sorted file requires N compares
Merging the Four Files R 1 R 2 R 1 2 million compares T 1 R 3 3 million compares T 2 4 million compares R 3 R 4 2 million compares 4 million compares R 4 R 2 T 1 T 2
Total Processing Time n Time to Create the 4 Runs ¨ n Time to Merge the 4 Runs ¨ n 80 million comparisons 8 million comparisons Assuming a File Read takes just 100 times longer than a Memory Read ¨ Total Time = 880 million time units ¨ note, we have omitted the time to read the runs into memory and to write the runs to temp files
Second Example n 2 Runs of 2 Million Records each ¨ Internal Sorting N log 2 N = 2 million * 24 = 48 million compares n 96 million to create both runs n ¨ File n Merging 4 million compares ¨ Total n Time 496 million time units
Next in this course n So how much time does it take to access the disk?
- Slides: 9