A COMPARISONFREE SORTING ALGORITHM ON CPUs Saleh Abdelhafeez
A COMPARISON-FREE SORTING ALGORITHM ON CPUs Saleh Abdel-hafeez, Jordan (JUST) Ann Gordon-Ross, USA (UF) Samer Abu. Baker, Jordan (JUST)
Highlights Ø Principle Example Ø Potential Key Factors Ø CPU Simulation q. Single Threaded (no Parallelism) § C-Code (Memory Locality) § Execution Time Simulations q. Multi-threaded (Parallelism) § C-Code (Atomic and Semaphore Vs. Memory) § Execution Time Simulations Ø Conclusions
Principle Example
Potential Key Factors Ø Two Representations § Binary § One-Hot Ø N=2 K Ø Computations less q. Memory Transpose q. Memory Mapping Ø Idea üReduce the size of One-Hot (Nx. N) to NX 1 üImprove Locality (Spatial and Temporal)
CPU Single Thread
Loop 1 Time vs. Loop 2 Time (MEMORY LOCALITY) 10 120% Execution Time (Sec. ) 9 8 100% 7 80% 6 60% 5 40% 4 20% 3 0% 7 2 8 10 12 14 16 LOOP 1 18 20 22 24 26 28 30 LOOP 2 1 0 7 8 10 12 14 16 18 20 2^K LOOP 1 LOOP 2 22 24 26 28 30
Dependent Less on Input Distribution 100% 90% 80% Execution Time Percentage 70% 60% 50% 40% 30% 20% 10% 0% 7 8 10 12 14 16 18 20 Size of the 2^(n) Random Revers Nearly Sorted Few Unique 22 24 26 28 29
CPU Single thread (Time Simulation) Millions 140000000 execution time in microsecond Execution Time in Microsecond 120000000 10000 80000000 60000000 1 0, 9 0, 8 0, 7 0, 6 0, 5 0, 4 0, 3 0, 2 0, 1 40000000 0 16 18 20 22 size of 2^(n) 20000000 Free-comparison quick merge radix 0 7 8 10 12 14 16 18 20 Size of 2^(n) Comaprison-Free Quick Merge Radix 22 24 26 28 29
CPU Single Thread Significant �The Fastest �Minor Effect on Data Type Distribution �One Dimensional Memory �Less Computations �Easy to work with �Less Energy & Power 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 7 8 10 12 14 16 18 Comparison-Freecomparis on quick 7 8 10 12 14 16 18 20 6 10 41 145 584 2317 6839 31414 15 30 140 602 2673 11409 47064 148004 22 24 26 28 20 Quicksort 3 29 1668940 418684 1828644 7654605 4 3366248 6894229 456904 1842128 7859271 9 9 69519 22 24 26 28 29
CPU Multiple Threads (8 -Threads & 4 -Core)
CPU Multiple threaded (TIME) 100% 2500 90% 80% 70% Execution Time (micro-second) 2000 60% 50% 40% 1500 30% 20% 1000 10% 0% 7 8 10 12 14 16 Parallel 1 500 18 20 Parallel 2 22 24 26 28 30 32 34 Single 0 7 8 10 12 2^(K) Parallel 1 Parallel 2 Single 14 16
Execution Time vs. Data Sizes 90000 8 -thread Nonthread Execution Time (Micro-Seconed) 80000 7 345 8 333 10 363 12 386 14 1070 16 2085 18 7658 20 17309 22 58822 24 26 28 30 32 34 234639 1084792 4411107 11969863 32481103 88139858 6 10 41 145 584 2317 6839 31414 69519 418684 1828644 7654605 60934070 2. 22 E+08 8. 12 E+08 70000 120% 60000 100% 50000 80% 40000 60% 30000 40% 20000 20% 10000 0% 7 8 10 12 14 16 Single 18 Paralell 1 20 22 Parallel 2 24 26 28 32 34 36 Atomic 0 20 22 24 26 28 2^(K) 32 34 36
Memory Usage 9000000 120% 8000000 7000000 100% 80% 6000000 5000000 60% 4000000 40% 3000000 20% 2000000 0% 7 8 10 12 14 1000000 16 Single 18 20 Parallel-memory 22 24 26 28 30 32 34 36 Parallel-Atomic 0 7 8 10 12 14 16 single 18 20 Parallel-Memory 22 Parallel-atomic 24 26 28 30 32 34 36
Comparison with Parallel Sorting Algorithms Ø Avoid Mutual Exclusive (Memory Blocked) Ø Use More Memory for threaded Ø Use Atomic for less memory Ø Execution Time (Second) 14 Comparison-Free 0. 00107/0. 0005 [1]-2011 -Bitonic-Sort-CPU&GPU 0. 0012 [2]-2010 -Intel (Radix) CPU 0. 0075 [3]-2009 -Invidia (Radix) GPU 0. 008 20 24 0. 002 0. 235 0. 076 1. 97 0. 025 0. 081 0. 031 0. 12 26 1. 08 2. 23 0. 33 0. 27
CONCLUSION Ø The Design is novel and is not an incremental of other hybrid sorting algorithms (Future Work); the CCode is clear and is available Ø Comparison-free: Single-Threaded ØThe fastest for data sizes < 216 Ø Comparison-free: Multi-threaded ØCPU (Simple 4 -Core) fastest at data 220 ØCPU (Advance Multi-Core) need to investigate ØGPU (Simple and Advance) need to investigate Ø Use less memory, and expecting less energy
- Slides: 15