Technische Universitt Mnchen Massively Parallel SortMerge Joins MPSM

  • Slides: 29
Download presentation
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München

Technische Universität München Hardware trends … • Huge main memory • Massive processing parallelism

Technische Universität München Hardware trends … • Huge main memory • Massive processing parallelism • Non-uniform Memory Access (NUMA) • Our server: – – 4 CPUs 32 cores 1 TB RAM 4 NUMA partitions CPU 0 2

Technische Universität München Main memory database systems • Volt. DB, Hana, Monet. DB •

Technische Universität München Main memory database systems • Volt. DB, Hana, Monet. DB • Hy. Per: real-time business intelligence queries on transactional data* * http: //www-db. in. tum. de/research/projects/Hy. Per/ 3

Technische Universität München How to exploit these hardware trends? • Parallelize algorithms • Exploit

Technische Universität München How to exploit these hardware trends? • Parallelize algorithms • Exploit fast main memory access Kim, Sedlar, Chhugani: Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. VLDB‘ 09 Blanas, Li, Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD‘ 11 • AND be aware of fast local vs. slow remote NUMA access 4

Technische Universität München Ignoring NUMA core 1 NUMA partition 3 hashable core 5 core

Technische Universität München Ignoring NUMA core 1 NUMA partition 3 hashable core 5 core 2 core 6 core 3 core 7 core 4 core 8 NUMA partition 2 NUMA partition 4 5

Technische Universität München re te o m l ca lo s sort 837 ms

Technische Universität München re te o m l ca lo s sort 837 ms 12946 ms 7440 ms scaled execution time 417344 ms 22756 ms 100% 1000 ms How much difference does NUMA make? d tial e iz en n o hr sequ c yn partitioning l te ca o lo rem merge join (sequential read) 6

Technische Universität München The three NUMA commandments C 1 Thou shalt not write thy

Technische Universität München The three NUMA commandments C 1 Thou shalt not write thy neighbor‘s memory randomly -- chunk the data, redistribute, and then sort/work on your data locally. C 2 C 3 Thou shalt read thy neighbor‘s Thou shalt not wait for thy neighbors memory only -- don‘t use finesequentially grained latching or -- let the locking and avoid prefetcher hide synchronization the remote points of parallel access latency. threads. 7

Technische Universität München Basic idea of MPSM R R chunks chunk R S chunks

Technische Universität München Basic idea of MPSM R R chunks chunk R S chunks S chunk S 8

Technische Universität München Basic idea of MPSM • C 1: Work locally: sort •

Technische Universität München Basic idea of MPSM • C 1: Work locally: sort • C 3: Work independently: sort and merge join • C 2: Access neighbor‘s data only sequentially chunk R sort R chunks locally R chunks MJ S chunks MJ MJ MJ merge join chunks sort S chunks locally chunk S 9

Technische Universität München Range partitioning of private input R • To constrain merge join

Technische Universität München Range partitioning of private input R • To constrain merge join work • To provide scalability in the number of parallel workers 10

Technische Universität München Range partitioning of private input R • To constrain merge join

Technische Universität München Range partitioning of private input R • To constrain merge join work • To provide scalability in the number of parallel workers R chunks range partition R range partitioned R chunks 11

Technische Universität München Range partitioning of private input R • To constrain merge join

Technische Universität München Range partitioning of private input R • To constrain merge join work • To provide scalability in the number of parallel workers S is implicitly partitioned range partitioned R chunks sort R chunks S chunks sort S chunks 12

Technische Universität München Range partitioning of private input R • To constrain merge join

Technische Universität München Range partitioning of private input R • To constrain merge join work • To provide scalability in the number of parallel workers S is implicitly partitioned range partitioned R chunks sort R chunks MJ S chunks MJ MJ MJ merge join only relevant parts sort S chunks 13

Technische Universität München Range partitioning of private input • Time efficient branch-free comparison-free synchronization-free

Technische Universität München Range partitioning of private input • Time efficient branch-free comparison-free synchronization-free and • Space efficient densely packed in-place by using radix-clustering and precomputed target partitions to scatter data to 14

Technische Universität München chunk of worker W 2 chunk of worker W 1 Range

Technische Universität München chunk of worker W 2 chunk of worker W 1 Range partitioning of private input 19 9 7 3 21 1 17 2 23 4 31 8 20 26 19=10011 7= 00111 histogram of worker W 1 4 3 <16 ≥ 16 prefix sum of worker W 1 0 0 1 17 = 10001 2=00010 W 1 W 2 histogram of worker W 2 3 4 <16 ≥ 16 prefix sum of worker W 2 4 5 3 W 1 2 19 W 2 15

Technische Universität München chunk of worker W 2 chunk of worker W 1 Range

Technische Universität München chunk of worker W 2 chunk of worker W 1 Range partitioning of private input 19 9 7 3 21 1 17 2 23 4 31 8 20 26 19=10011 7= 00111 histogram of worker W 1 4 3 <16 ≥ 16 prefix sum of worker W 1 0 0 1 17 = 10001 2=00010 W 1 W 2 histogram of worker W 2 3 4 <16 ≥ 16 prefix sum of worker W 2 4 5 3 W 1 W 2 9 7 3 1 2 4 8 19 21 17 23 31 20 26 16

Technische Universität München Real C hacker at work …

Technische Universität München Real C hacker at work …

Technische Universität München Skew resilience of MPSM • Location skew is implicitly handled •

Technische Universität München Skew resilience of MPSM • Location skew is implicitly handled • Distribution skew: – Dynamically computed partition bounds – Determined based on the global data distributions of R and S – Cost balancing for sorting R and joining R and S 18

Technische Universität München Skew resilience 1. Global S data distribution – Local equi-height histograms

Technische Universität München Skew resilience 1. Global S data distribution – Local equi-height histograms (for free) – Combined to CDF 1 7 10 15 22 31 66 81 2 12 17 25 33 42 78 90 S 1 S 2 # tuples CDF 16 13 50 key value 19

Technische Universität München Skew resilience 2. Global R data distribution – Local equi-width histograms

Technische Universität München Skew resilience 2. Global R data distribution – Local equi-width histograms as before – More fine-grained histograms 2 13 4 31 8 20 6 2 = 00010 8 = 01000 histogram 3 2 1 1 <8 [8, 16) [16, 24) ≥ 24 R 1 20

Technische Universität München Skew resilience 3. Compute splitters so that overall workloads are balanced*:

Technische Universität München Skew resilience 3. Compute splitters so that overall workloads are balanced*: greedily combine buckets, thereby balancing the costs of each thread for sorting R and joining R and S are balanced # tuples CDF histogram + 3 2 1 1 = 2 4 6 13 31 8 20 key value * Ross and Cieslewicz: Optimal Splitters for Database Partitioning with Size Bounds. ICDT‘ 09 21

Technische Universität München Performance evaluation • MPSM performance in a nutshell: – 160 mio

Technische Universität München Performance evaluation • MPSM performance in a nutshell: – 160 mio tuples joined per second – 27 bio tuples joined in less than 3 minutes – scales linearly with the number of cores • Platform Hy. Per 1: – Linux server – 1 TB RAM – 4 CPUs with 8 physical cores each • Benchmark: – Join tables R and S with schema {[joinkey: 64 bit, payload: 64 bit]} – Dataset sizes ranging from 50 GB to 400 GB 22

Technische Universität München Execution time comparison • MPSM, Vectorwise (VW), and Blanas hash join*

Technische Universität München Execution time comparison • MPSM, Vectorwise (VW), and Blanas hash join* • 32 workers • |R| = 1600 mio (25 GB), varying size of S * S. Blanas, Y. Li, and J. M. Patel: Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs. SIGMOD 2011 23

Technische Universität München Scalability in the number of cores • MPSM and Vectorwise (VW)

Technische Universität München Scalability in the number of cores • MPSM and Vectorwise (VW) • |R| = 1600 mio (25 GB), |S|=4*|R| 24

Technische Universität München Location skew • Location skew in R has no effect because

Technische Universität München Location skew • Location skew in R has no effect because of repartitioning • Location skew in S: in the extreme case all join partners of Ri are found in only one Sj (either local or remote) 25

Technische Universität München Distribution skew: anti-correlated data without balanced partitioning with balanced partitioning 26

Technische Universität München Distribution skew: anti-correlated data without balanced partitioning with balanced partitioning 26

Technische Universität München Distribution skew : anti-correlated data 27

Technische Universität München Distribution skew : anti-correlated data 27

Technische Universität München Conclusions • • • MPSM is a sort-based parallel join algorithm

Technische Universität München Conclusions • • • MPSM is a sort-based parallel join algorithm MPSM is NUMA-aware & NUMA-oblivious MPSM is space efficient (works in-place) MPSM scales linearly in the number of cores MPSM is skew resilient MPSM outperforms Vectorwise (4 X) and Blanas et al‘s hash join (18 X) • MPSM is adaptable for disk-based processing – See details in paper 28

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and Thomas Neumann Technische Universität München THANK YOU FOR YOUR ATTENTION!