NUMAaware algorithms the case of data shuffling Yinan

  • Slides: 21
Download presentation
NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller *University

NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller *University of Wisconsin - Madison Vijayshankar Raman Guy Lohman IBM Almaden Research Center 1

Hardware is a moving target Intel-based 2 -socket 4 -socket (a) POWER-based 4 -socket

Hardware is a moving target Intel-based 2 -socket 4 -socket (a) POWER-based 4 -socket (b) Cloud 8 -socket • Different degrees of parallelism, # sockets and memory hierarchies • Different types of CPUs (SSE, out-of-order vs in-order, 2 - vs 4 - vs 8 way SMT, …), storage technologies … Very difficult to optimize & maintain data management code for every HW platform 2

1 Socket 0 4 Socket 2 2 Socket 3 Memory QPI 3 Bandwidth seq.

1 Socket 0 4 Socket 2 2 Socket 3 Memory QPI 3 Bandwidth seq. mem access (12 threads) Latency data dependent random access (1 thread) Local memory access 1 24. 7 GB/s 340 cycles/access (~150 ns/access) Remote memory 1 hop 2 10. 9 GB/s 420 cycles/access (~185 ns/access) Remote memory 2 hops 3 10. 9 GB/s 520 cycles/access (~230 ns/access) Remote memory 2 hops 3 with cross traffic 4 5. 3 GB/s 530 cycles/access (~235 ns/access) Socket 1 Memory NUMA effects => underutilize RAM bandwidth Sequential accesses are not the final solution 3

Use case: data shuffling • Sort-merge join • Partitioned aggregation • Map. Reduce Bandwidth

Use case: data shuffling • Sort-merge join • Partitioned aggregation • Map. Reduce Bandwidth • Each of the N threads need to send data to the N-1 other threads • Common operation: 4 socket • Both Map and Reduce shuffle data • Scatter/gather Sequential, but naïve Ignoring NUMA leaves perf. on the table Coordinated NUMA-aware 4

NUMA-aware data mgmt. operations • Tons of work on SMPs & NUMA 1. 0

NUMA-aware data mgmt. operations • Tons of work on SMPs & NUMA 1. 0 • Sort-merge join [Albutiu et al. VLDB 2012] • Favor sequential accesses over random probes • OLTP on HW Islands [Porobic et al. VLDB 2012] • Should we treat multisocket multicores as a cluster? There are many different data operations that need similar optimizations 5

Need for primitives • Kernels used frequently on data management operations • E. g.

Need for primitives • Kernels used frequently on data management operations • E. g. sorting, hashing, data shuffling, … • Highly optimized software solutions • Similar to BLAS • Optimized by skilled devs per new HW platform • Hardware-based solutions • Database machines 2. 0 (see Bionic DBMSs talk this afternoon) • If very important kernel, can be burnt into HW • Expensive, but orders of magnitude more efficient (perf. , energy) • Companies like IBM and Oracle can do vertical engineering 6

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling •

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling • Ring shuffling • Thread migration • Evaluation • Conclusions 7

Data shuffling & naïve implementation • N threads produce N-1 partitions for all other

Data shuffling & naïve implementation • N threads produce N-1 partitions for all other threads • Each thread needs to read its partitions Before • Assume uniform sizes of partitions Shuffle • N * (N-1) transfers After • Naïve implementation • Each thread acting autonomously: for (thread=0; thread<N; thread++) read. My. Partition. From(thread); How bad can that be? 8

Shuffling naively in a NUMA system Naïve uncoordinated shuffling T 0 T 1 T

Shuffling naively in a NUMA system Naïve uncoordinated shuffling T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 1 Bandwidth (GB/s) 100 80 Naive Aggr. BW of all channels 60 Need to orchestrate threads/transfers to utilize the rest 40 Step 2 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 3 Step … Usage of QPI and Memory paths Max mem. BW of 1 channel 20 0 1 4 8 12 16 20 24 28 32 # Threads BUT we bought 4 memory channels and 6 QPIs 9

Ring shuffling 1 t 0 2 s 3. s 2. p s 3. p

Ring shuffling 1 t 0 2 s 3. s 2. p s 3. p 0 t 0 s 1. p 1 s 0. p 1 s 2. t 0 s 2. t 1 s 0. . . s 2. p 0 • Thread binding & synchronization • Control location of mem. allocations p 0 • Can be executed in lock-step or loosely • Needs: s 0. p 0 s 2. p 3 s 1. • Inner ring: partitions ordered by thread number, socket; stationary • Outer ring: threads ordered by socket, thread number; rotates s 0. t 0 1 s 3. t s 1. t • Devise a global schedule and all threads follow it 10

Ring shuffling in action Aggr. BW of all channels Ring shuffling T 1 T

Ring shuffling in action Aggr. BW of all channels Ring shuffling T 1 T 2 T 3 T 4 T 5 T 6 Step 1 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 2 Step 3 Step … Usage of QPI and Memory paths 100 T 7 Bandwidth (GB/s) T 0 80 Ring shuffling Naive 60 40 20 0 1 4 8 12 16 20 24 28 32 # Threads Orchestrated traffic utilizes underlying QPI network 11

Thread migration instead of shuffling Aggr. BW of all channels • Move computation to

Thread migration instead of shuffling Aggr. BW of all channels • Move computation to data instead of shuffling them • Choice of migrating only thread or thread + state • But, both very sensitive to amount of thread state Bandwidth (GB/s) • Convert accesses to local memory reads 100 Thread migration 80 Ring shuffling 60 Naive 40 20 0 1 4 8 12 16 20 24 28 32 # Threads 12

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling •

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling • Evaluation • Conclusions 13

Shuffling benchmark – peak bandwidth 120 4 socket 100 80 80 ~4 x 100

Shuffling benchmark – peak bandwidth 120 4 socket 100 80 80 ~4 x 100 60 3 x Bandwidth (GB/s) 120 40 60 40 20 20 0 0 Naive Coord. random (tight) 8 socket Ring (loose) Ring Thread (tight) migration (loose) (tight) IBM x 3850 4 sockets x 8 cores Intel X 7650 Nehalem-EX Fully connected QPI Naive Coord. random (tight) Ring (loose) Ring Thread (tight) migration (loose) (tight) 2 x IBM x 3850 8 sockets x 8 cores Intel X 7650 Nehalem-EX 14

Exploiting ring shuffling in joins 30 Total (Ring) 25 Total (Naive) w. Naive shuffling

Exploiting ring shuffling in joins 30 Total (Ring) 25 Total (Naive) w. Naive shuffling Time breakdown Scalability (vs. 1 thread) • Implemented the algorithm of Albutiu et al. • Sort-merge-based join implementation Merge Phase (Ring) 20 Merge Phase (Naive) 15 w. Ring shuffling 100% Merge 80% Sort Fact 60% Sort Dim. 40% Partition 20% 10 0% 5 1 2 4 8 16 32 # Threads 0 1 2 4 8 16 32 # Threads 32 Small overall perf. improvement because dominated by sort # Threads 15

Shuffling vs migration for aggregation • Partitioning-based aggregation Bandwidth (GB/s) 100 4 socket 80

Shuffling vs migration for aggregation • Partitioning-based aggregation Bandwidth (GB/s) 100 4 socket 80 60 Migration & copy Migration Ring Naive 40 20 # distinct keys / partition 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 0 # distinct keys / partition Potential of thread migration when thread state small 16

Conclusions • Hardware is a moving target • Need for primitives for data management

Conclusions • Hardware is a moving target • Need for primitives for data management operations • Highly optimized SW or HW implementations • BLAS for DBMSs Questions? ? ? • Data shuffling can be up to 3 x if NUMA-aware • Needs binding of memory allocations, thread scheduling … • Potential of thread migration • Improved overall performance of optimized joins and aggregations • Continue investigating primitives, their implementation and exploitation • Looking for motivated summer interns ! [email to ipandis@us. ibm. com] 17

Backup slides 18

Backup slides 18

Shuffling data - scalability Bandwidth (GB/s) 100 IBM x 3850 4 sockets x 8

Shuffling data - scalability Bandwidth (GB/s) 100 IBM x 3850 4 sockets x 8 cores Fully connected QPI Thread migration Ring shuffling 80 Naive 60 40 20 0 1 4 8 12 16 20 # Threads 24 28 32 19

Shuffling vs migration for aggregation - breakdown • Partitioning-based aggregation 12 Naive Migration Ring

Shuffling vs migration for aggregation - breakdown • Partitioning-based aggregation 12 Naive Migration Ring Migration & copy Copy state 8 Upd. hash table Read partitions 6 4 2 4 M 2 M 2 K 1 M 51 6 K K 8 K 25 # Distinct groups 64 4 M 2 M 2 K 1 M 51 6 K 25 8 K K 12 64 4 M 2 M 6 K 51 2 K 1 M 25 K 12 8 K # Distinct groups 12 # Distinct groups 64 4 M 2 M 2 K 1 M 6 K 51 8 K 12 25 K 0 64 Time (sec) 10 # Distinct groups 20

Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T 0 T 1 T

Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Iteration 1 Iteration 2 Iteration 3 Iteration … Usage of QPI and Memory paths 21