NUMAaware algorithms the case of data shuffling Yinan

Hardware is a moving target Intel-based 2 -socket 4 -socket (a) POWER-based 4 -socket

1 Socket 0 4 Socket 2 2 Socket 3 Memory QPI 3 Bandwidth seq.

Use case: data shuffling • Sort-merge join • Partitioned aggregation • Map. Reduce Bandwidth

NUMA-aware data mgmt. operations • Tons of work on SMPs & NUMA 1. 0

Need for primitives • Kernels used frequently on data management operations • E. g.

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling •

Data shuffling & naïve implementation • N threads produce N-1 partitions for all other

Shuffling naively in a NUMA system Naïve uncoordinated shuffling T 0 T 1 T

Ring shuffling 1 t 0 2 s 3. s 2. p s 3. p

Ring shuffling in action Aggr. BW of all channels Ring shuffling T 1 T

Thread migration instead of shuffling Aggr. BW of all channels • Move computation to

Shuffling benchmark – peak bandwidth 120 4 socket 100 80 80 ~4 x 100

Exploiting ring shuffling in joins 30 Total (Ring) 25 Total (Naive) w. Naive shuffling

Shuffling vs migration for aggregation • Partitioning-based aggregation Bandwidth (GB/s) 100 4 socket 80

Conclusions • Hardware is a moving target • Need for primitives for data management

Shuffling data - scalability Bandwidth (GB/s) 100 IBM x 3850 4 sockets x 8

Shuffling vs migration for aggregation - breakdown • Partitioning-based aggregation 12 Naive Migration Ring

Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T 0 T 1 T

Slides: 21

Download presentation

NUMA-aware algorithms: the case of data shuffling Yinan Li* Ippokratis Pandis Rene Mueller *University of Wisconsin - Madison Vijayshankar Raman Guy Lohman IBM Almaden Research Center 1

Hardware is a moving target Intel-based 2 -socket 4 -socket (a) POWER-based 4 -socket (b) Cloud 8 -socket • Different degrees of parallelism, # sockets and memory hierarchies • Different types of CPUs (SSE, out-of-order vs in-order, 2 - vs 4 - vs 8 way SMT, …), storage technologies … Very difficult to optimize & maintain data management code for every HW platform 2

1 Socket 0 4 Socket 2 2 Socket 3 Memory QPI 3 Bandwidth seq. mem access (12 threads) Latency data dependent random access (1 thread) Local memory access 1 24. 7 GB/s 340 cycles/access (~150 ns/access) Remote memory 1 hop 2 10. 9 GB/s 420 cycles/access (~185 ns/access) Remote memory 2 hops 3 10. 9 GB/s 520 cycles/access (~230 ns/access) Remote memory 2 hops 3 with cross traffic 4 5. 3 GB/s 530 cycles/access (~235 ns/access) Socket 1 Memory NUMA effects => underutilize RAM bandwidth Sequential accesses are not the final solution 3

Use case: data shuffling • Sort-merge join • Partitioned aggregation • Map. Reduce Bandwidth • Each of the N threads need to send data to the N-1 other threads • Common operation: 4 socket • Both Map and Reduce shuffle data • Scatter/gather Sequential, but naïve Ignoring NUMA leaves perf. on the table Coordinated NUMA-aware 4

NUMA-aware data mgmt. operations • Tons of work on SMPs & NUMA 1. 0 • Sort-merge join [Albutiu et al. VLDB 2012] • Favor sequential accesses over random probes • OLTP on HW Islands [Porobic et al. VLDB 2012] • Should we treat multisocket multicores as a cluster? There are many different data operations that need similar optimizations 5

Need for primitives • Kernels used frequently on data management operations • E. g. sorting, hashing, data shuffling, … • Highly optimized software solutions • Similar to BLAS • Optimized by skilled devs per new HW platform • Hardware-based solutions • Database machines 2. 0 (see Bionic DBMSs talk this afternoon) • If very important kernel, can be burnt into HW • Expensive, but orders of magnitude more efficient (perf. , energy) • Companies like IBM and Oracle can do vertical engineering 6

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling • Ring shuffling • Thread migration • Evaluation • Conclusions 7

Data shuffling & naïve implementation • N threads produce N-1 partitions for all other threads • Each thread needs to read its partitions Before • Assume uniform sizes of partitions Shuffle • N * (N-1) transfers After • Naïve implementation • Each thread acting autonomously: for (thread=0; thread<N; thread++) read. My. Partition. From(thread); How bad can that be? 8

Shuffling naively in a NUMA system Naïve uncoordinated shuffling T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 1 Bandwidth (GB/s) 100 80 Naive Aggr. BW of all channels 60 Need to orchestrate threads/transfers to utilize the rest 40 Step 2 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 3 Step … Usage of QPI and Memory paths Max mem. BW of 1 channel 20 0 1 4 8 12 16 20 24 28 32 # Threads BUT we bought 4 memory channels and 6 QPIs 9

Ring shuffling 1 t 0 2 s 3. s 2. p s 3. p 0 t 0 s 1. p 1 s 0. p 1 s 2. t 0 s 2. t 1 s 0. . . s 2. p 0 • Thread binding & synchronization • Control location of mem. allocations p 0 • Can be executed in lock-step or loosely • Needs: s 0. p 0 s 2. p 3 s 1. • Inner ring: partitions ordered by thread number, socket; stationary • Outer ring: threads ordered by socket, thread number; rotates s 0. t 0 1 s 3. t s 1. t • Devise a global schedule and all threads follow it 10

Ring shuffling in action Aggr. BW of all channels Ring shuffling T 1 T 2 T 3 T 4 T 5 T 6 Step 1 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Step 2 Step 3 Step … Usage of QPI and Memory paths 100 T 7 Bandwidth (GB/s) T 0 80 Ring shuffling Naive 60 40 20 0 1 4 8 12 16 20 24 28 32 # Threads Orchestrated traffic utilizes underlying QPI network 11

Thread migration instead of shuffling Aggr. BW of all channels • Move computation to data instead of shuffling them • Choice of migrating only thread or thread + state • But, both very sensitive to amount of thread state Bandwidth (GB/s) • Convert accesses to local memory reads 100 Thread migration 80 Ring shuffling 60 Naive 40 20 0 1 4 8 12 16 20 24 28 32 # Threads 12

Outline • Introduction • NUMA 2. 0 and related work • Data shuffling • Evaluation • Conclusions 13

Shuffling benchmark – peak bandwidth 120 4 socket 100 80 80 ~4 x 100 60 3 x Bandwidth (GB/s) 120 40 60 40 20 20 0 0 Naive Coord. random (tight) 8 socket Ring (loose) Ring Thread (tight) migration (loose) (tight) IBM x 3850 4 sockets x 8 cores Intel X 7650 Nehalem-EX Fully connected QPI Naive Coord. random (tight) Ring (loose) Ring Thread (tight) migration (loose) (tight) 2 x IBM x 3850 8 sockets x 8 cores Intel X 7650 Nehalem-EX 14

Exploiting ring shuffling in joins 30 Total (Ring) 25 Total (Naive) w. Naive shuffling Time breakdown Scalability (vs. 1 thread) • Implemented the algorithm of Albutiu et al. • Sort-merge-based join implementation Merge Phase (Ring) 20 Merge Phase (Naive) 15 w. Ring shuffling 100% Merge 80% Sort Fact 60% Sort Dim. 40% Partition 20% 10 0% 5 1 2 4 8 16 32 # Threads 0 1 2 4 8 16 32 # Threads 32 Small overall perf. improvement because dominated by sort # Threads 15

Shuffling vs migration for aggregation • Partitioning-based aggregation Bandwidth (GB/s) 100 4 socket 80 60 Migration & copy Migration Ring Naive 40 20 # distinct keys / partition 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M 8 M 0 # distinct keys / partition Potential of thread migration when thread state small 16

Conclusions • Hardware is a moving target • Need for primitives for data management operations • Highly optimized SW or HW implementations • BLAS for DBMSs Questions? ? ? • Data shuffling can be up to 3 x if NUMA-aware • Needs binding of memory allocations, thread scheduling … • Potential of thread migration • Improved overall performance of optimized joins and aggregations • Continue investigating primitives, their implementation and exploitation • Looking for motivated summer interns ! [email to ipandis@us. ibm. com] 17

Backup slides 18

Shuffling data - scalability Bandwidth (GB/s) 100 IBM x 3850 4 sockets x 8 cores Fully connected QPI Thread migration Ring shuffling 80 Naive 60 40 20 0 1 4 8 12 16 20 # Threads 24 28 32 19

Shuffling vs migration for aggregation - breakdown • Partitioning-based aggregation 12 Naive Migration Ring Migration & copy Copy state 8 Upd. hash table Read partitions 6 4 2 4 M 2 M 2 K 1 M 51 6 K K 8 K 25 # Distinct groups 64 4 M 2 M 2 K 1 M 51 6 K 25 8 K K 12 64 4 M 2 M 6 K 51 2 K 1 M 25 K 12 8 K # Distinct groups 12 # Distinct groups 64 4 M 2 M 2 K 1 M 6 K 51 8 K 12 25 K 0 64 Time (sec) 10 # Distinct groups 20

Naïve vs ring shuffling Naïve uncoordinated shuffling Coordinated shuffling T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 0 T 1 T 2 T 3 T 4 T 5 T 6 T 7 Iteration 1 Iteration 2 Iteration 3 Iteration … Usage of QPI and Memory paths 21