A Comprehensive Review of SIMD Techniques for Data





































- Slides: 37
A Comprehensive Review of SIMD Techniques for Data Processing ADMS Workshop VLDB 2018, Rio De Janeiro, Brazil Shasank Chavan Vice President, In-Memory Database Technologies Oracle Weiwei Gong Dev Manager, Vector Flow Analytics Oracle Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 2
Talk Agenda 1. 2. 3. 4. 5. 6. History / Background Instruction-Level Intrinsics Early SIMD Adoption for Data Processing Compression and Complex Predicates SIMD Beyond Scans What Does Industry Need Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 3
SIMD : History / Background • SIMD (Single Instruction Multiple Data) technology in microprocessors allows one instruction to process multiple data items at the same time. • SIMD technology was first used in the late 60 s and early 70 s, as the basis for vector supercomputers (Cray, CDC Star-100, in the ILLIAC IV (1966) • SIMD processing shifted from super-computer market to desktop market – Real-time graphics and gaming, audio/video processing (e. g. MPEG decoding), etc. • For example, “Change the brightness of an image” (i. e. add/sub value from R-G-B fields of pixel) – VIS (Sun Microsystems ), MMX (Intel), Alti. Vec (IBM), SSE (Intel) – 1999 • Initial adoption of SIMD systems in personal computers was slow – FP registers reused. Conversions from FP to MMX registers. Lack of compiler support. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 4
SIMD : Intrinsics (Intel) • Instruction-level intrinsics are C-style functions representing assembly instruction(s) which compilers inline into the source program. – E. g. __popcnt() maps to the popcnt instruction which returns number of set bits in reg • With SSE came 128 -bit SIMD registers and rich set of intrinsics, which started making SIMD programming more main-stream for performance • Intel’s compiler-intrinsics are available online via an interactive guide ** MMX (~124 intrins) SSE 4. 2 (~624 intrins) AVX 2 (~997 intrins) AVX-512 (~4864 intrins) _mm_cmpeq_pi* _mm_blendv_epi* _mm 256_srlv_epi* _mm 512_conflict_epi* _mm_srli_pi* _mm_max_epi* _mm 256_i 32_gather_epi* _mm 512_mask_cmp_epu*_mask _mm_add_pi* _mm_shuffle_epi 8 _mm 256_permute 4 x 64_epi 64 _mm 512_maskz_compress_epi* ** https: //software. intel. com/sites/landingpage/Intrinsics. Guide/ Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 5
Talk Agenda 1. 2. 3. 4. 5. 6. History / Background Instruction-Level Intrinsics Early SIMD Adoption for Data Processing Compression and Complex Predicates SIMD Beyond Scans What Does Industry Need Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 6
SIMD : Early Adoption For Data Processing • CPU and memory performance become bottleneck for columnar databases (in-memory processing and modern hardware). • Intel SSE arrived w/ 128 -bit SIMD registers and rich ISA to replace MMX • Jingren Zhou and Kenneth Ross wrote in 2002 that SIMD technology can be employed in many database operations – “Implementing Database Operations Using SIMD Instructions”, Sigmod 2002 • SIMD Implementations for Database Operations – Scans & Filters, Joins, Aggregations, Index Search Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 7
SIMD : Early Adoption For Data Processing : Scan Filters • Parallelize predicate evaluation – load, eval, store/consume result • Select count(*) from T where a > 10 and b < 20 – [Load] A – [Load] Temp = 10 – [Compare] A > Temp – Load B, Compare 20 – And – Mask, Store Bit-Map 96 64 32 0 51 95182 1 69 > > 10 10 1 1 0 1 & & 0 1 1 1 0 1 0101 Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 8
SIMD : Early Adoption For Data Processing : Miscellaneous • Simple aggregation (min, max, sum) of a column vector is straightforward – Initialize a temporary SIMD register T (e. g. all zeroes) – Load N values into SIMD register A and perform aggregation with SIMD register T. – Loop until all values processed. – Perform Horizontal aggregation on T to aggregate over the partial aggregates. • Simple group-by aggregation could be done by sorting column vector first by grouping key – avoids conficts and fast aggregation within group. • Nested Loop Join – Join key extracted from outer table gets broadcasted into SIMD register T – Join keys from inner table are loaded into SIMD register A to then compare to T Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 9
Talk Agenda 1. 2. 3. 4. 5. 6. History / Background Instruction-Level Intrinsics Early SIMD Adoption for Data Processing Compression and Complex Predicates SIMD Beyond Scans What Does Industry Need Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 10
SIMD : Columnar Compression : Dictionary Encoding • In-Memory Columnar DBs compress data with lightweight compressors • Dictionary Encoding is heavily used because it gives excellent compression without requiring data to be decoded / decompressed for many operations • Dictionary Encoding format: – Maintain an order-preserving dictionary of distinct symbols in a column. – Map those symbols to codes [0. . N] – Replace symbols in column with the codes • For additional compression, bit-pack the codes in the value stream (log(N) bits needed) and similarly pack symbols in dictionary. • For more compression, run-length encode the value stream Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 11
SIMD : Columnar Compression : Dictionary Unpacking Load Byte Shuffle And Mask Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |
SIMD : Compression • Data compression still matters – working data sets can’t fit into memory • Many lightweight compression techniques benefit with SIMD – Dictionary Encoding, Bit-Packing, Run-Length-Encoding – Integer compression using SIMD has been studied extensively • “Fast Integer Compression using SIMD Instructions”, Da. Mo. N 2010. • Null Suppression – Encode leading zeroes w/ bits & remove from integers 0 x 00000025 0 x 000055 EF 0 x 006 E 5 A 22 0 x. FFFF – (1) SIMD LZCNT = [3, 2, 1, 0] bytes, SIMD MASK = [11100100] – (2) Fetch Shuffle Mask, Shuffle Data, Variable Bit-Shift, Stitch In LZCNT b’ 11, 0 x 25 b’ 10, 0 x 55 EF b’ 01, 0 x 6 E 5 A 22 Shuffle M 1 Shuffle M 2 … Shuffle M 256 b’ 00, 0 x. FFFF ** EVEN FASTER WITH VPCOMPRESS INST ** Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 13
SIMD : Decompression • Decompression is usually more important than Compression for In. Memory Columnar Databases focused on Analytics (write once, read many) • Oracle developed a compressor called OZIP which provides compression ratios similar to LZO, but at must faster decompression speeds. • HW-Friendly format that was implemented for Sparc M 7 DAX coprocessor. • Compressed format looks like Dictionary Encoding: – Small dictionary of 1024 symbols maintained (high frequency). – Dictionary symbols are between 1 -8 bytes long – nibble lengths stored. – Uncompressed stream is dictionary-encoded. • Decompression involves Bit-Unpacking, Gather, Unaligned Stores Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 14
SIMD : OZIP Decompression HEADER DICTIONARY 0 SYMBOL LENGTHS 3 1. Symbols 1 -8 bytes. 2. Max 1024 symbols 3. Packed together 1. Nibble size lens CODES 1. Bit-packed 3 1 5 2 8 1 2 4 4 5 7 4 3 3 4 5 6 … … 2 6 3 VPCOMPRESS Requires Immediate Mask Permutation Table Too Large to Keep : 3^8 * 64 B Decompressed Buffer: Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 15
SIMD : Complex Predicates • T. Willhalm soon started worked on a more generic framework for evaluating arbitrary complex predicates with SIMD on columnar data – “Vectorizing Database Column Scans with Complex Predicates” • Generate specialized kernels for each bit-width (e. g. 1 -32), and integrate decompression (e. g. bit-unpacking) with predicate evaluation – E. g. GT_LT range comparison, 5 -bit dictionary codes, output found indexes • In-List Support (where A in (‘yak’, ‘dog’, …, ‘tiger’)) – This becomes a set-membership problem (generic eval using SIMD for complex preds) • In-List elements are converted into codes belonging to column value domain [0. . N]. • For each code in A, check if it’s found in the set. If found, return TRUE, else return FALSE. • Implementation is like bloom filter evaluation, so more on this later. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 16
SIMD : Bit-Weaving, Byte-Slicing, Brain-Splitting, … • Along came Jignesh Patel and crew to re-organize bit-packed dictionary codes vertically instead of horizontally, just to get super-fast scans… • Assuming you have 3 -bit codes: [101 111 100 101 010] – Store them vertically into 3 segments so that all bits at position k are stored together – [11111110] (High Bits), [01010001] (Med Bits), [11010110] (Low Bits) • Great for scans with early termination (reduced loads and comparions) – E. g. Match for “ 000” involves searching Green and Red before returning “No Match” – Theta compares boil down to sequence of SIMD logical operations (and, not, or, xor) • Stitching vertical bits back into horizontal form for projection is costly Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 17
SIMD : Bit-Weaving, Byte-Slicing, Brain-Splitting, … • Byte-Slicing was introduced to get the best of both worlds – Slice vertically at byte granularity. Still get blazing fast SIMD byte comparisons. – Stitching it simplified, although still expensive. • However, most benefit still really only applies to early termination – i. e. when less work is needed. – How likely is it that an evaluated slice across all segments indicates early termination? – How likely will a single slice in a segment terminate early for conditional evaluation? – Do we need to make blazing fast scans even faster? • AVX-512 allows for conditional evaluation – more on this later Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 18
Talk Agenda 1. 2. 3. 4. 5. 6. History / Background Instruction-Level Intrinsics Early SIMD Adoption for Data Processing Columnar Compression and SIMD Beyond Scans What Does Industry Need Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 19
SIMD : Selective Evaluation • AVX-512 allows most instructions to conditionally / selectively execute the specified operation against elements in the SIMD register. – A conditional 64 -bit bitmap k is provided which indicates which of the 64 bytes in the SIMD register are to participate in the operation. • Separate code-paths for optimal execution based on selectivity not needed • Example : select count(*) from T where a > 10 and b < 20 – If “a > 10” is highly selective, prior to AVX-512 we may not vectorize “b < 20” – With AVX-512, we could use bitmap result from “a > 10” as conditional mask into SIMD execution of “b < 20”. • Main benefit is to avoid loads of “b” and stores of result bit-map (assuming zero initialized). • Pipelined operations benefit too (select sum(a) from T where b > 20) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 20
SIMD : Index Search • Many modern-day memory indexes use SIMD to scan keys in inner nodes. • Techniques generally include binary searching and/or linear searching. • Linear Search (simd_cmp_GT(‘SNAKE’)) ANT CAT DOG 0 00000111 • Binary Search EEL FISH 0 LZCNT = 9 0 GOAT 0 LION PIG 0 SNAIL 0 TIGER 0 1 WHALE ZEBRA 1 1 Children[9] – For wider inner nodes, start in middle of keys and linear search for k keys. If lzcnt is k or 0, then search left or right, respectively. • ART, FAST, etc, all use some variant of approaches above Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 21
SIMD : Hash Joins • Vectorizing Hash Joins has been looked at in numerous papers recently – “Rethinking SIMD Vectorization for In-Memory Databases”, Sigmod 2015 – “Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last”, VLDB 2018 • Hash Join has many components that can benefit with SIMD – Hashing (multiplicative hashing), Partitioning, Bloom Filter, Probe, Gather and Project • Probe (2 flavors) – Compare probe key to all keys in a hash bucket(s) (chaining, linear probing) – Gather N keys across multiple HT buckets and compare against N probe keys at once • Better efficiency of SIMD lanes - Proposed by O. Polychroniou Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 22
SIMD : Bloom Filter • Bloom Filters are used in Hash Joins for fast filtering during inner table scan • Keys from build table are hashed and bits are set in a (segmented) bitmap • Bloom filter evaluation can be vectorized with SIMD using a combination of instructions – loads, shuffles, shifts, variable shifts, gather, ands, masks. – O. Polychroniou et al. “Vectorized Bloom filters for advanced SIMD processors” • A variation of this is used by Oracle for Key Vector filtering – “Accelerating Joins and Aggregations on the Oracle In-Memory Database”, ICDE 2018 • A Key Vector is just like a bloom filter except multiple bits are used depending on the number of possible grouping keys – e. g. 1, 2, 4, 8, or 32 • SIMD processing alone makes IMA run about 2 X faster with SSB queries Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 23
SIMD : Key Vector Filtration 96 JK 4 64 JK 3 32 JK 2 0 JK 1 Join keys from fact table loaded into a SIMD register (assuming join key is 4 byte value) Example shows looking up join keys in a nibble packed segmented key vector Nibble Index in segment Shift LSB of nibble index >>1 Segment Index Byte Index in segment KV_seg[ ][ ] Intel AVX-512 VPSRLVW VPSRLVD VPSRLVQ — Variable Bit Shift Right Logical + Mask Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | Dense key from KV 24
SIMD : Group-By and Aggregation • Group-By and Aggregation clauses are highly suitable for SIMD processing – Select a, b, sum(c) from T where <blah> group by a, b • Hash Group-By is the most common implementation method. – Grouping key is hashed to index into Hash Table where results are accumulated. • Array-based Aggregation is preferable when grouping key cardinality is relatively low and space is available. – Conflict detection is still required in order to parallelize aggregation across groups • Partition-based Aggregation is also suitable – contention is avoided at the cost of slow partitioning. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 25
SIMD : Group-By and Aggregation Select sum(A) from T group by X, Y • Array-based Aggregation – Grouping columns X and Y are both dictionary-encoded, with cardinality 10 and 20. – Create multi-dimensional Agg[10][20] buffer to store aggregate results. – Agg[Unpack[X]][Unpack[Y]] += Gather[Unpack[A]] • Partition-based Aggregation – Sort rows by (Unpack[X], Unpack[Y]), then parallel sum A up to change in composite grouping key. • Frequency-based Aggregation (for software-implemented numbers) – If grouping key(s) cardinality is low, then maintain counts of each distinct A symbol across *each* group. Multiply count (frequency) by A value in dictionary Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 26
SIMD : Semi-Structure Data Search • Parsing semi-structured data like JSON can be expensive – “Filter Before You Parse: Faster Analytics on Raw Data with Sparser”, VLDB 2018 • Apply generic filters on the raw byte-stream first for first-pass filtering – E. g. where (A like ‘%fish%cat%’) --> A contains “dog” and A contains “cat” • Search using SIMD by packing multiple copies of search string in register, and create N additional registers representing N shifts of the string – [FISH, FISH], [ FIS, HFIS, HFIS], [ FI, SHFI, SHFI], [ F, ISHF, ISHF] • Scan through the raw string and apply SIMD comparisons to N registers. • Multiple patterns lends itself to predicate reordering optimizations. In the end, merge results found to eliminate false positives. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 27
SIMD : Sorting and Partitioning • Optimizing Sorting and Partitioning with SIMD has been extensively studied – “A Comprehensive Study of Main-Memory Partitioning and its Application to Large-Scale Comparison- and Radix. Sort“, Sigmod 2014 – “Efficient implementation of sorting on multi-core SIMD CPU architecture”, VLDB 2008 • In-Register Sorting Networks can utilize SIMD. Sorted runs are merged. – For example: [50, 12, 85, 8, 30, 501, 29, 15, 9, 100, 6, 28, 98, 105, 40] – Perform a series of SIMD MINs and MAXs so that sequence is sorted vertically • [15, 12, 85, 6, 28, 15, 100, 8, 38, 28, 105, 29, 50, 30, 501, 40] – Shuffle to transpose columns back to rows of sorted runs. • [15, 28, 38, 50, 12, 15, 28, 30, 85, 100, 105, 501, 6, 8, 29, 40] • More complicated SIMD sequences by O. Polychroniou for partitioning and comparison sorts utilizing comparisons, blend, logic operators, etc. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 28
SIMD : Many Research Areas • Spatial – “Adaptive Geospatial Joins for Modern Hardware” • String Search – “Exploiting SIMD instructions in current processors to improve classical string algorithms” • Skyline Computations – “VSkyline: Vectorization for Efficient Skyline Computation” • Skip Lists – “Parallelizing Skip Lists for In-memory Multi-core Database Systems” • Set Intersection – “Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions” Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 29
SIMD : What Industry Needs (Observations) • Simplicity to code (auto-vectorization) without hand-tuned specializations • Instructions (or intrinsics) mapping database operations (e. g. sort) • SIMD implementations for row-major, variable width data • Portable open-source libraries implementing database operations, optimized for next generation ISA. • Focus on the hard problems – making already blazing fast scans faster just makes 5% of the cpu-pathlength shorter. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 30
SIMD : What Industry Needs (Observations) • Unless you’re running super-simple single-table queries, you’re not bounded by memory bandwidth. Database queries are still computebound for single-threaded execution. – Workload can be parallelized across cores to overcome, but still unlikely • On the flip-side, bandwidth-limited queries are not going to run faster with more compute power – making fast scans faster may be good for a research paper, but not really useful. • SQL query offload to hardware accelerators (ASIC, FPGA, GPU) sounds good for research, but it’s a costly investment with questionable gains. • We need something ground-breaking and revolutionary. We want MAGIC. – And ideally supported with Auto-Vectorization (no more hand-tuning) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 31
SIMD : What Industry Needs (Solutions) : Sparc M 8 ISA • Dictionary Unpack (dictunpk) AAAB BBCC CDDD EEEF FFGG GHHH …. AAA BBB CCC DDD EEE FFF GGG HHH • RLE Decompression (burst) 2 3 1 2 3 3 7 2 Run Vector 1 0 1 1 0 Bit Vector to Expand 1 1 0 0 0 1 1 1 1 0 0 • Unaligned Loads / Stores Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 32
SIMD : What Industry Needs (Solutions) : Intel ISA • More optimal In-Register Aggregation w/ Cheaper VPCONFLICT – Given a vector A full of group-IDs within range [0. . N], where N <= 16, shuffle elements in vector B to corresponding positions in vector C https: //en. wikichip. org/w/images/d/d 5/Intel_Advanced_Vector_Extensions_2 pending_elem = 0 x. FFFF; 015 -2016_Support_in_GNU_Compiler_Collection. pdf do { curr_elem = get_conflict_free_subset(grouping_keys, pending_elem) vals = SHUFFLE(values, grouping_keys, curr_elem) ADD(sum, vals); pending_elem = pending_elem ^ curr_elem } while (pending_elem) – VBMI (Vector Byte Manipulation Instructions) has (byte-to-byte) cross-lane shuffles – Still requires conflict detection, so VPCONFLICT must be made cheaper. • VPCONFLICT will determine if any two elements in the SIMD register are identical (i. e. conflict), but it has very high latency (and it’s not pipeline-able, I believe). Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 33
SIMD : What Industry Needs (Solutions) : Intel ISA • Vector to Immediate Mask Creation – Often times a permutation table is allocated to maintain shuffle masks. – Large tables are both space-consuming and expensive to index into (cache misses). – Special instructions to compute masks from vector of lengths would be useful • Intel has informed previously that this is extremely expensive to wire up in SIMD unit. • Still investigating if existing ISA can solve the problem. • Non-Temporal Loads – Avoid cache pollution throughout cache hierarchy – E. g. Streaming in probe table join keys should not evict out build hash table. – CPU Cache Partitioning (Intel’s Cache Allocation Technology) is temporary remedy • Accelerating Concurrent Workloads with CPU Cache Partitioning (S. Noll, et all) Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 34
SIMD : What Industry Needs : Other Thoughts • Indexes – Can we parallelize index lookup across multiple keys? • Group-By – Can we maintain group-by aggregates entirely in register(s)? • Compression – Can we compress using vector of lengths (and not k-mask)? • Registers – Can we have larger virtual register by joining N SIMD registers? • Gather – Can we gather bits instead of words and quad-words? • Gather – Can we get faster performance rather than just N loads/stores? Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 35
SIMD : ASICs, GPUs, FPGAs • Oracle created specialized hardware for data processing (M 7, M 8) – Small investment (1% of die), big benefit over Sparc ISA (at the time), but… – Producing the ASIC, maintaining/revving it, producing software for it, is expensive. – Alternative was to create larger SIMD units (like Intel) – would take 50% of die. • GPUs like Map. D can run table scans blazing fast (300 GB/sec). Line up multiple cards to scale, but beyond that and you run into the PCI-e bandwidth wall. – Complex operations can stall threads with branching. Heavy decompression is not really possible currently. Much easier to implement with general purpose X 86 and try and maximize memory bandwidth with powerful compute. • FPGAs have larger capacity (onboard HBM and DDR 4 channels). Hybrid CPU-FPGA solutions (and CPU-GPU for that matter) are being designed – e. g. build table constructed on CPU, probe offloaded to FPGA. Copyright © 2018, Oracle and/or its affiliates. All rights reserved. | 36
Thank You Copyright © 2018, Oracle and/or its affiliates. All rights reserved. |