1 SCIENCE PASSION TECHNOLOGY Architecture of DB Systems

1 SCIENCE PASSION TECHNOLOGY Architecture of DB Systems 07 Compilation and Parallelization Matthias Boehm Graz University of Technology, Austria Computer Science and Biomedical Engineering Institute of Interactive Systems and Data Science BMK endowed chair for Data Management Last update: Dec 02, 2020

2 Announcements/Org § #1 Video Recording § Link in Teach. Center & TUbe (lectures will be public) § Optional attendance (independent of COVID) § #2 COVID-19 Restrictions (HS i 5) § Corona Traffic Light: RED § Temporarily webex lectures and recording § #3 Open Ph. D Position (03/2021) mboehm@tugraz. at § Part of Re. Waste F project (Recycling and Recovery of Waste for Future) § System infrastructure for consolidated and federated storage and analysis § Data integration, data provenance, data cleaning, domain-specific analysis 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Overview Query Processing 3 Recap: Overview Query Processing SELECT * FROM Top. Scorer WHERE Count>=4 Parsing AST/IR Name Count James Rodríguez Thomas Müller Robin van Persie Neymar 6 5 4 4 Semantic Analysis IR Query Rewrites Plan Execution IR Plan Optimization Compile Time QEP Plan Caching Runtime 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

4 Agenda § Vectorization and SIMD § Query Compilation § Query Parallelization 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

5 Vectorization and SIMD Instruction-level Parallelism (aka Vectorization) Vectorized Execution Model Cache-friendly / Auto-SIMD 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Vectorization and SIMD 6 Terminology § Flynn’s Classification § SISD, SIMD § (MISD), MIMD [Michael J. Flynn, Kevin W. Rudd: Parallel Architectures. ACM Comput. Surv. 28(1) 1996] Singe Data Multiple Data Singe Instruction SISD (uni-core) SIMD (vector) Multiple Instruction MISD (pipelining) MIMD (multi-core) § Example: SIMD Processing § Streaming SIMD Extensions (SSE) § Process the same operation on multiple elements at a time (packed vs scalar SSE instructions) § Data parallelism (aka: instruction-level parallelism) § Example: VFMADD 132 PD 2009 Nehalem: 128 b (2 x. FP 64) 2012 Sandy Bridge: 256 b (4 x. FP 64) 2017 Skylake: 512 b (8 x. FP 64) c = _mm 512_fmadd_pd(a, b); a b c 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21
![Vectorization and SIMD 7 Background Vector Processors [https: //www. computerhistory. org/tdih/september/28/] § GRAY-1 § Vectorization and SIMD 7 Background Vector Processors [https: //www. computerhistory. org/tdih/september/28/] § GRAY-1 §](http://slidetodoc.com/presentation_image_h2/59592ea6ae2b6201517e92e46d629d58/image-7.jpg)
Vectorization and SIMD 7 Background Vector Processors [https: //www. computerhistory. org/tdih/september/28/] § GRAY-1 § 8 x (64 elements x 8 B) vector registers [Richard M. Russell: The § INT and FP arithmetic @ 80 MHz CRAY-1 Computer System. § Vector and scalar instructions CACM 21(1) 1978] § NEC Vector Engine v 2 20 A/20 B § 8/10 vector cores w/ scalar/vector processing units § Vector width: 256 x 8 B = 16, 384 bit § 1. 6 GHz, 3. 07/6. 14 TFLOPs, 1. 53 TB/s [https: //en. wikichip. org/wiki/nec/microarchitectures/sx-aurora] @ZAMG § Others: CPUs, GPUs, FPGAs, DSPs [https: //www. zamg. ac. at/cms/en/images/ weather/nec/image_view_fullscreen] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Vectorization and SIMD 8 SIMD Data Processing [Jingren Zhou, Kenneth A. Ross: Implementing database operations using SIMD instructions. SIGMOD 2002] § Overview § Process multiple elements at once § Avoid conditional branch instructions § Assuming column-wise storage, and vectors of fixed-sized values § Example Selection § 16 x 32 b x 7 1 2 9 3 8 6 7 3 4 9 2 4 5 6 9 mask = x>=5 1 0 0 1 1 1 V = 77391 SIMD_bitmap(mask) // [0, 2^(S-1)] All match extraction from y 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Vectorization and SIMD 9 SIMD Data Processing, cont. § Example Aggregations § Convert non-matched elements to zero § Aggregate into vector register, final agg/extraction [Jingren Zhou, Kenneth A. Ross: Implementing database operations using SIMD instructions. SIGMOD 2002] § Auto Vectorization § GCC 7. 2 § Clang 5. 0 § ICC 18 [Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, Peter A. Boncz: : Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. PVLDB 11(13) 2018] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Vectorization and SIMD 10 Vectorized Execution (vector-at-a-time) § Motivation § Iterator Model: many function calls, no instruction-level parallelism § Materialized: mem-bandwidth-bound § Hyper-Pipelining § Operators work on vectors § Pipelining of vectors (sub-columns) § Vector sizes according to cache size § Pre-compiled function primitives Generalization of execution strategies [Peter A. Boncz, Marcin Zukowski, Niels Nes: Monet. DB/X 100: Hyper-Pipelining Query Execution. CIDR 2005] for(int i=0; i<n; i++) out[i] = in[i]<L [Marcin Zukowski, Peter A. Boncz, Niels Nes, Sándor Héman: Monet. DB/X 100 - A DBMS In The CPU Cache. IEEE Data Eng. Bull. 28(2), 2005] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

11 Query Compilation Holistic Query Evaluation Data-centric Query Evaluation Compilation and/or Vectorization 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 12 Query Compilation Motivation § Background § Traditional DBMS assume data >> main memory (I/O dominates) § Modern in-memory DBMS CPU/memory efficiency crucial § Example SELECT sum(price*(1+tax)) FROM Orders WHERE oid >= 100 AND oid <= 200 GROUPBY category compile for(int i = 0; i < N; i++) if(oid[i] >= 100 && oid[i] <= 200) ret[category] += price[i]*(1+tax[i]); [Juliusz Sompolski, Marcin Zukowski, Peter A. Boncz: Vectorization vs. compilation in query execution. Da. Mo. N 2011] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 13 Holistic Query Evaluation § Query Processing Architecture [Konstantinos Krikellas, Stratis Viglas, Marcelo Cintra: Generating code for holistic query evaluation. ICDE 2010] [Konstantinos Krikellas: The case for holistic query evaluation, Ph. D Thesis, University of Edinburgh, 2010] § HIQUE: Holistic Integrated Query Engine § Holistic: Query-awareness + HW-awareness § Codegen as underlying principle of efficient query evaluation Codegen and compilation step 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 14 Holistic Query Evaluation, cont. § Code Generation Approach § #1 Data Staging: input tables, selection, projection, pre-processing § #2 Holistic Query Instantiation: join, group-by, order-by // function prototypes. . . σ Q 4 γSUM(S_Qty); Q 3 ⋈P_PID=S_PID S_Date>=‘ 2011 -01 -01‘ Products Q 1 P_Name Sales Q 2 Materialized Intermediates // function implementations Result execute. Query() { tmp 1 = execute. Q 1(…); tmp 2 = execute. Q 2(…); tmp 3 = execute. Q 3(tmp 1, tmp 2, …); return execute. Q 4(tmp 3, …); } … … execute. O 1(…){…} execute. O 2(…){…} execute. O 3(…){…} execute. O 4(…){…} 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 15 Holistic Query Evaluation, cont. § Code Generation Approach, cont. § Types: known attribute types no separate function calls (access, eval) § Size: fixed-length tuples direct access, cache-conscious blocking § Operations: interleaved operations on cached data § #1 Data Staging σ x=7 R 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 16 Holistic Query Evaluation, cont. § #2 Holistic Query Instantiation § Join Teams § Join operators with predicate on same attribute § Single generated function § Alternatives § Holistic nested loop join § Holistic merge join (cooperative staging) § Holistic partitioned join § Holistic hybrid hashsort-merge join M=1 for merge join 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 17 Holistic Query Evaluation, cont. § Runtime Break-Down 26% instructions 37% data access 1% function calls § Compiler Optimizations 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 18 Holistic Query Evaluation, cont. § Comparison TPC-H Queries § Code Generation Overhead Compilation time dominates execution time 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 19 Data-centric Query Evaluation § Motivation § Algebraic operator model useful for reasoning, but not necessarily a good idea for query processing § Code compilation overhead Materialized Intermediates [Thomas Neumann: Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB 2011] ⋈ ⋈ ⋈ R ⋈ ⋈ U ⋈ T S R § Data-centric Processing (not operator-centric) § Keep data in CPU registers as long as possible (no op boundaries) § Data is pushed towards operators (code and data locality) § Queries are compiled into native machine code using LLVM 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21 U T S

Query Compilation Data-centric Query Evaluation, cont. 20 § Example Plan with Pipeline Boundaries § Pipeline breaker: op takes a tuple out of register § Full pipeline breaker: blocking op Compiled Query (not LLVM) ⋈a=b ⋈z=c σx=7 R SELECT * FROM R 1, R 3, (SELECT R 2. z, count(*) FROM R 2 WHERE R 2. y = 3 GROUP BY R 2. z) R 2 WHERE R 1. x = 7 AND R 1. a = R 3. b AND R 2. z = R 3. c γCOUNT(*); Z σy=3 R 2 R 3 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 21 Data-centric Query Evaluation, cont. § Data-Centric Operator Model § Conceptual data-centric operator model, used during compilation § produce(): produce result tuples § consume(attributes, source): receive input tuples § Example σx=7 ⋈a=b R 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 22 Data-centric Query Evaluation, cont. § Example LLVM Fragment: γCOUNT(*); Z(σy=3(R 2)) 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 23 Data-centric Query Evaluation, cont. § Experiments § TPC-CH (extended TPC-C+TPC-H) § Code Quality § Instruction cache misses (L 1 i) § Data cache miss (L 1 d, L 2 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 24 Other Systems w/ Query Compilation § IEEE Data Engineering Bulletin [http: //sites. computer. org/ debull/A 14 mar/issue 1. htm] Hy. Per HIQUE Hekaton Impala Lego. Base/ Sca. Lite 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21
![Query Compilation 25 Specialized Code Generation § Improved Branch Prediction § No-branch: pos+=pred(data[i]) § Query Compilation 25 Specialized Code Generation § Improved Branch Prediction § No-branch: pos+=pred(data[i]) §](http://slidetodoc.com/presentation_image_h2/59592ea6ae2b6201517e92e46d629d58/image-25.jpg)
Query Compilation 25 Specialized Code Generation § Improved Branch Prediction § No-branch: pos+=pred(data[i]) § Hash table lookup [Thomas Neumann: Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB 2011] existing entry (~ true) end of chain (~ false) § SIMD Loop Tiling and Fission § Loop tiling (vectorization) for SIMD § Loop fission into parallel and serial ops [Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Çetintemel, Stan Zdonik: An Architecture for Compiling UDFcentric Workflows. PVLDB 8(12) 2015] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 26 Compilation w/ SIMD, Prefetch, Decompress § Relaxed Operator Fusion (ROF) [Prashanth Menon, Andrew Pavlo, Todd C. Mowry: Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last. PVLDB 11(1) 2017] § Introduce buffered stage boundary for vectorized execution § SIMD operations after boundary (w/ repacking after SIMD ops) § Prefetching before boundary § Data Blocks [Harald Lang el al: Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation. SIGMOD 2016] § Hot and cold (compressed) blocks § SIMD predicated evaluation on blocks, output unpacking into vectors of 8192 tuples § Vector tuples fed into JIT-compiled pipelines 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 27 Compilation vs Vectorized Execution § Motivation [Timo Kersten et al. : Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. PVLDB 11(13) 2018] § Sot. A: compilation or vectorization § Typer: Test data-centric query eval (Hy. Per) § Tectorwise: Test vectorized eval (Vector. Wise) § Selected Results § Neither system clearly dominated § Both with large differences to others 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21
![Query Compilation 28 Excursus: System. DS Codegen Practical, exact, cost-based optimizer [CIDR’ 17] (w/ Query Compilation 28 Excursus: System. DS Codegen Practical, exact, cost-based optimizer [CIDR’ 17] (w/](http://slidetodoc.com/presentation_image_h2/59592ea6ae2b6201517e92e46d629d58/image-28.jpg)
Query Compilation 28 Excursus: System. DS Codegen Practical, exact, cost-based optimizer [CIDR’ 17] (w/ fuse-all heuristic) - Lacked maintainability - Poor plans for complex DAGs and local/distributed operations [Tarek Elgamal, Shangyu Luo, Matthias Boehm, [Matthias Boehm, Berthold Reinwald, Dylan Alexandre V. Evfimievski, Shirish Tatikonda, Hutchison, Prithviraj Sen, Alexandre V. Berthold Reinwald, Prithviraj Sen: SPOOF: Sum. Evfimievski, Niketan Pansare: On Optimizing Product Optimization and Operator Fusion for Operator Fusion Plans for Large-Scale Machine 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, Parallelization Large-Scale Machine Learning. CIDR 2017] Learning inand System. ML. PVLDB 11(12) 2018] Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 29 Excursus: System. DS Codegen – Ex. L 2 SVM § L 2 SVM Inner Loop while(continue. Outer & iter<maxi) { #. . . while(continue. Inner) { out = 1 -Y* (Xw+step_sz*Xd); sv = (out > 0); out = out * sv; g = wd + step_sz*dd - sum(out * Y * Xd); h = dd + sum(Xd * sv * Xd); step_sz = step_sz - g/h; } }. . . # of Vector Intermediates Base (w/o fused ops): 10 Fused (w/ fused ops): 4 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 30 Excursus: System. DS Codegen – Ex. L 2 SVM § Template Skeleton § § T: Cell, MAgg, Row, Outer Data access, blocking Multi-threading Final aggregation public final class TMP 25 extends Spoof. MAgg { public TMP 25() { super(false, Agg. Op. SUM); } protected void genexec(double a, Side. Input[] b, double[] scalars, double[] c, . . . ) { double TMP 11 = get. Value(b[0], row. Index); double TMP 12 = get. Value(b[1], row. Index); double TMP 13 = a * scalars[0]; double TMP 14 = TMP 12 + TMP 13; double TMP 15 = TMP 11 * TMP 14; double TMP 16 = 1 - TMP 15; double TMP 17 = (TMP 16 > 0) ? 1 : 0; double TMP 18 = a * TMP 17; double TMP 19 = TMP 18 * a; double TMP 20 = TMP 16 * TMP 17; double TMP 21 = TMP 20 * TMP 11; double TMP 22 = TMP 21 * a; c[0] += TMP 19; c[1] += TMP 22; # of Vector Intermediates } } Gen (codegen ops): 0 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Compilation 31 Excursus: System. DS Codegen – Ex. MLog. Reg § MLogreg Inner Loop (main expression on feature matrix X) Q = P[, 1: k] * (X %*% v) H = t(X) %*% (Q - P[, 1: k] * row. Sums(Q)) public final class TMP 25 extends Spoof. Row { public TMP 25() { super(Row. Type. COL_AGG_B 1_T, true, 5); } protected void genexec. Dense(double[] a, int ai, Side. Input[] b, double[] c, . . . , int len) { double[] TMP 11 = get. Vector(b[1]. vals(rix), . . . ); double[] TMP 12 = vect. Mat. Mult(a, b[0]. vals(rix), . . . ); “vectorized double[] TMP 13 = vect. Mult(TMP 11, TMP 12, 0, 0, . . . ); row ops” double TMP 14 = vect. Sum(TMP 13, 0, TMP 13. length); double[] TMP 15 = vect. Mult(TMP 11, TMP 14, 0, . . . ); double[] TMP 16 = vect. Minus(TMP 13, TMP 15, 0, 0, . . . ); vect. Outer. Mult. Add(a, TMP 16, c, ai, 0, 0, . . . ); } protected void genexec. Sparse(double[] avals, int[] aix, int ai, Side. Input[] b, . . . , int len) {. . . } } 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

32 Query Parallelization Intra- and Inter-Operator Parallelism Fine-grained Pipeline Parallelism Workload Management / Inter-Query Parallelism 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 33 Overview Query Parallelism § Types of Parallelism Inter. Query Inter. Operator Intra. Operator Queries and TXs Bushy Plans Pipelining Data Partitions MT Ops, SIMD Multi-threaded / Distributed § Beware: Danger of Interference § #1 Locks and latches on hot data items increasing TX aborts § #2 Temporary memory/IO requirements (see 03 buffer pool) § #3 CPU and cache interference (e. g. , context switches) § #4 Throughput vs latency vs freshness vs fairness vs priorities Dedicated DB workload management & DB schedulers 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 34 Recap: Iterator Model § Volcano Iterator Model Scalable (small memory) High CPI measures [Goetz Graefe: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 1994] § Pipelined & no global knowledge § Open-Next-Close (ONC) interface § Query execution from root node (pull-based) § Example σA=7(R) open() next() EOF close() void open() { R. open(); } void close() { R. close(); } Record next() { while( (r = R. next()) != EOF ) if( p(r) ) //A==7 return r; return EOF; } § Blocking Operators open() next() close() open() next() close() σA=7 EOF R EOF Postgre. SQL: Init(), Get. Next(), Re. Scan(), Mark. Pos(), § Sorting, grouping/aggregation, 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization End() Restore. Pos(), build-phase of (simple) joins of Technology, WS 2020/21 Matthias Boehm, hash Graz University

Query Parallelization 35 Intra- and Inter-Operator Parallelism § Overview [Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD 1990] § Seamless parallelization in iterator model via dedicated exchange operator § Avoid unnecessary overhead for local subplans γB, count(*) § Inter-Operator Parallelism § § Vertical parallelism in terms of pipelining Open: create new process Next: transfer packets of records (1. . 32, 000) Close: shutdown child processes σ(R) ⋈ γ XCHG ⋈ XCHG σA=7 S R 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 36 Intra- and Inter-Operator Parallelism, cont. § [Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD 1990] XCHG ⋈ XCHG σA=7 R 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21 XCHG S

Query Parallelization 37 Excursus: Map. Reduce – Execution Model Input CSV files (stored in HDFS) CSV File 1 CSV File 2 CSV File 3 Split 11 Map-Phase map task Split 12 map task Split 21 map task Split 22 map task Split 31 Split 32 map task #1 Data Locality (delay sched. , write affinity) #2 Reduced shuffle (combine) #3 Fault tolerance (replication, attempts) [Reduce-Phase] Output Files (HDFS) reduce task Out 1 reduce task Out 2 reduce task Out 3 Shuffle, Merge, [Combine] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Sort, [Combine], [Compress] w/ Matthias Boehm, Graz University of Technology, WS 2020/21 #reducers = 3

Query Parallelization 38 Fine-grained Parallelism § Motivation [Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann: Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. SIGMOD 2014] § Non-uniform memory architecture (NUMA) § Load imbalance / serial fraction due to plan-driven parallelism § Scheduler (dispatcher) § Fixed number of workers to avoid over-provisioning § Morsel: segment of tuples (e. g. , 100 K) § Task: operator pipeline on morsel § Task distribution at runtime w/ static partitioning + work stealing § NUMA data locality § Hybrid Interpreted/compiled § Exchange plans at morsel granularity [André Kohn, Viktor Leis, Thomas Neumann: Adaptive Execution of Compiled Queries. ICDE 2018] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 39 Fine-grained Parallelism, cont. § Motivation, cont. [Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Garret Swart, Weiwei Gong: A Morsel-Driven Query Execution Engine for Heterogeneous Multi. Cores. PVLDB 12(12) 2019] § Dark silicon due to power and thermal constraints § Sparc M 7 platform w/ on-die ASIC, Data Analytics Accelerator (DAX) § Extensions § Pipeline decomposition for function-specific cores § Cost-based work submission to accelerator § DAX: scan&filter, select, semi-join § Similar Abstractions for CPU/GPU balancing 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 40 Workload Management § Example: DB 2 Workload Management § Concurrency thresholds for incoming requests § Stop/continue/remap on violated thresholds § Map DB 2 service classes to Linux classes § Linux cgroups (control groups) for resource isolation [https: //www. ibm. com/support/knowledge center/SSEPGG_11. 5. 0/com. ibm. db 2. luw. ad min. wlm. doc/c 0053451. html] [https: //www. ibm. com/support/knowledge center/SSEPGG_11. 5. 0/com. ibm. db 2. luw. ad min. wlm. doc/c 0053465. html] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 41 Workload Management, cont. § Example H-Store Volt. DB § Cluster of single-threaded storage and execution engines § No disk-based logging or locking [Robert Kallman et al. : H-store: a highperformance, distributed main memory transaction processing system. PVLDB 1(2) 2008] [Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, Pat Helland: The End of an Architectural Era (It's Time for a Complete Rewrite). VLDB 2007] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

Query Parallelization 42 Workload Management – Prioritization § SAP HANA [Iraklis Psaroudakis, Florian Wolf, Norman May, Thomas Neumann, Alexander Böhm, Anastasia Ailamaki, Kai-Uwe Sattler: Scaling Up Mixed Workloads: A Battle of Data Freshness, Flexibility, and Scheduling. TPCTC 2014] § Main column store and delta CSB tree “[…] the default configuration of SAP § Thread pool for network clients HANA favors analytical throughput § Scheduler for heavy-weight requests over transactional throughput” (single- or multi-task intra-query parallelism) § UDFs via #pragma omp parallel for reduction(+: nnz) for (int i = 0; i < N; i++) { int thread. ID = omp_get_thread_num(); R[i] = foo(A[i], thread. ID); nnz += (R[i]!=0) ? 1 : 0; } § Open. MP (since 1997, Open Multi-Processing) § DOALL parallel loops (independent iterations) § SAP HANA: custom Open. MP backend for intercepting tasks DB job scheduler (w/ priorities) [Florian Wolf, Iraklis Psaroudakis, Norman May, Anastasia Ailamaki, Kai-Uwe Sattler: Extending database task schedulers for multi-threaded application code. SSDBM 2015] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21

43 Summary and Q&A § Vectorization and SIMD § Query Compilation § Query Parallelization § Next Lectures (Part B) § 08 Query Optimization I (normalization, rewrites, unnesting) [Dec 09] § 09 Query Optimization II (cost models, join ordering) [Dec 09] § 10 Adaptive Query Processing [Dec 16] 706. 543 Architecture of Database Systems – 07 Vectorization, Compilation, and Parallelization Matthias Boehm, Graz University of Technology, WS 2020/21
- Slides: 43