IBM Research Parallel Querying with NonDedicated Nodes Vijayshankar

IBM Research Properties of a relational database § Ease of schema evolution § Declarative Querying § Transparent scalability does not quite work Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Today: Partitioning is basis for parallelism § static partitioning (on the base tables) § Dynamic partitioning via exchange operators L 1 O 1 S a L 2 O 2 S b L 3 O 3 S c § Claim: partitioning does not handle non-dedicated nodes well Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Problems of partitioning § Hard to scale incrementally – Data must be re-partitioned – Disk and CPU must be scaled together • DBA must ensure partition-cpu affinity initial partitioning § Homogeneity Assumptions – Same plan runs on each node – Identical software needed on all nodes exchange § Susceptible to load variations, node failures / stalls, … – Response time is dictated by speed of slowest processor – Bad for transient compute resources • E. g. we want ability to interrupt query work by higher-priority local work Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research GOAL: A more graceful scale-out solution Sacrifice partitioning for scalability – Avoid initial partitioning – No exchange New means for work allocation in absence of partitioning – Handles heterogeneity and load variations better § Two Design Features – Data In The Network (DITN) • Shared files on high speed networks (e. g SAN) – Intra-Fragment Parallelism • Send SQL fragments to heterogeneous join processors: each performs the same join, over a different subset of cross-product space • Easy fault-tolerance • Can use heterogeneous nodes -- whatever is available at that time Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Outline § Motivation § DITN design § Experimental Results § Summary Parallel

IBM Research DITN Architecture 1. Find idle coprocessors P 1, P 2, P 3, P 4, P 5, P 6 2. Prepare O, L, C 3. Logically divide Ox. Lx. C into workunits Wi 4. In Parallel, Run SQL queries for Wi at Pi 5. Property: SPJAG(Ox. Lx. C) = AG ( i SPJAG(Wi)) Restrictions (will return to this at the end) § Pi cannot use indexes at info. Integrator § Isolation issues Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Why Data in the Network § Observation: Network bandwidth >> Query Operator Bandwidth – N/W bandwidth: in Gbps (SAN/LAN), Scan: 10 -100 Mbps, Sort: about 10 Mbps – Interconnect transfers data faster than query operators can process it § But, exploiting this fast interconnect via SQL is tricky – E. g. ODBC Scan: 10 x slower than local scan § Instead, keep temp files in a shared storage system (e. g. SAN-FS) – Allows exploitation of full n/w bandwidth § immediate benefits – Fast data transfer – DBMS doesn’t have to worry about disks, i/o ||ism, || scans, etc. – Independent scaling of CPU and I/O Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Work Allocation without Partitioning § For each join: we now have to join the off-diagonal rectangles also § Minimize Response time = max(RT of each work-unit) = maxi, j Join. Cost(|Li|, |Oj|) § How to optimize the Work allocation? – ~ cut join hyper-rectangle into n pieces to minimize max perimeter – Simplification: assume that the join is cut into a grid • Choices: number of cuts on each table, size of each cut, allocation of work-units to processors Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Allocation to homogenous processors § Theorem: For monotonic Join. Cost, RT is minimized when each cut (on a table) is of same size § So allocation done into rectangles of size |T 1|/p 1, |T 2|/p 2, … |Tn|/pn § Theorem: For symmetric Join. Cost, RT is minimized when |T 1|/p 1 = |T 2|/p 2 = … |Tn|/pn § E. g. , with 10 processors, cut Lineitem into 5 parts and Orders into 2 § Note: cutting each table into same number of partitions (as is done usually) is sub-optimal Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Allocation to heterogeneous co-processors § Response time of query RT = max(RT of each work-unit) Choose size of each work-unit, and allocation of work-units to co-processor, so as to minimize RT § Like a bin packing problem – Solve for number of cuts on each table, assuming homogeneity – Then solve a Linear Program to find the optimal size of each cut – Have to make some approximations in order to avoid Integer Program (see paper) Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Failure/Stall Resiliency by Work-Unit Reassignment Without tuple shipping between plans, failure handling is easy § If co-processor’s A, B, C finished by time X, and co-processor D has not finished by time X(1+f) – Take D’s work unit and assign to fastest among A, B, C – say A – When either of D or A returns, close the cursor on the other § Can generalize to a work-stealing scheme – E. g. with 10 coprocessors, assign each to 1/20 th of the cross-product space – When a coprocessor returns with a result, assign it more work § Tradeoff: Finer work allocation => more flexible work-stealing BUT, more redundant work Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Analysis: What do we lose by not partitioning § § Say join of L x O x C (TPC-H) with 12 processors: 12 = p 1 p 2 p 3 RT without partitioning ~ Join. Cost(|L|/p 1, |O|/p 2 , |C|/p 3) RT with partitioning ~ Join. Cost(|L|/p 1 p 2 p 3, |O|/p 1 p 2 p 3, |C|/p 1 p 2 p 3) At p 1=6, p 2=2, p 3=1, loss in CPU speedup is Join. Cost(|L|/6, |O|/2, |C| ) ~ 2 Join. Cost(|L|/12, |O|/12, |C|/12) § Note: I/O speedup is unaffected Optimization: selective clustering § Can close the gap with partitioning further § Sort the largest tables of the join: e. g. |L|, |O| on their join column – Now, loss is: Join. Cost(|L|/12, |O|/12, |C|) / Join. Cost(|L|/12, |O|/12, |C|/12) § Still avoids exchange => can use heterogeneous, non-dedicated nodes, but causes problems with isolation Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Lightweight Join Processor § Work Allocation via Query Fragments => co-processors can be heterogeneous § Need not have a full DBMS; join processor is enough § E. g. screen saver for join processing § We use a trimmed down version of Apache Derby – Parse CSV files – Predicates, projections, sort-merge joins, aggregates, group by Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Performance degradation due to not partitioning O L S O L § At 10 nodes on Sx. Ox. Lx. Cx. Nx. R, SOLCNR DITN is about 2. 1 x slower than PBP (Work alloc: L/5, O/2, S, C, N, R) § DITN 2 PART has very little slowdown – But needs total clustering § Slow-down oscillates due to discreteness of work-allocation Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Failure/Stall Resiliency by Work-Unit Reassignment § Orders x Lineitem group by o_orderpriority 5 co-processors § Impose high load on one PBP DITN 2 PART co-processor as soon as query begins § At 60% load (50% wait), DITN times out and switches to alternative Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Importance of Asymmetric Allocation Contrast between DITN-symmetric and DITN-asymmetric § Initially 2 fast nodes: then add 4 slow nodes § With symmetric allocation: adding slow nodes can slow down system Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Danger of Tying partition to CPU § Repeated execution of O L § Impose 75% CPU load on one of the 5 co-processors during 3 rd iteration § PBP continues to use this slow node throughout § DITN switches to another node after two iterations Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Related Work § Parallel query processing – Gamma, XPRS, many commercial systems § § § – Mostly shared-nothing – Shared-disk: IBM Sysplex • Queries done via tuple shipping between co-processors – Oracle • Shared disk, but hash joins done via partitioning (static/dynamic) Mariposa – similar query fragment level work allocation Load Balancing Exchange, Flux, River, Skew-avoidance in hash joins Fault-tolerant exchange (FLUX) Polar*, OGSA-DQP Distributed Eddies Query Execution on P 2 P systems Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation

IBM Research Summary and Future work § Partitioning-based parallelism does not handle non-dedicated nodes § Proposal: Avoid partitioning – Share data via storage system – Intra-fragment parallelism instead of exchange – Careful work-allocation to optimize response time § Promising initial results: only 2 x slowdown with 10 nodes Open Questions § § Index scans: want shared reads without latching Isolation: DITN: uncommitted read; DITN 2 PART: read-only Scaling to large numbers of nodes Multi-query optimization to reuse shared temp tables Parallel Querying with Non-Dedicated Computers Aug 30 2005 © 2005 IBM Corporation