The Communication Complexity of Distributed SetJoins Dirk Van
The Communication Complexity of Distributed Set-Joins Dirk Van Gucht, Ryan Williams, David Woodruff, Qin Zhang Indiana University, Stanford, IBM Almaden, Indiana University
Set-Joins • Given sets A 1, …, Am from [n] = {1, 2, …, n} • Given sets B 1, …, Bm from [n] • Join = {(i, j) for which P(Ai, Bj)}, where P is a predicate • P might be: • set-intersection • set non-intersection • set equality • set-intersection threshold join |Ai ∩ Bj| > 0 |Ai ∩ Bj| = 0 A i = Bj |Ai ∩ Bj | > T?
Communication Complexity A 1, …, Am B 1, …, Bm Alice and Bob want to minimize communication so that they together can output the join The protocol can fail with probability 1/100 (over its randomness)
Communication Complexity of Joins • Set-Intersection, Set Non-Intersection, Set-Intersection Threshold Join all require Ω(mn) communication • Reduction from set-disjointness on a universe of size m*n • Matched by trivial upper bound {x 1, x 2, …, xmn} A 1 = {x 1, x 2, …, xn} A 2 = {xn+1, xn+2, …, x 2 n} … Am = {x(m-1)n+1, …, xmn} {y 1, y 2, …, ymn} B 1 = {y 1, y 2, …, yn} B 2 = {yn+1, y 2, …, y 2 n} … Bm = {y(m-1)n, …, ymn}
Communication Complexity of Joins • Set-Equality has communication Θ(m) • Ω(m) bits transferred to the player outputting the join • O(m) upper bound due to a set-intersection protocol of [BCKWY] A 1, …, Am x 1, …, xm in [m 2] Randomly hash sets to [m 2] B 1, …, Bm y 1, …, ym in [m 2]
Communication Complexity of Joins • Can we achieve better communication for joins in practice? • Sparsity • Output: typically the output size |Join(A 1, …, Am, B 1, …, Bm)| is at most k << m 2 • Input: typically |A 1|, …, |Am|, |B 1|, …, |Bm| each at most s << n • Can get better bounds in terms of k, s! • Focus on set-intersection join (will generalize to natural join)
Our Main Results • O(k 1/2 n) communication protocol for set-intersection join • More efficient than O(mn) for small k! • Improves O(kn) communication protocol of [Williams, Yu, SODA ’ 14] • Matching Ω(k 1/2 n) lower bound • For input sparsity s, we show Θ(min(ms, (skn)1/2) + n) communication bound • Highlights • Use 2 rounds of interaction (1 round is impossible!) • For each (i, j) in the join, returns all k for which k in Ai and k in Bj (natural join) • New algorithms for input/output sparse matrix multiplication
Talk Outline • Our O(k 1/2 n) communication protocol • Extension to solve the natural join problem • New algorithm for matrix multiplication input/output • Application to Sparse Transitive-Closure • An Ω(k 1/2 n) communication lower bound • Estimating sizes of set joins
A Protocol for Set-Intersection Join B 1, …, Bm [ [ A= A 1 A 2 … Am B= [ [ A 1, …, Am B 1 B 2 …Bm Set-Intersection Join = {(i, j) for which Ci, j > 0, where C = A*B}
A Protocol for Set-Intersection Join = {(i, j) for which Ci, j > 0, where C = A*B} m x n Matrix A n x m Matrix B Alice computes S*A for a random k x m matrix S, sends S*A to Bob [Compressed Sensing] S has the property that if x is a vector containing at most k non-zero entries, then given S*x, one can recover x w. h. p.
A Protocol for Set-Intersection Join m x n Matrix A m x n Matrix B • Alice computes S*A for a random k x m matrix S, sends S*A to Bob • Bob computes (S*A)*B. Each column of A*B has at most k non-zeros • From S*A*B, Bob can recover all non-zero entries of A*B • S can be a random Gaussian matrix with O(log n) bits of precision per entry
A Protocol for Set-Intersection Join m x n Matrix A m x n Matrix B • Problem: sending S*A requires k n log n bits of communication • Idea: instead let S be a k 1/2 x m random matrix! • From S*A*B, can recover each column of A*B with at most k 1/2 non-zeros • Number of columns of A*B with more than k 1/2 non-zeros is at most k 1/2
Overall Protocol for Set-Intersection Join • Alice sends Bob S*A where S has k 1/2 rows • Alice also sends Bob T*A, where T has O(log n) rows [KNW] • Bob computes T*A*B and finds which columns of A*B have > k 1/2 non-zeros • Set H of such columns has size at least k 1/2 • Bob computes S*A*B and recovers (A*B)j for each j not in H • Bob sends Alice columns Bj for each j in H • Alice directly computes A*Bj for each j in H
Extension to Solve the Natural Join • Not enough to just figure out Ci, j > 0, need to find all witnesses • A witness is an index k for which Ai, k = Bk, j = 1 • Can be solved iteratively • Bob deletes entries of B one-at-a-time and re-computes S*A*B • Alice just sends S*A once • Since Alice has Bj for columns j in H, she directly computes all witnesses
A New Algorithm for Matrix Multiplication • Suppose we choose S to be a Count. Sketch matrix [ [ 0010 01 00 1000 00 00 0 -1 1 0 -1 0 0 0 0 0 1 • The same algorithm works to recover C = A*B! • Time • nnz(A) to compute S*A • nnz(B)k 1/2 to compute S*A*B and recover A*Bj for j not in H • nnz(A)k 1/2 to compute A*Bj for j in H
Matrix Multiplication Application • Gives O((nnz(A) + nnz(B))k 1/2) time for matrix multiplication • [Sparse Transitive Closure Application] • • Given: a directed graph G on n nodes Transitive-closure is pairs (u, v) of nodes for which v is reachable from u Promise: transitive-closure has O(n) edges Goal: find all edges in the transitive-closure • Solve by computing A, A 2, A 4, …, An , where A is adjacency matrix • We obtain O~(n 1. 5) time algorithm. Previous best is n 1. 844 time [BCH]
Matching Communication Lower Bound • Reduction from set-disjointness on a universe of size k 1/2 n {x 1, x 2, …, xsqrt{k}n} A 1 = {x 1, x 2, …, xn} A 2 = {xn+1, xn+2, …, x 2 n} … Am = {x(sqrt{k}-1)n+1, …, xsqrt{k}n} {y 1, y 2, …, ysqrt{k}n} B 1 = {y 1, y 2, …, yn} B 2 = {yn+1, y 2, …, y 2 n} … Bm = {y(sqrt{k}-1)n, …, ysqrt{k}n} A*B is k 1/2 x k 1/2 matrix so it has at most k non-zeros A non-zero on the diagonal can decide disjointness
Our Other Results • Approximate size |Join(A 1, …, Am, B 1, …, Bm)| • Query strategy planning, measuring similarity, etc. • Output X = (1±ε) |Join(A 1, …, Am, B 1, …, Bm)| • Communication of Approximate Set-Intersection Join • [1 -way Communication] Upper and lower bounds of n/ε 2 • [2 -way Communication] Lower bound of n/ ε 2/3 • Results for set-intersection threshold join as well (tight for 1 -way)
Conclusion • Summary • Separation of communication complexity of different join types • Optimal communication protocol for Set-Intersection and Natural Joins in terms of sparsity • New algorithms for sparse input and output matrix multiplication • Upper and lower communication bounds for approximating join sizes • Open Questions • Communication bounds not tight for approximating join sizes for most joins • For T-threshold join, off by a (large) factor of sqrt{n/T} • Multi-player versions and more expressive joins • Other applications of our matrix multiplication algorithm
- Slides: 19