Tight Lower Bounds for the Distinct Elements Problem

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT dpwood@mit. edu Joint work with Piotr Indyk

The Problem 4 • • • 3 7 3 1 1 0 … Stream of elements a 1, …, an each in {1, …, m} Want F 0 = # of distinct elements Elements in adversarial order Algorithms given one pass over stream Goal: Minimum-space algorithm

A Trivial Algorithm 4 3 7 0000 3 1 1 0 … 10011011 • Keep m-bit characteristic vector v of stream • j in stream $ vj = 1 • F 0 = wt(10011011) = 5 • Space = m Can we do better?

Negative Results • Any algorithm computing F 0 exactly must use (m) space [AMS 96] • Any deterministic alg. that outputs x with |F 0 – x| < F 0 must use (m) space [AMS 96] • What about randomized approximation algorithms?

Rand. Approx. Algorithms for F 0 • O(log m/ 2 + log m log 1/ ) alg. outputs x with Pr[| F 0 – x| < F 0 ] > ¾ [BJKST 02] • Lots of hashing tricks Is this optimal? • Previous lower bounds • (log m) [AMS 96] • (1/ ) [Bar-Yossef] • Open Problem of [BJKST 02]: GAP: 1/ << 1/ 2

Idea Behind Lower Bounds Alice x 2 y 2 {0, 1}m Stream s(x) (1 § ) F 0 algorithm A Bob S Internal state of A Stream s(y) (1 § ) F 0 algorithm A • Compute (1 § ) F 0(s(x) ± s(y)) w. p. > ¾ • Idea: If can decide f(x, y) w. p. > ¾, space used by A at least f’s rand. 1 -way comm. complexity

Randomized 1 -way comm. complexity • • Boolean function f: X £ Y ! {0, 1} Alice has x 2 X, Bob y 2 Y. Bob wants f(x, y) Only 1 message sent: must be from Alice to Bob Comm. cost of protocol = expected length of longest message sent over all inputs. • -error randomized 1 -way comm. complexity of f, R (f), is comm. cost of optimal protocol computing f w. p. ¸ 1 - • How do we lower bound R (f)?

The VC Dimension [KNR] • F = {f : X ! {0, 1}} family of Boolean functions • f 2 F is length-|X | “bit string” • For S µ X, shatter coefficient SC(f. S) of S is |{f |S}f 2 F| = # distinct bit strings when F restricted to S • SC(F, p) = max. S 2 X, |S| = p SC(f. S) • If SC(f. S) = 2|S|, S shattered by F • VC Dimension of F, VCD(F), = size of largest S shattered by F

Shatter Coefficient Theorem • Notation: For f: X £ Y ! {0, 1}, define: f. X = { fx(y) : Y ! {0, 1} | x 2 X }, where fx(y) = f(x, y) • Theorem [BJKS]: For every X £ Y ! {0, 1}, every p ¸ VCD( f. X ), R 1/4(f) = (log(SC(f. X, p))) f:

The (1/ ) Lower Bound [Bar-Yossef] • Alice has x 2 R {0, 1}m, wt(x) = m/2 • Bob has y 2 R {0, 1}m, wt(y) = m and: • Either wt(x Æ y) = 0 OR wt(x Æ y) = m f(x, y) = 0 f(x, y) = 1 • R 1/4(f) = (VCD(f. X)) = (1/ ) [Bar-Yossef] • s(x), s(y) any streams w/char. vectors x, y • f(x, y) = 1 ! F 0(s(x) ± s(y)) = m/2 • f(x, y) = 0 ! F 0(s(x) ± s(y)) = m/2 + m • (1+ ’)m/2 < (1 - ’)(m/2 + m) for ’ = ( ) • Hence, can decide f ! F 0 alg. uses (1/ ) space

Our Results • Remainder of talk: (1/ 2) lower bound for = (m-1/(9+k)) for any k > 0. ! O(log m/ 2 + log m log 1/ ) upper bound almost optimal IDEA: Reduce from protocol for computing dot product

The Promise Problem • t = (1/ 2), Y = basis of unit vectors of Rt Alice Bob x 2 [0, 1]t ||x|| = 1 hx, yi = 0 f(x, y) = 0 y 2 Y Promise Problem : OR hx, yi = 2/t 1/2 f(x, y) = 1 • X = {x 2 [0, 1]t, ||x|| = 1 and 9 y 2 Y s. t. (x, y) 2 } • We lower bound R 1/4(f) via SC(f. X, t)

Bounding SC(f. X, t) Theorem: SC(f. X, t/4) = 2 (t) Proof: 8 T ½ {Y} s. t. |T| = t/4, put x. T = (2/t 1/2) ¢ e 2 T e Define X 1 ½ X as X 1 = {x. T | T ½ {Y}, |T| = t/4} Claim: 8 s 2 {0, 1}t w/ wt(x) = t/4, s 2 truth tab. of f. X 1 Proof: 1. Let s 2 {0, 1}t with 1 s in positions i 1, …, it/4 2. Put T = {ei 1, …, eit/4}. 8 e 2 T, he, x. Ti = 2/t 1/2 = 2 3. 8 e 2 Y - T, h e, x. T i = 0 5. There are 2 (t) such s. • • 1. 2. 3. 4.

Bounding R 1/4(f) • Corollary: • Reduction: Reduction we need protocol computing f with communication = space used by any (1 § ) F 0 approx. alg.

Reduction • Recall: • hx, yi = 0 if f(x, y) = 0 • hx, yi = 2/t 1/2 if f(x, y) = 1 • Goal: Reduce “separation” of hx, yi to separation of F 0(s(x) ± s(y)) for streams s(x), s(y) Alice/Bob can derive from x, y • Use relation: ||y-x||2 = ||y||2 + ||x||2 – 2 hx, yi • f(x, y) = 0 ! ||y-x|| = 21/2 • f(x, y) = 1 ! ||y-x|| < 21/2 (1 - 1/t 1/2) = 21/2 (1 - ( ))

Overview of Reduction x 2 [0, 1]t ||x|| = 1 (x) y 2 E (y) 1. Low-distortion embedding 2. : l 2 t ! l 1 poly(t) 2. Rational Approximation 3. Scale rationals to integers s x’ s(x’) F 0 Alg 4. Convert integer coords to unary to get {0, 1} vectors x’, y’ State F 0(s(x’) ± s(y’)) can decide f(x, y) w. p. ¸ 3/4 y’ s(y’) F 0 Alg F 0(s(x’) ± s(y’))

Embedding l 2 t into l 1 poly(t) • A (1+ )-distortion embedding : l 2 t ! l 1 d is mapping s. t. 8 p, q 2 l 2 t, • Theorem [FLM 77]: 8 9 a (1+ )distortion embedding : l 2 t ! l 1 d with:

Embedding l 2 t into l 1 d x 2 [0, 1]t ||x|| = 1 (x) Low-distortion embedding : l 2 t ! l 1 d y 2 E (y) • Using Theorem [FLM 77], Alice/Bob get (x), (y) 2 Rd with d = O(t ¢ (log 1/ ) / 2): • specified later

Rational Approximation • z = z(t): N ! N; assume z ¸ d • Approximate each coord. of output of embedding by integer multiple of 1/z

Scaling • Alice (resp. Bob) multiplies each coord. of (resp. ) by z • Obtains s( ) (resp. s( ) • Claim: coords. are integers in range [-2 z, 2 z] • Proof: 1. | | · | (¢)| + d/z · 2 2. |s( )| = z| |

Converting to Unary • For i=1 to d • j Ã s( )i • Replace s( • • )i with 12 z+j 02 z-j Bob does same for s( ) x’, y’ denote new length 4 dz bitstrings wt(x’) = |s( )|, wt(y’) = |s( )| (x’, y’) = |s( ) – s( )|

Reducing (x’, y’) to F 0 • Alice (Bob) chooses stream ax’ (ay’) with char. vector x’ (y’). • Lemma: If 1 < wt(x’), wt(y’) < 2, then: 1 + (x’, y’)/2 < F 0(ax’ ± ay’) < 2 + (x’, y’)/2 Follows from fact: F 0(ax’ ± ay’) = wt(x’ Ç y’)

Reducing (x’, y’) to F 0 • Use lemma to show: • Set = ( ), z = (1/ 5 log 1/ ) so that two cases distinguished by (1 § ( )) F 0 alg

Conclusions • ax’, ay’ must be in universe of size ¸ 4 zd = (log (1/ )/ 9) • Reduction only valid if 4 zd · m • (1/ 2) bound for = (m-1/(9+k)) 8 k > 0. • Recently lower bound improved to: • (1/ 2) for ¸ m-1/2, which is optimal • Find set of vectors directly in Hamming space via involved prob. method argument