Streaming Algorithms for Geometric Problems Piotr Indyk MIT
Streaming Algorithms for Geometric Problems Piotr Indyk MIT 1
Data Streams n A data stream is a (massive) sequence of data n n Examples: n n Too large to store (on disk, memory, cache, etc. ) Network traffic (source/destination) Sensor networks Satellite data feed, etc. Approaches: n n Ignore it Develop algorithms for dealing with such data 2
Talk Overview n n Computational model Example problems (Short) history of streaming algorithms Streaming algorithms for geometric problems n n n Insertions only Insertions and deletions Open problems 3
Computational Model n n n Single pass over the data: e 1, e 2, …, en Bounded storage Fast processing time per element 4
Related Models n External Memory: n n n Bounded Storage Data Stored on Disk Random Access to Blocks of Data Compact Representations of Data and Communication Complexity Read-Once Branching Programs Memory Disk Alice: x Bob: y F(x, y)=? e 1=1 ? Y N 5
Classic Examples n Compute the number of distinct elements: n n n Exactly: (n) bits of space (1+ ) -approximation: O(1/ 2 *log n) bits [Flajolet. Martin, JCSS’ 85] , … Compute the median n n Exactly: (n) (50% ) -approximation: O(1/ *polylog n) [Paterson-Munro, TCS’ 80] , … 6
Brief History of Streaming Algorithms n n n Ancient times [MP’ 80, FM’ 85, Morris, . . ] Middle Ages Renaissance [Alon-Matias-Szegedy, STOC’ 96] n n n Theory DB (Aqua project in Bell Labs) Networking … Streaming became mainstream 7
Theoretical History n Vector problems: n n n Metric problems n n n Stream defines an array of numbers Maintain stats of the array, e. g. , median Clustering Graph problems, Text problems Geometric Problems [this talk] 8
Geometric Data Stream Algorithms as Data Structures n Data structures that support: n n Insert(p) to P Possibly: Delete(p) from P Compute(P) Use space that is sub-linear in |P| 9
Insertions-only 10
Metric clustering problems n k-center [Charikar-Chekuri-Feder-Motwani, STOC’ 97] n k-median [Guha-Mishra-Motwani-O’Callaghan, FOCS’ 00, Meyerson, FOCS’ 01, Charikar-O’Callaghan. Panigrahy, STOC’ 03] n Bounds: n n Poly(K, log n) space O(1)-approximation 11
k-median/k-center • k is given • Goal: choose k medians/centers to minimize: • k-median: the sum of the distances • k-center: the max distance 12
Geometric Problems n Diameter, Minimum Enclosing Ball [Agarwal -Har-Peled, SODA’ 01, Feigenbaum-Kannan-Zhang’ 02 (Algorithmica), Hershberger-Suri, PODS’ 04] n n n K-center [AHP, SODA’ 01] K-median [Har-Peled-Mazumdar, STOC’ 04] Range searching via -approximations: n n [Suri-Toth-Zhou, So. CG’ 04] [Bagchi-Chaudhary-Eppstein-Goodrich, So. CG’ 04] 13
Dominant Approach: Merge and Reduce n Main ideas: n Design an (off-line) algorithm that computes a “sketch” of the input n n n Small size Sufficient to solve the problem A sketch of sketches is a sketch 14
Tree Computation p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 p 11 p 12 p 13 p 14 p 15 p 16 15
Algorithm n n n Space: (sketch size)*log n Time: sketch computation time Question: Where do sketches come from ? 16
Idea I: solution=sketch n n n Consider k-median [GMMO’ 00] : approximate kmedian of approximate weighted k-medians is an approximate k-median Result: n n Constant depth tree Space: kn , >0 O(1) -approximation Works for any metric space 3 2 1 k=3 17
Use the solution, ctd. n n n -Approximations: find a subset S P , such that for any rectangle/halfspace/etc R, |R S|/|S| = |R P|/|P| [Matousek] : approximation of a union of approximations is an approximation [BCEG’ 04] : convert it into streaming algorithm, applications n n 1/ 2 space : better/optimal bounds for rectangles and halfspaces [STZ’ 04] 18
Idea 2: Core-Sets [AHP’ 01] n n n Assume we want to minimize CP(o) S P is an -core-set for P, if for any o, and a set T: CP T (o) < (1+ ) CS T (o) Note: this must hold for all o, not just the optimal one o 19
Example: Core-set for MEB n Compute extremal points: n n Choose “densely” spaced direction v 1 …vk I. e. , for any u there is vi such that u*vi ≥ ||u||2 / (1+ ) For each direction maintain extremal point k=O(1/ )(d-1)/2 suffice 20
Stream Algorithms via -sets n n Core Diameter/MEB/width: O(1/ )(d-1)/2 log n space [AHP’ 01] k-center: O(k/ d) log n [HP’ 01] k-median: O(k/ d) log n [HPM’ 04] Faster algorithms and other results: [Chan, So. CG’ 04], [Suri-Hershberger’ 03] 21
Limitations n n Small core-sets might not exist (see next slide) Do not support deletions 22
Minimum Weight Bi-chromatic Matching • Estimate the cost of MWBM 23
Insertions and Deletions 24
Streaming Algorithms for Vector Problems n Norm estimation: n n Stream elements: (i, b) , i=1…m Interpretation: xi=xi+b Want to maintain ||x||p Why ? Examples: n n ||x||pp =Σi xip = #non-zero elements in x, as p 0 … 25
Dimensionality reduction n L 2: Johnson-Lindenstrauss Lemma: n n x is an m-dimensional vector A is a random m times k matrix, each entry independently drawn from e. g. Gaussian distribution, k=O(log N/ 2 ) Then with probability 1 -1/N ||x||2 ≤||Ax||2 ≤(1+ )||x||2 A can be pseudo-random [AMS’ 96]* *Using slightly different method for norm estimation 26
What it means n n To know ||x||2, suffices to know Ax Can maintain Ax when the coordinates are incremented: A(x+ bei)=Ax+ b. A ei Ax n n A x Can maintain approximate L 2 -norm of x Similar approach works for p (0, 2] [Indyk, FOCS’ 00] 27
Histograms n n n View x as a function x: [1…n] [1…M] Approximate it using piecewise constant function h, with B pieces (buckets) Problem can be formulated in 2 D as well (buckets become rectangular tiles) 28
Results: 1 D n [Gilbert-Guha-Indyk-Kotidis-Muthukrishnan-Strauss, STOC’ 02] n n : Maintains h with B pieces such that ||x-h||2 ≤ (1+ )||x-h. OPT||2 Under increments/decrements of x Space: poly(B, 1/ , log n) Time: poly(B, 1/ , log n) 29
Results: 2 D n [Thaper-Guha-Indyk-Koudas, SIGMOD’ 02] n n n Maintains h with B log (n. M) tiles such that ||x-h||2 ≤ (1+ )||x-h. OPT||2 Under increments/decrements of x Space/Update time: poly(B, 1/ , log n) Histogram reconstruction time: poly(B, 1/ , n) [Muthukrishnan-Strauss, FSTTCS’ 03] n n : : Maintains h with 4 B tiles Time: poly(B, 1/ , log(n. M)) 30
General Approach n n n Maintain sketches Ax of x This allows us to estimate the error of any given h, via ||x-h|| ||Ax-Ah|| Construct h: n n n Enumeration Greedy Dynamic Programming 31
Minimum Weight Matching • Estimate the cost of MWM 32
Minimum Spanning Tree • Estimate the cost of MST 33
Facility Location • Goal: choose a set F of facilities to minimize the sum of the distances to nearest facility plus the number of facilities times f • Again, report the cost 34
Approach n n n Assume P {1… }2 Reduce to vector problems Impose square grids G 0…Gk, with side lengths 20, 21, …, 2 k , shifted at random. For each square cell c in Gi, let n. P(c) be the number of points from P in c. The algorithms will maintain certain statistics over n. P(. ), which will allow it to approximately solve the problems 2 3 1 1 1 5 35
Estimators n n n MST: MWM: MWBM: Fac. Loc. : K-median: (const. factor) ∑i 2 i ∑i 2 i ∑c Gi [n. P(c)>0] ∑c Gi [n. P(c) is odd] ∑c Gi |n. G(c)-n. B(c)| ∑c Gi min[n. P(c), Ti] ∑c Gi - B(Q, 2^i) n. P(c) Maintain #non-zero entries in n. P [FM’ 85] Maintain L 1 difference [I’ 00] 36
Results [Indyk’ 04] Problem Appr. MST MWM MWBM* Fac. Loc. log 2 Space: (log +log n)O(1) *follows from Charikar, STOC’ 02; also Agarwal-Varadarajan, So. CG’ 04 and Indyk-Thaper’ 37
Results: K-median Computation Time Approximation O(k) poly(log n+1/ ) 2 poly(log n+log +k) poly(log n+log +k) 1+ O(1) [ 1+ , log n log ] Space: (K+log + log n)O(1) 38
Probabilistic embeddings into HST’s T 2 3 1 1 1 5 Known [Bartal, FOCS’ 96, Charikar-Chekuri-Goel-Guha-Plotkin, STOC’ 98]: • ||p-q|| ≤ Dtree (p, q) • E[ Dtree(p, q) ] ≤ ||p-q|| * O(log ) 39
MST 2 3 1 n n n 1 1 5 E[Cost(MST in T)] ≤ O(log ) Cost(MST in T) Cost(T) How to compute Cost(T) ? n n Sum over all levels i, of the #nodes at i, times 2 i Node c exists iff ni(c)>0 40
Matching 0 n 1 1 1 0 1 Algorithm: n 1 0 n n n Match what you can at the current level Odd leftovers wait for the next level Repeat Optimal on the HST Cost=∑i 2 i ∑c Gi [n. P(c) is odd] 41
Conclusions n Algorithms for geometric data streams n n Insertions-only: merge and reduce Insertions and deletions: randomized linear embeddings 42
Open Problems n High dimensions: n Diameter: n 21/2 -approx, O(d 2 n 1/2 ) space, follows from [Goel -Indyk-Varadarajan, SODA’ 01] n n n c-approx, O( dn 1/(c 2 - 1) ) [Indyk, SODA’ 03] Conjecture: 21/2 -approx, O(d polylog n) space Min-width cylinder: 18 -approx, O(d) space [Chan’ 04] n Other problems ? 43
Open Problems n Range queries: n General lower bounds ? (Not just for approximations) (1/ 2) -bit bound for general queries follows from LB for dot product [Indyk-Woodruff, FOCS’ 03] , and is tight (for randomized algorithms) 4/3) n What about e. g. , half-space queries ? O(1/ is known [STZ’ 04] Other problems [STZ’ 04] n n 44
Open Problems n Matchings, Facility Location, etc: n n Replace log by O(1) or even 1+ Possible for MST [Frahling-Indyk-Sohler’? ? ] Related to computing bi-chromatic matching [Agarwal-Varadarajan’ 04] Min-sum clustering ? 45
Open Problems n Better core-sets n n n k-median: 1/ d 1/ (d-1)/2 ? Possible for d=1 [Indyk] k-center: 1/ d 1/ (d-1)/2 Possible for k=1 (this is minimum enclosing ball) Insertions and deletions ? n k-median: poly(log n+log +k+1/ ) space/time, (1+ ) –approximation ? 46
The End – Thank you ! 47
- Slides: 47