On Random Sampling over Joins Surajit Chaudhuri Microsoft

  • Slides: 19
Download presentation
On Random Sampling over Joins Surajit Chaudhuri Microsoft Research Rajeeve Motwani Vivek Narasayya Stanford

On Random Sampling over Joins Surajit Chaudhuri Microsoft Research Rajeeve Motwani Vivek Narasayya Stanford University Microsoft Research Compiled by: Arjun Dasgupta

CONTENTS • • • The difficulty of join sampling Semantic and algorithms of sample

CONTENTS • • • The difficulty of join sampling Semantic and algorithms of sample Two previous sampling strategies New strategies for join sampling Experiment’s results

SAMPLE (R 1><R 2, f) ≠ SAMPLE (R 1, f) >< SAMPLE (R 2,

SAMPLE (R 1><R 2, f) ≠ SAMPLE (R 1, f) >< SAMPLE (R 2, f)

STRATEGY USED • Obtain SAMPLE (R 1><R 2, f) from nonuniform samples of R

STRATEGY USED • Obtain SAMPLE (R 1><R 2, f) from nonuniform samples of R 1 and R 2

The Difficulty of Join Sampling Example: • Suppose that we have the relations

The Difficulty of Join Sampling Example: • Suppose that we have the relations

TECHNIQUES FOR SAMPLING • • Black Box U 1 (un-weighted) Black Box U 2

TECHNIQUES FOR SAMPLING • • Black Box U 1 (un-weighted) Black Box U 2 (un-weighted) Black Box WR 1 (weighted) Black Box WR 2 (weighted)

Black-Box U 2: Given relation R with n tuples, generate an unweighted WR sample

Black-Box U 2: Given relation R with n tuples, generate an unweighted WR sample of size r. . 1. 2 Initialize reservoir array A[1. . r] with r dummy values. 3. While tuples are streaming by do begin (a) get next tuple t; (b) (c) for j=1 to r set A[j] to t with probability 1/N end

Black-Box WR 2: Given relation R with n tuples, generate a weighted WR sample

Black-Box WR 2: Given relation R with n tuples, generate a weighted WR sample of size r. • . 1 • . 2 Initialize reservoir array A[1…r] with r dummy values. • 3. While tuples are streaming by do begin (a) get next tuple t with weight w(t); (b) (c) for j=1 to r do set A[j] to t with prob. w(t)/W end.

The Classification of the Problem: • Case A : No information is available for

The Classification of the Problem: • Case A : No information is available for either or. • Case B : No information is available for indexes and /or statistics are available for • Case C : Indexes/statistics are available for and . but.

Previous Sampling Strategies Strategy Naive-Sample: 1. Compute the join. 2. As the tuples of

Previous Sampling Strategies Strategy Naive-Sample: 1. Compute the join. 2. As the tuples of J stream by, use Black-Box U 1 or U 2 to produce.

Previous Sampling Strategies Strategy Olken-Sample: 1. Let M be an upper bound on for

Previous Sampling Strategies Strategy Olken-Sample: 1. Let M be an upper bound on for all. 2. repeat (a) Sample a tuple uniformly at random. (b) Sample a random tuple from among all tuples that have. (c) Output with probability , and with remaining probability reject the sample. Until r tuples have been produced.

New Strategies for Join Sampling Strategy Stream Sample: 1. Use Black-Box WR 1 or

New Strategies for Join Sampling Strategy Stream Sample: 1. Use Black-Box WR 1 or WR 2 to produce a WR sample of size r, where the weight for a tuple is set to 2. While tuples of are streaming by do begin (a) get next tuple and let ; (b) sample a random tuple from among all tuples that have ; (c) output. end.

New Strategies for Join Sampling • Strategy Stream Sample is more efficiency then Olken

New Strategies for Join Sampling • Strategy Stream Sample is more efficiency then Olken : 1. No information is required for case B. 2. No tuple is rejected after computing the join. 3. Only one iteration is needed for each output tuple.

New Strategies for Join Sampling Strategy Group Sample 1. Use Black-Box WR 1 or

New Strategies for Join Sampling Strategy Group Sample 1. Use Black-Box WR 1 or WR 2 to produce a WR sample of size r, where the weight for a tuple is set to. 2. Let consist of the tuples. Produce whose tuples are grouped by ‘s tuples that generated them. 3. Use r invocations of Black-Box U 1 or U 2 to sample r sample, one of each group.

New Strategy for Join Sampling • Strategy Frequency-Partition-Sample

New Strategy for Join Sampling • Strategy Frequency-Partition-Sample

Experimental Results:

Experimental Results:

Experimental Results:

Experimental Results:

Experimental Results:

Experimental Results:

Summery • The difficulty of join sampling- example. • The classification of the problem

Summery • The difficulty of join sampling- example. • The classification of the problem - 3 cases. • Naive-sample Olken -sample previous strategies • Stream-sample Groupsample new strategies Frequency-partition-sample • Conclusion : The new strategies are better then the earlier techniques.