More StreamMining Counting How Many Elements Computing Moments

  • Slides: 18
Download presentation
More Stream-Mining Counting How Many Elements Computing “Moments” 1

More Stream-Mining Counting How Many Elements Computing “Moments” 1

Counting Distinct Elements u. Problem: a data stream consists of elements chosen from a

Counting Distinct Elements u. Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. u. Obvious approach: maintain the set of elements seen. 2

Applications u. How many different words are found among the Web pages being crawled

Applications u. How many different words are found among the Web pages being crawled at a site? w Unusually low or high numbers could indicate artificial pages (spam? ). u. How many different Web pages does each customer request in a week? 3

Using Small Storage u. Real Problem: what if we do not have space to

Using Small Storage u. Real Problem: what if we do not have space to store the complete set? u. Estimate the count in an unbiased way. u. Accept that the count may be in error, but limit the probability that the error is large. 4

Flajolet-Martin* Approach u. Pick a hash function h that maps each of the n

Flajolet-Martin* Approach u. Pick a hash function h that maps each of the n elements to log 2 n bits, uniformly. w Important that the hash function be (almost) a random permutation of the elements. u. For each stream element a, let r (a ) be the number of trailing 0’s in h (a ). u. Record R = the maximum r (a ) seen. u. Estimate = 2 R. * Really based on a variant due to AMS (Alon, Matias, and Szegedy) 5

Why It Works u. The probability that a given h (a ) ends in

Why It Works u. The probability that a given h (a ) ends in at least r 0’s is 2 -r. u. If there are m elements in the stream, the probability that R ≥ r is 1 – (1 - 2 -r )m. u. If 2 r >> m, prob ≈ m / 2 r (small). u. If 2 r << m, prob ≈ 1. u. Thus, 2 R will almost always be around m. 6

Why It Doesn’t Work u. E(2 R ) is actually infinite. w Probability halves

Why It Doesn’t Work u. E(2 R ) is actually infinite. w Probability halves when R -> R +1, but value doubles. u. That means using many hash functions and getting many samples. u. How are samples combined? w Average? What if one very large value? w Median? All values are a power of 2. 7

Solution u. Partition your samples into small groups. u. Take the average of groups.

Solution u. Partition your samples into small groups. u. Take the average of groups. u. Then take the median of the averages. 8

Moments (New Topic) u. Suppose a stream has elements chosen from a set of

Moments (New Topic) u. Suppose a stream has elements chosen from a set of n values. u. Let mi be the number of times value i occurs. u. The k th moment is the sum of (mi )k over all i. 9

Special Cases u 0 th moment = number of different elements in the stream.

Special Cases u 0 th moment = number of different elements in the stream. w The problem just considered. u 1 st moment = sum of the numbers of elements = length of the stream. w Easy to compute. u 2 nd moment = surprise number = a measure of how uneven the distribution is. 10

Example: Surprise Number u. Stream of length 100; 11 values appear. u. Unsurprising: 10,

Example: Surprise Number u. Stream of length 100; 11 values appear. u. Unsurprising: 10, 9, 9, 9. Surprise # = 910. u. Surprising: 90, 1, 1, 1, 1. Surprise # = 8, 110. 11

AMS Method u. Works for all moments; gives an unbiased estimate. u. We’ll just

AMS Method u. Works for all moments; gives an unbiased estimate. u. We’ll just concentrate on 2 nd moment. u. Based on calculation of many random variables X. w Each requires a count in main memory, so number is limited. 12

One Random Variable u. Assume stream has length n. u. Pick a random time

One Random Variable u. Assume stream has length n. u. Pick a random time to start, so that any time is equally likely. u. Let the chosen time have element a in the stream. u. X = n * ((twice the number of a ’s in the stream starting at the chosen time) – 1). w Note: just store the count. 13

Expected Value of X u 2 nd moment is Σa (ma )2. u. E(X

Expected Value of X u 2 nd moment is Σa (ma )2. u. E(X ) = (1/n )(Σall times t of n * (twice the number of times the stream element at time t appears from that time on) – 1). u= Σa (1/n)(n )(1+3+5+…+2 ma-1). u= Σa ( m a ) 2. 14

Combining Samples u. Compute as many variables X as can fit in available memory.

Combining Samples u. Compute as many variables X as can fit in available memory. u. Average them in groups. u. Take median of averages. u. Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large. 15

Problem: Streams Never End u. We assumed there was a number n, the number

Problem: Streams Never End u. We assumed there was a number n, the number of positions in the stream. u. But real streams go on forever, so n is a variable --- the number of elements seen so far. 16

Fixups 1. The variables X have n as a factor --- need to scale

Fixups 1. The variables X have n as a factor --- need to scale as n grows. 2. Suppose we can only store k counts. We must throw some X ’s out as time goes on. w Objective: each X is selected with probability k / n. 17

Solution to (2) u. Choose the first k elements. u. When the n th

Solution to (2) u. Choose the first k elements. u. When the n th element arrives (n > k ), choose it with probability k / n. u. If you choose it, throw one of the previously stored variables out, with equal probability. 18