Effective Change Detection Using Sampling Junghoo John Cho

  • Slides: 31
Download presentation
Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA Junghoo "John" Cho

Effective Change Detection Using Sampling Junghoo “John” Cho Alexandros Ntoulas UCLA Junghoo "John" Cho (UCLA Computer Science)

Problem Polling Update Remote database l Query Local database Application l l l Web

Problem Polling Update Remote database l Query Local database Application l l l Web search engines/crawlers Web archive Data warehouse. . . Junghoo "John" Cho (UCLA Computer Science) 2

Existing Approach l Round robin l l Download pages in a round robin manner

Existing Approach l Round robin l l Download pages in a round robin manner Change-frequency based [CLW 98, CGM 00, EMT 01] l l l Estimate the change frequency Adjust download frequency Proven to be optimal Junghoo "John" Cho (UCLA Computer Science) 3

Our Approach l Sampling-based l l Sample k pages from each source Download more

Our Approach l Sampling-based l l Sample k pages from each source Download more pages from the source with more changed samples Junghoo "John" Cho (UCLA Computer Science) 4

Comparison l Frequency based l l Sampling based l l l Proven to be

Comparison l Frequency based l l Sampling based l l l Proven to be optimal Change history required Difficult to estimate change frequency Can be worse than frequency based policy No history/frequency-estimation required Experimental comparison later Junghoo "John" Cho (UCLA Computer Science) 5

Questions l l Are we assuming correlation? How to use sampling results? l l

Questions l l Are we assuming correlation? How to use sampling results? l l How many samples? l l Proportional vs Greedy Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 6

Is Correlation Necessary? l Random sampling 4/5 l 1/5 Correlation not necessary. Only random

Is Correlation Necessary? l Random sampling 4/5 l 1/5 Correlation not necessary. Only random sampling l More discussion later Junghoo "John" Cho (UCLA Computer Science) 7

Questions l l Are we assuming correlation? How to use sampling results? l l

Questions l l Are we assuming correlation? How to use sampling results? l l How many samples? l l Proportional vs Greedy Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 8

Download Model (1) l Fixed download cycle l l Fixed download resources in each

Download Model (1) l Fixed download cycle l l Fixed download resources in each cycle l l Say, once a month Say, 100, 000 page download every month Goal l l Download as many changes as we can Change. Ratio = No of changed & downloaded pages No of downloaded pages Junghoo "John" Cho (UCLA Computer Science) 9

Download Model (2) l Two-stage sampling policy l l l Sampling stage Download stage

Download Model (2) l Two-stage sampling policy l l l Sampling stage Download stage Sampling requires page download Junghoo "John" Cho (UCLA Computer Science) 10

How to Use Sampling Result? l l l Sites A and B, each with

How to Use Sampling Result? l l l Sites A and B, each with 20 pages 20 total download, 5 samples from each site 10 page download remaining A 4/5 B Junghoo "John" Cho (UCLA Computer Science) 1/5 11

Proportional Policy l Download pages proportionally to the detected changes l 8 pages from

Proportional Policy l Download pages proportionally to the detected changes l 8 pages from A, 2 pages from B A 4/5 B Junghoo "John" Cho (UCLA Computer Science) 1/5 12

Greedy Policy l Download pages from the sites with most changes l 10 pages

Greedy Policy l Download pages from the sites with most changes l 10 pages from A A 4/5 B Junghoo "John" Cho (UCLA Computer Science) 1/5 13

Optimality of Greedy l Theorem l l Greedy is optimal if we make download

Optimality of Greedy l Theorem l l Greedy is optimal if we make download decisions purely based on sampling results Probabilistic optimality for their expected values Junghoo "John" Cho (UCLA Computer Science) 14

Questions l l Are we assuming correlation? How to use sampling results? l l

Questions l l Are we assuming correlation? How to use sampling results? l l How many samples? l l Proportional vs Greedy Dynamic sample size adjustment? What if we have very limited resources? Junghoo "John" Cho (UCLA Computer Science) 15

How Many Samples? l Too few samples l l Too many samples l l

How Many Samples? l Too few samples l l Too many samples l l Inaccurate change estimates “Waste” of resources for sampling How to determine optimal sample size? Junghoo "John" Cho (UCLA Computer Science) 16

Optimal Sample Size l Factors to consider l l Total number of pages that

Optimal Sample Size l Factors to consider l l Total number of pages that we maintain Number of pages that we can download in the current cycle Number of pages in each Web site Change distribution l l Scenario 1 -- A: 90/100, B: 10/100 Scenario 2 -- A: 60/100, B: 40/100 Junghoo "John" Cho (UCLA Computer Science) 17

Change Fraction Distribution fraction of sites f( ) t l l i : fraction

Change Fraction Distribution fraction of sites f( ) t l l i : fraction of changed pages in site i f( ): distribution of values Junghoo "John" Cho (UCLA Computer Science) 18

Optimal Sample Size l l N: no of pages in a site r: no

Optimal Sample Size l l N: no of pages in a site r: no of pages to download / no of pages we maintain Analysis is complex is a good rule of thumb Junghoo "John" Cho (UCLA Computer Science) 19

Dynamic Sample Size? l Do we need the sample size for every site? l

Dynamic Sample Size? l Do we need the sample size for every site? l A: = 0, B: = 0. 45, C: = 0. 55, D: = 1 Junghoo "John" Cho (UCLA Computer Science) 20

Adaptive Sampling l l If the estimated is high/low enough, make an early decision

Adaptive Sampling l l If the estimated is high/low enough, make an early decision What does “high enough” mean? l Confidence interval above threshold ( i () () t i Junghoo "John" Cho (UCLA Computer Science) i ) 21

In the Paper l More details on l l l Optimal sample size Adaptive

In the Paper l More details on l l l Optimal sample size Adaptive policy The cases where resource is too limited for sampling Junghoo "John" Cho (UCLA Computer Science) 22

Experiments l 353, 000 pages from 252 sites l Mostly popular sites l l

Experiments l 353, 000 pages from 252 sites l Mostly popular sites l l ~ 1400 pages from each site Followed the links in the breadth-first manner Monthly change history for 6 months l l Yahoo, CNN, Microsoft, … 5 download cycles In experiments, 100, 000 page downloads in each download cycle Junghoo "John" Cho (UCLA Computer Science) 23

Comparison of Policies Change. Ratio Junghoo "John" Cho (UCLA Computer Science) 24

Comparison of Policies Change. Ratio Junghoo "John" Cho (UCLA Computer Science) 24

Optimal Sample Size Change. Ratio Optimal sample size ~ 10 through 60 ~ 20

Optimal Sample Size Change. Ratio Optimal sample size ~ 10 through 60 ~ 20 Junghoo "John" Cho (UCLA Computer Science) Sample Size 25

Comparison of Long-Term Performance l Problem: We have only 5 -download-cycle data l Solution:

Comparison of Long-Term Performance l Problem: We have only 5 -download-cycle data l Solution: Extrapolate the history ? Repeat Junghoo "John" Cho (UCLA Computer Science) 26

Frequency vs. Sampling Change. Ratio Frequency Greedy Download Cycle Junghoo "John" Cho (UCLA Computer

Frequency vs. Sampling Change. Ratio Frequency Greedy Download Cycle Junghoo "John" Cho (UCLA Computer Science) 27

Related Work l Frequency-based policy l l Coffman et al. , Journal of Scheduling

Related Work l Frequency-based policy l l Coffman et al. , Journal of Scheduling 1998 Cho et al. , SIGMOD 2000 Edwards et al. , WWW 2001 Source cooperation l Olston et al. , SIGMOD 2002 Junghoo "John" Cho (UCLA Computer Science) 28

Conclusion l Sampling-based policy l l l Frequency-based policy l l Great short-term performance

Conclusion l Sampling-based policy l l l Frequency-based policy l l Great short-term performance No change history required Potentially good long-term performance if the change frequency does not change Greedy is easy to implement and shows high performance Junghoo "John" Cho (UCLA Computer Science) 29

Future Work l Combination of sampling and frequency based policies l l Switch to

Future Work l Combination of sampling and frequency based policies l l Switch to the frequency-based policy after a while Good partitioning for sampling? l l l Site based? Directory based? Content based? Link-structure based? Junghoo "John" Cho (UCLA Computer Science) 30

Questions? Junghoo "John" Cho (UCLA Computer Science) 31

Questions? Junghoo "John" Cho (UCLA Computer Science) 31