Optimizing Update Frequencies for Decaying Information Simon Razniewski

  • Slides: 17
Download presentation
Optimizing Update Frequencies for Decaying Information Simon Razniewski Free University of Bozen-Bolzano, Italy

Optimizing Update Frequencies for Decaying Information Simon Razniewski Free University of Bozen-Bolzano, Italy

2 Motivation Addr. Inc. Sells them to pharmaceutical/medical technology companies Collects addresses Dr. John

2 Motivation Addr. Inc. Sells them to pharmaceutical/medical technology companies Collects addresses Dr. John 5 Main St. 38274 Hampton Dr. Miller 17 Hill St. 45192 Fordham Dr. Higgs 9 West St. 82077 Chatham

3 Main activities of Addr. Inc. 1. Discover new addresses 2. Check correctness of

3 Main activities of Addr. Inc. 1. Discover new addresses 2. Check correctness of existing addresses ▫ Doctors relocate occasionally

4 How to check the correctness of existing addresses? • Check done by web

4 How to check the correctness of existing addresses? • Check done by web search or phone calls ▫ Online directories for doctors ▫ Hospital webpages ▫ Homepages of private doctors

5 How many resources to provide to update addresses?

5 How many resources to provide to update addresses?

6 Outline 1. The problem 2. Information decay 3. Formula for optimal update frequency

6 Outline 1. The problem 2. Information decay 3. Formula for optimal update frequency 4. Application to caching and crawling 5. Validation

7 1. The problem • The more employees, the more frequent updates ▫ But

7 1. The problem • The more employees, the more frequent updates ▫ But how often should we update each entity? What is the optimal update frequency for each entity? Optimize income (benefit minus cost) • Cost ▫ Work time per update ▫ E. g. 15 minutes at $20/hr -> $5 per update • Benefit ▫ $20 per year per up-to-date address ▫ What is the benefit of updating?

8 2. Information decay • Information value of data gets lost over time •

8 2. Information decay • Information value of data gets lost over time • Similar to radioactive decay

9 Shape of decay curves • Linear, exponential, geometric, … • Exponential decay for

9 Shape of decay curves • Linear, exponential, geometric, … • Exponential decay for all processes that follow a Poisson distribution [Cho and Molina, TOIT 2003] ▫ Empirically found to apply to website updates • Below: Soccer player relocation behaviour Manchester United Bayern München

10 Benefit of updating • Benefit per entity depends on average correctness ▫ $20

10 Benefit of updating • Benefit per entity depends on average correctness ▫ $20 per year per up-to-date entity, 70% average correctness $14 benefit 1 st year 30% average correctness $6 benefit 2 nd year …. • Benefit of updating is a certain average correctness Updates

11 3. The (simple) core formula •

11 3. The (simple) core formula •

12 Examples for Addr. Inc. Yearly income in $ Update frequency (years) C=$5, B=$20/year,

12 Examples for Addr. Inc. Yearly income in $ Update frequency (years) C=$5, B=$20/year, exponential decay, relocation frequencies taken from Californian tax payers

13 Extensions in the paper • Bulk updates • Cost of outdated entities •

13 Extensions in the paper • Bulk updates • Cost of outdated entities • Different costs for checking and updating an address

14 4. Other applications • Caching • Web crawling • Difference to classical work

14 4. Other applications • Caching • Web crawling • Difference to classical work there: ▫ Focus on optimal update frequency ▫ Classical work focuses on best distribution of a fixed update budget (“ 1000 pages, 500 crawls/minute, . . ”) • Our approach more relevant now given scalable cloud resources

15 Caching and crawling • Cost of an update ▫ Bandwidth or compute time

15 Caching and crawling • Cost of an update ▫ Bandwidth or compute time �Crawling: ~ 0. 003 ct/crawl (2015) • Benefit of an update ▫ Caching: Avoiding repeated computation or lower delay ▫ Crawling: Better search quality ▫ Challenge: How to express in money?

16 5. Validation • 1. Is it easy to get decay rates; 2. Do

16 5. Validation • 1. Is it easy to get decay rates; 2. Do they differ? Yearly relocation probability Ph. D students Academic Industrial researchers • 3. Which decay function applies? ▫ Exponential decay for soccer player affiliations • 4. What can we gain? ▫ Up to 6. 5% in a use case of academic advertisement ▫ Up to 31. 8% in a web crawling use case [Cho and Molina, TOIT 2003] Professors

17 Summary • Framework for finding optimal update frequency for decaying information ▫ Independent

17 Summary • Framework for finding optimal update frequency for decaying information ▫ Independent of actual decay function • Focus on address data, but also relevant for crawling and caching ▫ Challenge: Modelling benefit • Also interesting ▫ Data mining: How to identify relevant attributes (age/profession, …) that allow to predict decay rates