Characterization of Search Engine Caches Frank Mc Cown

Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling

Black hat: http: //img. webpronews. com/securitypronews/110705 blackhat. jpg Virus image: http: //polarboing. com/images/topics/misc/story. computer.

Preservation: Fortress Model 5 easy steps for preservation: 1. 2. 3. 4. 5. Get

How much of the Web is indexed? Internet Archive? 5 Estimates from “The Indexable

Alternative Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve

Cached PDF http: //www. fda. gov/cder/about/whatwedo/testtube. pdf canonical MSN version Yahoo version Google version

Crawling the Web and web repositories 12

• Frank Mc. Cown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing

Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to

Cached Resource Size Distributions 976 KB 1 MB 977 KB 215 KB 19

Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled

Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached

Similarity • Compared live web resource with cached counterpart using shingling • Shingling –

Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of

Thank You Frank Mc. Cown fmccown@cs. odu. edu http: //www. cs. odu. edu/~fmccown/ 30

Slides: 28

Download presentation

Characterization of Search Engine Caches Frank Mc. Cown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007 1

Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment 2

Black hat: http: //img. webpronews. com/securitypronews/110705 blackhat. jpg Virus image: http: //polarboing. com/images/topics/misc/story. computer. virus_1137794805. jpg Hard drive: http: //www. datarecoveryspecialist. com/images/head-crash-2. jpg 3

Preservation: Fortress Model 5 easy steps for preservation: 1. 2. 3. 4. 5. Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” 4 Slide from: http: //www. cs. odu. edu/~mln/pubs/differently. ppt Image from: http: //www. itunisie. com/tourisme/excursion/tabarka/images/fort. jpg

How much of the Web is indexed? Internet Archive? 5 Estimates from “The Indexable Web is More than 11. 5 billion pages” by Gulli and Signorini (WWW’ 05)

Alternative Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve your website • Just-In-Time Preservation – Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation – Push your content to sites that might preserve it • Web Server Enhanced Preservation – Use Apache modules to create archival-ready resources 6

Cached Image 10

Cached PDF http: //www. fda. gov/cder/about/whatwedo/testtube. pdf canonical MSN version Yahoo version Google version 11

Crawling the Web and web repositories 12

• Frank Mc. Cown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7 th International Web Archiving Workshop (IWAW 2007). To appear. • Frank Mc. Cown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank Mc. Cown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17 th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank Mc. Cown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8 th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http: //www. cs. odu. edu/~fmccown/warrick/ 13

Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive 14

Web and Cache Overlap 15

Indexed and Cached Content by Type 16

Distribution of Top Level Domains 17

Cached Resource Size Distributions 976 KB 1 MB 977 KB 215 KB 19

Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date) 21

Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale 22

Distribution of Staleness 23

Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average 24

Similarity vs. Staleness 25

Overlap with Internet Archive 26

Overlap with Internet Archive 27

Distribution of Sampled URLs 28

Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA 29

Thank You Frank Mc. Cown fmccown@cs. odu. edu http: //www. cs. odu. edu/~fmccown/ 30