Characterization of Search Engine Caches Frank Mc Cown
- Slides: 28
Characterization of Search Engine Caches Frank Mc. Cown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007 1
Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment 2
Black hat: http: //img. webpronews. com/securitypronews/110705 blackhat. jpg Virus image: http: //polarboing. com/images/topics/misc/story. computer. virus_1137794805. jpg Hard drive: http: //www. datarecoveryspecialist. com/images/head-crash-2. jpg 3
Preservation: Fortress Model 5 easy steps for preservation: 1. 2. 3. 4. 5. Get a lot of $ Buy a lot of disks, machines, tapes, etc. Hire an army of staff Load a small amount of data “Look upon my archive ye Mighty, and despair!” 4 Slide from: http: //www. cs. odu. edu/~mln/pubs/differently. ppt Image from: http: //www. itunisie. com/tourisme/excursion/tabarka/images/fort. jpg
How much of the Web is indexed? Internet Archive? 5 Estimates from “The Indexable Web is More than 11. 5 billion pages” by Gulli and Signorini (WWW’ 05)
Alternative Models of Preservation • Lazy Preservation – Let Google, IA et al. preserve your website • Just-In-Time Preservation – Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation – Push your content to sites that might preserve it • Web Server Enhanced Preservation – Use Apache modules to create archival-ready resources 6
7
8
9
Cached Image 10
Cached PDF http: //www. fda. gov/cder/about/whatwedo/testtube. pdf canonical MSN version Yahoo version Google version 11
Crawling the Web and web repositories 12
• Frank Mc. Cown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7 th International Web Archiving Workshop (IWAW 2007). To appear. • Frank Mc. Cown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank Mc. Cown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17 th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank Mc. Cown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8 th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http: //www. cs. odu. edu/~fmccown/warrick/ 13
Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive 14
Web and Cache Overlap 15
Indexed and Cached Content by Type 16
Distribution of Top Level Domains 17
Cached Resource Size Distributions 976 KB 1 MB 977 KB 215 KB 19
Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date) 21
Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale 22
Distribution of Staleness 23
Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average 24
Similarity vs. Staleness 25
Overlap with Internet Archive 26
Overlap with Internet Archive 27
Distribution of Sampled URLs 28
Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA 29
Thank You Frank Mc. Cown fmccown@cs. odu. edu http: //www. cs. odu. edu/~fmccown/ 30
- Cown definition
- Analyzing and leveraging decoupled l1 caches in gpus
- L caches
- External vs internal combustion engine
- Indirect characterization
- Indirect and direct characterization
- Frank william abagnale sr
- Sebutkan 6 dari top 10 search engine
- Http://education.iseek.com
- Horizontal word
- Distributed search engine
- Google scholar owner
- Trellian keyword discovery tool
- Bing search engine
- Vivian is using a search engine to find photos
- Base search engine
- Danny sullivan search engine land
- Indri search engine
- Goto search engine
- Siapakah founder atau pendiri bing
- Personalized mobile search engine ieee paper
- What are the four components of a search engine
- Ppc search engine marketing positioning
- Meta search engines compared
- Hacking site drive.google.com
- Sequence diagram for atm system pdf
- Keyword generation for search engine advertising
- Vista search engine
- What's a search engine