Database System Laboratory Mercator A Scalable Extensible Web
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v. 2(4), p. 219 -229, Dec. 1999. May. 23. 2006 Sun Woo Kim
Content Database System Laboratory n Extensibility n Crawler traps and other hazards n Results of an extended crawl n Conclusions 2
Extensibility Database System Laboratory n Extensibility n Extend with new functionality n n 3 New protocol and processing modules Different versions of most of its major components n Ingredients n Interface an abstract class n Mechanism a configuration file n Infrastructure
Protocol and processing modules Database System Laboratory n Abstract Protocol class n fetch method: download the document n new. URL method: parse a given string n Abstract Analyzer class n process method: process it appropriately 4 n Different Analyzer subclasses n Gif. Stats n Tag. Counter n Web. Linter: runs the Weblint program
Alternative URL frontier Database System Laboratory n Drawback on intranet n Multiple hosts might be assigned to the same thread n Solution n URL frontier component that dynamically assigns host n n 5 Maximized the number of busy worker threads Is well-suited to host-limited crawls
As a random walker Database System Laboratory n Random walker n Starts at a random page taken from a set of seeds n The next page is selected by choosing a random link n Differences n A page may be revisited multiple times n Only one link is followed each time 6 n To support random walking n A new URL frontier n n Records only the URLs discovered most recently fetched file Document fingerprint set n Never rejects documents as already having been seen
URL aliases Database System Laboratory n Four causes n Host name aliases canonicalize n n n Omitted port numbers default value: 80 Alternative paths on the same host cannot avoid n n 7 coke. com and cocacola. com 203. 134. 241. 178 digital. com/index. html and digital. com/home. html Replication across different hosts cannot avoid n Mirror sites n Cannot avoid content-seen test
Session IDs embedded in URLs Database System Laboratory n Session identifiers n To tract the browsing behavior of their visitors n Create a potentially infinite set of URLs n Represent a special case of alternative paths n 8 Document fingerprinting technique
Crawler traps Database System Laboratory n Crawler trap n Cause a crawler to crawl indefinitely n Unintentional: symbolic link n Intentional: trap using CGI programs n Antispam traps, traps to catch search engine crawlers n Solution n No automatic technique 9 n n But traps are easily noticed Manually exclude the site n Using the customizable URL filter
Performance Database System Laboratory n Digital Ultimate Workstation n Two 533 MHz Alpha processors n 2 GB of RAM and 118 GB of local disk n Run in May 1999 n 77. 4 million HTTP requests in 8 days n 112 docs/sec and 1, 682 KB/sec 10 n CPU cycle n 37%: JIT-compiled Java bytecode n 19%: Java runtime n 44%: Unix kernel
Selected Web statistics (1) Database System Laboratory n Relationship between URLs and HTTP requests No. of URLs removed + No. of robots. txt requests - No. of excluded URLs = No. of HTTP requests 11 76, 732, 515 3, 675, 634 3, 050, 768 77, 357, 381
Selected Web statistics (2) Database System Laboratory n Breakdown of HTTP status codes Code 12 Meaning Number Percent 65, 790, 953 87. 03% 200 OK 404 Not found 5, 617, 491 7. 43% 302 Moved temporarily 2, 517, 705 3. 33% 301 Moved permanently 842, 875 1. 12% 403 Forbidden 322, 042 0. 43% 401 Unauthorized 223, 843 0. 30% 500 Internal server error 83, 744 0. 11% 406 Not acceptable 81, 091 0. 11% 400 Bad request 65, 159 0. 09% Other 48, 628 0. 06% Total 75, 593, 531 100. 0% relatively low
Selected Web statistics (3) Database System Laboratory n Size of successfully downloaded documents 13 80%
Selected Web statistics (4) Database System Laboratory n Distribution of MIME types MIME type Number Percent text/html 41, 490, 044 69. 2% image/gif 10, 729, 326 17. 9% 4, 846, 257 8. 1% text/plain 869, 911 1. 5% application/pdf 540, 656 0. 9% audio/x-pn-realaudio 269, 384 0. 4% application/zip 213, 089 0. 4% application/postscript 159, 869 0. 3% other 829, 410 1. 4% Total 59, 947, 946 100. 0% image/jpeg 14
Conclusions Database System Laboratory n Use of Java n Made implementation easier and more elegant n Threads, garbage collection, objects, exception, etc. n Scalability n Extensibility 15 Fin.
- Slides: 15