Lecture 5 Search Engines Outline Search engines key

Outline • Search engines: key tools for ecommerce – Buyers and sellers must find

Search Engines • Tools for finding information on the Web – Problem: “hidden” databases,

Indexing • Arrangement of data (data structure) to permit fast searching • Which list

Inverted Files POS 1 10 20 30 36 A file is a list of

Inverted Files for Multiple Documents LEXICON DOCID OCCUR . . . POS 1 POS

Search Engine Architecture • Spider – Crawls the web to find pages. Follows hyperlinks.

Crawlers (Spiders, Bots) • Retrieve web pages for indexing by search engines • Start

Query Specification • Boolean – AND , OR, NOT, PHRASE “ ”, NEAR ~

“Advanced” Query Specification • Multimedia, e. g. Google • Date range • Relevance specification

Ranking (Scoring) Hits • Hits must be presented in some order • What order?

Google’s Page. Rank Algorithm • Assumption: A link in page A to page B

Definition of Page. Rank • Consider the following infinite random walk (surfing): – Initially

Page. Rank Formula where n is the total number of nodes in the graph

Page. Rank Example B A d d P Page. Rank of P is (1

Link Popularity • How many pages link to this page? – on the whole

Search Engine Sizes (Sept. 2, 2003) BILLIONS OF PAGES ATW AV GG INK TMA

Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE SOURCE: SEARCHENGINEWATCH. COM

Search Engines Disjointness Four searches, 10 engines, total of 141 hits on March 6,

Search Engine EKG Shows activity of the Lycos crawler at one sample site, calafia.

SOURCE: SEARCHENGINEWATCH. COM Search Engine EKG Comparison

Search Engine Differences • Coverage (number of documents) • Spidering algorithms (visit Spider. Catcher)

Metasearchers • All the engines operate differently. Different – – – sizes query languages

Clustering • Viewing large numbers of unstructured hits is not useful • Answer: cluster

Search Spying • • • Peeking at queries as they are being submitted All.

Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003 Up 58% in ONE

Audience Reach by Search Site, Jan, 2003 Audience Reach = % of active surfers

Robot Exclusion • You may not want certain pages indexed but still viewable by

Robots Exclusion Protocol • Format of robots. txt – Two fields. User-agent to specify

Key Takeaways • • • Engines are a critical Web resource Very sophisticated, high

Q&A 20 -751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Slides: 31

Download presentation

Lecture 5: Search Engines

Outline • Search engines: key tools for ecommerce – Buyers and sellers must find each other • • How do they work? How much do they index? How are hits ordered? Can the order be changed?

Search Engines • Tools for finding information on the Web – Problem: “hidden” databases, e. g. New York Times • Directory – A hand-constructed hierarchy of topics (e. g. Yahoo) • Search engine – A machine-constructed index (usually by keyword) • So many search engines, we now need search engines to find them. Searchenginecollosus. com

Indexing • Arrangement of data (data structure) to permit fast searching • Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak • Sorting helps. Why? – Permits binary search. About log 2 n probes into list • log 2(1 billion) ~ 30 – Permits interpolation search. About log 2(log 2 n) probes • log 2(1 billion) ~ 5

Inverted Files POS 1 10 20 30 36 A file is a list of words by position – First entry is the word in position 1 (first word) – Entry 4562 is the word in position 4562 (4562 nd word) – Last entry is the last word An inverted file is a list of positions by word! a (1, 4, 40) entry (11, 20, 31) file (2, 38) list (5, 41) position (9, 16, 26) positions (44) word (14, 19, 24, 29, 35, 45) words (7) 4562 (21, 27) INVERTED FILE

Inverted Files for Multiple Documents LEXICON DOCID OCCUR . . . POS 1 POS 2 . . . “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56. . . WORD INDEX

Search Engine Architecture • Spider – Crawls the web to find pages. Follows hyperlinks. Never stops • Indexer – Produces data structures for fast searching of all words in the pages • Retriever – Query interface – Database lookup to find hits • 2 billion documents • 4 TB RAM, many terabytes of disk – Ranking

Crawlers (Spiders, Bots) • Retrieve web pages for indexing by search engines • Start with an initial page P 0. Find URLs on P 0 and add them to a queue • When done with P 0, pass it to an indexing program, get a page P 1 from the queue and repeat • Can be specialized (e. g. only look for email addresses) • Issues – Which page to look at next? (Special subjects, recency) – Avoid overloading a site – How deep within a site to go (drill-down)? – How frequently to visit pages?

Query Specification • Boolean – AND , OR, NOT, PHRASE “ ”, NEAR ~ – But keyword query is artificial • Question-answering (simulated) – “Who offers a master’s degree in ecommerce? • Date range • Relevance specification – In Altavista, can specify terms by importance (separate from query specification) • Content – multimedia, MP 3, . PPT files • Stemming: eat, eats, eaten, eating, eater, (ate!)

“Advanced” Query Specification • Multimedia, e. g. Google • Date range • Relevance specification – In Altavista, can specify terms by importance (separate from query specification) • Content – multimedia, MP 3, . PPT files • Stemming • Language • Search depth (from site’s front page)

Ranking (Scoring) Hits • Hits must be presented in some order • What order? – Relevance, recency, popularity, reliability? • Some ranking methods – – Presence of keywords in title of document Closeness of keywords to start of document Frequency of keyword in document Link popularity (how many pages point to this one) • Can the user control? Can the page owner control? • Can you find out what order is used? • Spamdexing: influencing retrieval ranking by altering a web page. (Puts “spam” in the index)

Google’s Page. Rank Algorithm • Assumption: A link in page A to page B is a recommendation of page B by the author of A (we say B is successor of A) è The “quality” of a page is related to the number of links that point to it (its in-degree) • Apply recursively: Quality of a page is related to – its in-degree, and to – the quality of pages linking to it è Page. Rank Algorithm (Brinn & Page, 1998) SOURCE: GOOGLE 20 -751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Definition of Page. Rank • Consider the following infinite random walk (surfing): – Initially the surfer is at a random page – At each step, the surfer proceeds • to a randomly chosen web page with probability d • to a randomly chosen successor of the current page with probability 1 -d • The Page. Rank of a page p is the fraction of steps the surfer spends at p as the number of steps approaches infinity SOURCE: GOOGLE 20 -751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Page. Rank Formula where n is the total number of nodes in the graph • Google uses d 0. 85 • Page. Rank is a probability distribution over web pages • The sum of all Page. Ranks of all Pages is 1 SOURCE: GOOGLE 20 -751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Page. Rank Example B A d d P Page. Rank of P is (1 -d)*[(Page. Rank of A)/4 + (Page. Rank of B)/3)] + d/n PAGERANK CALCULATOR SOURCE: GOOGLE 20 -751 ECOMMERCE TECHNOLOGY FALL 2003 COPYRIGHT © 2003 MICHAEL I. SHAMOS

Link Popularity • How many pages link to this page? – on the whole Web – in our database? • www. linkpopularity. com • Link popularity is used for ranking – Many measures – Number of links in – Weighted number of links in (by weight of referring page)

Search Engine Sizes (Sept. 2, 2003) BILLIONS OF PAGES ATW AV GG INK TMA SEARCHES/DAY (MILLIONS) 250 80 2900 per second! All. The. Web Altavista Google Inktomi Teoma 18 SOURCE: SEARCHENGINEWATCH. COM

Search Engine Usage SHARE BY SEARCH SITE SHARE BY ENGINE SOURCE: SEARCHENGINEWATCH. COM

Search Engine EKG Shows activity of the Lycos crawler at one sample site, calafia. com, by number of pages visited during each crawl SOURCE: SEARCHENGINEWATCH. COM

SOURCE: SEARCHENGINEWATCH. COM Search Engine EKG Comparison

Search Engine Differences • Coverage (number of documents) • Spidering algorithms (visit Spider. Catcher) – Frequency, depth of visits • • Inexing policies Search interfaces Ranking One solution: use a metasearcher (search agent)

Metasearchers • All the engines operate differently. Different – – – sizes query languages crawling algorithms storage policies (stop words, punctuation, fonts) freshness ranking • Submit the same query to many engines and collect the results • Metacrawler

Clustering • Viewing large numbers of unstructured hits is not useful • Answer: cluster them • Vivisimo • Kartoo • i. Boogie • Surf. Wax

Search Spying • • • Peeking at queries as they are being submitted All. The. Web Metaspy. Spies on Metacrawler Ask. Jeeves Epicurious (recipes) Stock. Charts. com Yahoo buzz index Kanoodle IQSeek

Time Spent Per Visitor (minutes) by Search Engine, Jan. 2003 Up 58% in ONE YEAR! AJ AOL AV ELNK GG ISP LS LY MSN NS OVR YH Ask Jeeves America Online Altavista Earth. Link Google Info. Space Look. Smart Lycos Microsoft Netscape OVERTURE Yahoo SOURCE: SEARCHENGINEWATCH. COM

Audience Reach by Search Site, Jan, 2003 Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap AJ AOL AV ELNK GG ISP LS LY MSN NS OVR YH Ask Jeeves America Online Altavista Earth. Link Google Info. Space Look. Smart Lycos Microsoft Netscape OVERTURE Yahoo SOURCE: SEARCHENGINEWATCH. COM

Robot Exclusion • You may not want certain pages indexed but still viewable by browsers. Can’t protect directory. • Some crawlers conform to the Robot Exclusion Protocol. Compliance is voluntary. One way to enforce: firewall • They look for file robots. txt at highest directory level in domain. If domain is www. ecom. cmu. edu, robots. txt goes in www. ecom. cmu. edu/robots. txt • A specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX">

Robots Exclusion Protocol • Format of robots. txt – Two fields. User-agent to specify a robot – Disallow to tell the agent what to ignore • To exclude all robots from a server: User-agent: * Disallow: / • To exclude one robot from two directories: User-agent: Web. Crawler Disallow: /news/ Disallow: /tmp/ • View the robots. txt specification.

Key Takeaways • • • Engines are a critical Web resource Very sophisticated, high technology They don’t cover the Web completely Spamdexing is a problem New paradigms needed as Web grows What about images, music, video? – www. corbis. com, Google images