Search Engines What Are They Four Components A

  • Slides: 20
Download presentation
Search Engines

Search Engines

What Are They? ÷ Four Components A database of references to webpages ¯ An

What Are They? ÷ Four Components A database of references to webpages ¯ An indexing robot that crawls the WWW ¯ An interface ¯ Enables users to submit queries ° Displays results ° ¯ ÷ Information retrieval system Each is unique, but are mostly the same 2

Database ÷ ÷ Where user's query is matched Contains only essential parts of pages

Database ÷ ÷ Where user's query is matched Contains only essential parts of pages Only includes pages that were indexed Search engines are always out of date 3

Web Crawler ÷ ÷ A robot that follows links Records data it finds Words

Web Crawler ÷ ÷ A robot that follows links Records data it finds Words in the webpage ¯ Metadata ¯ ¯ ÷ ALT attributes in IMG tags Robot Exclusion Protocol 4

Search Engine Interfaces ÷ ÷ Gathers input from users Presents results from the IR

Search Engine Interfaces ÷ ÷ Gathers input from users Presents results from the IR system ¯ Often in ranked order 5

Search Engine Interfaces ÷ Input ¯ User requirements ° ¯ Search expression, search limits

Search Engine Interfaces ÷ Input ¯ User requirements ° ¯ Search expression, search limits Presentation style ° Presentation format , search type 6

Search Engine Interfaces ÷ Output Results ¯ Descriptions ¯ Clusters ¯ 7

Search Engine Interfaces ÷ Output Results ¯ Descriptions ¯ Clusters ¯ 7

Example: Visual Clustering Interface 8

Example: Visual Clustering Interface 8

Grokker 9 Large Example: Clustering Visual Interface

Grokker 9 Large Example: Clustering Visual Interface

Search Term Matching ÷ ÷ Trying to find a match in the database Two

Search Term Matching ÷ ÷ Trying to find a match in the database Two main methods ¯ Keyword searching ° ¯ Matching single terms, computing cosine Concept-based searching Examining clusters of words ° Attempt to determine meaning of query and find records related to that meaning ° 10

Basic IR Features ÷ Boolean operators ¯ ÷ Extended operators ¯ ÷ ÷ ÷

Basic IR Features ÷ Boolean operators ¯ ÷ Extended operators ¯ ÷ ÷ ÷ AND, OR, NOT, grouping NEAR, ADJACENT, (") Stop word deletion Stemming Searching in fields (e. g. host) 11

Ranked Output ÷ Most SEs produce ranked lists by applying simple rules: ¯ ¯

Ranked Output ÷ Most SEs produce ranked lists by applying simple rules: ¯ ¯ ¯ ÷ Early words are more important Title is very important Frequency of occurrence matters for some Infrequent words matter more Modification date Google is different: ¯ ¯ Page. Rank. TM method based on popularity Links as money 12

Googlebombing ÷ Google spoofed from the lecture list first hit from 1992 ¯ Official

Googlebombing ÷ Google spoofed from the lecture list first hit from 1992 ¯ Official Google. Blog explanation ¯ 13

What about the Invisible Web? ÷ ÷ Also known as the Deep Web Documents

What about the Invisible Web? ÷ ÷ Also known as the Deep Web Documents that are on the WWW but not indexed by Search Engines Some are available only by submitting forms ¯ Some are not generally accessible (in subnets) ¯ Some are not in (X)HTML format ¯ 14

The Invisible Web Isn't So Invisible Anymore… ÷ ÷ More search engines parse non(X)HTML

The Invisible Web Isn't So Invisible Anymore… ÷ ÷ More search engines parse non(X)HTML now than before Because of awareness of the problem companies are making more content available using Stable URLs ¯ Robot-friendly sitemaps ¯ ÷ But much content is still not indexed 15

But, there's still plenty of important yet invisible docs ÷ How to find them?

But, there's still plenty of important yet invisible docs ÷ How to find them? ¯ ¯ ÷ Use database tools from the U. 's library ¯ ÷ Many of them are in databases No one search engine covers everything Especially for research articles Use multiple search engines or a metacrawler ¯ dogpile is the most famous 16

Search Engines A Summary of Practical Advice

Search Engines A Summary of Practical Advice

How To Succeed With SEs ÷ As a surfer: ¯ If you don't know

How To Succeed With SEs ÷ As a surfer: ¯ If you don't know what you are looking for Use multiple SEs, or a meta-crawler ° Search within results ° ¯ If you don't know what you are looking for Use multiple SEs, or a meta-crawler ° Use Boolean expressions or search within results ° Consider specialized engines ° 18

How To Succeed With SEs ÷ As a creator: ¯ HTML level ° °

How To Succeed With SEs ÷ As a creator: ¯ HTML level ° ° ¯ Always use ALT attributes with <IMG>, etc. Avoid frames Make it easier to index ° ° ° Don't expect SEs to find your pages Make links between your pages Use metadata ³ ³ ¯ Informal: <meta name="description" …> Formal: Dublin core and others Increase your pages popularity ° ° Don’t use systematic reciprocal linking: rings, exchanges, lists Page Rank™ is inversely proportional to outdegree 19

How To Succeed With SEs ÷ ÷ As a creator (cont. ) For surfers:

How To Succeed With SEs ÷ ÷ As a creator (cont. ) For surfers: Use <meta name="description" …> ¯ Don't expect surfers to start at top of your hierarchy ¯ Don't rely on a hierarchy ° Include a context map near the top of each page ° Don't use frames ° Think through dynamic content implications ° Stickiness… is for another day ° 20