Needle in the Haystack The Technology of Internet

  • Slides: 54
Download presentation
Needle in the Haystack: The Technology of Internet Search Randy H. Katz The United

Needle in the Haystack: The Technology of Internet Search Randy H. Katz The United Microelectronics Corporation Distinguished Professor Computer Science Division, EECS Department University of California, Berkeley, CA 94720 -1776 USA randy@cs. Berkeley. edu 1

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 2

Search is BIG! 3

Search is BIG! 3

And the World is Going Digital 4

And the World is Going Digital 4

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 5

Historical Background: The Perfect Storm ARPANet 1969 NSFNet 1985 Commercial Internet 1995 Marc Andreessen

Historical Background: The Perfect Storm ARPANet 1969 NSFNet 1985 Commercial Internet 1995 Marc Andreessen NCSA Mosaic 1993 Jim Clark Netscape World Wide Web 1995 Tim Berners-Lee URL/HTTP/HTML 1989 Bill Atkinson Hypercard 1987 SGML 1986 Ted Nelson Xanadu Hypertext 1965 -1990 Autodesk Est. $15. 5 Billion spent on-line Vannevar Bush “As We Thanksgivings to Xmas 2004, May Think” MEMEX 1947 up 28% since 2003 6

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 7

Information Tsunami • Bit: Binary digit – either a 0 or 1 • Byte:

Information Tsunami • Bit: Binary digit – either a 0 or 1 • Byte: 8 bits – 1 byte: single character – 10 bytes: a single word – 100 bytes: Telegram or punched card • Kilobyte: 1, 000 or 103 bytes – – – 1 kilobyte: Very short story 2 kilobytes: Typewritten page 10 kilobytes: Encyclopedia page 50 kilobytes: Compressed document image page 100 kilobytes: Low-res photo 200 kilobytes: Box of punched cards http: //www. sims. berkeley. edu/research/projects/how-much-info/index. html 8

Information Tsunami • Megabyte: 1, 000 or 106 bytes – – – 1 megabyte:

Information Tsunami • Megabyte: 1, 000 or 106 bytes – – – 1 megabyte: Small novel or 3. 5 in floppy disk 2 megabytes: Hi-res photo 5 megabytes: Complete works of Shakespeare 10 megabytes: Minute of hi-fi sound 100 megabytes: 1 m shelved books 500 megabytes: CD-ROM – – – 1 gigabyte: Pickup truck filled with paper 2 gigabytes: Movie on a DVD 50 gigabytes: Floor of books 100 gigabytes: Floor of academic journals 500 gigabytes: Biggest FTP site • Gigabyte: 1, 000, 000 or 109 bytes http: //www. sims. berkeley. edu/research/projects/how-much-info/index. html 9

Information Tsunami • Terabyte: 1, 000, 000 or 1012 bytes – 1 terabyte: 50,

Information Tsunami • Terabyte: 1, 000, 000 or 1012 bytes – 1 terabyte: 50, 000 trees made into paper and printed or 1 day of EOS data – 2 terabytes: Academic research library – 10 terabytes: Printed collection of the U. S. Library of Congress – 50 terabytes: Contents of a large mass storage system – 400 terabytes: National Climate Data Center (NOAA) database • Petabyte: 1, 000, 000 or 1015 bytes – – 1 petabytes: 3 years of Earth Observing System (EOS) data 2 petabytes: All U. S. academic research libraries 8 petabytes: All information available on the Web 200 petabytes: All printed material (2001) http: //www. sims. berkeley. edu/research/projects/how-much-info/index. html 10

Information Tsunami • Exabyte: 1, 000, 000 or 1018 bytes – 2 exabytes: Total

Information Tsunami • Exabyte: 1, 000, 000 or 1018 bytes – 2 exabytes: Total volume of information generated worldwide annually – 5 exabytes: All words ever spoken by humans • Zettabyte: 1, 000, 000, 000 or 1021 bytes • Yottabyte: 1, 000, 000, 000 or 1024 bytes http: //www. sims. berkeley. edu/research/projects/how-much-info/index. html 11

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 12

Anatomy of a Web Page: Randy’s Home Page • URL: Uniform Resource Locator •

Anatomy of a Web Page: Randy’s Home Page • URL: Uniform Resource Locator • Images • Text 13

Anatomy of a Web Page: Randy’s Home Page <html> <head> <title>Professor Randy Howard Katz

Anatomy of a Web Page: Randy’s Home Page <html> <head> <title>Professor Randy Howard Katz University of California Berkeley Computer Science Division Home Page</title> <meta name="description“ content="Home Page of Berkeley Computer Science Professor Randy Howard Katz"> <meta name="keywords“ content="Katz Randy Howard Berkeley Professor University California Electrical Engineering Computer Science Department RAID Redundant Arrays Inexpensive Disks SPUR Snoop Wireless Communications Networks Programmable Network Elements"> </head> <body> <p><img height="269" src="Randy_2004. jpg" width="182" align="bottom" naturalsizeflag="0">  <img height="269" src="RHK 85 a. jpg" width="177" align="bottom" naturalsizeflag="0">  </p> <p><font size="-1">2005 vs. 1985. . . The hair is grayer, but the smirk remains the same! ". . . Katz, a thin, almost gaunt man with horn-rimmed glasses magnifying sunken eyes. . " --George Johnson, WIRED Magazine, (January 2000), page 14 150. </font></p><p><img src="VISIONAR. JPG" align="bottom"> </p>

 • Text • Images • Links! 15

• Text • Images • Links! 15

Anatomy of a Web Page: Randy’s Web Page <hr align="left"> <h 1>Professor Randy H.

Anatomy of a Web Page: Randy’s Web Page <hr align="left"> <h 1>Professor Randy H. Katz</h 1> <h 3>Electrical Engineering and Computer Science Department</h 3> <p><a href="http: //www. umc. com. tw/"><img hspace="6" src="UMCLogo. gif" align="left"> </a> <b><font size="+1">The <a href="http: //www. umc. com. tw/">United Microelectronics Corporation</a> Distinguished Professor</font></b></p> <p><font size="-1"><br clear="left"> Ph. D. , University of California, Berkeley, 1980. M. S. , University of California, Berkeley, 1978. A. B. , Cornell University, 1976. </font></p> 16

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 17

Anatomy of Web Access Naming System (DNS): Name-to-Address Mapping IP address Web Page In

Anatomy of Web Access Naming System (DNS): Name-to-Address Mapping IP address Web Page In HTML (1) (2) Link URL http: //www. umc. com. tw/ Web Browser Taiwan (3) (4) Web Server 18

Anatomy of Web Access Content Caching Naming System (DNS) Origin IP Web Page In

Anatomy of Web Access Content Caching Naming System (DNS) Origin IP Web Page In HTML Content Network DNS Edge Cache IP (5) Taiwan (6) Link URL …/English/about/index. asp (7) Web Browser (8) Content Distribution San Jose Edge Cache Origin Web Server 19

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 20

Challenges of Search • • How to find all the pages on the Web?

Challenges of Search • • How to find all the pages on the Web? How to order the pages by relevance? How to make searchable the content on those pages? How to keep it all up-to-date? • Web Crawlers/Spider. Bots – Network software executing in parallel that follow links in the Web to find content – Web pages “scraped” for more links follow – Web revisited on the order of once every two-three days • Indexers – Web pages “scraped” for search terms to build indexes – (Google) Page rank algorithm: order a page within the index based (roughly) on how many pages refer to it 21

Quick (and Incomplete) History of Search Engines CMU Lycos 1 st Commercial Search Engine

Quick (and Incomplete) History of Search Engines CMU Lycos 1 st Commercial Search Engine Stanford Yahoo! Directories UMinn MIT Veronica & Wandex/ Archie WWW services Wanderer for gopher & Aliweb ftp Pre-Web 1993 1995 a 9. com Allthe. Web Ask Jeeves Clusty Gigablast Ez 2 Find Yahoo! Teoma acquires Overture Wise. Nut Go. Hook (Allthe. Web, Walhello Alta. Vista) Kartoo Yahoo! acquires Inktomi Battle for Popularity: Webcrawler (UWash) Hot. Bot (Wired) Excite (Stanford) Infoseek (ABC) Inktomi (Berkeley) Alta. Vista (DEC) Google (Stanford) 1997 1999 2001 Yahoo! deploys joint technology 2003 2005 22

Search Challenges and Issues • Web growing faster than search engines can index •

Search Challenges and Issues • Web growing faster than search engines can index • Web pages updated frequently, forcing frequent revisits • Key word only searches results in many false positives • Difficult to index dynamically generated sites: the socalled “invisible web” • Some search engines order results by financial “placement” considerations rather than relevance • Some sites trick search engine to display them first for some keywords—results in polluted search results, with more relevant links pushed down among the results 23

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 24

Page Ranking Algorithms • Web page relevancy – Many hits, how to insure the

Page Ranking Algorithms • Web page relevancy – Many hits, how to insure the best/most relevant web pages are presented first in answer to a search • Location and Frequency of Keywords – Index terms in page title raise its relevance for that term – Keywords near “top” of page more relevant than bottom – High keyword frequency boosts relevance • If search engine strategy is known, page developers will “game” the strategy to get their pages ranked higher 25

Google’s Page Rank Algorithm • Which is the most important page? 26

Google’s Page Rank Algorithm • Which is the most important page? 26

Google’s Page Rank Algorithm • Googlese from their web page: – Page. Rank relies

Google’s Page Rank Algorithm • Googlese from their web page: – Page. Rank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important. ” 27

Google Page Rank Algorithm • Basic idea: – Page’s rank determined by the number

Google Page Rank Algorithm • Basic idea: – Page’s rank determined by the number of links to the page (also known as citations) – If citing page is more important (has a high page rank/authority page) then the pages it cites are more important – If citing page has many links, then cited page is less important (normalize for number of links on citing page) PR(P) is page rank of page P, T 1, …, TN are pages that cite P, C(P) is the # links from Page P, D is a “decay factor”, e. g. , 0. 85 then: PR(P) = (1 – d) + d (PR(T 1)/C(T 1) + … + PR(Tn)/C(Tn)) • See http: //www-db. stanford. edu/~backrub/google. html 28

Google Conceptual Architecture 29

Google Conceptual Architecture 29

Google Server Architecture Google Web Server Spell Checker Ad Server Doc Server Index Server

Google Server Architecture Google Web Server Spell Checker Ad Server Doc Server Index Server • • Doc Server Doc Server Index servers: search term partitioned and mapped to doc list Intersect to find document list, sort by page rank Document IDs used to extract text from Doc Servers Over 100, 000 processors (and growing) in Googleplex 30

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 31

Fun and Games • • • Google Scholar Googling Someone Google News Comparison Shopping

Fun and Games • • • Google Scholar Googling Someone Google News Comparison Shopping Google Whacks 32

Google Scholar 33

Google Scholar 33

Google Randy 34

Google Randy 34

Google Randy Katz “Google Index” Advertising Placement 35

Google Randy Katz “Google Index” Advertising Placement 35

Google News 36

Google News 36

Comparison Shopping 37

Comparison Shopping 37

elgoo. G 38

elgoo. G 38

Google Whacks 39

Google Whacks 39

Business Model Ad Placement and Click-Thru Old data (2002): Google is now market leader

Business Model Ad Placement and Click-Thru Old data (2002): Google is now market leader in ad revenue 2004 revenue through 9/30/04: $2. 1 B 40

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of

Outline • • Historical Background Information Tsunami Anatomy of a Web Page Anatomy of Web Access The Challenge of Search Google’s Page Rank Algorithm Fun and Games with Internet Search New Directions 41

Top 10 Search Engines 10. DMOZ. org 9. Alltheweb. com 8. Kart. OO. com

Top 10 Search Engines 10. DMOZ. org 9. Alltheweb. com 8. Kart. OO. com 7. MSN. com 6. Dogpile. com 5. Ask. Jeeves. com 4. About. com 2. Yahoo. com 2. Vivismio. com 1. Google. com 42

Clustering 43

Clustering 43

Google Video Search 44

Google Video Search 44

Google Video Search 45

Google Video Search 45

Amazon’s A 9 46

Amazon’s A 9 46

Amazon’s A 9 47

Amazon’s A 9 47

A 9’s Yellow Pages 48

A 9’s Yellow Pages 48

A 9’s Yellow Pages 49

A 9’s Yellow Pages 49

Innovations Now and Yet to Come • Index ever larger portions of the Web,

Innovations Now and Yet to Come • Index ever larger portions of the Web, even beyond traditional web pages, e. g. , video • Better quality/higher relevance searches • Better presentation of results, e. g. , clustering, site information • Better exploitation of semantic relationships for improved page ranking, more personalization, e. g. , user’s zip code • More services (Web, news groups, blogs, comparison shopping, video/audio, yellow pages, etc. ) • Integrate with desktop machine 50

Parting Thoughts 51

Parting Thoughts 51

Parting Thoughts 52

Parting Thoughts 52

“Where is the wisdom we have lost in knowledge? Where is the knowledge we

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? ” T. S. Eliot, “Choruses from the rock”, Selected Poems, NY: Harvest / Harcourt, 1962, p. 107. 53

Needle in the Haystack: The Technology of Internet Search Thanks for Your Patience &

Needle in the Haystack: The Technology of Internet Search Thanks for Your Patience & Attention! Questions? 54