Chapter 5 Ordering Search Results How does Google

Information on Internet • Amount of information on Internet has been growing at a

# of Pages Indexed by Google • From when Google was a prototype at

Big Question • If Google has indexed trillions of pages, how is it that

Ranking by Relevance • Early search engines store title and section headings of pages

Big Question If you were designing a search engine, how would you rank webpages?

Ranking by Relevance • For a given search query entered, order the pages that

A Platoon of Soldiers Who is most important in a platoon? → its leader

Importance • Importance score: measuring how important a webpage is, irrespective of content on

Graph and Webgraph • Webpages are connected to one another through hyperlinks • Connections

Importance • What makes a page importance? • # of other nodes pointing to

Random Surfer • A surfer is given a web page, and keeps clicking on

Random Surfer • What’s the chance that surfer will be on a given page

Quantifying Importance from Webgraphs • Concepts of in-degree, out-degree, and page importance can be

Quantifying Importance • Chance that a transition will happen from W to X –

Solving System of Equations • Importance scores represent probabilities of each node being visited

Quantifying Importance • Webpage’s importance depends upon other webpages’, and these other webpages’ importance

Quantifying Importance • For each webpage, three things are equal – importance score of

Quantifying Importance • Whenever random surfer lands on Y, she will go to Z

The Math behind Google Kurt Bryan and Tanya Leise, The $25, 000, 000 Eigenvector:

Ranking → Linear Algebra Transforms web ranking problem into “standard” Linear Algebra problem →

System of Linear Equations • Three types of solutions of a system of linear

Dangling Nodes • Nodes that don’t point to any other nodes • With dangling

Multiple Connected Components • A connected component is a group of nodes in which

Slides: 26

Download presentation

Chapter 5 Ordering Search Results How does Google rank webpages?

Information on Internet • Amount of information on Internet has been growing at a rapid rate since introduction of Web in 1989 • More than 60 trillion (6 x 1013) unique pages • How does Google keep track of all these pages? • Each search engine has its own database which stores information about all of webpages it knows of • With web growing and evolving as fast as it does, how do search engines keep their databases up to date? • By constantly crawling web, through programs that automatically follow links from one webpage to next, adding new pages to database in the process and updating entries of existing pages

# of Pages Indexed by Google • From when Google was a prototype at Stanford in 1997 to when it stopped publicizing the number of pages in its database on its home page • 24 M in 1997 → 8 B in 2005 → 60 T in 2015

Big Question • If Google has indexed trillions of pages, how is it that when you enter a search query, you can usually find more than enough of what you need within the first few results? – (result) pages aren’t displayed in order of when Google first indexed them – those displayed on the top are important ones that are relevant to query

Ranking by Relevance • Early search engines store title and section headings of pages → enable quick and inexpensive searches • Issues? • Loss of information and search precision – “search engine” may not match “ordering search results” • Full text search (may address such issue) – every word of content on a webpage is stored in the database – allowing search queries to be matched against all content • Search engine Web. Crawler (acquired by AOL in 1995) in 1994 does full text search

Big Question If you were designing a search engine, how would you rank webpages?

Ranking by Relevance • For a given search query entered, order the pages that match the query by number of times they contain the query – more occurrences indicate a higher match Counting occurrences → measuring relevance of a page to query → how strongly associated [search is with page] # of times word “Vanilla” appears

Is Relevance Enough?

A Platoon of Soldiers Who is most important in a platoon? → its leader → how do you denote the leader by annotating graph?

Importance • Importance score: measuring how important a webpage is, irrespective of content on page and text in search • How importance scores are determined? • Involves looking at network of webpages (termed webgraph) that is induced by hyperlinks connecting pages to each other to form the Web in the first place • Importance of each page will not change depending on search query entered by user, nor based on content that is contained on page • Instead, it is based entirely on structure of graph showing how webpages point to each other • Google search engine employs its own algorithm – Page. Rank – which solves a huge system of equations to determine importance of each page, and then ranks results from highest to lowest importance

Graph and Webgraph • Webpages are connected to one another through hyperlinks • Connections between webpages can be represented succinctly using a graph • Graph = ({nodes}, {edges}) • Webgraph – node = webpage – (direct) link = indication of one page references another • Webgraph encapsulate structure of web’s connectivity • Structure of the entire Web? – sparse

Importance • What makes a page importance? • # of other nodes pointing to it – in-degree (# of incoming links) • • In-degree of X = 2 In-degree of Y = 3 In-degree of Z = 2 In-degree of W = 1 • In-degree → importance score • Good enough ? ? ? A B

Random Surfer • A surfer is given a web page, and keeps clicking on links randomly: e. g. , A → D → E • Surfer may eventually get bored and enter some other address (F) into browser • In Page. Rank: the percentage of times that a page is visited in this process (relative to total number of visits to all pages) is the page’s importance • Probability of choosing a specific hyperlink on a page – 1/out-degree

Random Surfer • What’s the chance that surfer will be on a given page in the first place? – depending on in-degree – what else? – importance of those pages that have links pointing to this page • While Z only has two incoming links, one is from Y (most important in terms of in-degree) • If Y is that likely to be visited, surely Z will be at least as likely, because once on Y, the only option surfer has is to click on Z E D F C A B G

Quantifying Importance from Webgraphs • Concepts of in-degree, out-degree, and page importance can be visually represented through nodes (W, X, Y, Z) and links of webgraph • Each node assigned an importance score, defined as chance that it will be selected in random surfing (w, x, y, z) • A page spreads its importance across outgoing hyperlinks • X could be selected if surfer is on either W or Z – chance that a transition will happen from W to X? – chance that a transition will occur from Z to X?

Quantifying Importance • Chance that a transition will happen from W to X – on W in the first place • importance of W (w) – transition to X when on W w(1/2) • 1/2 • Chance that a transition will occur from Z to X? – on Z in the first place • importance of Z (z) – transition to X when on Z • 1/3 z(1/3) x = w/2 + z/3

Quantifying Importance

Solving System of Equations • Importance scores represent probabilities of each node being visited during random surfing process

Quantifying Importance • Webpage’s importance depends upon other webpages’, and these other webpages’ importance is in turn dependent upon the original’s • Such circular logic demands solution to a system of simultaneous equations • Method – label each link as ratio of importance of origin node to its out-degree – at each node, equate its importance to sum of all values on incoming links • number of equations = number of nodes in graph

Quantifying Importance • For each webpage, three things are equal – importance score of page – sum of incoming importance – sum of outgoing importance

Quantifying Importance • Whenever random surfer lands on Y, she will go to Z next, but not the other way around • Over time, Z will get more visits than Y, and therefore should be more important

The Math behind Google Kurt Bryan and Tanya Leise, The $25, 000, 000 Eigenvector: The Linear Algebra behind Google, SIAM Review, 48(3), 569 -581, 2006.

Ranking → Linear Algebra Transforms web ranking problem into “standard” Linear Algebra problem → eigenvector

System of Linear Equations • Three types of solutions of a system of linear equations – one unique solution – infinitely many solutions – no solution

Dangling Nodes • Nodes that don’t point to any other nodes • With dangling node(s), there is no solution – why? • Sum of importance of all outgoing links is equal to the node’s importance • Since V has no outgoing links, v = 0 • ➡ z = 0 (as v = z/4) • ➡ all other nodes = 0 • Page. Rank solution to this problem – add links from V to all nodes – intuition? ? ? If a random surfer lands on a page without any hyperlinks, she would have to enter some other link into browser to keep on going

Multiple Connected Components • A connected component is a group of nodes in which any two can reach other (directly or indirectly), but none can reach any outside of group • A random surfer can’t get from one component to the other • No way of relating importance of nodes in one component to those in the other, which causes the problem to be mathematically underspecified (i. e. , many potential solutions) • Page. Rank solution – randomly choose from all available webpages