Google Page Rank Basic Principles and AlgebraicStochastic Interpretation
Google Page. Rank - Basic Principles and Algebraic/Stochastic Interpretation Laboratory of Intelligent Networks (LINK) Youn-Hee Han
Backgrond History n n n Proposed by Sergey Brin and Lawrence Page (Google’s Bosses) in 1998 at Stanford. Algorithm of the first generation of Google Search Engine. “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. Target n n n Measure the importance of Web page based on the link structure alone. Assign each node a numerical score between 0 and 1: Page. Rank Web pages based on Page. Rank values. Good Reference n n 2 http: //en. wikipedia. org/wiki/Page. Rank http: //www. emh. co. kr/xhtml/google_pagerank_citation_ranking. h tml (Korean) Page. Rank
Backgrond Sergey Brin and Lawrence Page Sergey Brin received his B. S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph. D. candidate in computer science at Stanford University where he received his M. S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Lawrence Page was born in East Lansing, Michigan, and received a B. S. E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph. D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining 3 Google Inc. in 09/98 (google. com - 09/97) Page. Rank
Backgrond Stanford Web. Base project (1996 - 1999) n n http: //dbpubs. stanford. edu: 8091/~testbed/doc 2/Web. Base/ http: //dbpubs. stanford. edu: 8091/diglib/ The Page. Rank Citation Ranking: Bringing Order to the Web n it is a technical report! (working paper) w Stanford Digital Libraries SIDL-WP-1999 -0120 n n from the paper: web size = 150 M web pages 2005: Google claims to index more than 8 B pages w http: //blog. searchenginewatch. com/blog/041111 -084221 w http: //www. cs. uiowa. edu/~asignori/web-size n Claim that the estimated size of the indexable Web to at least 11. 5 billion pages as of the end of January 2005 4 Page. Rank
Backgrond The Philosophy of Page. Rank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B 5 Page. Rank
Backgrond Scenario: n n n A random surfer who begins at a Web page A. Execute a random walk from A to a randomly chosen Web page that A hyperlinks to. Some nodes are visited more often. Intuitively, these are nodes with many links coming in from other frequently visited nodes. Idea n Pages visited more often in this walk are more important. “The rank of a page can be interpreted as the probability that a surfer will be at the page after following a large number of forward links. ” 6 Page. Rank
Basics based on link structure of the web pages = nodes && links = edges forward links = outlinks backlinks = inlinks A and B are Backlinks of C 7 Page. Rank
Basic Principles about Page. Ranks n 1) a link from page A to page B is a vote from A to B n 2) Pages with lots of backlinks are important w www. stanford. edu has 23, 400 inlinks w www. joe-schmoe. com has 1 inlink n 3) Backlinks coming from important pages convey more importance to a page combination of PR and text-matching techniques result in highly relevant search results 8 Page. Rank
Basic Principles about Page. Ranks n 3) Backlinks coming from important pages convey more importance to a page Taher’s Home Page DB Pub Server Sep’s Home Page CS 361 Linked by 2 Unimportant pages 9 Yahoo! CNN Linked by 2 Important Pages Page. Rank
Basic Principles Design of Equation to get Page Importance importance of page i pages j that link to page i 10 importance of page j number of outlinks from page j Page. Rank
Basic Principles Design of Equation to get Page Importance 0. 05 Taher 0. 25 Sep 1/2 DB Pub Server 0. 1 11 1 CNN 0. 1 1 0. 1 Page. Rank
Basic Principles Exact Equation of Page. Rank n n u, v: web pages Bu: set of pages pointing (back link) to u Nv: the number of pages v points (forward link) to d: damping factor w Possibility that a user clicks links in webpages continuously. w 0~1 n n 12 0: a user always types URL and visit the page of the URL. 1: a user permanently clicks links of pages over his/her surf Page. Rank
Basic Principles Exact Equation of Page. Rank n 13 Example Page. Rank
Basic Principles Iteration 14 figures from: http: //www. iprcom. com/papers/pagerank/ and http: //en. wikipedia. org/wiki/Pagerank Page. Rank
Basic Principles Iteration (another example) 0. 333 Initialize all nodes to rank 15 Page. Rank
Basic Principles Iteration (another example) 0. 5 0. 167 0. 333 Propagate ranks across links (multiplying by link weights) 16 0. 167 Page. Rank
Basic Principles Iteration (another example) 0. 333 0. 167 0. 5 0. 167 Propagate ranks again across links (multiplying by link weights) 17 0. 167 Page. Rank
Basic Principles Iteration (another example) 0. 4 0. 2 After a while… 18 Page. Rank
Basic Principles Algorithm n Initialize: n Repeat until convergence: importance of page i pages j that link to page i 19 importance of page j number of outlinks from page j Page. Rank
Algebraic Interpretation 20 Page. Rank
Algebraic Interpretation Source: How Google Finds Your Needle in the Web's Haystack n http: //www. ams. org/samplings/feature-column/fcarc-pagerank Hyperlink Matrix n n n 21 Suppose that page Pj has Nj links If one of those links is to page Pi , then Pj will pass on 1/Nj of its importance to Pi The importance ranking of Pi Page. Rank
Algebraic Interpretation Hyperlink Matrix n n Hyperlink Matrix H = [Hij] in which the entry in the ith row and jth column is Matrix H is stochastic w H entries are all nonnegative w The sum of the entries in a column is one 22 Page. Rank
Algebraic Interpretation Stationary Vector I n n We will also form a vector Page. Ranks An important condition whose components are w the vector I is an eigenvector of the matrix H with eigenvalue 1. n We also call I a stationary vector of H. w the sum of the entries in the vector I be one 23 Page. Rank
Algebraic Interpretation Stationary Vector I n n n 25 billion web pages indicates H has about N = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. w We begin by choosing a vector I 0 w then producing a sequence of vectors I k by General principle: The sequence Ik will converge to the stationary vector I. 24 Page. Rank
Algebraic Interpretation Stationary Vector I 25 Page. Rank
Algebraic Interpretation Three Important Questions n n n 26 Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I 0? Do the importance rankings contain the information that we want? the answer to all three questions is "No!“ However, we'll see how to modify our method so that we can answer "yes" to all three. Page. Rank
Algebraic Interpretation Problem 1: Dangling Node n n 27 Consider the following small web consisting of two web pages The importance rating of both pages is zero, which tells us nothing about the relative importance of these pages The problem is that P 2 has no links. Pages with no links are called dangling nodes and there are, of course, many of them in the real web. Page. Rank
Algebraic Interpretation Problem 1: Dangling Node n n To solve it, we pretend that a dangling node has a link to every other page. This has the effect of modifying the hyperlink matrix H by replacing the column of zeroes corresponding to a dangling node with a column in which each entry is 1/N Q n 28 If A is the matrix whose entries are all zero except for the columns corresponding to dangling nodes, in which each entry is 1/N, then Q = H + A. (we will call Q primitive) Page. Rank
Algebraic Interpretation Problem 2: Smaller Sub-web n Think the following n Then, Q and I are as follows: Q w Page. Ranks assigned to the first four web pages are zero 29 Page. Rank
Algebraic Interpretation Problem 2: Smaller Sub-web n n The problem: it contains a smaller web within it, shown in the blue box below the matrix Q is reducible if Q can be written in block form as Q n 30 if the matrix Q is irreducible, we can guarantee that there is a stationary vector I with all positive entries Page. Rank
Algebraic Interpretation Problem 2: Smaller Sub-web n n n 31 A web is called strongly connected if, given any two pages, there is a way to follow links from the first page to the second. Only strongly connected webs provide irreducible matrices Q. Clearly, the example is not strongly connected. Page. Rank
Algebraic Interpretation (Revisits) Three Important Questions n Does the sequence Ik always converge? Is the vector to which it converges independent of the initial vector I 0? Do the importance rankings contain the information that we want? n In order to answer the three questions, matrix Q should be n n w 1) Stochastic n n All entries are nonnegative The sum of the entries in a column is one w 2) Primitive w 3) Strongly connected 32 Page. Rank
Algebraic Interpretation Final Modification n Two ways to surf web w 1) follow(click) links: random surf n the movement of random surf is determined by Q w 2) type links in the browser: randomly choose any other page n n n 33 all pages have the equal chance to be visited by typing. New matrix 1 (the N*N matrix whose entries are all one) is used. Google Matrix G w G is stochastic since it is a combination of stochastic matrices. w G is both primitive and irreducible because all the entries of G are positive w Therefore, G has a unique stationary vector I Page. Rank
Algebraic Interpretation Final Modification n Google Matrix w The meaning of parameter d n n d=1 (G=H+A): we are only working with the original hyperlink structure of the web. d=0 (G=(1 -d)/N 1): we are just type the URL and visit a page w we would like to take d close to 1 so that we hyperlink structure of the web is weighted heavily into the computation. n 34 Serbey Brin and Larry Page, the creators of Page. Rank, chose d=0. 85 Page. Rank
Algebraic Interpretation From wikipedia… 35 Page. Rank
Stochastic Interpretation 36 Page. Rank
Stochastic Interpretation Page. Rank – Random Walk over the Web n n n 37 If a user starts at a random web page and sufs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? A Markov chain is a discrete-time stochastic process consisting of N states, each Web page corresponds to a state. A Markov chain is characterized by an N*N transition probability matrix P Page. Rank
Stochastic Interpretation Let assume the following stochastic process with values in a set E, called the state space, while its elements are called state of the process. n Let assume the set E is finite or countable 38 Page. Rank
Stochastic Interpretation Definitions 39 Page. Rank
Stochastic Interpretation Definitions n n n 40 If state i is recurrent, then it is said to be positive recurrent if, starting in state i, the expected time until the process returns to state i is finite. It can be shown that in a finite-state Markov chain, all recurrent states are positive recurrent. Positive recurrent, aperiodic states are called ergodic. Page. Rank
Stochastic Interpretation Limiting Probability (Ross Book – pp. 205) n 41 It can be shown that , the limiting probability that the process will be in state j at time n, also equals the long-run proportion of time that the process will be in state j Page. Rank
Stochastic Interpretation Limiting Probability (Ross Book – pp. 206) 42 Page. Rank
Stochastic Interpretation Google Matrix G n Since the matrix Q can be reducible or periodic, the following google matrix G must be considered to ensure that the steadystate probability exists and is unique. G 43 Page. Rank
Stochastic Interpretation P: Importance Vector of Web Pages n The initial importance is chosen according to some probability distribution P 0=[pi] w pi : the probability that the Markov Chain is in state i at the initial time n Pk = a vector whose i-th component is the probability that the Markov Chain is in state i at time k (Pk+1)T= (Pk)T G (Pk)T = (P 0)T Gk P Pk for enough large k w The power method w Brin and Page report that 50 - 100 iterations are required to obtain a sufficiently good approximation to P. w The calculation is reported to take a few days to complete n 44 Stationary distribution P satisfies PT = PT G (steady-state behavior) Page. Rank
- Slides: 44