The Page Rank Citation Ranking Bringing Order to
The Page. Rank Citation Ranking: Bringing Order to the Web Page L. , Brin S. , Motwani R. , Winograd T. Stanford Digital Library Technologies Project http: //dbpubs. stanford. edu/pub/1999 -66 Presented by Soumya Sanyal CSE-6392 Class Presentation Instructor: Dr. Gautam Das Databases and Information Retrieval
University of Texas at Arlington Outline Paper Citations and the Web : Motivation Page. Rank : Why it should be considered? More Page. Rank: Nuts and bolts Page. Rank Unleashed: Looking under the hood Convergence and Random Walks : Why does it work? Implementation: Getting your hands dirty Personalized Page. Rank: The invisible source Applications: What wasn’t apparent already Conclusions
University of Texas at Arlington Paper Citations and the Web : Motivation Academic Citations link to other well known papers But they are peer reviewed and have quality control Web of academic documents are homogeneous in their quality, usage, citation & length Most web pages link to web pages as well Quality measure of a web page is subjective to the user though Importance of a page is a quantity that isn’t intuitively possible to capture
University of Texas at Arlington Contd. An user wants to see what is most applicable to her needs first. The job of the retrieval system is to present the more relevant documents up front. The notion of quality or relative importance of a web page magnifies The average quality experienced by an user is higher than the average quality of the average web page. Notations Used: • Backlinks (inedges) : Links that point to a certain page • Forward Links (outedges): Links that emanate from that page
University of Texas at Arlington Page. Rank : Why it should be considered? Think of a color palette – Colors are formed by the mixture of one or more colors – The amount and intensity of each color you mix ultimately governs the color of the final mixture not the number of colors !!! Now think of a Web Page – A number of back links (inedges) point to this webpage – Say a certain back link came from Yahoo! and another came from an obscure home page. Think of the importance of the Yahoo! Page as opposed to the importance of the ‘home page’. Now say the importance of the Yahoo! Page was mapped to the amount (intensity) of one color and the ‘home page’ to another color Importance of back links rather than their number. + +
University of Texas at Arlington More Page. Rank: Nuts and bolts Say for any Web Page u the number of forward links is given by Fu and the number of back links be Bu and Nu=| Fu | R() = Rank of page u ; c = Normalization Constant – Note: c < 1 to cover for pages with no outgoing links
University of Texas at Arlington Contd. . So what does the overall picture look like? A is designated to be a matrix, u and v correspond to the columns of this matrix
University of Texas at Arlington Contd. . (Matrices Revisited) Eigenvectors and eigenvalues Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue. It can be found out by recursing the previous equation till the recurrence converges. – A set of eigenvalues form what is called the eigenspace.
University of Texas at Arlington Contd. . (A Walk Through Example) Lets take an example AT =
University of Texas at Arlington Contd. . Matrix Notation R=c. AR=MR c : eigenvalue R : eigenvector of A Ax=λx | A - λI | x = 0 R= A= Normalized =
University of Texas at Arlington Contd. . (Markov Chains) Random surfer model – Description of a random walk through the Web graph – Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page – The above notion is fundamental to any Markovian System. For a discrete notion of the above, the following is assumed. Rt = M Rt-1 M: transition matrix for a first-order Markov chain (stochastic) The question is does it converge to some sensible solution (as t ) regardless of the initial ranks ?
University of Texas at Arlington Contd. . (Issues. . ) The above equation would converge were it not for a little problem This problem is called the ‘Rank Sink’ Problem. – The sink accumulates rank, but never distributes it!
University of Texas at Arlington Contd. . () In general many Web pages don’t have either backlinks or forward links. Results in dangling edges of the graph no parent rank 0 – MT converges to a matrix whose last column is all zero no children no solution – MT converges to zero matrix
University of Texas at Arlington Contd. . (More Random Surfer) How do we escape from this ? – A: We actually ‘escape’ from it. Say a surfer is randomly clicking and hopping from one page to the other. If this surfer keeps going back to the ‘same’ set of pages, she will get bored (in reality too) and try and ‘escape’ from this set of pages. Hence, we associate an ‘escape’ factor E to account for this ‘boredom’. – How do we model this escape probability We term this E to be a vector over all the web pages that accounts for each page’s escape probability.
University of Texas at Arlington Contd. . Given this Escape vector, how do we associate this with the original model In matrix notation It can be rewritten as Hence where
University of Texas at Arlington Page. Rank Unleashed: Looking under the hood The main algorithm : • What can we say about d and ? • d 1 is called the eigengap and it controls the rate of convergence • is the convergence threshold
University of Texas at Arlington Convergence and Random Walks : Why does it work? Irreducible Aperiodic Markov Chains with a Primitive transition probability matrix What is the issue all about? – We need a transition matrix model that is guaranteed convergence and does indeed converge to a unique stationary distribution vector.
University of Texas at Arlington Contd. . Addition of the escape vector E, allows us to make the original matrix A be both primitive and stochastic – This guarantees convergence What about the addition of new links – Whether the link analysis algorithms based on eigenvectors are stable in the sense that results don’t change significantly? The connectivity of a portion of the graph is changed arbitrary – How will it affect the results of algorithms? Ng et al. (2001) IJCAI and Bianchini et al. (2002) WWW’ 02 • It is possible to perturb a symmetric matrix by a quantity that grows as d 1 that produces a constant perturbation of the dominant eigenvector
University of Texas at Arlington Contd. . Convergence Experiment(s) – Expander graphs and d 1 (every subset S has a neighborhood bounded by some factor times |S|) – Rapidly mixing random walk : Convergence is guaranteed in logarithmic time in the order of the size of the graph
University of Texas at Arlington Implementation: Getting your hands dirty In 1998 – 24 million web pages – Crawler builds an index of links – To do this in 5 days, 50 Web pages/second need to be crawled – 11 is the average outdegree, 550 links/second – 75 million unique URL’s to be compared against – URL’s are hashed to unique integer ID – No dangling links are kept initially – Vector E will help in convergence issues also – Weights were kept for 75 million URLs @ 4 bytes/weight (300 MB) – Access to link Database is linear since it is sorted `99 – 800 million pages; `00 - 2 billion; `01 – 4 billion
University of Texas at Arlington Personalized Page. Rank: The invisible source ||E||1=0. 15 – Web Pages are valued because they exist! – Web Pages with many related links receive an overly high ranking The other extreme – E for just one web page – Netscape Home Page and John Mc. Carthy’s home page
University of Texas at Arlington Applications: What wasn’t apparent already Estimating Web Traffic – How Page. Rank corresponds to actual usage – Internet proxy cache from NLANR compared to Page. Rank – 2. 6 million pages intersect with Page. Rank’s indexed 75 mil. – Web based email access is one plausible reason for this disparity – People look at certain pages but never link them Backlink Predictor – Page. Rank is a better predictor future citation counts than citation counts themselves. – Experiment starts out with one URL and no other information – Goal is to crawl the Web in the order of their importance – Importance being an Evaluation function on the number of citation counts (number of backlinks) – Page. Rank escapes local minima, citation count get stuck in these.
University of Texas at Arlington Conclusions In essence, the importance of one page being dependent on the importance of its predecessors is like a ‘peer’ review. NASDAQ – 17 th February, 2005 - $197. 41 : Need I say More?
- Slides: 23