The Page Rank Citation Ranking Bringing Order to
The Page. Rank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University. 1998. PRESENTED BY ASHISH CHAWLA AND VINIT ASHER
Agenda Introduction Background Link Structure Propagation of Ranking Simplified Page Rank Calculation Problems in Ranking Page Rank Definition Computing Page Rank Mathematical Basics Implementation Details Convergence Searching with Page. Rank Personalized Page. Rank Applications Conclusion
Introduction �Challenges in Information Retrieval on Web Large # of documents Heterogeneous and Unstructured WWW � Is hypertext � provides auxiliary information (other than the text of web pages) �Objective Take advantage of this link structure.
Background �Academic Citations link to other well known papers peer reviewed have quality control �Web : Homogeneous in their quality, usage, citation & length Quality measure (subjective to the user) Importance of a page is a quantity that isn’t intuitively possible to capture
What does a user want? �Most applicable documents first �What is the job of a retrieval system? Present more relevant documents upfront �Notion: Quality/Importance of Web Pages Difficult to classify (depends on user) �We deal with the overall importance of a page, rather than individual sections of the page.
Link Structure � Forward Links � Back Links � Web has 150 million pages and 1. 7 billion links (probably more now) � Use the concept of citation analysis Highly linked pages are more “important" than pages with few links
Propagation of Ranking � Page Rank: a page has high rank if the sum of the ranks of its back-links is high � Some notations u Web Page Fu Set of pages u points to (Forward links) Bu Set of pages that point to u (Backlinks) Nu = |Fu| Number of links from u c Normalization factor Simple Ranking function
Simplified Page Rank Calculation
Problem in Ranking? �Rank Sink: Two web pages that point to each other but to no other page. Third page which points to one of them. loop will accumulate rank but never distribute it (since there are no outedges).
Page Rank Definition �Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the Page. Rank of a set of Web pages is an assignment, R’, to the Web pages which satisfies such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L 1 norm of R’).
Computing Page Rank initialize vector over web pages Loop: new ranks sum of normalized backlink ranks compute normalizing factor add escape term control parameter while stop when converged
Random Surfer Model �Random Surfer �Clicks at random basis �“Surfer” periodically gets bored.
Solution to Random Surfer Model �Escape term: E(u) can be thought of as the random surfer gets bored periodically and jumps to a different page – not staying in the loop forever. �We term this E to be a vector over all the web pages that accounts for each page’s escape probability (user defined parameter).
Another Problem – Dangling Links �What are dangling links? Links that point to any page with no outgoing links. Pages not downloaded yet. �Why is this a problem? We don’t know how to distribute weight to these. �What do we do ? Remove them from the system
Mathematical Basics �What is eigen vector and eigen value? Given a vector v in the n-dimensional vector space, we can linear transform it to another vector space using a transformation matrix A. The transformed vector is Av. An eigen vector is a vector that is scaled by a linear transformation, but not moved. The scaling factor is the eigen value. Eigen values and eigen vectors are not unique. We can compute them by Ax = x where is the eigen value of A and x is the corresponding eigen vector. An eigenvector is a vector that 'points' in the same direction (has invariant direction cosines) under some transform. The eigenvalue is a number that describes how the magnitude of the eigenvector is scaled by the transform.
Mathematical Basics �A is designated to be a matrix, u and v correspond to the columns of this matrix. �Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.
Example A T=
Example (contd. . ) �R = c A R = M R c: eigenvalue R : eigen vector of A A= Ax=λx | A - λI | x = 0 R= Normalized =
Implementation �Web crawler keeps a database of URLs so that it can discover all URLs on the web �To implement Page. Rank, the web crawler builds an index of the URLs as it crawls �Problems? ? ? Infinitely large sites Incorrect HTML Sites are down Web is always changing
Page. Rank Implementation �Convert each URL into unique integer ID �Link structure sorted by the IDs �Remove dangling links �Make a initial assignment of ranks and iterate until convergence �Add the dangling links back �Iterate the process again to assign weights to all dangling links �Link database A, is normally kept in RAM
Convergence Properties �Page. Rank will scale very well for large collections as the scaling factor is roughly linear in log n.
Convergence Properties �Here we interpret web as a expander like graph. �A graph is said to be expander if every subsets of nodes S has a neighborhood that is larger than some factor α times |S| �Mathematically we verify the same if the largest eigenvalue is sufficiently larger than the secondlargest eigenvalue
Searching with Page. Rank �Two search engines implemented using Page. Rank. Title based search engine � Matches titles of web pages with the given query � Ranks the results using Page. Rank � Works well for general queries having a large result set Full text search engine (Google) � Scans the entire document for a match with the given query � Performs rank merging.
Types of Results �Information based result Finds a site which contains great deal of information Propagates textual matching score through the link structure �Common Case result Most commonly used site (often commercial) relevant to the search query Page. Rank results in good representation of the common case
Personalized Page. Rank �E vector Corresponds to a distribution of web pages Provides flexibility in adjustment of Page. Ranks �Uniform E causes highly linked web pages to achieve a very high ranking �Single page E results in important pages not related to the homepage to achieve a low Page. Rank �E consisting of root level pages of all web servers is a good compromise between uniform E and single page E
Applications �Estimating Web Traffic Looking at differences between Page. Rank and actual usage statistics, it is possible to find things that people often look at, but do not want to link to their web pages �Backlink Predictor Citation counts tends to get stuck in the local web pages Using random surfer model, Page. Rank quickly finds the site homepage, and gives preference to its children resulting in an efficient, broad search Hence Page. Rank potentially acts as a better backlink predictor since it builds up the entire website information faster
Other Applications �Spam detection and prevention �Sort the backlinks based on their importance
Issues �Users are not random walkers. �Starting point distribution (actual usage data as starting vector). �Bias towards main pages. �Linkage spam. �No query specific rank.
Conclusion �Page. Rank is a global ranking of all webpages, regardless of their content based solely on their location in the Web’s graph structure �Page. Rank can be used to separate a small set of commonly used documents �Full database is consulted only when small database is not adequate to answer the queries �Personalized Page. Rank can be used to create a view of Web from a particular user’s perspective
Google Architecture. .
- Slides: 30