The Page Rank Citation Ranking Bringing Order to
The Page. Rank Citation Ranking: Bringing Order to the Web Dr. Yingwu Zhu
Overview l Motivation l Related work l Page Rank & Random Surfer Model l Implementation l Conclusion
Motivation l Web: heterogeneous and unstructured l Free of quality control on the web l Commercial interest to manipulate ranking
Related Work l Academic citation analysis l Link-based analysis l Clustering methods of link structure l Hubs & Authorities Model
Backlink l Link Structure of the Web l Approximation of importance / quality
Page. Rank l Pages with lots of backlinks are important l Backlinks coming from important pages convey more importance to a page
Page. Rank
Two Problems! l Rank sink – Introduce escape terms l Dangling Links – Dangling links are simply links that point to any page with no outgoing links – They do not affect the rank of any other pages directly – Ignore first and add back later
Rank Sink l Page cycles pointed by some incoming link l Problem: this loop will accumulate rank but never distribute any rank outside
Escape Term l Solution: lc Rank Source is maximized and =1 l E(u) is some vector over the web pages – uniform, favorite page etc.
Matrix Notation l. R is the dominant eigenvector and c is the dominant eigenvalue of because c is maximized
Computing Page. Rank - initialize vector over web pages loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter while - stop when converged
Random Surfer Model l Page. Rank corresponds to the probability distribution of a random walk on the web graphs l E(u) can be re-phrased as the random surfer gets bored periodically and jumps to a different page and not kept in a loop forever
Implementation l Computing resources — 24 million pages — 75 million URLs l Memory and disk storage Weight Vector (4 byte float) Matrix A (linear access)
Implementation (Con't) l Dealing with dangling links – Unique integer ID for each URL – Sort and Remove dangling links – Rank initial assignment – Iteration until convergence – Add back dangling links and Re-compute
Convergence Properties (con't) l Page. Rank computation is O(log(|V|)) due to rapidly mixing graph G of the web.
Personalized Page. Rank l Rank Source E can be initialized : – uniformly over all pages: e. g. copyright warnings, disclaimers, mailing lists archives result in overly high ranking – total weight on a single page, e. g. Netscape, Mc. Carthy great variation of ranks under different single pages as rank source – and everything in-between, e. g. server root pages allow manipulation by commercial interests
Issues l Users are no random walkers – Content based methods l Starting point distribution – Actual usage data as starting vector l l Reinforcing effects/bias towards main pages How about traffic to ranking pages? No query specific rank Linkage spam – Page. Rank favors pages that managed to get other pages to link to them – Linkage not necessarily a sign of relevancy, only of promotion (advertisement…)
Conclusion l Page. Rank is a global ranking based on the web's graph structure l Page. Rank use backlinks information to bring order to the web l Page. Rank can separate out representative pages as cluster center l A great variety of applications
- Slides: 19