Authoritative Sources in Hyperlinked Environment Jon M Kleinberg

Authoritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999 Presented By Raman Adaikkalavan Feb 23, 2005, CSE 6392 Instructor: Dr. Gautam Das

Overview • • Problem – in general Query Types Problems of Answering Queries Authoritative Pages – Broad-topic queries – Iterate Method/Algorithm • • • Similar Page Queries Multiple Sets of Hubs and Authorities Diffusion and Generalization Evaluation Comparison – ? Conclusion

Problem – in general • Searching on the www for discovering pages that are relevant to a given query – Improving Quality of search

Query Types • Does Netscape support the JDK 1. 1 code signing API – Specific queries • Find information about the Java programming language – Broad-topic queries • Find pages ‘similar’ to java. sun. com – Similar-page queries

Problems with Answering Queries • Specific queries – Scarcity problem: Very few pages that contain required information – Difficult to determine the identity of the pages • Broad-topic queries – Abundance problem: Number of pages that could reasonably be returned as relevant is far too large for a human user to digest – Select a small set of the most “authoritative” or “definitive” ones – pages that are most relevant

Authoritative Pages – Central focus • Given a query how to get the small set of authoritative pages corresponding to that query • How to accurately model authority in the context of a particular query topic • Text-based searching/ranking – Sufficient ? Many prominent pages are not sufficiently self -descriptive. – “harvard” – www. harvard. edu – “search engines” – Yahoo, Alta. Vista, …? – “automobile manufacturers” – Honda, Toyota, …?

Analysis of the Link Structure • Hyperlinks encode a considerable amount of latent human judgment – used for authority ? – e. g. , the creator of page p, by including a link to page q, has in some measure conferred authority on q – a large number of links are created primarily for navigational purposes, back – Links to paid advertisements – relevance and popularity • Find pages using #inlinks – this would consider highly popular pages as authoritative – Yahoo. com

Conferral of Authority • Model that consistently identifies relevant, authoritative www pages for broad search topics • Based on the relationship between ‘authorities’ and ‘hubs’ – Authorities: Pages that have relevant information about a given topic – Hubs: Pages that link to many related authorities

Till Now WWW • Authoritative pages • Not only based on text • Using link analysis Information about Java PL (Broad Topic Queries)

Can We Operate Over Entire WWW ? • Specific to a query; i. e. , not predefined • Computational costs – should be reduced • Analysis of the link structure; which subgraph www should be operated on ? • All pages containing query string – May be over million pages - computation – Some or most of the best authorities may not belong to this set

Finding Authoritative Pages • Steps – 1: Construct a focused subgraph (S ) of the www; such that • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities – 2: Compute Hubs and Authorities from the focused subgraph

Construction of Focused Subgraph Topic t highest-ranked Search Engine pages Rootset R Expanded Set Pages S At most d pages Backward link pages Forward link pages

Offsetting Navigational Links • G[S ] subgraph induced on the pages in S • Types of links – Transverse: if between pages with different domain names – Intrinsic: is between pages within the same domain name • Delete Intrinsic Links from G[S ]; resulting in a graph G • Collusion: large # of pages from a single domain all point to a single page p. “This site is designated to…” Eliminate by a parameter m (approx 4 – 8)

Finding Authoritative Pages • Steps – 1: Construct a focused subgraph (S ) of the www • S is relatively small • S is rich in relevant pages • S contains most (or many) of the strongest authorities – 2: Compute Hubs and Authorities from the focused subgraph

Computing Hubs & Authorities • Goal: Given a query find: – Good sources of content (authorities) – Good sources of links (hubs) FROM: Monika Henzinger, Hyperlink Analysis on the Web

Intuition • Authority comes from in-edges. Being a good hub comes from out-edges. • Better authority comes from in-edges from good hubs. Being a better hub comes from outedges to good authorities. FROM: Monika Henzinger, Hyperlink Analysis on the Web

Hubs and Authorities • An iterative algorithm – with each page p, we associate • a non-negative authority weight x<p> • a non-negative hub weight y<p> – weights of each type are normalized so their squares sum to 1 – p S (x<p>)2 = 1 p S (y<p>)2 = 1 • The pages with larger x and y values have “better” authorities and hubs respectively.

Hubs and Authorities • If p points to many pages with large x-values, then it should receive a large y-value • If p is pointed to by many pages with large yvalues, then it should receive a large x-value • Inlinks I: • Outlinks O:

Hubs and Authorities • As one applies Iterate with arbitrary large k, the {xk} and {yk} converge to fixed points x* and y* • Let G = (V, E), with V = {p 1, p 2, …, pn}, and let A denote the adjacency matrix of the graph G: the (i, j)th entry of A is 1 if (pi, pj) is an edge of G, and is 0 otherwise. • x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • The convergence of Iterate is quite rapid (k=20 is sufficient)

Mini Web (Modified) HUBS Forward links AUTHORITIES X é ê ê Y ê ê Z ê ë X M = X Z Y Y Backward links Z X 1 1 1ùú ú 0 0 1ú M T = ú 1 1 0úû é ê ê Y ê ê Z ê ë X Y 1 0 1ùú ú 1 0 1ú ú 1 1 0úû H i = M * Ai -1 H i = M * M T H i -1 T T Ai = M * H i-1 Z Ai = M * Ai -1 SOURCE: Vagelis H, Random Walks Presentation

Mini Web (Modified) H i = M * M T H i -1 T Ai = M * Ai -1 T M M Iteration 1 2 é 2 ê = êê 2 ê ê 1 ë é 3 1 2ù 2 1ùú ê ú 2 1ú M M T = ê 1 1 0ú ú ê 1 2û 2 0 2úû ë 3 … ¥ X is the best hub X Z Y Z is the most authoritative SOURCE: Vagelis H, Random Walks Presentation

Basic Results – Broad Topic Search

Observations • Just “pure” analysis of link structure – i. e. , text-based search is just an initial set • Pages legitimately considered as authoritative in the context of www without access to largescale index of the www – i. e. , global analysis of the full www link structure can be replaced by local method over small focused subgraph

Overview • • Problem – in general Query Types Problems of Answering Queries Authoritative Pages – Broad-topic queries – Iterate Method/Algorithm • • • Similar Page Queries Multiple Sets of Hubs and Authorities Diffusion and Generalization Evaluation Comparison Conclusion

Similar-Page Queries • E. g. , Find pages ‘similar’ to honda. com • Using links analysis to infer a notion of “similarity” among pages • We have found a page p that is of interest and it’s an authoritative page on a topic. – What do users of the WWW consider to be related to p when they create pages and links ? • If p is highly referenced ? – abundance problem

Similar-Page Queries • In the local region of the link structure near p, what are the strongest authorities – Can be a potential broad-topic summary of pages related to p • Normal Search; a query string - “Find t pages containing ” as R and then get subgraph S • a page p -- “Find t pages pointing to p” as R and then get subgraph S

Results – Similar Page Queries

Multiple Sets of Hubs and Authorities • Broad-topic queries: most densely linked collection of hubs and authorities • Can we can find several densely linked collections of hubs and authorities among the same set S of pages. • Each collection could potentially be relevant to the query topic, but they could well-separated from one another in the graph G : – The query string may have several very different meanings. E. g. “jaguar”, “java”. – The string may arise as a term in the context of multiple technical communities. E. g. “randomized algorithms”. – The string may refer to a highly polarized issue, involving groups that are not likely to link to one another. E. g. “abortion”.

Multiple Sets of Hubs and Authorities • Relevant documents can be grouped in to several clusters • For Broad-topic Queries: x* is the principal eigenvector of ATA, and y* is the principal eigenvector of AAT • Can we use the non-principal eigenvectors to extract additional densely linked collections of hubs and authorities – Positive and Negative

Results – Multiple Sets of H & A

Diffusion and Generalization • Diffusion happens – if the specifies a topic that is not sufficiently broad, there will be not enough relevant pages in G – the most relevant collection in G is not the “densest” one – as a result the I and O operations will find the diffused collection of authority corresponding to the “broader” topics – Limits the algorithm • The broader topic that supplants the original, toospecific query very often represents a natural generalization of • It provides a simple way of abstracting a specific query topic to a broader related one.

Results – Diffusion & Generalization

Results – Diffusion & Generalization • The use of non-principal eigenvectors, combined with basic term-matching, can be a simple way to extract collections of authoritative pages that are more relevant to a specific query topic

Evaluation • 26 broad search topics, 37 users • For each topic, took the top 10 pages from Alta. Vista, the top five hubs and five authorities from Clever, and a random set of 10 pages from Yahoo • The results – For 31% of the topics, Yahoo and Clever received evaluations equivalent to each other – For 50%, Clever received a higher evaluation – For 19%, Yahoo received the higher evaluation

Summary • Answering Broad-topic queries • Finding Authoritative Pages using the good hubs and good authorities • Answering similar-page queries by starting with a different root set • Finding Multiple Hubs and Authorities using nonprinciple eigenvectors • Overcoming Diffusion and Generalization by using non-principal eigenvectors and basic term matching

Page. Rank vs. HITS • Computation: – Once for all documents and queries (offline) • Query-independent – requires combination with query-dependent criteria • Hard to spam • Computation: – Requires computation for each query • Query-dependent • Relatively easy to spam • Quality depends on quality of start set • Gives hubs as well as authorities FROM: Monika Henzinger, Hyperlink Analysis on the Web

Page. Rank vs. HITS • [Lempel] Not rank • Not rank-stable: O(1) changes in graph can change O(N 2) order-relations • [Ng, Zheng, Jordan 01] • “value”-stablility “Value”-Stable: depends on gap g change in k nodes between largest and (with PR values second largest p 1, …pk) results in p* eigenvector: change of s. t. O(g) nodes results in p* s. t. FROM: Monika Henzinger, Hyperlink Analysis on the Web

References/Slide Sources • Authoritative Sources in Hyperlinked Environment Jon M. Kleinberg JACM 1999 • Monika Henzinger “Hyperlink Analysis on the Web”. • Original Mini-web example http: //www. cs. fiu. edu/~vagelis/presentations/Random. Walks. p pt • “Authoritative sources in a hyperlinked environment” Presentation By Vivek B. Tawde.

Conclusion • Influential paper – Citeseer – 457 Citings – ACM – 115 Citings • Same time period as the Google page-rank algorithm

Thank You