Improved Algorithms for Topic Distillation in a Hyperlinked

  • Slides: 11
Download presentation
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘ 98) Ruey-Lung,

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘ 98) Ruey-Lung, Hsiao Nov 23, 2000

Topic Distillation on the WWW • Definition Given a typical user query to find

Topic Distillation on the WWW • Definition Given a typical user query to find quality documents related to the query topic. • Characteristics – More general than finding a precise query match – Not as ambitious as trying to exactly satisfy user information need – In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.

Related Research Related Page [3] Topic Distillation [2] HITS [1] Reputation [5] 1. 2.

Related Research Related Page [3] Topic Distillation [2] HITS [1] Reputation [5] 1. 2. 3. 4. 5. Web Community [4] Authoritative sources in a hyperlinked environment ‘ 97 Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’ 98 Finding Related Pages in the World Wide Web ’ 99 Inferring Web Communities from link topology ’ 98 What is this page known for ? Computing Web Page Reputations. ‘ 00

HITS (Hyperlink Induced Topic Search) • Algorithm – Start with a root set S

HITS (Hyperlink Induced Topic Search) • Algorithm – Start with a root set S • Ss is relatively small (typically up to 200 pages) • Ss is rich in relevant pages • Ss contains most (or many) of the strongest authorities. – Recursively compute the degree of authority and hub for each element. set T set S a(p) = h(q) q p h(p) = a(q) p q

HITS (Hyperlink Induced Topic Search) • Premises – The implicit annotation provided by human

HITS (Hyperlink Induced Topic Search) • Premises – The implicit annotation provided by human creator contains sufficient information to infer authority. – The sufficiently broad topics contain embedded communities of hyperlinked pages. • Problems – Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. – Automatically Generated Links no human opinion is expressed by the link. – Non-relevant Documents the graph contains documents not relevant to the query topic

Improved Algorithm • Improved Connectivity Analysis – Mutually reinforcing relationships should have the same

Improved Algorithm • Improved Connectivity Analysis – Mutually reinforcing relationships should have the same infulence on a single document. a(p) = h(q) x auth_wt(q, p) q p h(p) = a(q) x hub_wt(p, q) p q • Pruning Nodes from Neighborhood Graph t i=1 Wiq x Wij Similarity(Q, Dj) = ti=1 wiq ti=1 wij 2 – Relevant threshold : • Median Weight • Start Set Median Weight • Fixed Fraction of Maximum Weight 2

Partial Content Analysis • Selectively analyze and prune if needed, the nodes that are

Partial Content Analysis • Selectively analyze and prune if needed, the nodes that are most influential in the outcome. • Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links • Pruning – Degree Based Pruning • Use 4*in_degree+out_degree as a measure of influence • Fetch the top 100 nodes, scored against Q and pruned if needed. – Iterative Pruning • Use connectivity analysis itself to select nodes to prune. (imp) • Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.

Evaluation Average Precision at Top 5 and 10 ranked authority documents Without Regulation With

Evaluation Average Precision at Top 5 and 10 ranked authority documents Without Regulation With Regulation Partial base imp med start max At 5 At 10 0. 52 0. 66 0. 73 0. 65 0. 69 0. 67 0. 72 0. 65 0. 70 0. 72 0. 75 0. 46 0. 58 0. 65 0. 66 0. 62 0. 65 0. 64 0. 67 At 5 At 10 Popular At 5 At 10 0. 24 0. 36 0. 64 0. 48 0. 55 0. 60 0. 48 0. 60 0. 18 0. 24 0. 50 0. 43 0. 44 0. 48 0. 54 0. 48 0. 44 0. 64 0. 36 0. 55 0. 60 0. 68 0. 64 0. 55 0. 60 0. 68 0. 88 0. 40 0. 54 0. 57 0. 70 0. 60 0. 58 0. 60 0. 62 0. 64 0. 68 0. 80 All Rare imp med start max pca 0 pca 1 26% 36% Average Precision at Top 5 and 10 ranked hub documents Without Regulation At 5 At 10 At 5 Rare At 10 Popular At 5 At 10 All With Regulation Partial base imp med start max pca 0 0. 60 0. 74 0. 87 0. 78 0. 75 0. 80 0. 87 0. 77 0. 81 0. 80 0. 56 0. 73 0. 79 0. 70 0. 73 0. 76 0. 81 0. 69 0. 76 0. 74 0. 71 0. 44 0. 64 0. 88 0. 72 0. 60 0. 88 0. 80 0. 56 0. 72 0. 46 0. 60 0. 76 0. 60 0. 64 0. 76 0. 80 0. 66 0. 76 0. 53 0. 63 0. 48 0. 80 0. 72 0. 80 1. 00 0. 68 0. 42 0. 68 0. 74 0. 72 0. 68 0. 70 0. 74 0. 60 0. 76 0. 54 23% 33% pca 1

Finding Related Pages in the WWW • Appears in 8 th www conference •

Finding Related Pages in the WWW • Appears in 8 th www conference • Definition – A related web page is one that addresses the same topic as the original page. – For example, www. washingtonpost. com is a page related to www. nytimes. com. • Algorithms – Companion algorithm : derived from HITS. – Cocitation algorithm : finds pages that are frequently cocited with the input URL u. • Evaluation – Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.

Companion Algorithm • Takes as input a URL u and consists of four steps:

Companion Algorithm • Takes as input a URL u and consists of four steps: – – Build a vicinity graph for u. Contract duplicates and near-duplicates in this graph Compute edge weights based on host to host connections Compute hub/authority score. u

Cocitation Algorithm • Degree of co-citation – The number of common parents of two

Cocitation Algorithm • Degree of co-citation – The number of common parents of two nodes. Sibling Set u