COMP 5331 Web databases Prepared by Raymond Wong
COMP 5331 Web databases Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP 5331 1
Web Databases Raymond Wong COMP 5331 2
How to rank the webpages? COMP 5331 3
Ranking Methods n n HITS Algorithm Page. Rank Algorithm COMP 5331 4
HITS Algorithm n HITS is a ranking algorithm which ranks “hubs” and “authorities”. COMP 5331 5
HITS Algorithm n Authority v COMP 5331 n Hub v Each page has two weights 1. Authority weight a(v) 2. Hub weight h(v) 6
HITS Algorithm A good hub has many outgoing edges to good authorities Each vertex has two weights n n n Authority weight A good authority has many Hub weight edges from good hubs Authority Weight v a(v) = COMP 5331 u v h(u) Hub Weight v h(v) = v u a(u) 7
HITS Algorithm n HITS involves two major steps. n n Step 1: Sampling Step 2: Iteration Step COMP 5331 8
Step 1 – Sampling Step n Given a user query with several terms n n Collect a set of pages that are very relevant – called the base set How to find base set? n n n We retrieve all webpages that contain the query terms. The set of webpages is called the root set. Next, find the link pages, which are either pages with a hyperlink to some page in the root set or some page in the root set has hyperlink to these pages All pages found form the base set. COMP 5331 9
HITS Algorithm n HITS involves two major steps. n n Step 1: Sampling Step 2: Iteration Step COMP 5331 10
Step 2 – Iteration Step n Goal: to find the base pages that are good hubs and good authorities COMP 5331 11
Adjacency matrix M Step 2 – Iteration Step = MS N MS A N: Netscape MS: Microsoft A: Amazon. com N h(N) = a(N) + a(MS) + a(A) h(MS) = a(A) M h(A) = a(N) + a(MS) a(N) h(MS) A = h(A) COMP 5331 a(MS) h(N) a(A) h(MS) a(MS) h(A) a(A) 12
Adjacency matrix M Step 2 – Iteration Step = MS N MS A N: Netscape MS: Microsoft A: Amazon. com a(N) = h(N) + h(A) a(MS) = h(N) + h(A) N M a(A) = h(N) + h(MS) h(N) a(MS) A = a(A) COMP 5331 h(MS) h(N) a(N) h(A) h(MS) a(MS) h(A) a(A) 13
Step 2 – Iteration Step We have We derive COMP 5331 14
Step 2 – Iteration Step N N MS A N N M = MT = MS MS N A A N N MS A N N MMT= MS A COMP 5331 MTM= MS A 15
Hub N MS A Step 2 – Iteration Step N 1. 5 0. 402 1. 098 = MS A N MMT= MS A Hub (non-normalized) Iteration No. 1 2 3 4 5 6 7 N MS A 1 1 1 6 2 4 7 2 5 7. 071 1. 929 5. 143 7. 091 1. 909 5. 182 7. 096 1. 904 5. 192 7. 098 1. 902 5. 195 6 1. 5 0. 402 1. 098 7 1. 5 0. 402 1. 098 The sum of all elements in the vector = 3 Hub (normalized) Iteration No. 1 2 N MS A 1 1. 5 0. 5 1 COMP 5331 3 1. 5 0. 429 1. 071 4 1. 5 0. 409 1. 091 5 1. 5 0. 404 1. 096 16
Hub N MS A Step 2 – Iteration Step Authority N MS A N MTM= N MS A = 1. 5 0. 402 1. 098 = 1. 098 0. 804 MS A Authority (non-normalized) Iteration No. 1 2 3 4 5 6 7 N MS A 1 1 1 5 5 4 5. 143 3. 857 5. 182 3. 818 5. 192 3. 808 5. 195 3. 805 5. 196 3. 804 Authority (normalized) The sum of all elements in the vector = 3 Iteration No. 1 2 N MS A 1 1. 071 0. 857 COMP 5331 3 1. 091 0. 818 4 1. 096 0. 808 5 1. 098 0. 805 6 1. 098 0. 804 7 1. 098 0. 804 17
Hub How to Rank n Many ways n n n N MS A = 1. 5 0. 402 1. 098 = 1. 098 0. 804 Authority N MS A Rank in descending order of hub only Rank in descending order of authority only Rank in descending order of the value computed from both hub and authority (e. g. , the sum of the hub value and the authority value) COMP 5331 18
Ranking Methods n n HITS Algorithm Page. Rank Algorithm COMP 5331 19
Page. Rank Algorithm (Google) n Disadvantage of HITS: n n Since there are two concepts, namely hubs and authorities, we do not know which concept is more important for ranking. Advantage of Page. Rank: n Page. Rank involves only one concept for ranking COMP 5331 20
Page. Rank Algorithm (Google) n Page. Rank Algorithm makes use of Stochastic approach to rank the pages COMP 5331 21
Page. Rank Algorithm (Google) Stochastic matrix M N MS N: Netscape MS: Microsoft A: Amazon. com N A N = MS A COMP 5331 M A 22
Page. Rank Algorithm (Google) N MS A N M = MS A Page Rank Iteration No. 1 2 3 4 5 N MS A 1 1 0. 5 1. 25 0. 75 1 1. 125 0. 5 1. 375 1. 156 0. 531 1. 313 . . … … … 33 1. 20 0. 60 1. 20 Microsoft (MS) is quite upset with this result. Microsoft decides to link only to itself from now on. COMP 5331 23
Page. Rank Algorithm (Google) N Stochastic matrix M N MS A N = MS A COMP 5331 M A 24
Page. Rank Algorithm (Google) N Stochastic matrix M N MS A N = MS A COMP 5331 M A 25
Page. Rank Algorithm (Google) N MS A N M = MS A Page Rank Iteration No. 1 2 3 4 5 N MS A 1 1 1. 5 0. 75 1. 75 0. 625 2 0. 375 0. 5 2. 188 0. 313 . . … … … 40 0 3 0 Microsoft (MS) is happy. It is the most important now. Others is not happy. COMP 5331 26
Page. Rank Algorithm (Google) N N M A Spider trap: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the web. How to solve it? COMP 5331 M A Microsoft has become a spider trap. 27
Page. Rank Algorithm (Google) N MS A N M = MS A COMP 5331 28
Page. Rank Algorithm (Google) N MS A N M = MS A Page Rank Iteration No. 1 2 3 4 5 N MS A 1 1 1. 4 0. 6 0. 84 1. 56 0. 776 1. 688 0. 536 0. 725 1. 765 0. 510 . . … … … 20 0. 636 1. 909 0. 455 We have a more reasonable distribution of importance than before. COMP 5331 29
- Slides: 29