Ranking Linkbased Ranking 2 generation Reading 21 Sec
Ranking Link-based Ranking (2° generation) Reading 21
Sec. 21. 1 The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink describes the target page (textual context)
Sec. 21. 1. 1 Indexing anchor text n When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www. ibm. com Joe’s computer hardware links Sun HP IBM Big Blue today announced record profits for the quarter Can score anchor text with weight depending on the authority of the anchor page’s website
Random Walks Paolo Ferragina Dipartimento di Informatica Università di Pisa
Definitions Adjacency matrix A Transition matrix P 2 1 1 3 Any edge weigthing is possible: Any proposals? 1/2 1 1/2 3 5
What is a random walk t=0 1 1 1/2 6
What is a random walk t=1 t=0 1 1 1/2 1/2 7
What is a random walk t=0 1 1 t=1 1/2 1/2 t=2 1 1 1/2 8
What is a random walk t=1 t=0 1 1 1/2 1/2 t=2 1 1 1/2 t=3 1 1 1/2 1/2 9
Probability Distributions n xt(i) = probability that surfer is at node i at time t n xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j, i) = xt * P 0 0 1 = xt Transition matrix P 1/2 0 xt+1 1 t=2 2 1/2 1 1 1/2 3 10
Probability Distributions n xt(i) = probability that surfer is at node i at time t n xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j, i) = xt P n n xt+1 = xt P = xt-1* P = xt-2* P * P = …= x 0 Pt+1 What happens when the surfer keeps walking for a long time? So called Stationary distribution 11
Stationary Distribution n n The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node. It is when the distribution does not change anymore: i. e. x. T+1 = x. T P = 1* x. T (left eigenvector of eigenvalue 1) n For “well-behaved” graphs this does not depend on the start distribution: xt+1 = x 0 P t+1 12
Interesting questions n Does a stationary distribution always exist? Is it unique? n n Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and aperiodic. How fast will the random surfer approach this stationary distribution? n Mixing Time! 13
Well behaved graphs n Irreducible: There is a path from every node to every other node ( it is an SCC). Irreducible Not irreducible 14
Well behaved graphs n Aperiodic: The GCD of all cycle lengths is 1. The GCD is also called period. Periodicity is 3 Aperiodic 15
About undirected graphs n n A connected undirected graph is irreducible A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution! n Makes sense, since larger the degree of the node more likely a random walk is to come back to it. 16
Page. Ranks and HITS Paolo Ferragina Dipartimento di Informatica Università di Pisa
Query-independent ordering n First generation: using link counts as simple measures of popularity. n Undirected popularity: n n Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: n Score of a page = number of its in-links (es. 3). Easy to SPAM
Second generation: Page. Rank n n Deploy the graph structure, and each link has its own importance ! Page. Rank is n independent of the query n many interpretations: n n n Linear algebra – eigenvectors, eigenvalues Markov chains – steady state probability distribution Social interpretation – a sort of voting scheme
Basic Intuition… Random jump to any node d 1 -d Random jump to neighbors
Google’s Pagerank Random jump Principal eigenvector B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}.
Pagerank: use in Search Engines n Preprocessing: n n n Given graph of links, build matrix P Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order n At query time: n n Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent
How to compute it n Fast (approximate) computation: n n n Given the Web graph, build the matrix P Compute r = e * Pt for t = 0, 1, … r[i] is the pagerank of page i
(Personalized) Pagerank n Bias the random jump: n n Substitute e = [1 1 …. 1] with a preference vector which jumps to preferred pages (topics) If e = ei , then r[j] = relatedness between node j and node i
Co. Sim Rank to «compare» nodes Personalized Page. Rank, p(0)(i) varies with node i The adjacency matrix A has normalized rows >>> Set d=1, so no teleport step You can normalize it multiplying by (1 -c)
HITS: Hypertext Induced Topic Search
Calculating HITS n n It is query-dependent Produces two scores per page: n Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. n Hub score: A good hub page for a topic points to many authoritative pages for that topic.
Authority and Hub scores 5 2 3 1 4 1 6 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai, j = 1 if i j Thus, h is an eigenvector of AAt a is an eigenvector of At. A Symmetric matrices
Weighting links Weight more if the query occurs in the neighborhood of the link (e. g. anchor text).
- Slides: 30