Network Mining Finding Important Nodes Page Rank Relative

  • Slides: 30
Download presentation
Network Mining • Finding Important Nodes – Page. Rank – Relative Importance • Discovering

Network Mining • Finding Important Nodes – Page. Rank – Relative Importance • Discovering Network Modules • Inferring Important Paths

Page. Rank • Web pages are organized in a network. – Each webpage is

Page. Rank • Web pages are organized in a network. – Each webpage is represented as a node. – Each hyperlink is a directed edge – The entire web can be viewed as a directed graph.

Page. Rank is a numeric value that represents how important a page is on

Page. Rank is a numeric value that represents how important a page is on the web. • Webpage importance – One page links to another page = A vote for the other page: A link from page A to page B is a vote on A to B. – If page A is more important itself, then the vote of A to B should carry more weight. – More votes = More important the page must be • How can we model this importance?

The Random Surfer Model Page. Rank = A model of user behaviour A surfer

The Random Surfer Model Page. Rank = A model of user behaviour A surfer clicks on links at random with no regard towards content. Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the Page. Rank of the page.

Page. Rank • Importance Computation – The importance of a page is distributed to

Page. Rank • Importance Computation – The importance of a page is distributed to pages that it points to. – The importance of a page is the aggregation of the importance shares of the pages that points to it. • If a page has 5 outlinks, the importance of the page is divided into 5 and each link receives one fifth share of the importance.

Page. Rank

Page. Rank

Algorithm Now we refer Page. Rank as “PR” PR(A) = (1 -d) + d

Algorithm Now we refer Page. Rank as “PR” PR(A) = (1 -d) + d (PR(T 1) / C(T 1) + …. . + PR(Tn) / C(Tn) ) • PR(A) is the Page. Rank of page A • PR(Ti) is the Page. Rank of pages Ti which link to page A • C(Ti) is the number of outbound links on page Ti • d is a damping factor which can be set between 0 and 1 (usually set to 0. 85) A page’s Page. Rank = 0. 15 + 0. 85 * (a “share” of the Page. Rank of every page that links to it) “share” = the linking page’s Page. Rank divided by the number of outbound links on the page

Page A Page B PR = 1 PR(A) = 0. 15 + 0. 85

Page A Page B PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 15 + 0. 85 * PR(A) We can't work out A's Page. Rank until we know B's Page. Rank, and we can't work out B's Page. Rank until we know A's Page. Rank. Iterations are necessary to calculate the most accurate values by using inaccurate values 100 iterations are necessary to get a good approximation of the Page. Rank values of the whole web

Page. Rank of a Site • Page. Rank of a site is equal to

Page. Rank of a Site • Page. Rank of a site is equal to the pagerank of all pages in the site. • The maximal pagerank of a site is equal to the total number of pages in the site.

Internal Linking • A website has a maximum amount of Page. Rank that is

Internal Linking • A website has a maximum amount of Page. Rank that is distributed between its pages by internal links • The maximum amount of Page. Rank in a site increases as the number of pages in the site increases • By linking poorly, it is possible to fail to reach the site’s maximum Page. Rank, but it is not possible to exceed it

Internal Linking Page A Page B Page C PR = 1 Maximum Page. Rank

Internal Linking Page A Page B Page C PR = 1 Maximum Page. Rank is the amount of Page. Rank in the site. So this site’s maximum Page. Rank is 3. PR(A) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(B) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(C) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 Total Page. Rank in this site = 0. 45 Wasting most of its potential Page. Rank!

Internal Linking Page A Page B Page C PR = 1 PR(A) = 0.

Internal Linking Page A Page B Page C PR = 1 PR(A) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(B) = 0. 15 + 0. 85 * ( PR(A)/1 ) = 0. 15 + 0. 85 * ( 1 ) = 1 PR(C) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 After 100 iterations….

Internal Linking Page A Page B Page C PR = 1 PR(A) = 0.

Internal Linking Page A Page B Page C PR = 1 PR(A) = 0. 15 PR(B) = 0. 2775 PR(C) = 0. 15 Total Page. Rank in this site = 0. 5775 Slightly better but still not the best it could be

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C)

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C) / 1) = 0. 15 + 0. 85 * (1 + 1 ) = 0. 15 + 1. 7 = 1. 85 PR(B) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 PR(C) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 After 100 iterations… Page A Page B PR = 1 Page C PR = 1 Page A = 1. 459459 Page B = 0. 7702703 Page C = 0. 7702703

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/2 + PR(C)

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/2 + PR(C) / 2) = 0. 15 + 0. 85 * (0. 5 + 0. 5 ) = 0. 15 + 0. 85 =1 PR(B) = 1 PR(C) = 1 Total Page. Rank in this site = 3 Good Linking! Page A Page B PR = 1 Page C PR = 1

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C)

Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C) / 2) = 0. 15 + 0. 85 * (1 + 0. 5 ) = 0. 15 + 1. 275 = 1. 425 Page A Page B PR = 1 PR(B) = 0. 15 + 0. 85 * ( PR(A)/2 + PR(C)/2 ) = 0. 15 + 0. 85 * (0. 5 + 0. 5) = 0. 15 + 0. 85 =1 PR(C) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 Page C PR = 1 After 100 iterations… Page A = 1. 298245 Page B = 0. 9999999 Page C = 0. 7017543

Internal Linking Page A Page B PR = 1 PR(A) = 1. 46 PR(B)

Internal Linking Page A Page B PR = 1 PR(A) = 1. 46 PR(B) = 0. 77 PR(C) = 0. 77 Total Page. Rank in this site = 3 No Page. Rank has been wasted! Page C PR = 1

Internal Linking PR(A) = 1. 298 PR(B) = 0. 999 PR(C) = 0. 702

Internal Linking PR(A) = 1. 298 PR(B) = 0. 999 PR(C) = 0. 702 Total Page. Rank in this site = 3 Page A Page B PR = 1 Page C PR = 1 Page A and Page C lose some Page. Rank Page B gains some Page. Rank

Inbound Links Page X Page A Page D PR = 10 PR = 1

Inbound Links Page X Page A Page D PR = 10 PR = 1 Page B Page C PR = 1 Try to set the Damping Factor “d” to 0. 5 in this example to see the influence of the “d” PR(A) = 0. 5 + 0. 5 * ( PR(X) / C(X) + PR(D) ) = 6. 33 PR(B) = 0. 5 + 0. 5 * PR(A) = 3. 67 PR(C) = 0. 5 + 0. 5 * PR(B) = 2. 33 PR(D) = 0. 5 + 0. 5 * PR(C) = 1. 67 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0. 5 x 10 / 1 = 5

Inbound Links Page X Page A Page D PR = 10 PR = 1

Inbound Links Page X Page A Page D PR = 10 PR = 1 Page B Page C PR = 1 Now we set the Damping Factor “d” back to 0. 85 PR(A) = 0. 15 + 0. 85 * ( PR(X) / C(X) + PR(D) ) = 18. 78 PR(B) = 0. 15 + 0. 85 * PR(A) = 16. 12 PR(C) = 0. 15 + 0. 85 * PR(B) = 13. 85 PR(D) = 0. 15 + 0. 85 * PR(C) = 11. 92 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0. 85 x 10 / 1 = 8. 5

Site 1 Outbound Links PR(A) = 0. 15 + 0. 85 * PR(B) =

Site 1 Outbound Links PR(A) = 0. 15 + 0. 85 * PR(B) = 1 PR(B) = 0. 15 + 0. 85 * PR(A) = 1 PR(C) = 0. 15 + 0. 85 * PR(D) = 1 Site 2 Page A Page C PR = 1 Page B Page D PR = 1 PR(D) = 0. 15 + 0. 85 * PR(C) = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A)/2 = 0. 33 PR(C) = 0. 15 + 0. 85 * (PR(A)/2 + PR(D) = 1. 67 PR(D) = 0. 15 + 0. 85 * PR(C) = 1. 57 Site 1 loses: 0. 76 - 2= -1. 24, Site 2 gains: 3. 24 – 2 = 1. 24 The Page. Rank benefit for one site equals the Page. Rank loss of the other

Dangling Links Danglinks are links that point to any Page with no outgoing links.

Dangling Links Danglinks are links that point to any Page with no outgoing links. Page A Page C PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A) /2= 0. 33 PR(C) = 0. 15 + 0. 85 * PR(A)/2 = 0. 33 The total Page. Rank is only 1. 10 which is only one third of the maximum Page. Rank. Page B PR = 1 To prevent Page. Rank from the negative effects of dangling links, pages without outbound links have to be removed from the database until the Page. Rank values are computed.

Dangling Links Page A Page C PR = 1 PR(A) = 0. 15 +

Dangling Links Page A Page C PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A) = 0. 33 PR(C) = 0. 15 + 0. 85 * PR(A) = 0. 33 Page B PR(A) = 0. 15 + 0. 85 * PR(B) = 1 PR(B) = 0. 15 + 0. 85 * PR(A) = 1 PR = 1

Speed of Convergence • Early experiments on Google used 322 million links. • Page.

Speed of Convergence • Early experiments on Google used 322 million links. • Page. Rank algorithm converged (within small tolerance) in about 52 iterations. • Number of iterations required for convergence is empirically O(log n) (where n is the number of links). • Therefore calculation is quite efficient.

Page. Rank • Page A’s PR is based on all its inbound neighbors. •

Page. Rank • Page A’s PR is based on all its inbound neighbors. • PR is defined recursively • With new calculation, each page’s PR may change. – If B and C points to A, the order of computing A, B, C may effect the results of A’s PR. – How can the problem be solved?

Page. Rank • PR of all web pages can form a vector • X

Page. Rank • PR of all web pages can form a vector • X is a matrix. – Each row and column represents a page. – Xu, v = 1/C(u) if there is a link from u to v. – 0 otherwise. • P = c. XP+c. E – P = c (A+EI)P since ||P||=1, I is the identity matrix – P is the eigenvector of (A+EI).

Page. Rank • When the number of pages is small, P can be computed

Page. Rank • When the number of pages is small, P can be computed efficient. • However, when the number of page is large, computing P is a time consuming technique. – An iterative method is used for computation. – Any initial value of P is ok. – After applying the equation 50 times, P can be converged to a stable state.

Page. Rank Algorithm • R 0 <-S • Loop: – Ri+1 <-A * Ri

Page. Rank Algorithm • R 0 <-S • Loop: – Ri+1 <-A * Ri – d <- ||Ri||1 - ||Ri+1||1 – Ri+1 <-Ri + d. E – dif <- ||Ri+1 – Ri||1 • While (dif > t)

Examples A B • P(A)=P(B)=P(C)=0. 15 C

Examples A B • P(A)=P(B)=P(C)=0. 15 C

Page. Rank A C B • A=1. 3 • B=1 • C=0. 7

Page. Rank A C B • A=1. 3 • B=1 • C=0. 7