Network Mining Finding Important Nodes Page Rank Relative
- Slides: 30
Network Mining • Finding Important Nodes – Page. Rank – Relative Importance • Discovering Network Modules • Inferring Important Paths
Page. Rank • Web pages are organized in a network. – Each webpage is represented as a node. – Each hyperlink is a directed edge – The entire web can be viewed as a directed graph.
Page. Rank is a numeric value that represents how important a page is on the web. • Webpage importance – One page links to another page = A vote for the other page: A link from page A to page B is a vote on A to B. – If page A is more important itself, then the vote of A to B should carry more weight. – More votes = More important the page must be • How can we model this importance?
The Random Surfer Model Page. Rank = A model of user behaviour A surfer clicks on links at random with no regard towards content. Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the Page. Rank of the page.
Page. Rank • Importance Computation – The importance of a page is distributed to pages that it points to. – The importance of a page is the aggregation of the importance shares of the pages that points to it. • If a page has 5 outlinks, the importance of the page is divided into 5 and each link receives one fifth share of the importance.
Page. Rank
Algorithm Now we refer Page. Rank as “PR” PR(A) = (1 -d) + d (PR(T 1) / C(T 1) + …. . + PR(Tn) / C(Tn) ) • PR(A) is the Page. Rank of page A • PR(Ti) is the Page. Rank of pages Ti which link to page A • C(Ti) is the number of outbound links on page Ti • d is a damping factor which can be set between 0 and 1 (usually set to 0. 85) A page’s Page. Rank = 0. 15 + 0. 85 * (a “share” of the Page. Rank of every page that links to it) “share” = the linking page’s Page. Rank divided by the number of outbound links on the page
Page A Page B PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 15 + 0. 85 * PR(A) We can't work out A's Page. Rank until we know B's Page. Rank, and we can't work out B's Page. Rank until we know A's Page. Rank. Iterations are necessary to calculate the most accurate values by using inaccurate values 100 iterations are necessary to get a good approximation of the Page. Rank values of the whole web
Page. Rank of a Site • Page. Rank of a site is equal to the pagerank of all pages in the site. • The maximal pagerank of a site is equal to the total number of pages in the site.
Internal Linking • A website has a maximum amount of Page. Rank that is distributed between its pages by internal links • The maximum amount of Page. Rank in a site increases as the number of pages in the site increases • By linking poorly, it is possible to fail to reach the site’s maximum Page. Rank, but it is not possible to exceed it
Internal Linking Page A Page B Page C PR = 1 Maximum Page. Rank is the amount of Page. Rank in the site. So this site’s maximum Page. Rank is 3. PR(A) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(B) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(C) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 Total Page. Rank in this site = 0. 45 Wasting most of its potential Page. Rank!
Internal Linking Page A Page B Page C PR = 1 PR(A) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 PR(B) = 0. 15 + 0. 85 * ( PR(A)/1 ) = 0. 15 + 0. 85 * ( 1 ) = 1 PR(C) = 0. 15 + 0. 85 * ( 0 ) = 0. 15 After 100 iterations….
Internal Linking Page A Page B Page C PR = 1 PR(A) = 0. 15 PR(B) = 0. 2775 PR(C) = 0. 15 Total Page. Rank in this site = 0. 5775 Slightly better but still not the best it could be
Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C) / 1) = 0. 15 + 0. 85 * (1 + 1 ) = 0. 15 + 1. 7 = 1. 85 PR(B) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 PR(C) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 After 100 iterations… Page A Page B PR = 1 Page C PR = 1 Page A = 1. 459459 Page B = 0. 7702703 Page C = 0. 7702703
Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/2 + PR(C) / 2) = 0. 15 + 0. 85 * (0. 5 + 0. 5 ) = 0. 15 + 0. 85 =1 PR(B) = 1 PR(C) = 1 Total Page. Rank in this site = 3 Good Linking! Page A Page B PR = 1 Page C PR = 1
Internal Linking PR(A) = 0. 15 + 0. 85 * ( PR(B)/1 + PR(C) / 2) = 0. 15 + 0. 85 * (1 + 0. 5 ) = 0. 15 + 1. 275 = 1. 425 Page A Page B PR = 1 PR(B) = 0. 15 + 0. 85 * ( PR(A)/2 + PR(C)/2 ) = 0. 15 + 0. 85 * (0. 5 + 0. 5) = 0. 15 + 0. 85 =1 PR(C) = 0. 15 + 0. 85 * ( PR(A)/2 ) = 0. 15 + 0. 85 * (0. 5 ) = 0. 15 + 0. 425 = 0. 575 Page C PR = 1 After 100 iterations… Page A = 1. 298245 Page B = 0. 9999999 Page C = 0. 7017543
Internal Linking Page A Page B PR = 1 PR(A) = 1. 46 PR(B) = 0. 77 PR(C) = 0. 77 Total Page. Rank in this site = 3 No Page. Rank has been wasted! Page C PR = 1
Internal Linking PR(A) = 1. 298 PR(B) = 0. 999 PR(C) = 0. 702 Total Page. Rank in this site = 3 Page A Page B PR = 1 Page C PR = 1 Page A and Page C lose some Page. Rank Page B gains some Page. Rank
Inbound Links Page X Page A Page D PR = 10 PR = 1 Page B Page C PR = 1 Try to set the Damping Factor “d” to 0. 5 in this example to see the influence of the “d” PR(A) = 0. 5 + 0. 5 * ( PR(X) / C(X) + PR(D) ) = 6. 33 PR(B) = 0. 5 + 0. 5 * PR(A) = 3. 67 PR(C) = 0. 5 + 0. 5 * PR(B) = 2. 33 PR(D) = 0. 5 + 0. 5 * PR(C) = 1. 67 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0. 5 x 10 / 1 = 5
Inbound Links Page X Page A Page D PR = 10 PR = 1 Page B Page C PR = 1 Now we set the Damping Factor “d” back to 0. 85 PR(A) = 0. 15 + 0. 85 * ( PR(X) / C(X) + PR(D) ) = 18. 78 PR(B) = 0. 15 + 0. 85 * PR(A) = 16. 12 PR(C) = 0. 15 + 0. 85 * PR(B) = 13. 85 PR(D) = 0. 15 + 0. 85 * PR(C) = 11. 92 Initial effect of the additional inbound link of page A: d x PR(X) / C(X) = 0. 85 x 10 / 1 = 8. 5
Site 1 Outbound Links PR(A) = 0. 15 + 0. 85 * PR(B) = 1 PR(B) = 0. 15 + 0. 85 * PR(A) = 1 PR(C) = 0. 15 + 0. 85 * PR(D) = 1 Site 2 Page A Page C PR = 1 Page B Page D PR = 1 PR(D) = 0. 15 + 0. 85 * PR(C) = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A)/2 = 0. 33 PR(C) = 0. 15 + 0. 85 * (PR(A)/2 + PR(D) = 1. 67 PR(D) = 0. 15 + 0. 85 * PR(C) = 1. 57 Site 1 loses: 0. 76 - 2= -1. 24, Site 2 gains: 3. 24 – 2 = 1. 24 The Page. Rank benefit for one site equals the Page. Rank loss of the other
Dangling Links Danglinks are links that point to any Page with no outgoing links. Page A Page C PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A) /2= 0. 33 PR(C) = 0. 15 + 0. 85 * PR(A)/2 = 0. 33 The total Page. Rank is only 1. 10 which is only one third of the maximum Page. Rank. Page B PR = 1 To prevent Page. Rank from the negative effects of dangling links, pages without outbound links have to be removed from the database until the Page. Rank values are computed.
Dangling Links Page A Page C PR = 1 PR(A) = 0. 15 + 0. 85 * PR(B) = 0. 43 PR(B) = 0. 15 + 0. 85 * PR(A) = 0. 33 PR(C) = 0. 15 + 0. 85 * PR(A) = 0. 33 Page B PR(A) = 0. 15 + 0. 85 * PR(B) = 1 PR(B) = 0. 15 + 0. 85 * PR(A) = 1 PR = 1
Speed of Convergence • Early experiments on Google used 322 million links. • Page. Rank algorithm converged (within small tolerance) in about 52 iterations. • Number of iterations required for convergence is empirically O(log n) (where n is the number of links). • Therefore calculation is quite efficient.
Page. Rank • Page A’s PR is based on all its inbound neighbors. • PR is defined recursively • With new calculation, each page’s PR may change. – If B and C points to A, the order of computing A, B, C may effect the results of A’s PR. – How can the problem be solved?
Page. Rank • PR of all web pages can form a vector • X is a matrix. – Each row and column represents a page. – Xu, v = 1/C(u) if there is a link from u to v. – 0 otherwise. • P = c. XP+c. E – P = c (A+EI)P since ||P||=1, I is the identity matrix – P is the eigenvector of (A+EI).
Page. Rank • When the number of pages is small, P can be computed efficient. • However, when the number of page is large, computing P is a time consuming technique. – An iterative method is used for computation. – Any initial value of P is ok. – After applying the equation 50 times, P can be converged to a stable state.
Page. Rank Algorithm • R 0 <-S • Loop: – Ri+1 <-A * Ri – d <- ||Ri||1 - ||Ri+1||1 – Ri+1 <-Ri + d. E – dif <- ||Ri+1 – Ri||1 • While (dif > t)
Examples A B • P(A)=P(B)=P(C)=0. 15 C
Page. Rank A C B • A=1. 3 • B=1 • C=0. 7
- Google page rank algorithm
- Pagerank random walk
- Rankmap
- Pagerank centrality
- Google page rank algorithm
- Google page rank algorithm
- Pagerank nedir
- Page rank analysis
- Page rank history
- Google pageranking
- Page rank
- Strip mining vs open pit mining
- Mineral resources and mining chapter 13
- Difference between strip mining and open pit mining
- Difference between text mining and web mining
- Multimedia data mining
- Mining complex types of data in data mining
- Does apa format have a title page
- From most important to least important in writing
- Inverted pyramid in news writing
- Least important to most important
- Stage 15 relative clauses and relative pronouns
- How to find conditional relative frequency
- The person who phoned me last night is my teacher
- Relative adverbs
- Artificial neural network in data mining
- Mining social network graphs
- Neural network in data mining
- Andrea goldsmith wireless communications
- Constrained nodes and constrained networks
- Vpp graph