Googles Page Rank web site xxx web site
Google’s Page. Rank web site xxx web site a b c defg web site yyyy Comp. Sci 100 E W. Cohen Inlinks are “good” (recommendations) Inlinks from a “good” site are better than inlinks from a “bad” site pdq. . but inlinks from sites with many outlinks are not as “good”. . . “Good” and “bad” are relative. 4. 1
Google’s Page. Rank web site xxx Imagine a “pagehopper” that always either • follows a random link, or web site a b c defg • jumps to random page web site yyyy site pdq. . web site a b c defg web site yyyy 4. 2 Comp. Sci 100 E W. Cohen
Google’s Page. Rank (Brin & Page, http: //www-db. stanford. edu/~backrub/google. html) web site xxx Imagine a “pagehopper” that always either • follows a random link, or web site a b c defg • jumps to random page web site yyyy web site a b c defg web site yyyy site pdq. . Page. Rank ranks pages by the amount of time the pagehopper spends on a page: • or, if there were many pagehoppers, Page. Rank is the expected “crowd size” 4. 3 Comp. Sci 100 E W. Cohen
Page. Rank • Google's Page. Rank™ algorithm. [Sergey Brin and Larry Page, 1998] Ø Measure popularity of pages based on hyperlink structure of Web. Revolutionized access to world's information. Comp. Sci 100 E 4. 4
90 -10 Rule • Model. Web surfer chooses next page: Ø Ø • 90% of the time surfer clicks random hyperlink. 10% of the time surfer types a random page. Caveat. Crude, but useful, web surfing model. Ø Ø Ø No one chooses links with equal probability. No real potential to surf directly to each page on the web. The 90 -10 breakdown is just a guess. It does not take the back button or bookmarks into account. We can only afford to work with a small sample of the web. … Comp. Sci 100 E 4. 5
Web Graph Input Format • Input format. Ø Ø N pages numbered 0 through N-1. Represent each hyperlink with a pair of integers. Graph representation Comp. Sci 100 E 4. 6
Transition Matrix • Transition matrix. p[i][j]= prob. that surfer moves from page i to j. surfer on page 1 goes to page 2 next 38% of the time Comp. Sci 100 E © Sedgewick & Wayne 4. 7
Web Graph to Transition Matrix % java Transition tiny. txt 5 5 0. 02000 0. 92000 0. 02000 0. 38000 0. 02000 0. 92000 0. 02000 0. 47000 0. 02000 Comp. Sci 100 E © Sedgewick & Wayne 0. 02000 0. 20000 0. 02000 4. 8
Monte Carlo Simulation How? see next slide • Monte Carlo simulation. Ø Ø Ø Surfer starts on page 0. Repeatedly choose next page, according to transition matrix. Calculate how often surfer visits each page transition matrix Comp. Sci 100 E © Sedgewick & Wayne 4. 9
Random Surfer • Random move. Surfer is on page. How to choose next page j? Ø Ø Row page of transition matrix gives probabilities. Compute cumulative probabilities for row page. Generate random number r between 0. 0 and 1. 0. Choose page j corresponding to interval where r lies. page transition matrix Comp. Sci 100 E 4. 10
Random Surfer • Random move. Surfer is on page. How to choose next page j? Ø Ø Row page of transition matrix gives probabilities. Compute cumulative probabilities for row page. Generate random number r between 0. 0 and 1. 0. Choose page j corresponding to interval where r lies. // make one random move double r = Math. random(); double sum = 0. 0; for (int j = 0; j < N; j++) { // find interval containing r sum += p[page][j]; if (r < sum) { page = j; break; } } Comp. Sci 100 E © Sedgewick & Wayne 4. 11
Random Surfer: Monte Carlo Simulation public class Random. Surfer { public static void main(String[] args) { int T = Integer. parse. Int(args[0]); int N = in. next. Int(); // int page = 0; double[][] p = new int[N][N]; // number of moves number of pages // current page // transition matrix // read in transition matrix. . . // simulate random surfer and count page frequencies int[] freq = new int[N]; for (int t = 0; t < T; t++) { // make one random move see previous slide freq[page]++; } // print page ranks for (int i = 0; i < N; i++) { System. out. println(String. format("%8. 5 f", (double) freq[i] / T); } Comp. Sci System. out. println(); 100 E } } page rank 4. 12
Mathematical Context • Convergence. For the random surfer model, the fraction of time the surfer spends on each page converges to a unique distribution, independent of the starting page. "page rank" "stationary distribution" of Markov chain "principal eigenvector" of transition matrix Comp. Sci 100 E © Sedgewick & Wayne 4. 13
The Power Method • Q. If the surfer starts on page 0, what is the probability that surfer ends up on page i after one step? • A. First row of transition matrix. Comp. Sci 100 E © Sedgewick & Wayne 4. 14
The Power Method • • Q. If the surfer starts on page 0, what is the probability that surfer ends up on page i after two steps? A. Matrix-vector multiplication. Comp. Sci 100 E 4. 15
The Power Method • Power method. Repeat until page ranks converge. 4. 16 Comp. Sci 100 E © Sedgewick & Wayne
Mathematical Context • Convergence. For the random surfer model, the power method iterates converge to a unique distribution, independent of the starting page. "page rank" "stationary distribution" of Markov chain "principal eigenvector" of transition matrix Comp. Sci 100 E © Sedgewick & Wayne 4. 17
Comp. Sci 100 E 4. 18
Random Surfer: Scientific Challenges • Google's Page. Rank™ algorithm. [Sergey Brin and Larry Page, 1998] Ø Ø • Rank importance of pages based on hyperlink structure of web, using 90 -10 rule. Revolutionized access to world's information. Scientific challenges. Cope with 4 billion-by-4 billion matrix! Need Comp. SciØ 100 E Ø data structures to enable computation. Need linear algebra to fully understand computation. © Sedgewick & Wayne 4. 19
- Slides: 19