Crawling and Ranking HTML Hyper Text Markup Language

HTML (Hyper. Text Markup Language) • • Described the structure and content of a

The HTML language • Text and tags • Tags define structure – Used for

HTML structure <!DOCTYPE html …> <html lang="en"> <head> <!-- Header of the document -->

<!DOCTYPE html PUBLIC "-//W 3 C//DTD XHTML 1. 0 Strict//EN“ "http: //www. w 3.

Header • Appears between the tags <head>. . . </head> • Includes meta-data such

Body • Between <body>. . . </body> tags • The body is structured into

HTTP • Application protocol Client request: GET /Mark. Up/ HTTP/1. 1 Host: www. google.

GET URL: http: //www. google. com/search? q=BGU Corresponding HTTP GET request: GET /search? q=BGU

POST • Used for submitting forms POST /php/test. php HTTP/1. 1 Host: www. bgu.

Status codes • HTTP response always starts with a status code followed by a

Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session

Cookies • Key/value pairs, that a server asks a client to store and retransmit

Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages

Discovering new URLs • Browse the "internet graph" (following e. g. hyperlinks) • Site

The internet graph • At least 14. 06 billion nodes = pages • At

Graph-browsing algorithms • Depth-ﬁrst • Breath-first • Combinations. . • Parallel crawling

Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing •

Near-duplicate detection • Edit distance – Good measure of similarity, – Does not scale

Crawling ethics • robots. txt at the root of a Web server • User-agent:

Overview • Crawl • Retrieve relevant documents – Can you guess how? • To

Why Ranking? • Huge number of pages • Huge even if we filter according

When to rank? • Before retrieving results – Advantage: offline! – Disadvantage: huge set

How to rank? • Observation: links are very informative! • Not just for discovering

Authority and Hubness • Authority: a site is very authoritative if it receives many

HITS • Recursive dependency: a(v) = Σ(u, v) h(u) h(v) = Σ(v, u) a(u)

HITS (cont. ) • Works rather well if applied only on relevant web pages

Google Page. Rank • Works offline, i. e. computes for every web-site a score

Random Surfer Model • Consider a "random surfer" • At each point chooses a

Recursive definition • If Page. Rank reflects the probability of being in a web-page

Problems • A random surfer may get stuck in one component of the graph

Damping Factor • Add some probability d for "jumping" to a random page •

How to compute PR? • Simulation • Analytical methods – Can we solve the

Simulation: A random surfer algorithm • Start from an arbitrary page • Toss a

Convergence • Not guaranteed without the damping factor! • (Partial) intuition: if unlucky, the

Markov Chain Monte Carlo (MCMC) • A class of very useful algorithms for sampling

Markov Chain • A finite or countably infinite state machine • We will consider

MCMC framework • Construct (explicitly or implicitly) a Markov Chain (MC) that describes the

Properties of Markov Chains • A Markov Chain defines a distribution on the different

Properties • Periodicity – A state i has period k if any return to

Back to Page. Rank • The MC is on the graph with probabilities we

Problem with MCMC • In general no guarantees on convergence time – Even for

A different approach • Reconsider the equation system PR(W) = (1 -d) * [PR(W

Transition Matrix T= (0 0. 33 0 0 0. 5 0. 25 0 0)

Eigen. Vector! • PR (column vector) is the right eigenvector of the stochastic transition

Direct solution • Solving the equations set – Via e. g. Gaussian elimination •

Power method • Start with some arbitrary rank vector R 0 • Compute Ri

Power method (cont. ) • Every iteration is still “expensive” • But since the

Other issues • Accelerating Computation • Updates • Distributed Page. Rank • Mixed Model

Slides: 51

Download presentation

Crawling and Ranking

HTML (Hyper. Text Markup Language) • • Described the structure and content of a (web) document HTML 4. 01: most common version, W 3 C standard XHTML 1. 0: XML-ization of HTML 4. 01, minor differences Validation (http: //validator. w 3. org/) against a schema. Checks the conformity of a Web page with respect to recommendations, for accessibility: – to all graphical browsers (IE, Firefox, Safari, Opera, etc. ) – to text browsers (lynx, links, w 3 m, etc. ) – to all other user agents including Web crawlers

The HTML language • Text and tags • Tags define structure – Used for instance by a browser to lay out the document. • Header and Body

HTML structure <!DOCTYPE html …> <html lang="en"> <head>  </head> <body>  </body> </html>

<!DOCTYPE html PUBLIC "-//W 3 C//DTD XHTML 1. 0 Strict//EN“ "http: //www. w 3. org/TR/xhtml 1/DTD/xhtml 1 -strict. dtd"> <html xmlns=http: //www. w 3. org/1999/xhtml lang="en" xml: lang="en"> <head> <meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /> <title>Example XHTML document</title> </head> <body> <p>This is a <a href="http: //www. w 3. org/">link to the W 3 C</a></p> </body> </html>

Header • Appears between the tags <head>. . . </head> • Includes meta-data such as language, encoding… • Also include document title • Used by (e. g. ) the browser to decipher the body

Body • Between <body>. . . </body> tags • The body is structured into sections, paragraphs, lists, etc. <h 1>Title of the page</h 1> <h 2>Title of a main section</h 2> <h 3>Title of a subsection</h 3>. . . • <p>. . . </p> define paragraphs • More block elements such as table, list…

HTTP • Application protocol Client request: GET /Mark. Up/ HTTP/1. 1 Host: www. google. com Server response: HTTP/1. 1 200 OK • Two main HTTP methods: GET and POST

GET URL: http: //www. google. com/search? q=BGU Corresponding HTTP GET request: GET /search? q=BGU HTTP/1. 1 Host: www. google. com

POST • Used for submitting forms POST /php/test. php HTTP/1. 1 Host: www. bgu. ac. il Content-Type: application/x-wwwformurlencoded Content-Length: 100 …

Status codes • HTTP response always starts with a status code followed by a human-readable message (e. g. , 200 OK) • First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error

Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. • It can be used instead to transmit sensitive data GET. . . HTTP/1. 1 Authorization: Basic d. G 90 bzp 0 a. XRp

Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID – Connected, on the server side, to all session information

Crawling

Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web • Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!

Discovering new URLs • Browse the "internet graph" (following e. g. hyperlinks) • Site maps (sitemap. org)

The internet graph • At least 14. 06 billion nodes = pages • At least 140 billion edges = links • Lots of "junk"

Graph-browsing algorithms • Depth-ﬁrst • Breath-first • Combinations. . • Parallel crawling

Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing • Trivial duplicates: same resource at the same canonized URL: http: //example. com: 80/toto http: //example. com/titi/. . /toto • Exact duplicates: identiﬁcation by hashing • near-duplicates: (timestamps, tip of the day, etc. ) more complex!

Near-duplicate detection • Edit distance – Good measure of similarity, – Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams

Crawling ethics • robots. txt at the root of a Web server • User-agent: * Allow: /searchhistory/ Disallow: /search • Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX, NOFOLLOW"> • Per-link exclusion (de facto standard). <a href="toto. html" rel="nofollow">Toto</a> • Avoid Denial Of Service (DOS), wait 100 ms/1 s between two repeated requests to the same Web server

Overview • Crawl • Retrieve relevant documents – Can you guess how? • To define relevance, to find relevant docs. . – We will discuss later • Rank

Ranking

Why Ranking? • Huge number of pages • Huge even if we filter according to relevance – Keep only pages that include the keywords • A lot of the pages are not informative – And anyway it is impossible for users to go through 10 K results

When to rank? • Before retrieving results – Advantage: offline! – Disadvantage: huge set • After retrieving results – Advantage: smaller set – Disadvantage: online, user is waiting. .

How to rank? • Observation: links are very informative! • Not just for discovering new sites, but also for estimating the importance of a site • CNN. com has more links to it than my homepage… • Quality and Efficiency are key factors

Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from lessimportant sites A(v) = The authority of v • Hubness A good hub is a site that links to many authoritative sites H(v) = The hubness of v

HITS • Recursive dependency: a(v) = Σ(u, v) h(u) h(v) = Σ(v, u) a(u) Normalize (when? ) according to square root of sum of squares of authorities hubness values • Start by setting all values to 1 – We could also add bias • We can show that a(v) and h(v) converge

HITS (cont. ) • Works rather well if applied only on relevant web pages – E. g. pages that include the input keywords • The results are less satisfying if applied on the whole web • On the other hand, online ranking is a problem

Google Page. Rank • Works offline, i. e. computes for every web-site a score that can then be used online • Extremely efficient and high-quality • The Page. Rank algorithm that we will describe here appears in [Brin & Page, 1998]

Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it • A link is chosen with uniform distribution – A simplifying assumption. . • What is the probability of being, at a random time, at a web-page W?

Recursive definition • If Page. Rank reflects the probability of being in a web-page (PR(w) = P(w)) then PR(W) = PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W

Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem – Many Web pages have no inlinks/outlinks

Damping Factor • Add some probability d for "jumping" to a random page • Now PR(W) = (1 -d) * [PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index

How to compute PR? • Simulation • Analytical methods – Can we solve the equations?

Simulation: A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow which page to go to • Keep record of the frequency of the webpages visited

Convergence • Not guaranteed without the damping factor! • (Partial) intuition: if unlucky, the algorithm may get stuck forever in a connected component • Claim: with damping, the probability of getting stuck forever is 0 • More difficult claim: with damping, convergence is guaranteed

Markov Chain Monte Carlo (MCMC) • A class of very useful algorithms for sampling a given distribution • We first need to know what is a Markov Chain

Markov Chain • A finite or countably infinite state machine • We will consider the case of finitely many states • Transitions are associated with probabilities • Markovian property: given the present state, future choices are independent from the past

MCMC framework • Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution • Perform a random walk on the MC, keeping track of the proportion of state visits – Discard samples made before “Mixing” • Return proportion as an approximation of the correct distribution

Properties of Markov Chains • A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time) • We want conditions on when this distribution is unique, and when will a random walk approximate it

Properties • Periodicity – A state i has period k if any return to state i must occur in multiples of k time steps – Aperiodic: period = 1 for all states • Reducibility – An MC is irreducible if there is a probability 1 of (eventually) getting from every state to every state • Theorem: A finite-state MC has a unique stationary distribution if it is aperiodic and irreducible

Back to Page. Rank • The MC is on the graph with probabilities we have defined • MCMC is the random walk algorithm • Is the MC aperiodic? Irreducible? • Why?

Problem with MCMC • In general no guarantees on convergence time – Even for those “nice” MCs • A lot of work on characterizing “nicer” MCs – That will allow fast convergence • In practice for the web graph it converges rather slowly – Why?

A different approach • Reconsider the equation system PR(W) = (1 -d) * [PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N • A linear equation system!

Transition Matrix T= (0 0. 33 0 0 0. 5 0. 25 0 0) Stochastic matrix

Eigen. Vector! • PR (column vector) is the right eigenvector of the stochastic transition matrix – I. e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique under the same assumptions as before

Direct solution • Solving the equations set – Via e. g. Gaussian elimination • This is time-consuming • Observation: the matrix is sparse • So iterative methods work better here

Power method • Start with some arbitrary rank vector R 0 • Compute Ri = A Ri-1 • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)

Power method (cont. ) • Every iteration is still “expensive” • But since the matrix is sparse it becomes feasible • Still, need a lot of tweaks and optimizations to make it work efficiently

Other issues • Accelerating Computation • Updates • Distributed Page. Rank • Mixed Model (Incorporating "static" importance) • Personalized Page. Rank