Crawling and Ranking HTML Hyper Text Markup Language
- Slides: 51
Crawling and Ranking
HTML (Hyper. Text Markup Language) • • Described the structure and content of a (web) document HTML 4. 01: most common version, W 3 C standard XHTML 1. 0: XML-ization of HTML 4. 01, minor differences Validation (http: //validator. w 3. org/) against a schema. Checks the conformity of a Web page with respect to recommendations, for accessibility: – to all graphical browsers (IE, Firefox, Safari, Opera, etc. ) – to text browsers (lynx, links, w 3 m, etc. ) – to all other user agents including Web crawlers
The HTML language • Text and tags • Tags define structure – Used for instance by a browser to lay out the document. • Header and Body
HTML structure <!DOCTYPE html …> <html lang="en"> <head> <!-- Header of the document --> </head> <body> <!-- Body of the document --> </body> </html>
<!DOCTYPE html PUBLIC "-//W 3 C//DTD XHTML 1. 0 Strict//EN“ "http: //www. w 3. org/TR/xhtml 1/DTD/xhtml 1 -strict. dtd"> <html xmlns=http: //www. w 3. org/1999/xhtml lang="en" xml: lang="en"> <head> <meta http-equiv="Content-Type“ content="text/html; charset=utf-8" /> <title>Example XHTML document</title> </head> <body> <p>This is a <a href="http: //www. w 3. org/">link to the W 3 C</a></p> </body> </html>
Header • Appears between the tags <head>. . . </head> • Includes meta-data such as language, encoding… • Also include document title • Used by (e. g. ) the browser to decipher the body
Body • Between <body>. . . </body> tags • The body is structured into sections, paragraphs, lists, etc. <h 1>Title of the page</h 1> <h 2>Title of a main section</h 2> <h 3>Title of a subsection</h 3>. . . • <p>. . . </p> define paragraphs • More block elements such as table, list…
HTTP • Application protocol Client request: GET /Mark. Up/ HTTP/1. 1 Host: www. google. com Server response: HTTP/1. 1 200 OK • Two main HTTP methods: GET and POST
GET URL: http: //www. google. com/search? q=BGU Corresponding HTTP GET request: GET /search? q=BGU HTTP/1. 1 Host: www. google. com
POST • Used for submitting forms POST /php/test. php HTTP/1. 1 Host: www. bgu. ac. il Content-Type: application/x-wwwformurlencoded Content-Length: 100 …
Status codes • HTTP response always starts with a status code followed by a human-readable message (e. g. , 200 OK) • First digit indicates the class of the response: 1 Information 2 Success 3 Redirection 4 Client-side error 5 Server-side error
Authentication • HTTPS is a variant of HTTP that includes encryption, cryptographic authentication, session tracking, etc. • It can be used instead to transmit sensitive data GET. . . HTTP/1. 1 Authorization: Basic d. G 90 bzp 0 a. XRp
Cookies • Key/value pairs, that a server asks a client to store and retransmit with each HTTP request (for a given domain name). • Can be used to keep information on users between visits • Often what is stored is a session ID – Connected, on the server side, to all session information
Crawling
Basics of Crawling • Crawlers, (Web) spiders, (Web) robots: autonomous agents that retrieve pages from the Web • Basics crawling algorithm: 1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (next slide) 4. Repeat on each found URL Problem: The web is huge!
Discovering new URLs • Browse the "internet graph" (following e. g. hyperlinks) • Site maps (sitemap. org)
The internet graph • At least 14. 06 billion nodes = pages • At least 140 billion edges = links • Lots of "junk"
Graph-browsing algorithms • Depth-first • Breath-first • Combinations. . • Parallel crawling
Duplicates • Identifying duplicates or near-duplicates on the Web to prevent multiple indexing • Trivial duplicates: same resource at the same canonized URL: http: //example. com: 80/toto http: //example. com/titi/. . /toto • Exact duplicates: identification by hashing • near-duplicates: (timestamps, tip of the day, etc. ) more complex!
Near-duplicate detection • Edit distance – Good measure of similarity, – Does not scale to a large collection of documents (unreasonable to compute the edit distance for every pair!). • Shingles: two documents similar if they mostly share the same succession of k-grams
Crawling ethics • robots. txt at the root of a Web server • User-agent: * Allow: /searchhistory/ Disallow: /search • Per-page exclusion (de facto standard). <meta name="ROBOTS" content="NOINDEX, NOFOLLOW"> • Per-link exclusion (de facto standard). <a href="toto. html" rel="nofollow">Toto</a> • Avoid Denial Of Service (DOS), wait 100 ms/1 s between two repeated requests to the same Web server
Overview • Crawl • Retrieve relevant documents – Can you guess how? • To define relevance, to find relevant docs. . – We will discuss later • Rank
Ranking
Why Ranking? • Huge number of pages • Huge even if we filter according to relevance – Keep only pages that include the keywords • A lot of the pages are not informative – And anyway it is impossible for users to go through 10 K results
When to rank? • Before retrieving results – Advantage: offline! – Disadvantage: huge set • After retrieving results – Advantage: smaller set – Disadvantage: online, user is waiting. .
How to rank? • Observation: links are very informative! • Not just for discovering new sites, but also for estimating the importance of a site • CNN. com has more links to it than my homepage… • Quality and Efficiency are key factors
Authority and Hubness • Authority: a site is very authoritative if it receives many citations. Citation from important sites has more weight than citations from lessimportant sites A(v) = The authority of v • Hubness A good hub is a site that links to many authoritative sites H(v) = The hubness of v
HITS • Recursive dependency: a(v) = Σ(u, v) h(u) h(v) = Σ(v, u) a(u) Normalize (when? ) according to square root of sum of squares of authorities hubness values • Start by setting all values to 1 – We could also add bias • We can show that a(v) and h(v) converge
HITS (cont. ) • Works rather well if applied only on relevant web pages – E. g. pages that include the input keywords • The results are less satisfying if applied on the whole web • On the other hand, online ranking is a problem
Google Page. Rank • Works offline, i. e. computes for every web-site a score that can then be used online • Extremely efficient and high-quality • The Page. Rank algorithm that we will describe here appears in [Brin & Page, 1998]
Random Surfer Model • Consider a "random surfer" • At each point chooses a link and clicks on it • A link is chosen with uniform distribution – A simplifying assumption. . • What is the probability of being, at a random time, at a web-page W?
Recursive definition • If Page. Rank reflects the probability of being in a web-page (PR(w) = P(w)) then PR(W) = PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn)) Where O(W) is the out-degree of W
Problems • A random surfer may get stuck in one component of the graph • May get stuck in loops • “Rank Sink” Problem – Many Web pages have no inlinks/outlinks
Damping Factor • Add some probability d for "jumping" to a random page • Now PR(W) = (1 -d) * [PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N Where N is the number of pages in the index
How to compute PR? • Simulation • Analytical methods – Can we solve the equations?
Simulation: A random surfer algorithm • Start from an arbitrary page • Toss a coin to decide if you want to follow a link or to randomly choose a new page • Then toss another coin to decide which link to follow which page to go to • Keep record of the frequency of the webpages visited
Convergence • Not guaranteed without the damping factor! • (Partial) intuition: if unlucky, the algorithm may get stuck forever in a connected component • Claim: with damping, the probability of getting stuck forever is 0 • More difficult claim: with damping, convergence is guaranteed
Markov Chain Monte Carlo (MCMC) • A class of very useful algorithms for sampling a given distribution • We first need to know what is a Markov Chain
Markov Chain • A finite or countably infinite state machine • We will consider the case of finitely many states • Transitions are associated with probabilities • Markovian property: given the present state, future choices are independent from the past
MCMC framework • Construct (explicitly or implicitly) a Markov Chain (MC) that describes the desired distribution • Perform a random walk on the MC, keeping track of the proportion of state visits – Discard samples made before “Mixing” • Return proportion as an approximation of the correct distribution
Properties of Markov Chains • A Markov Chain defines a distribution on the different states (P(state)= probability of being in the state at a random time) • We want conditions on when this distribution is unique, and when will a random walk approximate it
Properties • Periodicity – A state i has period k if any return to state i must occur in multiples of k time steps – Aperiodic: period = 1 for all states • Reducibility – An MC is irreducible if there is a probability 1 of (eventually) getting from every state to every state • Theorem: A finite-state MC has a unique stationary distribution if it is aperiodic and irreducible
Back to Page. Rank • The MC is on the graph with probabilities we have defined • MCMC is the random walk algorithm • Is the MC aperiodic? Irreducible? • Why?
Problem with MCMC • In general no guarantees on convergence time – Even for those “nice” MCs • A lot of work on characterizing “nicer” MCs – That will allow fast convergence • In practice for the web graph it converges rather slowly – Why?
A different approach • Reconsider the equation system PR(W) = (1 -d) * [PR(W 1)* (1/O(W 1))+…+ PR(Wn)* (1/O(Wn))] + d* 1/N • A linear equation system!
Transition Matrix T= (0 0. 33 0 0 0. 5 0. 25 0 0) Stochastic matrix
Eigen. Vector! • PR (column vector) is the right eigenvector of the stochastic transition matrix – I. e. the adjacency matrix normalized to have the sum of every column to be 1 • The Perron-Frobinius theorem ensures that such a vector exists • Unique under the same assumptions as before
Direct solution • Solving the equations set – Via e. g. Gaussian elimination • This is time-consuming • Observation: the matrix is sparse • So iterative methods work better here
Power method • Start with some arbitrary rank vector R 0 • Compute Ri = A Ri-1 • If we happen to get to the eigenvector we will stay there • Theorem: The process converges to the eigenvector! • Convergence is in practice pretty fast (~100 iterations)
Power method (cont. ) • Every iteration is still “expensive” • But since the matrix is sparse it becomes feasible • Still, need a lot of tweaks and optimizations to make it work efficiently
Other issues • Accelerating Computation • Updates • Distributed Page. Rank • Mixed Model (Incorporating "static" importance) • Personalized Page. Rank
- An html file is a text file containing small markup tags.
- Html hyper text
- "sem rush" "ranking factor" or "ranking factors"
- Extra markup html
- Structural markup html
- Html stands
- Html5 semantic elements
- Text to text text to self text to world
- Library.med.utah.edu/kw/pharm/hyper heart.html
- Library.med.utah.edu/kw/pharm/hyper heart.html
- Definicionn
- Html hyper
- Html hyper
- Abnormal crawling
- Search font by image
- List crawers
- Crawling informatica
- Crawling informatica
- Crawling peg
- Crawling informatica
- Language
- Security assertion markup language definition
- Wml adalah
- Markup language examples
- History of markup languages
- Json xml alternatives
- Lightweight markup
- Taocomputer/recursos
- Gml geography markup language
- What is gml
- Td vs th
- Darpa agent markup language
- Xml extensible markup language
- Darpa agent markup language
- Language
- Language
- Markup language
- Verspeer
- Uiml
- 蔡顯明
- Bandwidth extensible markup language
- Java hyper text
- Hyper language
- Doctype html html head
- Doctype html html head
- 12.html?action=
- Doctype html html head
- 1
- How can you obtain the mark-down?
- Practice 6-9 markup and discount
- Hyperpoetry
- Scalanie komórek html