IR Implementation Issues Web Crawlers and Web Search

IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval 8/28/97 Information Organization and Retrieval

Review • Boolean Retrieval • Ranked Retrieval • Vector Space Model 8/28/97 Information Organization and Retrieval

Information need Collections Pre-process text input Parse Query Index Rank or Match

Boolean Model t 1 D 9 m 3 D 11 m 6 D 4 D 5 D 3 D 10 m 8 D 2 D 1 m 5 m 2 m 1 D 6 m 4 m 7 D 8 D 7 t 3 8/28/97 t 2 Information Organization and Retrieval m 1 = t 1 t 2 t 3 m 2 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Cracks Width measurement Beams Prestressed concrete 8/28/97 Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Information Organization and Retrieval Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

Boolean Problems • Disjunctive (OR) queries lead to information overload • Conjunctive (AND) queries lead to reduced, and commonly zero result • Conjunctive queries imply reduction in Recall 8/28/97 Information Organization and Retrieval

Advantages and Disadvantage of the Boolean Model Advantages • Complete expressiveness for any identifiable subset of collection • Exact and simple to program • The whole panoply of Boolean Algebra available 8/28/97 Disadvantages • Complex query syntax is often misunderstood (if understood at all) • Problems of Null output and Information Overload • Output is not ordered in any useful fashion Information Organization and Retrieval

Boolean Extensions • Fuzzy Logic – Adds weights to each term/concept – ta AND tb is interpreted as MIN(w(ta), w(tb)) – ta OR tb is interpreted as MAX (w(ta), w(tb)) • Proximity/Adjacency operators – Interpreted as additional constraints on Boolean AND • TOPIC system – Uses various weighted forms of Boolean logic and proximity information in calculating RSVs 8/28/97 Information Organization and Retrieval

Vector Space Model • Documents are represented as vectors in term space – Terms are usually stems – Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents 8/28/97 Information Organization and Retrieval

Documents in Vector Space t 3 D 1 D 9 D 11 D 5 D 3 D 10 D 4 D 2 t 1 t 2 8/28/97 D 8 D 6 Information Organization and Retrieval

Vector Space Documents and Queries t 1 t 3 D 9 D 2 D 1 D 4 D 11 D 5 D 3 D 6 D 10 D 7 8/28/97 Information Organization and Retrieval D 8 t 2

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient 8/28/97 Information Organization and Retrieval

Vector Space with Term Weights and Cosine Matching Term B 1. 0 0. 8 D 2 Q Q = (0. 4, 0. 8) D 1=(0. 8, 0. 3) D 2=(0. 2, 0. 7) Di=(di 1, wdi 1; di 2, wdi 2; …; dit, wdit) Q =(qi 1, wqi 1; qi 2, wqi 2; …; qit, wqit) 0. 6 0. 4 D 1 0. 2 0 0. 2 8/28/97 0. 4 0. 6 Term A 0. 8 1. 0 Information Organization and Retrieval

Problems with Vector Space • There is no real theoretical basis for the assumption of a term space – it is more for visualization that having any real basis – most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions – Terms are not independent of all other terms 8/28/97 Information Organization and Retrieval

Today • Probabilistic Retrieval (Introduction) • Processing Ranked Queries (the role of inverted files) • Web Crawlers - Distributed indexing of the WWW • Probabilistic Retrieval (Details) 8/28/97 Information Organization and Retrieval

Probabilistic Retrieval • Goes back to 1960’s (Maron and Kuhns) • Robertson’s “Probabilistic Ranking Principle” – Retrieved documents should be ranked in decreasing probability that they are relevant to the user’s query. – How to estimate these probabilities? • Several methods (Model 1, Model 2, Model 3) with different emphases on how estimates are done. 8/28/97 Information Organization and Retrieval

Probabilistic Models: Some Notation • • • D = All present and future documents Q = All present and future queries (Di, Qj) = A document query pair x = class of similar documents, y = class of similar queries, Relevance is a relation: 8/28/97 Information Organization and Retrieval

Probabilistic Models • Model 1 -- Probabilistic Indexing, P(R|y, Di) • Model 2 -- Probabilistic Querying, P(R|Qj, x) • Model 3 -- Merged Model, P(R| Qj, Di) • Model 0 -- P(R|y, x) • Probabilities are estimated based on prior usage or relevance estimation 8/28/97 Information Organization and Retrieval

Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities for accurate results 8/28/97 Information Organization and Retrieval

Vector and Probabilistic Models • • • Support “natural language” queries Treat documents and queries the same Support relevance feedback searching Support ranked retrieval Differ primarily in theoretical basis and in how the ranking is calculated – Vector assumes relevance – Probabilistic relies on relevance judgments or estimates 8/28/97 Information Organization and Retrieval

Web Search Engines • Most include some version of Vector Space or extended Boolean • Some offer both “ranked” and Boolean, but not together. • Some engines (such as those based on the original WAIS) are little more than coordination-level matching for ranked retrieval. 8/28/97 Information Organization and Retrieval

Web Search Engines • Some engines use added natural language processing techniques to identify concepts – Lycos based on work by Michael Mauldin at CMU – Excite’s “concept-based” search may be a development of Latent Semantic Indexing • Some search engines using Probabilistic methods (with proprietary extensions) – Inktomi/Hot. Bot uses a form of SLR. 8/28/97 Information Organization and Retrieval

Web Search Engines • Exact algorithms are not available for commercial WWW search engines • Many search engines appear to be hybrids offering both ranked and Boolean elements 8/28/97 Information Organization and Retrieval

Web Search Conclusions • Web Search engines are stretching the performance limits of ranked retrieval algorithms • Most Web search engines today attempt to combine the best features of ranked and Boolean searching • There is still a long way to go before All and Only the Relevant web pages are retrieved in response to your query 8/28/97 Information Organization and Retrieval

Web Crawlers • How do the web search engines get all of the items they index? • How do you store millions of words from hundreds of sites so that you can find them quickly (and efficiently)? 8/28/97 Information Organization and Retrieval

Depth-First Crawling Page 1 Page 2 Page 1 Site 1 Page 2 Page 3 Page 5 Page 3 Page 1 Page 4 Site 5 Page 6 Page 1 Page 2 Site 3 8/28/97 Site 2 Information Organization and Retrieval Page 1 Site 6

Breadth First Page 1 Page 2 Page 1 Site 1 Page 2 Page 3 Page 5 Page 3 Page 1 Page 4 Site 5 Page 6 Page 1 Page 2 Site 3 8/28/97 Site 2 Information Organization and Retrieval Page 1 Site 6

Inverted Files • We have seen “Vector files” conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows 8/28/97 Information Organization and Retrieval

How Are Inverted Files Created • Documents are parsed to extract words (or stems) and these are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight 8/28/97 Information Organization and Retrieval

How Inverted Files are Created • After all document have been parsed the inverted file is sorted 8/28/97 Information Organization and Retrieval

How Inverted Files are Created • Multiple term entries for a single document are merged and frequency information added 8/28/97 Information Organization and Retrieval

How Inverted Files are Created • The file is split into a Dictionary and a Postings file 8/28/97 Information Organization and Retrieval

Inverted files • Permit fast search for individual terms • Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) • These lists can be used to solve Boolean queries: – country: d 1, d 2 – manor: d 2 – country and manor: d 2 8/28/97 Information Organization and Retrieval

Inverted Files • Lots of alternative implementations – E. g. : Cheshire builds within-document frequency using a hash table during parsing – Document IDs and frequency info are stored in a B-tree index keyed by the term. • See the chapter on inverted files in the reader for other implementations. 8/28/97 Information Organization and Retrieval

Probabilistic Models (Again) • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities for accurate results 8/28/97 Information Organization and Retrieval

Probabilistic Models: Logistic Regression • Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds: 8/28/97 Information Organization and Retrieval

Probabilistic Models: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Number of Terms in common between query and document -- logged 8/28/97 Information Organization and Retrieval

Probabilistic Models: Logistic Regression Probability of relevance is based on Logistic regression from a sample set of documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For the 6 X attribute measures shown previously 8/28/97 Information Organization and Retrieval

Probabilistic Models Advantages • Strong theoretical basis • In principle should supply the best predictions of relevance given available information • Can be implemented similarly to Vector 8/28/97 Disadvantages • Relevance information is required -- or is “guestimated” • Important indicators of relevance may not be term -- though terms only are usually used • Optimally requires ongoing collection of relevance information Information Organization and Retrieval