Advanced information retrieval Computer engineering department Chapter 13






































- Slides: 38

Advanced information retrieval Computer engineering department Chapter 13 – Web IR

Searching the web • Introduction (History, Definitions, Aims, Statistics) • Tasks of search engines - Gathering - Indexing - “Searching” (Querying and ranking algorithms) - Document and query management • Metasearch • Browsing and web directory • Users and web search • Research issues

Web Searching and Classical IR TREC Classic IR research 1970 s 1980 s 1990 s then came the web searching 2000 s

Terminology and Definitions • A web page corresponds to a document in traditional IR • Web pages are different in their size and in the types of • • • files that constitute them – text, graphics, sound, video, GIF, JPEG, ASCII, PDF, etc IR on the web considers as collection of documents the part of the web that is publicly indexable – exclude pages that cannot be indexed (authorization, dynamic pages) Location of web pages by navigation Location of web pages by searching (IR on the web)

Challenges for web search • Problem with the data – Distributed data – High percentage of volatile data – Large volume – Unstructured data – Redundant data – Quality of data – Heterogeneous data • Problem faced by users – How to specify a query? – How to interpret answers?

Search engines • Portals – identification of real names – links to books from Amazon. com – send electronic postcards – translation to other languages – search of other media (metadata) – language-specific searches – weather, stock price, traffic • Business model – targeted advertising with revenue from clickthroughs – word of mouth (since no industry standard evaluation) – fast response (<1 s) 24 hours / 7 days a week – filter out spam

Examples Search engines Alta. Vista Excite Google Infoseek Lycos Northern. Light • • • URL www. altavista. com www. excite. com www. google. com www. infoseek. com www. lycos. com www. nlsearch. com 25 -55% of web covered only! Some search engines powered by same IR engine Most based in US, English

Other search engines • Specialised in different countries/languages (e. g. , http: //www. iol. it/) • Specific – Rank according to popularity (e. g. , Direct. Hit) – Topic oriented (e. g. , Search. Broker) – Personnel or institutional pages, electronic mail addresses, images, software applets

Tasks of a web search engine • Document gathering – select the documents to be indexed • Document indexing – represent the content of the selected documents – often 2 indices maintained (full + small for frequent queries) • Searching – represent the user information need into a query – retrieval process (search algorithms, ranking of web pages) • Document and query management – display the results – virtual collection (documents discarded after indexing) vs. physical collection (documents maintained after indexing)

Document gathering • Document gathering = crawling the web • Crawler – Robot, spider, wanderer, walker, knowbot, web search agent – Program that traverses the web to send new or updated pages to be indexed – Run on local server and send requests to remote servers

Crawling the web (1) • Crawling process – Start with set of URLs • Submitted by users or companies • Popular URLs – Breath-first or depth-first – Extract further URLs • Up to 10 millions pages per day • Several crawlers – Problem of redundancy – Web partition robot per partition

Crawling the web (2) • Up-to-date? – Non-submitted pages need up to 2 months to be indexed – Search engine learns change frequency of pages – Popular pages (having many links to them) crawled more frequently • Guideline for robot behaviours – File placed at root of web server – Indicate web pages that should not be indexed – Avoid overloading servers/sites

Typical anatomy of a large-scale crawler.

Document indexing • Document indexing = building the indices • Indices are variant of inverted files – – meta tag analysis stop words removal + stemming position data (for phrase searches) weights • tf x idf; • downweight long URLs (not important page) • upweight terms appearing at the top of the documents, or emphasised terms – use de-spamming techniques • hyperlink information • count link popularity • anchor text from source links • hub and authority value of a page

Maintaining indices over dynamic collections.

Stop-press index • Collection of document in flux – – – Model document modification as deletion followed by insertion Documents in flux represented by a signed record (d, t, s) “s” specifies if “d” has been deleted or inserted. • Getting the final answer to a query – Main index returns a document set D 0. – Stop-press index returns two document sets • D+ : documents not yet indexed in D 0 matching the query • D- : documents matching the query removed from the collection since D 0 was constructed. • Stop-press index getting too large – Rebuild the main index • signed (d, t, s) records are sorted in (t, d, s) order and mergepurged into the master (t, d) records – Stop-press index can be emptied out.

Searching • Querying – – – 1 word or all words must be in the retrieved pages normalisation (stop words removal, stemming, etc) complex queries (date, structure, region, etc) Boolean expressions (advanced search) metadata • Ranking algorithms – – use of web links web page authority analysis • HITS (Hyperlink Induced Topic Search) • Page. Rank (Google)

Use of web links • Web link: represent a relationship between the connected pages • The main difference between standard IR algorithms and web IR algorithms is the massive presence of web links – web links are source of evidence but also source of noise – classical IR: citation-based IR – web track in TREC, 2000, TREC-9: Small Web task (2 GB of web data); Large Web task (100 GB of web data, 18. 5 million documents)

Use of anchor text • represent referenced document – why? • provides more accurate and concise description than the page itself • (probably) contains more significant terms than the page itself – used by ‘WWW Worm’ - one of the first search engines 1994 – representation of images, programs, … • generate page descriptions from anchor text

Algorithms • Query independent page quality – global analysis • Page. Rank (Google): simulates a random walk across the web and computes the “score” of a page as the probability of reaching the page • Query dependent page quality – local analysis • HITS (Hyperlink Induced Topic Search): focusses on broad topic queries that are likely to be answered with too many pages – the more a page is pointed to by other pages, the more popular is the page – popular pages are more likely to include relevant information than nonpopular pages

Page. Rank (1) • Designed by Brin and Page at Stanford University • Used to implement Google • Main idea: – a page has a high rank if the sum of the ranks of its in-links is high • in-link of page p: a link from a page to page p • out-link of a page p: a link from page p to a page – a high Page. Rank page has many in-links or few highly ranked in-links • Retrieval: use cosine product (content, feature, term weight) combined with Page. Rank value

Page. Rank (2) • Random Surfer Model : user randomly navigates – Initially the surfer is at a random page – At each step the surfer proceeds • to a randomly chosen Web page with probability d called the “damping factor” • (e. g. probability of random jump = 0. 2) to a randomly chosen page linked to the current page with probability 1 - d (e. g. probability of following a random outlink = 0. 8) • Process modelled by Markov Chain – Page. Rank PR of a page a = probability that the surfer is at page a on a given time PR(a) = Kd + K(1 -d) i=1, n PR(ai)/C(ai) d set by system K normalisation factor a = page pointed by ai for i=1, n C(ai) = number of outlinks of ai

HITS: Hypertext Induced Topic Search • Originated from Kleinberg, 1997 • Also referred to as the “The Connectivity Analysis Approach” • Broad topic queries produce large sets of retrieved results – abundance problem too many relevant documents – new type of quality measure needed distinguish the most “authoritative” pages high-quality response to a broad query • HITS: for a certain topic, it identifies – good authorities • pages that contain relevant information (good sources of content) – good hubs • page that point to useful pages (good sources of links)

HITS (2) • Intuition – authority comes from inlinks – being a good hub comes from outlinks – better authority comes from inlinks from good hubs – being a better hub comes from outlinks to good authorities • Mutual reinforcement between hubs and authorities – a good authority page is pointed to by many hub pages – a good hub page point to many authority pages

HITS (3) • Set of pages S that are retrieved – sometimes k (e. g. k = 200) top-ranked pages • Set of pages T that point to or are pointed to by retrieved set of pages S – rank pages according to in_degree (number of in-links) - not effective • Set of pages T : – authoritative pages relevant to query should have a large in_degree – considerable overlap in the sets of pages that point to them hubs

Algorithm for HITS (General Principle) • Computation of hub and authority value of a page through the iterative propagation of “authority weight” and “hub weight” • Initially all values equal to 1 • Authority weight of page x(p) – if p is pointed to by many pages with large y-values, then it should receive a large x-value • • x(p) = qi p y(qi) Hub weight of page y(p) – if p points to many pages with large x-values, then it should receive a large y-value y(p) = p qi x(qi) After each computation (iteration), weights are normalised

Topic distillation • Process of finding ‘quality’ (authority - hub) Web documents related to a query topic, given an initial user information need. • Extensions of HITS – ARC (Automatic Resource Compilation) • distance-2 neighbourhood graph • anchor (& surrounding) text used in computation of hub and authority values – SALSA, etc. • Problems with HITS – – mutual reinforcing relationship between hosts automatically generated links non relevant highly connected pages topic drift: generalisation of the query topic

Difference between Page. Rank and HITS • The Page. Rank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query. • HITS computes authorities and hubs; Page. Rank computes authorities only. • Page. Rank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard • Implementation details of Page. Rank have been

Document and query management • Results – – – Usually screens of 10 pages Clustering URL, size, date, abstract, etc Various sorting Most similar documents options Query refinement • Virtual collection vs. physical collection – document can change over time – document may be different to the one that has been indexed – broken link

Metasearch (1) • Problems of Web search engines: – – – limited coverage of the publicly indexable Web index different overlapping sections of the Web based on different IR models different results to the same query users do not have the time, knowledge to select the most appropriate search engines with regard to their information need • Possible solution: metasearch engines – Web server that sends query to • Several search engines, Web directories, Databases – Collect results – Unify (merge) them - Data fusion • Aim: better coverage, increase effectiveness

Metasearch (2) • Divided into phases – search engine selection • topic-dependent, past queries, network traffic, … – document selection • how many documents from each search engine? – merging algorithm • utilise rank positions, document retrieval scores, … Metasearcher URL used Meta. Crawler www. metacrawler. com Dogpile www. dogpile. com 25 Savvy. Search www. search. com Sources 13 > 1000

Meta. Crawler

Browsing • Web directory – Catalog, yellow pages, subject categories – Many standard search engines provide subject categories now • Hierarchical taxonomy that classifies human knowledge Arts & Humanities Automotive Business & Economy Computers & Internet Education Employment Entertainment & Leisure Games Government Investing Kids & Family Life & Style News People Philosophy & Religion Politics Science & Technologies Social Science …

Yahoo! • Around one million classified pages • More than 14 country specific directories (Chinese, Danish, French, …) • Manual classification (has its problem) • Acyclic graphs • Search within the classified pages/taxonomy • Problem of coverage

Users and the web (1) • Queries on the web: average values Measure Average value Range Number of words 0 - 393 Number of operators 0 - 958 Repetitions of queries 1 - 1. 5 million Queries per user session 2. 35 0. 41 3. 97 2. 02

Users and the web (2) • Some statistics – Main purpose: research, leisure, business, education, • products and services (e-commerce) • people and company names and home pages • factoids (from any one of a number of documents) • entire, broad documents • mp 3, image, video, audio – – 80% do not modify query 85% look first screen only 64% queries are unique 25% users use single keywords (problem for polysemic words and synonyms) – 10% queries are empty

Users and the web (3) • Results of search engines are identical, independent of – user – context in which the user made the request • adding context information for improving search results it directly focus on the user need answer – explicit context • query + category – implicit context • based on documents edited or viewed by user – personalised search • previous requests and interests, user profile

Research issues • Modelling • Querying • Distributed architecture • Ranking • Indexing • Dynamic pages • Browsing • User interface • Duplicated data