CS 430 INFO 430 Information Retrieval Lecture 13

  • Slides: 28
Download presentation
CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems

CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems 1

Course Administration Assignment 2 Deadline changed to midnight on Sunday, October 9 There is

Course Administration Assignment 2 Deadline changed to midnight on Sunday, October 9 There is major electrical work in Upson Hall on Saturday and most of the computers and labs will not be available. 2

Course Administration Midterm Examination Wednesday, October 12, 7: 30 to 9: 00, Upson B

Course Administration Midterm Examination Wednesday, October 12, 7: 30 to 9: 00, Upson B 17 The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper from a previous year. See the Web site for instructions about laptop computers. 3

Course Administration Discussion Class on October 19 This class will be held in Philips

Course Administration Discussion Class on October 19 This class will be held in Philips Hall 213 4

Notation Documents file Docs File of catalog records Catalog User interface Human action Physical

Notation Documents file Docs File of catalog records Catalog User interface Human action Physical objects User interface service UI 5 Searchable index Index Automatic process

Single Homogeneous Collection: Full Text Indexing • Documents and indexes are held on a

Single Homogeneous Collection: Full Text Indexing • Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Index Examples: SMART, Lucene 6 Docs

Single Homogeneous Collection: Use of Catalog Records • Documents may be digital or physical

Single Homogeneous Collection: Use of Catalog Records • Documents may be digital or physical objects, e. g. , books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Search Index Example: Library catalog 7 Create catalog Catalog Docs

Several Similar Collections: One Computer System • Several more or less similar collections are

Several Similar Collections: One Computer System • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e. g. , different stoplists). Build indexes Search Index Example: Pub. Med 8 Docs

Distributed Architecture: Standard Search Protocols Index 1 Index 2 9 Strict adherence to standards

Distributed Architecture: Standard Search Protocols Index 1 Index 2 9 Strict adherence to standards allows any user interface to search any conforming search service.

Standard Search Protocols Example: Z 39. 50 Family of Standards for Searching Library Catalogs

Standard Search Protocols Example: Z 39. 50 Family of Standards for Searching Library Catalogs The Z 39. 50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z 39. 50. 10

Z 39. 50 principles • Servers store a set of databases with searchable indexes

Z 39. 50 principles • Servers store a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 11

State Z 39. 50 • The server carries out the search and builds a

State Z 39. 50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 12

Standards Z 39. 50 Family of Standards for Searching Library Catalogs Content: Anglo American

Standards Z 39. 50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc. ) Message Passing Protocol: Z 39. 50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e. g. the Internet suite of protocols. 13

Distributed Architecture: Meta-search (Broadcast Search) • A user interface service broadcasts a query to

Distributed Architecture: Meta-search (Broadcast Search) • A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Index 1 Search UI User interface service Index 2 Index n Example: Dienst 14

Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e. g. ,

Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e. g. , CGI), or run on the user's computer (e. g. , applet). Protocols: In the simple version, each collection must support the same standards and protocols (e. g. , Z 39. 50, http). 15

Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does

Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization. 16

Union Catalog • Catalog records from several libraries are merged into a single union

Union Catalog • Catalog records from several libraries are merged into a single union catalog • Information retrieval uses an index of the records in the union catalog Create catalog records Build Docs index Search Index to Union Catalog Example: Harvard University's Hollis system 17 Docs

Use of Union Catalogs Search Index to Union Catalog Retrieve Union Catalog Docs 18

Use of Union Catalogs Search Index to Union Catalog Retrieve Union Catalog Docs 18 Docs Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections.

Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog

Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages. . . • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but. . . 19 • Requires that collections have metadata and support harvesting protocol (e. g. , Open Archives Initiative Protocol for Metadata Harvesting).

OAI Verbs • • • 20 Identify – repository characteristics List. Metadata. Formats –

OAI Verbs • • • 20 Identify – repository characteristics List. Metadata. Formats – DC required List. Sets – repository partitioning List. Records – (selectively) harvest metadata List. Identifiers – (selectively) harvest metadata identifiers Get. Record – known item retrieval

OAI-PMH Key technical features • • • 21 Simple HTTP encoding Built on of

OAI-PMH Key technical features • • • 21 Simple HTTP encoding Built on of established XML standards Multiple metadata formats, but Dublin Core required Repository partitioning (sets) Selective harvesting (sets and dates) Clean partition between core and implementation-specific extensions – Multiple item-level metadata – Collection level metadata

Open Archives Initiative Protocol for Metadata Harvesting See: http: //www. openarchives. org/ Herbert Van

Open Archives Initiative Protocol for Metadata Harvesting See: http: //www. openarchives. org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative. " D-Lib Magazine, 6(2), 2000 http: //www. dlib. org/dlib/february 00/vandesompeloai/02 vandesompel-oai. html 22

Web Searching: Architecture • Documents stored on many Web servers are indexed in a

Web Searching: Architecture • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog. ) • The central index is implemented as a single system on a very large number of computers Build index Docs Search on Web Index to all Web pages Examples: Google, Yahoo! 23 server Docs on Web server

Use of Web Search Service Search Index to all Web pages Retrieve Docs on

Use of Web Search Service Search Index to all Web pages Retrieve Docs on Web server 24 Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location.

Web Searching: Building the Index Documents are Web pages Each document is: • identified

Web Searching: Building the Index Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents are usually discarded, but a cached copy may be retained. Web searching is the topic of Lectures 19 -21 and Discussion Classes 9 and 10. 25

Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at

Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but. . . • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections. 26

Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters 27

Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters 27 Function

Example: Textual Mark-up Cost of acceptance SGML XML HTML ASCII 28 Function

Example: Textual Mark-up Cost of acceptance SGML XML HTML ASCII 28 Function