CS 430 INFO 430 Information Retrieval Lecture 13
- Slides: 28
CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems 1
Course Administration Assignment 2 Deadline changed to midnight on Sunday, October 9 There is major electrical work in Upson Hall on Saturday and most of the computers and labs will not be available. 2
Course Administration Midterm Examination Wednesday, October 12, 7: 30 to 9: 00, Upson B 17 The topics to be are examined are all lectures and discussion class readings before the midterm break. See the Web site for a sample paper from a previous year. See the Web site for instructions about laptop computers. 3
Course Administration Discussion Class on October 19 This class will be held in Philips Hall 213 4
Notation Documents file Docs File of catalog records Catalog User interface Human action Physical objects User interface service UI 5 Searchable index Index Automatic process
Single Homogeneous Collection: Full Text Indexing • Documents and indexes are held on a single computer system (may be several computers). • Information retrieval uses a full text index, which may be tuned to the specific corpus. Build index Search Index Examples: SMART, Lucene 6 Docs
Single Homogeneous Collection: Use of Catalog Records • Documents may be digital or physical objects, e. g. , books. • Documents are described by catalog records generated manually (or sometimes automatically). • Information retrieval uses an index of catalog records Build index Search Index Example: Library catalog 7 Create catalog Catalog Docs
Several Similar Collections: One Computer System • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (but tuned for each collection, e. g. , different stoplists). Build indexes Search Index Example: Pub. Med 8 Docs
Distributed Architecture: Standard Search Protocols Index 1 Index 2 9 Strict adherence to standards allows any user interface to search any conforming search service.
Standard Search Protocols Example: Z 39. 50 Family of Standards for Searching Library Catalogs The Z 39. 50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z 39. 50. 10
Z 39. 50 principles • Servers store a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction. 11
State Z 39. 50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database. 12
Standards Z 39. 50 Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc. ) Message Passing Protocol: Z 39. 50 Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e. g. the Internet suite of protocols. 13
Distributed Architecture: Meta-search (Broadcast Search) • A user interface service broadcasts a query to several indexes and merges the results. • Can be used with full text or catalogs. Searches Index 1 Search UI User interface service Index 2 Index n Example: Dienst 14
Distributed Architecture: Broadcast Search Interface Service: Can be a separate server (e. g. , CGI), or run on the user's computer (e. g. , applet). Protocols: In the simple version, each collection must support the same standards and protocols (e. g. , Z 39. 50, http). 15
Distributed Architecture: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization. 16
Union Catalog • Catalog records from several libraries are merged into a single union catalog • Information retrieval uses an index of the records in the union catalog Create catalog records Build Docs index Search Index to Union Catalog Example: Harvard University's Hollis system 17 Docs
Use of Union Catalogs Search Index to Union Catalog Retrieve Union Catalog Docs 18 Docs Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central index, (b) retrieves catalog records, (c) retrieves documents from collections.
Building Union Catalogs Harvesting • Each collection makes a copy of its metadata (catalog records) available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages. . . • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but. . . 19 • Requires that collections have metadata and support harvesting protocol (e. g. , Open Archives Initiative Protocol for Metadata Harvesting).
OAI Verbs • • • 20 Identify – repository characteristics List. Metadata. Formats – DC required List. Sets – repository partitioning List. Records – (selectively) harvest metadata List. Identifiers – (selectively) harvest metadata identifiers Get. Record – known item retrieval
OAI-PMH Key technical features • • • 21 Simple HTTP encoding Built on of established XML standards Multiple metadata formats, but Dublin Core required Repository partitioning (sets) Selective harvesting (sets and dates) Clean partition between core and implementation-specific extensions – Multiple item-level metadata – Collection level metadata
Open Archives Initiative Protocol for Metadata Harvesting See: http: //www. openarchives. org/ Herbert Van de Sompel and Carl Lagoze, "The Santa Fe Convention of the Open Archives Initiative. " D-Lib Magazine, 6(2), 2000 http: //www. dlib. org/dlib/february 00/vandesompeloai/02 vandesompel-oai. html 22
Web Searching: Architecture • Documents stored on many Web servers are indexed in a single central index. (This is similar to a union catalog. ) • The central index is implemented as a single system on a very large number of computers Build index Docs Search on Web Index to all Web pages Examples: Google, Yahoo! 23 server Docs on Web server
Use of Web Search Service Search Index to all Web pages Retrieve Docs on Web server 24 Batch indexing: Each Web page is brought to the central location and indexed. Real-time searching: The user (a) searches the central index, (b) retrieves documents (Web pages) from original location.
Web Searching: Building the Index Documents are Web pages Each document is: • identified by Web Crawling • copied to a central location • indexed and added to the central index After indexing the documents are usually discarded, but a cached copy may be retained. Web searching is the topic of Lectures 19 -21 and Discussion Classes 9 and 10. 25
Web Crawling Advantages of Web crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but. . . • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections. 26
Standardization: Function Versus Cost of Acceptance Cost of acceptance Few adopters Many adopters 27 Function
Example: Textual Mark-up Cost of acceptance SGML XML HTML ASCII 28 Function
- Info 430
- Fijtimes
- Info 430
- Info 430
- Info 430
- Cluster foxtrots
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Sequential searching
- Search engine architecture in information retrieval
- What is precision and recall in information retrieval
- Text operations in information retrieval
- Query operations in information retrieval
- Positional index information retrieval
- Index construction in information retrieval
- Bsbi vs spimi
- Which internet service is used for information retrieval
- Information retrieval tutorial
- Wild card queries in information retrieval
- Capabilities of information retrieval system
- Link analysis in information retrieval
- Information retrieval lmu
- Defense acquisition management information retrieval
- Advantages of information retrieval system
- Information retrieval nlp
- Signature file structure in information retrieval system
- Information retrieval slides
- Relevance information retrieval
- Information retrieval stanford
- Link analysis in information retrieval