Introduction to CNIDRs Isearch Archie Warnock warnockclark net

  • Slides: 15
Download presentation
Introduction to CNIDR’s Isearch Archie Warnock warnock@clark. net A/WWW Enterprises 1

Introduction to CNIDR’s Isearch Archie Warnock warnock@clark. net A/WWW Enterprises 1

Who is MCNC/CNIDR? u MCNC = Microelectronics Consortium of North Carolina u CNIDR =

Who is MCNC/CNIDR? u MCNC = Microelectronics Consortium of North Carolina u CNIDR = Clearinghouse for Networked Information Discovery and Retrieval u Originally funded by NSF to coordinate and produce network information tools u Now developing public domain and commercial search/retrieval tools A/WWW Enterprises 2

What is Isearch? u Isearch is the successor to free. WAIS u Isearch is

What is Isearch? u Isearch is the successor to free. WAIS u Isearch is a sophisticated full-text search and retrieval system u Isearch is a component of Isite, an implementation of the NISO standard protocol Z 39. 50 for information search and retrieval u ftp: //ftp. cnidr. org/pub/NIDR. tools/Isearch u http: //vinca. cnidr. org/software/Isearch. html A/WWW Enterprises 3

Terminology - I u Client/server - an architecture to allow communications between programs, possibly

Terminology - I u Client/server - an architecture to allow communications between programs, possibly on different computers u Protocol - the communication “language” used by client and server programs u http - the protocol used by WWW clients and servers u CGI - mechanism to process WWW forms A/WWW Enterprises 4

Terminology - II u Query - user-supplied search criteria u Full-text search - word-based

Terminology - II u Query - user-supplied search criteria u Full-text search - word-based search of all the text in a document u Fielded search - word-based search of text within only certain fields in a document u Z 39. 50 - a standard protocol for network-based document search and retrieval A/WWW Enterprises 5

System Components - I u Iindex, the Text Indexer - builds searchable version of

System Components - I u Iindex, the Text Indexer - builds searchable version of the document collection u Implements fast word-based searching u Document parser - recognize start/end of individual documents u Field parser - recognize start/end of fields within individual documents A/WWW Enterprises 6

System Components - II u Isearch, the Search engine - searches a document collection

System Components - II u Isearch, the Search engine - searches a document collection based on user-supplied query u Command u Primarily u WWW A/WWW Enterprises used for testing gateway (using CGI) u End-user u Z 39. 50 line search interface using forms gateway 7

Isearch Capabilities u Fast full-text search u US AIDS Patent Collection - can search

Isearch Capabilities u Fast full-text search u US AIDS Patent Collection - can search ~250, 000 patents in < 1 second u Fielded search u Can restrict searches to title, author, abstract, other fields u Relevance u Search A/WWW Enterprises ranking “hits” are assigned scores & sorted 8

Isearch Capabilities u Word truncation u search for “matri*” matches “matrix” and “matrices” u

Isearch Capabilities u Word truncation u search for “matri*” matches “matrix” and “matrices” u Boolean functions u AND, OR and ANDNOT combinations of different fields u Customized presentation of results u Phrase searching (coming soon) A/WWW Enterprises 9

Isearch Customization u What’s needed to customize Isearch? u Isearch is written in C++

Isearch Customization u What’s needed to customize Isearch? u Isearch is written in C++ u Documents are C++ objects - data & procedures u Already have SGML & HTML, among others u Object technology allows code reusability, customizing only where differences from existing objects occur A/WWW Enterprises 10

Isearch Customization u What’s needed to make arbitrary documents searchable? u Code to parse

Isearch Customization u What’s needed to make arbitrary documents searchable? u Code to parse documents u Code to parse fields u Code to build brief and full result records u Yes, it requires programming u But, many of these are derived from existing procedures A/WWW Enterprises 11

Customization Example Linear Algebra u Inputs u SGML-tagged u T EX bibliographic records preprints

Customization Example Linear Algebra u Inputs u SGML-tagged u T EX bibliographic records preprints u Requirements u Field searching on title, author, abstract u Full-text search of preprints u WWW-based interface A/WWW Enterprises 12

Customization Example Linear Algebra u End products u HTML-tagged “brief records” - title, author

Customization Example Linear Algebra u End products u HTML-tagged “brief records” - title, author and links to full bibliographic records and preprints u HTML formatted bibliographic records for display in WWW browser u Preprints for display or retrieval to local storage A/WWW Enterprises 13

Customization Example Linear Algebra u Sample Bibliographic Record <BB> <AID>####</AID> <VOL>##</VOL> <ISS>##</ISS> <ATL>Title text</ATL>

Customization Example Linear Algebra u Sample Bibliographic Record <BB> <AID>####</AID> <VOL>##</VOL> <ISS>##</ISS> <ATL>Title text</ATL> <AUG> <AU>Author Name</AU> </AUG> <ABS>Abstract text</ABS> <PPX>Preprint. filename</PPX> <PGR>###-###</PGR> </BB> A/WWW Enterprises 14

Customization Example Linear Algebra u Isearch Modifications u ~1 week coding and testing, mostly

Customization Example Linear Algebra u Isearch Modifications u ~1 week coding and testing, mostly in developing presentation customizations u Additional work to develop ingest and on-thefly formatting scripts, code deployment at ESI u Now have basic code to handle SGML documents using Elsevier DTD A/WWW Enterprises 15