How to Find a Needle in the Haystack

  • Slides: 50
Download presentation
How to Find a Needle in the Haystack Adrian Stevenson Learning Technology Services University

How to Find a Needle in the Haystack Adrian Stevenson Learning Technology Services University of Manchester Institutional Web Management Workshop 2005 Parallel Session 4 pm - 5. 30 pm, Wednesday 6 th July 2005 Combining the strengths of UMIST and The Victoria University of Manchester

Overview • Introduction to Cross searching / metasearch • The Problem – why metasearch?

Overview • Introduction to Cross searching / metasearch • The Problem – why metasearch? • JISC Information Environment • Quick introduction to XML and Web Services • Metasearch Technologies – Z 39. 50, SRU/SRW, OAI • Metasearch issues • NISO Metasearch Initiative Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Cross Searching • Cross searching has many names: – Metasearch – Distributed search –

Cross Searching • Cross searching has many names: – Metasearch – Distributed search – Parallel search – Federated search – Broadcast search – Cross-database search • Common theme of allowing search and retrieval to span multiple databases, sources, platforms, protocols, and vendors at once Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

The Problem • Web users such as researchers or tutors frequently require information from

The Problem • Web users such as researchers or tutors frequently require information from a variety of different sources • User required to search many different service interfaces, each with a different look and feel, metadata and subject classifications. • The results are almost always supplied in HTML, which makes them difficult to merge. • Users search many services and portals such as the RDN, zetoc and COPAC, image resources, e-prints, learning objects, external and internal resources. • If a user wants to obtain a local copy of the range of search results, they often have to merge the results themselves, for example by creating a text file. Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

JISC Information Environment • Cross searching is at the core of the JISC IE

JISC Information Environment • Cross searching is at the core of the JISC IE • JISC notes that considerable investment has been made to provide high -quality digital information resources • But students, lecturers and researchers are faced with a vast and sometimes bewildering range of sources of electronic information. • Each source has its own name, interface, features and search facilities. • Users remain unaware of their existence or fail to discover their value for their own learning, teaching or research. • A key challenge is therefore to achieve a managed, coherent and shared information environment that will overcome these obstacles Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

JISC: Helping Users find digital information • Being able to cross-search will considerably simplify

JISC: Helping Users find digital information • Being able to cross-search will considerably simplify users’ interactions with online resources. • This should encourage take-up and greatly improve means of accessing these resources. • Institutions will be able to incorporate these services within their own institutional online environments, presenting local content alongside nationally provided resources. • A second aspect relates to making the Information Environment actually work. • Making the Information Environment work requires the implementation of a range of commonly-agreed technical standards and protocols Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

JISC IE Technical Architecture • “The JISC Information Environment technical architecture specifies a set

JISC IE Technical Architecture • “The JISC Information Environment technical architecture specifies a set of standards and protocols that support the delivery of integrated networked services that allow the end-user to discover, access, use and publish digital and physical resources” How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Metasearch Technologies • Two main approaches: • Real-time cross searching – Z 39. 50

Metasearch Technologies • Two main approaches: • Real-time cross searching – Z 39. 50 – Search and Retrieve URL / Web Service - SRU/SRW • Harvesting – Open Archives Initiative Protocol for Metadata Harvesting – OAI-PMH Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Metasearch Technologies • Other approaches: • Hybrid – Combination of Z 39. 50, SRU/W,

Metasearch Technologies • Other approaches: • Hybrid – Combination of Z 39. 50, SRU/W, and OAI and. . • Screen scraping – parsing the HTML to find patterns or parts of content. – Screen scraping is an ad-hoc technique that is dependent on a consistent format for the data being scraped – Regular expressions used for screen scraping. Perl has strong support for regular expressions – grep – Difficult, unreliable and laborious Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Z 39. 50 • ANSI/NISO Z 39. 50 - 2003 Information Retrieval : Application

Z 39. 50 • ANSI/NISO Z 39. 50 - 2003 Information Retrieval : Application Service Definition & Protocol Specification • The National Information Standards Organization (NISO) is an American National Standards Institute (ANSI) accredited standards developer that serves the library, information, and publishing communities Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Z 39. 50 • Z 39. 50 is designed to enable communication between computers,

Z 39. 50 • Z 39. 50 is designed to enable communication between computers, typically those used to manage library catalogues • A portal can send a real-time query to a number of Z 39. 50 enabled content providers and a results set is returned to the user • The AHDS Gateway, physically based in London, uses Z 39. 50 to query five different databases containing information on archaeology (York), history (Colchester), the performing arts (Glasgow), the visual arts (Newcastle), and textual studies (Oxford) • They are driven by different database management software and run on a variety of hardware platforms. Z 39. 50 enables searches across the five sites. • Library OPAC and desktop applications such as End. Note can also be used to search Z targets Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

AHDS How to Find a Needle in the Haystack IWMW 2005 6 th June

AHDS How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Z 39. 50 • Z 39. 50 employs a client/server model • One computer,

Z 39. 50 • Z 39. 50 employs a client/server model • One computer, the client or, in Z 39. 50 terms, the ‘Origin’, submits a request to another computer, the server or ‘Target’ which then services the request and returns an answer • Queries can be sent to multiple databases simultaneously to cross search • Records can be returned in a number of formats or ‘syntaxes’ as requested by the client. These typically include: – MARC (Machine Readable Cataloging ) – SUTRS (Simple Unstructured Text Record Syntax) – Raw ASCII text file – XML (e. Xtensible Markup Language) Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

What is XML? Some possible definitions? • a technology for the management, display and

What is XML? Some possible definitions? • a technology for the management, display and organisation of data • a programming language • a markup language used to describe the structure of data • not really a language • a standard for creating languages that meet the XML criteria Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

XML: elements <tag> content </tag> <language> English </language> Combining the strengths of UMIST and

XML: elements <tag> content </tag> <language> English </language> Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

XML must be well formed • a root element is required <ead> …. .

XML must be well formed • a root element is required <ead> …. . all your tags and content… </ead> • closing tags are required Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

XML must be well formed • a root element is required <ead> …. .

XML must be well formed • a root element is required <ead> …. . all your tags and content… </ead> • closing tags are required • Tags must be properly nested • Case matters Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Valid XML • Valid XML provides consistency and facilitates the exchange of data •

Valid XML • Valid XML provides consistency and facilitates the exchange of data • XML must conform to a Document Type Definition (DTD) or Schema to be valid • Schemas and DTDs specify the elements and attributes and defines how they can be used: – Sequence of elements – Maximum and minimum values • People can agree to use a common Schema for interchanging data – e-learning: IEEE Learning Object Metadata Schema (LOM) Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Some Valid XML - EAD (Encoded Archival Description) <archdesc level="fonds"> <did> <repository>John Rylands University

Some Valid XML - EAD (Encoded Archival Description) <archdesc level="fonds"> <did> <repository>John Rylands University Library of Manchester</repository> <unitid countrycode="GB" repositorycode="0133">GB 0133 NCN</unitid> <unittitle>Papers of Norman Nicholson</unittitle> <unitdate normal="1899 -1987">1899 -1987</unitdate> <physdesc> <extent>0. 44 cu. m; 1, 201 items</extent> </physdesc> <langmaterial> <language langcode="eng">English</language> </langmaterial> <origination>Nicholson, Norman Cornthwaite, 1914 -1987</origination> <note>Created by the John Rylands Library archivist</note> </did> …. . </archdesc> Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Something to remember about XML • XML does not do anything itself. It is

Something to remember about XML • XML does not do anything itself. It is pure information wrapped in XML tags. • You must use other means to send, receive or display the data Display here like this XML is used by Combining the strengths of UMIST and The Victoria University of Manchester XML technologies to. . How to Find a Needle in the Haystack IWMW 2005 Display there like that extract this data for this purpose extract that data for that purpose 6 th June 2005

Why Use XML? • Because everyone else is! • International standard, supported by the

Why Use XML? • Because everyone else is! • International standard, supported by the W 3 C • XML is open, licence free and platform neutral • XML is human and machine readable • XML documents are text documents Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Why Use XML? • Separation of content and presentation – With proprietary systems content

Why Use XML? • Separation of content and presentation – With proprietary systems content is inextricably bound up with format • XML does not determine the presentation of the data – You can use CSS (stylesheets) or XSLT (Extensible Style Sheet Language for Transformations) to present XML data • The flexibility of XML enables the presentation of merged search results to the user. Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Web Services • A Web Service is an online application that can be accessed

Web Services • A Web Service is an online application that can be accessed by other applications in machine to machine (m 2 m) interactions • Web services use XML to achieve this interoperability – SOAP – WSDL: Web Services Description Language – UDDI: Universal Description, Discovery and Integration Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

What is a Web Service • A Web Service is a process of some

What is a Web Service • A Web Service is a process of some kind, some functionality, for example: – A search and retrieve procedure – A conversion process • Fahrenheit to Centigrade • MARC record to Dublin Core record • LCSH subject headings to Dewey Decimal Classification numbers Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Publicly available Web Services • Google’s ‘similar pages’ • Amazon’s book connections: ‘customers who

Publicly available Web Services • Google’s ‘similar pages’ • Amazon’s book connections: ‘customers who bought this also bought this’ • These services can be used in other applications • Xmethods website has a list of some experimental services – http: //www. xmethods. net Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Creating a Web Service • Web services can be built for existing applications, or

Creating a Web Service • Web services can be built for existing applications, or created from scratch • A key element of a Web Service is an XML file with details of how to interact with the service – the WSDL (Web Services Description Language) file Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Zetoc WSDL extract http: //zetoc. mimas. ac. uk/soap/zetocsoap. wsdl … <complex. Type name="Journal. Request">

Zetoc WSDL extract http: //zetoc. mimas. ac. uk/soap/zetocsoap. wsdl … <complex. Type name="Journal. Request"> <sequence> <element ref="srw: start. Record" min. Occurs="1" max. Occurs="1"/> <element ref="bath: any" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="dc: title" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="dc: creator" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="oujnl: jtitle" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="oujnl: issn" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="oujnl: volume" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="oujnl: issue" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="oujnl: spage" min. Occurs="1" max. Occurs="1" nillable="true"/> <element ref="dcterms: issued" min. Occurs="1" max. Occurs="1" nillable="true"/> </sequence> </complex. Type> Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Interacting with a Web Service • Once the client application knows how to interact

Interacting with a Web Service • Once the client application knows how to interact with the service, the client and service communicate using messages encoded in XML • These messages are frequently expressed in SOAP • These messages are generally passed over HTTP (but they don’t have to be) Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

SOAP • A way of packaging XML information and passing it from one system

SOAP • A way of packaging XML information and passing it from one system to another • Allows one system to make requests of another and to process the reply • Systems can be completely different, running on different software, hardware Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

SOAP request <soap: Envelope xmlns: soap="http: //schemas. xmlsoap. org/soap/envelope/"> <soap: Body> <zetoc: Journal. Request>

SOAP request <soap: Envelope xmlns: soap="http: //schemas. xmlsoap. org/soap/envelope/"> <soap: Body> <zetoc: Journal. Request> <dc: creator>apps</dc: creator> <oujnl: title>materialia</oujnl: title> <oujnl: issn>1359 -6462</oujnl: issn> <oujnl: volume>48</oujnl: volume> … </zetoc: Journal. Request> </soap: Body> </soap: Envelope> Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

SOAP response HTTP/1. 1 200 OK Content-Type: text/xml <soap: Envelope > <soap: Body> <zetoc:

SOAP response HTTP/1. 1 200 OK Content-Type: text/xml <soap: Envelope > <soap: Body> <zetoc: Identifier. Search. Response > <srw: number. Of. Records>1</srw: number. Of. Records> <dc: identifier>RN 125218404</dc: identifier> <zetoc: type>J</zetoc: type> <dc: title>Phase compositions in magnesium-rare earth alloys containing yttrium, gadolinium or dysprosium</dc: title> … </zetoc: Identifier. Search. Response > </soap: Body> </soap: Envelope> Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

To recap … • SOAP is a standard used for wrapping XML messages •

To recap … • SOAP is a standard used for wrapping XML messages • The XML that is sent and returned within the SOAP wrapper is determined by the WSDL file for any particular Web Service • This is all done on a machine-to-machine level – you should never have to see a SOAP message • However we can demonstrate with XML SPY editor so we can see the SOAP messages [demo] Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Search Retrieve URL / Web Service (SRU/SRW) • Takes the core of Z 39.

Search Retrieve URL / Web Service (SRU/SRW) • Takes the core of Z 39. 50 and re-implements as Web Service • SRU and SRW are XML based protocols designed to be a low barrier to entry solutions for performing searches and information retrieval operations across the internet. • The protocol has two ways that it can be carried: – via SOAP – Search Retrieve Web Service – as parameters in a URL. - SRU – Search/Retrieve by URL • The primary function of SRU/SRW is to allow a user to search a remote database of records. • This is done via the search. Retrieve operation: – the client sends a search. Retreive. Request and – the server responds with a search. Retrieve. Response Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Example SRW request • Most important part is the ‘query’. It contains a Common

Example SRW request • Most important part is the ‘query’. It contains a Common Query Language (CQL) string: • The request contains other parameters, all of these are optional except for ‘version’ Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Example SRW response • Response must contain ‘version’ and ‘number of records’ Combining the

Example SRW response • Response must contain ‘version’ and ‘number of records’ Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Some SRU Requests • SRU requests are URL with query string • ‘Explain’ request:

Some SRU Requests • SRU requests are URL with query string • ‘Explain’ request: http: //z 3950. loc. gov: 7090/voyager Describes the database/index and functionality • A simple search for the term "dinosaur“: http: //z 3950. loc. gov: 7090/voyager? version=1. 1& operation=search. Retrieve&query=dinosaur • And the first of these records: http: //z 3950. loc. gov: 7090/voyager? version=1. 1&operation=search. Retrieve&query=dinosaur&maximum. Records=1 Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Open Archives Initiative (OAI) • The Open Archives Initiative (OAI) provides is a mechanism

Open Archives Initiative (OAI) • The Open Archives Initiative (OAI) provides is a mechanism for sharing metadata records based on HTTP and XML • Enables metadata records about resources to be ‘harvested’ from multiple distributed services, typically into a central database (which itself may be a Z 39. 50 target) • Records harvested periodically e. g. . Once a day, hour etc. • Generally considered to be an elegant, simple and efficient protocol • 6 requests types or ‘verbs’ - Get. Record, Identify, List. Identifier, List. Metadata. Formats, List. Records and List. Sets. • JORUM Learning Object Repository Service OAI interface at: http: //repository. jorum. ac. uk/intralibrary/Intra. Library-OAI? verb=identify Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Real Time Cross Searching VS. Harvesting • Delays occur with real time cross-searching •

Real Time Cross Searching VS. Harvesting • Delays occur with real time cross-searching • The response time for searches sent to multiple search targets tends to be limited by the worst performing target or intervening network delays. • Very difficult to build flexible browse interfaces based on a distributed set of gateway databases. • OAI harvesting periodic so search results may not be accurate and up to date Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

OAI - Connect Portal • Connect’ Learning & Teaching Portal – http: //www. connect.

OAI - Connect Portal • Connect’ Learning & Teaching Portal – http: //www. connect. ac. uk • Connect is a HE Academy project (used to be the LTSN – Learning and Teaching Support Network) • Connect harvests in records from HE Academy subject centres around the UK • Records harvested by server at Rutherford Appleton Labs Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th

Connect Portal How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Metasearch issues: Metadata • Format – As users searching cross-domain, it makes sense to

Metasearch issues: Metadata • Format – As users searching cross-domain, it makes sense to use a cross-domain metadata schema. Dublin Core is a good contender for this and is required for use of OAI-PMH. – However, domains will use their own metadata schemas, such as the IEEE -LOM for learning objects. – Mappings required to enable cross searching, but some of the semantic richness of the original resource may be lost. • Common Meaning – Semantic issues – There needs to be agreement amongst content providers about the meaning of terms such as ‘title’, ‘article’, ‘research paper’, ‘learning object’ – There will inevitably be difficulties in reaching agreement about the meaning of metadata elements, as they are used differently in different contexts. Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Metasearch issues: Metadata • Political – The decision to make resources more widely available

Metasearch issues: Metadata • Political – The decision to make resources more widely available has implications for the organisations concerned: • • It may be seen as a loss of control or ownership • staff may not possess the skills required to support more complex systems Legal – legal requirements of Freedom of Information Legislation in several countries a significant factor in the dissemination of public sector resources. – The Intellectual Property Rights (IPR) of those providing sources may need to be protected. Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Why not just use Google? • Its content is limited to the visible Web

Why not just use Google? • Its content is limited to the visible Web • Limited search functionality – Can’t search by specific criteria (metadata) such as ‘publication date’, ‘author’, ‘educational level’ • Little quality control • Google Scholar? – Still a web crawl – Evidence that gives unreliable results Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

NISO Metasearch Initiative • This NISO Meta. Search initiative is trying to bring the

NISO Metasearch Initiative • This NISO Meta. Search initiative is trying to bring the area of metasearching together around a NISO standard. • “Best Practices for Metasearch” document due out June 15 th 2005 • http: //www. niso. org/committees/Meta. Search-info. html Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Overview • Introduction to Cross searching / metasearch • The Problem – why metasearch?

Overview • Introduction to Cross searching / metasearch • The Problem – why metasearch? • JISC Information Environment • Quick introduction to XML and Web Services • Metasearch Technologies – Z 39. 50, SRU/SRW, OAI • Metasearch issues • NISO Metasearch Initiative Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005

Contact Adrian Stevenson Learning Technology Services Internet Services University of Manchester adrian. stevenson [at]

Contact Adrian Stevenson Learning Technology Services Internet Services University of Manchester adrian. stevenson [at] manchester. ac. uk Tel: +44 (0) 161 306 3109 Combining the strengths of UMIST and The Victoria University of Manchester How to Find a Needle in the Haystack IWMW 2005 6 th June 2005