The Open Archives Initiative OAI and the Protocol

  • Slides: 24
Download presentation
The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAIPMH) CS 431

The Open Archives Initiative (OAI) and the Protocol for Metadata Harvesting (OAIPMH) CS 431 guest lecture Simeon Warner 3 March 2004

Origins of the OAI “The Open Archives Initiative has been set up to create

Origins of the OAI “The Open Archives Initiative has been set up to create a forum to discuss and solve matters of interoperability between electronic preprint solutions, as a way to promote their global acceptance. “ (Paul Ginsparg, Rick Luce & Herbert Van de Sompel - 1999) 2

What is the OAI now? “The OAI develops and promotes interoperability standards that aim

What is the OAI now? “The OAI develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. ” (from OAI mission statement) § Technological framework around OAI-PMH protocol § Application independent § Independent of economic model for content Also … a community and a “brand” (and you need it for an assignment due in April) 3

Where does the OAI fit? DEF Eprints Search Service DSpace DTU Institutional Repository ar.

Where does the OAI fit? DEF Eprints Search Service DSpace DTU Institutional Repository ar. Xiv OAI EPrints OAIster Search Service Scholarly Publishing and Archiving on the Web 4

OAI and Open Access • There is “A” difference § Open Archives Initiative §

OAI and Open Access • There is “A” difference § Open Archives Initiative § Open Access • The OAI is not tied to a particular political agenda - technical focus • BUT… the OAI provides functionality that is essential for many Open Access proposals 5

OAI-PMH Þ PMH -> Protocol for Metadata Harvesting http: //www. openarchives. org/OAI/2. 0/openarchivesprotocol. htm

OAI-PMH Þ PMH -> Protocol for Metadata Harvesting http: //www. openarchives. org/OAI/2. 0/openarchivesprotocol. htm • Simple protocol, just 6 verbs • Designed to allow harvesting of any XML metadata (schema described) • For batch-mode not interactive use Service Provider (Harvester) Protocol requests (GET, POST) Data Provider (Repository) XML metadata 6

OAI for discovery R 1 R 2 User ? R 3 R 4 Information

OAI for discovery R 1 R 2 User ? R 3 R 4 Information islands 7

OAI for discovery Service layer R 1 R 2 User Search service R 3

OAI for discovery Service layer R 1 R 2 User Search service R 3 R 4 Metadata harvested by service 8

OAI for XYZ Service layer R 1 R 2 User XYZ service R 3

OAI for XYZ Service layer R 1 R 2 User XYZ service R 3 R 4 Global network of resources exposing metadata 9

OAI-PMH Data Model resource item has identifier all available metadata about this sculpture Dublin

OAI-PMH Data Model resource item has identifier all available metadata about this sculpture Dublin Core metadata MARC 21 metadata branding metadata item records record has identifier + metadata format + datestamp 10

OAI-PMH and HTTP • Clear separation of OAI-PMH and HTTP: OAIPMH uses HTTP as

OAI-PMH and HTTP • Clear separation of OAI-PMH and HTTP: OAIPMH uses HTTP as transport • all OK at HTTP level? => 200 OK • something wrong at OAI-PMH level? => OAIPMH error (e. g. bad. Verb) • HTTP codes 302 (redirect), 503 (retry-after), etc. still available to implementers, but do not represent OAI-PMH events • Not REST like 11

Normal response <? xml version="1. 0" encoding="UTF-8"? > <OAI-PMH> …. namespace info not shown

Normal response <? xml version="1. 0" encoding="UTF-8"? > <OAI-PMH> …. namespace info not shown here <response. Date>2002 -0208 T 08: 55: 46 Z</response. Date> <request verb=“Get. Record”… …>http: //ar. Xiv. org/oai 2</request> <Get. Record> <record> <header> <identifier>oai: ar. Xiv: cs/0112017</identifier> <datestamp>2001 -12 -14</datestamp> <set. Spec>cs</set. Spec> <set. Spec>math</set. Spec> note no HTTP encoding </header> of the OAI-PMH request <metadata> …. . </metadata> </record> </Get. Record> </OAI-PMH> 12

Error/exception response <? xml version="1. 0" encoding="UTF-8"? > <OAI-PMH> <response. Date>2002 -0208 T 08:

Error/exception response <? xml version="1. 0" encoding="UTF-8"? > <OAI-PMH> <response. Date>2002 -0208 T 08: 55: 46 Z</response. Date> <request>http: //ar. Xiv. org/oai 2</request> <error code=“bad. Verb”>Show. Me is not a valid OAI-PMH verb</error> </OAI-PMH> Same schema for all responses, including error responses. with errors, only the correct attributes are echoed in <request> 13

OAI-PMH verbs Verb Identify metadata about the repository harvesting verbs Function description of archive

OAI-PMH verbs Verb Identify metadata about the repository harvesting verbs Function description of archive List. Metadata. Formats metadata formats supported by archive List. Sets sets defined by archive List. Identifiers OAI unique ids contained in archive List. Records listing of N records Get. Record listing of a single record most verbs take arguments: dates, sets, ids, metadata formats and resumption token (for flow control) 14

Identify verb Information about the repository, start any harvest with Identify <Identify> <repository. Name>Library

Identify verb Information about the repository, start any harvest with Identify <Identify> <repository. Name>Library of Congress 1</repository. Name> <base. URL>http: //memory. loc. gov/cgi-bin/oai</base. URL> <protocol. Version>2. 0</protocol. Version> <admin. Email>r. e. gillian@larc. nasa. gov</admin. Email> <admin. Email>rgillian@visi. net</admin. Email> <deleted. Record>transient</deleted. Record> <earliest. Datestamp>1990 -02 -01 T 00: 00 Z</earliest. Datestamp> <granularity>YYYY-MM-DDThh: mm: ss. Z</granularity> <compression>deflate</compression> 15

Identifiers • Items have identifiers (all records of same item share identifier) • Identifiers

Identifiers • Items have identifiers (all records of same item share identifier) • Identifiers must have URI syntax (defined by RFC, a type in XML schema) • Unless you can recognize a global URI scheme, identifiers must be assumed to be local to the repository • Complete identification of a record is base. URL+identifier+metadata. Prefix+datestamp • <provenance> container may be used to express harvesting/transformation history 16

Datestamps • All dates/times are UTC, encoded in ISO 8601, Z notation: 1957 -03

Datestamps • All dates/times are UTC, encoded in ISO 8601, Z notation: 1957 -03 -20 T 20: 30: 00 Z • Datestamps may be either fill date/time as above or date only (YYYY-MM-DD). Must be consistent over whole repository, ‘granularity’ specified in Identify response. • Earlier version of the protocol specified “local time” which caused lots of misunderstandings. Not good for global interoperability! 17

Harvesting granularity • mandatory support of YYYY-MM-DD • optional support of YYYY-MM-DDThh: mm: ss.

Harvesting granularity • mandatory support of YYYY-MM-DD • optional support of YYYY-MM-DDThh: mm: ss. Z (must look at Identify response) • granularity of from and until agrument in List. Identifier/List. Records must match 18

Sets • Simple notion of grouping at the item level to support selective harvesting

Sets • Simple notion of grouping at the item level to support selective harvesting § Hierarchical set structure § Multiple set membership permitted § E. g: repo has sets A, A: B: C, D, D: E, D: F If item 1 is in A: B then it is in A If item 2 is in D: E then it is in D, may also be in D: F Item 3 may be in no sets at all • Don’t use sets unless you have a good reason (selective harvesting) 19

resumption. Token • Protocol supports the notion of partial responses in a very simple

resumption. Token • Protocol supports the notion of partial responses in a very simple way: Response includes a ‘token’ at the which is used to get the next chunk. • Idempotency of resumption. Token: return same incomplete list when resumption. Token is reissued • while no changes occur in the repo: strict • while changes occur in the repo: all items with unchanged datestamp • optional attributes for the resumption. Token: expiration. Date, complete. List. Size, cursor 20

Record headers • header contains set membership of item <record> <header> <identifier>oai: ar. Xiv:

Record headers • header contains set membership of item <record> <header> <identifier>oai: ar. Xiv: cs/0112017</identifier> <datestamp>2001 -12 -14</datestamp> <set. Spec>cs</set. Spec> <set. Spec>math</set. Spec> </header> <metadata> …. . eliminates the need for the “double </metadata> harvest” 1. x required to get all records </record> and all set information 21

Deleted records • What happens when a record (or item) is deleted from a

Deleted records • What happens when a record (or item) is deleted from a repository? Would be nice if harvesters could find out. • Not necessarily guaranteed in OAI that harvesters will find out. Support made optional because of problems with legacy repositories (practical constraint). § Level of support expressed in Identify (no, persistent, transient) § Status expressed in header element, <header status=“deleted”>…</header> 22

Harvesting strategy • Issue Identify request § Check all as expected (validate, version, base.

Harvesting strategy • Issue Identify request § Check all as expected (validate, version, base. URL, granularity, comporession…) • Check sets/metadata formats as necessary (List. Sets, List. Metadata. Formats) • Do harvest, initial complete harvest done with no from and to parameters • Subsequent incremental harvests start from datastamp that is response. Date of last response 23

Changing Scholarly Communication • Traditional journal publishing combines functions: registration, certification, awareness, archiving. •

Changing Scholarly Communication • Traditional journal publishing combines functions: registration, certification, awareness, archiving. • How about eprints being the starting point of a new value chain in which the raw material - the non -certified eprint - is open access? • Other functions might be fullfilled by different networked parties. This requires a communication infrastructure: OAI-PMH may be part of this. • Presentations on OAI and Scholarly Communication at http: //www. cs. cornell. edu/people/simeon/talks 24