OAIster A No Dead Ends Digital Object Service
OAIster: A “No Dead Ends” Digital Object Service Kat Hagedorn OAIster Librarian University of Michigan Libraries October 3, 2003
background • One-year Mellon grant project to test the feasibility of making OAI-enabled metadata for digital objects accessible to the public • Digital Library Production Service at University of Michigan Libraries began work in December 2001 • Publicized as OAIster in February 2002 • Launched in June 2002
highlights • • • Any audience Any subject matter Any format Freely accessible No dead ends One-stop shopping …retrieving the “hidden web”
the protocol • OAI = Open Archives Initiative • OAI-PMH = Open Archives Initiative Protocol for Metadata Harvesting • Designed to make it easy to exchange metadata among interested parties • Consists of 6 HTTP requests to identify repositories / metadata and perform “harvesting”
tool we borrowed • University of Illinois Urbana-Champaign open-source OAI protocol harvester • java edition for our unix environment • Worked collaboratively to iron out kinks – resumption. Token / retry. After – inexplicable kill – bogus records in My. SQL table
development environment • Digital Library Extension Service (DLXS) • Develop open-source middleware and license XPAT search engine for building and mounting digital libraries • Middleware consists of document classes, i. e. , Text, Image, Bib, Find. Aid • Originally designed to make SGML encoded texts available online
tool we developed • Runs in DLXS environment using Bib. Class • Current Bib. Class web templates modified • Additional java-based transformation tool to: – DC metadata records concatenated – No-digital-object records filtered out – Records counted – Conversion from UTF-8 to ISO-8859 -1 – XSLT used to transform DC records into Bib. Class records
system design XSL stylesheets (per source type) UIUC harvester OAI-enabled DC records Non-OAIenabled DC records Record storage Bib. Class indexes XSLT transformation tool Search interface (XPAT)
result • One place to look for digital objects • Big – 1, 484, 767 metadata records – 195 institutions (as of August 03) • Popular – Averages 3300 search sessions / month – Picked up in March 03: average 3700 now – 43, 894 searches total (through July 03)
www. oaister. org: search
www. oaister. org: limiters
www. oaister. org: sort
www. oaister. org: results
www. oaister. org: repositories
repositories: e. g. , – Online Archive of California: manuscripts, photographs, and works of art held in institutions across California – ar. Xiv Eprint Archive: math and physics preand post-prints – Sammelpunkt, Elektronisch Archivierte Theorie: archive of philosophical publications – British Women Romantic Poets Project: collection of poems written by British women between 1789 and 1832
repositories: stats • As of July 03, out of 191 repositories… • U. S. and foreign – U. S. : 49% (94) – Foreign: 51% (97) • By subject – Humanities: 26% (50) – Science: 30% (58) – Mixed: 43% (83) • E-prints and pre-prints – Using eprints. org software: 41% (78) – Not using eprints. org software: 58% (110)
major issues encountered • Metadata variation • Records not leading to digital objects • Access restrictions on digital objects described in records • Duplicate records for a single digital object
issue: metadata variation • With more records, users need more restrictions • Consistent metadata needed to facilitate these restrictions • One option: normalization of data
issue: metadata variation • Type: the obvious quick win – 240 metadata values mapped to four generic values (text, image, audio, video) – e. g. , audio, sound = audio motion, animation, newsreels, etc. = video watercolour, watercolor, slides, etc. = image article, articles, booklet, diss, story, etc. = text
issue: metadata variation • Date: where to begin? – Most records with at least one date – Some records include up to seven dates – No consistent style of date • Subject: out of context, what meaning? – Many records with at least one subject element – But over 100 records with more than 50 subjects – And one record with 1000!
issue: metadata variation • Sample date values <date>2 -12 -01</date> <date>2002 -01 -01</date> <date>0000 -00 -00</date> <date>1822</date> <date>between 1827 and 1833</date> <date>18 --? </date> <date>November 13, 1947</date> <date>SEP 1958</date> <date>235 bce</date> <date>Summer, 1948</date>
issue: metadata variation • Sample subject values <subject>30, 51, 52</subject> <subject>1852, Apr. 22. E[veritt] Judson, letter to Philuta [Judson]. </subject> <subject>Slavery--United States--Controversial literature</subject> <subject>view of interior with John Henry sculpture</subject> <subject>Particles (Nuclear physics) -Research. </subject>
issue: no digital objects • Some records contain links to further description of digital object • But not the digital object itself • Culling difficult • One option: add explanatory text to site
issue: access restrictions • No records where metadata itself is restricted in use (as far as we know!) • Definitely some records where objects are restricted to licensed users • One option: add explanatory text to site
issue: access restrictions • DC Rights element: often not enough info about viewing restrictions • Currently no protocol method for indicating restricted digital objects (i. e. , “yes/no” toggle element) • Need to assess whether users feel informed or frustrated when encountering restricted objects
issue: duplicate records • Two records harvested, different identifiers, same object described and pointed to • Acquired in two ways: – Harvesting of original repository and aggregator – Receiving “static” DC records provided by content creator and harvesting aggregator
issue: duplicate records • Aggregators can contain records not currently available through OAI channels • Aggregators do not always contain all the records of a particular original repository • So, need to harvest both aggregator and original repositories
issue: duplicate records • Harvest records from aggregator • Also receive from original content creator, but as snapshot – e. g. , MEO and cogprints – Snapshot before aggregator – Creator unsure all records would be aggregated
issue: duplicate records • Were duplicates to be identified, how to deal with the issue? – Suppress? – Group? – Flag? • So far, not addressed in OAIster
assessment • Large survey (over 400 respondents) • 2 rounds of face-to-face and remote user testing • Conducted before design and after phase one rollout
assessment: survey • Online journals and reference materials wanted over other digital objects • Difficult to search for information; every service different; where to start • Number of respondents (5%) indicated they were generally successful in finding resources online
assessment: user testing • No short and long record formats: one size fits all • Want clearly defined and labeled AND/OR searching options • Results clear and easy to understand • Want to sort by title, date, institution, resource format…you name it! • Use OAIster for academic, trustworthy, authentic materials
service providers: comparison high Usability UIUC, Emory, etc. Ad hoc OAIster DP-9 low some Content all • Focus on high usability • Focus on all content available • Some service providers have increased functionality (e. g. , deduplication, integration of thesauri)
future of OAIster • • Make it faster Advanced searching Grouping to aid browsing Saving/emailing/downloading records Further normalization of data Handling duplicate records Collaboration with other services: search, instructional…
current state of protocol • Popular • As Peter Suber says: – “…no other single idea or technology in the [opensource movement has enjoyed this density of endorsement and adoption in a six month period. ” • Data providers over one year: – – June 02: 56 repositories / 274, 062 records June 03: 187 repositories / 1, 246, 953 records Over three-fold increase for repositories Over four-fold increase for records
future of protocol • Branching out – HTTP vs. SOAP – DC required vs. highly recommended – Use of OAI in closed environments – Static repository protocol • Need for add-on applications • OAI evangelism
what can you do? • OAI-enable your data – – – DLXS customer: easiest Make sure data is UTF-8 / Unicode compliant Provide as much metadata as you can Use standard element tags Develop “sets” for service providers • Let us know you’re ready to be harvested • Keep us informed about changes to the harvesting URL, new data and deleted data, change in contact info
contact info • Kat Hagedorn • University of Michigan Libraries, Digital Library Production Service • khage@umich. edu • http: //www. oaister. org/
- Slides: 38