Dublin Core and metadata a tutorial Lorcan Dempsey
Dublin Core and metadata: a tutorial Lorcan Dempsey Andy Powell UKOLN, University of Bath (with a little help from our friends) http: //www. ukoln. ac. uk/metadata
Questions for you. . . • • Metadata EAD, CIMI, TEI PICS, XML, RDF MARC 856 Dublin Core you are • geeks/people with sensible shoes • goers/doers 2
Overview • • 3 UKOLN and metadata Metadata landscape Dublin Core Metadata management Interoperability Harvesting Future
UKOLN and metadata • ROADS • subject gateways • WHOIS++ templates • BIBLINK • CIP for electronic data • Dublin Core (+ MARC) • Desire • WHOIS++, GILS, Dublin Core • Z 39. 50/WHOIS++ 4 • News. Agent • current awareness, Ariadne • Dublin Core, DC-dot • MODELS • collection description? ? • Agora • PRIDE • Initiatives
Metadata landscape
What is metadata …? • It’s just cataloguing, isn’t it … ? • Yes and no … • Data which supports operations carried out on information objects … – discover, buy, . . . • In the company of strangers (Brody) • Relieve user of having to have full advance knowledge of characteristics of resources … … variety 6
Metadata model: the library example Semantics, syntax, content MARC, ISO 2709, AACR 2 Libraries Picture by Stu Weibel 7 MARC AACR 2
Variety of formal and informal metadata models Home Pages Scientific Data Commerce Libraries Geospatial Internet Commons Museums Whatever. . . Picture by Stu Weibel 8
Variety of operations. . . • Discovery • Location • Selection • fit for use • Acquire • terms • Manipulate • Exploit • IPR 9 • • Document Contextualise Preserve Manage • dates, people, structures, … • Agent/client access • ….
Variety of sectors. . . • Curatorial traditions • ‘cataloguing’/documentation • libraries, archives, text archives, museums, geospatial data, etc • Network resource discovery • directory services, search engines, etc • influence from computer science • Network information management • web developments, W 3 C, database • sitemap, time to live, . . . • pragmatic - market needs, vendor push 10
Variety of creation models. . . • Author/creator • web pages? • Repository/site manager • effective disclosure • better management • Third party creator • e. g. e. Lib subject gateways • Library 11
Metadata. . . • Variety of metadata models • syntax, semantics, content • scope • sectors/domains • Variety of operations supported • Variety of creation models • Variety of architectures for disclosure/discovery • Search and retrieve • Disclosure/distribution • Management … complex 12
Some formats richer… semantics, structure, domain-specific, . . . 13
Dublin core in the metadata landscape
Dublin Core • Operations • resource discovery on the web Dublin Core • Explicitly cross sector/domain • No constraint on creation model or application architecture 15 … simple and intuitive . . . Museum MARC • Simple element set • focus on semantics - several target syntaxes FGDC • Metadata model
Dublin core - why success? • Simple • Coincides with strategic needs in each of sectors we identified – Curatorial: semantic interoperability between richer metadata models – Resource discovery: a simple format for descriptive metadata (DLOs) – Web management: associate metadata with Web resources • Inclusive (countries/domains/traditions) • Stu Weibel 16
Introduction to Dublin Core
Dublin Core - elements • 15 element core metadata set • • 18 Title Subject Description Creator Publisher Contributor Date Type • • Format Identifier Source Language Relation Coverage Rights
Dublin Core - HTML Example 19 <HTML><HEAD> <TITLE>UKOLN Home Page</TITLE> <META NAME="DC. Title” CONTENT="UKOLN: UK Office for Library and Information Networking"> <META NAME="DC. Subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops"> <META NAME="DC. Description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services"> <META NAME="DC. Creator" CONTENT=”Isobel Stark"> </HEAD>. . .
Management
Data creation Practical issues of using Dublin Core for Internet resource description. . . • UKOLN metadata system • Requirements • 3 models for metadata management • Implementation at UKOLN 21
UKOLN metadata system requirements • Easy to use • Work with a variety of methods of creating HTML • Simple migration to future metadata formats • Separate metadata from resource 22
Managing Dublin Core (1) HTML Authoring tool Embed by hand using HTML or text editor Pros… • Simple • May be useful for training and familiarisation 23 Cons… • May not be possible with all editors • Maintenance problems • Easy to make errors
DC-dot • A Web based tool for creating Dublin Core <meta> tags • Automatic generation of some tags based on content of the resource • Forms based editing of tags • Cut-and-paste output into HTML • Conversion to other formats… • SOIF, ROADS/WHOIS++, USMARC, GILS. . . Run http: //www. ukoln. ac. uk/metadata/dcdot/ 24 demo
Managing Dublin Core (2) Web-site management tool Use Web-site management tool, for example Net. Objects Fusion Pros… • Use of Web-site management tools likely to increase • Object-oriented database approach 25 Cons… • Proprietry formats • Early days - too early to evaluate use for metadata yet?
Managing Dublin Core (3) On the fly generation Hold Dublin Core separately and embed on-the-fly using server-side include (SSI) Pros… • Separates metadata from resource • Future migration fairly simple 26 Cons… • Performance • Lack of integration with HTML tools • Server specific
UKOLN metadata system (1) • • Embed on-the-fly Apache SSI script Store metadata using SOIF records Use MS-Access as tool to create the records • Associate metadata with resource by co-locating them in the Web server filestore 27
UKOLN metadata system (2) Apache syntax for calling server-side script <!--#exec cmd="getmeta" --> HTML editor intro. html <html> <head> <title>…</title> <!--#exec cmd="getmeta" --> </head>. . . intro. html. soif MS-Access Database 28 @FILE { http: //www. ukoln. ac. . keywords{13}: xxx, yyy, zzz description{14}: blah b author{13}: Stark, Isobel. . . }
UKOLN metadata system (3) MS-Access front end. . . Filename browser Text boxes Name choosers UKOLN specific metadata 29
UKOLN metadata system (4) intro. html Web robot 1 6 <html> <head> <title>…</title> <!--#exec cmd="getmeta" --> </head>. . . 2 UKOLN Web server intro. html. soif 3 4 5 30 SSI script @FILE { http: //www. ukoln. ac. . keywords{13}: xxx, yyy, zzz description{14}: blah b author{13}: Stark, Isobel. . . }
Issues • Performance • Interaction with Web caches • Dublin Core vs Alta Vista style metadata <META NAME=”Description” CONTENT=”blah, blah"> <META NAME="Keywords” CONTENT="xxx, yyy, zzz"> • Granularity • Which pages should have metadata? 31
A short history: Dublin to Helsinki We have borrowed some of this material from Stu Weibel, with permission
Dublin Core Workshop Series. . • DC-1: OCLC/NCSA Metadata Workshop Mar, 1995 • Limited Scope: Discovery of document-like objects • 13 element Dublin Core • Interdisciplinary consensus • DC-2: OCLC/UKOLN Warwick Workshop April, 1996 • Warwick Framework - modularity • Syntax issues 33
. . Dublin Core Workshop Series • DC-3: CNI/OCLC Image Metadata Workshop, Sep, 1996 • Images are in scope • 15 element core; some element name changes • DC-4: Canberra Metadata Workshop Mar, 1997 • Minimalists and Structuralists • Canberra Qualifiers (additional information useful for interpretation of metadata) 34
Dublin core - qualifiers • Language of element value • Scheme • specifies a context for interpretation <META NAME=“DC. Subject” SCHEME=“ddc. 21” CONTENT=“ 170. 42”> • Sub-element • specifies a facet - narrows <META NAME="DC. Creator. Address" CONTENT=“l. dempsey@ukoln. ac. uk"> 35
DC-5 • DC-5: National Library of Finland/OCLC Workshop, October 1997 – Formal Data Model (expressed in RDF) – many other problems are hereby made simpler – Resource Description Framework – The return of modularity – Finnish finish (of unqualified DC) – minimalist DC is done and will not be changed – Semantics for additional sub-structure – a small number of sub-elements will be established – Closer DC-W 3 C collaboration 36
Working groups • Data Model • Sub-elements • date, relationship, source • what is a resource? • 1: 1 • RDF • Relationships • Typology 37 • Date
RFCs in preparation • Simple DC semantics (the minimalist position) • Simple DC syntax for embedded HTML • DC semantics with qualifiers • DC syntax with qualifiers • HTML 2. 0 • HTML 4. 0 • RDF 38
Dublin Core implementation
Projects • 30 projects; 10 countries http: //purl. org/metadata/dublin_core/projects. html • “Interdisciplinary and international recognition as the lingua franca for resource discovery metadata for electronic resources” Stu Weibel • Support for use for non-digital objects 40
The HTML 2. 0 “kludge” • Convention for simple embedded metadata • Bootstrapping early Dublin Core deployments • META tags and standard HTML syntax • Useful for simple metadata without qualifiers • Can support Dublin Core qualifiers, but with risks for interoperability and indexing purity <META NAME="DC. Subject" CONTENT="(SCHEME=LCSH) Information technology -- higher education"> 41
HTML 4. 0 - DC influences the web • Richer <META> tag attributes • LANG (language of the metadata) • SCHEME (formal qualifier) • SUB-ELEMENTS (dot syntax extensions) • Allows syntactically “clean” implementation of metadata with qualifiers <META NAME="DC. Subject" SCHEME="LCSH" CONTENT="Information technology -- higher education"> 42
Some quick statistics • UK (academic sites only) • Total pages: ~1. 5 M (a guess!) • Embedded DC: ‘a few hundred’ Information provided by Dave Beckett http: //www. cs. ukc. ac. uk/people/staff/djb 1/ • Sweden • Total pages: 1. 4 M • Embedded DC: ‘a few dozen’ http: //www. lub. lu. se/nwi. Paper/ 43 Information provided by Sigfrid Lundburg
Interoperability
Interoperability • • 45 What do we mean by interoperability? Issues Z 39. 50 and Dublin Core Metadata registries
Interoperability? • Unify access to data in different domains - Web, library, museums, archives, . . . In real life these can all • Issues get mixed up • Protocols - Z 39. 50, WHOIS++, … – gateways • Attribute names - author/creator/. . . – Semantic interoperability - mapping tables • Format of results – format converters 46
Protocol Gateways - an example • ZEXI - a Z 39. 50 to WHOIS++ gateway • Based on CNIDR's Isite • Accepts Z 39. 50 searches • Converts them to WHOIS++ • Returns SUTRS records http: //roads. ukoln. ac. uk/cgi-bin/egwcgi/egwirtcl/targets. egw 47
Attribute names • Different databases may use different ‘names’ for the same thing • ‘creator’ vs ‘author’ • Need to be able to construct searches that ‘work’ against different databases irrespective of the ‘names’ in use • Dublin Core provides a minimal set of agreed ‘names’ with which we can construct searches 48
Format of results • Different databases may return results in different formats • USMARC, GRS-1, SUTRS, IAFA, . . . • Early stages of searching ideally need results to be returned in single ‘simple’ format • Dublin Core provides a minimal set of agreed data elements with which we can construct results 49
Z 39. 50 and DC - searching • Version 2 • Searches phrased in terms of single attribute set only • Either need to – add DC attributes to Bib-1 – map DC to Bib-1 • Version 3 • Multiple attribute sets allowed for searching • New simple DC attribute set to be proposed • Other attributes taken from Bib-1 http: //cypress. dev. oclc. org: 12345/~rrl/docs/dublincoreandz 3950. html 50
Z 39. 50 and DC - retrieval • To return Dublin Core ‘records’ using Z 39. 50… • use GRS-1 (General Record Syntax) • elements are assigned tags • DC elements have been added to tagset-G 51
Format conversion - issues • Simple to rich, e. g. DC to MARC • May not generate valid rich record without manual enhancement • Use of DC qualifiers required for decent MARC record • Rich to simple, e. g. MARC to DC • Loss of data 52
Metadata registries • Semantics • Agreement on element meanings • Agreement on enumerated lists • Qualifiers • Thesaurus naming • Publishing existing metadata sets • Re-use by others - prevent duplication of work • e. g. Administrative metadata 53
Some pointers • Mapping tables http: //www. ukoln. ac. uk/metadata/interoperability/ • Software • General http: //www. ukoln. ac. uk/metadata/software-tools/ • d 2 m : Dublin Core to MARC converter http: //www. bibsys. no/meta/d 2 m/ • USEMARCON http: //www 2. echo. lu/libraries/en/projects/usemarc. html 54
Harvesting
Harvesting Dublin Core • General Issues • Building a Web index • Harvest and NWI • Building a ‘local’ search engine • Harvest, SWISH-E, Isite, Zebra • DC as cataloguer’s aid 56
Harvesting - issues • • Mappings Multiple element values Multiple languages Complex data values • e. g. DC. Date, DC. Coverage • SCHEMES 57
Harvesting - issues • • • 58 Frames Harvesting non-embedded metadata HTML 3. 2 vs HTML 4. 0 Hidden pages Controlling the robot
Harvest • Resource discovery suite of tools robot, summarisers, indexers • SOIF records • Supports a variety of indexers • Supports database brokerage model • CGI based user-interface • UKOLN’s HTML summariser is Dublin Core aware http: //www. tardis. ed. ac. uk/harvest/ 59
Nordic Web Index • • • Custom robot - NWI/Combine Dublin Core aware GILS-II records Indexed using Zebra Searched using Z 39. 50 User interface based on Europagate http: //nwi. ub 2. lu. se/? lang=uk 60
Other software • SWISH-E • system for indexing local collections of Web pages or other text files http: //sunsite. berkeley. edu/SWISH-E/ • Isite • text indexer (Isearch) and Z 39. 50 http: //www. cnidr. org/ir/isite. html • Zebra • text indexer and Z 39. 50 http: //www. indexdata. dk 61
DC as cataloguer’s aid • ROADS • Software to create, manage and search Internet resource descriptions • WHOIS++ • Records created manually • Pump-prime’ metadata record with values based on embedded DC using robot http: //www. ukoln. ac. uk/roads/ 62
DC as cataloguer’s aid • BIBLINK • Flow of information from publishers to National Bibliographic Agencies • MARC based catalogues of electronic publications • Initial MARC record based on DC description supplied by publisher using email http: //www. ukoln. ac. uk/metadata/BIBLINK/ 63
Dublin Core - critique
Limits • In development • Syntax • Simple • • Discovery Document like objects Weak model Administrative metadata • Addressed in Helsinki 70
Futures The material on RDF has been adapted from Stu Weibel’s material, with permission
Dublin Core futures • Internal • Syntax and semantics • External environment 72
Syntax • HTML 2, HTML 4, RDF, . . . • RDF - W 3 C (World Wide Web Consortium) initiative • “RDF is the realization of the Warwick Framework for the Web” • RDF will be the foundation for an architecture for metadata on the Web Resource description Site mapping Digital signatures 73 Electronic commerce Third party rating
RDF: Why is it important? • RDF provides a coherent data model and syntactical framework for ‘plug-n-play’ metadata • the semantics and structure of metadata packages will be determined by stakeholder communities via independently developed and maintained metadata element sets • e. g. : MARC, DC, TEI, GILS, CIMI, Ratings…. • Political imperatives for deployment • Software infrastructure will be ubiquitous (and come for free in browsers and servers) 74
Semantics • Tension • simple vs complex • generic vs specific • interoperability vs selfstanding • Development • relationship • sub-elements • scheme 75
Environment • ‘Save the time of the user’ • Diverse resources • Broker/middleware/ gateway/trading place/… • Variety of protocols and metadata models • DC • simple - volume • ‘shallow’ - interop 76
Further Information • Dublin Core Home Page http: //purl. org/metadata/dublin_core • W 3 Metadata Overview and RDF Working Group Home Page http: //www. w 3. org/Metadata/RDF • UKOLN metadata page http: //www. ukoln. ac. uk/metadata/ 77
- Slides: 72