Taxonomies Insuring compatibility and crosswalks Marjorie M K
Taxonomies: Insuring compatibility and crosswalks Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn. com
Background o o "Underlying the information architecture for web sites and search are taxonomies. The standards for thesauri, taxonomies, ontologies, semantic web and topic maps are converging. Where do they differ and where are they the same? This one hour talk will cover the ISO ANSI/NISO and W 3 C terminology and controlled vocabulary standards, as well as the differences in the new standards compared to the previous editions. Finally it will talk about the crosswalks and registries underway between these development communities. "
What we will cover today o o o Background Overview of standards Specifics on 3 things n n n o NISO Z 39. 19 BSI 8723 IFLA Thoughts on a registry
Why are taxonomies hot? o Search doesn’t work n o o Without tagged data Websites need them to display information To tag navigation back to content
What’s happening to the business? o o o o Carpet baggers Differences of opinion Want to build on existing taxonomies Need for standards Need for cross walks Need for international communication Need for general registries of taxonomies
The Problem – KEEPING UP o o Many players we know and don’t know Between controlled vocabulary standards n n o Groups developing guidelines and standards n n o ISO 2788 and 5964, BSI 8723 W 3 C with SKOS and OWL Governments world wide developing and mandating taxonomies Communities n n increase reuse mapping interoperability between controlled vocabularies.
Traditional Standards o ISO n TC 46 o n SC 9 ANSI o NISO n n BSI o o n OWL SKOS US Government n o BS 8723 W 3 C n o Z 39. 19 Office of Management and Budget European Union
Thesaurus related o o o o NISO Z 39. 19 2006 www. niso. org BSI (BS 8723) the next revised ISO 2788 - Monolingual (1986) ISO 5964 - Multilingual (1985) www. iso. ch/iso/en/ISOOnline. frontpage ISO 5127, Information and documentation Vocabulary OWL from W 3 C SKOS the W 3 C thesaurus standard
Thesaurus and Indexing Standards – ANSI/NISO o o o ANSI/NISO Z 39. 19 - 2003 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z 39. 19 -200 x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR 02 -1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson
The standards o o o o NISO Z 39. 19 2006 www. niso. org BSI (BS 8723) - the next revised ISO 2788 - Monolingual (1986) ISO 5964 - Multilingual (1985) www. iso. ch/iso/en/ISOOnline. frontpage ISO 5127 - Information and documentation Vocabulary OWL from W 3 C SKOS - the W 3 C thesaurus standard
Z 39. 19 - What’s new? The old standard o Coverage n o o o documents Types of vocabularies n The revised standard Thesauri Single BT Post-coordinated Printed formats Monolingual vocabularies o o o o Coverage n Content objects Types of vocabularies n lists, synonym rings, taxonomy Pre-coordinated Web format Multilingual vocabularies (general) Polyheirachical Interoperability Facet analysis
British Standards - BS 8723 o o o Structured vocabularies for information retrieval – Guide Part 1: General Part 2: Thesauri Part 3: Vocabularies other than thesauri Part 4: Interoperability between vocabularies Part 5: Interoperability with applications
ISO TC 37 Scope of ISO TC 37: Standardization of principles, methods and applications relating to terminology and other language resources. o TC 37/SC 1 - Principles and methods o TC 37/SC 2 - Terminography and lexicography o TC 37/SC 3 - Computer applications for terminology o TC 37/SC 4 - Language resource management
Other ISO standards: Concept-oriented terminology ISO 704: 2000 Terminology work Principles and methods ISO 860: 1996 Terminology work Harmonization of concepts and terms ISO 1087 -1: 2000 Terminology work Vocabulary - Part 1: Theory and application ISO 1087 -2: 2000 Terminology work Vocabulary - Part 2: Computer applications ISO 10241: 1992 Preparation and layout of international terminology standards
Sample ISO - Data Categories o o ISO 12200: 1999 Computer applications in terminology - Machine-readable terminology interchange format (MARTIF) - Negotiated interchange ISO 12616: 2002 Translation-oriented terminography ISO/TR 12618: 1994 Computer aids in terminology - Creation and use of terminological databases and text corpora ISO 12620: 1999 Computer applications in terminology - Data categories used to create glossaries
ISO Thesaurus and Indexing Standards o o ISO 2788: 1986 Documentation - Guidelines for the establishment and development of monolingual thesauri ISO 5964: 1985 Documentation - Guidelines for the establishment and development of multilingual thesauri ISO 5963: 1985 Documentation - Methods for examining documents, determining their subjects, and selecting indexing terms ISO 999: 1996 Information and documentation - Guidelines for the content, organization and presentation of indexes
ISO TC 46/SC 9 o o o Information and Documentation - Identification and Description TC 46 is ISO's Technical Committee (TC) for information and documentation standards. SC 9 is the TC 46 Subcommittee (SC) that develops and maintains ISO standards on the identification and description of information resources.
ANSI/NISO Thesaurus and Indexing Standards o o o ANSI/NISO Z 39. 19 - 2005 Guidelines for the Construction, Format, and Management of Monolingual Thesauri NISO Z 39. 19 -200 x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies NISO TR 02 -1997 Guidelines for Indexes and Related Information Retrieval Devices by James D. Anderson
Reports to use o o Report on the Workshop on Electronic Thesauri, November 4 -5, 1999 http: //www. niso. org/news/events_workshops/th es 99 rprt. html Final Report to the ALCTS/CCS Subject Analysis Committee: Subcommittee on Subject Relationships/Reference Structures June 1997 http: //archive. ala. org/alcts/organization/ccs/ sac/rpt 97 rev. html
Other links o o o o o http: //esw. w 3. org/topic/Skos. Dev/Thesaurus. Links/Xml. Form ats MARC-21 XMLSchema. Zthes Z 39. 50 profile for thesaurus navigation (2001). TML thesaurus markup language (1999). ADL Thesaurus Protocol XML formats (2002). Me. SH XML format (2001). GEMET XML format (2003). APAIS XML thesaurus format, an extension of Zthes (2000). Open University thesaurus schemas (2002). Soergel XML thesaurus specification (2001).
W 3 C o o o o OWL – Web Ontology Language RDF – Resource Description Format Topic Maps SKOS - Simple Knowledge Organization Systems Which community to serve? Build on the current standard Might make this link next
Other things to watch o o Other W 3 C and ISO areas Support groups n n o o o Blogs Communities of Practice SIMILE Web 2. 0 activities WSDL – Web Services Digital Library
Other Relevant ISO & W 3 C Standards • Markup Languages • Metadata Resources • Character Coding • Access Protocols and Interoperability • Content Creation, Manipulation, and Maintenance • Authoring Standards • Text and Content Markup • Translation Standards • Terminology and Lexicography Standards • ISO TC 37 Standards • Terminology Interchange Standards • Controlled Language Standards • Taxonomy and Ontology Standards • Corpus Management Standards • Locale-Related Standards For translation, terminology and applied linguists go to: http: //appling. kent. edu/Resource. Pag es/LTStandards/C hart/standards. ch art. htm#Ontology
SIMILE Semantic Interoperability of Metadata and Information in un. Like Environments q Forming a data reference for open source taxonomies q
Revised Standards for Controlled Vocabularies U. S. Standard (NISO Z 39. 19 - 2005) British Standard (BS 8723 - 2005) IFLA Guidelines - 2005
U. S. Standard for Controlled Vocabularies – NISO Z 39. 19 -200 x Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies Some of the slides are based on Emily Fayen 2004. 6 SLA presentation, Margie Hlava’s talk at 2005 Data Harmony User Group meeting 2005 and Marcia Zeng – NKOS Meeting in Denver
A little bit history… o o o o ANSI/NISO Z 39. 19, Guidelines for the Construction, Format, and Management of Monolingual Thesauri – 1993 The most frequently requested NISO Standard In spite of its age the Standard is still relevant 1999: NISO Workshop on Electronic Thesauri http: //www. niso. org/news/events_workshop/th es 99 rpt. html 2002: NISO initiates revision of Z 39. 19 2004: 1993 reaffirmed 2005 new standard published
Scope o o o Expand beyond thesaurus Make more user-friendly Explain important concepts Explain principles of vocabulary control Include electronic information environment Include additional user search methods: n n n o o Browse Navigate Keyword searching Expand beyond A & I services Include Web applications
The Team: n n n n Vivian Bliss – Microsoft Carol Brent – Pro. Quest John Dickert – DTIC Lynn El-Hoshy – Library of Congress Marjorie Hlava – Access Innovations Stephen Hearn – ALA Sabine Kuhn – Chemical Abstracts Service Pat Kuhr – H. W. Wilson Company Diane Mc. Kerlie – DMA Consulting Peter Morville -- Semantic Studios Stuart Nelson – National Library of Medicine Allan Savage – National Library of Medicine Diane Vizine-Goetz – OCLC Marcia Lei Zeng – Special Libraries Association
Z 39. 19 Chapters 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Introduction Scope Referenced Standards Definitions, Abbreviations, and Acronyms Controlled Vocabularies – Purpose, Concepts, Principles, and Structure Term Choice, Scope, and Form Compound Terms Relationships Displaying Controlled Vocabularies Interoperability Construction, Testing, Maintenance, and Management Systems
Z 39. 19 - What’s new? o The old standard o The revised standard o Coverage n o documents Types of vocabularies n Thesauri o Single BT o Post-coordinated Printed formats Monolingual vocabularies o o n o Types of vocabularies n o o o Content objects lists, synonym rings, taxonomy Pre-coordinated Web format Multilingual vocabularies (general) Poly hierarchical Interoperability Facet analysis
Principles of Controlled Vocabularies o There are four important principles of vocabulary control that guide their design and development. • eliminating ambiguity • controlling synonyms • establishing relationships among terms where appropriate • testing and validation of terms
Type of vocabulary control
Lists A list is a simple group of terms Example: Alabama Alaska Arkansas California Colorado. . Frequently used in Web site pick lists and pull down menus
Synonym Rings A synonym ring is a list of synonyms or near synonyms that are used interchangeably for retrieval purposes
Synonym Rings -- Examples Synonym rings are usually found as sets of lists that allow users to access all content containing any of the terms. -- Frequently used in systems where the content is not indexed or the indexing vocabulary is not controlled e. g. , cholesterol: Cholesterol Blood Cholesterol Serum Cholesterol Good Cholesterol Bad Cholesterol LDL. . .
An example from International SEMATECH; a search for Silicon would look like this: Your search was submitted as “SILICON” or “SI”
Synonym Rings are used-o o o To expand queries for content objects. n any one of these terms retrieves any of the terms in the cluster. With unstructured natural language format, n interface draws together similar terms With search engines n Help control of the diversity of the language
Taxonomies A taxonomy is a set of preferred terms, all connected by a hierarchy or polyhierarchy Example: Chemistry Organic chemistry Polymer chemistry Nylon Frequently used in web navigation systems
Thesauri A thesaurus is a controlled vocabulary with multiple types of relationships Example: Rice UF paddy BT Cereals BT Plant products NT Brown rice RT Rice straw
Thesauri (cont. ) Relationship types: o Equivalence (Use/Used For) – indicates preferred term in a synonym relationship o Hierarchy – indicates broader and narrower terms o Associative – almost unlimited types of relationships may be used - related It is the most complex format for controlled vocabularies and widely used.
Interoperability o One of the most important issues from the 1999 workshop o Question: How to n n n compare indexes perform searches merge databases that have been developed using different controlled vocabularies?
Interoperability (CONT. ) o o o o Factors Affecting Interoperability Multilingual Controlled Vocabularies Searching Indexing Merging Databases Merging Controlled Vocabularies Achieving Interoperability Storage and Maintenance of Relationships among Terms in Multiple Controlled Vocabularies
II. The British Standard BS 8723: Structured Vocabularies for Information Retrieval – Guide Slides based on the presentation by Stella G Dextre Clarke, Alan Gilchrist , Leonard Will In ISKO 2004, London
Existing BSI/ISO thesaurus standards o ISO 2788 -1986 Guidelines for the establishment and development of monolingual thesauri = BS 5723: 1987 o ISO 5964 -1985 Guidelines for the establishment and development of multilingual thesauri = BS 6723: 1985
What needs updating? o o o Printed versus electronic application Guidance on management software Interoperability: n n o Mapping between thesauri and other types of vocabulary Formats/protocols for data exchange with downstream applications Applicability to end-user applications, not just those for information professionals
Outline of new standard BS 8723: Structured vocabularies for information retrieval – Guide n n n Part 1 - Definitions, symbols and abbreviations Part 2 – Thesauri Part 3 - Vocabularies other than thesauri; Part 4 - Interoperability between vocabularies Part 5 - Interoperation between vocabularies and other components of information storage and retrieval systems
Part 3 chapters o o o Classification schemes Subject heading lists Taxonomies Ontologies Semantic nets (? ) Search thesauri
Issues for Part 3 o o How much guidance is needed on how to build other sorts of vocabulary? Should we describe the idiosyncrasies of existing schemes, even where we judge there is a ‘better’ way? Pick out the characteristics of different vocabulary types that govern when and how you can map them. But some of the observable characteristics might not be what we’d recommend.
Part 4: Interoperability between vocabularies o Huge demand for accessing information n n o Includes multilingual thesauri n o indexed with another language and/or vocabulary. ‘Mapping’. The Semantic Web is just one application. special case of mapping between vocabularies. Applies where n n more than one language or vocabulary is in use, access to all resources is through one vocabulary
Part 4: Interoperability between vocabularies (cont. ) o BS 8723 part 4 has a wider scope n o BS 6723, was only with multilingual thesauri. BS 8723 extends the scope to: n n thesauri in different dialects of one language different thesauri in a single language situations where a thesaurus interoperates with one or more different types of structured vocabulary, such as classification schemes situations where not all the interoperating vocabularies have the same status and/or function.
Part 5: Interoperability with applications o Vocabularies must work with n Search software n Content Management Systems n Web publishing software, etc.
Build on existing formats and protocols for data exchange o o o o o Z 39. 50 and Zthes, XML schema DTD MARC SKOS Core Schema Topic Map ADL gazetteer protocol W 3 C crosswalks OMB _ Section 207 of e-gov act
Review and Comments o Request a copy for Parts 1, 2, 3 and 4: n n Parts 1 and 2 numbered 04/30086620 DC and 04/30094113 DC. The documents may be ordered from BSI Customer Services o o o tel +44(0)208 -996 -9001 or email orders@bsi-global. com Part 5 is out for comment
III. IFLA Guidelines for Multilingual Thesauri IFLA Classification and Indexing Section April 2005 released for comments Published 2005
Add to the ISO 5964 for multilingual Thesauri World-Wide Review of IFLA Guidelines for Multilingual Thesauri o URL: http: //www. ifla. org/VII/s 29/pubs/Draftmultilingualthesauri. pdf
IFLA Classification and Indexing Section WG on Guidelines for Multilingual Thesauri o o Chair: Gerhard J. A. Riesthuis (Netherlands) Members: o o o o Lois Mai Chan (USA), Patrice Landry (Switzerland), Pia Leth (Sweden), Ia Mc. Ilwaine (United Kingdom), Martin Kunz (Germany), Dorothy Mc. Garry (USA), Max Naudi (France), Marcia Lei Zeng (USA)
Three approaches in the development of multilingual thesauri: 1. building a new thesaurus from the bottom up n n 2. combining existing thesauri n n 3. starting with one language and adding another language or languages starting with more than one language simultaneously merging two or more existing thesauri into one new (multilingual) information retrieval language to be used in indexing and retrieval linking existing thesauri and subject heading languages to each other; using the existing thesauri and/or subject heading languages both in indexing and retrieval translating a thesaurus into one or more other languages
Semantic problems pertain to equivalence relations between terms used as preferred and nonpreferred terms in information retrieval languages. n n n Equivalence relations exist not only within each separate language involved, but also between the languages (intra-language equivalence and interlanguage equivalence). Intra-language homonymy and inter-language homonymy are also considered semantic questions. Additional problems pertaining to semantics involve the scope, form and choice of thesaurus terms.
Structural problems o o Structural problems involve hierarchical and associative relations between the terms. An important question in this respect is whether the structure should be the same or different for each language. n n In most if not all cases of linking, the structure will most probably not be the same in all the information retrieval languages involved. In the other approaches mentioned it is possible in principle to apply the same structure to all languages.
Contents covered by the guidelines o Building multilingual thesauri starting from scratch n n o Starting from existing thesauri n n o o Structure Morphology and Semantics Merging Linking Glossary Appendix: n An example of a non-symmetrical thesaurus
Examples are in multiple languages Cranes is a homograph in English does not necessarily mean that equivalent terms in other languages are also homographs. The Dutch term kranen is a homograph too, but with the meanings cranes (lifting equipment) and taps.
What is a taxonomist to do? o o o Watch the standards Participate in development Exceed the guidelines Comply with all standards – internationally Promote standards participation And we do – so far!
Controlled vocabularies of all stripes need a place to call home o o o Open contribution Thesaurus metadata contributions Comments on the contributions Examples of implementation A clearing house to keep track of o o o all the initiatives and suggested standards, a means to allow input from and to those initiatives, and publishing of best practices or lessons learned from implementations perhaps a Wiki. KOS
The Solutions o o o o Registry? NKOS of KOS SKOS participants KOS typology - Tudhope Tesauro. com – Spanish - Salama Kent. edu site – Marcia Zeng Taxonomy Warehouse – Factiva - Clarke UMLS - Unified Medical Language System
More Solutions §Semantic Interoperability of Metadata and Information in un. Like Environments (Open Source §UK HILT - Dennis Nicholson
Good starts o o Link to each other Include n n n n Thesauri Taxonomies Semantic webs Classification systems Subject headings SKOS OWL and Ontologies Other KOS
What about? o o o o Authority Files Other pick lists Roget's and other synonym rings Dictionaries Gazetteers Glossaries Etc.
Discussion? ? Thank you for your attention! Marjorie M. K. Hlava Access Innovations / Data Harmony mhlava@accessinn. com
- Slides: 69