Automatic cataloging classification Eric Childress OCLC Research OCLC

The key question • Critical data present • Accurate tagging • Accurate values –

Automation approaches § Harvesting: Drawing from extant metadata in one or more sources §

Approaches (cont’) § Harvesting & extraction can be integrated with other tactics: – Point-of-transaction

Harvesting options § New record, same database: – OCLC “derive” record technique § External

Extraction landscape § Many tools from many sources – Features vary widely – Some

Extraction approaches § Information extraction: – “Automatically extract structured or semistructured information from unstructured

Some work of interest § § § Library of Congress NSF-funded NSDL projects AMe.

Library of Congress § BEAT (Bibliographic Enrichment Advisory Team) activities & projects: – MARC

NSDL-related projects (selected) § Meta. Extract: An NLP System to Automatically Assign Metadata –

Meta. Extract study findings § Auto-generated versus manually-assigned: – Comparable • Performance in Retrieval

Other projects § AMe. GA (Automatic Metadata Generation Applications Project) – UNC-CH SILS Metadata

OCLC activities § OCLC Research projects: – Automatic classification – FRBR-related record harvesting –

Automatic classification work § Scorpion – Open source software that implements a system for

Other OCLC Research activities § FRBR-related record harvesting – Best elements of all records

OCLC products § OCLC Digital Archive – Various harvesting options • Capture of technical

Links § Recommended reading: – Liddy, Elizabeth, “Metadata: A Promising Solution” in EDUCAUSE Review,

Slides: 17

Download presentation

Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005

The key question • Critical data present • Accurate tagging • Accurate values – Ideal: Enriched metadata § The answer: – Yes…with caveats Input – Baseline metadata Human Labor Status quo Output § Can machines be leveraged for? Metadata

Automation approaches § Harvesting: Drawing from extant metadata in one or more sources § Extraction: Drawing from attributes of the resource and/or content in the resource § Both: Integrating both harvesting & extraction in metadata generation

Approaches (cont’) § Harvesting & extraction can be integrated with other tactics: – Point-of-transaction capture: Manual and/or automatic capture of metadata during the lifecycle of resource and/or metadata (e. g. , the source agency, date of record) – Human review/prompting: Integrating human decision-making to address cases machines cannot handle efficiently (e. g. , linking name references to correct authority file when several names are similar)

Harvesting options § New record, same database: – OCLC “derive” record technique § External metadata files: – Z 39. 50/Zing/MXG – OAI harvesting – Citation tools (e. g. , End. Note) § Embedded metadata harvesting: – Processes structured metadata – Various tools (e. g. , DC tools list) § Many harvesting tools include some extraction features (and vice-versa) – Example: Info. Librarian appliance

Extraction landscape § Many tools from many sources – Features vary widely – Some are narrow-band (e. g. , domain-specific, narrow scope of data work) – Standalone or highly integrated in systems (often as part of digital access mgt. systems) § Frequently-encountered features: – Simple: document statistics, file type – Complex: (reliable) language detection, audience level, topics, entities represented, document parts, taxonomy derivation

Extraction approaches § Information extraction: – “Automatically extract structured or semistructured information from unstructured machine-readable documents” - Wikipedia § Natural language processing – “A range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (e. g. , morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications” - AHIMA – Extracts both explicit & implicit meaning

Some work of interest § § § Library of Congress NSF-funded NSDL projects AMe. GA i. Via software RLG’s Automatic Exposure

Library of Congress § BEAT (Bibliographic Enrichment Advisory Team) activities & projects: – MARC records fromharvesting: • E-CIP • Web access to publications in series – Numerous enrichment activities: • TOCs: E-CIP, ONIX, d. TOC project, more • Reviews: HNET, Outstanding Reference Sources, HLAS reviews, MARS Best Free Reference Sites • Contributor biographic information, ONIX descriptions, sample texts • Links to e-versions of various texts • Special projects for select LC collections – Work with bibliographies & pathfinders

NSDL-related projects (selected) § Meta. Extract: An NLP System to Automatically Assign Metadata – CNLP (Syracuse U) & SIS (Syracuse U) – Builds on several previous projects including: • Breaking the Meta. Data Generation Bottleneck [2000 -2002] § Lenny – CNLP (Syracuse U) & U Washington i. School – Application of NLP to automatically generate metadata for courseoriented materials – Cornell NSDL group & INFOMINE – Orchestrated application of a suite of activities • OAI harvesting with metadata augmentation using i. Via • Loosely-coupled third party services to provide metadata enhancements (correction, augmentation) to metadata destined for a central repository • Interactions orchestrated by centralized software application

Meta. Extract study findings § Auto-generated versus manually-assigned: – Comparable • Performance in Retrieval • Quality of most elements (for Browsing) – Better • Coverage of metadata elements § Auto-generated versus full-text: – Comparable • Performance in Retrieval – Better • Enables Fielded searching • Enables Browsing of results – Provides useful structuring of data

Other projects § AMe. GA (Automatic Metadata Generation Applications Project) – UNC-CH SILS Metadata Research Center – Research initiated to fulfill LC Bibliographic Control Action Plan 4. 2 (deliver specifications for tools to effect automated processing of Web-based resources) – Final report identifies and recommends functionalities for automatic metadata generation applications § i. Via software – Developed by INFOMINE & in use by NSDL, various other digital library projects; LC looking at using i. Via – Sophisticated open source harvester software that can assign LCSH, LCC § Automatic Exposure – RLG-led initiative advocates capturing standard technical metadata about digital images automatically, as part of image creation

OCLC activities § OCLC Research projects: – Automatic classification – FRBR-related record harvesting – Schema. Trans § OCLC production services: – OCLC Digital Archive – World. Cat link – OCLC Connexion

Automatic classification work § Scorpion – Open source software that implements a system for automatically classifying Web-accessible text documents – Incorporated into Connexion extractor § FAST as a knowledge base for automatic classification project – Evaluated FAST as a database to support automatic classification § e. Prints-UK project – A collaboration with RDN to pilot Web services to classify records by DDC and provide authority control for personal names for RDN eprint metadata records

Other OCLC Research activities § FRBR-related record harvesting – Best elements of all records in workset used to build a “work” record (Fiction Finder) § Schema. Trans project – Adopts a novel approach to translating structured metadata between schemes – Should be friendly to modular augumentation/correction activities

OCLC products § OCLC Digital Archive – Various harvesting options • Capture of technical metadata • Start descriptive records in Connexion § World. Cat link – Scheduled ingest of metadata from OAI servers and batch processing into World. Cat § OCLC Connexion – Extractor processes metadata from web sites • Relatively sophisticated harvesting • Processes non-canonical metadata • Slated for significant upgrade in 2006 – Rules-aided LCSH assignment while editing bibs – Automatic base authority record generation from relevant bibliographic record (NACO)

Links § Recommended reading: – Liddy, Elizabeth, “Metadata: A Promising Solution” in EDUCAUSE Review, v. 40, n. 3 (May/June 2005) § OCLC Research links: – Automatic classification projects – Schema. Trans – Research. Works