Automatic cataloging classification Eric Childress OCLC Research OCLC

















- Slides: 17

Automatic cataloging & classification Eric Childress OCLC Research OCLC Members Council Research and New Technologies Interest Group 25 October 2005

The key question • Critical data present • Accurate tagging • Accurate values – Ideal: Enriched metadata § The answer: – Yes…with caveats Input – Baseline metadata Human Labor Status quo Output § Can machines be leveraged for? Metadata

Automation approaches § Harvesting: Drawing from extant metadata in one or more sources § Extraction: Drawing from attributes of the resource and/or content in the resource § Both: Integrating both harvesting & extraction in metadata generation

Approaches (cont’) § Harvesting & extraction can be integrated with other tactics: – Point-of-transaction capture: Manual and/or automatic capture of metadata during the lifecycle of resource and/or metadata (e. g. , the source agency, date of record) – Human review/prompting: Integrating human decision-making to address cases machines cannot handle efficiently (e. g. , linking name references to correct authority file when several names are similar)

Harvesting options § New record, same database: – OCLC “derive” record technique § External metadata files: – Z 39. 50/Zing/MXG – OAI harvesting – Citation tools (e. g. , End. Note) § Embedded metadata harvesting: – Processes structured metadata – Various tools (e. g. , DC tools list) § Many harvesting tools include some extraction features (and vice-versa) – Example: Info. Librarian appliance

Extraction landscape § Many tools from many sources – Features vary widely – Some are narrow-band (e. g. , domain-specific, narrow scope of data work) – Standalone or highly integrated in systems (often as part of digital access mgt. systems) § Frequently-encountered features: – Simple: document statistics, file type – Complex: (reliable) language detection, audience level, topics, entities represented, document parts, taxonomy derivation

Extraction approaches § Information extraction: – “Automatically extract structured or semistructured information from unstructured machine-readable documents” - Wikipedia § Natural language processing – “A range of computational techniques for analyzing and representing naturally occurring text (free text) at one or more levels of linguistic analysis (e. g. , morphological, syntactic, semantic, pragmatic) for the purpose of achieving human-like language processing for knowledge-intensive applications” - AHIMA – Extracts both explicit & implicit meaning

Some work of interest § § § Library of Congress NSF-funded NSDL projects AMe. GA i. Via software RLG’s Automatic Exposure

Library of Congress § BEAT (Bibliographic Enrichment Advisory Team) activities & projects: – MARC records fromharvesting: • E-CIP • Web access to publications in series – Numerous enrichment activities: • TOCs: E-CIP, ONIX, d. TOC project, more • Reviews: HNET, Outstanding Reference Sources, HLAS reviews, MARS Best Free Reference Sites • Contributor biographic information, ONIX descriptions, sample texts • Links to e-versions of various texts • Special projects for select LC collections – Work with bibliographies & pathfinders

NSDL-related projects (selected) § Meta. Extract: An NLP System to Automatically Assign Metadata – CNLP (Syracuse U) & SIS (Syracuse U) – Builds on several previous projects including: • Breaking the Meta. Data Generation Bottleneck [2000 -2002] § Lenny – CNLP (Syracuse U) & U Washington i. School – Application of NLP to automatically generate metadata for courseoriented materials – Cornell NSDL group & INFOMINE – Orchestrated application of a suite of activities • OAI harvesting with metadata augmentation using i. Via • Loosely-coupled third party services to provide metadata enhancements (correction, augmentation) to metadata destined for a central repository • Interactions orchestrated by centralized software application

Meta. Extract study findings § Auto-generated versus manually-assigned: – Comparable • Performance in Retrieval • Quality of most elements (for Browsing) – Better • Coverage of metadata elements § Auto-generated versus full-text: – Comparable • Performance in Retrieval – Better • Enables Fielded searching • Enables Browsing of results – Provides useful structuring of data

Other projects § AMe. GA (Automatic Metadata Generation Applications Project) – UNC-CH SILS Metadata Research Center – Research initiated to fulfill LC Bibliographic Control Action Plan 4. 2 (deliver specifications for tools to effect automated processing of Web-based resources) – Final report identifies and recommends functionalities for automatic metadata generation applications § i. Via software – Developed by INFOMINE & in use by NSDL, various other digital library projects; LC looking at using i. Via – Sophisticated open source harvester software that can assign LCSH, LCC § Automatic Exposure – RLG-led initiative advocates capturing standard technical metadata about digital images automatically, as part of image creation

OCLC activities § OCLC Research projects: – Automatic classification – FRBR-related record harvesting – Schema. Trans § OCLC production services: – OCLC Digital Archive – World. Cat link – OCLC Connexion

Automatic classification work § Scorpion – Open source software that implements a system for automatically classifying Web-accessible text documents – Incorporated into Connexion extractor § FAST as a knowledge base for automatic classification project – Evaluated FAST as a database to support automatic classification § e. Prints-UK project – A collaboration with RDN to pilot Web services to classify records by DDC and provide authority control for personal names for RDN eprint metadata records

Other OCLC Research activities § FRBR-related record harvesting – Best elements of all records in workset used to build a “work” record (Fiction Finder) § Schema. Trans project – Adopts a novel approach to translating structured metadata between schemes – Should be friendly to modular augumentation/correction activities

OCLC products § OCLC Digital Archive – Various harvesting options • Capture of technical metadata • Start descriptive records in Connexion § World. Cat link – Scheduled ingest of metadata from OAI servers and batch processing into World. Cat § OCLC Connexion – Extractor processes metadata from web sites • Relatively sophisticated harvesting • Processes non-canonical metadata • Slated for significant upgrade in 2006 – Rules-aided LCSH assignment while editing bibs – Automatic base authority record generation from relevant bibliographic record (NACO)

Links § Recommended reading: – Liddy, Elizabeth, “Metadata: A Promising Solution” in EDUCAUSE Review, v. 40, n. 3 (May/June 2005) § OCLC Research links: – Automatic classification projects – Schema. Trans – Research. Works
OCLC Research Common Issues Shared Explorations Eric Childress
Kolton Childress Cell 901 592 8180 Email childress
OCLC and Vocabulary Identifiers Eric Childress Andrew Houghton
VIAF for NAAC 2012 October Eric Childress OCLC
OCLC and Vocabulary Identifiers Eric Childress Andrew Houghton
OCLC Research OCLC Online Computer Library Center Research
OCLC Research OCLC Online Computer Library Center Research
OCLC Research Lorcan Dempsey VP Research OCLC February
1 Cataloging 14 2 Seminar Cataloging 2 Session
Evergreen Indiana Cataloging Roundtable New Cataloging Features of
Lihong Zhu Interim Cataloging ManagerMonographic Cataloging Librarian Washington
Lihong Zhu Interim Cataloging ManagerMonographic Cataloging Librarian Washington
Jones Cataloging Rules Yee Cataloging Rules From the
The Cataloging Must Flow Establishing Cataloging Processes in
THE YEE CATALOGING RULES FRBRIZED CATALOGING RULES WITH
Classification Cataloging LIB 3200 Classification The systematic placement
Automatic Delivery Rewards Automatic Delivery Rewards Automatic Delivery