November 30 2020 Adding Cyrillic Script to World
November 30, 2020 Adding Cyrillic Script to World. Cat Jenny Toves, OCLC Cynthia Whitacre, OCLC
Jenny Toves Software Architect Manager, OCLC Cynthia Whitacre Senior Metadata Operations Manager, OCLC
THE IDEA
New Directions in Non-Latin Script Access • ALCTS Ca. MMS Committee on Cataloging: Asian & African Materials June 23, 2018 presentations • Barry - The Past, Present, and Future of Original Script • Riley - Extending Access - Supporting Improved Discovery • Smith-Yoshimura - Non-Latin Script Access - Past, Now, Future
The Project • Adding Cyrillic script to World. Cat records is possible by automation • One-to-one correspondence between Latin script alphabet and Cyrillic script alphabet • Automated addition is possible for many languages that use Cyrillic script • Russian selected first • English language of cataloging only
This Photo by Unknown Author is licensed under CC BY-SA
Connecting the two projects • OCLC supplied UCLA with OCNs for all the records for Russian language titles lacking Cyrillic with UCLA holdings set • Internally I realized that OCLC Research was undertaking a similar project for all of World. Cat. • Proposing that we work together • Shared ideas, strategies, and review of the project design
HOW WE DID IT
Overview of process • Select a set of records with high probability of good transliterated text • Review and correct until satisfied • Add Cyrillic data to OCLC Research copy of World. Cat records • Add enhanced records to production World. Cat
Selected Records • Language of cataloging is English • Single language present in the record (no parallel titles or translations) • Published in the native country • Skipped Content. DM records • Discovered some new criteria along the way – AACR 2/RDA records only – Cognizant of reform dates
To correct or …. Starting with fixing errors in data – Lots of review time and lots of code trying to implement rules for specific words or word fragments – Less than half of corrected records end up being usable because errors rarely occur one at a time – Brittle solution because it relies on what we know today – If a pattern is wrong 80% of the time - trying to fix the 80% without breaking the 20% - arggh!
… not to correct Slowly realized that corrections might be counter productive – Skip all records containing a questionable pattern – such as 3 vowels in a row with no marks – Errors rarely occur alone so better to notice the common errors – Yes: we avoid a few good records – But we save reviewer time and coder time – And we avoid doing bad things – And can always re-visit those records later
Technology Used • OCLC Research Hadoop Cluster • World. Cat Metadata API • Connexion client macros
Progress and next steps • • Russian – 1. 1 M records Ukrainian – 28 k Bulgarian – reviewing 25 k Future Cyrillic work: Serbian (35 k), Belarusian (10 k), Macedonian (12 k), Uzbek (10 k)
Statistics for World. Cat • June 2019: 1, 632, 293 World. Cat Records had Cyrillic characters • November 2020: 3, 085, 920 World. Cat records had Cyrillic characters • 89% increase over that time span, mainly due to this project
Follow-up projects: UCLA - Armenian • 43, 000 Armenian titles. Uniform title: Bible. Psalms. Armenian. 1913. Title: Girkʻ Saghmosatsʻ Dawtʻi. ��������. Published/Distributed: Konstandnupolis : A. B. S. , 1913 (Kostandnupolis : Tpagr. A. H. Po yachean) �������� : �. �. �. , 1913 (Kostandnupolis : Tpagr. A. H. Po yachean)
Model for future projects • Any alphabetic script is a candidate for this • Assumption is that the Latin transliterated script in the record has been Romanized consistently • Where do we go next?
References • Hanging Together Blog postings earlier this year – Cyrillicizing World. Cat Russian Records (https: //hangingtogether. org/? p=7658 by Karen Smith-Yoshimura) – Кириллица в World. Cat (https: //hangingtogether. org/? p=7868 by Jenny Toves, Mary Haessig, Bryan Baldus)
Jenny Toves tovesj@oclc. org Cynthia Whitacre whitacrc@oclc. org
- Slides: 20