American University of Beirut Libraries Aldb Magazine Archives
American University of Beirut Libraries Al-Ādāb Magazine Archives: Digitization, Preservation and Access BASMA CHEBANI HEAD OF CATALOGUING AND METADATA SERVICES, UNIVERSITY LIBRARIES MELCOM INTERNATIONAL 39 TH ANNUAL CONFERENCE CAMBRIDGE (UNITED KINGDOM)| 3 -6/7/2017
AUB University Libraries 2 American University of Beirut University Libraries (AUB-UL) are leading different digitization initiatives in attempt to preserve National and AUB cultural heritage, to disseminate information and to promote knowledge by allowing access to AUB community, to scholars, to researchers and to the largest possible audience. 3 -6/7/2017
Digitization Initiatives - ACO 3 �Arabic Collections Online (ACO): is a publicly available digital library of public domain Arabic language content the project is managed by New York University Library. �ACO contributing partners are New York University (NYU), Princeton, Cornell, Columbia, American University of Beirut and American University of Cairo. �Arabic Collections Online website �AUB Books in Arabic Collections Online 3 -6/7/2017
Digitization Initiatives – Oral History 4 � Palestinian Oral History Archive (POHA): is an archival collection that contains more than 1, 000 hours of testimonies with first generation Palestinian refugees. � The project will digitize from magnetic tapes, index, catalog, preserve, and provide access to the material through the creation of a digital platform (Dspace Repository) � The partners of POHA project are: American University of Beirut University Libraries (AUB-UL) Issam Fares Institute for Public Policy and International Affairs at AUB (IFI) Nakba Archive Arab Resource Center for Popular Arts (AL-JANA) � POHA Thesaurus is being constructed according to the international standard for thesauri ISO 25964 3 -6/7/2017
Digitization Initiatives – Nahda Journals 5 Indexing the Nahda (Arabic renaissance) journals with Brill � AUB-UL signed an agreement with Brill Publishers to start a joint project to archive the Arabic Nahda journals (18801950 s) using Index Islamicus platform � The project proposes indexing the journal content in Arabic and English language, developing an ontology of Nahda terms and concepts, and building authority lists of proper names of people, places and organizations. 3 -6/7/2017
Digitization Initiatives – AL ADAB 6 AL-ĀDĀB MAGAZINE ARCHIVE 3 -6/7/2017
Archiving Al-Ādāb Magazine 7 �The Agreement �Al-Ādāb Magazine ( ﺍﻵﺪﺍﺏ )ﻣﺠﻠﺔ and AUB University Libraries agreed in 2014 to digitize, OCR, catalog and index the entire issues of the print magazine. �Al-Ādāb Magazine will publish online the Archive of digitized materials through its own Content Management System (CMS) Drupal. 3 -6/7/2017
Archiving Al-Ādāb Magazine 8 �The Agreement (cont. ) �AUB University Libraries will keep a preservation copy of the entire magazine issues. �AUB University Libraries will archive the digitized materials in its own Content Management System (CMS) Extensible Text Framework (XTF) �University Libraries will provide the AUB community with access to the digitized content through a full text search web interface. 3 -6/7/2017
Why Al_Ādāb Magazine? 9 �Literary and cultural Journal established in 1953 �Focused on movements in literature and culture in the Arab world �Included files in political thought, poetry, novel, short stories, movies criticism, theater, and general culture. �Al_Ādāb Magazine stopped publishing in print in 2012. Accumulating an archive of 60 volumes over 60 years. �In October, 2015 Al_Ādāb Magazine started republishing its online edition 3 -6/7/2017
Cover of the first issue from the first publication year 1953 10 3 -6/7/2017
Title of the Magazine 11 �Title varied during the Al-Ādāb Magazine life time. ﻣﺠﻠﺔ ﺷﻬﺮﻳﺔ ﺗﻌﻨﻰ ﺑﺸﺆﻮﻥ ﺍﻟﻔﻜﺮ : ﺍﻵﺪﺍﺏ ﻣﺠﻠﺔ ﺛﻘﺎﻓﻴﺔ ﻋﺮﺑﻴﺔ : ﺍﻵﺪﺍﺏ �Issued on monthly basis from 1953 -1980 �Issued 5 times a year from 1980 -2011. �Issued 4 times a year in 2012 �Stopped by the end of Autumn 2012. 3 -6/7/2017
Al-Ādāb Magazine Archives 12 CATALOGING AND INDEXING OR DESCRIPTIVE METADATA 3 -6/7/2017
Descriptive Metadata 13 �Descriptive metadata for the content al- Adāb magazine (Author of the article, Title, pages, and source (Citations) at the article level Magazine top level Issue level Article level �Indexing or the description of subjects of the articles (subject headings) 3 -6/7/2017
Descriptive metadata tools 14 �Use of Anglo-American Cataloging Rules for Descriptive cataloging (AACR 2) �Use of Library of Congress Authorities (Translated into Arabic) �Use of the Library system (Millennium) for analytical cataloging in MARC format �Authorities of Library of Congress Subject Headings (translated into Arabic by University Libraries) �Detailed indexing in Arabic (2 -70 Subject headings or keywords for conferences and meetings papers) 3 -6/7/2017
Metadata Export 15 �Export the Descriptive metadata (Cataloging and indexing field) to different formats: � DC 15 elements (Dublin Core) metadata �MARC Exchange format (ISO 2709) �MARC 21 standard �CSV (Comma Separated Value) 3 -6/7/2017
Exported Metadata Fields 1 16 �System Record Number (Unique Identifier) �Title of the article �Main Author of the article �Added Authors (co-authors and translators) �Pages �Source that includes: (Title of the magazine, volume, issue, date) Bibliographic citation �URL of the digitized articles on the server �The Personalities (Personal names as subjects) 3 -6/7/2017
Al-Ādāb Magazine Archives 18 CHALLENGES FOR METADATA 3 -6/7/2017
Challenges for metadata 19 �The periodicity / Frequency changed from monthly to bimonthly to quarterly during the life of the magazine �Changing in the rubrics of the magazine ( )ﺃﺒﻮﺍﺏ ﺍﻟﻤﺠﻠﺔ �Additional Citations for the continuations ( )ﺍﻟﺘﺘﻤﺎﺕ and combining the pdf files in one link �Add more subject headings to compensate bad recognition in ocring of Arabic printed characters. 3 -6/7/2017
Challenge of Censorship 20 � 2 editions for the same issue due to the Censorship in some Arab Countries: One for standard distribution One for distribution in designated country � Censorship: Problem at the citation level for the issue of the second edition (2 different citations for the same article) � Looking for the complete edition (Missing issues and missing pages) � Number of pages varies according to the omission of articles in the censored edition (affecting numbering in digital files and in the descriptive metadata) 3 -6/7/2017
Challenges at Technical level 21 �Paper quality (yellowish and fade in some issues in the 1950 s) �Binding margins require additional touching �Covers are missing due to binding mistakes �Printing quality varies 3 -6/7/2017
Al-Ādāb Magazine Archives 22 DIGITIZATION 2/21/2021
Digitization Process 23 �Elie Kahale, the Head of Digital Initiative Department at AUB_UL is dealing with digitization and ingestion of metadata in the Content Management System. �Digitization and Archiving procedures are good example of coordination between librarians and Information Technology specialists at AUB_UL. 3 -6/7/2017
Why Digitization? 24 �Digitization for Access: o Browse / Discover o Index & Search o View / Read � Digitization for Preservation: o o Authenticity: Maintain trustworthy representation of original document. Integrity: Ensure data is saved/retrieved exactly as intended by ingestion of preservation metadata. Discovery and identification: Ability to locate item using descriptive metadata. Continuous ability to use and access a digital object. 3 -6/7/2017
PRE Digitization – Material Preparation 25 �Availability of all Volumes: Two copies of each volume (60 volumes covering 60 years) Triage and selection of volumes for Image capture Censorship problem: 2 editions for the same issue having different content for in different geographical area. Integrity : completing all volumes. Some Covers are missing. 3 -6/7/2017
Digitization - Workflow 26 3 -6/7/2017
Digitization - Image Capture 27 � Digitization is the conversion of an analog physical source material such as document, image, sound/audio to be represented in a numerical format (mainly binary) which will allow access through computers and stored on different hardware medium such as , servers, hard disk, DVDs, etc… � Flat. Bed and Book Scanners were used to digitize the 60 volumes. It generated around 50, 000 digital images. 3 -6/7/2017
Digitization - Image Capture 28 Text Content: Tiff, 300 dpi, bitonal Covers: Tiff, 300 dpi, 24 -bit 3 -6/7/2017
Challenges in Image Capture 29 �Image capture in Black & White for the text �Image capture in grey scale for the images inside the text (scanning the pages including images twice with different capture settings ) �Image capture in color for the covers 3 -6/7/2017
Digital Preservation 1 30 �Initial steps for digital Preservation: Create Descriptive and Technical Metadata Embedding of Metadata with the digital objects in bags using Bag. It and backed up on multiple copies on SAN storage server and Tapes (in bitstream) (Library of Congress Bag. It is a hierarchical file packaging format designed to support disk-based storage and network transfer of arbitrary digital content. “Wikipedia”) 3 -6/7/2017
Digital Preservation 2 31 �Next steps: Ingest the bags into Open Archival Information System using METS document. Archivematica is an open source digital preservation system which uses Bag. It to create OAIS Archival Information Packages (AIP) In Archivematica 1. 4 and higher, fields in the baginfo. txt file are indexed as source metadata in the Archivematica METS document file, making their contents searchable in the Archival storage server after the bag transfer has been processed and stored. 3 -6/7/2017
Al-Ādāb Magazine Archives 32 OPTICAL CHARACTER RECOGNITION (OCR) FOR ARABIC CHARACTERS 2/21/2021
Optical Character Recognition 33 OCR is a process / a computer software that converts digital images of text from pixels into encoded machine text (editable and searchable texts). It turns a picture of text into text itself. In other words the software of the OCR process produces and exports something like a. txt or. doc files from a scanned jpeg 2000 or tiff image formats representing the printed page. 3 -6/7/2017
Optical Character Recognition 34 �OCR Advantages: Index and Search through the generated text. Edit the generated text �For this initiative we are interested in the OCR process in order to achieve full Arabic text search. 3 -6/7/2017
Optical Character Recognition 35 �OCR Accuracy The OCR Accuracy was calculated on character level. It is the percentage of characters that were correctly recognized by the OCR engine compared to the total characters in the page. It varied from 84% to 99% in “Al Adab” depending on the fonts. Old volumes had less accuracy than New volumes. 3 -6/7/2017
OCR Accuracy at the font level 36 OCR Results for each decade at the character level � 1950 s: 84. 92% - 98. 71% � 1960 s: 92. 69% - 98. 21% � 1970 s: 90. 4% - 98. 37% � 1980 s: 93% - 99. 14% � 1990 s: 95. 88% - 98. 2% � 2000 s: 95. 92%-99. 4% Large fonts better results then small fonts 3 -6/7/2017
37 3 -6/7/2017
Optical Character Recognition problems 38 �In the old “Al-Ādāb” volumes, different “Fonts” were used in the same volume (year), in the same issue, in the same page �To increase accuracy and improves images before OCR we need to use Scan. Fix application to de-speckle (noise removal) and de-skew (straighten pages) 3 -6/7/2017
Volume 2, Year 1954 39 3 -6/7/2017
Volume 2, Year 1954 40 3 -6/7/2017
Volume 53, Year 2005 Volume 2, Year 1954 41 2/21/2021
Arabic Character recognition challenges 42 �Arabic language is cursive language where letters are attached and each letter change according to its position in the word �Arabic letters are overlapped in the 1950 s �Different font types and sizes in the same page �Titles printed in calligraphic fonts �Accuracy rate of recognition was calculated at the font level. �Use of Sakhr Automatic Reader for Arabic OCRing that enables teaching new characters 3 -6/7/2017
Challenges for Arabic OCRing (cont. ) 43 � Some characters have the same form and are only distinguished by the position of various dots relative to the main character block. Given that dots are considered as noise and OCR tends to remove them. � ﻓـ ﻗـ ﻧـ ﺗـ ﻋـ ﻏـ ـﻤـ ـﻌـ ﺭ ﺯ ﻁ ﻅ ﺻـ ﺿـ ﺳـ ﺷـ � Space between two connecting Arabic characters can vary in size and shape and use of kashida (extended letter) for alignment � Vowels: ( ﻱ ، ﻭ ، )ﺣﺮﻭﻑ ﺍﻟﻌﻠﺔ ﺍﻟﻄﻮﻳﻠﺔ )ﺍ are different from short vowels ( )ﺣﺮﻭﻑ ﺍﻟﻌﻠﺔ ﺍﻟﻘﺼﻴﺮﺓ or diacratical marks. Examples fatḥa, dhammah, kasrah not always added to the letter (letters not consistent) 3 -6/7/2017
44 3 -6/7/2017
45 3 -6/7/2017
Why do-we need Metadata with OCR 46 �The indexing terms improve Full text search results. �Compensate the low accuracy of OCR for old volumes! �Preservation metadata (Technical and Administrative) guarantee the long term preservation and accurate retrieval of digital objects (Digitized articles of al-Adab) 2/21/2021
47 METADATA AND DIGITAL ARTICLES ARE EMBEDDED IN THE CONTENT MANAGEMENT SYSTEM (CMS) XTF 2/21/2021
Metadata mapping to CMS 48 �Metadata exported from Millennium in MARC format could be mapped to any digital repository or content management system (CMS) such as: Drupal, Word. Press, Joomla, Dspace and XTF by converting the data to XML Dublin Core. �Keywords could be structured in a Content Management System thesaurus or taxonomy or controlled vocabularies. 3 -6/7/2017
XTF : The Content Management System 49 � AUB-UL adopted XTF (Extensible Text Framework) � XTF: (http: //xtf. cdlib. org/ ) Powerful open source platform for access to digital content. Developed and maintained by the California Digital Library. Ability to create indexes on any XML element or attribute including Dublin Core (XMLDC) � Lucene : An open free full text indexing tool. � Apache Solr : an open source search engine. � XTF provides out-of-the-box support for the following types of documents: Types of documents: Microsoft Word, PDF, Web pages (html/htm), XML, encoded plain text, Scanned books from Internet Archives and Hathi. Trust. User interface with search/browse and document views able to be customized. 3 -6/7/2017
Searching Metadata and digitized magazine 50 �XTF enables the uploading of all the information such as images, text, metadata for a full text search. �Provide a web interface that allows access to the digitized magazine. �Provide a viewer/reader feature 3 -6/7/2017
Full text search in CMS 51 �Full text can be extracted, indexed and retrieved in the Content Management System due to the OCR technique that transforms Arabic image text into searchable text. �Search engine Apache Solr in XTF CMS can retrieve both full text and metadata in “Simple Search” and “Advanced Search” web interface. 2/21/2021
52 SAMPLES OF SEARCH INTERFACE IN THE CONTENT MANAGEMENT SYSTEM (CMS) XTF 2/21/2021
53 2/21/2021
54 3 -6/7/2017
Cover of the last issue from the last publication year 2012 55 3 -6/7/2017
American University of Beirut Libraries Thank you http: //www. aub. edu. lb/ulibraries/ Basma Chebani (email: bc 01@aub. edu. lb) Head of Cataloguing And Metadata Services, University Libraries/ Jafet
- Slides: 56