Million Book Project Bibliotheca Alexandrina Youssef Eldakar 19
Million Book Project @ Bibliotheca Alexandrina Youssef Eldakar 19 November 2006 Bibliotheca Alexandrina
BA Digitization Workflow Bibliotheca Alexandrina 2
Bibliotheca Alexandrina 3
Image Processing Image to Better Image Bibliotheca Alexandrina
Image Processing Sequence Scan. Fix (Automated processing) Adobe Photoshop (Manual processing) Scan. Fix § Deskew § Despeckle § Rotation § Noise Removal § Black Edge Removal § Page resize § Center Text to page § Enhance text quality [Grow & Erode] § Renaming Files § File compression (CCITT – Group 4) (Automated processing) ACDSee (Automated processing) Bibliotheca Alexandrina 5
Image Processing Sequence Scan. Fix (Automated processing) Adobe Photoshop (Manual processing) Scan. Fix § Deskew § Despeckle § Rotation § Noise Removal § Black Edge Removal § Page resize § Center Text to page § Enhance text quality [Grow & Erode] § Renaming Files § File compression (CCITT – Group 4) (Automated processing) ACDSee (Automated processing) Bibliotheca Alexandrina 6
Scanfix § Deskew After Before Bibliotheca Alexandrina 7
Scanfix § Despeckle Before Bibliotheca Alexandrina After 8
Scanfix § Rotation Bibliotheca Alexandrina 9
Image Processing Sequence Scan. Fix (Automated processing) Adobe Photoshop (Manual processing) Scan. Fix § Deskew § Despeckle § Rotation § Noise Removal § Black Edge Removal § Page resize § Center Text to page § Enhance text quality [Grow & Erode] § Renaming Files § File compression (CCITT – Group 4) (Automated processing) ACDSee (Automated processing) Bibliotheca Alexandrina 10
Photoshop § Noise removal After Before Bibliotheca Alexandrina 11
Photoshop § Black edge removal After Before Bibliotheca Alexandrina 12
Photoshop § Page resize Bibliotheca Alexandrina 13
Photoshop § Center text to page After Before Bibliotheca Alexandrina 14
Image Processing Sequence Scan. Fix (Automated processing) Adobe Photoshop (Manual processing) Scan. Fix § Deskew § Despeckle § Rotation § Noise Removal § Black Edge Removal § Page resize § Center Text to page § Enhance text quality [Grow & Erode] § Renaming Files § File compression (CCITT – Group 4) (Automated processing) ACDSee (Automated processing) Bibliotheca Alexandrina 15
Scanfix § Enhance text quality : Grow, Erode (Horizontal / Vertical) Before Bibliotheca Alexandrina After 16
Image Processing Sequence Scan. Fix (Automated processing) Adobe Photoshop (Manual processing) Scan. Fix § Deskew § Despeckle § Rotation § Noise Removal § Black Edge Removal § Page resize § Center Text to page § Enhance text quality [Grow & Erode] § Renaming Files § File compression (CCITT – Group 4) (Automated processing) ACDSee (Automated processing) Bibliotheca Alexandrina 17
ACDSee § Renaming Files Bibliotheca Alexandrina 18
ACDSee § Compression to TIFF (CCITT– Group 4) Bibliotheca Alexandrina 19
OCR Image to Text Bibliotheca Alexandrina
OCR - Arabic § Poses unique challenges – – – Written cursively, with blocks of connected characters a ‘block of characters’ can have more than one base line. Uses external objects such as dots, 'Hamza' and 'Madda'. Diacritization Characters can have more than one shape according to their position – Overlapping makes it difficult to determine the spacing § Sakhr Automatic reader is used § Tricky with old books § Requires learning Bibliotheca Alexandrina 21
Arabic Script Is Cursive Bibliotheca Alexandrina 22
Old, Smudgy, and Sticked Together Bibliotheca Alexandrina 23
Use of Diacritics Bibliotheca Alexandrina 24
Pre-OCR Text Enhancement § Condition of Arabic printings varies – Old/new – Light/heavy – Solid/dot-matrix § Scan. Fix’s smoothing and completion features improve recognition accuracy § Separate from actual processing phase – Must be tested under OCR right away – OCR specialists have a better feel for “good text” Bibliotheca Alexandrina 25
Text Repair in Scan. Fix Bibliotheca Alexandrina 26
Font Libraries § Improvement of Arabic OCR results through – Tweaking of OCR engine settings – Learning § Libraries for different fonts have been built to achieve higher recognition rates § Databases of character glyphs that describe a particular type of script and improve OCR accuracy § Built on a carefully selected and classified high-variety set of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups Bibliotheca Alexandrina 27
Font Classification § Classification criteria: – Script type • TA: Traditional Arabic • AR: Arabic Transparent • DT: Deco type Naskh and Deco type Naskh extension – Printing quality: High (H), Medium (M), and Low (L) – Font size: 1 (largest) to 5 (smallest) § “Group X” – virtual font to tag unclassifiable printings and handwriting § Minimum accuracy number assigned to each group based on testing results Bibliotheca Alexandrina 28
16 Font Groups Bibliotheca Alexandrina 29
Learning § Train the engine on two representational pages of the book to build upon an initial font file picked from a set of prebuilt font libraries § Use a different page to manually calculate OCR accuracy before and after learning § Batch OCR book using learned font file and save to ART Bibliotheca Alexandrina 30
Learning in Sakhr’s Automatic Reader Bibliotheca Alexandrina 31
VERUS from Novo. Dynamics § Preliminay evaluation on two data sets is promising – Challenge: difficult to OCR, degraded images – Normal: known to return acceptable accuracy § No learning capabilities—no human operators § VERUS uses an XML format to store recognition data § BA and Novo. Dynamics entered into a research agreement Bibliotheca Alexandrina 32
Evaluation of VERUS and AR Bibliotheca Alexandrina 33
Encoding Image on Text Bibliotheca Alexandrina
Challenges in Publishing § Preservation of layout § Searchability of content and metadata § Efficient image compression § Easy browsing of books § Accommodating low bandwidth user § Multilingual text support § Multipaging Bibliotheca Alexandrina 35
Image-on-Text § Multilayered: – Visible page image – Hidden OCR text § View exact original layout while searching and highlighting § Supported with some OCR suites only § Supported format: DJVU and PDF Bibliotheca Alexandrina 36
UDBE § Universal Digital Book Encoder § A framework for integrating many OCR engines and supporting many target formats into a system for encoding image-on-text documents for publishing § Made possible through the use of a Common OCR Format (COF) Bibliotheca Alexandrina 37
UDBE § Built around a Common OCR Format (COF) Bibliotheca Alexandrina 38
Performance – Arabic B&W Bibliotheca Alexandrina 39
Performance – Latin B&W Bibliotheca Alexandrina 40
Quality Assurance Bibliotheca Alexandrina
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality Bibliotheca Alexandrina 42
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality Bibliotheca Alexandrina 43
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality 17 Bibliotheca Alexandrina 44
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality § Curved Pale Text Toothed Text Bibliotheca Alexandrina 45
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality Bibliotheca Alexandrina 46
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality §§ Skew §Noise Pages Fingers Cut Pages and Sizepage edges Bibliotheca Alexandrina 47
Q/A - Common Errors § § § No missing cover or pages All pages are in order Text quality Images quality PDF quality § Image on Text § Searching Hits Bibliotheca Alexandrina 48
DAR Digital Assets Repository Bibliotheca Alexandrina
System Architecture Bibliotheca Alexandrina 50
DAK - Metadata § Descriptive Metadata § Administrative Metadata § Technical Metadata Bibliotheca Alexandrina 51
DAK Publishing Module § Providing access to the repository content through search and browse facilities § Multilingual full-text search Bibliotheca Alexandrina 52
DAK Publishing Module § Functionalities – Browse the repository contents by Collection, Subject, Creator and Title – Search content by an indexed metadata field – Multilingual full-text search using both exact and morphological matching Bibliotheca Alexandrina 53
DAK Publishing Module § Functionalities (cont’d) – Display brief record information – Display full record information with links to digital objects – Display MARC and DC format Bibliotheca Alexandrina 54
Bibliotheca Alexandrina 55
Bibliotheca Alexandrina 56
Bibliotheca Alexandrina 57
Bibliotheca Alexandrina 58
Show notes Bibliotheca Alexandrina 59
Bibliotheca Alexandrina 60
DAR: Future Work § Consider MODS and METS standards in the new system data model. § Enhance the functionalities of the Books Viewer with more security and copyright management § Join the Open Source community by building DAR modules with open source technologies and languages. § Provide support for the currently available digital library interoperability protocols Bibliotheca Alexandrina 61
Books from India Towards Better Collaboration Bibliotheca Alexandrina
Books From India Languag e Arabic Number Books 832 Arabic + French 3 Arabic + Germa n 1 Persian 101 French 2 English 1 Spanish 1 German 1 Total 942 Bibliotheca Alexandrina 63
Progress Phase Name Done as of November 1, 2006 Expected to finished by Comments Cataloging 801 - Processing 742 November 20, 2006 OCRing 200 March 1, 2007 Encoding 171 - - Publishing 171 - - Bibliotheca Alexandrina 35 have metadata problems 64
Metadata Problems Bibliotheca Alexandrina 65
Processing Bibliotheca Alexandrina 66
OCR Using VERUS or AR? § Calculated accuracy for a small sample – Images processed once with darkening effect and once without – VERUS likes darkening, AR does not – Overall, AR won 70% of cases Bibliotheca Alexandrina 67
Bibliotheca Alexandrina 68
Bibliotheca Alexandrina 69
Bibliotheca Alexandrina 70
Bibliotheca Alexandrina 71
Bibliotheca Alexandrina 72
Bibliotheca Alexandrina 73
Bibliotheca Alexandrina 74
Bibliotheca Alexandrina 75
- Slides: 75