Digitizing Arabic Text Where are we today Elizabeth
Digitizing Arabic Text: Where are we today? Elizabeth A. S. Beaudin Yale University Library Project AMEEL http: //www. library. yale. edu/ameel/
New Haven 41 n 18, 72 w 56 Victor 42 n 56 78 w 44 Ann Arbor 42 n 17, 83 w 45 Alexandria 31 n 12, 29 e 54 Halle 51 n 29, 11 e 58 Leiden 52 n 10, 4 e 30
Outsourcing Kirtas APT Bookscan 2400
In house … Indus 5002 Book Scanner
Processing (Image enhancement) before after
Sakhr vs. 9 Interface
Sakhr -- Increasing accuracy
VERUS vs. 2 – interface
VERUS – increasing accuracy
comparison of features and uses Sakhr VERUS • Character by character • Font libraries • Output choices: Unicode text, proprietary, html • Learning approach to improving accuracy • Use of dongle for copyright • Handles batch processing well • Desktop and SDK versions • Word by word • Custom dictionary • Output choices: searchable PDF or UTF-8 text • Good with degraded documents • Use of dongle to track quantity • Handles mix of languages better • API version for customization
Accuracy? Averages Conditions Improvement Decisions
Software suite Digitization Workflow • Scanning: proprietary to Indus and Kirtas • Processing: • OCR: Photo. Shop, Scan. Fix, ACDSee, Unifier, Super. Edi Sakhr OCR-Gold and font libraries, VERUS v 2 • Staging and Archiving: • Workflow control: ACDSee, customized scripts MS Access, customized scripts Repository Development Fedora Repository Framework, PHP, My. SQL, REST, Java
Full Text repository FEDORA framework Indexed and searchable in Arabic http: //oacistest. library. yale. edu: 8080/fedoragsearch/rest. Ameel
http: //www. library. yale. edu/ameel/
- Slides: 23