OCR With Nikud Shani Vered Oz Adi Advisor : Prof. Michael Elhadad
Motivation • • • Create a free tool that converts a text without nikud to one with it. Will help to preserve the language. (nikud usage is decreasing) NLP hebrew research - create hebrew corpus with nikud
How To Improve We want to train the tesseract so it will recognize the Britannica Hebrew letters and nikud. The way is to create an improved train data file for tesseract. We used a useful tool called Moshpytt
Data Set Distribution + Box Files Example Letter Hits א 1201 י 153 ע 242 ב 755 כ 1785 פ 520 ג 1333 ך 60 ף 31 ד 212 ל 333 צ 356 ה 469 מ 1020 ץ 48 ו 163 ם 651 ק 192 ז 1720 נ 881 ר 370 ח 108 ן 168 ש 1055 ט 402 ס 522 ת 808
Project Results Top 10 Errors : ● ● ● ● Words ending with letter ד - lots of times we have Hirik - ד mistakes between ש and ש כז - ור , כרור instead of כדור ה instead of ה and ה ה instead of ה י instead of י ב instead of ב letter ק needs better training ך - doesn't exist in corpus holam haser - is missing in the corpus for some letters ת instead of ח ס instead of ס 'ג becomes נ Confusion Matrix