OCR With Nikud Shani Vered Oz Adi Advisor
- Slides: 13
OCR With Nikud Shani Vered Oz Adi Advisor : Prof. Michael Elhadad
Motivation • • • Create a free tool that converts a text without nikud to one with it. Will help to preserve the language. (nikud usage is decreasing) NLP hebrew research - create hebrew corpus with nikud
How To Improve We want to train the tesseract so it will recognize the Britannica Hebrew letters and nikud. The way is to create an improved train data file for tesseract. We used a useful tool called Moshpytt
Data Set Distribution + Box Files Example Letter Hits א 1201 י 153 ע 242 ב 755 כ 1785 פ 520 ג 1333 ך 60 ף 31 ד 212 ל 333 צ 356 ה 469 מ 1020 ץ 48 ו 163 ם 651 ק 192 ז 1720 נ 881 ר 370 ח 108 ן 168 ש 1055 ט 402 ס 522 ת 808
Data Set Distribution - with nikud א א א א 4 - - - 2 75 105 94 155 45 46 11 2 מ מ מ מ 18 4 - 2 82 115 64 103 230 - - - 191 ע ע ע ע 4 - - 1 64 137 7 21 88 2 98 3 6 פ פ פ פ 5 14 - - 68 64 21 23 46 - - - 77 ש ש ש ש 7 5 139 46 81 21 243 37 69 - - - 133 ת ת ת 1 10 - 3 100 103 11 9 70 - - ת 136 א מ ע פ ש ת
Project Results Top 10 Errors : ● ● ● ● Words ending with letter ד - lots of times we have Hirik - ד mistakes between ש and ש כז - ור , כרור instead of כדור ה instead of ה and ה ה instead of ה י instead of י ב instead of ב letter ק needs better training ך - doesn't exist in corpus holam haser - is missing in the corpus for some letters ת instead of ח ס instead of ס 'ג becomes נ Confusion Matrix
Project Results - Cont. Overall Accuracy : 90% ! Precision Recall Plain Letters 95% 93% Letters With Nikud 82. 4% 80. 5% Only Nikud 87. 9% 87%
Questions ? http: //www. cs. bgu. ac. il/~nlpproj