Experiments on Processing Overlapping Parallel Corpora University of



















- Slides: 19
Experiments on Processing Overlapping Parallel Corpora University of Tartu Mark Fishel and Heiki-Jaan Kaalep
Outline: • Parallel corpora containing overlapping parts • A method for processing these • Some experiments on JRC-Acquis (Estonian, Latvian, English)
Overlapping parallel corpora • Hunglish and OPUS – Hu-En subtitles • Hunglish and JRC-Acquis – Hu-En legislation texts • Univ. of Tartu corpus and JRC-Acquis – Et-En legislation texts • JRC-Acquis Vanilla and Hun. Align – legislation texts
Overlapping parallel corpora • Additional troubles for handling: – source version differences – encoding differences – format differences • But also potential benefits: – detect alignment errors – raise corpora quality – increase segmentation depth
Par. Align – the method • A method of finding and matching corresponding corpora parts • Enables – combining corpora – detecting potential error spots – increasing alignment depth – evaluating and improving alignment quality
Method based on finding corpora correspondence:
Aligning the corresponding language parts:
Aligning the corresponding language parts: • Edit distance over the corpora documents – comparing N to M sentences – matching weight = approx. sentence matching • Approximate sentence matching: modified edit distance – same letter different case replacing free – number inserting/replacing infinitely costly – punctuation replacing cheap
Aligning the language alignments: • Levenstein distance
Par. Align, the Implementation • Combine corpora, include side with more sentences • Print out all mismatching parts (potential error spots) • Use one corpus as guideline, proof the other one • Available at http: //ats. cs. ut. ee/smt/paralign
Method Benefits: • Handles different segmentation levels (M to N al. unit relations) • Insensitive to minor input differences – Encoding – Typing errors –…
Experiment-1 • Univ. of Tartu corpus and JRC-Acquis (English-Estonian) • Overlapping parts found by comparing the CELEX codes • Aim: generate joint corpus
Results • Joint corpus size: 670000 al. units
Segmentation differences
Experiment-2 • JRC-Acquis – English-Estonian – English-Latvian – Estonian-Latvian • Aim: compare alignments produced by Vanilla and Hun. Align – almost 100% overlapping
Results En-Et En-Lv Et-Lv Hun Van Matching 83. 5% 85. 3% 83. 8% 86. 2% 98. 0% 98. 2% Mismatching 15. 9% 13. 7% 15. 5% 12. 8% Single 0. 6% 1. 0% 0. 7% 1. 0% 0. 1% 0. 2% 1. 9% 1. 6%
Future Work • Other corpora • Optimizing • Test on other domains
Summary • A method for parallel corpora combining/comparing/evaluating/… using overlapping parts • Implementation available • Joint En-Et corpus • Comparison results between Hun. Align and Vanilla versions of Jrc-Acquis En-Et, En-Lv and Et-Lv parts
Thank You!