Experiments on Processing Overlapping Parallel Corpora University of

Outline: • Parallel corpora containing overlapping parts • A method for processing these •

Overlapping parallel corpora • Hunglish and OPUS – Hu-En subtitles • Hunglish and JRC-Acquis

Overlapping parallel corpora • Additional troubles for handling: – source version differences – encoding

Par. Align – the method • A method of finding and matching corresponding corpora

Method based on finding corpora correspondence:

Aligning the corresponding language parts:

Aligning the corresponding language parts: • Edit distance over the corpora documents – comparing

Aligning the language alignments: • Levenstein distance

Par. Align, the Implementation • Combine corpora, include side with more sentences • Print

Method Benefits: • Handles different segmentation levels (M to N al. unit relations) •

Experiment-1 • Univ. of Tartu corpus and JRC-Acquis (English-Estonian) • Overlapping parts found by

Results • Joint corpus size: 670000 al. units

Experiment-2 • JRC-Acquis – English-Estonian – English-Latvian – Estonian-Latvian • Aim: compare alignments produced

Results En-Et En-Lv Et-Lv Hun Van Matching 83. 5% 85. 3% 83. 8% 86.

Future Work • Other corpora • Optimizing • Test on other domains

Summary • A method for parallel corpora combining/comparing/evaluating/… using overlapping parts • Implementation available

Slides: 19

Download presentation

Experiments on Processing Overlapping Parallel Corpora University of Tartu Mark Fishel and Heiki-Jaan Kaalep

Outline: • Parallel corpora containing overlapping parts • A method for processing these • Some experiments on JRC-Acquis (Estonian, Latvian, English)

Overlapping parallel corpora • Hunglish and OPUS – Hu-En subtitles • Hunglish and JRC-Acquis – Hu-En legislation texts • Univ. of Tartu corpus and JRC-Acquis – Et-En legislation texts • JRC-Acquis Vanilla and Hun. Align – legislation texts

Overlapping parallel corpora • Additional troubles for handling: – source version differences – encoding differences – format differences • But also potential benefits: – detect alignment errors – raise corpora quality – increase segmentation depth

Par. Align – the method • A method of finding and matching corresponding corpora parts • Enables – combining corpora – detecting potential error spots – increasing alignment depth – evaluating and improving alignment quality

Method based on finding corpora correspondence:

Aligning the corresponding language parts:

Aligning the corresponding language parts: • Edit distance over the corpora documents – comparing N to M sentences – matching weight = approx. sentence matching • Approximate sentence matching: modified edit distance – same letter different case replacing free – number inserting/replacing infinitely costly – punctuation replacing cheap

Aligning the language alignments: • Levenstein distance

Par. Align, the Implementation • Combine corpora, include side with more sentences • Print out all mismatching parts (potential error spots) • Use one corpus as guideline, proof the other one • Available at http: //ats. cs. ut. ee/smt/paralign

Method Benefits: • Handles different segmentation levels (M to N al. unit relations) • Insensitive to minor input differences – Encoding – Typing errors –…

Experiment-1 • Univ. of Tartu corpus and JRC-Acquis (English-Estonian) • Overlapping parts found by comparing the CELEX codes • Aim: generate joint corpus

Results • Joint corpus size: 670000 al. units

Segmentation differences

Experiment-2 • JRC-Acquis – English-Estonian – English-Latvian – Estonian-Latvian • Aim: compare alignments produced by Vanilla and Hun. Align – almost 100% overlapping

Results En-Et En-Lv Et-Lv Hun Van Matching 83. 5% 85. 3% 83. 8% 86. 2% 98. 0% 98. 2% Mismatching 15. 9% 13. 7% 15. 5% 12. 8% Single 0. 6% 1. 0% 0. 7% 1. 0% 0. 1% 0. 2% 1. 9% 1. 6%

Future Work • Other corpora • Optimizing • Test on other domains

Summary • A method for parallel corpora combining/comparing/evaluating/… using overlapping parts • Implementation available • Joint En-Et corpus • Comparison results between Hun. Align and Vanilla versions of Jrc-Acquis En-Et, En-Lv and Et-Lv parts

Thank You!