Machine Translation with Diverse Data Sources Huda Khayrallah



















































































![Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn](https://slidetodoc.com/presentation_image_h/25e574497f2667efbd06ea65cf7c6188/image-84.jpg)


- Slides: 86
Machine Translation with Diverse Data Sources Huda Khayrallah This talk was presented at JHU CLSP seminar on March 29, 2019 and at the UPenn Computational Linguistics Seminar on April 8, 2019 It is based on the following papers: https: //aclweb. org/anthology/W 18 -2705 (bibtex: https: //aclweb. org/anthology/W 18 -2705) https: //aclweb. org/anthology/W 18 -2709 (bibtex: https: //aclweb. org/anthology/W 18 -2709. bib)
Machine Translation with Diverse Data Sources Huda Khayrallah Work with: Brian Thompson, Kevin Duh & Philipp Koehn
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 2
Neural Machine Translation Huda Khayrallah 3
Wasch dir die Hände 4
Source Embedding Wasch dir die Hände 5
Encoder Source Embedding Wasch dir die Hände 6
Decoder Encoder Source Embedding Wasch dir die Hände 7
Softmax Decoder Encoder Source Embedding Wasch dir die Hände 8
Wash Softmax Decoder Encoder Source Embedding Wasch dir die Hände 9
Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 10
Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 11
Wash Target Embedding Softmax your hands Decoder Encoder Source Embedding Wasch dir die Hände 12
NMT loss function Gold Target Model output Cross Entropy( , ) Gold Target Model output Huda Khayrallah 13
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 14
What do we want to translate? Huda Khayrallah 15
Developmental toxicity, including dose -dependent delayed foetal ossification and possible teratogenic effects, were observed in rats at doses resulting in subtherapeutic exposures (based on AUC) and in rabbits at doses resulting in exposures 3 and 11 times the mean steady-state AUC at the maximum recommended clinical dose. Huda Khayrallah 16
The films coated therewith, in particular polycarbonate films coated therewith, have improved properties with regard to scratch resistance, solvent resistance, and reduced oiling effect, said films thus being especially suitable for use in producing plastic parts in film insert molding methods. Huda Khayrallah 17
General Domain Data Huda Khayrallah 18
General Domain Data Would it not be beneficial, in the short term, following the Rotterdam model, to inspect according to a points system in which, for example, account is taken of the ship's age, whether it is single or double-hulled or whether it sails under a flag of convenience. Huda Khayrallah 19
General Domain Data Mama always said there's an awful lot you can tell about a person by their shoes. Huda Khayrallah 20
Domain Mismatch Huda Khayrallah 21
Translating Russian Patents Huda Khayrallah 22
General Domain NMT Model 50 m General Domain sentence pairs Huda Khayrallah 23
General Domain NMT Model Errors due to domain mismatch дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door Huda Khayrallah 24
In-Domain NMT 30 k In-Domain sentence pairs In-Domain NMT Model Huda Khayrallah 25
In-Domain NMT Model Errors due to lack of data дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock for a high degree of protection against coke Huda Khayrallah 26
Domain Adaptation Huda Khayrallah 27
Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs 50 m General Domain sentence pairs Huda Khayrallah 28
Continued Training Improved performance! Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock with increased penetration protection Huda Khayrallah 29
Russian → English Patents + 9. 3 40 30 BLEU 20 10 0 General Domain In-Domain Continued Training
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 31
Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation Huda Khayrallah, Brian Thompson, Kevin Duh & Philipp Koehn WNMT at ACL 2018
Continued Training General Domain NMT Model 30 k In-domain sentence pairs Huda Khayrallah Continued Training NMT Model 33
Regularized Continued Training General Domain NMT Model 30 k In-domain sentence pairs Regularized Continued Training NMT Model General Domain NMT Model Huda Khayrallah 34
Teacher/Student Models • Word Level Knowledge distillation • Often used to make smaller/faster models • Train one model; use it to ‘teach’ another Huda Khayrallah 35
Regularized Continued Training Student General Domain NMT Model Teacher General Domain NMT Model Regularized Continued Training NMT Model 36
NMT loss function Gold Target CT Model output Cross Entropy( , ) Gold Target CT Model output Huda Khayrallah 37
Teacher/Student Loss Function General Model Output (teacher) CT Model output (student) Cross Entropy( , ) General Model CT Model output (student) Output (teacher) Huda Khayrallah 38
This work: Combine Both (1 - �� ) × ( ) + �� × ( ) (1 - �� ) × Cross Ent ( , ) + �� × Cross Ent ( , ) Huda Khayrallah 39
Results Huda Khayrallah 40
Russian → English Patents + 1. 2 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 41
English → German Medical + 1. 5 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 42
Analysis Huda Khayrallah 43
Russian → English General (patents) 30 - 9. 2 BLEU 20 - 18. 2 10 0 General Domain Continued Training w/Reg Huda Khayrallah Continued Training 44
NAACL 2019 Huda Khayrallah 45
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 46
On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn WNMT at ACL 2018 [Outstanding Contribution Award]
Statistical MT with big LM Statistical MT (SMT) Neural MT (NMT) [Koehn & Knowles 2017] BLEU More data is better! Corpus size (English words) Huda Khayrallah 48
Let’s go get more data! Huda Khayrallah 49
De En translation WMT 17 + raw paracrawl + filtered paracrawl NMT SMT 27. 2 24. 0 17. 3 (-9. 9) 25. 2 (+1. 2) 32. 4 25. 8 (+5. 2) (+1. 8) Huda Khayrallah 50
Raw Paracrawl NMT Huda Khayrallah SMT 51
Manual Analysis Okay Misaligned sentences Other Text Short Segments 3 rd Language Untranslated Both German Both English Huda Khayrallah 52
Noise Types • Misaligned Sentences • Misordered words • Wrong Language • Untranslated Sentences • Short Segments Huda Khayrallah 53
Misaligned Sentences Huda Khayrallah 54
Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 55
Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The kangaroos jump The koala is soft The kangaroo is fast The koalas are cute Huda Khayrallah 56
Misaligned Sentences NMT Huda Khayrallah SMT 57
Misordered Words Huda Khayrallah 58
Misordered Words (source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 59
Misordered Words (source) Koalas Die sind süß Kängurus springen Die ist Der weich Koala schnell Känguru ist Das The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 60
Misordered Words (source) NMT Huda Khayrallah SMT 61
Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 62
Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell koalas cute are The kangaroos The jump is The soft koala fast The is kangaroo Huda Khayrallah 63
Misordered Words (target) NMT Huda Khayrallah SMT 64
Wrong Language Huda Khayrallah 65
Wrong Language (French source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 66
Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous sautent The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Huda Khayrallah 67
Wrong Language (French source) NMT Huda Khayrallah SMT 68
Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 69
Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Les koalas sont mignons Les kangourous sautent Le koala est doux Le kangourou est rapide Huda Khayrallah 70
Wrong Language (French target) NMT Huda Khayrallah SMT 71
Untranslated Huda Khayrallah 72
Untranslated (English Source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 73
Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 74
Untranslated (English source) NMT Huda Khayrallah SMT 75
Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 76
Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Huda Khayrallah 77
Untranslated (German target) NMT Huda Khayrallah SMT 78
Short Segments Huda Khayrallah 79
Short Segments Die süß Känguru schnell The cute Kangaroo fast Huda Khayrallah 80
Short Segments ≤ 2 words 3 -5 words Huda Khayrallah 81
82
Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn 2017] • WMT shared task [Koehn, Khayrallah, Heafield & Forcada 2018] • Dual Conditional Cross-Entropy Filtering [Junczys-Dowmunt 2018] • Zipporah [Khayrallah, Xu & Koehn 2018] Huda Khayrallah 83
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 84
Questions? Huda Khayrallah 85