Machine Translation with Diverse Data Sources Huda Khayrallah

  • Slides: 86
Download presentation
Machine Translation with Diverse Data Sources Huda Khayrallah This talk was presented at JHU

Machine Translation with Diverse Data Sources Huda Khayrallah This talk was presented at JHU CLSP seminar on March 29, 2019 and at the UPenn Computational Linguistics Seminar on April 8, 2019 It is based on the following papers: https: //aclweb. org/anthology/W 18 -2705 (bibtex: https: //aclweb. org/anthology/W 18 -2705) https: //aclweb. org/anthology/W 18 -2709 (bibtex: https: //aclweb. org/anthology/W 18 -2709. bib)

Machine Translation with Diverse Data Sources Huda Khayrallah Work with: Brian Thompson, Kevin Duh

Machine Translation with Diverse Data Sources Huda Khayrallah Work with: Brian Thompson, Kevin Duh & Philipp Koehn

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation •

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 2

Neural Machine Translation Huda Khayrallah 3

Neural Machine Translation Huda Khayrallah 3

Wasch dir die Hände 4

Wasch dir die Hände 4

Source Embedding Wasch dir die Hände 5

Source Embedding Wasch dir die Hände 5

Encoder Source Embedding Wasch dir die Hände 6

Encoder Source Embedding Wasch dir die Hände 6

Decoder Encoder Source Embedding Wasch dir die Hände 7

Decoder Encoder Source Embedding Wasch dir die Hände 7

Softmax Decoder Encoder Source Embedding Wasch dir die Hände 8

Softmax Decoder Encoder Source Embedding Wasch dir die Hände 8

Wash Softmax Decoder Encoder Source Embedding Wasch dir die Hände 9

Wash Softmax Decoder Encoder Source Embedding Wasch dir die Hände 9

Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 10

Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 10

Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 11

Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 11

Wash Target Embedding Softmax your hands Decoder Encoder Source Embedding Wasch dir die Hände

Wash Target Embedding Softmax your hands Decoder Encoder Source Embedding Wasch dir die Hände 12

NMT loss function Gold Target Model output Cross Entropy( , ) Gold Target Model

NMT loss function Gold Target Model output Cross Entropy( , ) Gold Target Model output Huda Khayrallah 13

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation •

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 14

What do we want to translate? Huda Khayrallah 15

What do we want to translate? Huda Khayrallah 15

Developmental toxicity, including dose -dependent delayed foetal ossification and possible teratogenic effects, were observed

Developmental toxicity, including dose -dependent delayed foetal ossification and possible teratogenic effects, were observed in rats at doses resulting in subtherapeutic exposures (based on AUC) and in rabbits at doses resulting in exposures 3 and 11 times the mean steady-state AUC at the maximum recommended clinical dose. Huda Khayrallah 16

The films coated therewith, in particular polycarbonate films coated therewith, have improved properties with

The films coated therewith, in particular polycarbonate films coated therewith, have improved properties with regard to scratch resistance, solvent resistance, and reduced oiling effect, said films thus being especially suitable for use in producing plastic parts in film insert molding methods. Huda Khayrallah 17

General Domain Data Huda Khayrallah 18

General Domain Data Huda Khayrallah 18

General Domain Data Would it not be beneficial, in the short term, following the

General Domain Data Would it not be beneficial, in the short term, following the Rotterdam model, to inspect according to a points system in which, for example, account is taken of the ship's age, whether it is single or double-hulled or whether it sails under a flag of convenience. Huda Khayrallah 19

General Domain Data Mama always said there's an awful lot you can tell about

General Domain Data Mama always said there's an awful lot you can tell about a person by their shoes. Huda Khayrallah 20

Domain Mismatch Huda Khayrallah 21

Domain Mismatch Huda Khayrallah 21

Translating Russian Patents Huda Khayrallah 22

Translating Russian Patents Huda Khayrallah 22

General Domain NMT Model 50 m General Domain sentence pairs Huda Khayrallah 23

General Domain NMT Model 50 m General Domain sentence pairs Huda Khayrallah 23

General Domain NMT Model Errors due to domain mismatch дверной замок повышенной степени защищенности

General Domain NMT Model Errors due to domain mismatch дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door Huda Khayrallah 24

In-Domain NMT 30 k In-Domain sentence pairs In-Domain NMT Model Huda Khayrallah 25

In-Domain NMT 30 k In-Domain sentence pairs In-Domain NMT Model Huda Khayrallah 25

In-Domain NMT Model Errors due to lack of data дверной замок повышенной степени защищенности

In-Domain NMT Model Errors due to lack of data дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock for a high degree of protection against coke Huda Khayrallah 26

Domain Adaptation Huda Khayrallah 27

Domain Adaptation Huda Khayrallah 27

Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs 50

Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs 50 m General Domain sentence pairs Huda Khayrallah 28

Continued Training Improved performance! Continued Training NMT Model General Domain NMT Model 30 k

Continued Training Improved performance! Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock with increased penetration protection Huda Khayrallah 29

Russian → English Patents + 9. 3 40 30 BLEU 20 10 0 General

Russian → English Patents + 9. 3 40 30 BLEU 20 10 0 General Domain In-Domain Continued Training

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation •

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 31

Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation Huda

Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation Huda Khayrallah, Brian Thompson, Kevin Duh & Philipp Koehn WNMT at ACL 2018

Continued Training General Domain NMT Model 30 k In-domain sentence pairs Huda Khayrallah Continued

Continued Training General Domain NMT Model 30 k In-domain sentence pairs Huda Khayrallah Continued Training NMT Model 33

Regularized Continued Training General Domain NMT Model 30 k In-domain sentence pairs Regularized Continued

Regularized Continued Training General Domain NMT Model 30 k In-domain sentence pairs Regularized Continued Training NMT Model General Domain NMT Model Huda Khayrallah 34

Teacher/Student Models • Word Level Knowledge distillation • Often used to make smaller/faster models

Teacher/Student Models • Word Level Knowledge distillation • Often used to make smaller/faster models • Train one model; use it to ‘teach’ another Huda Khayrallah 35

Regularized Continued Training Student General Domain NMT Model Teacher General Domain NMT Model Regularized

Regularized Continued Training Student General Domain NMT Model Teacher General Domain NMT Model Regularized Continued Training NMT Model 36

NMT loss function Gold Target CT Model output Cross Entropy( , ) Gold Target

NMT loss function Gold Target CT Model output Cross Entropy( , ) Gold Target CT Model output Huda Khayrallah 37

Teacher/Student Loss Function General Model Output (teacher) CT Model output (student) Cross Entropy( ,

Teacher/Student Loss Function General Model Output (teacher) CT Model output (student) Cross Entropy( , ) General Model CT Model output (student) Output (teacher) Huda Khayrallah 38

This work: Combine Both (1 - �� ) × ( ) + �� ×

This work: Combine Both (1 - �� ) × ( ) + �� × ( ) (1 - �� ) × Cross Ent ( , ) + �� × Cross Ent ( , ) Huda Khayrallah 39

Results Huda Khayrallah 40

Results Huda Khayrallah 40

Russian → English Patents + 1. 2 40 BLEU 30 20 10 0 General

Russian → English Patents + 1. 2 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 41

English → German Medical + 1. 5 40 BLEU 30 20 10 0 General

English → German Medical + 1. 5 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 42

Analysis Huda Khayrallah 43

Analysis Huda Khayrallah 43

Russian → English General (patents) 30 - 9. 2 BLEU 20 - 18. 2

Russian → English General (patents) 30 - 9. 2 BLEU 20 - 18. 2 10 0 General Domain Continued Training w/Reg Huda Khayrallah Continued Training 44

NAACL 2019 Huda Khayrallah 45

NAACL 2019 Huda Khayrallah 45

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation •

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 46

On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah

On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn WNMT at ACL 2018 [Outstanding Contribution Award]

Statistical MT with big LM Statistical MT (SMT) Neural MT (NMT) [Koehn & Knowles

Statistical MT with big LM Statistical MT (SMT) Neural MT (NMT) [Koehn & Knowles 2017] BLEU More data is better! Corpus size (English words) Huda Khayrallah 48

Let’s go get more data! Huda Khayrallah 49

Let’s go get more data! Huda Khayrallah 49

De En translation WMT 17 + raw paracrawl + filtered paracrawl NMT SMT 27.

De En translation WMT 17 + raw paracrawl + filtered paracrawl NMT SMT 27. 2 24. 0 17. 3 (-9. 9) 25. 2 (+1. 2) 32. 4 25. 8 (+5. 2) (+1. 8) Huda Khayrallah 50

Raw Paracrawl NMT Huda Khayrallah SMT 51

Raw Paracrawl NMT Huda Khayrallah SMT 51

Manual Analysis Okay Misaligned sentences Other Text Short Segments 3 rd Language Untranslated Both

Manual Analysis Okay Misaligned sentences Other Text Short Segments 3 rd Language Untranslated Both German Both English Huda Khayrallah 52

Noise Types • Misaligned Sentences • Misordered words • Wrong Language • Untranslated Sentences

Noise Types • Misaligned Sentences • Misordered words • Wrong Language • Untranslated Sentences • Short Segments Huda Khayrallah 53

Misaligned Sentences Huda Khayrallah 54

Misaligned Sentences Huda Khayrallah 54

Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das

Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 55

Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das

Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The kangaroos jump The koala is soft The kangaroo is fast The koalas are cute Huda Khayrallah 56

Misaligned Sentences NMT Huda Khayrallah SMT 57

Misaligned Sentences NMT Huda Khayrallah SMT 57

Misordered Words Huda Khayrallah 58

Misordered Words Huda Khayrallah 58

Misordered Words (source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Misordered Words (source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 59

Misordered Words (source) Koalas Die sind süß Kängurus springen Die ist Der weich Koala

Misordered Words (source) Koalas Die sind süß Kängurus springen Die ist Der weich Koala schnell Känguru ist Das The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 60

Misordered Words (source) NMT Huda Khayrallah SMT 61

Misordered Words (source) NMT Huda Khayrallah SMT 61

Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 62

Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell koalas cute are The kangaroos The jump is The soft koala fast The is kangaroo Huda Khayrallah 63

Misordered Words (target) NMT Huda Khayrallah SMT 64

Misordered Words (target) NMT Huda Khayrallah SMT 64

Wrong Language Huda Khayrallah 65

Wrong Language Huda Khayrallah 65

Wrong Language (French source) Die Koalas sind süß Die Kängurus springen Der Koala ist

Wrong Language (French source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 66

Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous

Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous sautent The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Huda Khayrallah 67

Wrong Language (French source) NMT Huda Khayrallah SMT 68

Wrong Language (French source) NMT Huda Khayrallah SMT 68

Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist

Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 69

Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist

Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Les koalas sont mignons Les kangourous sautent Le koala est doux Le kangourou est rapide Huda Khayrallah 70

Wrong Language (French target) NMT Huda Khayrallah SMT 71

Wrong Language (French target) NMT Huda Khayrallah SMT 71

Untranslated Huda Khayrallah 72

Untranslated Huda Khayrallah 72

Untranslated (English Source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Untranslated (English Source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 73

Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft

Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 74

Untranslated (English source) NMT Huda Khayrallah SMT 75

Untranslated (English source) NMT Huda Khayrallah SMT 75

Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 76

Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich

Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Huda Khayrallah 77

Untranslated (German target) NMT Huda Khayrallah SMT 78

Untranslated (German target) NMT Huda Khayrallah SMT 78

Short Segments Huda Khayrallah 79

Short Segments Huda Khayrallah 79

Short Segments Die süß Känguru schnell The cute Kangaroo fast Huda Khayrallah 80

Short Segments Die süß Känguru schnell The cute Kangaroo fast Huda Khayrallah 80

Short Segments ≤ 2 words 3 -5 words Huda Khayrallah 81

Short Segments ≤ 2 words 3 -5 words Huda Khayrallah 81

82

82

Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn

Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn 2017] • WMT shared task [Koehn, Khayrallah, Heafield & Forcada 2018] • Dual Conditional Cross-Entropy Filtering [Junczys-Dowmunt 2018] • Zipporah [Khayrallah, Xu & Koehn 2018] Huda Khayrallah 83

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation •

Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 84

Questions? Huda Khayrallah 85

Questions? Huda Khayrallah 85