Machine Translation with Diverse Data Sources Huda Khayrallah
- Slides: 86
Machine Translation with Diverse Data Sources Huda Khayrallah This talk was presented at JHU CLSP seminar on March 29, 2019 and at the UPenn Computational Linguistics Seminar on April 8, 2019 It is based on the following papers: https: //aclweb. org/anthology/W 18 -2705 (bibtex: https: //aclweb. org/anthology/W 18 -2705) https: //aclweb. org/anthology/W 18 -2709 (bibtex: https: //aclweb. org/anthology/W 18 -2709. bib)
Machine Translation with Diverse Data Sources Huda Khayrallah Work with: Brian Thompson, Kevin Duh & Philipp Koehn
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 2
Neural Machine Translation Huda Khayrallah 3
Wasch dir die Hände 4
Source Embedding Wasch dir die Hände 5
Encoder Source Embedding Wasch dir die Hände 6
Decoder Encoder Source Embedding Wasch dir die Hände 7
Softmax Decoder Encoder Source Embedding Wasch dir die Hände 8
Wash Softmax Decoder Encoder Source Embedding Wasch dir die Hände 9
Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 10
Wash Target Embedding Softmax Decoder Encoder Source Embedding Wasch dir die Hände 11
Wash Target Embedding Softmax your hands Decoder Encoder Source Embedding Wasch dir die Hände 12
NMT loss function Gold Target Model output Cross Entropy( , ) Gold Target Model output Huda Khayrallah 13
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 14
What do we want to translate? Huda Khayrallah 15
Developmental toxicity, including dose -dependent delayed foetal ossification and possible teratogenic effects, were observed in rats at doses resulting in subtherapeutic exposures (based on AUC) and in rabbits at doses resulting in exposures 3 and 11 times the mean steady-state AUC at the maximum recommended clinical dose. Huda Khayrallah 16
The films coated therewith, in particular polycarbonate films coated therewith, have improved properties with regard to scratch resistance, solvent resistance, and reduced oiling effect, said films thus being especially suitable for use in producing plastic parts in film insert molding methods. Huda Khayrallah 17
General Domain Data Huda Khayrallah 18
General Domain Data Would it not be beneficial, in the short term, following the Rotterdam model, to inspect according to a points system in which, for example, account is taken of the ship's age, whether it is single or double-hulled or whether it sails under a flag of convenience. Huda Khayrallah 19
General Domain Data Mama always said there's an awful lot you can tell about a person by their shoes. Huda Khayrallah 20
Domain Mismatch Huda Khayrallah 21
Translating Russian Patents Huda Khayrallah 22
General Domain NMT Model 50 m General Domain sentence pairs Huda Khayrallah 23
General Domain NMT Model Errors due to domain mismatch дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door Huda Khayrallah 24
In-Domain NMT 30 k In-Domain sentence pairs In-Domain NMT Model Huda Khayrallah 25
In-Domain NMT Model Errors due to lack of data дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock for a high degree of protection against coke Huda Khayrallah 26
Domain Adaptation Huda Khayrallah 27
Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs 50 m General Domain sentence pairs Huda Khayrallah 28
Continued Training Improved performance! Continued Training NMT Model General Domain NMT Model 30 k In-domain sentence pairs дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock with increased penetration protection Huda Khayrallah 29
Russian → English Patents + 9. 3 40 30 BLEU 20 10 0 General Domain In-Domain Continued Training
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 31
Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation Huda Khayrallah, Brian Thompson, Kevin Duh & Philipp Koehn WNMT at ACL 2018
Continued Training General Domain NMT Model 30 k In-domain sentence pairs Huda Khayrallah Continued Training NMT Model 33
Regularized Continued Training General Domain NMT Model 30 k In-domain sentence pairs Regularized Continued Training NMT Model General Domain NMT Model Huda Khayrallah 34
Teacher/Student Models • Word Level Knowledge distillation • Often used to make smaller/faster models • Train one model; use it to ‘teach’ another Huda Khayrallah 35
Regularized Continued Training Student General Domain NMT Model Teacher General Domain NMT Model Regularized Continued Training NMT Model 36
NMT loss function Gold Target CT Model output Cross Entropy( , ) Gold Target CT Model output Huda Khayrallah 37
Teacher/Student Loss Function General Model Output (teacher) CT Model output (student) Cross Entropy( , ) General Model CT Model output (student) Output (teacher) Huda Khayrallah 38
This work: Combine Both (1 - �� ) × ( ) + �� × ( ) (1 - �� ) × Cross Ent ( , ) + �� × Cross Ent ( , ) Huda Khayrallah 39
Results Huda Khayrallah 40
Russian → English Patents + 1. 2 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 41
English → German Medical + 1. 5 40 BLEU 30 20 10 0 General Domain Continued Training In-Domain Continued Training w/Reg Huda Khayrallah 42
Analysis Huda Khayrallah 43
Russian → English General (patents) 30 - 9. 2 BLEU 20 - 18. 2 10 0 General Domain Continued Training w/Reg Huda Khayrallah Continued Training 44
NAACL 2019 Huda Khayrallah 45
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 46
On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn WNMT at ACL 2018 [Outstanding Contribution Award]
Statistical MT with big LM Statistical MT (SMT) Neural MT (NMT) [Koehn & Knowles 2017] BLEU More data is better! Corpus size (English words) Huda Khayrallah 48
Let’s go get more data! Huda Khayrallah 49
De En translation WMT 17 + raw paracrawl + filtered paracrawl NMT SMT 27. 2 24. 0 17. 3 (-9. 9) 25. 2 (+1. 2) 32. 4 25. 8 (+5. 2) (+1. 8) Huda Khayrallah 50
Raw Paracrawl NMT Huda Khayrallah SMT 51
Manual Analysis Okay Misaligned sentences Other Text Short Segments 3 rd Language Untranslated Both German Both English Huda Khayrallah 52
Noise Types • Misaligned Sentences • Misordered words • Wrong Language • Untranslated Sentences • Short Segments Huda Khayrallah 53
Misaligned Sentences Huda Khayrallah 54
Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 55
Misaligned Sentences Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The kangaroos jump The koala is soft The kangaroo is fast The koalas are cute Huda Khayrallah 56
Misaligned Sentences NMT Huda Khayrallah SMT 57
Misordered Words Huda Khayrallah 58
Misordered Words (source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 59
Misordered Words (source) Koalas Die sind süß Kängurus springen Die ist Der weich Koala schnell Känguru ist Das The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 60
Misordered Words (source) NMT Huda Khayrallah SMT 61
Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 62
Misordered Words (target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell koalas cute are The kangaroos The jump is The soft koala fast The is kangaroo Huda Khayrallah 63
Misordered Words (target) NMT Huda Khayrallah SMT 64
Wrong Language Huda Khayrallah 65
Wrong Language (French source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 66
Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous sautent The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Huda Khayrallah 67
Wrong Language (French source) NMT Huda Khayrallah SMT 68
Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 69
Wrong Language (French target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Les koalas sont mignons Les kangourous sautent Le koala est doux Le kangourou est rapide Huda Khayrallah 70
Wrong Language (French target) NMT Huda Khayrallah SMT 71
Untranslated Huda Khayrallah 72
Untranslated (English Source) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 73
Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 74
Untranslated (English source) NMT Huda Khayrallah SMT 75
Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Huda Khayrallah 76
Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Huda Khayrallah 77
Untranslated (German target) NMT Huda Khayrallah SMT 78
Short Segments Huda Khayrallah 79
Short Segments Die süß Känguru schnell The cute Kangaroo fast Huda Khayrallah 80
Short Segments ≤ 2 words 3 -5 words Huda Khayrallah 81
82
Filtering methods • Bi. Cleaner [Espla-Gomis & Forcada 2009] • Zipporah [Xu & Koehn 2017] • WMT shared task [Koehn, Khayrallah, Heafield & Forcada 2018] • Dual Conditional Cross-Entropy Filtering [Junczys-Dowmunt 2018] • Zipporah [Khayrallah, Xu & Koehn 2018] Huda Khayrallah 83
Overview • Review of Neural Machine Translation (NMT) • Review of Domain Adaptation • Improving Domain Adaptation • Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation [Khayrallah, Thompson, Duh & Koehn 2018] • Analysis of Noisy Corpora • On the Impact of Various Types of Noise on Neural Machine Translation [Khayrallah & Koehn 2018] Huda Khayrallah 84
Questions? Huda Khayrallah 85
- Huda khayrallah
- Cellular vs plasmodial slime molds
- Mahfudz al huda
- Sephora swot
- Al huda artinya
- Dr miftahul huda sh llm
- Contoh anggaran tenaga kerja langsung
- Print sources and web sources
- Sources of water sources of water
- Data collection secondary data sources
- Translate
- Cisco voice translation rule
- Square root function transformations
- Semantic translation và communicative translation
- Interactive machine translation
- Lms machine translation
- Visualizing and understanding neural machine translation
- Google translate
- Stephan
- Language translator
- Machine translation
- Meteor vs bleu
- Lms machine translation
- Dot translation
- John hutchins machine translation
- Machine translation presentation
- Large language models in machine translation
- Most diverse group of organisms
- Reflectia si refractia luminii
- The most diverse kingdom
- Sharing of diverse information through universal web access
- Forma di governo
- Diverse ion effect
- Algal
- African iron age
- Diverse departure
- Career development of diverse populations
- Art is diverse
- Quante altezze ha il parallelogramma
- We live in a diverse world
- Chapter 16 ap human geography
- Circus possessive form
- Diverse group of hydrophobic molecules
- Protista euglena
- Most diverse kingdom
- Introduction for ict
- Managing diverse employees in a multicultural environment
- The most diverse of the four eukaryotic kingdoms is the
- Most diverse biomolecules
- Diverse offspring
- Diverse learning environments survey
- Chapter 54 community ecology
- Diverse expenses
- Working with culturally and linguistically diverse families
- What is meant by physical diversity
- Diverse desert sector willing
- Technology for diverse learners
- Vanier institute of the family definition
- Diverse societies in africa
- Diverse routing
- Diverse learning environments survey
- Surface level diversity
- Syndicated panel surveys measure the
- Emsi data sources
- Quantitative data sources
- Sources of population data
- Routine data sources
- Sources of demographic data
- Disadvantages of secondary data
- Syndicated sources of secondary data
- Ad hoc data sources pharmacoepidemiology
- Finite state machine vending machine example
- Moore machine and mealy machine
- Mealy to moore conversion
- Chapter 10 energy work and simple machines answer key
- Bsp classification
- Machine level representation of programs
- Machine learning and data mining
- Data mining azure
- Data formats of ibm 360
- Training data in machine learning with example
- What is big data mis
- Path length difference
- Sources and uses of funds
- How to cite multiple authors
- Esci clarivate
- Zone of areation