KnowledgeRich MT Chris Dyer Kevin Gimpel Waleed Ammar

  • Slides: 46
Download presentation
Knowledge-Rich MT Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith November 4,

Knowledge-Rich MT Chris Dyer - Kevin Gimpel Waleed Ammar - Noah Smith November 4, 2011

Outline • Where are we starting with end-to-end MT? • Adapting SMT for low-resource

Outline • Where are we starting with end-to-end MT? • Adapting SMT for low-resource scenarios • What progress have we been making? • What does Year 2 hold?

Cross-site system comparison

Cross-site system comparison

The SMT baseline English S'il vous plaît traduire. . . decoder LM learner English

The SMT baseline English S'il vous plaît traduire. . . decoder LM learner English français TM learner Please translate. . .

SMT Baselines BLEU Kinyarwanda – English (Hiero) 6. 8 BLEU English – Kinyarwanda (Hiero)

SMT Baselines BLEU Kinyarwanda – English (Hiero) 6. 8 BLEU English – Kinyarwanda (Hiero) 4. 7

SMT Baselines BLEU Kinyarwanda – English (Hiero) 6. 8 BLEU English – Kinyarwanda (Hiero)

SMT Baselines BLEU Kinyarwanda – English (Hiero) 6. 8 BLEU English – Kinyarwanda (Hiero) 4. 7 BLEU Malagasy – English (Hiero) 24. 3 Malagasy – English (Moses) 24. 2 BLEU English – Malagasy (Hiero) 25. 0 English – Malagasy (Moses) 30. 5

Let’s make things better.

Let’s make things better.

The problem? LM English learner English français TM learner

The problem? LM English learner English français TM learner

Low-resource! LM English learner English Malagasy TM learner

Low-resource! LM English learner English Malagasy TM learner

Low-resource! LM English learner English Malagasy TM Small, Out of domain

Low-resource! LM English learner English Malagasy TM Small, Out of domain

Low-resource! LM English learner English Malagasy TM Malagasy verbal morphology “Partial” language models

Low-resource! LM English learner English Malagasy TM Malagasy verbal morphology “Partial” language models

Low-resource! LM English learner English Malagasy TM Malagasy verbal morphology Dependency parses Unsupservised model

Low-resource! LM English learner English Malagasy TM Malagasy verbal morphology Dependency parses Unsupservised model outputs

Low-resource! LM English learner English Malagasy TM Unsupservised model outputs Malagasy verbal morphology Dependency

Low-resource! LM English learner English Malagasy TM Unsupservised model outputs Malagasy verbal morphology Dependency parses Word clusters 36: dieny, fara, fiompiny, hamoaka, handehanany 37: adinina, aforeto, ahevao, akaiky, alao,

Year 1 MT Challenge

Year 1 MT Challenge

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny,

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny, fara, fiompiny, hamoaka, handehanany 37: adinina, aforeto, ahevao, akaiky, alao,

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny,

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny, fara, fiompiny, hamoaka, handehanany 37: adinina, aforeto, ahevao, akaiky, alao, Translation Model

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny,

Year 1 MT Challenge English Malagasy verbal morphology Dependency parses Word clusters 36: dieny, fara, fiompiny, hamoaka, handehanany 37: adinina, aforeto, ahevao, akaiky, alao, henemana no hana. . . Translation Model something intelligible. . .

Accomplishments • Better alignments, better translations • Feature-rich translation • 10 s of millions

Accomplishments • Better alignments, better translations • Feature-rich translation • 10 s of millions of features • Diverse knowledge sources • Phrase dependency translation model • phrase ordering with a dependency model

Model 4 CMU

Model 4 CMU

Model 4 CMU

Model 4 CMU

Model 4 CMU

Model 4 CMU

Model 4 CMU Similar pattern of improvements, no language-specific features (yet).

Model 4 CMU Similar pattern of improvements, no language-specific features (yet).

Malagasy - English BLEU Model 4 - GDA 24. 2 Model 4 - GDFA

Malagasy - English BLEU Model 4 - GDA 24. 2 Model 4 - GDFA 26. 7 CMU - GDFA 26. 3 Model 4 +CMU 27. 6 Malagasy - English version 1. 0

What improvements? the sons of simeon were jemoela , jamin , jakin , and

What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman. the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman. the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ).

What improvements? the sons of simeon were jemoela , jamin , jakin , and

What improvements? the sons of simeon were jemoela , jamin , jakin , and ohada zohara saul , the son of a canaanite woman. the sons of simeon were jemuel , jamin , ohada , jakin , zohar , and shaul , the son of a canaanite woman. the sons of simeon : jemuel , jamin , ohad , jakin , zohar , and shaul ( the son of a canaanite woman ).

What improvements? then the woman said to the serpent , “ no ! you

What improvements? then the woman said to the serpent , “ no ! you will not die. now the serpent said to the woman , “ you will not die. the serpent said to the woman , “ surely you will not die ,

What improvements? then the woman said to the serpent , “ no ! you

What improvements? then the woman said to the serpent , “ no ! you will not die. now the serpent said to the woman , “ you will not die. the serpent said to the woman , “ surely you will not die ,

 • • Feature-rich translation Discriminative learning on training data Learn much sparser features

• • Feature-rich translation Discriminative learning on training data Learn much sparser features than possible with just a development set • • Update weights to improve translation probability Final tuning pass on development set to optimize translation metrics (BLEU, METEOR, etc. )

What features?

What features?

Contexts give clues to contintuents

Contexts give clues to contintuents

Contexts give clues to contintuents

Contexts give clues to contintuents

German - English BLEU Features baseline 25. 0 11 / 11 +7 -gram 25.

German - English BLEU Features baseline 25. 0 11 / 11 +7 -gram 25. 0 13 / 13 25. 2 11, 194 / 80, 006, 646 25. 4 11, 196 / 80, 006, 648 +Context +7 -gram

Phrasal dependency translation model

Phrasal dependency translation model

Phrasebased output:

Phrasebased output:

Phrasebased output: Our System:

Phrasebased output: Our System:

Phrasebased output: Our System: Use features from source-side parse

Phrasebased output: Our System: Use features from source-side parse

% BLEU Target Syntax Only

% BLEU Target Syntax Only

% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules

% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules

% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules + Tree-to-Tree Features

% BLEU Target Syntax Only Target Syntax + String-to-Tree Rules + Tree-to-Tree Features

 • Our best results use supervised parsers for both source and target languages

• Our best results use supervised parsers for both source and target languages • What about unsupervised parsing?

 • Our best results use supervised parsers for both source and target languages

• Our best results use supervised parsers for both source and target languages • What about unsupervised parsing? • We use the dependency model with valence (Klein & Manning, 2004) • With careful initialization, it gives state-ofthe-art results (Gimpel & Smith, 2011): • 53. 1% attachment accuracy on Penn Treebank • 44. 4% on Chinese Treebank

% BLEU

% BLEU

Year 2 “Into other languages” • Target morphological complexity • Generate novel word forms

Year 2 “Into other languages” • Target morphological complexity • Generate novel word forms • Leverage morphological resources and machine learning • Need better language models, not just translation models

Year 2 Challenges • Generating new word forms means a much larger search space

Year 2 Challenges • Generating new word forms means a much larger search space than is usual in MT • Inference is expensive • Use “high-recall” linguistic tools to constrain search • Statistics do the rest

Year 2 • Data requirements • Large non-English monolingual corpora • Test sets for

Year 2 • Data requirements • Large non-English monolingual corpora • Test sets for focus languages