MT for Languages with Limited Resources 11 731

  • Slides: 82
Download presentation
MT for Languages with Limited Resources 11 -731 Machine Translation April 20, 2011 Based

MT for Languages with Limited Resources 11 -731 Machine Translation April 20, 2011 Based on Joint Work with: Lori Levin, Jaime Carbonell, Stephan Vogel, Shuly Wintner, Danny Shacham, Katharina Probst, Erik Peterson, Christian Monson, Roberto Aranovich and Ariadna Font-Llitjos

Why Machine Translation for Minority and Indigenous Languages? • Commercial MT economically feasible for

Why Machine Translation for Minority and Indigenous Languages? • Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers) • Is there hope for MT for languages with very limited resources? • Benefits include: – Better government access to indigenous communities (Epidemics, crop failures, etc. ) – Better indigenous communities participation in information-rich activities (health care, education, government) without giving up their languages. – Language preservation – Civilian and military applications (disaster relief) April 20, 2011 11 -731 Machine Translation 2

MT for Minority and Indigenous Languages: Challenges • Minimal amount of parallel text •

MT for Minority and Indigenous Languages: Challenges • Minimal amount of parallel text • Possibly competing standards for orthography/spelling • Often relatively few trained linguists • Access to native informants possible • Need to minimize development time and cost April 20, 2011 11 -731 Machine Translation 3

MT for Low Resource Languages • Possible Approaches: – Phrase-based SMT, with whatever small

MT for Low Resource Languages • Possible Approaches: – Phrase-based SMT, with whatever small amounts of parallel data that is available – Build a rule-based system – need for bilingual experts and resources – Hybrid approaches, such as the AVENUE Project (Stat -XFER) approach: • Incorporate acquired manual resources within a general statistical framework • Augment with targeted elicitation and resource acquisition from bilingual non-experts April 20, 2011 11 -731 Machine Translation 4

CMU Statistical Transfer (Stat -XFER) MT Approach • Integrate the major strengths of rule-based

CMU Statistical Transfer (Stat -XFER) MT Approach • Integrate the major strengths of rule-based and statistical MT within a common framework: – Linguistically rich formalism that can express complex and abstract compositional transfer rules – Rules can be written by human experts and also acquired automatically from data – Easy integration of morphological analyzers and generators – Word and syntactic-phrase correspondences can be automatically acquired from parallel text – Search-based decoding from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc. – Framework suitable for both resource-rich and resourcepoor language scenarios April 20, 2011 11 -731 Machine Translation 5

Stat-XFER Main Principles • Framework: Statistical search-based approach with syntactic translation transfer rules that

Stat-XFER Main Principles • Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts • Automatic Word and Phrase translation lexicon acquisition from parallel data • Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages • Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences • Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants • XFER + Decoder: – XFER engine produces a lattice of possible transferred structures at all levels – Decoder searches and selects the best scoring combination April 20, 2011 11 -731 Machine Translation 6

Stat-XFER Framework Source Input Preprocessing Language Weighted Model Features Morphology Transfer Rules Bilingual Lexicon

Stat-XFER Framework Source Input Preprocessing Language Weighted Model Features Morphology Transfer Rules Bilingual Lexicon April 20, 2011 Transfer Engine Translation Lattice 11 -731 Machine Translation Second-Stage Decoder Target Output 7

Hebrew Input בשורה הבאה Transfer Rules {NP 1, 3} NP 1: : NP 1

Hebrew Input בשורה הבאה Transfer Rules {NP 1, 3} NP 1: : NP 1 [NP 1 "H" ADJ] -> [ADJ NP 1] ((X 3: : Y 1) (X 1: : Y 2) ((X 1 def) = +) ((X 1 status) =c absolute) ((X 1 num) = (X 3 num)) ((X 1 gen) = (X 3 gen)) (X 0 = X 1)) Preprocessing Morphology English Language Model Transfer Engine Translation Lexicon N: : N |: ["$WR"] -> ["BULL"] ((X 1: : Y 1) ((X 0 NUM) = s) ((Y 0 lex) = "BULL")) N: : N |: ["$WRH"] -> ["LINE"] ((X 1: : Y 1) ((X 0 NUM) = s) ((Y 0 lex) = "LINE")) Decoder Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line

Transfer Rule Formalism ; SL: the old man, TL: ha-ish ha-zaqen Type information Part-of-speech/constituent

Transfer Rule Formalism ; SL: the old man, TL: ha-ish ha-zaqen Type information Part-of-speech/constituent information Alignments NP: : NP ( (X 1: : Y 1) (X 1: : Y 3) (X 2: : Y 4) (X 3: : Y 2) x-side constraints ((X 1 AGR) = *3 -SING) ((X 1 DEF = *DEF) ((X 3 AGR) = *3 -SING) ((X 3 COUNT) = +) y-side constraints ((Y 1 DEF) = *DEF) ((Y 3 DEF) = *DEF) ((Y 2 AGR) = *3 -SING) ((Y 2 GENDER) = (Y 4 GENDER)) ) xy-constraints, e. g. ((Y 1 AGR) = (X 1 AGR)) April 20, 2011 11 -731 Machine Translation [DET ADJ N] -> [DET N DET ADJ] 9

Transfer Rule Formalism (II) ; SL: the old man, TL: ha-ish ha-zaqen NP: :

Transfer Rule Formalism (II) ; SL: the old man, TL: ha-ish ha-zaqen NP: : NP ( (X 1: : Y 1) (X 1: : Y 3) (X 2: : Y 4) (X 3: : Y 2) Value constraints Agreement constraints April 20, 2011 [DET ADJ N] -> [DET N DET ADJ] ((X 1 AGR) = *3 -SING) ((X 1 DEF = *DEF) ((X 3 AGR) = *3 -SING) ((X 3 COUNT) = +) ((Y 1 DEF) = *DEF) ((Y 3 DEF) = *DEF) ((Y 2 AGR) = *3 -SING) ((Y 2 GENDER) = (Y 4 GENDER)) ) 11 -731 Machine Translation 10

Translation Lexicon: Hebrew-to-English Examples (Semi-manually-developed) PRO: : PRO |: ["ANI"] -> ["I"] ( (X

Translation Lexicon: Hebrew-to-English Examples (Semi-manually-developed) PRO: : PRO |: ["ANI"] -> ["I"] ( (X 1: : Y 1) ((X 0 per) = 1) ((X 0 num) = s) ((X 0 case) = nom) ) N: : N |: ["$&H"] -> ["HOUR"] ( (X 1: : Y 1) ((X 0 NUM) = s) ((Y 0 lex) = "HOUR") ) PRO: : PRO |: ["ATH"] -> ["you"] ( (X 1: : Y 1) ((X 0 per) = 2) ((X 0 num) = s) ((X 0 gen) = m) ((X 0 case) = nom) ) N: : N |: ["$&H"] -> ["hours"] ( (X 1: : Y 1) ((Y 0 NUM) = p) ((X 0 NUM) = p) ((Y 0 lex) = "HOUR") ) April 20, 2011 11 -731 Machine Translation 11

Translation Lexicon: French-to-English Examples (Automatically-acquired) DET: : DET |: [“le"] -> [“the"] ( (X

Translation Lexicon: French-to-English Examples (Automatically-acquired) DET: : DET |: [“le"] -> [“the"] ( (X 1: : Y 1) ) NP: : NP |: [“le respect"] -> [“accordance"] ( ) PP: : PP |: [“dans le respect"] -> [“in accordance"] ( ) Prep: : Prep |: [“dans”] -> [“in”] ( (X 1: : Y 1) ) N: : N |: [“principes"] -> [“principles"] ( (X 1: : Y 1) ) PP: : PP |: [“des principes"] -> [“with the principles"] ( ) N: : N |: [“respect"] -> [“accordance"] ( (X 1: : Y 1) ) April 20, 2011 11 -731 Machine Translation 12

Hebrew-English Transfer Grammar Example Rules (Manually-developed) {NP 1, 2} ; ; SL: $MLH ADWMH

Hebrew-English Transfer Grammar Example Rules (Manually-developed) {NP 1, 2} ; ; SL: $MLH ADWMH ; ; TL: A RED DRESS {NP 1, 3} ; ; SL: H $MLWT H ADWMWT ; ; TL: THE RED DRESSES NP 1: : NP 1 [NP 1 ADJ] -> [ADJ NP 1] ( (X 2: : Y 1) (X 1: : Y 2) ((X 1 def) = -) ((X 1 status) =c absolute) ((X 1 num) = (X 2 num)) ((X 1 gen) = (X 2 gen)) (X 0 = X 1) ) NP 1: : NP 1 [NP 1 "H" ADJ] -> [ADJ NP 1] ( (X 3: : Y 1) (X 1: : Y 2) ((X 1 def) = +) ((X 1 status) =c absolute) ((X 1 num) = (X 3 num)) ((X 1 gen) = (X 3 gen)) (X 0 = X 1) ) April 20, 2011 11 -731 Machine Translation 13

French-English Transfer Grammar Example Rules (Automatically-acquired) {PP, 24691} ; ; SL: des principes ;

French-English Transfer Grammar Example Rules (Automatically-acquired) {PP, 24691} ; ; SL: des principes ; ; TL: with the principles {PP, 312} ; ; SL: dans le respect des principes ; ; TL: in accordance with the principles PP: : PP [“des” N] -> [“with the” N] ( (X 1: : Y 1) ) PP: : PP [Prep NP] -> [Prep NP] ( (X 1: : Y 1) (X 2: : Y 2) ) April 20, 2011 11 -731 Machine Translation 14

The Transfer Engine • Input: source-language input sentence, or sourcelanguage confusion network • Output:

The Transfer Engine • Input: source-language input sentence, or sourcelanguage confusion network • Output: lattice representing collection of translation fragments at all levels supported by transfer rules • Basic Algorithm: “bottom-up” integrated “parsingtransfer-generation” chart-parser guided by the synchronous transfer rules – Start with translations of individual words and phrases from translation lexicon – Create translations of larger constituents by applying applicable transfer rules to previously created lattice entries – Beam-search controls the exponential combinatorics of the search-space, using multiple scoring features April 20, 2011 11 -731 Machine Translation 15

The Transfer Engine • Some Unique Features: – Works with either learned or manually-developed

The Transfer Engine • Some Unique Features: – Works with either learned or manually-developed transfer grammars – Handles rules with or without unification constraints – Supports interfacing with servers for morphological analysis and generation – Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures April 20, 2011 11 -731 Machine Translation 16

Hebrew Example (From [Lavie et al. , 2004]) • Input word: B$WRH 0 1

Hebrew Example (From [Lavie et al. , 2004]) • Input word: B$WRH 0 1 2 3 4 |----B$WRH----| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| April 20, 2011 11 -731 Machine Translation 17

Hebrew Example (From [Lavie et al. , 2004]) Y 0: ((SPANSTART 0) (SPANEND 4)

Hebrew Example (From [Lavie et al. , 2004]) Y 0: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y 1: ((SPANSTART 0) (SPANEND 2) (LEX B) (POS PREP)) Y 2: ((SPANSTART 1) (SPANEND 3) (LEX $WR) (POS N) (GEN M) (NUM S) (STATUS ABSOLUTE)) Y 3: ((SPANSTART 3) (SPANEND 4) (LEX $LH) (POS POSS)) Y 4: ((SPANSTART 0) (SPANEND 1) (LEX B) (POS PREP)) Y 5: ((SPANSTART 1) (SPANEND 2) (LEX H) (POS DET)) Y 6: ((SPANSTART 2) (SPANEND 4) (LEX $WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y 7: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS LEX)) April 20, 2011 11 -731 Machine Translation 18

XFER Output Lattice (28 (29 (29 (30 (30 28 29 29 29 30 30

XFER Output Lattice (28 (29 (29 (30 (30 28 29 29 29 30 30 "AND" -5. 6988 "W" "(CONJ, 0 'AND')") "SINCE" -8. 20817 "MAZ " "(ADVP, 0 (ADV, 5 'SINCE')) ") "SINCE THEN" -12. 0165 "MAZ " "(ADVP, 0 (ADV, 6 'SINCE THEN')) ") "EVER SINCE" -12. 5564 "MAZ " "(ADVP, 0 (ADV, 4 'EVER SINCE')) ") "WORKED" -10. 9913 "&BD " "(VERB, 0 (V, 11 'WORKED')) ") "FUNCTIONED" -16. 0023 "&BD " "(VERB, 0 (V, 10 'FUNCTIONED')) ") "WORSHIPPED" -17. 3393 "&BD " "(VERB, 0 (V, 12 'WORSHIPPED')) ") "SERVED" -11. 5161 "&BD " "(VERB, 0 (V, 14 'SERVED')) ") "SLAVE" -13. 9523 "&BD " "(NP 0, 0 (N, 34 'SLAVE')) ") "BONDSMAN" -18. 0325 "&BD " "(NP 0, 0 (N, 36 'BONDSMAN')) ") "A SLAVE" -16. 8671 "&BD " "(NP, 1 (LITERAL 'A') (NP 2, 0 (NP 1, 0 (NP 0, 0 (N, 34 'SLAVE')) ) ") (30 30 "A BONDSMAN" -21. 0649 "&BD " "(NP, 1 (LITERAL 'A') (NP 2, 0 (NP 1, 0 (NP 0, 0 (N, 36 'BONDSMAN')) ) ") April 20, 2011 11 -731 Machine Translation 19

The Lattice Decoder • Stack Decoder, similar to standard Statistical MT decoders • Searches

The Lattice Decoder • Stack Decoder, similar to standard Statistical MT decoders • Searches for best-scoring path of non-overlapping lattice arcs • No reordering during decoding • Scoring based on log-linear combination of scoring features, with weights trained using Minimum Error Rate Training (MERT) • Scoring components: – – Statistical Language Model Bi-directional MLE phrase and rule scores Lexical Probabilities Fragmentation: how many arcs to cover the entire translation? – Length Penalty: how far from expected target length? April 20, 2011 11 -731 Machine Translation 20

XFER Lattice Decoder 00 ON THE FOURTH DAY THE LION ATE THE RABBIT TO

XFER Lattice Decoder 00 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL Overall: -8. 18323, Prob: -94. 382, Rules: 0, Frag: 0. 153846, Length: 0, Words: 13, 13 235 < 0 8 -19. 7602: B H IWM RBI&I (PP, 0 (PREP, 3 'ON')(NP, 2 (LITERAL 'THE') (NP 2, 0 (NP 1, 1 (ADJ, 2 (QUANT, 0 'FOURTH'))(NP 1, 0 (NP 0, 1 (N, 6 'DAY')))))))> 918 < 8 14 -46. 2973: H ARIH AKL AT H $PN (S, 2 (NP, 2 (LITERAL 'THE') (NP 2, 0 (NP 1, 0 (NP 0, 1 (N, 17 'LION')))))(VERB, 0 (V, 0 'ATE'))(NP, 100 (NP, 2 (LITERAL 'THE') (NP 2, 0 (NP 1, 0 (NP 0, 1 (N, 24 'RABBIT')))))))> 584 < 14 17 -30. 6607: L ARWXH BWQR (PP, 0 (PREP, 6 'TO')(NP, 1 (LITERAL 'A') (NP 2, 0 (NP 1, 0 (NNP, 3 (NP 0, 0 (N, 32 'MORNING'))(NP 0, 0 (N, 27 'MEAL')))))))> April 20, 2011 11 -731 Machine Translation 21

Stat-XFER MT Systems • General Stat-XFER framework under development for past nine years •

Stat-XFER MT Systems • General Stat-XFER framework under development for past nine years • Systems so far: – – – – Chinese-to-English French-to-English Hebrew-to-English Urdu-to-English German-to-English Hindi-to-English Dutch-to-English Turkish-to-English Mapudungun-to-Spanish Arabic-to-English Brazilian Portuguese-to-English-to-Arabic Hebrew-to-Arabic April 20, 2011 11 -731 Machine Translation 22

Learning Transfer-Rules for Languages with Limited Resources • Rationale: – Large bilingual corpora not

Learning Transfer-Rules for Languages with Limited Resources • Rationale: – Large bilingual corpora not available – Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool – Elicitation corpus designed to be typologically comprehensive and compositional – Transfer-rule engine and rule learning approach support acquisition of generalized transfer-rules from the data April 20, 2011 11 -731 Machine Translation 23

English-Chinese Example April 20, 2011 11 -731 Machine Translation 24

English-Chinese Example April 20, 2011 11 -731 Machine Translation 24

English-Hindi Example April 20, 2011 11 -731 Machine Translation 25

English-Hindi Example April 20, 2011 11 -731 Machine Translation 25

Spanish-Mapudungun Example April 20, 2011 11 -731 Machine Translation 26

Spanish-Mapudungun Example April 20, 2011 11 -731 Machine Translation 26

English-Arabic Example April 20, 2011 11 -731 Machine Translation 27

English-Arabic Example April 20, 2011 11 -731 Machine Translation 27

The Typological Elicitation Corpus • Translated, aligned by bilingual informant • Corpus consists of

The Typological Elicitation Corpus • Translated, aligned by bilingual informant • Corpus consists of linguistically diverse constructions • Based on elicitation and documentation work of field linguists (e. g. Comrie 1977, Bouquiaux 1992) • Organized compositionally: elicit simple structures first, then use them as building blocks • Goal: minimize size, maximize linguistic coverage April 20, 2011 11 -731 Machine Translation 28

The Structural Elicitation Corpus • Designed to cover the most common phrase structures of

The Structural Elicitation Corpus • Designed to cover the most common phrase structures of English learn how these structures map onto their equivalents in other languages • Constructed using the constituent parse trees from the Penn Tree. Bank – Extracted and frequency ranked all rules in parse trees – Selected top ~200 rules, filtered idiosyncratic cases – Revised lexical choices within examples • Goal: minimize size, maximize linguistic coverage of structures April 20, 2011 11 -731 Machine Translation 29

The Structural Elicitation Corpus Examples: srcsent: in the forest tgtsent: B H I&R aligned:

The Structural Elicitation Corpus Examples: srcsent: in the forest tgtsent: B H I&R aligned: ((1, 1), (2, 2), (3, 3)) context: C-Structure: (<PP> (PREP in-1) (<NP> (DET the-2) (N forest-3))) srcsent: steps tgtsent: MDRGWT aligned: ((1, 1)) context: C-Structure: (<NP> (N steps-1)) srcsent: the boy ate the apple tgtsent: H ILD AKL AT H TPWX aligned: ((1, 1), (2, 2), (3, 3), (4, 5), (5, 6)) context: C-Structure: (<S> (<NP> (DET the-1) (N boy-2)) (<VP> (V ate-3) (<NP> (DET the-4)(N apple-5)))) srcsent: the first year tgtsent: H $NH H RA$WNH aligned: ((1, 1 3), (2, 4), (3, 2)) context: C-Structure: (<NP> (DET the-1) (<ADJP> (ADJ first-2)) (N year-3)) April 20, 2011 11 -731 Machine Translation 30

A Limited Data Scenario for Hindi-to-English • Conducted during a DARPA “Surprise Language Exercise”

A Limited Data Scenario for Hindi-to-English • Conducted during a DARPA “Surprise Language Exercise” (SLE) in June 2003 • Put together a scenario with “miserly” data resources: – Elicited Data corpus: 17589 phrases – Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) – Manually acquired resources during the SLE: • • 500 manual bigram translations 72 manually written phrase transfer rules 105 manually written postposition rules 48 manually written time expression rules • No additional parallel text!! April 20, 2011 11 -731 Machine Translation 31

Examples of Learned Rules (Hindi-to-English) {NP, 14244} ; ; Score: 0. 0429 NP: :

Examples of Learned Rules (Hindi-to-English) {NP, 14244} ; ; Score: 0. 0429 NP: : NP [N] -> [DET N] ( (X 1: : Y 2) ) {NP, 14434} ; ; Score: 0. 0040 NP: : NP [ADJ CONJ ADJ N] -> [ADJ CONJ ADJ N] ( (X 1: : Y 1) (X 2: : Y 2) (X 3: : Y 3) (X 4: : Y 4) ) April 20, 2011 {PP, 4894} ; ; Score: 0. 0470 PP: : PP [NP POSTP] -> [PREP NP] ( (X 2: : Y 1) (X 1: : Y 2) ) 11 -731 Machine Translation 32

Manual Transfer Rules: Hindi Example ; ; PASSIVE OF SIMPLE PAST (NO AUX) WITH

Manual Transfer Rules: Hindi Example ; ; PASSIVE OF SIMPLE PAST (NO AUX) WITH LIGHT VERB ; ; passive of 43 (7 b) {VP, 28} VP: : VP : [V V V] -> [Aux V] ( (X 1: : Y 2) ((x 1 form) = root) ((x 2 type) =c light) ((x 2 form) = part) ((x 2 aspect) = perf) ((x 3 lexwx) = 'j. An. A') ((x 3 form) = part) ((x 3 aspect) = perf) (x 0 = x 1) ((y 1 lex) = be) ((y 1 tense) = past) ((y 1 agr num) = (x 3 agr num)) ((y 1 agr pers) = (x 3 agr pers)) ((y 2 form) = part) ) April 20, 2011 11 -731 Machine Translation 33

Manual Transfer Rules: Example ; NP 1 ke NP 2 -> NP 2 of

Manual Transfer Rules: Example ; NP 1 ke NP 2 -> NP 2 of NP 1 ; Ex: j. Ivana ke eka a. Xy. Aya ; life of (one) chapter ; ==> a chapter of life ; {NP, 12} NP: : NP : [PP NP 1] -> [NP 1 PP] ( (X 1: : Y 2) (X 2: : Y 1) ; ((x 2 lexwx) = 'k. A') ) NP NP PP NP 1 NP P Adj N N 1 ke eka a. Xy. Aya PP P one chapter of N 1 N N j. Ivana life {NP, 13} NP: : NP : [NP 1] -> [NP 1] ( (X 1: : Y 1) ) {PP, 12} PP: : PP : [NP Postp] -> [Prep NP] ( (X 1: : Y 2) (X 2: : Y 1) ) April 20, 2011 11 -731 Machine Translation NP 34

Manual Grammar Development • Covers mostly NPs, PPs and VPs (verb complexes) • ~70

Manual Grammar Development • Covers mostly NPs, PPs and VPs (verb complexes) • ~70 grammar rules, covering basic and recursive NPs and PPs, verb complexes of main tenses in Hindi (developed in two weeks) April 20, 2011 11 -731 Machine Translation 35

Testing Conditions • Tested on section of JHU provided data: 258 sentences with four

Testing Conditions • Tested on section of JHU provided data: 258 sentences with four reference translations – – SMT system (stand-alone) EBMT system (stand-alone) XFER system (naïve decoding) XFER system with “strong” decoder • No grammar rules (baseline) • Manually developed grammar rules • Automatically learned grammar rules – XFER+SMT with strong decoder (MEMT) April 20, 2011 11 -731 Machine Translation 36

Results on JHU Test Set System BLEU M-BLEU NIST EBMT 0. 058 0. 165

Results on JHU Test Set System BLEU M-BLEU NIST EBMT 0. 058 0. 165 4. 22 SMT 0. 093 0. 191 4. 64 0. 055 0. 177 4. 46 0. 109 0. 224 5. 29 XFER 0. 116 0. 231 5. 37 XFER 0. 135 0. 243 5. 59 XFER+SMT 0. 136 0. 243 5. 65 XFER (naïve) man grammar XFER (strong) no grammar (strong) learned grammar (strong) man grammar April 20, 2011 11 -731 Machine Translation 37

Effect of Reordering in the Decoder April 20, 2011 11 -731 Machine Translation 38

Effect of Reordering in the Decoder April 20, 2011 11 -731 Machine Translation 38

Observations and Lessons (I) • XFER with strong decoder outperformed SMT even without any

Observations and Lessons (I) • XFER with strong decoder outperformed SMT even without any grammar rules in the miserly data scenario – SMT Trained on elicited phrases that are very short – SMT has insufficient data to train more discriminative translation probabilities – XFER takes advantage of Morphology • Token coverage without morphology: 0. 6989 • Token coverage with morphology: 0. 7892 • Manual grammar was somewhat better than automatically learned grammar – Learned rules were very simple – Large room for improvement on learning rules April 20, 2011 11 -731 Machine Translation 39

Observations and Lessons (II) • MEMT (XFER and SMT) based on strong decoder produced

Observations and Lessons (II) • MEMT (XFER and SMT) based on strong decoder produced best results in the miserly scenario. • Reordering within the decoder provided very significant score improvements – Much room for more sophisticated grammar rules – Strong decoder can carry some of the reordering “burden” April 20, 2011 11 -731 Machine Translation 40

Modern Hebrew • Native language of about 3 -4 Million in Israel • Semitic

Modern Hebrew • Native language of about 3 -4 Million in Israel • Semitic language, closely related to Arabic and with similar linguistic properties – Root+Pattern word formation system – Rich verb and noun morphology – Particles attach as prefixed to the following word: definite article (H), prepositions (B, K, L, M), coordinating conjuction (W), relativizers ($, K$)… • Unique alphabet and Writing System – 22 letters represent (mostly) consonants – Vowels represented (mostly) by diacritics – Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word – Example: MHGR mehager, m+hagar, m+h+ger April 20, 2011 11 -731 Machine Translation 41

Modern Hebrew Spelling • Two main spelling variants – “KTIV XASER” (difficient): spelling with

Modern Hebrew Spelling • Two main spelling variants – “KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed – “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter • KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling • Example: – niqud (spelling): NIQWD, NQD – When written as NQD, could also be niqed, naqed, nuqad April 20, 2011 11 -731 Machine Translation 42

Challenges for Hebrew MT • Puacity in existing language resources for Hebrew – No

Challenges for Hebrew MT • Puacity in existing language resources for Hebrew – No publicly available broad coverage morphological analyzer – No publicly available bilingual lexicons or dictionaries – No POS-tagged corpus or parse tree-bank corpus for Hebrew – No large Hebrew/English parallel corpus • Scenario well suited for CMU transfer-based MT framework for languages with limited resources April 20, 2011 11 -731 Machine Translation 43

Morphological Analyzer • We use a publicly available morphological analyzer distributed by the Technion’s

Morphological Analyzer • We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system • Coverage is reasonable (for nouns, verbs and adjectives) • Produces all analyses or a disambiguated analysis for each word • Output format includes lexeme (base form), POS, morphological features • Output was adapted to our representation needs (POS and feature mappings) April 20, 2011 11 -731 Machine Translation 44

Morphology Example • Input word: B$WRH 0 1 2 3 4 |----B$WRH----| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

Morphology Example • Input word: B$WRH 0 1 2 3 4 |----B$WRH----| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| April 20, 2011 11 -731 Machine Translation 45

Morphology Example Y 0: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS N) (GEN F)

Morphology Example Y 0: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y 1: ((SPANSTART 0) (SPANEND 2) (LEX B) (POS PREP)) Y 2: ((SPANSTART 1) (SPANEND 3) (LEX $WR) (POS N) (GEN M) (NUM S) (STATUS ABSOLUTE)) Y 3: ((SPANSTART 3) (SPANEND 4) (LEX $LH) (POS POSS)) Y 4: ((SPANSTART 0) (SPANEND 1) (LEX B) (POS PREP)) Y 5: ((SPANSTART 1) (SPANEND 2) (LEX H) (POS DET)) Y 6: ((SPANSTART 2) (SPANEND 4) (LEX $WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y 7: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS LEX)) April 20, 2011 11 -731 Machine Translation 46

Translation Lexicon • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E

Translation Lexicon • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us, augmented by other public sources • Coverage is not great but not bad as a start – Dahan H-to-E is about 15 K translation pairs – Dahan E-to-H is about 7 K translation pairs • Base forms, POS information on both sides • Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc. ) • Had to deal with spelling conventions • Recently augmented with ~50 K translation pairs extracted from Wikipedia (mostly proper names and named entities) April 20, 2011 11 -731 Machine Translation 47

Manual Transfer Grammar (human-developed) • Initially developed by Alon in a couple of days,

Manual Transfer Grammar (human-developed) • Initially developed by Alon in a couple of days, extended and revised by Nurit over time • Current grammar has 36 rules: – – 21 NP rules one PP rule 6 verb complexes and VP rules 8 higher-phrase and sentence-level rules • Captures the most common (mostly local) structural differences between Hebrew and English April 20, 2011 11 -731 Machine Translation 48

Transfer Grammar Example Rules {NP 1, 2} ; ; SL: $MLH ADWMH ; ;

Transfer Grammar Example Rules {NP 1, 2} ; ; SL: $MLH ADWMH ; ; TL: A RED DRESS {NP 1, 3} ; ; SL: H $MLWT H ADWMWT ; ; TL: THE RED DRESSES NP 1: : NP 1 [NP 1 ADJ] -> [ADJ NP 1] ( (X 2: : Y 1) (X 1: : Y 2) ((X 1 def) = -) ((X 1 status) =c absolute) ((X 1 num) = (X 2 num)) ((X 1 gen) = (X 2 gen)) (X 0 = X 1) ) NP 1: : NP 1 [NP 1 "H" ADJ] -> [ADJ NP 1] ( (X 3: : Y 1) (X 1: : Y 2) ((X 1 def) = +) ((X 1 status) =c absolute) ((X 1 num) = (X 3 num)) ((X 1 gen) = (X 3 gen)) (X 0 = X 1) ) April 20, 2011 11 -731 Machine Translation 49

Hebrew-to-English MT Prototype • Initial prototype developed within a two month intensive effort •

Hebrew-to-English MT Prototype • Initial prototype developed within a two month intensive effort • Accomplished: – – – – Adapted available morphological analyzer Constructed a preliminary translation lexicon Translated and aligned Elicitation Corpus Learned XFER rules Developed (small) manual XFER grammar System debugging and development Evaluated performance on unseen test data using automatic evaluation metrics April 20, 2011 11 -731 Machine Translation 50

Example Translation ● Input: – לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא

Example Translation ● Input: – לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה – After debates many decided the government to hold referendum in issue the withdrawal ● Output: – AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL April 20, 2011 11 -731 Machine Translation 51

Noun Phrases – Construct State החלטת הנשיא הראשון HXL@T [HNSIA HRA$WN] decision. 3 SF-CS

Noun Phrases – Construct State החלטת הנשיא הראשון HXL@T [HNSIA HRA$WN] decision. 3 SF-CS the-president. 3 SM the-first. 3 SM THE DECISION OF THE FIRST PRESIDENT החלטת הנשיא הראשונה [HXL@T HNSIA] decision. 3 SF-CS the-president. 3 SM HRA$WNH the-first. 3 SF THE FIRST DECISION OF THE PRESIDENT April 20, 2011 11 -731 Machine Translation 52

Noun Phrases - Possessives הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו

Noun Phrases - Possessives הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו HNSIA HKRIZ $HM$IMH HRA$WNH $LW THIH the-president announced that-the-task. 3 SF the-first. 3 SF of-him will. 3 SF LMCWA PTRWN LSKSWK to-find solution to-the-conflict BAZWRNW in-region-POSS. 1 P Without transfer grammar: THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR With transfer grammar: THE PRESIDENT ANNOUNCED THAT HIS FIRST TASK WILL BE TO FIND A SOLUTION TO THE CONFLICT IN OUR REGION April 20, 2011 11 -731 Machine Translation 53

Subject-Verb Inversion אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא ATMWL HWDI&H HMM$LH yesterday announced.

Subject-Verb Inversion אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא ATMWL HWDI&H HMM$LH yesterday announced. 3 SF the-government. 3 SF $T&RKNH BXIRWT BXWD$ HBA that-will-be-held. 3 PF elections. 3 PF in-the-month the-next Without transfer grammar: YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF THE MONTH THE NEXT With transfer grammar: YESTERDAY THE GOVERNMENT ANNOUNCED THAT ELECTIONS WILL ASSUME IN THE NEXT MONTH April 20, 2011 11 -731 Machine Translation 54

Subject-Verb Inversion לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה LPNI before

Subject-Verb Inversion לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה LPNI before KMH $BW&WT HWDI&H HNHLT HMLWN several weeks announced. 3 SF management. 3 SF. CS the-hotel $HMLWN ISGR BSWF H$NH that-the-hotel. 3 SM will-be-closed. 3 SM at-end. 3 SM. CS the-year Without transfer grammar: IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR With transfer grammar: SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR April 20, 2011 11 -731 Machine Translation 55

Evaluation Results • Test set of 62 sentences from Haaretz newspaper, 2 reference translations

Evaluation Results • Test set of 62 sentences from Haaretz newspaper, 2 reference translations System BLEU NIST P R METEOR No Gram 0. 0616 3. 4109 0. 4090 0. 4427 0. 3298 Learned 0. 0774 3. 5451 0. 4189 0. 4488 0. 3478 Manual 0. 1026 3. 7789 0. 4334 0. 4474 0. 3617 April 20, 2011 11 -731 Machine Translation 56

Current and Future Work • Issues specific to the Hebrew-to-English system: – Coverage: further

Current and Future Work • Issues specific to the Hebrew-to-English system: – Coverage: further improvements in the translation lexicon and morphological analyzer – Manual Grammar development – Acquiring/training of word-to-word translation probabilities – Acquiring/training of a Hebrew language model at a postmorphology level that can help with disambiguation • General Issues related to XFER framework: – – Discriminative Language Modeling for MT Effective models for assigning scores to transfer rules Improved grammar learning Merging/integration of manual and acquired grammars April 20, 2011 11 -731 Machine Translation 57

Conclusions • Test case for the CMU XFER framework for rapid MT prototyping •

Conclusions • Test case for the CMU XFER framework for rapid MT prototyping • Preliminary system was a two-month, three person effort – we were quite happy with the outcome • Core concept of XFER + Decoding is very powerful and promising for low-resource MT • We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar. . . April 20, 2011 11 -731 Machine Translation 58

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María April 20, 2011 11 -731 Machine Translation 59

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3. obj -1. subj. indicative Maria Spanish No vi a María neg see. 1. subj. past. indicative acc Maria April 20, 2011 11 -731 Machine Translation 60

pe-la-fi-ñ Maria V pe April 20, 2011 11 -731 Machine Translation 61

pe-la-fi-ñ Maria V pe April 20, 2011 11 -731 Machine Translation 61

pe-la-fi-ñ Maria V pe VSuff Negation = + la April 20, 2011 11 -731

pe-la-fi-ñ Maria V pe VSuff Negation = + la April 20, 2011 11 -731 Machine Translation 62

pe-la-fi-ñ Maria V pe VSuff. G Pass all features up VSuff la April 20,

pe-la-fi-ñ Maria V pe VSuff. G Pass all features up VSuff la April 20, 2011 11 -731 Machine Translation 63

pe-la-fi-ñ Maria V pe VSuff. G VSuff object person = 3 fi la April

pe-la-fi-ñ Maria V pe VSuff. G VSuff object person = 3 fi la April 20, 2011 11 -731 Machine Translation 64

pe-la-fi-ñ Maria V pe VSuff. G VSuff Pass all features up from both children

pe-la-fi-ñ Maria V pe VSuff. G VSuff Pass all features up from both children fi la April 20, 2011 11 -731 Machine Translation 65

pe-la-fi-ñ Maria V pe VSuff. G VSuff ñ VSuff person = 1 number =

pe-la-fi-ñ Maria V pe VSuff. G VSuff ñ VSuff person = 1 number = sg mood = ind fi la April 20, 2011 11 -731 Machine Translation 66

pe-la-fi-ñ Maria VSuff. G V pe VSuff. G VSuff ñ VSuff Pass all features

pe-la-fi-ñ Maria VSuff. G V pe VSuff. G VSuff ñ VSuff Pass all features up from both children fi la April 20, 2011 11 -731 Machine Translation 67

pe-la-fi-ñ Maria Pass all features up from both children V V pe Check that:

pe-la-fi-ñ Maria Pass all features up from both children V V pe Check that: 1) negation = + VSuff. G 2) tense is undefined VSuff. G VSuff ñ fi la April 20, 2011 11 -731 Machine Translation 68

pe-la-fi-ñ Maria NP V pe N VSuff. G VSuff ñ Maria VSuff person =

pe-la-fi-ñ Maria NP V pe N VSuff. G VSuff ñ Maria VSuff person = 3 number = sg human = + fi la April 20, 2011 11 -731 Machine Translation 69

pe-la-fi-ñ Maria S Pass features up from V Check that NP is human =

pe-la-fi-ñ Maria S Pass features up from V Check that NP is human = + VP NP V pe N VSuff. G VSuff ñ Maria VSuff fi la April 20, 2011 11 -731 Machine Translation 70

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G VSuff ñ Maria VSuff fi la April 20, 2011 11 -731 Machine Translation 71

Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP

Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP NP V pe VSuff. G VSuff N VSuff. G VSuff ñ Maria VSuff “a” NP N VSuff. G V V fi la April 20, 2011 11 -731 Machine Translation 72

Transfer to Spanish: Top-Down S VP NP V pe VSuff. G VSuff N VSuff.

Transfer to Spanish: Top-Down S VP NP V pe VSuff. G VSuff N VSuff. G VSuff ñ Maria VSuff VP V “a” NP N VSuff. G V S Pass all features down fi la April 20, 2011 11 -731 Machine Translation 73

Transfer to Spanish: Top-Down S S VP VP NP V pe VSuff. G VSuff

Transfer to Spanish: Top-Down S S VP VP NP V pe VSuff. G VSuff N VSuff. G VSuff ñ Maria VSuff “a” NP N VSuff. G V V Pass object features down fi la April 20, 2011 11 -731 Machine Translation 74

Transfer to Spanish: Top-Down S S VP VP NP V pe VSuff. G VSuff

Transfer to Spanish: Top-Down S S VP VP NP V pe VSuff. G VSuff N VSuff. G VSuff ñ Maria VSuff “a” NP N VSuff. G V V fi Accusative marker on objects is introduced because human = + la April 20, 2011 11 -731 Machine Translation 75

Transfer to Spanish: Top-Down S S VP VP: : VP [VBar NP] -> [VBar

Transfer to Spanish: Top-Down S S VP VP: : VP [VBar NP] -> [VBar "a" NP] VP ( (X 1: : Y 1) NP V (X 2: : Y 3) “a” NP N = (*NOT* personal)) ((X 2 type) ((X 2 human) =c +) VSuff (X 0 = N X 1) ((X 0 object) = X 2) VSuff. G V pe V VSuff. G VSuff la April 20, 2011 fi ñ (Y 0 Maria = X 0) ((Y 0 object) = (X 0 object)) (Y 1 = Y 0) (Y 3 = (Y 0 object)) ((Y 1 objmarker person) = (Y 3 person)) ((Y 1 objmarker number) = (Y 3 number)) ((Y 1 objmarker gender) = (Y 3 ender))) 11 -731 Machine Translation 76

Transfer to Spanish: Top-Down S VP NP V pe N VSuff. G VSuff ñ

Transfer to Spanish: Top-Down S VP NP V pe N VSuff. G VSuff ñ Maria VSuff S Pass person, number, and. VP mood features to Spanish Verb “a” NP V “no” V Assign tense = past fi la April 20, 2011 11 -731 Machine Translation 77

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G VSuff ñ Maria VSuff fi “a” V NP “no” V Introduced because negation = + la April 20, 2011 11 -731 Machine Translation 78

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G VSuff ñ Maria VSuff “a” V “no” NP V ver fi la April 20, 2011 11 -731 Machine Translation 79

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G

Transfer to Spanish: Top-Down S S VP VP NP V pe N VSuff. G VSuff ñ Maria VSuff “a” V “no” V ver vi fi la April 20, 2011 NP 11 -731 Machine Translation person = 1 number = sg mood = indicative tense = past 80

Transfer to Spanish: Top-Down S S Pass features over to VP Spanish side VP

Transfer to Spanish: Top-Down S S Pass features over to VP Spanish side VP NP V pe N VSuff. G VSuff ñ Maria VSuff “a” V “no” NP V N vi N María fi la April 20, 2011 11 -731 Machine Translation 81

I Didn’t see Maria S S VP VP NP V pe N VSuff. G

I Didn’t see Maria S S VP VP NP V pe N VSuff. G VSuff ñ Maria VSuff “a” V “no” NP V N vi N María fi la April 20, 2011 11 -731 Machine Translation 82