Johns Hopkins 2003 Summer Workshop on Syntax and

Outline • • Introduction Implicit Syntactic Feature Functions Shallow Syntactic Feature Functions Deep Syntactic

Introduction • Motivation • The Plan • The Baseline system

Motivation Introduction • Statistical MT systems are the current state-of-the art, but they often

The Plan Introduction • Investigate the effect of integrating syntactic rules on the performance

The Baseline System • • The Model Feature Functions Training and Test corpora Results

The Model Baseline System • Alignment template system – Och 2002; Och, Tillman, Ney

Fig 1. 1, System architecture based on Log-linear modelling

Baseline System Alignment Templates • Used to do the phrase-based translation • Sentences e

Fig 1. 2, Sample segmentation of e and f and translation into alignment templates

Fig 1. 3, Dependencies in the alignment template model

Feature Functions • • Alignment template selection Word selection Phrase alignment Language Model features

Baseline System Alignment template selection • Product of alignment template probabilities • Feature function:

Baseline System Word selection • Product of word translation probabilities • Feature function: –

Baseline System Phrase Alignment • Feature function for phrase alignment: – Sum over distance

Baseline System Language Model features • Standard word-based trigram for language model feature: •

Baseline System Word-phrase penalty • Number of produced words, ie, length of target sentence:

Baseline System Additional Features • Phrases from conventional lexicon – Entries in the Chinese-English

Baseline System Training and Test corpora • Three corpora: – training corpus (train) •

Baseline System Preprocessing • Some additional tweaking was needed to get the system ready

Baseline System The Baseline Result • BLEU score: 31. 6 • This will be

Implicit Syntactic Feature Functions • • • A Trio for Punctuation Specific Word Penalty

Punctuation Implicit Functio • Problem: Ungrammatical punctuation in hypotheses affect syntactic quality of output

Punctuation - Results • BLEU score: no statistically significant improvement • Ideas 1 and

Implicit Functio Specific Word Penalty • Problem: Errant – ie, wrongly-placed, inserted, or deleted

Model 1 Score Implicit Functio • Idea: Use IBM Model 1 for two feature

Model 1 - Results • Compared to 31. 6 baseline: – With p(f|e) yields

Implicit Functio Missing Content Words • Problem: Those missing content words are really annoying.

MSA for Hypotheses Implicit Functio • Problem: No real range of diversity in translation

MSA - Results • BLEU scores: Meh. • Conclusion: SMT not constrained enough to

Shallow Syntactic Feature Functions • • • Overview Part-Of-Speech and Chunk Tag Counts Tag

Overview Shallow Functio • Shallow features depend on POS tagging or chunking • Motivations:

Part-Of-Speech and Chunk Tag Counts Shallow Functio • Problem: baseline is systematically under- and

Tag Fertility Models Shallow Functio • Problem: Tag distribution again • Idea: Model expected

Tag Fertility - Results • Not as good as hoped: • Discussion: – Parameter

Shallow Functio Projected POS Language Model • Problem: Word movement model in baseline system

Projected POS - Results • This is a little better. . . • Conclusion:

Shallow Functio Aligned POS-Tag Sequences • Problem: Alignments in baseline computed on word level;

Aligned POS - Results • Unigram model: average 31. 6 • Conditional model: average

Deep Syntactic Feature Functions • Grammaticality Test of English Parser – Parser Probability /

Overview • Deep syntactic features depend on parser output – Grammaticality is measured by

Grammaticality Test of English Parser Deep Function • Idea: Grammatical sentences should have a

Tree to String Model Deep Function • Idea: Incorporate syntax-based Tree-to-String model as a

Tree to Tree Alignment Deep Function • Idea: Use (Gildea 2003) tree alignment probabilities

Dependency Tree-to-Tree Alignments Deep Function • Idea: Try a gaggle of dependency-derived features (listed

Slides: 52

Download presentation

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 1 -4 Shauna Eggers

Outline • • Introduction Implicit Syntactic Feature Functions Shallow Syntactic Feature Functions Deep Syntactic Feature Functions

Introduction • Motivation • The Plan • The Baseline system

Motivation Introduction • Statistical MT systems are the current state-of-the art, but they often make ‘stupid’ syntax errors: – Missing content words Ukraine condemns US interference in its internal affairs : Condemns US interference in its internal affairs – Missing articles … he is fully able to activate the team : … he is fully able to activate team – Incorrect dependencies … particularly those players who cheat the audience : … particularly those who cheat the audience players • These systems are data-driven, so reflect implicit syntactic properties of language, as n-grams and alignment models • What if we incorporate some explicit syntactic knowledge?

The Plan Introduction • Investigate the effect of integrating syntactic rules on the performance of an SMT system – Analyze errors in a baseline SMT system – Develop syntactically-motivated feature functions to target specific errors – Observe effect of each feature on system score – Hope for improved results! • Measuring improvement: BLEU – Is this always an effective/appropriate metric?

The Baseline System • • The Model Feature Functions Training and Test corpora Results Introduction

The Model Baseline System • Alignment template system – Och 2002; Och, Tillman, Ney 1999; Och, Ney 2004 – How it works: segment input sentence into phrases, translate phrases, and reorder in target language – Uses log-linear modelling approach for direct translation • Basic idea of experiments: model each syntactic feature as a function and plug it into the model

Fig 1. 1, System architecture based on Log-linear modelling

Baseline System Alignment Templates • Used to do the phrase-based translation • Sentences e and f are decomposed into K phrase pairs, and a template z is assigned to translate each pair • Parameters: – Segmentation points in both e and f • Search for optimal segmentation included in Global Search component of model (Fig 1. 1) – Set of templates 1 through K: z 1 K – Permutation of templates 1 through K: π1 K • This parameter allows for reordering of phrases • The z and π parameters are added as “hidden variables” to the model

Fig 1. 2, Sample segmentation of e and f and translation into alignment templates

Fig 1. 3, Dependencies in the alignment template model

Feature Functions • • Alignment template selection Word selection Phrase alignment Language Model features Word/Phrase penalty Phrases from conventional lexicon Additional features Baseline System

Baseline System Alignment template selection • Product of alignment template probabilities • Feature function: – Notice there are no insertions or deletions on phrase level; just permutations

Baseline System Word selection • Product of word translation probabilities • Feature function: – Notice i and j to include dependence on word positions (űbermorgen -> the day after tomorrow should be weighted higher than űbermorgen -> after the day tomorrow) – Ei is word class for word ei – A is matrix of word alignments Aπ1 Kz 1 K

Baseline System Phrase Alignment • Feature function for phrase alignment: – Sum over distance (in source) of alignment templates which are consecutive in target – Measures “non-monotonicity” of phrases • Takes into account that very often monotone alignments are the correct alignments

Baseline System Language Model features • Standard word-based trigram for language model feature: • Report mentions a total of four language models; don’t know what other three are (or just four variations on trigram? )

Baseline System Word-phrase penalty • Number of produced words, ie, length of target sentence: • Number of produced phrases: – Can be arranged to prefer long or short phrases (I imagine this means smaller K for longer phrases, and larger K for shorter phrases. . . ? )

Baseline System Additional Features • Phrases from conventional lexicon – Entries in the Chinese-English lexicon provided by Linguistic Data Consortium can be potential phrase translation pairs in the align template system – A feature function is included that counts the number of times each lexical entry is used in training • Model allows further addition of any number of feature functions, for example, other syntactic features (numbers of verb arguments), semantic features, pragmatic features

Baseline System Training and Test corpora • Three corpora: – training corpus (train) • 170 M English words – development corpus (dev) • 993 sentences (~25 K words) in both languages • 5765 sentences (~175 K words) for use in post-workshop experiments – test corpus (test) • And: – unseen test corpus (blind-test) • for experiments on completely unseen data

Baseline System Preprocessing • Some additional tweaking was needed to get the system ready to roll – Segmentation and POS tagging: used standard tools distributed by LDC • Slightly different tag sets are appropriate for English and Chinese data (no NN v. NNS distinction in Chinese, no M for measure words in English) – Parsing: used Collins 1999 for English, Bikel 2002 for Chinese – Chunking: used fn. TBL chunker – Case issues: used HMM to insert upper case back into baseline system output – Tokenization Issues: normalize hyphenation, other formatting for i/o into various systems

Baseline System The Baseline Result • BLEU score: 31. 6 • This will be the score that every experiment will be compared against • SPOILER! You’re not going to see anything much different from this. . . (But hold on anyway. . . here we go!)

Outline • • Introduction Implicit Syntactic Feature Functions Shallow Syntactic Feature Functions Deep Syntactic Feature Functions

Implicit Syntactic Feature Functions • • • A Trio for Punctuation Specific Word Penalty Model 1 Score Missing Content Words Multi-Sequence Alignment (MSA) of Hypotheses

Punctuation Implicit Functio • Problem: Ungrammatical punctuation in hypotheses affect syntactic quality of output • Idea 1: Count of unmatched or empty parens and quotes – Feature function penalizes for ungrammatical punctuation • Idea 2: Percent overlap between groups in e and f – penalizes word movement around punctuation – penalizes punctuation deletion • Idea 3: Add hypotheses to correct bad punctuation – delete unaligned parens and quotes – insert and opening paren/quote before the first word aligned to the first Chinese word inside the parens – Insert a closing paren/quote after the last word aligned to the last Chinese word inside the parens

Punctuation - Results • BLEU score: no statistically significant improvement • Ideas 1 and 2 are restricted in their application – Have little discriminating power when most of the hypotheses for a Chinese sentences make similar punctuation mistakes • Idea 2 doesn’t work when punctuation deletion is at borders of sentence, or next to another punctuation mark • Idea 3 hypotheses apparently make only trivial changes to feature function values (and hence n-best score) • Conclusion: Punctuation soundness has little influence on BLEU

Implicit Functio Specific Word Penalty • Problem: Errant – ie, wrongly-placed, inserted, or deleted – content words • Idea: Use counts of ten most common non-content words as feature functions – Individually, 10 counts 10 functions – Combined into one count, to avoid overfitting • Results, compared to 31. 6 baseline: – Using individual counts as features: 31. 1 – Combined into one feature value: 31. 7 • Conclusion: BLEU drops with these features • But! They did find that “that” and “a” were more commonly systematically mistranslated than others. Maybe further experiments can be done on larger list of non-content words

Model 1 Score Implicit Functio • Idea: Use IBM Model 1 for two feature functions – Model 1 gives the sum of all possible alignment probabilities: – Feature functions: p(f|e) and p(e|f) – Trained with subset of training corpus for baseline system: 30 M English words – Smoothing: constant t(fj|ei) = 10 -40 used for unknown words

Model 1 - Results • Compared to 31. 6 baseline: – With p(f|e) yields 32. 5 average, p(e|f) 30. 6 – One of the best-performing features in workshop • Breakdown for different training sizes: (numbers for p(e|f) a little jumpy; may be bug in eval script)

Implicit Functio Missing Content Words • Problem: Those missing content words are really annoying. – Sentences missing content words can have overall higher probability ranking than those with correct content words • Idea: Implement feature function that counts number of content words missing in a candidate translation • Results: 31. 9 BLEU score – 0. 3% improvement over baseline 31. 6 – Comparatively large improvement, yet not statistically significant • Discussion: The BLEU score is not significantly better, but on manual inspection, the adequacy of resulting sentences is much better. Perhaps BLEU is not the best metric to evaluate application of this feature function.

MSA for Hypotheses Implicit Functio • Problem: No real range of diversity in translation sentences • Idea: Use Multi-Sequence Alignment (MSA) lattices to recombine subparts of existing hypotheses into new ones. Three features: – Path weight of each hypothesis • Arc weight = number of hypotheses that agree with that arc – Binary feature: Does arc represent majority hypoths? – Number of arcs on which a hypothesis agreed with the consensus path

MSA - Results • BLEU scores: Meh. • Conclusion: SMT not constrained enough to be a very good fit for MSA

Implicit Syntax Results

Outline • • Introduction Implicit Syntactic Feature Functions Shallow Syntactic Feature Functions Deep Syntactic Feature Functions

Shallow Syntactic Feature Functions • • • Overview Part-Of-Speech and Chunk Tag Counts Tag Fertility Models Projected POS Language Model Aligned POS-Tag Sequences

Overview Shallow Functio • Shallow features depend on POS tagging or chunking • Motivations: – Overcome data sparseness • Generalize from behavior of words to behavior of tags and chunks – Make stronger generalizations about syntactic behavior than what is observed in training corpus • Disadvantages: It may not be possible to capture more info than is already implicitly modeled in baseline – POS and baseline systems trained on same input – Chunker output not at a much higher granularity than Alignment Templates • Advantages – Efficiency of POS and chunking systems – Decisions are local, so better for noisy hypotheses – Lots of available input data (1. 3 M tagged parallel sentences available for training) – Simpler models allow quicker reaction to problems, contrastive error analysis

Part-Of-Speech and Chunk Tag Counts Shallow Functio • Problem: baseline is systematically under- and overgenerating certain POS and chunk types • Idea: Favor sentences with more or less of certain tags (depending on under- or over-generation). For example: – Number of NPs in English – Difference in number of NPs from Chinese to English – Number of Chinese N tags translated only to non-N tags in English • Results: Meh. • Conclusions: – Individual tag-count features probably already encoded in trigram models – Combined tag-count features do better, maybe because counteract biases in more sophisticated features

POS and Chunk counts - Results

Tag Fertility Models Shallow Functio • Problem: Tag distribution again • Idea: Model expected English tag distribution, with and without given Chinese tags – Single feature consisting of product of various probability distributions for English tags • Some bag-o’-tags models, eg, P(N Ne = 2) • Some conditional given Chinese tags, eg, P(N Pe = 2 | N Pf = 1)

Tag Fertility - Results • Not as good as hoped: • Discussion: – Parameter estimation was rather simplistic; obviouslyrelated probs such as were independently calculated – Fewer free parameters might be tried

Shallow Functio Projected POS Language Model • Problem: Word movement model in baseline system is pretty weak. • Idea: Since Chinese words are too sparse to model movement, use POS instead – Use word alignment to project Chinese POS sequences into English – Similar to HMM alignment model of Vogel, Ney, and Tillman 1996, but with POS instead of words

Projected POS - Results • This is a little better. . . • Conclusion: – Results better simply because of poorness of movementhandling in baseline – Strongest-performing of shallow features – Should be investigated further – indicates possible move from purely word-based models to ones based on shallow syntax

Shallow Functio Aligned POS-Tag Sequences • Problem: Alignments in baseline computed on word level; however, lexical item distribution is always sparse • Idea: Use POS tag sequence alignments instead – Replace words in alignment templates with POS tags, and use following alignment models: • Unigram: p(f, e) = product of all p(sf, se) • Conditional: p(e, f) = p(f) * product of all p(se | sf)

Aligned POS - Results • Unigram model: average 31. 6 • Conditional model: average 31. 4 • Conclusion: Maybe need more input for training models – Baseline system does not output alignment information for words translated by rules, so these particular alignments cannot be recovered – Performance of these feature functions may improve if can reconfigure baseline system to output more alignments

Shallow Syntax Results

Outline • • Introduction Implicit Syntactic Feature Functions Shallow Syntactic Feature Functions Deep Syntactic Feature Functions

Deep Syntactic Feature Functions • Grammaticality Test of English Parser – Parser Probability / Unigram LM Scores • Syntax-based Translation Models – Tree to String – Tree to Tree Alignment • Dependency Tree-to-Tree Alignments

Overview • Deep syntactic features depend on parser output – Grammaticality is measured by parse trees • How to use parser output: – simple features – model-based features – dependency-based features – other complex features • tricky features (Chapter 5, Ethan)

Grammaticality Test of English Parser Deep Function • Idea: Grammatical sentences should have a higher parse probability – Try parse probability of sentence by itself – Try parse probability of sentence / unigram prob for words in sentence • Result: Worse than baseline! Guess these probs are not really related. . .

Tree to String Model Deep Function • Idea: Incorporate syntax-based Tree-to-String model as a feature function (Yamada and Knight 2001, 2002) – Theta is the set of reorderings, insertions, and leaf-word translation operations • Results: Average 31. 7 BLEU • Conclusion – Results are not bad, but this is computationally very expensive! Expense makes it impractical for this model. – Try reducing cost by fragmenting long sentences with a tool called machete – kinks are still being worked out of this tool, but it may be promising

Tree to Tree Alignment Deep Function • Idea: Use (Gildea 2003) tree alignment probabilities as feature function – Remember, Gildea’s model includes cloning, and many -to-one, one-to-many node mappings • Experiment – Lexical translation probs for leaf nodes were trained using IBM Model 1 – Some tweaks for performance: max fan-out of 6, max sentence length of 60 • Results: 31. 6 BLEU

Dependency Tree-to-Tree Alignments Deep Function • Idea: Try a gaggle of dependency-derived features (listed in results table, next slide) – By representing relationships between words, dependency trees for source and target sentences supposedly have less conflicting structures than constituency trees • Results: Not much different from baseline • Conclusion: A lot of the lack of gain for this approach is probably accounted for by errors in the parsing tools. Fixing these errors would likely improve results of this using this feature.

Dependency Tree Alignments Results