Improving WordAlignments for Machine Translation Using PhraseBased Techniques

Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman

IBM Model 2 - recap • Alignments are word-to-word • Factors considered: • the words themselves • position within source and target sentences • Formally, probability that ith word of sentence S aligns with jth word of sentence T depends on: • what S[i] and T[j] are • f(i, j, length(S), length(T))

Introducing Phrases • Groups of words tend to translate as a unit (i. e. , a “phrase”). • IBM Model 2 has no notion of this. • We began with a working IBM Model 2 word aligner (from PA 2) and looked at three ways to extend this model using the notion of phrases.

Technique 1: Nearby Neighbors • Ideal: instead of measuring displacement relative to diagonal, measure displacement relative to the previous alignment. • This is hard: to be efficient, EM assumes that all alignments are independent. • Referring to “the previous alignment” has no meaning. • We get around this by means of a weaker dependency. • For the likelihood of aligning S[i] to T[j]: • don’t ask if S[i-1] is aligned to T[i-1]. • ask whether S[i-1] to T[j-1] would be a good alignment.

Technique 1: Nearby Neighbors • Suppose we have P(S, T, i, j) that returns probability S[i] aligns with T[j]. • Define P’(S, T, i, j) = 1 · P(S, T, i, j) + 2 · P(S, T, i - 1, j - 1) 1 = 0. 95, 2 = 0. 05 • Use this distribution (in EM phase and in computing final results) • Also tried a variety of similar models

Technique 1: Nearby Neighbors Results • When added to Model 2, provided only slight improvement in quality of final results. • Provided massive speedup in EM convergence • pre-encoding information that would otherwise have to be learned • When added to Model 1, provided notable improvement in quality of results • model adds information, but most of that information already captured by Model 2

Technique 2: Beam Search • The IBM models had a slightly different solution • IBM Model 2 penalized alignments of S[i] to T[j] that had higher displacements d(S[i], T[j]) from the diagonal • Since phrases tend to move together, each word in the phrase incurs the penalty • So, IBM Model 4 instead penalizes alignments of S[i] to T[j] that have a high displacement relative to the alignment of S[i − 1] • Thus, only the first word in each phrase is penalized.

Technique 2: Beam Search • But, to know where the previous source word was aligned, we need to keep track of each partial alignment for the sentence • We cannot afford to evaluate every possible alignment (exponential in the length of sentence) • Instead, we can maintain a beam of the n best alignments for the previous word.

Technique 2: Beam Search • To assess a penalty for aligning S[i] to T[j], we compute d’(S[i], T[j]) as the minimal displacement measured • either absolutely from the diagonal, or • Relative to one of the previous n best alignments. • The two cases represent a new and an old phrase, respectively • Formally, d’(S[i], T[j]) = min(d(S[i], T[j]), min(d(S[i − 1], T[j]) − d(S[i], T[km])), 1≤m≤ n where T[km] is the mth best alignment for S[i − 1]

Technique 2: Beam Search Results • In practice, n = 2 worked best • Gives just enough context without blurring distinctions between phrases • Resulted in more than 20% improvement in AER • Combined with nearby neighbors approach gives a massive speedup as well

Technique 3: Phrase Pre-chunking • Another idea was to find common phrases in each language and store them as a set • Take the sentences, and whenever we see our common phrases, just treat them as words using Model 2 • Ideally, we would find phrases of any length, taking the most probable phrases over the sentence as our chunks

Technique 3: Phrase Pre-chunking Implementation issues • We began by just using bigrams as our phrases, for simplicity • However, we found that this did not work well with our pre-existing Model 2 code • The function to get the best word alignment expects alignments based on the original sentences’ indices • We need to pre-chunk the sentences to get any meaningful results based on our training • This destroys the original indices, so we have to either store the old sentence or reconstruct the indices as we go

Technique 3: Phrase Pre-chunking Ideas for Improvement • Expanding to N-grams • Finding the best bigrams/N-grams rather than just the first one we see that is in our “good enough” set • Once we had a bigram, we tried checking if the second word and the following made a “better” bigram, and if so, used that one instead • This could potentially be improved upon with better techniques, though it would obviously be more complicated with longer N-grams

Summary • Nearby neighbors approach: • Massive speed-up • Beam Search • 20% AER improvement • Combined neighbors and beam • Both improvements were maintained (speed and AER) • Phrase Pre-Chunking • Good idea for further exploration