Sentence Alignment of Parallel and Comparable Bilingual Corpora

Sentence Alignment of Parallel and Comparable Bilingual Corpora David Huggins-Daines (dhuggins@cs. cmu. edu) January 26, 2006 11 -734 Advanced Seminar in Machine Translation

Overview • The sentence alignment problem • Fundamental techniques • Dynamic programming • Length-based and lexical methods • Comparison of standard techniques • New techniques using comparable monolingual corpora 11 -734 Advanced Seminar in Machine Translation

Sentence Alignment • Matching chunks of translated text • Alignment consists of “beads” • Zero or more translated sentences from each language • Find the most likely sequence of beads • An obvious application of Dynamic Programming 11 -734 Advanced Seminar in Machine Translation

Sentence Alignment 2: 2 1: 1 2: 1 11 -734 Advanced Seminar in Machine Translation

Dynamic Programming • Familiar technique from speech recognition • Recursive definition of alignment cost • Build a matrix of costs and backpointers • Axes are the sentences in each language • Each cell stores the lowest cost D(i, j) and its associated alignment • Backtrace to obtain the best alignment 11 -734 Advanced Seminar in Machine Translation

Dynamic Programming f 1 f 2 f 3 f 4 f 5 e 1 e 2 2: 2 1: 2 2: 1 1: 0 e 3 e 4 e 5 0: 1 D(i, j) 11 -734 Advanced Seminar in Machine Translation

Length-based methods • Paragraph lengths are highly correlated • Assume the same holds for sentences • Length expressed in chars or words • Gale and Church (1993) • Each character in L generates a 1 random number of characters in L 2 • Assume these are IID with Normal(c, σ ) 2 11 -734 Advanced Seminar in Machine Translation

• • • Gale and Church (1993) Normalized length distance δ • δ = (l 2 -l 1 c) / sqrt(l 1σ2) (i. e. z-score) P(match|δ) ∝ P(match)P(δ|match) P(match) estimated from hand-aligned data P(δ|match) estimated as probability mass of a Normal(0, 1) distribution inside |δ| DP cost D(i, j) = - log P(match)P(δ|match) 11 -734 Advanced Seminar in Machine Translation

Gale and Church (1993) • Precision is often more important than recall for alignment • Better alignments rather than more • Is the alignment score predictive of error? • If so, the top-scoring part of the corpus can be used when high precision is needed • Top 80% of the corpus has 0. 4% error 11 -734 Advanced Seminar in Machine Translation

Lexical methods • Chen (1993), Kay & Röscheisen (1993) • • • Use a simple translation model to estimate P(match) Chen uses beam search Kay & Röscheisen use “relaxation” algorithm (simulated annealing) Typically lexical methods are much slower than length-based methods • Not as much of a problem 13 years later Iterative refinement is possible 11 -734 Advanced Seminar in Machine Translation

Comparison • Gale and Church report 4% alignment error • Chen gives approximate 0. 6% error • Chen’s algorithm is one order of magnitude slower than Gale and Church • Kay & Röscheisen’s is several orders slower • ARCADE, Singh and Husain (2005), Rosen (2005) give comparisons of multiple 11 -734 systems Advanced Seminar in Machine Translation

Non-parallel corpora • • Training data may be out of evaluation domain • Can we use unilingual in-domain data? Zhao & Vogel (2002) • • Comparable stories mined from Xinhua newswire Dynamic-programming alignment IBM Model 1 and sentence length Combined using maximum-likelihood 11 -734 Advanced Seminar in Machine Translation

Non-parallel corpora • Munteanu, Fraser, Marcu (2005) • Discard the “alignment” approach • Use a stronger classifier and consider all possible sentence pairs • Limited to 1: 1 alignments • Not necesarily a problem in practice • Classify using word alignment results 11 -734 Advanced Seminar in Machine Translation

Word alignments 11 -734 Advanced Seminar in Machine Translation

Fraser et. al (2005) • Combine word alignment scores, other scores using a maximum-entropy model • Classifies a pair of sentences as translatable or not • Used resulting corpus both alone and in combination with a baseline model • Compared in-domain, out-of-domain training data for classifier 11 -734 Advanced Seminar in Machine Translation

Fraser et al. (2005) • Compared extracted data to a high- quality parallel corpus from the same domain • In-domain classifier works almost as well as out-of-domain classifier • Extracted data improves translation, but not as much as the high-quality data • Training classifier on the baseline corpus works just as well 11 -734 Advanced Seminar in Machine Translation

• Conclusions Basic alignment models are all roughly the same • • • Computational efficiency may be more important, hence popularity of Gale and Church’s algorithm Of course, they also provide source code IR techniques allow the use of non-parallel corpora • • • Important for out-of-domain evaluation data Potentially important for new language pairs Still requires a “bootstrap” set to train classifier 11 -734 Advanced Seminar in Machine Translation