Machine Translation III Empirical approaches to MT Examplebased

Machine Translation III Empirical approaches to MT: Example-based MT Statistical MT http: //personalpages. manchester. ac. uk/staff/harold. somers/ LELA 30431/chapter 50. pdf http: //www. statmt. org/

Introduction • Empirical approaches: what does that mean? – Empirical vs rationalist – Data-driven vs rule-driven • Pure empiricism: statistical MT • Hybrid empiricism: Example-based MT 2

Empirical approaches • Approaches based on pure data • Contrast with “rationalist” approach: rule-based systems of “ 2 nd generation” • Larger storage, faster processors, and availability of textual data in huge quantities suggest data-driven approach may be possible • “Data” here means just raw text 3

Flashback • Early thoughts on MT (Warren Weaver 1949) included possibility that translation was like code-breaking (cryptanalysis). • Weaver – with Claude Shannon – invented “information theory” • Given enough data, patterns could be identified and applied to new text 4

Back to the future • Data-driven approach encouraged by availability of machine-readable parallel text, notably at first Canadian and Hong Kong Hansards, then EU documents, and dual-language web pages • Two basic approaches: – Statistical MT – Example-based MT 5

Example-based MT • “Translation by analogy” • First proposed by Nagao (1984) but not implemented until early 1990 s • Very intuitive: translate text on the basis of recognising bits that have been previously translated, and sticking them together – Cf tourist phrasebook approach 6

Example-based MT • Like an extension of Translation Memory • Based on a database of translation examples • System finds closely matching previous example(s) • (unlike TM) identifies the corresponding fragments in the target text(s) (align) • And recombines them to give the target text 7

Example (Sato & Nagao 1990) Input He buys a book on international politics Matches He buys a notebook. Kare wa nōto o kau. I read a book on international politics. Watashi wa kokusai seiji nitsuite kakareta hon o yomu. Result Kare wa kokusai seiji nitsuite kakareta hon o kau. 8

Learning templates The monkey ate a peach. saru wa momo o tabeta. The man ate a peach. hito wa momo o tabeta. monkey saru man hito The … ate a peach. … wa momo o tabeta. The dog ate a rabbit. inu wa usagi o tabeta. dog inu rabbit usagi The … ate a …. … wa … o tabeta. The dog ate a peach. inu wa momo o tabeta. 9

Some problems include … • Source of examples – Genuine text or hand-crafted? • Identifying matching fragments – Preprocessed • storage implication • Prejudge what will be useful – “on the fly” – needs a dictionary • Partial matching • Sticking fragments together (boundary friction) • Conflicting/multiple examples 10

Partial matching The operation was interrupted because the file was hidden. a. The operation was interrupted because the Ctrl-c key was pressed. b. The specified method failed because the file is hidden. c. The operation was interrupted by the application. d. The requested operation cannot be completed because the disk is full. 11

Boundary friction (1) • Consider again: He buys a book on politics Matches He buys a notebook. He buys a pen. Kare wa nōto o kau. Kare wa pen o kau. I read a book on politics. Watashi wa seiji nitsuite kakareta hon o yomu. She wrote a book on politics. Kanojo wa seiji nitsuite kakareta hon o kaita. Result Kare wa wa seiji nitsuite kakareta hon o o kau. Kare wa wa seiji nitsuite kakareta hon o kau 12

Boundary friction (2) Input: The handsome boy entered the room Matches: The handsome boy ate his breakfast. Der schöne Junge aß sein Frühstück I saw the handsome boy. Ich sah den schönen Jungen. 13

Competing examples In closing, I will say that I am sad for workers in the airline industry. My colleague spoke about the airline industry. People in the airline industry have become unemployed. This tax will cripple some of the small companies in the airline industry. En terminant, je dirai que c’est triste pour les travailleurs et les travailleuses du secteur de l’aviation. Mon collègue a parlé de l’industrie du transport aérien. Des gens de l’industrie aérienne sont devenus chômeurs. Cette surtaxe va nuire aux petits transporteurs aériens. Results from Canadian Hansard using Trans. Search 14

Statistical MT • Pioneered by IBM in early 1990 s • Spurred on by better success in speech recognition of statistical over linguistic rule-based approaches • Idea that translation can be modelled as a statistical process • Seems to work best in limited domain where given data is a good model of future translations 15

Translation as a probabilistic problem • For a given SL sentence Si, there are number of “translations” T of varying probability • Task is to find for Si the sentence Tj for which the probability P(Tj | Si) is the highest 16

Two models • P(Tj | Si) is a function of two models: – The probabilities of the individual words that make up Tj given the individual words in Si - the “translation model” – The probability that the individual words that make up Tj are in the appropriate order – “the language model” 17

Expressed in mathematical terms: Since S is a given, and constant, this can be simplified as Language model Translation model 18

So how do we translate? • For a given input sentence Si we have to have a practical way to find the Tj that maximizes the formula • We have to start somewhere, so we start with the translation model: which words look most likely to help us? • In a systematic way we can keep trying different combinations together with the language model until we stop getting improvements 19

Input sentence Translation model Seek improvement by trying other combinations Bag of possible words Language model Most probable translation 20

Where do the models come from? • All the statistical parameters are precomputed, based on a parallel corpus • Language model is probabilities of word sequences (n-grams) • Translation model is derived from aligned parallel corpus 21

The translation model • Take sentence-aligned parallel corpus • Extract entire vocabulary for both languages • For every word-pair, calculate probability that they correspond – e. g. by comparing distributions 22

Some obvious problems • “fertility”: not all word correspondences are 1: 1 – Some words have multiple possible translations, e. g. the {le, la, l’, les} – Some words have no translation, e. g. in il se rase ‘he shaves’, se – Some words are translated by several words, e. g. cheap peu cher – Not always obvious how to align 23

The ~ Les proposal ~ propositions will not ~ ne seront pas } not ~ ne…pas will ~ seront now ~ maintenant be ~ many: many not allowed; only 1: n (n 0) and in practice, n<3 implemented ~ mises en application The proposal will not now be implemented Les propositions ne seront pas mises en application maintenant 24

Some word-pair probabilities from Canadian Hansard French le la l’ les ce il de à que P fertility P. 610 1 . 871. 178 0 . 124. 083 2 . 004. 023. 012. 009. 007 ‘the’ French pas ne non faux plus ce que jamais P fertility P. 469 2 . 758. 460 0 . 133. 024 1 . 106. 002 ‘not’ French P fertility P bravo. 992 0 . 584 entendre. 005 1 . 416 entendu. 002 entende. 001 25 ‘hear’

Another problem: distortion • Notice that corresponding words do not appear in the same order. • The translation model includes probabilities for “distortion” – e. g. P(2|5): the P that ws in position 2 will produce a wt in position 5 – can be more complex: P(5|2, 4, 6): the P that ws in position 2 will produce a wt in position 5 when S has 4 words and T has 6. 26

The language model • Impractical to calculate probability of every word sequence: – Many will be very improbable … – Because they are ungrammatical – Or because they happen not to occur in the data • Probabilities of sequences of n words (“ngrams”) more practical – Bigram model: where P(wi|wi– 1) f(wi– 1, wi)/f(wi) 27

Sparse data • Relying on n-grams with a large n risks 0 probabilities • Bigrams are less risky but sometimes not discriminatory enough – e. g. I hire men who is good pilots • 3 - or 4 -grams allow a nice compromise, and if a 3 -gram is previously unseen, we can give it a score based on the component bigrams (“smoothing”) 28

Put it all together and …? • To build a statistical MT system we need: – Aligned bilingual corpus – “Training programs” which will extract from the corpora all the statistical data for the models – A “decoder” which takes a given input, and seeks the output that evaluates the magic argmax formula – based on a heuristic search algorithm • Software for this purpose is freely available – e. g. • Claim is that an MT system for a new language pair can be built in a matter of hours 29

SMT latest developments • Nevertheless, quality is limited • SMT researchers quickly learned (just like in the 1960 s) that this crude approach can get them so far (quite far actually), but that to go the extra distance you need linguistic knowledge (eg morphology, “phrases”, consitutents) • Latest developments aim to incorporate this • Big difference is that it too can be LEARNED (automatically) from corpora • So SMT still contrasts with traditional RBMT where rules are “hand coded” by linguists 30