Introduction to Machine Translation Mitch Marcus CIS 530

  • Slides: 43
Download presentation
Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by

Introduction to Machine Translation Mitch Marcus CIS 530 Some slides adapted from slides by John Hutchins, Bonnie Dorr, Martha Palmer CIS 530 - Intro to NLP

Why use computers in translation? · · · Too much translation for humans Technical

Why use computers in translation? · · · Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs · Any one of these may justify machine translation or computer aids CIS 530 - Intro to NLP 2

The Early History of NLP (Hutchins): MT in the 1950 s and 1960 s

The Early History of NLP (Hutchins): MT in the 1950 s and 1960 s · · Sponsored by government bodies in USA and USSR (also CIA and KGB) • assumed goal was fully automatic quality output (i. e. of publishable quality) [dissemination] • actual need was translation for information gathering [assimilation] Survey by Bar-Hillel of MT research: • • • · criticised assumption of FAHQT as goal demonstrated ‘non-feasibility’ of FAHQT (without ‘unrealisable’ encyclopedic knowledge bases) advocated “man-machine symbiosis”, i. e. HAMT and MAHT ALPAC 1966, set up by disillusioned funding agencies • • compared latest systems with early unedited MT output (IBM-GU demo, 1954), criticised for still needing post-editing advocated machine aids, and no further support of MT research but failed to identify the actual needs of funders [assimilation] therefore failed to see that output of IBM-USAF Translator and Georgetown systems were used and appreciated CIS 530 - Intro to NLP 3

Consequences of ALPAC · MT research virtually ended in US · identification of actual

Consequences of ALPAC · MT research virtually ended in US · identification of actual needs • assimilation vs. dissemination · recognition that ‘perfectionism’ (FAHQT) had neglected: • operational factors and requirements • expertise of translators • machine aids for translators · henceforth three strands of MT: • translation tools (HAMT, MAHT) • operational systems (post-editing, controlled languages, domain-specific systems) • research (new approaches, new methods) · computational linguistics born in the aftermath CIS 530 - Intro to NLP 4

Machine Translation (Pass 0 – From Intro Lectures) CIS 530 - Intro to NLP

Machine Translation (Pass 0 – From Intro Lectures) CIS 530 - Intro to NLP

Why use computers in translation? · · · Too much translation for humans Technical

Why use computers in translation? · · · Too much translation for humans Technical materials too boring for humans Greater consistency required Need results more quickly Not everything needs to be top quality Reduce costs · Any one of these may justify machine translation or computer aids (next several slides adapted from Language Weaver) CIS 530 - Intro to NLP 6

Statistical Machine Translation Technology Spanish/English Bilingual Text English Text Statistical Analysis Spanish Que hambre

Statistical Machine Translation Technology Spanish/English Bilingual Text English Text Statistical Analysis Spanish Que hambre tengo yo CIS 530 - Intro to NLP Statistical Analysis Broken English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … I am so hungry 7

How A Statistical MT System Learns CIS 530 - Intro to NLP 8

How A Statistical MT System Learns CIS 530 - Intro to NLP 8

Translating a New Document CIS 530 - Intro to NLP 9

Translating a New Document CIS 530 - Intro to NLP 9

Language Weaver v. 2. 0 v. 2. 4 Source: Aljazeera, January 8, 2005 v.

Language Weaver v. 2. 0 v. 2. 4 Source: Aljazeera, January 8, 2005 v. 3. 0 CIS 530 - Intro to NLP 10

 Translingual Chat – Instant Messaging Original CIS 530 - Intro to NLP Translation

Translingual Chat – Instant Messaging Original CIS 530 - Intro to NLP Translation 11

Language Weaver (Al Jazeera 8/2007) Language. Weaver Demo Website CIS 530 - Intro to

Language Weaver (Al Jazeera 8/2007) Language. Weaver Demo Website CIS 530 - Intro to NLP 12

Language Weaver Hybrid Translation Technology · Chinese Source Text Sample 1: �展,一向是衡量一个国家汽�消��状和市�潜力的 “晴雨表”。本届北京国��展有 24个国家的1200余家厂商参展,8天接待40余万名参�

Language Weaver Hybrid Translation Technology · Chinese Source Text Sample 1: �展,一向是衡量一个国家汽�消��状和市�潜力的 “晴雨表”。本届北京国��展有 24个国家的1200余家厂商参展,8天接待40余万名参� 者,�下了中国�展的新��,�人深切地感受到汽�市�启�的信号。 “中国是世界最后一个最大的汽�市� ”。多年来,�句�更多地包含着汽�商人的一种希冀。然而如今,越来越多的事��示着它正在�� �� 。 来自本届�展的一�数据很有�服力。《北京青年�》的一份�����示, 35�以下参�者�占 35%; 62. 1%的被�者表示,参��展 主要是�近期��搜集信息,甚至在展会上就有可能��或��合适的�品; 76%的被�者表示最近两年会��私家�。 今年以来,国内��市�的��增��厂家喜上眉梢。据国家��局公布的数字,前 4个月,全国共生��� 26. 79万�,增� 27. 6%;特 �是 4月份,生 ��� 9万�,同比增� 50. 5%,�造了十几年来��月�增�的最高��。从�售看,一季度,全国��生�企�共�售 �� 18. 8万�,同比增� 22%,��率达 105%;���存比年初下降 1. 1万�,下降幅度近 25%。 Language Weaver Experimental Syntax MT Sample 1 : The motor show, has always been the' barometer' of a national car consumption and market potential. The Beijing International Auto Show has more than 1, 200 exhibitors from 24 countries and 8 days of receiving more than 40 million visitors, setting a new record in China's auto show, are deeply aware of the automobile market signals. "China is one of the largest automobile market in the world. Over the years, this phrase implies more auto businessmen. But now, more and more facts indicates that it is to become a reality. Data from the Motor Show is very convincing. The Beijing Qingnian Bao Report on-the-spot investigation showed that about 35 percent of 35 -year-old visitors, 62. 1 percent of the respondents said that the truck was mainly to buy a car in the near future to collect information, even at the exhibition may purchase or suitable products; 76% of respondents indicated in the past two years to buy private cars. Since the beginning of this year, the strong growth of the domestic car market. According to the figures released by the National Bureau of Statistics, in the first four months, the country produced 267, 900 vehicles, up 27. 6 percent; in particular, in April, the production of 90, 000 vehicles, an increase of 50. 5% over the same period last year, setting a record high for the monthly output growth over the past 10 -odd years. In terms of sales in the first quarter, manufacturing enterprises in the country sold 188, 000 cars, up 22 percent over the same period of last year, up 10. 5 percent; 11, 000 vehicles, dropping by nearly 25 percent lower than the beginning of the year. CIS 530 - Intro to NLP 13

Broadcast Monitoring BBN MAPS & Language Weaver MT CIS 530 - Intro to NLP

Broadcast Monitoring BBN MAPS & Language Weaver MT CIS 530 - Intro to NLP 14

CIS 530 - Intro to NLP 15

CIS 530 - Intro to NLP 15

Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle) Semantic Composition Semantic Analysis Syntactic Structure

Three MT Approaches: Direct, Transfer, Interlingual (Vauquois triangle) Semantic Composition Semantic Analysis Syntactic Structure Word Structure Morphological Analysis Source Text CIS 530 - Intro to NLP Semantic Structure Interlingua Semantic Transfer Syntactic Transfer Direct Semantic Decomposition Semantic Structure Semantic Generation Syntactic Structure Syntactic Generation Word Structure Morphological Generation Target Text 16

Examples of Three Approaches · Direct: • I checked his answers against those of

Examples of Three Approaches · Direct: • I checked his answers against those of the teacher → Yo comparé sus respuestas a las de la profesora • Rule: [check X against Y] → [comparar X a Y] · Transfer: • Ich habe ihn gesehen → I have seen him • Rule: [clause agt aux obj pred] → [clause agt aux pred obj] · Interlingual: • I like Mary→ Mary me gusta a mí • Rep: [Be. Ident (I [ATIdent (I, Mary)] Like+ingly)] CIS 530 - Intro to NLP 17

Direct MT: Pros and Cons · Pros • Fast • Simple • Inexpensive ·

Direct MT: Pros and Cons · Pros • Fast • Simple • Inexpensive · Cons • • • Unreliable Not powerful Rule proliferation Requires too much context Major restructuring after lexical substitution CIS 530 - Intro to NLP 18

Transfer MT: Pros and Cons · Pros • Don’t need to find language-neutral rep

Transfer MT: Pros and Cons · Pros • Don’t need to find language-neutral rep • No translation rules hidden in lexicon • Relatively fast · Cons • N 2 sets of transfer rules: Difficult to extend • Proliferation of language-specific rules in lexicon and syntax • Cross-language generalizations lost CIS 530 - Intro to NLP 19

Interlingual MT: Pros and Cons · Pros • Portable (avoids N 2 problem) •

Interlingual MT: Pros and Cons · Pros • Portable (avoids N 2 problem) • Lexical rules and structural transformations stated more simply on normalized representation • Explanatory Adequacy · Cons • • Difficult to deal with terms on primitive level: universals? Must decompose and reassemble concepts Useful information lost (paraphrase) (Is thought really language neutral? ? ) CIS 530 - Intro to NLP 20

MT Challenges: Ambiguity · Syntactic Ambiguity I saw the man on the hill with

MT Challenges: Ambiguity · Syntactic Ambiguity I saw the man on the hill with the telescope · Lexical Ambiguity E: book S: libro, reservar · Semantic Ambiguity • • • Homography: ball(E) = pelota, baile(S) Polysemy: kill(E), matar, acabar (S) Semantic granularity esperar(S) = wait, expect, hope (E) be(E) = ser, estar(S) fish(E) = pez, pescado(S) CIS 530 - Intro to NLP 21

MT Challenges: Divergences • Meaning of two translationally equivalent phrases is distributed differently in

MT Challenges: Divergences • Meaning of two translationally equivalent phrases is distributed differently in the two languages • Example: - English: [RUN INTO ROOM] - Spanish: [ENTER IN ROOM RUNNING] CIS 530 - Intro to NLP 22

Spanish/Arabic Divergences Divergence E/E’ (Spanish) E/E’ (Arabic) Categorial be jealous when he returns have

Spanish/Arabic Divergences Divergence E/E’ (Spanish) E/E’ (Arabic) Categorial be jealous when he returns have jealousy [tener celos] upon his return [ ]ﻋﻨﺪ ﺭﺠﻮﻌﻪ Conflational float come again go floating [ir flotando] return [ ]ﻋﺎﺪ Structural enter the house seek enter in the house [entrar en la casa] search for [ ]ﺑﺤﺚ ﻋﻦ Head Swap run in do something quickly enter running [entrar corriendo] go-quickly in doing something [ ]ﺍﺴﺮﻊ Thematic I have a headache my-head hurts me [me duele la cabeza] — CIS 530 - Intro to NLP 23

Divergence Frequency · 32% of sentences in UN Spanish/English Corpus (5 K) · 35%

Divergence Frequency · 32% of sentences in UN Spanish/English Corpus (5 K) · 35% of sentences in TREC El Norte Corpus (19 K) · Divergence Types • Categorial (X tener hambre X have hunger) [98%] • Conflational (X dar puñaladas a Z X stab Z) [83%] • Structural (X entrar en Y X enter Y) [35%] • Head Swapping (X cruzar Y nadando X swim across Y) • Thematic (X gustar a Y Y like X) CIS 530 - Intro to NLP [8%] [6%] 24

MT Lexical Choice- WSD Iraq lost the battle. Ilakuka centwey ciessta. [Iraq ] [battle]

MT Lexical Choice- WSD Iraq lost the battle. Ilakuka centwey ciessta. [Iraq ] [battle] [lost]. John lost his computer. John-i computer-lul ilepelyessta. [John] [computer] [misplaced]. CIS 530 - Intro to NLP 25

WSD with Source Language Semantic Class Constraints lose 1(Agent, Patient: competition) <=> ciessta lose

WSD with Source Language Semantic Class Constraints lose 1(Agent, Patient: competition) <=> ciessta lose 2 (Agent, Patient: physobj) <=> ilepelyessta CIS 530 - Intro to NLP 26

Lexical Gaps: English to Chinese break ? smash da po - irregular pieces shatter

Lexical Gaps: English to Chinese break ? smash da po - irregular pieces shatter da sui - small pieces snap pie duan -line segments CIS 530 - Intro to NLP 27

An Gentle Introduction to Statistical MT: 1949 to 1988 CIS 530 - Intro to

An Gentle Introduction to Statistical MT: 1949 to 1988 CIS 530 - Intro to NLP

Warren Weaver – 1949 Memorandum I · Proposes Local Word Sense Disambiguation! ‘If one

Warren Weaver – 1949 Memorandum I · Proposes Local Word Sense Disambiguation! ‘If one examines the words in a book, one at a time through an opaque mask with a hole in it one word wide, then it is obviously impossible to determine, one at a time, the meaning of words. "Fast" may mean "rapid"; or it may mean "motionless"; and there is no way of telling which. But, if one lengthens the slit in the opaque mask, until one can see not only the central word in question but also say N words on either side, then, if N is large enough one can unambiguously decide the meaning. . . ’ CIS 530 - Intro to NLP 29

Warren Weaver – 1949 Memorandum II · Proposes Interlingua for Machine Translation! ‘Thus it

Warren Weaver – 1949 Memorandum II · Proposes Interlingua for Machine Translation! ‘Thus it may be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication—the real but as yet undiscovered universal language—and—then re-emerge by whatever particular route is convenient. ’ CIS 530 - Intro to NLP 30

Warren Weaver – 1949 Memorandum III · Proposes Machine Translation using Information Theory! ‘It

Warren Weaver – 1949 Memorandum III · Proposes Machine Translation using Information Theory! ‘It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the "Chinese code. " If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation? ’ Weaver, W. (1949): ‘Translation’. Repr. in: Locke, W. N. and Booth, A. D. (eds. ) Machine translation of languages: fourteen essays (Cambridge, Mass. : Technology Press of the Massachusetts Institute of Technology, 1955), pp. 15 -23. CIS 530 - Intro to NLP 31

IBM Adopts Statistical MT Approach I (early 1990 s) · ‘In 1949, Warren Weaver

IBM Adopts Statistical MT Approach I (early 1990 s) · ‘In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center. ’ CIS 530 - Intro to NLP 32

IBM Adopts Statistical MT Approach II ‘The Candide group adopted an information-theoretic perspective on

IBM Adopts Statistical MT Approach II ‘The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text. As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition. ’ CIS 530 - Intro to NLP 33

The Channel Model for Machine Translation this and following 3 out of 4 slides

The Channel Model for Machine Translation this and following 3 out of 4 slides from original 1990 IBM MT paper CIS 530 - Intro to NLP 34

Noisy Channel - Why useful? · Word reordering in translation handled by P(S) •

Noisy Channel - Why useful? · Word reordering in translation handled by P(S) • P(S) factor frees P(T | S) from worrying about word order in the “Source” language · Word choice in translation handled by P (T|S) • P(T| S) factor frees P(S) from worrying about picking the right translation CIS 530 - Intro to NLP 35

An Alignment distortion fertility CIS 530 - Intro to NLP 36

An Alignment distortion fertility CIS 530 - Intro to NLP 36

Fertilities and Lexical Probabilities for not CIS 530 - Intro to NLP 37

Fertilities and Lexical Probabilities for not CIS 530 - Intro to NLP 37

Fertilities and Lexical Probabilities for hear CIS 530 - Intro to NLP 38

Fertilities and Lexical Probabilities for hear CIS 530 - Intro to NLP 38

Schematic of Translation Model fertility null cepts translation distortion from What's New in Statistical

Schematic of Translation Model fertility null cepts translation distortion from What's New in Statistical Machine Translation, Kevin Knight and Philipp Koehn, Tutorial at HLT/NAACL 2003 CIS 530 - Intro to NLP 39

How do we evaluate MT? · Human-based Metrics • • Semantic Invariance Pragmatic Invariance

How do we evaluate MT? · Human-based Metrics • • Semantic Invariance Pragmatic Invariance Lexical Invariance Structural Invariance Spatial Invariance Fluency Accuracy: Number of Human Edits required — HTER: Human Translation Error Rate • “Do you get it? ” · Automatic Metrics: Bleu CIS 530 - Intro to NLP 40

Bi. Lingual Evaluation Understudy (BLEU —Papineni, 2001) · Automatic Technique, but …. · Requires

Bi. Lingual Evaluation Understudy (BLEU —Papineni, 2001) · Automatic Technique, but …. · Requires the pre-existence of Human (Reference) Translations · Compare n-gram matches between candidate translation and 1 or more reference translations CIS 530 - Intro to NLP 41

Bleu Metric Chinese-English Translation Example: Candidate 1: It is a guide to action which

Bleu Metric Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. CIS 530 - Intro to NLP 42

Bleu Metric Chinese-English Translation Example: Candidate 1: It is a guide to action which

Bleu Metric Chinese-English Translation Example: Candidate 1: It is a guide to action which ensures that the military always obeys the commands of the party. Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct. Reference 1: It is a guide to action that ensures that the military will forever heed Party commands. Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party. Reference 3: It is the practical guide for the army always to heed the directions of the party. CIS 530 - Intro to NLP 43