CSA 405 Advanced Topics in NLP Machine Translation

  • Slides: 51
Download presentation
CSA 405: Advanced Topics in NLP Machine Translation I Introduction to MT Jan 2005

CSA 405: Advanced Topics in NLP Machine Translation I Introduction to MT Jan 2005 CSA 4050 MT I

Outline • • MT = Machine Translation Why MT is important What MT is

Outline • • MT = Machine Translation Why MT is important What MT is and why MT is difficult MT and the Human Translator Jan 2005 CSA 4050 MT I 2

Why Machine Translation is Important Jan 2005 CSA 4050 MT I

Why Machine Translation is Important Jan 2005 CSA 4050 MT I

Misconceptions about MT • There was/is an MT system which translated The spirit is

Misconceptions about MT • There was/is an MT system which translated The spirit is willing, but the flesh is weak into the Russian equivalent of The vodka is good, but the steak is lousy, and hydraulic ram into the French equivalent of water goat. MT is useless. • MT is a waste of time because you will never make a machine that can translate Shakespeare. • Generally, the quality of translation you can get from an MT system is very low. This makes them useless in practice. • MT threatens the jobs of translators. • The Japanese have developed a system that you can talk to on the phone. It translates what you say into Japanese, and translates the other speaker's replies into English. • There is an amazing South American Indian language with a structure of such logical perfection that it solves the problem of designing MT systems. • MT systems are machines, and buying an MT system should be very much like buying a car. Jan 2005 CSA 4050 MT I 4

Some Facts about MT • MT is useful. The METEO system has been in

Some Facts about MT • MT is useful. The METEO system has been in daily use since 1977. As of 1990, it was regularly translating around 45 000 words daily. In the 1980 s. It also produces high quality output. • While MT systems sometimes produce howlers, there are many situations where the ability of MT systems to produce reliable, if less than perfect, translations at high speed is valuable. • MT does not threaten translators' jobs. The limitations of current MT systems are too great. However, MT systems can take over some of the boring, repetitive translation jobs and allow human translation to concentrate on more interesting specialist tasks. • Speech-to-Speech MT is still a research topic. Verbmobil has been developed in Germany. • Building an MT system is an arduous and time consuming job, involving the construction of grammars and very large monolingual and bilingual dictionaries. There is no `magic solution' to this. • Before an MT system becomes really useful, a user will typically have to invest a considerable amount of effort in customization. Jan 2005 CSA 4050 MT I 5

The Place for MT • Human Translators are good at: – Getting the right

The Place for MT • Human Translators are good at: – Getting the right turn of phrase – Preserving translation equivalence • Human Translators are bad at – Dictionary look-up – Consistency of translation – Translation of terminology • MT can exploit these weaknesses Jan 2005 CSA 4050 MT I 6

Implications of Multilinguality Number of Languages Jan 2005 2 Number of Language Pairs 2

Implications of Multilinguality Number of Languages Jan 2005 2 Number of Language Pairs 2 3 6 10 90 20 380 CSA 4050 MT I 7

MT is important because. . . • There are too few human translators •

MT is important because. . . • There are too few human translators • Socio-political considerations require it. • Availability of materials in appropriate language has significant economic consequences. • Scientifically, it is still one of the best test areas for language technology • Philosophically, it demands practical solutions to old problems (e. g. role of knowledge and understanding in translation). negatively charged electrons and protons Jan 2005 CSA 4050 MT I 8

How much is MT used? • It is a myth that MT is not

How much is MT used? • It is a myth that MT is not used – In 2000, MT specialist Scott Bennett said “Altavista's Babel. Fish. . . initiated 1997, is now used a million times per day”. – In 2001, Softissimo announced that the Internet translation request volume processed by www. reverso. net has now reached several million (Web pages, e-mail, short texts and results of search engine requests) per month on its mail translation portal and the portals of its Internet partners. “ – V. d. Meer (2003) "Every day, portals like Altavista and Google process nearly 10 million requests for automatic translation. “ • MT usage is increasing Jan 2005 CSA 4050 MT I 9

How much more could it be used? • Translation/localisation industry has so far focused

How much more could it be used? • Translation/localisation industry has so far focused largely on product documentation • This represents less than 20% of all text-based information repositories that need to be localised Corporate decision makers and governments will have to begin supporting multilingual communication initiatives and strategies. Jan 2005 CSA 4050 MT I 10

Why Translation is Difficult Jan 2005 CSA 4050 MT I

Why Translation is Difficult Jan 2005 CSA 4050 MT I

What is Translation? • The process of transforming text from one language into another

What is Translation? • The process of transforming text from one language into another language. • A written communication in a second language having the same meaning as the written communication in a first language • It is what translators actually do! (Martin Kay) Jan 2005 CSA 4050 MT I 12

What Translators Actually Do: An Example of En/Fr Translation As recently as a decade

What Translators Actually Do: An Example of En/Fr Translation As recently as a decade ago it was widely believed that infectious disease was no longer much of a threat in the developed world. The remaining challenges to public health there, it was thought, stemmed from noninfectious conditions such as cancer, heart disease and degenerative diseases. Jan 2005 Il y a une dizaine d’annees, on croyait que les pays industrialises etait debarasses des risques lies aux maladies infectieuses et que la sante publique n’etait menacee que par des maladies comme le cancer, les troubles cardiaques, et les anomolies genetiques CSA 4050 MT I 13

Problems: style and meaning English French • Two sentences • One sentence • infectious

Problems: style and meaning English French • Two sentences • One sentence • infectious disease was no • les pays industrialises longer much of a threat in etait debarasses des the developed world risques lies aux maladies infectieuses • The remaining challenges • la sante publique n’etait to public health there menacee que • noninfectious conditions • maladies Jan 2005 CSA 4050 MT I 14

Problems: Contextual Interpretation OPEN Jan 2005 CSA 4050 MT I 15

Problems: Contextual Interpretation OPEN Jan 2005 CSA 4050 MT I 15

Problems: Non-Equivalences, Lexical Gaps • • • English Room I arrive/am arriving ? Consumptions?

Problems: Non-Equivalences, Lexical Gaps • • • English Room I arrive/am arriving ? Consumptions? VAT ? bits and pieces? I miss you Jan 2005 • • • French Salle/chambre/piece J’arrive Consommations TVA Petites fournitures Tu me manques CSA 4050 MT I 16

Cultural Models English: Health Insurance German: Krankenversicherung French: Assurance Maladie English: stamp German: entwerten

Cultural Models English: Health Insurance German: Krankenversicherung French: Assurance Maladie English: stamp German: entwerten French: obliterer Jan 2005 CSA 4050 MT I 17

Structural Ambiguity • • I bought a car with four doors/liri I forgot how

Structural Ambiguity • • I bought a car with four doors/liri I forgot how good beer tastes Time flies like an arrow The councillors refused the women a permit because they advocated/feared violence. Jan 2005 CSA 4050 MT I 18

Summary • Translation is about more than equivalence of meaning. • Translation may involve

Summary • Translation is about more than equivalence of meaning. • Translation may involve the resolution of ambiguity. • Preservation of intention involves cultural background as well as linguistic knowledge. • Translation is a hard problem – for humans let alone machines. Jan 2005 CSA 4050 MT I 19

Similarities and Differences Between Languages • • • Similarities Communicative function for survival Mechanisms

Similarities and Differences Between Languages • • • Similarities Communicative function for survival Mechanisms for reference to people, eating, politeness, time. Syntactic complexity Nouns Verbs Jan 2005 • • CSA 4050 MT I Differences Morphology Word order and syntactic structures Marking of semantic distinctions Lexical 20

Differences in Morphology • Number of morphemes per word: – One morpheme per word

Differences in Morphology • Number of morphemes per word: – One morpheme per word (Vietnamese) – Many morphemes per word (Maltese) • Segmentability of morphemes – Agglutinative (Turkish) uygar+las+tir+ama+dik+lar+imiz+dan+mis +siniz+casina behaving as if you are among those whom we could not case to become civilised. – Fusion – single affix – multiple morphemes (Russian) – stol+om – with (a) table (om = SING/INSTR/DECL 1) Jan 2005 CSA 4050 MT I 21

Differences in Word Order • SVO (English) The man kicked the ball • SOV

Differences in Word Order • SVO (English) The man kicked the ball • SOV (Japanese) The man the ball kicked • Mixed (German) The man (has) the ball kicked must • VSO (Classical Arabic) Kicked the man the ball • Free word order (Latin) Jan 2005 CSA 4050 MT I 22

Differences in Marking of Semantic Information • Head marking. – In English possessive relation

Differences in Marking of Semantic Information • Head marking. – In English possessive relation is marked on the head: The man's house – In Hungarian it is marked on the dependent: The man house-his – his house / sa maison • Direction and manner of motion marking – He ran into the room (English) – He entered the room running (French) Jan 2005 CSA 4050 MT I 23

Lexical Differences Semantic Granularity Jan 2005 CSA 4050 MT I 24

Lexical Differences Semantic Granularity Jan 2005 CSA 4050 MT I 24

Hutchins & Somers (1992) Jan 2005 CSA 4050 MT I 25

Hutchins & Somers (1992) Jan 2005 CSA 4050 MT I 25

Lexical Differences • Lexical gaps: when a word exists in one language but not

Lexical Differences • Lexical gaps: when a word exists in one language but not in another: – Japanese does not have a word corresponding to privacy. – English does not have a word for Japanese oyakoko (≈ filial piety). • Sapir/Wharf hypothesis – Language constrains thought – Speakers of different languages employ different conceptual systems – Impossibility of translation in general. Jan 2005 CSA 4050 MT I 26

Machine Translation and Human Translators Jan 2005 CSA 4050 MT I

Machine Translation and Human Translators Jan 2005 CSA 4050 MT I

In the Beginning. . was the dream of FAMT • Fully Automatic (High Quality)

In the Beginning. . was the dream of FAMT • Fully Automatic (High Quality) Machine Translation (Bar Hillel 1960) Source Language text Jan 2005 FAHQMT CSA 4050 MT I Target Language text 28

FAMT • Basic Charactistics – No human intervention – Arbitrary text • Evaluation Criteria

FAMT • Basic Charactistics – No human intervention – Arbitrary text • Evaluation Criteria – Quality of ouput – Cost ($/page) – Speed (pages/hour) Jan 2005 CSA 4050 MT I 29

Translation Process 1 • • • Pre-editing Translation Post-editing No pre-editing Lots of post-editing!

Translation Process 1 • • • Pre-editing Translation Post-editing No pre-editing Lots of post-editing! Lots of pre-editing No(t much) post-editing! GARBAGE IN, GARBAGE OUT!!! Jan 2005 CSA 4050 MT I 30

Pre-editing • What constitutes Good Input? • Depends on system. • short, simple, grammatical

Pre-editing • What constitutes Good Input? • Depends on system. • short, simple, grammatical sentences – New toner units are held level during installation and, since they do not as supplied contain toner, must be filled prior to installation from a small cartridge. – Fill the new toner unit with toner from a toner cartridge. Hold the new toner unit level while you put it in the printer. Jan 2005 CSA 4050 MT I 31

Pre-editing • Avoidance of ambiguous terms • Trend towards controlled languages and related tools

Pre-editing • Avoidance of ambiguous terms • Trend towards controlled languages and related tools – Spellcheckers – Grammar Checkers – Critiquing Systems • Controlled English to make English accessible and useable by greatest no. of people. Basic English, cf Esperanto. • Main idea: to reduce no. of general words needed for writing anything to a few hundred from 75000 (avg. for skilled native speakers) by operator verbs, e. g. ``make perfect'' instead of ``perfect''. • Xerox offers its technical writers one-day course, British Aerospace does the same in a few short sessions Jan 2005 CSA 4050 MT I 32

Translation Process 2 • Coordination • Communication • In theory, FAMT is meant to

Translation Process 2 • Coordination • Communication • In theory, FAMT is meant to usurp pre-editing, translation, post-editing phases. • But even with current technology, no system can be built which satisfies all of FAMT's goals simultaneously Jan 2005 CSA 4050 MT I 33

FAMT Success Story TAUM METEO • Written by Chevalier et al. 1978. • Translation

FAMT Success Story TAUM METEO • Written by Chevalier et al. 1978. • Translation of weather reports from English to French • Highly constrained subset of English: – Small number of senses for each word – Restricted syntactic constructions • System determines whether a given sentence is within its capabilities • Very fast, very accurate, no post-editing Jan 2005 CSA 4050 MT I 34

FAMT: MORAL • FAMT can work well but only if we give up one

FAMT: MORAL • FAMT can work well but only if we give up one or more of the goals e. g. – Unrestricted text input – High quality translation • This observation has lead to research on sub-languages • And to the use of FALQT Jan 2005 CSA 4050 MT I 35

Sublanguages • Restricted domain of reference • Restricted purpose and orientation • Restricted mode

Sublanguages • Restricted domain of reference • Restricted purpose and orientation • Restricted mode of communication (may include bandwidth considerations) • Community of users sharing specialised knowledge • See Kittredge (1985) for further details of what computational techniques are applicable to sublanguages Jan 2005 CSA 4050 MT I 36

Fully Automatic Low Quality Translation – (FALQT) • Can be used where translation volume

Fully Automatic Low Quality Translation – (FALQT) • Can be used where translation volume is high. • Where the gist is more important than an accurate translation • Where we need to select a small group of documents from a large collection for subsequent high quality translation. • Must answer question: could document X in collection Z be about Y? Jan 2005 CSA 4050 MT I 37

FAMT is not the only way • FAMT lies at one extreme of a

FAMT is not the only way • FAMT lies at one extreme of a continuum of ways in which technology can be brought to bear upon the translation problem • At the other extreme there are word processing software, fax machines, and even mobile phones • Between these two extremes there are other points of interest where technology can radically affect the productivity of the individual translator. Jan 2005 CSA 4050 MT I 38

MAHT and HAMT • Machine Aided Human Translation (MAHT) • Human Aided Machine Translation

MAHT and HAMT • Machine Aided Human Translation (MAHT) • Human Aided Machine Translation (HAMT). • The essential difference between these two lies not only in the way in which the person is involved but also in the extent of their involvement Jan 2005 CSA 4050 MT I 39

MAHT • All initiative resides with the human. • Often based on a text

MAHT • All initiative resides with the human. • Often based on a text editor with certain translation-specific functionalities such as • Simultaneous access to source and target texts • Online access to dictionaries, thesauri, terminological databases, and word concordance tools. • Identification of and access to secondary materials such as texts being worked on and other texts like it in both source and target forms. Jan 2005 CSA 4050 MT I 40

MAHT - Translation Memories • Systems consist of a database in which each source

MAHT - Translation Memories • Systems consist of a database in which each source sentence of a translation is stored together with the target sentence (this is called a translation memory "unit") • Any new source sentences will be searched for in the database and a match value is calculated. • When the match value is 100%, the translation of the source sentence from the database is inserted into the text being translated. Jan 2005 CSA 4050 MT I 41

MAHT - Translation Memories • If the match value is below 100% and above

MAHT - Translation Memories • If the match value is below 100% and above a certain user-definable percentage (i. e. , "fuzzy match"), the old translation will be inserted as a translation proposal for the translator to review and edit. • Sentences with match values below that margin have to be translated from scratch. • New and changed translation proposals will then be stored in the database for future use. Jan 2005 CSA 4050 MT I 42

MAHT - Translation Memories – Advantages • Avoid redoing translation of repeated material •

MAHT - Translation Memories – Advantages • Avoid redoing translation of repeated material • Use previous texts as a model for new translations • Ensure consistency throughout a translation Jan 2005 CSA 4050 MT I 43

MAHT - Translation Memories - Drawbacks • If terminology changes between projects the content

MAHT - Translation Memories - Drawbacks • If terminology changes between projects the content of a TM needs to be updated to reflect these changes. • Blind faith in exact matches (without validation) can generate incorrect translation since there is no verification of the context where the new segment is used compared to where the original one was used. Jan 2005 CSA 4050 MT I 44

MAHT - Translation Memories - Remarks • Translation Process: TM tools may not easily

MAHT - Translation Memories - Remarks • Translation Process: TM tools may not easily fit into existing translation or localization processes: work best where work can be signed off in pieces rather than as a whole. • Customisation: rarely works straight out of the box. Menu adaptation, filters to desktop applications may require significant effort. • Investment costs are high • Setup and maintenance of TMs has to factored in. • Open. Tag/TMX formats for exchanging TM data between competing systems Jan 2005 CSA 4050 MT I 45

MAHT – Other Technology • Communication/coordination amongst translators • Integration of internet technologies and

MAHT – Other Technology • Communication/coordination amongst translators • Integration of internet technologies and web services. • Database technology, smart indexing, and networking • Improvements can be achieved that are well within the scope of current technology. Jan 2005 CSA 4050 MT I 46

HAMT – Human Assisted Machine Translation • Machine retains the initiative but works in

HAMT – Human Assisted Machine Translation • Machine retains the initiative but works in collaboration with human consultant. • System translates autonomously until it recognises that a linguistic difficulty of a certain type has arisen, e. g. – – ambiguity pronoun reference unknown word unrecognised construction • At this point it seeks help from the consultant. Jan 2005 CSA 4050 MT I 47

HAMT – Challenges • Reliable identification/classification of difficulty. • Reliable communication of difficulty to

HAMT – Challenges • Reliable identification/classification of difficulty. • Reliable communication of difficulty to user. • Tradeoff between quality and scope of translation. Jan 2005 CSA 4050 MT I 48

HAMT - Advantages • Modulo challenges – a high quality of translation can be

HAMT - Advantages • Modulo challenges – a high quality of translation can be guaranteed. • Speed – if large sections of text can be translated automatically. • Human consultant need not necessarily have all the skills of a human translator; native competence in one or both languages may suffice. Jan 2005 CSA 4050 MT I 49

Summary • Machine Translation is a continuum – FAMT – HAMT – MAHT •

Summary • Machine Translation is a continuum – FAMT – HAMT – MAHT • The utility of a given type of system cannot be assessed with very simple criteria • Utlility function involves at least the human cost, the machine cost, the quality of the result, and the nature of the translation requirements. Jan 2005 CSA 4050 MT I 50

Some References • • • Jonathan Slocum, Machine Translation: its History, Current Status, and

Some References • • • Jonathan Slocum, Machine Translation: its History, Current Status, and Future Prospects, Proc ACL 1984, Stanford University, http: //acl. ldc. upenn. edu/P/P 84 -1116. pdf Martin Kay – Machine Translation, Computational Linguistics vol 11 numbers 2 -3 1985. Richard Kittredge – Sublanguages, Computational Linguistics vol 11 numbers 2 -3 1985. Jan 2005 CSA 4050 MT I 51