SYSTRAN Challenges and Recent Advances in Hybrid Machine

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation Jean Senellart, Jin Yang, Jens Stephan jyang@systransoft. com 2008 – copyright SYSTRAN

Overview SYSTRAN – 40 years of innovation The MT Challenges SYSTRANLab Projects Hybrid Engines From Research to Products CWMT 08 Conclusions 2008 – copyright SYSTRAN

SYSTRAN 40 years of history Located in Paris (La Défense) and San Diego +70 employees: ~ 20 linguists, ~ 30 engineers Including 10 Ph. Ds 2008 – copyright SYSTRAN

Core Technology Core technology “Rule-Based” Based on language description Analysis – Transfer – Generation paradigm Build a « syntax tree » based on hierarchical constituents with multi-level relationships Multi-pass analysis • • • Morphology Analysis Homograph Resolution Clause Boundary Syntagm Identification Syntactic Role Identification … Rely heavily on linguistic resources 2008 – copyright SYSTRAN

2008 – copyright SYSTRAN

Languages Chinese Arabic Spanish English Hindi Portuguese Russian French Japanese Urdu German Farsi 882 422 358 350 325 250 170 130 125 100 82 Korean Italian Ukrainian Polish Dutch Serbo-Croatian Greek Czech Albanian Slovak 78 62 47 42 23 21 18 12 6 6 3600 22 source languages 70 language pairs Dictionaries: 200 K-1 M entries per LP ~6 M reference multi-source / multi-target dictionary 2008 – copyright SYSTRAN

SYSTRAN Activity Retail products: Windows Desktop Product SYSTRAN Mobile on PDA Mac OS Dashboard Widget Online Services SYSTRANBox, SYSTRANNet, SYSTRANLinks Corporate customers Symantec, Cisco, Verizon, Ford, Daimler, Chemical Abstract… Institutional Customers EC and US agencies Portals - Online Translation “Babel Fish”, Google, Yahoo!, Microsoft Live, 2008 – copyright SYSTRAN …

MT Challenges RBMT/SMT Strengths and Weaknesses - I Rule-Based system builds a translation with available linguistic resources (dictionaries, rules) Human-built resources • Incremental Track the translation process • Predictable output Some phenomena are hard to formalize • Need semantic/pragmatic knowledge Not designed to deal with exceptions to the rules • … which are very frequent 2008 – copyright SYSTRAN

MT Challenges RBMT/SMT Strengths and Weaknesses - II Statistical system finds a translation within a choice of many, many possible translations Very easy to build • Automatic training process Knowledge acquisition is easy… • Not limited to predefined linguistic patterns – “phrase” … but cannot “understand” or generalize information • Not even elementary rules Output is “unpredictable” 2008 – copyright SYSTRAN

MT Challenges Corpus-Based or Rule-Based Approach? No conflict between “corpus” and “rulebased” approaches Possible to learn rules • Already learns terminology – monolingual and multilingual • Some approaches acquire complex rules Possible to find the best translation amongst several translations “Decoding” can be constrained by syntactic restrictions Linguistic rules but corpus drives! 2008 – copyright SYSTRAN

SYSTRANLab Research Projects Overview Toward Hybrid Engines Collaborations Statistical Post-Edition Lattice Decoding Source Analysis Adaptation From Research to Products 2008 – copyright SYSTRAN

Research Projects Resources Acquisition Consolidating a 6 M entry multilingual dictionary Acquiring more from corpus – lexicon and rules Linguistic Development Entity Recognition with local grammars Autonomous Generation modules Introduction of corpus-based technology Applications More interactive applications Professional Post-Edition Module (POEM) 2008 – copyright SYSTRAN

SYSTRANLab Research Projects The Phoenix Project Collaboration with P. Koehn (University of Edinburgh) Introduce corpus-based decision modules in SYSTRAN Specialized modules Word Sense Disambiguation Lattice Generation Preposition / Determiner Choice 2008 – copyright SYSTRAN

SYSTRANLab Research Projects The Sphinx Project Collaboration with CNRC Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition) GALE (DARPA Project) Participated in WMT 07, NIST 08 2008 – copyright SYSTRAN

SYSTRANLab Research Projects The Pegasus Project Collaboration with H. Schwenk (Université du Maine) Introduce linguistic knowledge in statistical engines Participated in WMT 08 2008 – copyright SYSTRAN

SYSTRANLab Hybrid Engines Introduce Self. Learning capability Learn “post-edition rules” Deep integration of statistical decision modules ID R B HY Insert linguistic knowledge in statistical engines 2008 – copyright SYSTRAN

CWMT 08 Chinese-English MT evaluation Primary: RBMT+SPE Contrast: RBMT Started in 1994, 1. 2 M terms, S&T-focus BLEU 4 NIST 5 GTM -SBP m. WER m. PER ICT Primary-a 0. 2275 0. 2193 7. 918 0 0. 7101 0. 7209 0. 5085 0. 3262 Contrastb 0. 1956 0. 1930 7. 635 6 0. 7089 0. 7165 0. 5123 0. 2942 2008 – copyright SYSTRAN

CWMT 08: SPE Usage SPE module trained on 1. 8 m sentences CWMT 08 training data not use Not only translation by also annotation by RBMT Dates, numerals, etc. Transfer model is filtered Exclusion of “bad rules” by rule based filtering Examples are “random” quotes, entities appearing Some expressions are “protected” Constituents will be replaced with placeholders before SPE Translated with RBMT Re-injected in translation after SPE model for CWMT 08 is trained using GIZA++, and decoding using Moses (www. statmt. org/moses) 2008 – copyright SYSTRAN

Statistical Post-Edition A Case Study – SYMANTEC – English>Chinese BLEU PERFECT Improv / Degrad SYSTRAN Raw SYSTRAN Cust SYSTRAN Raw + Translation Model 20. 89 34. 49 46. 86 2 4. 8 7. 4 ref - SYSTRAN Cust + Translation Model 50. 90 10. 5 15 2008 – copyright SYSTRAN

Conclusions Our approach is to start with rule-based framework Developed techniques give very competitive results Major focus on “degradation” control Learn more advanced post-edition rules Generic Translation – still a long way to go Bigger still better? Domain Translation Quality is there – statistics provides adaptation and fluidity Ø Need dedicated applications, workflow Bootstrapping new language pair development 2008 – copyright SYSTRAN