AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation

  • Slides: 12
Download presentation
AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11 -14 November

AUTOMATIC TRANSLATION UTILITY Fostering language diversity and participation Juan Dolio, DR, 11 -14 November 2008 Stéphane Bruno, AHTIC/CONSORTIUM CARISNET sbruno@websystems. ht

LANGUAGE STATS

LANGUAGE STATS

LANGUAGE STATS

LANGUAGE STATS

FACTS English is the dominant language in CIVIC discussions Non-English speaking members that are

FACTS English is the dominant language in CIVIC discussions Non-English speaking members that are not fluent in English (or do not speak at all) are reluctant to contribute Manual (Human) translation of all email and forum communications is impossible and way too costly Systematic human translation would also delay interactions

CIVIC APPROACH TO LANGUAGE DIVERSITY Three official languages: English, French, Spanish All documents and

CIVIC APPROACH TO LANGUAGE DIVERSITY Three official languages: English, French, Spanish All documents and “official” communications are translated in all three languages, (the original language document being the legally binding one? ) Simultaneous translation is provided in face-to-face meetings for plenary sessions when the number of the language group and its needs justify the cost Automatic translation of emails is provided to facilitate comprehension and contribution by all language groups

OBJECTIVES OF THE AUTOMATIC TRANSLATION Provide the opportunity for all members to get the

OBJECTIVES OF THE AUTOMATIC TRANSLATION Provide the opportunity for all members to get the essence of all communications in all three official CIVIC languages Make the translation non disruptive, as seamless and as user-friendly as possible Allow an improvement of the translation overtime Construct a contextual terminology and linguistic environment for CIVIC on its field of intervention

HOW IT WORKS

HOW IT WORKS

THE TRANSLATION MECHANISMS When a mail arrives, the software breaks the email into paragraphs

THE TRANSLATION MECHANISMS When a mail arrives, the software breaks the email into paragraphs The software tries to guess the language of the paragraph If it cannot guess the language, it assumes it is English Then the software preprocess the paragraph through the knowledgebase Then each paragraph is sent to the translation service (Babelfish) and the result is retrieved for each language pair The resulting paragraph is post-processed Then the email is reconstructed and sent to the mailing list manager

INPUT REQUIREMENTS Use simple language constructs Use complete sentences and correct grammar and syntax

INPUT REQUIREMENTS Use simple language constructs Use complete sentences and correct grammar and syntax Avoid abbreviations, metaphors and idiomatic expressions Avoid proverbs and sayings Do not mix languages in same paragraph (as translation is done paragraph by paragraph, and language is guessed)

OTHER FEATURES If you want some words not to be translated, enclose them in

OTHER FEATURES If you want some words not to be translated, enclose them in “*”, like *CIVIC* The knowledgebase allows to enter in a database how some words are to be translated to override the translation of the translation service, for example, to say ICT is translated TIC in French and Spanish and vice cersa This allows to build a lexicon or linguistic construct in the context of CIVIC and ICT 4 D

LIMITATIONS The less lengthy a paragraph is, the less accurate is the guessing of

LIMITATIONS The less lengthy a paragraph is, the less accurate is the guessing of the language of the text. So, introductory paragraphs like greetings or opening, single-words texts will usually be wrongly or not translated at all The current version works only with plain text email messages. The final version will try to convert HTML-formatted emails to plain text before processing them The utility relies on Babelfish without a formal agreement (since it is free) and for which Babelfish was not designed. So, it is vulnerable to the slightest changes on the Babelfish web site

THINGS TO RESOLVE The character encoding issues Who will manage the knowledgebase? How words

THINGS TO RESOLVE The character encoding issues Who will manage the knowledgebase? How words are entered into the database? How it is decided?