ALAN K MELBY FIT LTAC GLOBAL AND BYU

ALAN K. MELBY (FIT, LTAC GLOBAL, AND BYU) DACE DZEGUZE (TAUS) ARLE LOMMEL (CSA RESEARCH)

Trying to Standardinslation Quality – What Were They Thinking? Trying to Standardize Translation Quality – What Were They Thinking? ) Arle Lommel (CSA Research)

Session Overview – They were thinking they should build on the MQM project from QT 21 within ASTM International ● ● ● ● What is ASTM WK 46396? Staying on the road to one error typology Defining translation quality DQF-MQM error categories and severities Process for creating and applying MQM metrics Data collection Key takeaways ● Note: Assessment and evaluation can be contrasted but are not here

PART 1 What is ASTM WK 46396?

A (soon to be) standard that defines ● A taxonomy of translation errors ● A process to move from translation specifications to task-specific analytic metrics that share a common basis ● A scoring method to produce relevant numeric indications of translation quality ● Metrics relevant to any sort of translation as well as evaluation of source-text quality

It is not… ● ● ● A single, one-size-fits all metric An automatic approach A reference-based score (à la BLEU) A holistic metric A complete solution to translation quality evaluation ● MQM metrics can be combined with holistic metrics in a procedure ● Translation quality evaluation fits into a quality management system ● WK 46396 depends on ASTM F 2575 and a W 3 C Community Group

PART 1 Staying on the Road

Staying on the road to one error typology with ASTM

Roads that went elsewhere • SAE J 2450 (released in 2001) – Pro: maintained by a standards body – Con: too narrow • LISA QA (version 3. 1 released in 2006) – Pro: more flexible (perhaps too flexible) – Con: LISA slid off the road and crashed in 2011

New Roads: Which to Take? • TAUS DQF (Dynamic Quality Framework) – Based on many metrics (incl. LISA QA) – Initial study 2011 • DFKI MQM (Multidimensional Quality Metrics) – Based on many metrics (incl. LISA QA) – From QT Launchpad 2012 -2014

Harmony! • DQF and MQM error typologies were harmonized under QT 21 in 2015 – Three person team from TAUS, DFKI & LTAC – DQF-MQM is the short name for the TAUS subset of the large MQM error typology

Scare but ASTM to the rescue • QT 21 Project was scheduled to end by January 2018; parallel to LISA’s end? • TAUS and DFKI agreed to provide IP to ASTM International for long-term maintenance

Involvement in ASTM • • • DKFI rep: Aljoscha Burchardt TAUS rep: David Koot Working Group Chair: Arle Lommel (CSA) Others on editorial team from government and industry YOU can join ASTM as an individual or a company (www. astm. org/MEMBERSHIP)

PART 3 Defining Translation Quality

Defining Translation Quality A quality translation demonstrates accuracy and fluency required for the audience and purpose and complies with all other specifications negotiated between the requester and provider, taking into account requester and end-user needs.

What do we mean by “specifications” ● Defined in ASTM F 2575 – 2014 – Section 8 ● Cover all aspects of translation projects ○ Linguistic Work Product ○ Process ○ Project Environment ○ Relationships ● Focus today is on Linguistic Work Product ● Source-content information ● Target-text requirements

Source-Content Information ● textual characteristics ○ source language ● volume ○ text type ● complexity ○ audience ● origin ○ purpose ● specialized language ○ subject field ○ terminology

Target-Content Requirements ● target language ● register requirements ● format ○ target language (locale) ○ target terminology ● audience ● purpose ● content correspondence ● style ○ style guide ○ style relevance ● layout

PART 4 DQF-MQM Error Typology

Error Typology

Accuracy

Design

Fluency

e t n I i t a rn i l a on n o i t za

Locale convention

Style

Verity

Terminology … and zooming in even further, first to the left, and then to the right …

PART 5 DQF-MQM Process for creating and applying metrics

Creating and applying metrics 1 State specifications Including audience and purpose. See Translation Parameter and Overt. Covert handouts. Specs instead of quality levels in MQM 2 Select relevant dimensions for quality evaluation. (Terminology, Fluency, Accuracy, etc. in blue in the DQF subset below) 3 Finish QE system Complete the metric by selecting finegraining error types. See scoring card handout for evaluation stages. Train the evaluators to use metric. Conduct evaluations. Then improve QE system as needed.

Creating specifications ● Not required every time a metric is used ● Needed for new metrics (create templates for similar projects) ● Provides a way to document and share requirements between stakeholders

Selecting relevant dimensions and error types ● Draw from DQF-MQM where possible; otherwise use full MQM ● Tie error types to specifications ○ E. g. , if style relevance is high, check Style ○ E. g. , for transcreation, you would need to check Verity, but probably would not for support content for a consumer device ○ Each error type, determine what it is checking in the specifications (may apply to multiple areas) ● Aim for a minimal set of error types, but be granular enough to meet your needs

Training ● Not enough to give names of error categories and definitions ● Provide guidance (e. g. , decision trees, sample tasks and results) ● Testing of evaluators is essential ● Ongoing as new issues arise

Validity ● Validity is a property of a metric ● Are you measuring what you think you are? ○ Example: Measuring words/hour is not a valid quality metric ○ Example: A quality measure that doesn’t account for the length of the document won’t help you evaluate quality ● Requires verification against independent criteria (e. g. , user satisfaction relative to requirements, diversion of support calls, sales conversions…)

Reliability ● Reliability is a property of an evaluation system, not a metric ● Two types of reliability: ○ Inter-rater reliability: Do multiple trained and competent evaluators obtain the same result within reasonable tolerance? ○ Intra-rater reliability: Does the same evaluator consistently produce the same, expected result? Testing based on evaluating translations that have been previously evaluated by experts (comparing results), to make sure a set of new evaluators aren’t all wrong in same way ● Reasonable tolerance: >0. 7 (Cohen’s Kappa)

PART 6 Data Collection

Benchmarking ● Assumes reliable systems ● Requires consistent use of metadata (DQF API provides this) ● Needs stability of systems over time ● Requires collective agreement ○ On framework (DQF: MQM) ○ On specific metrics (e. g. , for medical information leaflets, automotive service manuals) ● ASTM WK 46396 will provide the framework for benchmarking ● Proper evaluator training is needed to know if the results are comparable across organizations, that is, for benchmarking

PART 7 Key Takeaway: They (QT 21) were thinking that MQM needed a home and found one: ASTM International (www. astm. org)

Key takeaways (based on ASTM WK 46396) Translation quality is well-defined but not static General quality management principles apply to translation Translation specifications matter MQM provides a standardized yet flexible way to evaluate translation quality analytically ● Collect data and start benchmarking (use DQF API integration when feasible; online MQM scorecard also available) ● Test for validity and reliability and train evaluators ● Get involved in the ASTM effort and the TAUS DQF community ● ●