<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Estimating the Quality of Translated Medical Texts using Back Translation &amp; Resource Description Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vinay Neekhra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dipti Misra Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Language Technology Research Center, Kohli Center on Intelligent Systems, International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad, IIIT-Hyderabad</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>How can we efectively estimate the quality of translated texts in the medical field, where back-translation is usually available and/or recommended for sensitive documents. This paper proposes a novel metric, GATE1 , for translation quality estimation task, leveraging the Resource Description Framework (RDF) to encode both semantic and syntactical information of the original and back-translated sentences into RDF graphs. The distance between these graphs is measured to get the semantic similarity score to assess the quality of the translation. Unlike traditional metrics like BLEU and METEOR, our approach is reference-less, capturing both semantic and syntactical information for a comprehensive assessment of translation quality. Our results correlate better with human judgment, giving a better Pearson correlation (0.357) as compared to BLEU (0.200), thereby showing ~70% improvement over BLEU. Our research shows that, in the field of translation evaluation, existing resources like back-translation and RDF could be useful.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Translation Quality Estimation</kwd>
        <kwd>Resource Description Framework (RDF)</kwd>
        <kwd>Back Translation</kwd>
        <kwd>GATE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>A drug trial in the medical domain incorporates a mandatory consent form called a Medical Consent
Form (MCF), which informs the patient about the experiment and its potential side efects. There is a
legal requirement for the MCF to be in the patient’s mother tongue and for it to be easy to understand.
A human translator translates the original MCF into the patient’s mother tongue. As MCFs are sensitive
documents, evaluating the quality of translated texts is crucial to ensure faithfulness to the original
texts (see Section 1.1 for an example).</p>
      <p>
        One way to evaluate the quality of the translated texts is using back-translation (see Section 3.1),
wherein the translated text is translated back into the original language. The original and back-translated
texts are then compared to estimate the quality of the translation. Back-translation is a prominent way
to assess the quality of translated texts in domains, such as medical documents, where accuracy and
precision are paramount [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Experienced professionals are responsible for carrying out all three procedures (see Figure 1), namely:
initial translation from the source language to the target language, followed by translation from the
target language back to the source language, and ultimately, comparison between the original text
and the back-translated texts. Our eforts are focused on reducing the eforts of human evaluators
comparing the original and back-translated texts by automating the task of evaluating the quality of
translated texts.</p>
      <p>
        While human evaluation has traditionally served as a benchmark for assessing translation quality, it
is often expensive, time-consuming, and subjective. As an alternative, automatic evaluation metrics
such as BLEU[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], METEOR[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], etc., have been developed to provide a more eficient and objective
means of evaluation, with BLEU being the most commonly used metric (see Section 2 for related work).
This field of research, called translation quality estimation (QE), is an area of research concerned with
evaluating the quality of translated texts when gold standard translations (called reference texts) are
unavailable.
      </p>
      <p>In this paper, we propose a novel translation evaluation metric, GATE (Graphical Assessment for
Translation quality Estimation), which leverages back-translation (see Section 3.1) and the Resource
Description Framework (RDF) (see Section 3.2). GATE encodes both semantic and syntactical information
of the original and back-translated sentences into RDF graphs, allowing for a reference-less,
semanticallyaware assessment of translation quality.</p>
      <p>
        For sensitive documents in the medical field, such as medical consent forms and qualitative research,
back-translation is a common practice to ensure the faithfulness of translations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. GATE capitalizes
on this by integrating back-translation into its evaluation framework, providing a comprehensive
and reliable assessment of translation quality. To estimate the quality of translated texts, we encode
the meaning of these sentences into graphs using the Resource Description Framework (RDF) and
then compare these graphs to come up with a similarity score (See Figure 4). GATE shows a higher
correlation (0.357) with human judgment than BLEU (0.200). (see Section 4 for the experiment details).
In the next Section 1.1, we discuss the significance of translation evaluation, highlighting the context
and motivation behind our research eforts.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Significance of Translation Evaluation</title>
        <p>Consider the following sentence from a medical consent form for a vaccine trial, translated to the
patient’s mother tongue (Tamil language) where the original consent form is in English.
• Source text:</p>
        <sec id="sec-1-1-1">
          <title>There are no side efects mentioned previously.</title>
          <p>To comply with legal requirements, the consent form was translated into Tamil by hospital
authorities, resulting in two translated versions. For evaluating the translation quality, the translated
MCF was back-translated to English, yielding the following results:
• Back Translation 1:
• Back Translation 2:</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>No side efects which were mentioned previously</title>
          <p>It has already been mentioned that it does not have any side-efects</p>
          <p>As seen above, the first back-translated sentence is semantically similar to the source text and
preserves the original intent. The second back translated text, on the other hand, conveys that —as
previously mentioned, there are no side-efects— , whereas the original intent was that no side-efects have
been observed yet, thus raising ethical and legal concerns.</p>
          <p>
            Thus, it is crucial, that translated texts are evaluated for their faithfulness to the original text,
especially in the medical domain. In the next subsection, we highlight the contributions of our work.
1.2. Contributions
1. This paper presents a novel approach, GATE, for translation quality estimation task by utilizing
back-translation and leveraging knowledge graphs (namely, Resource Description Framework)
for encoding the meaning of original and back-translated texts to come up with a translation
quality estimation score.
2. GATE incorporates both syntactic and semantic information, leading to improved evaluation
scores. Our approach is applicable to both machine-translated and human-translated texts. Our
experiments demonstrate a better correlation with human judgment compared to BLEU, with a
Pearson correlation of 0.357 compared to the most commonly used metric, BLEU’s 0.200.
3. Our approach eliminates the need for reference texts by comparing the source text directly with
its back-translated counterpart. This makes our approach reference-less and thus valuable for
scenarios where reference texts are not available for translation evaluation (such as medical
consent forms).
4. While our results do not surpass the current state-of-the-art, our metric, GATE, ofers distinct
advantages such as requiring no training, being computationally lightweight, being available
for low-resource languages, and operating without the need for extensive training data, unlike
neural network-based methods like COMET [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
          <p>The paper is structured as follows: Section 2 reviews related work in the area of translation evaluation,
discussing the limitations of existing metrics. Section 3 builds the foundation of our work, providing
an overview of back-translation along with its significance, introduces Knowledge Graphs in general,
and describes Resource Description Framework (RDF) and FRED RDF graphs. Section 4 details the
experiment design and methodology leading to the creation of GATE. The results of our experiments
are presented in Section 5, along with a discussion of the insights gained from our research eforts
while also addressing the current limitations of our metric. Finally, Section 6 and Section 7 conclude
the paper along with outlining the directions for future research.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Existing metrics for translation evaluation, such as BLEU[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], METEOR[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], NIST[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and TER[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], have
been widely utilized in the field, with BLEU being the most commonly used among them. BLEU
compares the translated sentence with a reference sentence. It operates on word group matching using
an n-gram model and remains popular due to its simplicity. In contrast, METEOR was developed as a
successor to BLEU to account for synonyms and other variations in language. Usually, the quality of
translation is evaluated at the sentence level, but word and document level QE are also possible [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>However, these metrics have inherent limitations. Many traditional metrics are categorized as n-gram
matching metrics, relying on handcrafted features to estimate translation quality by counting the
number and fraction of n-grams shared between a candidate translation hypothesis and one or more
human references. This restricts their ability to capture nuanced meaning, particularly in complex
and domain-specific texts. They often rely on surface-level similarity measures and may necessitate
reference translations, typically provided by humans as a standard of perfection.</p>
      <p>
        More recent approaches have explored the use of word embeddings as an alternative to n-gram
matching for capturing word semantic similarity. Metrics like BLEU2VEC[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], BERT SCORE[10], and
COMET[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] create alignments between reference and hypothesis segments in an embedding space
to compute a score reflecting semantic similarity. COMET, a notable metric in this domain, has
demonstrated remarkable results for translation evaluation. However, to train these models, the
availability of word embeddings for low-resource languages remains a significant challenge.
      </p>
      <p>However, these metrics may still need to catch up in capturing the full range of nuances captured
by human judgments. Challenges with existing metrics include their reliance on reference texts for
comparison, requiring semantic exactness at the word level, susceptibility to diferences in lexical
structure (such as word order), and the tendency to measure semantic relatedness rather than semantic
similarity, huge data requirement for training models thus not well-suited for low-resource languages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminaries</title>
      <sec id="sec-3-1">
        <title>This section lays out the foundation required for our experiment design.</title>
        <sec id="sec-3-1-1">
          <title>3.1. Back Translation:</title>
          <p>Back translation is a process where a translated text is translated back into the original language (source
language) by a diferent translator [ 11]. In Figure 1, translation and back-translation processes between
English and French are illustrated, as depicted by [12].</p>
          <p>
            Back translation is recommended in the domains where the content subjected to translation is too
sensitive and needs to be double-checked. The back-translation method is widely used in medical
research and clinical trials, as it is required by Ethics Committees and regulatory authorities in several
countries [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. This allows us to compare the back-translated text with the original text to evaluate the
quality of the translation.
          </p>
          <p>The rationale behind using back-translation is that for sensitive documents in the medical domain,
back-translation is a recommended practice to cross-verify that the translation adheres to the intended
meaning. Usually, back-translation is mandatory in case of quality assessment of medical consent forms,
so this is not an overhead in this particular scenario and is generally recommended for medical, legal,
market research, and government agencies working in public health, safety, and legal matters. We are
utilizing this for translation evaluation. We aim to address the specific needs of these domains to ensure
the faithfulness of the translated texts. Our eforts are to use already available back-translation texts for
the translation evaluation tasks.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Resource Description Framework</title>
          <p>The Resource Description Framework (RDF) is a W3C standard for data representation on the Web.
RDF provides a foundation for encoding information in a structured way for the Semantic Web [13]. It
is particularly useful for representing knowledge about entities and the relationships between them.
3.2.1. Components of RDF
RDF consists of triplets, which are fundamental units of information. These triplets, also known as RDF
triples, form the building blocks for representing knowledge within an RDF graph. Each RDF triple is
composed of three elements:</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>1. Subject: The resource (entity) being described. (e.g., “The patient”)</title>
        <p>2. Predicate: The property or characteristic of the subject, denoted by directed arrows. (e.g., “has
diagnosisof”)
3. Object: The value associated with the predicate for the subject. (e.g., “pneumonia”)</p>
        <p>In Figure 2, the RDF triple depicts a statement about a patient having a diagnosis of pneumonia. In
the context of our research, we leverage RDF to capture the semantics of the sentences, enabling a more
nuanced evaluation of translation quality compared to traditional metrics.
3.2.2. FRED RDF Graphs
Our research is based on RDF graphs provided by FRED (Framework for RDF-based Extraction and
Disambiguation) [14] to capture semantic nuances in translated texts. At its core, FRED leverages the
Resource Description Framework (RDF) to construct semantic graphs that capture the relationships and
entities present in the text. FRED bridges the gap between unstructured text and structured knowledge
representation, employing Semantic Web technologies to extract and disambiguate information from
textual data. Figure 3 shows the RDF graph for the sentence “An experimental drug is one which has
not been approved by FDA.”.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment Design</title>
      <p>We conduct a comparative experiment to evaluate the eficacy of our proposed RDF-based evaluation
metric, GATE, in comparison to the baseline metric BLEU and its correlation with human judgment.
To obtain baseline BLEU scores, we are using iBLEU [15]. The evaluation procedure, outlined in
Algorithm 1, explains the comparison of RDF graphs generated through the FRED API, which can be
accessed at http://wit.istc.cnr.it/stlab-tools/fred/demo/.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>Our experiments were done on the selected medical consent forms and the sentences from Semantic
Textual Similarity (STS) Benchmark Dataset [16] to evaluate the efectiveness of GATE in capturing
semantic similarity compared to BLEU. The medical consent forms dataset has around 250 original
sentence, their corresponding translations, and the back-translated texts, all provided by human translators.
Due to the selected availability of medical data, we augmented our analysis with the STS benchmark
dataset. In total, our experiments were conducted on 500 sentence pairs, with 250 pairs sourced from
medical consent forms provided by a medical institute.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Graph comparison and GATE Score</title>
        <p>We are comparing the source sentence with the back-translated text by constructing RDF graphs for
both. The distance between graphs is measured as the Jaccard similarity coeficient [ 17] between the
entities in the graphs. This way, the distance between the source and the back-translated sentence
graph is normalized between 0 and 1, where 1 denotes an exact match, and 0 denotes no similarity.
Algorithm 1 outlines the steps in the evaluation process. Specifically, for source sentence s, and the
back-translated text b, the GATE Score is calculated as follows:
 =
(s) ∩ (b)
(s) ∪ (b)</p>
        <sec id="sec-4-2-1">
          <title>For the Figure 4, the GATE Score is calculated as:</title>
          <p>=</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>8 (number of common entities)</title>
          <p>15 (total unique nodes in both the graphs)</p>
          <p>In the next section, we present the findings of our experiments along with a discussion of the insights
gained from our research eforts while also addressing the current limitations of our metric.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results &amp; Discussion</title>
      <p>Our experiment implemented the proposed GATE metric alongside the baseline metric, BLEU. We
calculated the Pearson correlation between the BLEU score and GATE score against human judgment
on the experiment dataset. Our results in Table 1, show that GATE achieves a significantly higher
correlation with human judgment in translation evaluation tasks compared to the widely used metric,
BLEU. Specifically, GATE exhibits a ~70% improvement in correlation on the experiment data, with a
Pearson correlation coeficient of 0.357 compared to BLEU’s 0.200. The higher correlation underscores
the efectiveness of leveraging RDF graphs in capturing semantic information, thereby improvement in
correlation with human judgments.</p>
      <p>Table 2 shows examples with corresponding human evaluation scores, GATE scores, and BLEU scores.
These examples serve to highlight GATE’s capability to better reflect human perception of semantic
similarity, as evidenced by its closer alignment with human judgments compared to BLEU scores. In
summary, our findings indicate that integrating RDF graphs with already existing back-translated
texts holds promise for reference-free translation evaluation. This metric can potentially assist human
evaluators who evaluate the translation of sensitive documents using back-translated texts.</p>
      <p>Using RDF for translation evaluation could be helpful as they ‘encode’ real-world semantics akin
to how embeddings work in neural network frameworks (such as COMET), contrasting with metrics
that are based on lexical level information for translation evaluation (such as BLEU). This work has the
potential to pave the way for utilizing knowledge graphs in the field of translation evaluation alongside
existing resources, such as word embeddings and LLM-based frameworks. Our experiments reinforce
our belief, demonstrating that using knowledge graphs to encode meaning is helpful and gets better
results than the baseline metrics.</p>
      <p>Human</p>
      <p>GATE</p>
      <p>BLEU</p>
      <p>A man is erasing a chalk board
Three men are playing guitars
A woman is carrying a boy
A woman peels a potato</p>
      <p>The man is erasing the chalk board
Three men on stage are playing guitars
A woman is carrying her baby
A woman is peeling a potato.</p>
      <p>Given that RDF is currently available only in English and our metric compares graphs of original
and back-translated texts for translation evaluation, our metric is presently only applicable where
English is the source language. However, the target language can be any other language as long as
back-translation is available.</p>
      <p>While our results do not surpass state-of-the-art performance, they serve as a proof-of-concept,
showcasing the efectiveness of leveraging RDF graphs for translation evaluation tasks. As FRED
accommodates large sentences as well, our future work will involve working with more extensive
real-world translated medical data and testing our methodology on larger sentences to demonstrate
its efectiveness comprehensively. These results underscore the advantages of GATE over traditional
metrics like BLEU and motivate further validation of GATE’s applicability on real-world data particularly
in domains like medicine, along with continuing our exploration for further improvement of the metric.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we introduce GATE, a novel metric based on the Resource Description Framework (RDF)
designed for assessing the quality of translated medical texts for which back-translation is available. To
showcase the efectiveness of our metric, we conducted experiments using selected medical data and the
STS benchmark dataset, comparing the results against the baseline metric, BLEU, and human judgment
scores. Notably, GATE exhibits a stronger correlation with human judgment than BLEU, achieving a
higher Pearson correlation coeficient (0.357 compared to BLEU’s 0.200), representing approximately a
~70% improvement over BLEU, the most commonly used metric.</p>
      <p>By leveraging back-translation and using RDF graphs to encode both semantic and syntactical
information, GATE provides a reference-less and semantically aware assessment of translation quality.
In comparison with the more advanced Large Language Model (LLM)-based metrics such as COMET,
our metric is computationally much lighter. It works for any target language, including low-resource
languages, and does not require any data training. Our research shows that, in the field of translation
evaluation, existing resources like back-translation and Resource Description Framework could be
helpful in real-world scenarios such as the medical domain.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future Directions</title>
      <p>As part of future work, we would like to explore:
1. Conducting further experiments to validate the eficacy of GATE on real-world translated medical
data.
2. Since Translation and Summarization can both be viewed as natural language generation from a
textual context, we aim to explore knowledge graphs such as RDF in the area of evaluating
summarization or similar natural language generation tasks. Investigate the utilization of knowledge
graphs for tasks beyond translation evaluation, such as summarization.
3. For calculating GATE score, experimenting with diferent formulas incorporating variations in
weights of entities, incoming edges, and outgoing edges.
4. Addressing the challenge of language dependency in GATE by incorporating multilingual
knowledge graphs since FRED works only with English texts. A primary avenue for future work, will
be looking into the inclusion of other knowledge graphs available in other languages, making
GATE language independent.
5. Development of a software similar to iBLEU for integrating FRED API to facilitate automatic
scoring of source and back-translated texts, enhanced visualization, and accessibility of the RDF
metric.
6. [18] shows that back-translation could be useful for improving the translation quality for
lowresource languages. Our future work is to combine neural networks with back-translation and
knowledge graphs in the area of translation evaluation for low-resource languages. Our future
work aims to combine these technologies along with knowledge graphs (such as Knowledge
Graph Embeddings) to improve our metric, making it suitable for evaluating translated sensitive
texts and investigating the potential of combining neural networks with back-translation and
knowledge graphs to improve translation quality, particularly for low-resource languages.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Our sincere gratitude to the late Prof. Ravi Kothari, on whose suggestions this research work was
started. We thank anonymous reviewers for their time and valuable suggestions for improving the
paper. We also express our gratitude to Supriya Ranjan, Bhavesh Neekhra, Mamatha Alugubelly, and
others for their invaluable feedback and support.
Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 619–622. URL: https:
//aclanthology.org/W17-4771. doi:10.18653/v1/W17-4771.
[10] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with
bert, arXiv preprint arXiv:1904.09675 (2019).
[11] A. Q, What is back translation?, 2021. URL: https://gtelocalize.com/what-is-back-translation/.
[12] T. H. Trinh, T. Le, P. Hoang, M. Luong, A tutorial on data augmentation by backtranslation,
https://github.com/vietai/dab (2019).
[13] World Wide Web Consortium, Resource description framework (rdf) syntax specification (revised),
1998. URL: https://www.w3.org/TR/PR-rdf-syntax/.
[14] A. Gangemi, V. Presutti, D. R. Recupero, A. G. Nuzzolese, F. Draicchio, M. MongiovÃ¬, Semantic</p>
      <p>Web Machine Reading with FRED, Semantic Web 8 (2017) 873–893.
[15] N. Madnani, ibleu: Interactively debugging and scoring statistical machine translation systems,
in: 2011 IEEE Fifth International Conference on Semantic Computing, 2011, pp. 213–214. doi:10.
1109/ICSC.2011.36.
[16] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, Semeval-2017 task 1: Semantic textual
similarity multilingual and crosslingual focused evaluation, in: Proceedings of the 11th International
Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics,
2017. URL: http://dx.doi.org/10.18653/v1/S17-2001. doi:10.18653/v1/s17-2001.
[17] Wikipedia contributors, Jaccard similarity, https://en.wikipedia.org/wiki/Jaccard_index, 2023.
[18] I. Abdulmumin, B. S. Galadanci, A. Isa, Enhanced back-translation for low resource neural
machine translation using self-training, in: S. Misra, B. Muhammad-Bello (Eds.), Information and
Communication Technology and Applications, Springer International Publishing, Cham, 2021, pp.
355–371.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Grunwald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goldfarb</surname>
          </string-name>
          ,
          <article-title>Back translation for quality control of informed consent forms 2 (</article-title>
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.-Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Boore</surname>
          </string-name>
          ,
          <article-title>Translation and back-translation in qualitative nursing research: methodological review</article-title>
          ,
          <source>Journal of Clinical Nursing</source>
          <volume>19</volume>
          (
          <year>2010</year>
          )
          <fpage>234</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . doi:
          <volume>10</volume>
          .3115/1073083. 1073135.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments</article-title>
          ,
          <source>in: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Association for Computational Linguistics</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://aclanthology.org/W05-0909.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zerva</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Farinha</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Glushkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , COMET-22:
          <article-title>Unbabel-IST 2022 submission for the metrics shared task</article-title>
          ,
          <source>in: Proceedings of the Seventh Conference on Machine Translation (WMT)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>578</fpage>
          -
          <lpage>585</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . wmt-
          <volume>1</volume>
          .
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Doddington</surname>
          </string-name>
          ,
          <article-title>Automatic evaluation of machine translation quality using n-gram co-occurrence statistics</article-title>
          ,
          <source>in: Proceedings of the second international conference on Human Language Technology Research</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Micciulla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Makhoul</surname>
          </string-name>
          ,
          <article-title>A study of translation edit rate with targeted human annotation</article-title>
          ,
          <source>in: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Paetzold</surname>
          </string-name>
          , G. Hirst,
          <article-title>Quality estimation for machine translation</article-title>
          , volume
          <volume>11</volume>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tättar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fishel, bleu2vec: the painfully familiar metric on continuous vector space steroids</article-title>
          , in: O.
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Buck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Federmann</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huck</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Yepes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Koehn</surname>
          </string-name>
          , J. Kreutzer (Eds.),
          <source>Proceedings of the Second Conference on Machine Translation,</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>