<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards User-Reliable Machine Translation Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sofía García González</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of the Basque Country (UPV/EHU)</institution>
          ,
          <addr-line>Manuel Lardizabal pasealekua, 1, 20018 Donostia-San Sebastian, Gipuzkoa</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>software, Rúa Salgueiriños de Abaixo</institution>
          ,
          <addr-line>11, Santiago de Compostela, 15706, Galicia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The advent of Neural Machine Translation has ushered in a new era of progress in the field of Natural Language Processing. However, Neural Machine Translation is not without its limitations. It is prone to a number of translation errors, including omissions, hallucinations, and other issues that arise from the machine translation process itself. These errors can be problematic in production environments. This thesis will evaluate diferent methodologies for automatic Machine Translation evaluation and error detection for low resource languages in an industrial context.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine translation</kwd>
        <kwd>Automatic Evaluation</kwd>
        <kwd>Low Resource Languages</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>This research project is a part-time industrial thesis between the University of the Basque Country
(UPV/EHU) and imaxin|software, a software company from Galicia specialized in Machine Translation
(MT), specially for Spanish, Galician and other LRL. The main topic of this thesis is MT evaluation.</p>
      <p>
        MT can be defined as any use of computer systems to transform a computerised text written in a
source language into a diferent computerised text written in a target language, thereby generating what
is known as a raw translation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This field has evolved tremendously since the advent of transformer
models in 2017, as well as other text generation tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Neural Machine Translation (NMT) generates
more fluent, more natural and more accurate translations than older methods such as Rule-Based
Machine Translation (RBMT) or Statistical Machine Translation (SMT), even between distant languages
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Additionally, the advent of Large Language Models (LLM) has precipitated a paradigm shift in NMT,
giving rise to novel scenarios that evolve at a rapid pace [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, the errors made by these systems
are challenging for users to detect, particularly because the most significant errors are not translation
errors, but errors in the models themselves. This renders them unreliable for both users and companies
that wish to implement them. This is why MT evaluation methods are of such importance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The MT evaluation can be divided into two categories: human evaluation and automatic evaluation,
which has traditionally involved metrics that compare a reference text with the translation hypothesis,
resulting in a corpus-based metric. However, until the advent of recent studies, there has been no
automatic evaluation method capable of detecting and pointing out translation errors due to the intrinsic
dificulty of this task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The novel approaches to MT and MT evaluation are predominantly framed within the context of High
Resource Languages (HRL) such as English. The substantial quantity of data, often annotated, required
to train these types of evaluation systems, or the absence of a basic neural or LLM model, render these
methods inefective or even detrimental in LRL, such as Galician. This is the rationale behind this thesis,
which seeks to address the following research question: "Which is the most efective method for
detecting errors and evaluating machine translation in an industrial context for low-resource
languages?".</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, Natural Language Processing (NLP) has improved and evolved significantly due to the
new architectures and higher computational power [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. NMT has greatly improved the performance of
RBMT and SMT, particularly after the development of the Transformer architecture [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. And, more
recently, even LLMs as GPT-31 or GPT-42 have also shown promising results in this field, both for
generic translation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and domain adaptation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Machine Translation Errors</title>
        <p>
          Nowadays, machine translation has become more sophisticated, more fluent and, the advent or
multilingual models, have facilitated the advancement and deployment of MT between distant and LRL
languages. Nevertheless, despite the aforementioned advances, machine translation remains imperfect,
particularly due to the fact that the new type of errors made by these systems are often not identified
by the user, despite their potential for causing significant issues. These errors can be categorised
into two distinct types. The first category encompasses errors intrinsic to the system itself, such as
hallucinations and omissions, whereas the second category comprises translation errors, including
those pertaining to grammar, syntax and style [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Hallucinations can occur during translation when the model produces a sentence that is grammatically
correct but does not accurately convey the meaning of the source sentence. These can be either partial,
where only a portion of the source sentence is inaccurately translated, or total, where the entire
translated sentence is incorrect in content [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. On the other hand, omissions can occur if the model
forgets part of the source sentence, resulting in the loss of relevant information from the original
sentence [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. These are the most significant errors, as they can result in the loss of information or a
change of meaning in the machine translation. Conversely, syntactic and morphological errors, as well
as the incorrect translation of entities, are inherent to machine translation and may also occur in this
type of models.
        </p>
        <p>
          The causes of errors intrinsic to neural systems, such as hallucinations or omissions, remain unsolved.
Although these phenomena do not occur generically in machine translation, they can give rise to
misunderstandings and problems in its production. With regard to the errors inherent in machine
translation, it is possible that many may be due to the training corpus itself [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Machine Translation Evaluation</title>
        <p>
          Nowadays MT evaluation can be divided into two main categories: human evaluation and automatic
evaluation. Human evaluation, while more dependable, is more expensive and time-consuming.
It assesses the adequacy of the translation, ensuring that the meaning of the original sentence is
preserved in the translated one, as well as the fluency , ensuring that the translation is grammatically
and syntactically correct in the target language. Within this domain, there exist a variety of evaluation
frameworks. These frameworks serve as a guide for evaluators, providing a consistent approach to the
MT evaluation. Two prominent examples are the Multidimensional Quality Metrics (MQM) and the
Direct Assesments (DA). These type of annotations have been used to train diferent evaluation models
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Automatic Evaluation: Metrics</title>
          <p>
            On the other hand, automatic evaluation is less costly and time-consuming, but less reliable than human
evaluation. Traditionally, the MT system quality has been evaluated with lexical based metrics. These
metrics provide a score at the sentence or document level by comparing the MT output with a reference
text previously reviewed by linguists or native speakers, named gold standard. This comparison between
the gold standard and the MT output can be done at three levels: at token level, using metrics that
1https://chatgpt.com/
2https://openai.com/index/gpt-4/
compare both documents token by token, such as Bilingual Evaluation Understudy (BLEU) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], which
has been the reference metric until nowadays; at character level, such as CHaRacter-level F-score (chrF)
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]; and, finally, by measuring the number of changes required to transform the MT output into the
reference text such as: insertions, deletions, substitutions or rearrangements. This metric is known
as Translation Edit Rate (TER) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. Such metrics have been the subject of significant criticism, with
two key factors being identified as the primary sources of it. Firstly, they are reliant on a reference text
with which to compare the MT output, which is not always feasible in all contexts. Secondly, natural
language is highly versatile, thus they may evaluate as erroneous translations that are in fact correct,
even if they difer from the gold standard [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ].
          </p>
          <p>
            More recently, metrics based on embeddings have emerged. These metrics extend beyond a mere
lexical comparison of translations and references, instead enabling a semantic comparison at the word
or sentence embedding level. An example of this type of metric is the Crosslingual Optimized Metric for
Evaluation of Translation (COMET), a multilingual MT evaluation framework that is able to evaluate
MT with diferent reference-dependent and reference-free models [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
          </p>
          <p>
            Another method for automatic MT evaluation is Qualitative Estimation (QE). It is the task of
estimating the quality of MT in real time without the need for reference translations [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Some
examples of QE metrics are TransQuest3 [
            <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
            ], a bilingual and multilingual framework that uses
XLM-R embeddings of source and hypothesis sentences to predict the QE score; wmt-comet-qe-da and
wmt-cometkiwida models4 [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]. Although there is an advantage in not requiring a reference text for
evaluation, QE is often limited to HRL and general domains, as it requires an annotated training corpus.
This MT evaluation can be conducted at the word, sentence, or document level [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Automatic Evaluation: Error Detection</title>
          <p>
            In recent times, a number of studies have been conducted with a view to examining the phenomenon
of error detection in greater depth. For instance, LLMs have shown a high capacity to evaluate machine
translation through error analysis, pointing out errors and even explaining them and suggesting
improvements to the translation. However, they still fail to correlate error analysis with translation
quality metrics [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. Despite the good quality shown by big LLMs as GPT3 or GPT4, when attempts have
been made to replicate this task in open-source LLMs such as LLAMA,5 the results have deteriorated
[
            <xref ref-type="bibr" rid="ref20">20</xref>
            ].
          </p>
          <p>
            On the other hand, Unbabel,6 the COMET development company, has released a new model
(XCOMET7) which is capable of detecting linguistic errors and hallucinations and classifying them
according to their severity: minor, serious or critical [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. Nevertheless, both the LLMs and XCOMET
have only been tested in HRL, such as English, Chinese or Russian. LRL have less presence in large
language models or less training data for training this types of systems.
          </p>
          <p>
            In this context, Dale et al. [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] have produced manually annotated datasets to detect hallucinations
and omissions. These datasets encompass 18 language pairs, including HRL pairs, LRL pairs, and one
zero-shot pair. This work aims to advance the field of hallucination and omission detection and is the
ifrst one to produce a human annotated dataset to this purpose.
          </p>
          <p>
            Finally, another significant contribution to this field is that of Don-Yehiya et al. [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ]. In this work
they have investigated the possibility of predicting the quality of a translated sentence based on the
semantic and linguistic characteristics of the source sentence. To this end, they have developed a model
capable of determining the intrinsic dificulty of a source sentence. In this work, they have been able
to conclude that the characteristics of a sentence determine the dificulty or ease of translation for
diferent machine translation models. Consequently, it is now possible to predict whether a sentence is
going to be mistranslated before it occurs. This can also be of great assistance to both linguists and
3https://huggingface.co/TransQuest/monotransquest-da-en_de-wiki
4https://github.com/Unbabel/COMET/blob/master/MODELS.md
5https://llama.meta.com/
6https://unbabel.com/
7https://unbabel.com/xcomet-translation-quality-analysis/
users. But more investigation is needed.
          </p>
          <p>In conclusion, research on error detection in machine translation or generative tasks is still in its
infancy and is only being developed for HRL. For other contexts such as LRL and very specific domains,
there is very little training corpus or presence of these languages in the base models. Moreover, both the
language models and the methods used by neural models such as COMET require a high computational
cost, which makes them ineficient methods to put into production in small companies.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Research Questions and Hypotheses</title>
      <sec id="sec-3-1">
        <title>3.1. Evaluation Metrics</title>
        <p>RQ1: What is the starting point? In order to assess the value of the information that some of the
metrics mentioned in section 2.2.1 can provide, the initial step of this thesis will be to establish, for the
ifrst time, the state of the art in machine translation for Spanish–Galician and English–Galician pairs.
The first part of this research question has been published at PROPOR 2024. 8 In this first paper, we
have conducted an exhaustive evaluation of all extant models for these language pairs, encompassing
both NMT and RBMT systems, across several test datasets pertaining to the general, legal, and health
domains. The metrics employed were BLEU, chrF, TER, and COMET, in addition to an error analysis
[22]. The remaining translation direction will be published in due course.</p>
        <p>The results obtained in this initial study indicate a correlation between the lexical-based metrics. This
implies that they exhibit homogeneous results, with minimal variation between them, and consistently
follow the same patterns across models. In contrast, COMET yields considerably higher results than the
lexical-based metrics, with smaller diferences between models. From these initial findings, it can be
concluded that the metrics are useful for identifying models that exhibit poor performance, indicating
the need for further revision and retraining. Conversely, models with high metrics scores demonstrate
satisfactory performance, although the metrics themselves provide limited insight into the nuances of
translation quality.</p>
        <p>RQ2: If each sentence is individually evaluated, can we identify a correlation between the
metrics and the errors? Once the corpus-level evaluations have been completed, another question
that must be addressed is whether the sentence-by-sentence evaluation aligns with the error analysis.
This entails determining whether and how the metrics penalise the same types of errors. To answer
this question, the same metrics mentioned in RQ1 will be employed to evaluate the generic corpus on a
sentence-by-sentence basis. The primary hypothesis is that, contingent on the typology (lexical-based,
edit distance-based or embedding-based), each metric will impose a distinct penalty on the errors
identified in the sentences. Consequently, each metric will yield disparate insights.</p>
        <p>
          RQ3: Can errors in machine translation be identified prior to their occurrence? As evidenced
by the Don-Yehiya et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] work and the analysis made in García and Rigau [22], which revealed
that each translation model consistently produces similar errors, our hypothesis is that it is feasible to
develop a model that can predict the likelihood of certain linguistic phenomena being challenging to
detect in the target language, contingent on the translation model employed.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Error Detection in the Machine Translation</title>
        <p>
          RQ4: What is the most appropriate methodology for the generation of annotated data in a
LRL? One of the significant limitations of a low-resource language such as Galician for the training of
systems capable of detecting errors is the lack of an annotated corpus. Works such as XCOMET [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ],
PreQuel [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] or the LLM themselves have required an annotated corpus with errors for their training.
In the case of Galician, it would be necessary to generate this kind of data from scratch. Consequently,
we will adopt the methodologies proposed in Guerreiro et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] to tag data in accordance with
the characteristics of the MQM corpora, or the methodology proposed by Dale et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to generate
annotated corpora with hallucinations and omissions.
        </p>
        <p>RQ5: Is it possible to detect MT errors through interpretable Semantic Textual Similarity
(iSTS)? Semantic textual similarity is defined as the measure of semantic equivalence between two
blocks of text [23]. In the WMT12, Castillo and Estrella [24] proposed, for the first time, the use of
semantic textual similarity as machine translation evaluation. They proposed the use of WordNet to
calculate how much equivalent is a hypothesis sentence to a reference. They achieved, at that year,
promising results at system level. Nowadays, metrics as COMET or BLEURT [25] use sentence and
word embeddings to compare the semantic similarity at word or sentence level.</p>
        <p>In contrast, to the best of our knowledge, there has never been a proposal for the use of interpretable
semantic textual similarity in MT evaluation and error detection. iSTS can be defined as giving meaning
to semantic similarity between short texts [26]. See the example in Figure 1.
(a) The following two sentences, taken from
Lopez</p>
        <p>Gazpio et al. [27] provide an explanation of the
interpretability layer between two similar
sentences in the English language.</p>
        <p>The two sentences are very similar. Note that ’in bus</p>
        <p>accident’ is a bit more specific than ’in road
accident’ in this context. Note also that ’12’ and ’10’
are very similar in this context. Note also that ’in</p>
        <p>Pakistan’ is a bit more general than ’in NW</p>
        <p>Pakistan’ in this context.
(b) Explanation given by the model in Lopez-Gazpio
et al. [27] about the diferences between the two
sentences in subfigure 1a.</p>
        <p>In Subfigure 1a, 12 killed in bus accident in Pakistan and 10 killed in road accident in NW Pakistan
difer from each other in two minor respects: the number of people killed and the type of accident.
Although the meaning conveyed by the two sentences is similar, it difers between them. It is precisely
these aspects in which the two sentences difer that are highlighted in the text of the Subfigure 1b [ 27].
This thesis therefore proposes the extension of iSTS to a multilingual and crosslingual context as an
evaluation approach for machine translation. In this context, the interpretability layer will function as
an error analysis that is able to explain the MT errors to the user. The aim is therefore to develop a
metric capable of identifying discrepancies between languages without the need for a reference test.
The hypothesis is that, with the current development of neural networks and large language models, it
will be possible to expand iSTS to a multilingual context. The objective is to optimise this technique to
achieve a metric similar to XCOMET, but capable of explaining errors and not just classifying them.</p>
        <p>
          RQ6: Can open-source LLM identify errors in LRL contexts? As previously stated in section
2.2.2, despite the remarkable capacity of large language models, such as GPT3 and GPT4, to conduct
error analysis on MT evaluation [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], Guerreiro et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] have also indicated that open-source LLM
exhibit inferior performance and fail to attain the quality of private LLM.
        </p>
        <p>Currently, there are two LLM available for Galician: Carballo-bloom-1.3B9 and
Carballo-cerebras1.3B.10 In spite of that, none of them are fine-tuned to MT, MT evaluation or other specific tasks. We
will conduct in this thesis a fine tuning of Carballo-bloom-1.3B, as it is a model based on FLOR-1.3B 11
that has been also based on the multilingual LLM Bloom-1.7B.12 Consequently, it is anticipated that the
ability to recognise languages other than Galician, such as English or Spanish, will be demonstrated.
The advent of enhanced techniques for enhancing these models, even in the context of low-resource
languages, suggests that even modest LLM may yield satisfactory performance.
9https://huggingface.co/proxectonos/Carballo-bloom-1.3B
10https://huggingface.co/proxectonos/Carballo-cerebras-1.3B
11https://huggingface.co/projecte-aina/FLOR-6.3B-Instructed
12https://huggingface.co/bigscience/bloom-1b7</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Timeline</title>
      <p>RQ1&amp; RQ2
First paper publication at PROPOR 2024
RQ3 &amp; RQ5 &amp; RQ6</p>
      <p>RQ4 &amp; RQ7 &amp; RQ8</p>
      <sec id="sec-4-1">
        <title>3.3. Error Correction</title>
        <p>The objective of this study is not only to identify errors in machine translation but also to ascertain the
feasibility of correcting them. The following section presents two hypotheses that have been formulated
in this regard.</p>
        <p>RQ7: Can we use multilingual models to translate from one bad MT to a good MT? One
of the research questions to be addressed in this thesis is whether it is possible to use multilingual
models to translate from one language into the same language in order to improve the MT made by
another model. That is, to provide the model with an MT with errors and have it translate it into the
same language, in order to rectify the existing errors. The hypothesis is that a multilingual model with
suficient knowledge of the language will be able to correct a machine translation generated by another
model.</p>
        <p>In order to achieve this objective, we will be utilising the multilingual models M2M10013 [28] and
NLLB20014 [29], which include Galician among their languages and have been identified as the most
efective in García and Rigau [22].</p>
        <p>RQ8: Is it possible to improve translation models based on error detection? The hypothesis
put forth for this research question is that every model will make certain types of errors. Therefore,
detecting these errors will not only assist the user, but also serve as a basis for retraining in order to
improve the model.</p>
        <p>The Figure 2 illustrates the planning of the thesis. The initial part will encompass an examination of
conventional MT evaluation metrics and will mark the significant shift occurring in the realm of MT
and its evaluation as a result of the advent of LLM. A preliminary paper on this aspect of the research
has already been published as it has been mentioned in section 3 for RQ1. By the end of 2024, the initial
section of the thesis is anticipated to be completed, addressing RQ1 and RQ2.</p>
        <p>On the other hand, as previously discussed in section 3, the publication of the new LLMs for Galician
will be accompanied by experiments aimed at retraining them for MT and MT evaluation. To this end,
it will be necessary to create an annotated corpus with errors and their typology, which will address
research RQ4 and RQ6. Similarly, the creation of these corpora will permit the training of systems based
on embeddings to detect errors, thereby providing an answer to RQ5. Furthermore, the detection of
errors prior to translation will be facilitated, thus providing an answer to RQ3. The commencement of
this part of the thesis is scheduled to take place between mid-2024 and mid-2026. Finally, the final year
will be dedicated to enhancing the systems themselves by identifying errors (RQ8) or improving the
translation process (RQ7).
13https://huggingface.co/facebook/m2m100_418M
14https://huggingface.co/facebook/nllb-200-distilled-600M</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In conclusion, the aforementioned experiments in Section 3 will be employed in order to ascertain
the optimal methodology for MT evaluation for a LRL in an industrial context. Given the rapidly
evolving environment in which we have been living in recent years, the new technologies that are
being generated tend to be focused on HRL and require a very high computational cost for SMEs. The
objective of this thesis is to conduct a comparative study of the existing technologies and to assess their
potential for implementation in a real-world context.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to express our gratitude to the Nós project members for their assistance and guidance
during the development of the methodological part of the project. Additionally, computational resources
for this research were provided by UPV/EHU and imaxin|software. Finally, we acknowledge the
funding received from the following projects:
(i) DeepKnowledge (PID2021-127777OB-C21) and ERDF A way of making Europe.
(ii) DeepR3 (TED2021-130295B-C31) and European Union NextGeneration EU/PRTR.
(iii) ILENIA (2022/TL22/00215335) the EU-funded NextGenera- tionEU Recovery, Transformation and
Resilience Plan.
in advance (2022). URL: http://arxiv.org/abs/2205.09178.
[22] S. García, G. Rigau, Study of the state of the art Galician machine translation: English-Galician and
Spanish-Galician models, in: P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira,
R. Amaro (Eds.), Proceedings of the 16th International Conference on Computational Processing
of Portuguese, Association for Computational Lingustics, Santiago de Compostela, Galicia/Spain,
2024, pp. 411–421. URL: https://aclanthology.org/2024.propor-1.42.
[23] D. Chandrasekaran, V. Mago, Evolution of semantic similarity – a survey (2020). URL: http:
//arxiv.org/abs/2004.13820http://dx.doi.org/10.1145/3440755. doi:10.1145/3440755.
[24] J. Castillo, P. Estrella, Semantic textual similarity for MT evaluation, in: C. Callison-Burch,
P. Koehn, C. Monz, M. Post, R. Soricut, L. Specia (Eds.), Proceedings of the Seventh Workshop on
Statistical Machine Translation, Association for Computational Linguistics, Montréal, Canada,
2012, pp. 52–58. URL: https://aclanthology.org/W12-3103.
[25] T. Sellam, D. Das, A. P. Parikh, Bleurt: Learning robust metrics for text generation, in: Proceedings
of ACL, 2020.
[26] A. A. Abafogi, Survey on interpretable semantic textual similarity and its applications, International
Journal of Innovative Technology and Exploring Engineering 10 (2021) 14–18. URL: https://www.
ijitee.org/portfolio-item/B82941210220/. doi:10.35940/ijitee.B8294.0110321.
[27] I. Lopez-Gazpio, M. Maritxalar, A. Gonzalez-Agirre, G. Rigau, L. Uria, E. Agirre, Interpretable
semantic textual similarity: Finding and explaining diferences between sentences (2016). URL: http://
arxiv.org/abs/1612.04868http://dx.doi.org/10.1016/j.knosys.2016.12.013. doi:10.1016/j.knosys.
2016.12.013.
[28] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek,
V. Chaudhary, et al., Beyond english-centric multilingual machine translation, Journal of Machine
Learning Research 22 (2021) 1–48.
[29] N. team, M. Costa-jussa, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Hefernan, E. Kalbassi,
J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault,
G. Gonzalez, P. Hansanti, J. Wang, No language left behind: Scaling human-centered machine
translation (2022). doi:10.48550/arXiv.2207.04672.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>M. L.Forcada,</surname>
          </string-name>
          <article-title>Building machine translation systems for minor languages: Challenges and efects, Revista de llengua i dret // Journal of language and law (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Otter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Medina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Kalita</surname>
          </string-name>
          ,
          <article-title>A survey of the usages of deep learning for natural language processing</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>604</fpage>
          -
          <lpage>624</lpage>
          . doi:
          <volume>10</volume>
          .1109/TNNLS.
          <year>2020</year>
          .
          <volume>2979670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Li,</surname>
          </string-name>
          <article-title>Multilingual machine translation with large language models: Empirical results and analysis (</article-title>
          <year>2023</year>
          ). URL: http://arxiv.org/abs/2304. 04675.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <article-title>A survey on evaluation metrics for machine translation</article-title>
          ,
          <source>Mathematics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .3390/math11041006.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <article-title>Prompting large language model for machine translation: A case study</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>41092</fpage>
          -
          <lpage>41110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Moslem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Haque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kelleher</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Way,</surname>
          </string-name>
          <article-title>Adaptive machine translation with large language models</article-title>
          , in: M.
          <string-name>
            <surname>Nurminen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Brenner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Koponen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Latomaa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mikhailov</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Schierl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Ranasinghe</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Vanmassenhove</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Aranberri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nunziatini</surname>
            ,
            <given-names>C. P.</given-names>
          </string-name>
          <string-name>
            <surname>Escartín</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Forcada</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Popovic</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Scarton</surname>
          </string-name>
          , H. Moniz (Eds.),
          <source>Proceedings of the 24th Annual Conference of the European Association for Machine Translation, European Association for Machine Translation</source>
          , Tampere, Finland,
          <year>2023</year>
          , pp.
          <fpage>227</fpage>
          -
          <lpage>237</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eamt-
          <volume>1</volume>
          .
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , E. Briakou,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Martindale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <article-title>Understanding and detecting hallucinations in neural machine translation via model introspection</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>546</fpage>
          -
          <lpage>564</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .tacl-
          <volume>1</volume>
          .32. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00563</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Voita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hansanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ropers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kalbassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Costa-jussà, Halomi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>638</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <article-title>COMET: A neural framework for MT evaluation</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2685</fpage>
          -
          <lpage>2702</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>213</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2020</year>
          .emnlp-main.
          <volume>213</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Popović</surname>
          </string-name>
          ,
          <article-title>chrF: character n-gram F-score for automatic MT evaluation</article-title>
          , in: O.
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Federmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hokamp</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huck</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Logacheva</surname>
          </string-name>
          , P. Pecina (Eds.),
          <source>Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics</source>
          , Lisbon, Portugal,
          <year>2015</year>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>395</lpage>
          . URL: https://aclanthology.org/W15-3049. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W15</fpage>
          -3049.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Micciulla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Makhoul</surname>
          </string-name>
          ,
          <article-title>A study of translation edit rate with targeted human annotation</article-title>
          ,
          <source>in: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Association for Machine Translation in the Americas</source>
          , Cambridge, Massachusetts, USA,
          <year>2006</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>231</lpage>
          . URL: https://aclanthology.org/
          <year>2006</year>
          . amta-papers.
          <volume>25</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mathur</surname>
          </string-name>
          , C.
          <article-title>-k.</article-title>
          <string-name>
            <surname>Lo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Stewart</surname>
            , E. Avramidis,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>Results of WMT22 metrics shared task: Stop using BLEU - neural metrics are better and more robust</article-title>
          , in: P.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Costa-jussà</surname>
            , C. Federmann,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fishel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fraser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grundkiewicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huck</surname>
            ,
            <given-names>A. Jimeno</given-names>
          </string-name>
          <string-name>
            <surname>Yepes</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morishita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Monz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nagata</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Nakazawa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Negri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Popel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Turchi</surname>
          </string-name>
          , M. Zampieri (Eds.),
          <source>Proceedings of the Seventh Conference on Machine Translation (WMT)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>68</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .wmt-
          <volume>1</volume>
          .2.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Yang,
          <article-title>From handcrafted features to llms: A brief survey for machine translation quality estimation</article-title>
          ,
          <source>arXiv preprint arXiv:2403.14118</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Orasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitkov</surname>
          </string-name>
          , Transquest:
          <article-title>Translation quality estimation with cross-lingual transformers</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ranasinghe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Orasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitkov</surname>
          </string-name>
          , Transquest at wmt2020:
          <article-title>Sentence-level direct assessment</article-title>
          ,
          <source>in: Proceedings of the Fifth Conference on Machine Translation</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Treviso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maroti</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Glushkova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task</article-title>
          , in: P.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Costa-jussà</surname>
            , C. Federmann,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fishel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fraser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grundkiewicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huck</surname>
            ,
            <given-names>A. Jimeno</given-names>
          </string-name>
          <string-name>
            <surname>Yepes</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morishita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Monz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nagata</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Nakazawa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Negri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Popel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Turchi</surname>
          </string-name>
          , M. Zampieri (Eds.),
          <source>Proceedings of the Seventh Conference on Machine Translation (WMT)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>634</fpage>
          -
          <lpage>645</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>60</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. van Stigt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Coheur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colombo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F. T.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <surname>xcomet:</surname>
          </string-name>
          <article-title>Transparent machine translation evaluation through fine-grained error detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>10482</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Don-Yehiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Choshen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Abend</surname>
          </string-name>
          , Prequel:
          <article-title>Quality estimation of machine translation outputs</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>