Towards User-Reliable Machine Translation Evaluation
                         Sofía García González1,2
                         1
                             University of the Basque Country (UPV/EHU), Manuel Lardizabal pasealekua, 1, 20018 Donostia-San Sebastian, Gipuzkoa (Spain)
                         2
                             imaxin|software, Rúa Salgueiriños de Abaixo, 11, Santiago de Compostela, 15706, Galicia (Spain)


                                        Abstract
                                        The advent of Neural Machine Translation has ushered in a new era of progress in the field of Natural Language
                                        Processing. However, Neural Machine Translation is not without its limitations. It is prone to a number of
                                        translation errors, including omissions, hallucinations, and other issues that arise from the machine translation
                                        process itself. These errors can be problematic in production environments. This thesis will evaluate different
                                        methodologies for automatic Machine Translation evaluation and error detection for low resource languages in
                                        an industrial context.

                                        Keywords
                                        Machine translation, Automatic Evaluation, Low Resource Languages


                         1. Introduction and Motivation
                         This research project is a part-time industrial thesis between the University of the Basque Country
                         (UPV/EHU) and imaxin|software, a software company from Galicia specialized in Machine Translation
                         (MT), specially for Spanish, Galician and other LRL. The main topic of this thesis is MT evaluation.
                            MT can be defined as any use of computer systems to transform a computerised text written in a
                         source language into a different computerised text written in a target language, thereby generating what
                         is known as a raw translation [1]. This field has evolved tremendously since the advent of transformer
                         models in 2017, as well as other text generation tasks [2]. Neural Machine Translation (NMT) generates
                         more fluent, more natural and more accurate translations than older methods such as Rule-Based
                         Machine Translation (RBMT) or Statistical Machine Translation (SMT), even between distant languages
                         [3]. Additionally, the advent of Large Language Models (LLM) has precipitated a paradigm shift in NMT,
                         giving rise to novel scenarios that evolve at a rapid pace [4]. However, the errors made by these systems
                         are challenging for users to detect, particularly because the most significant errors are not translation
                         errors, but errors in the models themselves. This renders them unreliable for both users and companies
                         that wish to implement them. This is why MT evaluation methods are of such importance [5].
                            The MT evaluation can be divided into two categories: human evaluation and automatic evaluation,
                         which has traditionally involved metrics that compare a reference text with the translation hypothesis,
                         resulting in a corpus-based metric. However, until the advent of recent studies, there has been no
                         automatic evaluation method capable of detecting and pointing out translation errors due to the intrinsic
                         difficulty of this task [5].
                            The novel approaches to MT and MT evaluation are predominantly framed within the context of High
                         Resource Languages (HRL) such as English. The substantial quantity of data, often annotated, required
                         to train these types of evaluation systems, or the absence of a basic neural or LLM model, render these
                         methods ineffective or even detrimental in LRL, such as Galician. This is the rationale behind this thesis,
                         which seeks to address the following research question: "Which is the most effective method for
                         detecting errors and evaluating machine translation in an industrial context for low-resource
                         languages?".


                          Doctoral Symposium on Natural Language Processing, 26 September 2024, Valladolid, Spain.
                          $ sofia.garcia@imaxin.com (S. G. González)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related Work
In recent years, Natural Language Processing (NLP) has improved and evolved significantly due to the
new architectures and higher computational power [2]. NMT has greatly improved the performance of
RBMT and SMT, particularly after the development of the Transformer architecture [3]. And, more
recently, even LLMs as GPT-31 or GPT-42 have also shown promising results in this field, both for
generic translation [6] and domain adaptation [7].

2.1. Machine Translation Errors
Nowadays, machine translation has become more sophisticated, more fluent and, the advent or mul-
tilingual models, have facilitated the advancement and deployment of MT between distant and LRL
languages. Nevertheless, despite the aforementioned advances, machine translation remains imperfect,
particularly due to the fact that the new type of errors made by these systems are often not identified
by the user, despite their potential for causing significant issues. These errors can be categorised
into two distinct types. The first category encompasses errors intrinsic to the system itself, such as
hallucinations and omissions, whereas the second category comprises translation errors, including
those pertaining to grammar, syntax and style [8].
   Hallucinations can occur during translation when the model produces a sentence that is grammatically
correct but does not accurately convey the meaning of the source sentence. These can be either partial,
where only a portion of the source sentence is inaccurately translated, or total, where the entire
translated sentence is incorrect in content [8]. On the other hand, omissions can occur if the model
forgets part of the source sentence, resulting in the loss of relevant information from the original
sentence [9]. These are the most significant errors, as they can result in the loss of information or a
change of meaning in the machine translation. Conversely, syntactic and morphological errors, as well
as the incorrect translation of entities, are inherent to machine translation and may also occur in this
type of models.
   The causes of errors intrinsic to neural systems, such as hallucinations or omissions, remain unsolved.
Although these phenomena do not occur generically in machine translation, they can give rise to
misunderstandings and problems in its production. With regard to the errors inherent in machine
translation, it is possible that many may be due to the training corpus itself [8].

2.2. Machine Translation Evaluation
Nowadays MT evaluation can be divided into two main categories: human evaluation and automatic
evaluation. Human evaluation, while more dependable, is more expensive and time-consuming.
It assesses the adequacy of the translation, ensuring that the meaning of the original sentence is
preserved in the translated one, as well as the fluency, ensuring that the translation is grammatically
and syntactically correct in the target language. Within this domain, there exist a variety of evaluation
frameworks. These frameworks serve as a guide for evaluators, providing a consistent approach to the
MT evaluation. Two prominent examples are the Multidimensional Quality Metrics (MQM) and the
Direct Assesments (DA). These type of annotations have been used to train different evaluation models
[10].

2.2.1. Automatic Evaluation: Metrics
On the other hand, automatic evaluation is less costly and time-consuming, but less reliable than human
evaluation. Traditionally, the MT system quality has been evaluated with lexical based metrics. These
metrics provide a score at the sentence or document level by comparing the MT output with a reference
text previously reviewed by linguists or native speakers, named gold standard. This comparison between
the gold standard and the MT output can be done at three levels: at token level, using metrics that
1
    https://chatgpt.com/
2
    https://openai.com/index/gpt-4/
compare both documents token by token, such as Bilingual Evaluation Understudy (BLEU) [11], which
has been the reference metric until nowadays; at character level, such as CHaRacter-level F-score (chrF)
[12]; and, finally, by measuring the number of changes required to transform the MT output into the
reference text such as: insertions, deletions, substitutions or rearrangements. This metric is known
as Translation Edit Rate (TER) [13]. Such metrics have been the subject of significant criticism, with
two key factors being identified as the primary sources of it. Firstly, they are reliant on a reference text
with which to compare the MT output, which is not always feasible in all contexts. Secondly, natural
language is highly versatile, thus they may evaluate as erroneous translations that are in fact correct,
even if they differ from the gold standard [14].
   More recently, metrics based on embeddings have emerged. These metrics extend beyond a mere
lexical comparison of translations and references, instead enabling a semantic comparison at the word
or sentence embedding level. An example of this type of metric is the Crosslingual Optimized Metric for
Evaluation of Translation (COMET), a multilingual MT evaluation framework that is able to evaluate
MT with different reference-dependent and reference-free models [10].
   Another method for automatic MT evaluation is Qualitative Estimation (QE). It is the task of
estimating the quality of MT in real time without the need for reference translations [15]. Some
examples of QE metrics are TransQuest3 [16, 17], a bilingual and multilingual framework that uses
XLM-R embeddings of source and hypothesis sentences to predict the QE score; wmt-comet-qe-da and
wmt-cometkiwida models4 [18]. Although there is an advantage in not requiring a reference text for
evaluation, QE is often limited to HRL and general domains, as it requires an annotated training corpus.
This MT evaluation can be conducted at the word, sentence, or document level [15].

2.2.2. Automatic Evaluation: Error Detection
In recent times, a number of studies have been conducted with a view to examining the phenomenon
of error detection in greater depth. For instance, LLMs have shown a high capacity to evaluate machine
translation through error analysis, pointing out errors and even explaining them and suggesting
improvements to the translation. However, they still fail to correlate error analysis with translation
quality metrics [19]. Despite the good quality shown by big LLMs as GPT3 or GPT4, when attempts have
been made to replicate this task in open-source LLMs such as LLAMA,5 the results have deteriorated
[20].
   On the other hand, Unbabel,6 the COMET development company, has released a new model
(XCOMET7 ) which is capable of detecting linguistic errors and hallucinations and classifying them
according to their severity: minor, serious or critical [20]. Nevertheless, both the LLMs and XCOMET
have only been tested in HRL, such as English, Chinese or Russian. LRL have less presence in large
language models or less training data for training this types of systems.
   In this context, Dale et al. [9] have produced manually annotated datasets to detect hallucinations
and omissions. These datasets encompass 18 language pairs, including HRL pairs, LRL pairs, and one
zero-shot pair. This work aims to advance the field of hallucination and omission detection and is the
first one to produce a human annotated dataset to this purpose.
   Finally, another significant contribution to this field is that of Don-Yehiya et al. [21]. In this work
they have investigated the possibility of predicting the quality of a translated sentence based on the
semantic and linguistic characteristics of the source sentence. To this end, they have developed a model
capable of determining the intrinsic difficulty of a source sentence. In this work, they have been able
to conclude that the characteristics of a sentence determine the difficulty or ease of translation for
different machine translation models. Consequently, it is now possible to predict whether a sentence is
going to be mistranslated before it occurs. This can also be of great assistance to both linguists and

3
  https://huggingface.co/TransQuest/monotransquest-da-en_de-wiki
4
  https://github.com/Unbabel/COMET/blob/master/MODELS.md
5
  https://llama.meta.com/
6
  https://unbabel.com/
7
  https://unbabel.com/xcomet-translation-quality-analysis/
users. But more investigation is needed.
   In conclusion, research on error detection in machine translation or generative tasks is still in its
infancy and is only being developed for HRL. For other contexts such as LRL and very specific domains,
there is very little training corpus or presence of these languages in the base models. Moreover, both the
language models and the methods used by neural models such as COMET require a high computational
cost, which makes them inefficient methods to put into production in small companies.


3. Research Questions and Hypotheses
3.1. Evaluation Metrics
RQ1: What is the starting point? In order to assess the value of the information that some of the
metrics mentioned in section 2.2.1 can provide, the initial step of this thesis will be to establish, for the
first time, the state of the art in machine translation for Spanish–Galician and English–Galician pairs.
The first part of this research question has been published at PROPOR 2024.8 In this first paper, we
have conducted an exhaustive evaluation of all extant models for these language pairs, encompassing
both NMT and RBMT systems, across several test datasets pertaining to the general, legal, and health
domains. The metrics employed were BLEU, chrF, TER, and COMET, in addition to an error analysis
[22]. The remaining translation direction will be published in due course.
   The results obtained in this initial study indicate a correlation between the lexical-based metrics. This
implies that they exhibit homogeneous results, with minimal variation between them, and consistently
follow the same patterns across models. In contrast, COMET yields considerably higher results than the
lexical-based metrics, with smaller differences between models. From these initial findings, it can be
concluded that the metrics are useful for identifying models that exhibit poor performance, indicating
the need for further revision and retraining. Conversely, models with high metrics scores demonstrate
satisfactory performance, although the metrics themselves provide limited insight into the nuances of
translation quality.
   RQ2: If each sentence is individually evaluated, can we identify a correlation between the
metrics and the errors? Once the corpus-level evaluations have been completed, another question
that must be addressed is whether the sentence-by-sentence evaluation aligns with the error analysis.
This entails determining whether and how the metrics penalise the same types of errors. To answer
this question, the same metrics mentioned in RQ1 will be employed to evaluate the generic corpus on a
sentence-by-sentence basis. The primary hypothesis is that, contingent on the typology (lexical-based,
edit distance-based or embedding-based), each metric will impose a distinct penalty on the errors
identified in the sentences. Consequently, each metric will yield disparate insights.
   RQ3: Can errors in machine translation be identified prior to their occurrence? As evidenced
by the Don-Yehiya et al. [21] work and the analysis made in García and Rigau [22], which revealed
that each translation model consistently produces similar errors, our hypothesis is that it is feasible to
develop a model that can predict the likelihood of certain linguistic phenomena being challenging to
detect in the target language, contingent on the translation model employed.

3.2. Error Detection in the Machine Translation
RQ4: What is the most appropriate methodology for the generation of annotated data in a
LRL? One of the significant limitations of a low-resource language such as Galician for the training of
systems capable of detecting errors is the lack of an annotated corpus. Works such as XCOMET [20],
PreQuel [21] or the LLM themselves have required an annotated corpus with errors for their training.
In the case of Galician, it would be necessary to generate this kind of data from scratch. Consequently,
we will adopt the methodologies proposed in Guerreiro et al. [20] to tag data in accordance with
the characteristics of the MQM corpora, or the methodology proposed by Dale et al. [9] to generate
annotated corpora with hallucinations and omissions.
8
    https://propor2024.citius.gal/
   RQ5: Is it possible to detect MT errors through interpretable Semantic Textual Similarity
(iSTS)? Semantic textual similarity is defined as the measure of semantic equivalence between two
blocks of text [23]. In the WMT12, Castillo and Estrella [24] proposed, for the first time, the use of
semantic textual similarity as machine translation evaluation. They proposed the use of WordNet to
calculate how much equivalent is a hypothesis sentence to a reference. They achieved, at that year,
promising results at system level. Nowadays, metrics as COMET or BLEURT [25] use sentence and
word embeddings to compare the semantic similarity at word or sentence level.
   In contrast, to the best of our knowledge, there has never been a proposal for the use of interpretable
semantic textual similarity in MT evaluation and error detection. iSTS can be defined as giving meaning
to semantic similarity between short texts [26]. See the example in Figure 1.

                                                               The two sentences are very similar. Note that ’in bus
                                                                   accident’ is a bit more specific than ’in road
                                                               accident’ in this context. Note also that ’12’ and ’10’
                                                                are very similar in this context. Note also that ’in
                                                                   Pakistan’ is a bit more general than ’in NW
                                                                             Pakistan’ in this context.
(a) The following two sentences, taken from Lopez-             (b) Explanation given by the model in Lopez-Gazpio
    Gazpio et al. [27] provide an explanation of the               et al. [27] about the differences between the two
    interpretability layer between two similar sen-                sentences in subfigure 1a.
    tences in the English language.
Figure 1: Figures taken from Lopez-Gazpio et al. [27] article that explain the interpretable Semantic Textual
Similarity model approach.


   In Subfigure 1a, 12 killed in bus accident in Pakistan and 10 killed in road accident in NW Pakistan
differ from each other in two minor respects: the number of people killed and the type of accident.
Although the meaning conveyed by the two sentences is similar, it differs between them. It is precisely
these aspects in which the two sentences differ that are highlighted in the text of the Subfigure 1b [27].
This thesis therefore proposes the extension of iSTS to a multilingual and crosslingual context as an
evaluation approach for machine translation. In this context, the interpretability layer will function as
an error analysis that is able to explain the MT errors to the user. The aim is therefore to develop a
metric capable of identifying discrepancies between languages without the need for a reference test.
The hypothesis is that, with the current development of neural networks and large language models, it
will be possible to expand iSTS to a multilingual context. The objective is to optimise this technique to
achieve a metric similar to XCOMET, but capable of explaining errors and not just classifying them.
   RQ6: Can open-source LLM identify errors in LRL contexts? As previously stated in section
2.2.2, despite the remarkable capacity of large language models, such as GPT3 and GPT4, to conduct
error analysis on MT evaluation [19], Guerreiro et al. [20] have also indicated that open-source LLM
exhibit inferior performance and fail to attain the quality of private LLM.
   Currently, there are two LLM available for Galician: Carballo-bloom-1.3B9 and Carballo-cerebras-
1.3B.10 In spite of that, none of them are fine-tuned to MT, MT evaluation or other specific tasks. We
will conduct in this thesis a fine tuning of Carballo-bloom-1.3B, as it is a model based on FLOR-1.3B11
that has been also based on the multilingual LLM Bloom-1.7B.12 Consequently, it is anticipated that the
ability to recognise languages other than Galician, such as English or Spanish, will be demonstrated.
The advent of enhanced techniques for enhancing these models, even in the context of low-resource
languages, suggests that even modest LLM may yield satisfactory performance.


9
 https://huggingface.co/proxectonos/Carballo-bloom-1.3B
10
   https://huggingface.co/proxectonos/Carballo-cerebras-1.3B
11
   https://huggingface.co/projecte-aina/FLOR-6.3B-Instructed
12
   https://huggingface.co/bigscience/bloom-1b7
3.3. Error Correction
The objective of this study is not only to identify errors in machine translation but also to ascertain the
feasibility of correcting them. The following section presents two hypotheses that have been formulated
in this regard.
   RQ7: Can we use multilingual models to translate from one bad MT to a good MT? One
of the research questions to be addressed in this thesis is whether it is possible to use multilingual
models to translate from one language into the same language in order to improve the MT made by
another model. That is, to provide the model with an MT with errors and have it translate it into the
same language, in order to rectify the existing errors. The hypothesis is that a multilingual model with
sufficient knowledge of the language will be able to correct a machine translation generated by another
model.
   In order to achieve this objective, we will be utilising the multilingual models M2M10013 [28] and
NLLB20014 [29], which include Galician among their languages and have been identified as the most
effective in García and Rigau [22].
   RQ8: Is it possible to improve translation models based on error detection? The hypothesis
put forth for this research question is that every model will make certain types of errors. Therefore,
detecting these errors will not only assist the user, but also serve as a basis for retraining in order to
improve the model.


4. Timeline


                                               2023            2024   2025         2026          2027

                             RQ1& RQ2
 First paper publication at PROPOR 2024

                     RQ3 & RQ5 & RQ6
                     RQ4 & RQ7 & RQ8


Figure 2: Thesis Timeline


   The Figure 2 illustrates the planning of the thesis. The initial part will encompass an examination of
conventional MT evaluation metrics and will mark the significant shift occurring in the realm of MT
and its evaluation as a result of the advent of LLM. A preliminary paper on this aspect of the research
has already been published as it has been mentioned in section 3 for RQ1. By the end of 2024, the initial
section of the thesis is anticipated to be completed, addressing RQ1 and RQ2.
   On the other hand, as previously discussed in section 3, the publication of the new LLMs for Galician
will be accompanied by experiments aimed at retraining them for MT and MT evaluation. To this end,
it will be necessary to create an annotated corpus with errors and their typology, which will address
research RQ4 and RQ6. Similarly, the creation of these corpora will permit the training of systems based
on embeddings to detect errors, thereby providing an answer to RQ5. Furthermore, the detection of
errors prior to translation will be facilitated, thus providing an answer to RQ3. The commencement of
this part of the thesis is scheduled to take place between mid-2024 and mid-2026. Finally, the final year
will be dedicated to enhancing the systems themselves by identifying errors (RQ8) or improving the
translation process (RQ7).

13
     https://huggingface.co/facebook/m2m100_418M
14
     https://huggingface.co/facebook/nllb-200-distilled-600M
5. Conclusions
In conclusion, the aforementioned experiments in Section 3 will be employed in order to ascertain
the optimal methodology for MT evaluation for a LRL in an industrial context. Given the rapidly
evolving environment in which we have been living in recent years, the new technologies that are
being generated tend to be focused on HRL and require a very high computational cost for SMEs. The
objective of this thesis is to conduct a comparative study of the existing technologies and to assess their
potential for implementation in a real-world context.


Acknowledgments
We would like to express our gratitude to the Nós project members for their assistance and guidance
during the development of the methodological part of the project. Additionally, computational resources
for this research were provided by UPV/EHU and imaxin|software. Finally, we acknowledge the
funding received from the following projects:
  (i) DeepKnowledge (PID2021-127777OB-C21) and ERDF A way of making Europe.
  (ii) DeepR3 (TED2021-130295B-C31) and European Union NextGeneration EU/PRTR.
  (iii) ILENIA (2022/TL22/00215335) the EU-funded NextGenera- tionEU Recovery, Transformation and
Resilience Plan.


References
 [1] M. L.Forcada, Building machine translation systems for minor languages: Challenges and effects,
     Revista de llengua i dret // Journal of language and law (2020).
 [2] D. W. Otter, J. R. Medina, J. K. Kalita, A survey of the usages of deep learning for natural language
     processing, IEEE Transactions on Neural Networks and Learning Systems 32 (2021) 604–624.
     doi:10.1109/TNNLS.2020.2979670.
 [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin,
     Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
     wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30,
     Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
     3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
 [4] W. Zhu, H. Liu, Q. Dong, J. Xu, S. Huang, L. Kong, J. Chen, L. Li, Multilingual machine translation
     with large language models: Empirical results and analysis (2023). URL: http://arxiv.org/abs/2304.
     04675.
 [5] S. Lee, J. Lee, H. Moon, C. Park, J. Seo, S. Eo, S. Koo, H. Lim, A survey on evaluation metrics for
     machine translation, Mathematics 11 (2023). doi:10.3390/math11041006.
 [6] B. Zhang, B. Haddow, A. Birch, Prompting large language model for machine translation: A case
     study, in: International Conference on Machine Learning, PMLR, 2023, pp. 41092–41110.
 [7] Y. Moslem, R. Haque, J. D. Kelleher, A. Way, Adaptive machine translation with large language mod-
     els, in: M. Nurminen, J. Brenner, M. Koponen, S. Latomaa, M. Mikhailov, F. Schierl, T. Ranasinghe,
     E. Vanmassenhove, S. A. Vidal, N. Aranberri, M. Nunziatini, C. P. Escartín, M. Forcada, M. Popovic,
     C. Scarton, H. Moniz (Eds.), Proceedings of the 24th Annual Conference of the European Associa-
     tion for Machine Translation, European Association for Machine Translation, Tampere, Finland,
     2023, pp. 227–237. URL: https://aclanthology.org/2023.eamt-1.22.
 [8] W. Xu, S. Agrawal, E. Briakou, M. J. Martindale, M. Carpuat, Understanding and detecting halluci-
     nations in neural machine translation via model introspection, Transactions of the Association
     for Computational Linguistics 11 (2023) 546–564. URL: https://aclanthology.org/2023.tacl-1.32.
     doi:10.1162/tacl_a_00563.
 [9] D. Dale, E. Voita, J. Lam, P. Hansanti, C. Ropers, E. Kalbassi, C. Gao, L. Barrault, M. Costa-jussà,
     Halomi: A manually annotated benchmark for multilingual hallucination and omission detection
     in machine translation, in: Proceedings of the 2023 Conference on Empirical Methods in Natural
     Language Processing, 2023, pp. 638–653.
[10] R. Rei, C. Stewart, A. C. Farinha, A. Lavie, COMET: A neural framework for MT evaluation,
     in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics,
     Online, 2020, pp. 2685–2702. URL: https://aclanthology.org/2020.emnlp-main.213. doi:10.18653/
     v1/2020.emnlp-main.213.
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
     translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
     of the Association for Computational Linguistics, Association for Computational Linguistics,
     Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040.
     doi:10.3115/1073083.1073135.
[12] M. Popović, chrF: character n-gram F-score for automatic MT evaluation, in: O. Bojar, R. Chatterjee,
     C. Federmann, B. Haddow, C. Hokamp, M. Huck, V. Logacheva, P. Pecina (Eds.), Proceedings of the
     Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics,
     Lisbon, Portugal, 2015, pp. 392–395. URL: https://aclanthology.org/W15-3049. doi:10.18653/v1/
     W15-3049.
[13] M. Snover, B. Dorr, R. Schwartz, L. Micciulla, J. Makhoul, A study of translation edit rate with
     targeted human annotation, in: Proceedings of the 7th Conference of the Association for Ma-
     chine Translation in the Americas: Technical Papers, Association for Machine Translation in the
     Americas, Cambridge, Massachusetts, USA, 2006, pp. 223–231. URL: https://aclanthology.org/2006.
     amta-papers.25.
[14] M. Freitag, R. Rei, N. Mathur, C.-k. Lo, C. Stewart, E. Avramidis, T. Kocmi, G. Foster, A. Lavie, A. F. T.
     Martins, Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better
     and more robust, in: P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-jussà,
     C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow,
     M. Huck, A. Jimeno Yepes, T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Nagata, T. Nakazawa,
     M. Negri, A. Névéol, M. Neves, M. Popel, M. Turchi, M. Zampieri (Eds.), Proceedings of the Seventh
     Conference on Machine Translation (WMT), Association for Computational Linguistics, Abu Dhabi,
     United Arab Emirates (Hybrid), 2022, pp. 46–68. URL: https://aclanthology.org/2022.wmt-1.2.
[15] H. Zhao, Y. Liu, S. Tao, W. Meng, Y. Chen, X. Geng, C. Su, M. Zhang, H. Yang, From handcrafted
     features to llms: A brief survey for machine translation quality estimation, arXiv preprint
     arXiv:2403.14118 (2024).
[16] T. Ranasinghe, C. Orasan, R. Mitkov, Transquest: Translation quality estimation with cross-lingual
     transformers, in: Proceedings of the 28th International Conference on Computational Linguistics,
     2020.
[17] T. Ranasinghe, C. Orasan, R. Mitkov, Transquest at wmt2020: Sentence-level direct assessment, in:
     Proceedings of the Fifth Conference on Machine Translation, 2020.
[18] R. Rei, M. Treviso, N. M. Guerreiro, C. Zerva, A. C. Farinha, C. Maroti, J. G. C. de Souza, T. Glushkova,
     D. Alves, L. Coheur, A. Lavie, A. F. T. Martins, CometKiwi: IST-unbabel 2022 submission for the
     quality estimation shared task, in: P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chatterjee,
     M. R. Costa-jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
     P. Guzman, B. Haddow, M. Huck, A. Jimeno Yepes, T. Kocmi, A. Martins, M. Morishita, C. Monz,
     M. Nagata, T. Nakazawa, M. Negri, A. Névéol, M. Neves, M. Popel, M. Turchi, M. Zampieri
     (Eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), Association for
     Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 2022, pp. 634–645. URL:
     https://aclanthology.org/2022.wmt-1.60.
[19] Q. Lu, B. Qiu, L. Ding, L. Xie, D. Tao, Error analysis prompting enables human-like translation
     evaluation in large language models: A case study on chatgpt (2023).
[20] N. M. Guerreiro, R. Rei, D. van Stigt, L. Coheur, P. Colombo, A. F. T. Martins, xcomet: Transparent
     machine translation evaluation through fine-grained error detection, 2023. arXiv:2310.10482.
[21] S. Don-Yehiya, L. Choshen, O. Abend, Prequel: Quality estimation of machine translation outputs
     in advance (2022). URL: http://arxiv.org/abs/2205.09178.
[22] S. García, G. Rigau, Study of the state of the art Galician machine translation: English-Galician and
     Spanish-Galician models, in: P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira,
     R. Amaro (Eds.), Proceedings of the 16th International Conference on Computational Processing
     of Portuguese, Association for Computational Lingustics, Santiago de Compostela, Galicia/Spain,
     2024, pp. 411–421. URL: https://aclanthology.org/2024.propor-1.42.
[23] D. Chandrasekaran, V. Mago, Evolution of semantic similarity – a survey (2020). URL: http:
     //arxiv.org/abs/2004.13820http://dx.doi.org/10.1145/3440755. doi:10.1145/3440755.
[24] J. Castillo, P. Estrella, Semantic textual similarity for MT evaluation, in: C. Callison-Burch,
     P. Koehn, C. Monz, M. Post, R. Soricut, L. Specia (Eds.), Proceedings of the Seventh Workshop on
     Statistical Machine Translation, Association for Computational Linguistics, Montréal, Canada,
     2012, pp. 52–58. URL: https://aclanthology.org/W12-3103.
[25] T. Sellam, D. Das, A. P. Parikh, Bleurt: Learning robust metrics for text generation, in: Proceedings
     of ACL, 2020.
[26] A. A. Abafogi, Survey on interpretable semantic textual similarity and its applications, International
     Journal of Innovative Technology and Exploring Engineering 10 (2021) 14–18. URL: https://www.
     ijitee.org/portfolio-item/B82941210220/. doi:10.35940/ijitee.B8294.0110321.
[27] I. Lopez-Gazpio, M. Maritxalar, A. Gonzalez-Agirre, G. Rigau, L. Uria, E. Agirre, Interpretable se-
     mantic textual similarity: Finding and explaining differences between sentences (2016). URL: http://
     arxiv.org/abs/1612.04868http://dx.doi.org/10.1016/j.knosys.2016.12.013. doi:10.1016/j.knosys.
     2016.12.013.
[28] A. Fan, S. Bhosale, H. Schwenk, Z. Ma, A. El-Kishky, S. Goyal, M. Baines, O. Celebi, G. Wenzek,
     V. Chaudhary, et al., Beyond english-centric multilingual machine translation, Journal of Machine
     Learning Research 22 (2021) 1–48.
[29] N. team, M. Costa-jussa, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi,
     J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault,
     G. Gonzalez, P. Hansanti, J. Wang, No language left behind: Scaling human-centered machine
     translation (2022). doi:10.48550/arXiv.2207.04672.