1. Introduction

Meta-Evaluation of Automatic Machine Translation Metrics between Italian and a Minor Language Variety of German

Paolo Di Natale

Elena Chiocchetti

Egon Stemle

0 1 0 Eurac Research , Viale Druso/Drususallee 1, 39100 Bolzano/Bozen , Italy 1 Masaryk University , Zerotinovo namesti 9, 602 00 Brno , Czech Republic

2025

We present the first meta-evaluation of Automatic Machine Translation Evaluation (AMTE) metrics between Italian and South Tyrolean German, a low-resourced standard variety of German. This minor German variety is recognised as a co-oficial language at the local level and is used by the local public administration and legislature. We evaluate metric agreement with human judgement across translation quality levels, using a dataset of bilingual machine-translated decrees annotated with human-curated error tags. Our findings show that embedding-based metrics perform best for evaluating high-quality translations, while learned neural metrics correlate more strongly with human judgments on lower-quality ranges. We also expose a persistent bias in AMTE against minor language varieties and make suggestions about the design of linguistic resources for envisaged custom metric devolopment.

eol>automatic machine translation evaluation metrics metrics meta-evaluation non-English language combination minor language variety machine translation natural language generation evaluation specialized communication

1. Introduction

ation, as they argue that metrics assign lower scores to minor lexical variants even when no change in meaning South Tyrolean German is a minor standard variety of exists. In addition, ineficient tokenization methods lead German with a co-oficial status in the Italian province to suboptimal segmentation and reduced adaptability for of Bolzano/Bozen (South Tyrol). The 350,000 German- under-resourced languages [7]. speaking citizens in South Tyrol have the right to com- Prior experiments with adaptive MT for South Tyrol municate with and access public services in their native [8, 9] have also employed metrics based on lexical overlanguage at the local level. Given the increasing integra- lap despite their known underperformance compared to tion of AI technologies into everyday life, this context neural metrics. This reliance stems from the lack of a underscores the need of developing bilingual NLP tools thorough, localized evaluation of more advanced metric tailored to the South Tyrolean variety of German and paradigms and makes a compelling case for a dedicated use cases, with Machine Translation (MT) one of the meta-evaluation study of existing solutions applicable to most pressing fields of research. However, it is well doc- the South Tyrolean context. umented that the performance of NLP systems for minor This work presents the first such MT meta-evaluation language varieties significantly lags behind both their study of metrics for the Italian–South Tyrolean German major counterparts and high-resource languages [ 1 ]. language pair. We conduct our analysis on MT@BZ1, a

Interest in generating translations into minor language manually error-annotated corpus of legal texts covering varieties is growing, yet the lack of validated evalua- both translation directions, to assess the reliability of tion metrics hampers accurate monitoring of achieved current automatic evaluation metrics. progress. Most related studies still rely on inadequate, sucpoemrsemduendilteyxhicaasl-movaedrelaepfomrtsetthooaddsa[p2t].nWeuhrialle mtheetrriecsseaforcrh 1.1. Automatic Machine Translation under-resourced and dialectal varieties [ 3, 4 ], the develop- Evaluation ment of robust evaluation methods is complicated by the absence of high-quality, suficiently large labeled datasets – an issue common to all under-resourced varieties [5].

Knowles et al. [6] have called for a comparative evalu

Human evaluation remains the gold standard method

for assessing MT quality outputs. However, because human annotation is time-consuming, resource-intensive and requires high domain expertise, Automatic Machine Translation Evaluation (AMTE) metrics have garnered increasing attention. These metrics aim to estimate translation quality by comparing a system-generated candidate translation either to the source segment2 in the other ments, local legislation, and materials intended for the language, to a human-produced reference translation in general public – such as the websites of local public inthe same language, or to both. In scenarios where the out- stitutions – must be available not only in the national put of only one translation system is available, as in this language Italian but also in the minority language Gercase, a so-called segment-level evaluation is carried out. man4.

It consists of "evaluating metrics based on their ability to This multilingual institutional language regime is rank segments in the same order as human judgments" largely implemented through translation between Italian [10]. The efectiveness of such metrics is commonly mea- and German or vice-versa. National legislation is drafted sured using ranking correlation coeficients, under the in Italian and any implementations at local level create assumption that a reliable metric should consistently as- the need for translation into German. Following quotas sign higher scores to translations deemed superior by in public employment, about two thirds of public adminhuman annotators [11]. istration staf is German-speaking. Consequently, many

Existing metrics can be categorized into three main legal and administrative texts are now originally drafted types: in German. While the Italian and German version of, for example, a local law are both oficial, in case of diverging • String-based: this approach quantifies transla- interpretation the Italian version prevails (Presidential tion quality by measuring lexical overlap with Decree No. 670/1972, Art. 99). This means that a transone or more reference translations. These meth- lated text can become the legally binding version. Given ods operate at the surface level, comparing exact the growing use of machine translation, this holds true matches of word or character sequences between also for machine-translated or post-edited texts. the candidate and the reference. The impressive level of fluency of MT-generated texts • Embedding-based: these metrics leverage con- poses a challenge for fair quality assessment of MT systextualized token embeddings from pretrained tems even for human evaluators – especially for those language models to compute semantic similarity lacking specialized training, who may be outperformed between the candidate translation and the ref- by automated neural metrics [12]. In South Tyrolean erence. Semantic alignment is evaluated at the public ofices, where translation-related tasks are often token level using cosine similarity, followed by performed by non-specialists, the rising adoption of MT an F-score aggregation procedure. – frequently without adherence to scientific evaluation • Learned: these metrics are based on transformer protocols [13] – carries the risk of overestimating producarchitectures that have been fine-tuned via su- tivity gains. Without systematic, targeted performance pervised learning to replicate human judgments monitoring, critical errors may go unnoticed. As highof machine translation quality, typically using lighted in the error analysis of a machine-translated legal a regression objective to provide a continuous corpus [14], MT systems often struggle with local legal score. terminology and are prone to interference from other legal systems using German. For example, kommunale Steuer (municipal tax) is never used in South Tyrol as it 2. Motivation would in Germany. The South Tyrolean term for "municipal tax" is Gemeindesteuer. Such errors can severely 2.1. Social and Linguistic Background of compromise translation quality and usability. In highSouth Tyrol stakes domains like the legal one, fluency is secondary to semantic precision and legal appropriateness. Critical accuracy errors can distort meaning, making translated laws unpublishable or even harmful. Consequently, there is a clear need for MT evaluation frameworks that attend to the specific requirements of the South Tyrolean administration and population.

South Tyrolean German is the standard variety of German used in the Autonomous Province of Bolzano/Bozen (South Tyrol) in Northern Italy. In South Tyrol, German is a recognized minority language, co-oficial with Italian. Public administration ofices are legally required to use German when interacting with the German-speaking population (Presidential Decree No. 670/1972, Art. 100), which makes up the large majority of South Tyrol’s population (69%)3. Consequently, all administrative docu

2.2. Toward the Development of Custom Metrics

The well-documented challenges of adapting NLP applications to minor language varieties [ 1 ] also apply to the 2In the field of MT, a segment is defined as the minimal translation unit, which in this study corresponds to a sentence. 3See the latest census data: https://assets-eu-01. kc-usercontent.com/b5376750-8076-01cf-17d2-d343e29778a7/ 5deec178-b2a3-4e2d-8795-d37635c7e0f7/pressnote_1160209_ mit56_2024.pdf

4There is a third oficial language, Ladin, spoken by about 20,000

South Tyroleans. We will not deal with Ladin in this paper. development of automatic evaluation metrics. Language datasets – which represent a limited range of linguistic models are pre-trained onto large-scale corpora where diversity and domains – their superiority in more spemajor language varieties contribute a disproportionately cialized evaluation scenarios remains open to question. larger amount of training signal [15], often without ex- Knowles et al. [6] raise questions regarding how metplicit annotation of variety or dialect tag. This results in rics assess terminological variation within language varibiased representations and undermines the fairness and eties and call for more thorough research on the subject. reliability of evaluation metrics for underrepresented va- Since larger language varieties contribute more training rieties [16]. Current literature has shown that intensive signal during metric development, studies have observed continued pre-training [16] and the use of high-quality, that major linguistic variants tend to be rated more favorhuman-annotated datasets spanning a range of transla- ably than minor linguistic variants, potentially leading tion quality levels [ 4 ] are essential to improving evalua- to biased evaluations [16]. tion performances. Yet, these strategies remain largely Furthermore, analyses of neural metric performance impractical at present due to the significant data and on non-English language pairs remain limited. As a result, resource demands they entail. the superiority of neural metrics cannot be indiscrim

Also, given the high costs of constructing fine-grained, inately generalized to all language combinations [23], manually annotated datasets, one wants to be sure that with some evidence suggesting that performance may the compilation of structured and detailed linguistic re- degrade when English is excluded from the evaluation sources is empirically justified. While Amrhein et al. [17] [24]. argue that the inclusion of reference translations gen- Among the major limitations highlighted in the literally improves evaluation reliability, the behavior of erature is the lack of interpretability inherent to many existing metrics remains inconsistent, occasionally even neural evaluation metrics, largely due to their opaque counterintuitive. For example, some metrics have been scoring mechanisms. Their black-box nature hinders an observed to disregard the reference altogether [18], or to assessment of which metrics are best suited for capturproduce high scores even when the source text is omitted ing specific linguistic phenomena and complicates the entirely [6, 10]. As a result, a comprehensive assessment selection of appropriate metrics for targeted evaluation of existing solutions is needed not only in terms of the tasks [25]. In response, recent research has increasingly identification of the best suited metrics to the context un- emphasized evaluation methodologies grounded in huder study, but also to lay the groundwork for envisaged man error annotations – particularly those following the future metric development. MQM (Multidimensional Quality Metrics) framework –

Moreover, reliable metrics can also advance generation which ofer fine-grained information on translation qualtasks. An emerging trend of natural language generation ity [12]. These span-level annotations have also been is to exploit Minimum Bayes Risk (MBR) decoding, which leveraged as a standardized method for deriving quality selects the output hypothesis that minimizes expected scores (eliminating the need for direct human scoring in loss according to a utility function defined by a chosen evaluation tasks) [26], and training more interpretable evaluation metric [19]. This approach can act as a form quality metrics. of style transfer with a reduction in training costs and Parallel eforts have also turned to linguistically motidata requirements. However, using the same metric for vated meta-evaluation test suites and controlled experboth decoding and final evaluation introduces bias, as the iments designed to probe metric sensitivity to specific system is optimized to reproduce the metric’s idiosyn- language phenomena [27, 28]. crasies [20]. Even diferent but highly correlated metrics The specialized nature of the legal domain also raises – especially if they are of the same type – can produce concerns about the reliability of existing evaluation metsimilar biases [21]. Thus, evaluating the robustness of rics. Zouhar et al. [29] highlight that learned metrics exmultiple metric paradigms becomes an essential prereq- hibit a performance drop when applied to out-of-domain uisite to generating text in South Tyrolean German with data, largely due to their final-stage fine-tuning process. MBR decoding. This suggests that current training data efectively optimizes metrics for specific domains but does not generalize well beyond them. As a result, extending these evalua3. Challenges of Automatic tion metrics to other domains – such as the legal domain Machine Translation Evaluation – may lead to performance degradation compared to the base model.

Learned metrics have consistently outperformed other

evaluation methods in benchmark competitions such as the WMT Metrics Shared Task [22]. However, this finding should not be generalized uncritically. Since neural metrics are predominantly fine-tuned on WMT competition

4. Methodology In this manner, we can lay out a hierarchy of

type-of-error severity and derive a more granular 4.1. Problem Definition quality ranking. We apply a penalty for each error in a segment, equal to the severity weight assigned to that We establish two criteria to characterize an efective met- error type, according to the Linear Raw Scoring Model ric for our use case: the first is absolute agreement, defined presented by Lommel et al. [30]. The sum of penalties as ranking correct translations higher than incorrect ones. is then deducted from a total of 100 and becomes the We also define relative agreement, that is the capability human score. This score reflects both the presence to rank translations containing critical mistakes lower and severity of translation errors, thereby enabling the than those with milder ones [11]. computation of rank-based correlation indices between

To operationalize the diferentiation, we partition the human judgments and automatic metric outputs. dataset for analysis. Absolute agreement is measured on the Whole Dataset – comprising all segments available. To measure relative agreement, we subsample only the segments annotated with at least one mistake, the Segments IT→DE DE→IT Mistake-only Dataset.

Error-annotated Exact matches Other correct

4.2. Dataset and Human Scoring

We use the MT@BZ corpus [8], a corpus of machine- Total segments 1,509 1,509 translated decrees. It comprises source, reference and can- Table 1 didate translations in both language directions (IT→DE Composition of MT@BZ dataset. Error-annotated segments and DE→IT). Each segment has been manually annotated indicate the number of translations that have been labeled for translation errors using a custom error taxonomy. Ta- as containing at least one mistake. Exact matches indicate ble 1 ofers a glance into the composition of the corpus for the number of correct translations that are identical to the each language direction. We notice that around 60% of all reference. Other correct segments indicate the number of segments is correct for both language directions. To gain correct translations that are diferent from the reference. further insight, we compute the BLEU score between reference and candidate sentences. Notably, we find that a very high number of segments labeled as correct receives a perfect BLEU score of 100, indicating exact matches 4.3. Setup of Selected Metrics with the reference translations. This outcome has also This section presents the evaluation metrics employed been observed by Oliver et al. [9] in similar experiments in our study, with details on the tested methods and on the same data, and is attributed to the repetitive and models provided in Table 2. Following best practices for formulaic nature of legal language, which often leads to replicability as recommended by Zouhar et al. [42] for low lexical and syntactic variability. Comet-suite metrics, we include hash codes and model

To measure correlation across a range of quality levels identifiers in the footnotes of the present section. (as defined in Section 4.1) in the absence of numerical quality scores, we assign severity weights to each error String-based Metrics type annotated in the original dataset (see Appendix A).

Given the highly specialized nature of the domain, ex- BLEU [31] measures modified n-gram precision with perts with competence in the South Tyrolean legal frame- a brevity penalty. chrF [ 34 ] computes overlap over work and German language varieties were consulted to character-level n-grams, ofering sensitivity to morphodefine severity levels for each error type. These levels logical and orthographic variation. Finally, TER [ 39 ] eswere established based on both linguistic adequacy and timates the minimum number of edit operations required legislative drafting requirements5. For a detailed qualita- to transform the candidate into the reference, approxitive analysis of the corpus mistakes, refer to De Camillis mating post-editing efort. and Chiocchetti [14]. 5For example, the South Tyrolean public administration is bound by law to use the terminology that is being oficially validated by a dedicated Terminology Commission (Presidential Decree No. 574/1988, Art. 6) and to adopt gender-neutral language (Provincial Law No. 51/2010). These constraints are therefore essential quality aspects when translating oficial documents into this minor language variety of German.

Embedding-based Metrics We utilize the BERTScore framework6 [ 33 ], which uses contextual embeddings from pre-trained language models to compute semantic similarity. The framework allows for model selection. Hash identifiers have been

6https://github.com/Tiiiger/bert_score

BLEU BLEURT BERTScore chrF COMET-22-DA COMET-Kiwi-DA COMET-KiwiXL-DA MetricX-24-Hybrid TER UNITE XCOMETXL-DA Type String-based Learned Embedding-based String-based Learned Learned Learned Learned String-based Learned Learned ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✓

We choose learned metrics trained under diferent input configurations.

We begin with reference-based metrics, which incorporate the reference translation during both training and inference. We select COMET-22-DA11[ 35 ] and BLEURT [32], which have been fine-tuned simply using quality scores from human annotators.

We also consider source-based metrics (also called Quality Estimation or QE metrics), which are trained without access to reference translations. Instead, they learn to predict human quality scores solely from the source sentence and the machine-generated output. We include both COMET-Kiwi-DA12 [ 36 ] and its larger variant COMETgenerated together with the scores and are provided in KiwiXL-DA13 [ 37 ], which builds on the same architecthe footnotes. In our experiments, we evaluate four en- ture but difers in model capacity. coder backbones: bert-base-multilingual7 (which is the The uniefid approach combines both the source and default model), roberta-large-mnli8, deberta-xlarge- the reference to exploit multi-task interaction. We assess mnli9 and bart-large-mnli10. UNITE14 [40]. It jointly leverages the source and the

In Table 3, we report the results under their respective reference as separate input streams during training, then model denominations, matched with the aggregated F1 incorporating a last layer to fuse the decomposed scores score. We also compute precision and recall individually into the holistic one. We report scores for source (src) to highlight asymmetric contributions to the similarity and reference (ref ) decompositions. assessment, which will be commented in Section 5. Pre- We also include error-span metrics, namely cision measures how many of the candidate’s tokens are XCOMETXL-DA15 [41], MetricX-24-Hybrid-Large present in the reference, while recall captures how well and its larger configuration MetricX-24-Hybrid-XL the reference tokens are matched by the generated can- [ 38 ]. These metrics include a training phase based didate. on error-span labels, according to the MQM error taxonomy. They are trained to predict error spans Learned Metrics alongside a penalty score. XCometXL-DA is a hybrid metric that provides additional scores based on four decomposed dimensions: src, ref, unified approach and MQM annotations. The holistic score is then produced by ensembling the four sub-scores via a forward pass that establishes aggregation weights. Instead, the MetricX model suite only provides a single additional decomposed score which includes only the source in the evaluation.

Finally, we explore a variant of XCOMETXL quantized to 8 bits16, motivated by the hypothesis put forward in Zouhar et al. [42] that lower precision approximations of large metrics can maintain correlation with human judgments while significantly reducing inference costs. 7bert-base-multilingual-cased_L9_noidf_version=0.3.12(hug_trans=4.46.2)_fast-tokenizer 8roberta-large-mnli_L19_no-idf_version=0.3.12(hug_trans=4.51.3) 9microsoft/deberta-xlarge-mnli_L40_noidf_version=0.3.12(hug_trans=4.51.3) 10facebook/bart-large-mnli_L11_no

4.4. Meta-Evaluation

In Table 3, we report Accuracy (Acc) [11], a measure computed through pairwise comparisons across the test set. It quantifies the proportion of pairs for which the 13Python3.8.10|Comet2.2.2|fp32|Unbabel/wmt22-cometkiwiXL-da|1 14Python3.8.10|Comet2.2.2|fp32|Unbabel/unite-mup|1 15Python3.8.10|Comet2.2.3|fp32|Unbabel/XCOMET-XL|1 16Python3.8.10|Comet2.2.2|qint8|Unbabel/XCOMET-XL|1 evaluation metric produces the same relative ordering translations identical to the reference may not receive as the human gold standard (concordant), versus those the maximum score, or scores may fall outside the valid where the ordering is incorrect (discordant). We follow range of 0 to 1 as in Comet models (requiring post-hoc Deutsch et al. [43] by using a variant of Accuracy ad- clipping). This behavior is consistent with prior observajusted for tie calibration by artificially creating ties from tions about learned metrics’ underperformance on highcontinuous scores. This procedure is needed in the light quality translations, as noted by Agrawal et al. [44]. of the high number of rank ties stemming from human However, the trend reverses in the Mistake-only score fabrication. The Acc value ranges from 0 to 1. Dataset: here, when including the reference, learned

We also adopt Spearman’s correlation (Rho) (ranging metrics consistently outperform other metric types, from -1 to 1). It ofers robustness to outliers and allows to regardless of the statistical measure used. This suggests capture rank-based monotonic relationships even across that their modeling power becomes more efective in the markedly diferent score distributions observed in lower-quality bands, where surface-level matches are the metrics evaluated [ 3 ]. less common and mistakes have to be properly identified

We decide not to use Pearson’s correlation because it and penalized. Despite this regained advantage in the assumes a linear relationship between the distributions Mistake-only setting, the overperformance margins of the two score groups [43]. The proportional sever- of most learned metrics remain tight and agreement ity weights we assign to diferent error types are not levels insuficient for a reliable quality evaluation. This expected to be linearly replicated by metric outputs. suggests that there is still room for improvement – especially as far as smaller-size metrics are concerned.

5. Results We apply meta-evaluation measures on both the Whole

Dataset and the Mistake-only Dataset. This addresses the need to adequately test the metrics on the two criteria that have been established when defining the problem in Section 4.1: absolute and relative agreement.

In Table 3, results are accordingly structured under two main sections, which separately report metric performance under each evaluation criterion. For metrics that generate holistic scores by aggregating subscores algorithmically, we report the holistic score in bold, while single decomposed scores are provided in regular font.

Metric paradigm performance varies across quality ranges. Our results reveal a widely diferent performance pattern across metric paradigms when evaluated on the Whole Dataset versus the Mistake-only Dataset.

Surprisingly, both string-based and embedding-based metrics outperform learned metrics when evaluated on the Whole Dataset. We explain this with the argument that string-based metrics – being rule-based – can reliably detect and reward the high sample of exact matches with the reference. Embedding-based metrics also beneift from their ability to capture lexical overlap at a subword or token level, recognising meaning even when the wording difers. We attribute the underperformance of learned metrics primarily to the inherent nature of their regression-based scoring. Unlike rule-based metrics that produce deterministic outputs, learned metrics rely on regression functions that approximate scores based on distributional patterns in the training data. This can result in unexpected behavior – for instance, candidate

Mind the reference. Disaggregating the performance of learned metrics by input type ofers valuable insights into which linguistic resources most efectively contribute to accurate evaluation. Considering the Mistakeonly Dataset, reference-based scores surpass both sourcebased and error-span counterparts for COMET, UNITE and MetricX families. Interestingly, for metrics built on the unified approach (such as UNITE and XCOMETXL), the inclusion of both source and reference appears beneifcial. While the reference remains the primary driver of correlation, incorporating the source provides a modest boost to overall score agreement. This suggests that uniifed models, which incorporate additional layers to weigh and integrate information streams from both inputs into the holistic score, may be better suited to capture certain error types that are only apparent when the source is considered.

In general, while source-based metrics trail behind other learned metric types, they can outperform embedding-based metrics counting on reference translations, especially if we consider models with larger capacity (MetricX-24-XL-QE, COMET-KiwiXL-DA and XCOMETXL-src).

Error-span metrics are misaligned. We assess the usefulness of error-span annotations in comparison to other linguistic signals. XCOMETXL-DA-mqm is the only available decomposed score based exclusively on MQM error span identification. Considering the Mistake-only Dataset, we observe a drop compared to related subscores of the same metric as well as to the smaller configuration of the same metric (COMET-22-DA). This failure may be attributable to a misalignment between the MQM

IT→DE Acc Rho

DE→IT Acc Rho

IT→DE Acc Rho

DE→IT

Acc Rho annotation framework used for training such metrics and trend may corroborate the importance of the reference our custom error taxonomy used for evaluation. Striving translation: gauging how much of semantic and syntactic for consistency over error label criteria across training information contained in the reference transfers to and evaluation is thus fundamental for fair assessment. the candidate may generally serve as a predictor of

Looking at Whole Dataset, we likewise highlight legal text quality as conceived of by expert evaluators. that error-span metrics (MetricX and XCOMETXL) are Yet, the negligible edge in the correlation measure is surpassed by learned metrics that are optimized only for neither strong nor consistent enough to draw definitive direct scalar prediction of sentence-level quality, such as conclusions. An informed interpretation of the results COMET-22-DA, BLEURT and UNITE. As the training would require a qualitative analysis on the amount of objective of error-span metrics is to regress over error semantic explicitation commonly expressed in the legal annotations to estimate penalty weights accordingly, texts of both languages. they may show a proneness for over-correction even in high-quality segments.

Minor varieties remain penalised. Focusing on the target language, we observe that correlations on the

Precision or Recall? In Appendix B, we collect Mistake-only Dataset are generally higher for Italian than decomposed subscores for embedding-based metrics: for German. This result is noteworthy given that German recall and precision. We notice that recall tends to benefits from a larger pool of training data due to the correlate more strongly with human judgments than fact that it is a more regularly featured language in WMT the holistic score and the precision subscore. This shared tasks, which contribute most of metrics training data. We posit that this discrepancy supports the argu- pora in South Tyrolean German and including relevant ment that generic models tend to embed biases toward terminology. dominant language varieties. In the case of German, it is Also, we recommend exploring training strategies that likely that the datasets used to train evaluation metrics integrate the strengths of embedding-based and learned predominantly feature standard varieties such as those metrics, with the goal of developing evaluation systems used in Germany and at the EU level. that perform robustly across the full quality spectrum of

Moreover, we caution against drawing conclusions machine translation output. based on the Whole Dataset, where Italian-to-German From a broader perspective, we suggest that metric translations include nearly twice as many full matches selection in natural language generation tasks should be between reference and candidate as the reverse direction. guided by a clear definition of the evaluation objective This makes the datasets not comparable to each other, and the nature of the task. Learned metrics are more inflating metric performance and simplifying evaluation efective when the task involves detecting and weighfor German as the target language. ing complex linguistic phenomena that may surface in diverse forms – such as in summarization or questionanswering tasks. In such cases, the fine-tuning and vali

Size matters. When comparing learned metrics of dation of a custom metric may be a further convenient increasing model size on the Mistake-only Dataset, we step. Conversely, more naive evaluation methods like the observe a general trend where scaling up benefits evalua- string-based ones are often appropriate when low varition performance. This is evident in the case of COMET- ance from a reference is expected, such as in the presence Kiwi, where the XL variant consistently outperforms of named entities. As our findings show, the two metits smaller counterpart, and for reference-based scores ric paradigms can even be complementary: embeddingof XCOMETXL-DA-ref, which shows stronger results and string-based metrics are well-suited for evaluating compared to COMET-22-DA. A more nuanced picture accuracy-related aspects, while learned metrics can ofer emerges with MetricX, where the XL versions outper- global insight into the overall fluency of the generated form the Large models only in evaluations into Italian, text and meaning preservation. suggesting that scaling efects may vary across language directions, presumably due to the language variety provenance of additional data. References

The quantized version of XCOMETXL-DA, though slightly lowering correlation measures compared to its full-precision counterpart, still outperforms all other metrics, which confirms previous findings that quantization can be a viable strategy for reducing computational costs.

6. Conclusions

As an indication for future metric development, we conclude that reference translations are most crucial for enhancing evaluation reliability, while source sentences may contribute marginally but are not essential. We advise against embarking on the efort of error-span annotation of large corpora with the aim of training new metrics: it has notable human and resource costs but results ofer no evidence that they determine commensurate metric improvements. Instead, targeted extensions of the existing MT@BZ dataset may provide more costefective support for evaluation purposes.

Given the underperformance of metrics when evaluating South Tyrolean German as a target language, future metric adaptation would likely benefit from applying continued pre-training to generic encoder models on South Tyrolean German data. This would provide a more suitable backbone for further fine-tuning learned metrics. To this end, eforts should be made to compile legal text corin: Proceedings of the 2024 Joint International Con- guistics, Online, 2021, pp. 478–494. URL: https: ference on Computational Linguistics, Language //aclanthology.org/2021.wmt-1.57/. Resources and Evaluation (LREC-COLING 2024), [12] M. Freitag, G. Foster, D. Grangier, V. Ratnakar, ELRA and ICCL, Torino, Italia, 2024, pp. 3553–3565. Q. Tan, W. Macherey, Experts, errors, and conURL: https://aclanthology.org/2024.lrec-main.315/. text: A large-scale study of human evaluation for [5] A. Magueresse, V. Carles, E. Heetderks, Low- machine translation, Transactions of the Associaresource languages: A review of past work and tion for Computational Linguistics 9 (2021) 1460– future challenges, 2020. URL: https://arxiv.org/abs/ 1474. URL: https://aclanthology.org/2021.tacl-1.87/. 2006.07264. arXiv:2006.07264. doi:10.1162/tacl_a_00437. [6] R. Knowles, S. Larkin, C.-K. Lo, MSLC24: Further [13] F. D. Camillis, La traduzione non professionale nelle challenges for metrics on a wide landscape of trans- istituzioni pubbliche dei territori di lingua minorilation quality, in: Proceedings of the Ninth Confer- taria: il caso di studio dell’amministrazione della ence on Machine Translation, Association for Com- Provincia autonoma di Bolzano, Ph.D. thesis, alma, putational Linguistics, Miami, Florida, USA, 2024, 2021. URL: https://amsdottorato.unibo.it/id/eprint/ pp. 475–491. URL: https://aclanthology.org/2024. 9695/.

wmt-1.34/. doi:10.18653/v1/2024.wmt-1.34. [14] F. De Camillis, E. Chiocchetti, Machine-translating [7] V. Dewangan, B. R. S, G. Suri, R. Sonavane, When legal language: error analysis on an italian-german every token counts: Optimal segmentation for corpus of decrees, Terminology science & research low-resource language models, in: Proceedings 27 (2024) 1–27. URL: https://journal-eaft-aet.net/ of the First Workshop on Language Models for index.php/tsr/article/view/8304/7492. Low-Resource Languages, Association for Compu- [15] J. O. Alabi, D. I. Adelani, M. Mosbach, D. Klakow, tational Linguistics, Abu Dhabi, United Arab Emi- Adapting pre-trained language models to African rates, 2025, pp. 294–308. URL: https://aclanthology. languages via multilingual adaptive fine-tuning, in: org/2025.loreslm-1.24/. Proceedings of the 29th International Conference [8] F. De Camillis, E. W. Stemle, E. Chiocchetti, F. Fer- on Computational Linguistics, International Comnicola, The MT@BZ corpus: machine translation mittee on Computational Linguistics, Gyeongju, Re& legal language, in: Proceedings of the 24th An- public of Korea, 2022, pp. 4336–4349. URL: https: nual Conference of the European Association for //aclanthology.org/2022.coling-1.382/. Machine Translation, European Association for Ma- [16] J. Sun, T. Sellam, E. Clark, T. Vu, T. Dozat, D. Garchine Translation, Tampere, Finland, 2023, pp. 171– rette, A. Siddhant, J. Eisenstein, S. Gehrmann, 180. URL: https://aclanthology.org/2023.eamt-1.17/. Dialect-robust evaluation of generated text, in: [9] A. Oliver, S. Alvarez-Vidal, E. Stemle, E. Chioc- Proceedings of the 61st Annual Meeting of the Aschetti, Training an NMT system for legal texts sociation for Computational Linguistics (Volume of a low-resource language variety south tyrolean 1: Long Papers), Association for Computational German - Italian, in: Proceedings of the 25th An- Linguistics, Toronto, Canada, 2023, pp. 6010–6028. nual Conference of the European Association for URL: https://aclanthology.org/2023.acl-long.331/. Machine Translation (Volume 1), European Asso- doi:10.18653/v1/2023.acl-long.331. ciation for Machine Translation (EAMT), Shefield, [17] C. Amrhein, N. Moghe, L. Guillou, ACES: TranslaUK, 2024, pp. 573–579. URL: https://aclanthology. tion accuracy challenge sets for evaluating machine org/2024.eamt-1.47/. translation metrics, in: Proceedings of the Seventh [10] S. Perrella, L. Proietti, A. Scirè, E. Barba, R. Nav- Conference on Machine Translation (WMT), Assoigli, Guardians of the machine translation meta- ciation for Computational Linguistics, Abu Dhabi, evaluation: Sentinel metrics fall in!, in: Proceed- United Arab Emirates (Hybrid), 2022, pp. 479–513. ings of the 62nd Annual Meeting of the Associa- URL: https://aclanthology.org/2022.wmt-1.44/. tion for Computational Linguistics (Volume 1: Long [18] Y. Yan, T. Wang, C. Zhao, S. Huang, J. Chen, Papers), Association for Computational Linguis- M. Wang, BLEURT has universal translations: tics, Bangkok, Thailand, 2024, pp. 16216–16244. An analysis of automatic metrics by minimum URL: https://aclanthology.org/2024.acl-long.856/. risk training, in: Proceedings of the 61st Andoi:10.18653/v1/2024.acl-long.856. nual Meeting of the Association for Computational [11] T. Kocmi, C. Federmann, R. Grundkiewicz, Linguistics (Volume 1: Long Papers), Association M. Junczys-Dowmunt, H. Matsushita, A. Menezes, for Computational Linguistics, Toronto, Canada, To ship or not to ship: An extensive evaluation 2023, pp. 5428–5443. URL: https://aclanthology. of automatic metrics for machine translation, in: org/2023.acl-long.297/. doi:10.18653/v1/2023. Proceedings of the Sixth Conference on Machine acl-long.297.

Translation, Association for Computational Lin- [19] P. Fernandes, A. Farinhas, R. Rei, J. G. C. de Souza, P. Ogayo, G. Neubig, A. Martins, Quality-aware [26] M. Freitag, N. Mathur, C.-k. Lo, E. Avramidis, decoding for neural machine translation, in: Pro- R. Rei, B. Thompson, T. Kocmi, F. Blain, D. Deutsch, ceedings of the 2022 Conference of the North Amer- C. Stewart, C. Zerva, S. Castilho, A. Lavie, G. Fosican Chapter of the Association for Computational ter, Results of WMT23 metrics shared task: MetLinguistics: Human Language Technologies, As- rics might be guilty but references are not insociation for Computational Linguistics, Seattle, nocent, in: Proceedings of the Eighth ConferUnited States, 2022, pp. 1396–1412. URL: https: ence on Machine Translation, Association for Com//aclanthology.org/2022.naacl-main.100/. doi:10. putational Linguistics, Singapore, 2023, pp. 578– 18653/v1/2022.naacl-main.100. 628. URL: https://aclanthology.org/2023.wmt-1.51/. [20] G. Kovacs, D. Deutsch, M. Freitag, Mitigating metric doi:10.18653/v1/2023.wmt-1.51. bias in minimum Bayes risk decoding, in: Proceed- [27] N. Moghe, A. Fazla, C. Amrhein, T. Kocmi, M. Steedings of the Ninth Conference on Machine Trans- man, A. Birch, R. Sennrich, L. Guillou, Machine lation, Association for Computational Linguistics, translation meta evaluation through translation acMiami, Florida, USA, 2024, pp. 1063–1094. URL: curacy challenge sets, Computational Linguistics 51 https://aclanthology.org/2024.wmt-1.109/. doi:10. (2025) 73–137. URL: https://aclanthology.org/2025. 18653/v1/2024.wmt-1.109. cl-1.4/. doi:10.1162/coli_a_00537. [21] J. Pombal, N. M. Guerreiro, R. Rei, A. F. T. Martins, [28] E. Avramidis, S. Manakhimova, V. Macketanz, Adding chocolate to mint: Mitigating metric inter- S. Möller, Machine translation metrics are betference in machine translation, 2025. URL: https: ter in evaluating linguistic errors on LLMs than //arxiv.org/abs/2503.08327. arXiv:2503.08327. on encoder-decoder systems, in: Proceedings of [22] M. Freitag, N. Mathur, D. Deutsch, C.-K. Lo, the Ninth Conference on Machine Translation, AsE. Avramidis, R. Rei, B. Thompson, F. Blain, sociation for Computational Linguistics, Miami, T. Kocmi, J. Wang, D. I. Adelani, M. Buchicchio, Florida, USA, 2024, pp. 517–528. URL: https:// C. Zerva, A. Lavie, Are LLMs breaking MT met- aclanthology.org/2024.wmt-1.37/. doi:10.18653/ rics? results of the WMT24 metrics shared task, v1/2024.wmt-1.37. in: Proceedings of the Ninth Conference on Ma- [29] V. Zouhar, S. Ding, A. Currey, T. Badeka, J. Wang, chine Translation, Association for Computational B. Thompson, Fine-tuned machine translation metLinguistics, Miami, Florida, USA, 2024, pp. 47– rics struggle in unseen domains, in: Proceedings 81. URL: https://aclanthology.org/2024.wmt-1.2/. of the 62nd Annual Meeting of the Association doi:10.18653/v1/2024.wmt-1.2. for Computational Linguistics (Volume 2: Short [23] N. Moghe, T. Sherborne, M. Steedman, A. Birch, Papers), Association for Computational LinguisExtrinsic evaluation of machine translation metrics, tics, Bangkok, Thailand, 2024, pp. 488–500. URL: in: Proceedings of the 61st Annual Meeting of the https://aclanthology.org/2024.acl-short.45/. doi:10. Association for Computational Linguistics (Volume 18653/v1/2024.acl-short.45. 1: Long Papers), Association for Computational Lin- [30] A. Lommel, S. Gladkof, A. Melby, S. E. Wright, guistics, Toronto, Canada, 2023, pp. 13060–13078. I. Strandvik, K. Gasova, A. Vaasa, A. Benzo, URL: https://aclanthology.org/2023.acl-long.730/. R. Marazzato Sparano, M. Foresi, J. Innis, L. Han, doi:10.18653/v1/2023.acl-long.730. G. Nenadic, The multi-range theory of transla[24] S. Agrawal, A. Farajian, P. Fernandes, R. Rei, A. F. T. tion quality measurement: MQM scoring modMartins, Assessing the role of context in chat els and statistical quality control, in: Protranslation evaluation: Is context helpful and un- ceedings of the 16th Conference of the Associder what conditions?, Transactions of the Associa- ation for Machine Translation in the Americas tion for Computational Linguistics 12 (2024) 1250– (Volume 2: Presentations), Association for Ma1267. URL: https://aclanthology.org/2024.tacl-1.69/. chine Translation in the Americas, Chicago, USA, doi:10.1162/tacl_a_00700. 2024, pp. 75–94. URL: https://aclanthology.org/2024. [25] R. Rei, N. M. Guerreiro, M. Treviso, L. Coheur, amta-presentations.6/.

A. Lavie, A. Martins, The inside story: To- [31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: wards better understanding of machine transla- a method for automatic evaluation of machine tion neural evaluation metrics, in: Proceedings translation, in: Proceedings of the 40th Annual of the 61st Annual Meeting of the Association Meeting of the Association for Computational Linfor Computational Linguistics (Volume 2: Short guistics, Association for Computational LinguisPapers), Association for Computational Linguis- tics, Philadelphia, Pennsylvania, USA, 2002, pp. tics, Toronto, Canada, 2023, pp. 1089–1105. URL: 311–318. URL: https://aclanthology.org/P02-1040/. https://aclanthology.org/2023.acl-short.94/. doi:10. doi:10.3115/1073083.1073135. 18653/v1/2023.acl-short.94. [32] T. Sellam, D. Das, A. Parikh, BLEURT: Learning

A. Custom error weights

Type of Error

Penalty Weight

Accuracy errors Mistranslation:

Multiword expressions Part of Speech Word Sense Disambiguation Partial

Semantically Unrelated Addition Omission Untranslated Mechanical Bilingual terminology Source error

Fluency errors Grammar:

Multiword syntax Word form Word order Extra words

Missing words Lexicon:

Lexical choice

Non-existing or Foreign Word Orthography:

Spelling Punctuation

Capitalization Gender Inconsistency Coherence Multiple fluency errors Other 0.781 0.778 0.781 MISTAKE-ONLY DATASET IT→DE

DE→IT

IT→DE

DE→IT Declaration on Generative AI

[1]

Zampieri ,

Nakov ,

Scherrer , Natural language processing for similar languages, varieties, and dialects: A survey , Natural Language Engineering 26 ( 2020 ) 595 - 612 . doi: 10 . 1017/S1351324920000492.

[2]

M. M. I.

Alam ,

Ahmadi ,

Anastasopoulos , CODET: A benchmark for contrastive dialectal evaluation of machine translation , in: Findings of the Association for Computational Linguistics: EACL 2024 , Association for Computational Linguistics, St . Julian's, Malta , 2024 , pp. 1790 - 1859 . URL: https://aclanthology.org/ 2024 .findings-eacl. 125 /.

[3]

Wang ,

D. I.

Adelani ,

Agrawal ,

Masiak ,

Rei , E. Briakou,

Carpuat , et al., AfriMTE and AfriCOMET: Enhancing COMET to embrace under-resourced African languages, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics , Mexico City, Mexico, 2024 , pp. 5997 - 6023 . URL: https://aclanthology.org/ 2024 . naacl-long . 334 /. doi: 10 .18653/v1/ 2024 . naacl-long . 334 .

[4]

Falcão ,

Borg ,

Aranberri ,

Abela , COMET for low-resource machine translation evaluation: A case study of English-Maltese and Spanish-Basque, robust metrics for text generation , in: Proceedings lation in the Americas: Technical Papers , Associaof the 58th Annual Meeting of the Association for tion for Machine Translation in the Americas , CamComputational Linguistics, Association for Com- bridge, Massachusetts, USA, 2006 , pp. 223 - 231 . URL: putational Linguistics, Online, 2020 , pp. 7881 - 7892 . https://aclanthology.org/ 2006 .amta-papers. 25 /. URL: https://aclanthology.org/ 2020 .acl-main. 704 /. [40]

Wan ,

Liu ,

Yang ,

Zhang ,

Chen ,

Wong , doi:10.18653/v1/ 2020 .acl-main.704. L. Chao, UniTE: Unified translation evaluation , in:

[33]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger , Proceedings of the 60th Annual Meeting of the AsY . Artzi, Bertscore: Evaluating text generation with sociation for Computational Linguistics (Volume bert, 2020 . URL: https://arxiv.org/abs/ 1904 .09675. 1: Long

Papers)

, Association for Computational arXiv: 1904 .09675. Linguistics , Dublin, Ireland, 2022 , pp. 8117 - 8127 .

[34]

Popović , chrF: character n-gram F-score for auto- URL: https://aclanthology.org/ 2022 . acl-long.558/. matic MT evaluation , in: Proceedings of the Tenth doi:10.18653/v1/2022.acl-long.558. Workshop on Statistical Machine Translation , Asso- [41]

N. M.

Guerreiro ,

Rei ,

D. v.

Stigt , L. Coheur, ciation for Computational Linguistics, Lisbon, Por- P. Colombo,

A. F. T.

Martins , xcomet: Transpartugal, 2015 , pp. 392 - 395 . URL: https://aclanthology. ent machine translation evaluation through fineorg/W15-3049/. doi: 10 .18653/v1/ W15 -3049. grained error detection, Transactions of the Associ-

[35]

Rei ,

Stewart ,

A. C.

Farinha ,

Lavie , COMET: ation for Computational Linguistics 12 ( 2024 ) 979- A neural framework for MT evaluation , in: Pro- 995 . URL: https://aclanthology.org/ 2024 .tacl- 1 .54/. ceedings of the 2020 Conference on Empirical Meth- doi:10.1162/tacl_a_00683. ods in Natural Language Processing (EMNLP) , As- [42]

Zouhar ,

Chen ,

T. K.

Lam ,

Moghe , B. Hadsociation for Computational Linguistics, Online, dow, Pitfalls and outlooks in using COMET , in: 2020, pp. 2685 - 2702 . URL: https://aclanthology.org / Proceedings of the Ninth Conference on Machine 2020.emnlp-main . 213 /. doi: 10 .18653/v1/2020. Translation, Association for Computational Linguisemnlp-main. 213 . tics, Miami, Florida, USA, 2024 , pp. 1272 - 1288 . URL:

[36]

Rei ,

Treviso ,

N. M.

Guerreiro ,

Zerva , A. C. https://aclanthology.org/ 2024 .wmt- 1 .121/. doi:10. Farinha , C. Maroti , J. G. C. de Souza , T. Glushkova , 18653 /v1/ 2024 .wmt- 1 .121. D. Alves , L.

Coheur , A.

Lavie , A. F. T.

Martins , [43] D.

Deutsch , G. Foster, M.

Freitag , Ties matter: MetaCometKiwi: IST-unbabel 2022 submission for the evaluating modern metrics with pairwise accuracy quality estimation shared task , in: Proceedings and tie calibration , in: Proceedings of the 2023 Conof the Seventh Conference on Machine Transla- ference on Empirical Methods in Natural Language tion (WMT) , Association for Computational Lin- Processing , Association for Computational Linguisguistics, Abu Dhabi, United Arab Emirates (Hybrid), tics , Singapore, 2023 , pp. 12914 - 12929 . URL: https: 2022 , pp. 634 - 645 . URL: https://aclanthology.org/ //aclanthology.org/ 2023 .emnlp-main. 798 /. doi:10. 2022 .wmt- 1 .60/. 18653/v1/ 2023 .emnlp-main. 798 .

[37]

Rei ,

N. M.

Guerreiro ,

Pombal , D. van Stigt , [44]

Agrawal ,

Farinhas ,

Rei ,

Martins , Can M. Treviso , L. Coheur , J. G. C. de Souza , A. Mar - automatic metrics assess high-quality translations?, tins, Scaling up CometKiwi: Unbabel-IST 2023 sub- in: Proceedings of the 2024 Conference on Emmission for the quality estimation shared task , in: pirical Methods in Natural Language Processing, Proceedings of the Eighth Conference on Machine Association for Computational Linguistics , Miami, Translation, Association for Computational Lin- Florida, USA, 2024 , pp. 14491 - 14502 . URL: https: guistics, Singapore, 2023 , pp. 841 - 848 . URL: https: //aclanthology.org/ 2024 .emnlp-main. 802 /. doi:10. //aclanthology.org/ 2023 .wmt- 1 .73/. doi: 10 .18653/ 18653/v1/ 2024 .emnlp-main. 802 . v1/ 2023 .wmt- 1 . 73 .

[38]

Juraska ,

Deutsch ,

Finkelstein , M. Freitag, MetricX-24: The Google submission to the WMT 2024 metrics shared task , in: Proceedings of the Ninth Conference on Machine Translation , Association for Computational Linguistics, Miami, Florida, USA, 2024 , pp. 492 - 504 . URL: https: //aclanthology.org/ 2024 .wmt- 1 .35/. doi: 10 .18653/ v1/ 2024 .wmt- 1 . 35 .

[39]

Snover ,

Dorr ,

Schwartz ,

Micciulla ,

Makhoul , A study of translation edit rate with targeted human annotation , in: Proceedings of the 7th Conference of the Association for Machine Trans-