<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Meta-Evaluation of Automatic Machine Translation Metrics between Italian and a Minor Language Variety of German</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paolo Di Natale</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Chiocchetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egon Stemle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eurac Research</institution>
          ,
          <addr-line>Viale Druso/Drususallee 1, 39100 Bolzano/Bozen</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Masaryk University</institution>
          ,
          <addr-line>Zerotinovo namesti 9, 602 00 Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present the first meta-evaluation of Automatic Machine Translation Evaluation (AMTE) metrics between Italian and South Tyrolean German, a low-resourced standard variety of German. This minor German variety is recognised as a co-oficial language at the local level and is used by the local public administration and legislature. We evaluate metric agreement with human judgement across translation quality levels, using a dataset of bilingual machine-translated decrees annotated with human-curated error tags. Our findings show that embedding-based metrics perform best for evaluating high-quality translations, while learned neural metrics correlate more strongly with human judgments on lower-quality ranges. We also expose a persistent bias in AMTE against minor language varieties and make suggestions about the design of linguistic resources for envisaged custom metric devolopment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;automatic machine translation evaluation metrics</kwd>
        <kwd>metrics meta-evaluation</kwd>
        <kwd>non-English language combination</kwd>
        <kwd>minor language variety</kwd>
        <kwd>machine translation</kwd>
        <kwd>natural language generation evaluation</kwd>
        <kwd>specialized communication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ation, as they argue that metrics assign lower scores to
minor lexical variants even when no change in meaning
South Tyrolean German is a minor standard variety of exists. In addition, ineficient tokenization methods lead
German with a co-oficial status in the Italian province to suboptimal segmentation and reduced adaptability for
of Bolzano/Bozen (South Tyrol). The 350,000 German- under-resourced languages [7].
speaking citizens in South Tyrol have the right to com- Prior experiments with adaptive MT for South Tyrol
municate with and access public services in their native [8, 9] have also employed metrics based on lexical
overlanguage at the local level. Given the increasing integra- lap despite their known underperformance compared to
tion of AI technologies into everyday life, this context neural metrics. This reliance stems from the lack of a
underscores the need of developing bilingual NLP tools thorough, localized evaluation of more advanced metric
tailored to the South Tyrolean variety of German and paradigms and makes a compelling case for a dedicated
use cases, with Machine Translation (MT) one of the meta-evaluation study of existing solutions applicable to
most pressing fields of research. However, it is well doc- the South Tyrolean context.
umented that the performance of NLP systems for minor This work presents the first such MT meta-evaluation
language varieties significantly lags behind both their study of metrics for the Italian–South Tyrolean German
major counterparts and high-resource languages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. language pair. We conduct our analysis on MT@BZ1, a
      </p>
      <p>
        Interest in generating translations into minor language manually error-annotated corpus of legal texts covering
varieties is growing, yet the lack of validated evalua- both translation directions, to assess the reliability of
tion metrics hampers accurate monitoring of achieved current automatic evaluation metrics.
progress. Most related studies still rely on inadequate,
sucpoemrsemduendilteyxhicaasl-movaedrelaepfomrtsetthooaddsa[p2t].nWeuhrialle mtheetrriecsseaforcrh 1.1. Automatic Machine Translation
under-resourced and dialectal varieties [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], the develop- Evaluation
ment of robust evaluation methods is complicated by the
absence of high-quality, suficiently large labeled datasets
– an issue common to all under-resourced varieties [5].
      </p>
      <p>Knowles et al. [6] have called for a comparative
evalu</p>
      <sec id="sec-1-1">
        <title>Human evaluation remains the gold standard method</title>
        <p>for assessing MT quality outputs. However, because
human annotation is time-consuming, resource-intensive
and requires high domain expertise, Automatic Machine
Translation Evaluation (AMTE) metrics have garnered
increasing attention. These metrics aim to estimate
translation quality by comparing a system-generated
candidate translation either to the source segment2 in the other ments, local legislation, and materials intended for the
language, to a human-produced reference translation in general public – such as the websites of local public
inthe same language, or to both. In scenarios where the out- stitutions – must be available not only in the national
put of only one translation system is available, as in this language Italian but also in the minority language
Gercase, a so-called segment-level evaluation is carried out. man4.</p>
        <p>It consists of "evaluating metrics based on their ability to This multilingual institutional language regime is
rank segments in the same order as human judgments" largely implemented through translation between Italian
[10]. The efectiveness of such metrics is commonly mea- and German or vice-versa. National legislation is drafted
sured using ranking correlation coeficients, under the in Italian and any implementations at local level create
assumption that a reliable metric should consistently as- the need for translation into German. Following quotas
sign higher scores to translations deemed superior by in public employment, about two thirds of public
adminhuman annotators [11]. istration staf is German-speaking. Consequently, many</p>
        <p>Existing metrics can be categorized into three main legal and administrative texts are now originally drafted
types: in German. While the Italian and German version of, for
example, a local law are both oficial, in case of diverging
• String-based: this approach quantifies transla- interpretation the Italian version prevails (Presidential
tion quality by measuring lexical overlap with Decree No. 670/1972, Art. 99). This means that a
transone or more reference translations. These meth- lated text can become the legally binding version. Given
ods operate at the surface level, comparing exact the growing use of machine translation, this holds true
matches of word or character sequences between also for machine-translated or post-edited texts.
the candidate and the reference. The impressive level of fluency of MT-generated texts
• Embedding-based: these metrics leverage con- poses a challenge for fair quality assessment of MT
systextualized token embeddings from pretrained tems even for human evaluators – especially for those
language models to compute semantic similarity lacking specialized training, who may be outperformed
between the candidate translation and the ref- by automated neural metrics [12]. In South Tyrolean
erence. Semantic alignment is evaluated at the public ofices, where translation-related tasks are often
token level using cosine similarity, followed by performed by non-specialists, the rising adoption of MT
an F-score aggregation procedure. – frequently without adherence to scientific evaluation
• Learned: these metrics are based on transformer protocols [13] – carries the risk of overestimating
producarchitectures that have been fine-tuned via su- tivity gains. Without systematic, targeted performance
pervised learning to replicate human judgments monitoring, critical errors may go unnoticed. As
highof machine translation quality, typically using lighted in the error analysis of a machine-translated legal
a regression objective to provide a continuous corpus [14], MT systems often struggle with local legal
score. terminology and are prone to interference from other
legal systems using German. For example, kommunale
Steuer (municipal tax) is never used in South Tyrol as it
2. Motivation would in Germany. The South Tyrolean term for
"municipal tax" is Gemeindesteuer. Such errors can severely
2.1. Social and Linguistic Background of compromise translation quality and usability. In
highSouth Tyrol stakes domains like the legal one, fluency is secondary
to semantic precision and legal appropriateness. Critical
accuracy errors can distort meaning, making translated
laws unpublishable or even harmful. Consequently, there
is a clear need for MT evaluation frameworks that attend
to the specific requirements of the South Tyrolean
administration and population.</p>
        <p>South Tyrolean German is the standard variety of
German used in the Autonomous Province of Bolzano/Bozen
(South Tyrol) in Northern Italy. In South Tyrol, German
is a recognized minority language, co-oficial with
Italian. Public administration ofices are legally required to
use German when interacting with the German-speaking
population (Presidential Decree No. 670/1972, Art. 100),
which makes up the large majority of South Tyrol’s
population (69%)3. Consequently, all administrative
docu</p>
        <sec id="sec-1-1-1">
          <title>2.2. Toward the Development of Custom</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Metrics</title>
          <p>
            The well-documented challenges of adapting NLP
applications to minor language varieties [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] also apply to the
2In the field of MT, a segment is defined as the minimal translation
unit, which in this study corresponds to a sentence.
3See the latest census data: https://assets-eu-01.
kc-usercontent.com/b5376750-8076-01cf-17d2-d343e29778a7/
5deec178-b2a3-4e2d-8795-d37635c7e0f7/pressnote_1160209_
mit56_2024.pdf
          </p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>4There is a third oficial language, Ladin, spoken by about 20,000</title>
        <p>
          South Tyroleans. We will not deal with Ladin in this paper.
development of automatic evaluation metrics. Language datasets – which represent a limited range of linguistic
models are pre-trained onto large-scale corpora where diversity and domains – their superiority in more
spemajor language varieties contribute a disproportionately cialized evaluation scenarios remains open to question.
larger amount of training signal [15], often without ex- Knowles et al. [6] raise questions regarding how
metplicit annotation of variety or dialect tag. This results in rics assess terminological variation within language
varibiased representations and undermines the fairness and eties and call for more thorough research on the subject.
reliability of evaluation metrics for underrepresented va- Since larger language varieties contribute more training
rieties [16]. Current literature has shown that intensive signal during metric development, studies have observed
continued pre-training [16] and the use of high-quality, that major linguistic variants tend to be rated more
favorhuman-annotated datasets spanning a range of transla- ably than minor linguistic variants, potentially leading
tion quality levels [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] are essential to improving evalua- to biased evaluations [16].
tion performances. Yet, these strategies remain largely Furthermore, analyses of neural metric performance
impractical at present due to the significant data and on non-English language pairs remain limited. As a result,
resource demands they entail. the superiority of neural metrics cannot be
indiscrim
        </p>
        <p>Also, given the high costs of constructing fine-grained, inately generalized to all language combinations [23],
manually annotated datasets, one wants to be sure that with some evidence suggesting that performance may
the compilation of structured and detailed linguistic re- degrade when English is excluded from the evaluation
sources is empirically justified. While Amrhein et al. [17] [24].
argue that the inclusion of reference translations gen- Among the major limitations highlighted in the
literally improves evaluation reliability, the behavior of erature is the lack of interpretability inherent to many
existing metrics remains inconsistent, occasionally even neural evaluation metrics, largely due to their opaque
counterintuitive. For example, some metrics have been scoring mechanisms. Their black-box nature hinders an
observed to disregard the reference altogether [18], or to assessment of which metrics are best suited for
capturproduce high scores even when the source text is omitted ing specific linguistic phenomena and complicates the
entirely [6, 10]. As a result, a comprehensive assessment selection of appropriate metrics for targeted evaluation
of existing solutions is needed not only in terms of the tasks [25]. In response, recent research has increasingly
identification of the best suited metrics to the context un- emphasized evaluation methodologies grounded in
huder study, but also to lay the groundwork for envisaged man error annotations – particularly those following the
future metric development. MQM (Multidimensional Quality Metrics) framework –</p>
        <p>Moreover, reliable metrics can also advance generation which ofer fine-grained information on translation
qualtasks. An emerging trend of natural language generation ity [12]. These span-level annotations have also been
is to exploit Minimum Bayes Risk (MBR) decoding, which leveraged as a standardized method for deriving quality
selects the output hypothesis that minimizes expected scores (eliminating the need for direct human scoring in
loss according to a utility function defined by a chosen evaluation tasks) [26], and training more interpretable
evaluation metric [19]. This approach can act as a form quality metrics.
of style transfer with a reduction in training costs and Parallel eforts have also turned to linguistically
motidata requirements. However, using the same metric for vated meta-evaluation test suites and controlled
experboth decoding and final evaluation introduces bias, as the iments designed to probe metric sensitivity to specific
system is optimized to reproduce the metric’s idiosyn- language phenomena [27, 28].
crasies [20]. Even diferent but highly correlated metrics The specialized nature of the legal domain also raises
– especially if they are of the same type – can produce concerns about the reliability of existing evaluation
metsimilar biases [21]. Thus, evaluating the robustness of rics. Zouhar et al. [29] highlight that learned metrics
exmultiple metric paradigms becomes an essential prereq- hibit a performance drop when applied to out-of-domain
uisite to generating text in South Tyrolean German with data, largely due to their final-stage fine-tuning process.
MBR decoding. This suggests that current training data efectively
optimizes metrics for specific domains but does not generalize
well beyond them. As a result, extending these
evalua3. Challenges of Automatic tion metrics to other domains – such as the legal domain
Machine Translation Evaluation – may lead to performance degradation compared to the
base model.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Learned metrics have consistently outperformed other</title>
        <p>evaluation methods in benchmark competitions such as
the WMT Metrics Shared Task [22]. However, this finding
should not be generalized uncritically. Since neural
metrics are predominantly fine-tuned on WMT competition</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Methodology</title>
      <sec id="sec-2-1">
        <title>In this manner, we can lay out a hierarchy of</title>
        <p>type-of-error severity and derive a more granular
4.1. Problem Definition quality ranking. We apply a penalty for each error in a
segment, equal to the severity weight assigned to that
We establish two criteria to characterize an efective met- error type, according to the Linear Raw Scoring Model
ric for our use case: the first is absolute agreement, defined presented by Lommel et al. [30]. The sum of penalties
as ranking correct translations higher than incorrect ones. is then deducted from a total of 100 and becomes the
We also define relative agreement, that is the capability human score. This score reflects both the presence
to rank translations containing critical mistakes lower and severity of translation errors, thereby enabling the
than those with milder ones [11]. computation of rank-based correlation indices between</p>
        <p>To operationalize the diferentiation, we partition the human judgments and automatic metric outputs.
dataset for analysis. Absolute agreement is measured
on the Whole Dataset – comprising all segments
available. To measure relative agreement, we subsample only
the segments annotated with at least one mistake, the Segments IT→DE DE→IT
Mistake-only Dataset.</p>
        <p>Error-annotated
Exact matches
Other correct</p>
        <sec id="sec-2-1-1">
          <title>4.2. Dataset and Human Scoring</title>
          <p>We use the MT@BZ corpus [8], a corpus of machine- Total segments 1,509 1,509
translated decrees. It comprises source, reference and can- Table 1
didate translations in both language directions (IT→DE Composition of MT@BZ dataset. Error-annotated segments
and DE→IT). Each segment has been manually annotated indicate the number of translations that have been labeled
for translation errors using a custom error taxonomy. Ta- as containing at least one mistake. Exact matches indicate
ble 1 ofers a glance into the composition of the corpus for the number of correct translations that are identical to the
each language direction. We notice that around 60% of all reference. Other correct segments indicate the number of
segments is correct for both language directions. To gain correct translations that are diferent from the reference.
further insight, we compute the BLEU score between
reference and candidate sentences. Notably, we find that a
very high number of segments labeled as correct receives
a perfect BLEU score of 100, indicating exact matches 4.3. Setup of Selected Metrics
with the reference translations. This outcome has also This section presents the evaluation metrics employed
been observed by Oliver et al. [9] in similar experiments in our study, with details on the tested methods and
on the same data, and is attributed to the repetitive and models provided in Table 2. Following best practices for
formulaic nature of legal language, which often leads to replicability as recommended by Zouhar et al. [42] for
low lexical and syntactic variability. Comet-suite metrics, we include hash codes and model</p>
          <p>To measure correlation across a range of quality levels identifiers in the footnotes of the present section.
(as defined in Section 4.1) in the absence of numerical
quality scores, we assign severity weights to each error String-based Metrics
type annotated in the original dataset (see Appendix A).</p>
          <p>
            Given the highly specialized nature of the domain, ex- BLEU [31] measures modified n-gram precision with
perts with competence in the South Tyrolean legal frame- a brevity penalty. chrF [
            <xref ref-type="bibr" rid="ref6">34</xref>
            ] computes overlap over
work and German language varieties were consulted to character-level n-grams, ofering sensitivity to
morphodefine severity levels for each error type. These levels logical and orthographic variation. Finally, TER [
            <xref ref-type="bibr" rid="ref11">39</xref>
            ]
eswere established based on both linguistic adequacy and timates the minimum number of edit operations required
legislative drafting requirements5. For a detailed qualita- to transform the candidate into the reference,
approxitive analysis of the corpus mistakes, refer to De Camillis mating post-editing efort.
and Chiocchetti [14].
5For example, the South Tyrolean public administration is bound
by law to use the terminology that is being oficially validated
by a dedicated Terminology Commission (Presidential Decree No.
574/1988, Art. 6) and to adopt gender-neutral language
(Provincial Law No. 51/2010). These constraints are therefore essential
quality aspects when translating oficial documents into this minor
language variety of German.
          </p>
          <p>
            Embedding-based Metrics
We utilize the BERTScore framework6 [
            <xref ref-type="bibr" rid="ref5">33</xref>
            ], which uses
contextual embeddings from pre-trained language
models to compute semantic similarity. The framework
allows for model selection. Hash identifiers have been
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>6https://github.com/Tiiiger/bert_score</title>
        <p>BLEU
BLEURT
BERTScore
chrF
COMET-22-DA
COMET-Kiwi-DA
COMET-KiwiXL-DA
MetricX-24-Hybrid
TER
UNITE
XCOMETXL-DA
Type
String-based
Learned
Embedding-based
String-based
Learned
Learned
Learned
Learned
String-based
Learned
Learned
✗
✗
✗
✗
✓
✓
✓
✓
✗
✓
✓
✓
✓
✓
✓
✓
✗
✗
✓
✓
✓
✓
✗
✗
✗
✗
✗
✗
✓
✓
✗
✗
✓</p>
      </sec>
      <sec id="sec-2-3">
        <title>We choose learned metrics trained under diferent input configurations.</title>
        <p>
          We begin with reference-based metrics, which
incorporate the reference translation during both training and
inference. We select COMET-22-DA11[
          <xref ref-type="bibr" rid="ref7">35</xref>
          ] and BLEURT
[32], which have been fine-tuned simply using quality
scores from human annotators.
        </p>
        <p>
          We also consider source-based metrics (also called
Quality Estimation or QE metrics), which are trained without
access to reference translations. Instead, they learn to
predict human quality scores solely from the source sentence
and the machine-generated output. We include both
COMET-Kiwi-DA12 [
          <xref ref-type="bibr" rid="ref8">36</xref>
          ] and its larger variant
COMETgenerated together with the scores and are provided in KiwiXL-DA13 [
          <xref ref-type="bibr" rid="ref9">37</xref>
          ], which builds on the same
architecthe footnotes. In our experiments, we evaluate four en- ture but difers in model capacity.
coder backbones: bert-base-multilingual7 (which is the The uniefid approach combines both the source and
default model), roberta-large-mnli8, deberta-xlarge- the reference to exploit multi-task interaction. We assess
mnli9 and bart-large-mnli10. UNITE14 [40]. It jointly leverages the source and the
        </p>
        <p>
          In Table 3, we report the results under their respective reference as separate input streams during training, then
model denominations, matched with the aggregated F1 incorporating a last layer to fuse the decomposed scores
score. We also compute precision and recall individually into the holistic one. We report scores for source (src)
to highlight asymmetric contributions to the similarity and reference (ref ) decompositions.
assessment, which will be commented in Section 5. Pre- We also include error-span metrics, namely
cision measures how many of the candidate’s tokens are XCOMETXL-DA15 [41], MetricX-24-Hybrid-Large
present in the reference, while recall captures how well and its larger configuration MetricX-24-Hybrid-XL
the reference tokens are matched by the generated can- [
          <xref ref-type="bibr" rid="ref10">38</xref>
          ]. These metrics include a training phase based
didate. on error-span labels, according to the MQM error
taxonomy. They are trained to predict error spans
Learned Metrics alongside a penalty score. XCometXL-DA is a hybrid
metric that provides additional scores based on four
decomposed dimensions: src, ref, unified approach and
MQM annotations. The holistic score is then produced
by ensembling the four sub-scores via a forward pass
that establishes aggregation weights. Instead, the
MetricX model suite only provides a single additional
decomposed score which includes only the source in the
evaluation.
        </p>
        <p>Finally, we explore a variant of XCOMETXL quantized
to 8 bits16, motivated by the hypothesis put forward in
Zouhar et al. [42] that lower precision approximations
of large metrics can maintain correlation with human
judgments while significantly reducing inference costs.
7bert-base-multilingual-cased_L9_noidf_version=0.3.12(hug_trans=4.46.2)_fast-tokenizer
8roberta-large-mnli_L19_no-idf_version=0.3.12(hug_trans=4.51.3)
9microsoft/deberta-xlarge-mnli_L40_noidf_version=0.3.12(hug_trans=4.51.3)
10facebook/bart-large-mnli_L11_no</p>
        <p>idf_version=0.3.12(hug_trans=4.51.3)
11Python3.8.10|Comet2.2.2|fp32|Unbabel/wmt22-comet-da|1
12Python3.8.10|Comet2.2.2|fp32|Unbabel/wmt22-cometkiwi-da|1</p>
        <sec id="sec-2-3-1">
          <title>4.4. Meta-Evaluation</title>
          <p>In Table 3, we report Accuracy (Acc) [11], a measure
computed through pairwise comparisons across the test
set. It quantifies the proportion of pairs for which the
13Python3.8.10|Comet2.2.2|fp32|Unbabel/wmt22-cometkiwiXL-da|1
14Python3.8.10|Comet2.2.2|fp32|Unbabel/unite-mup|1
15Python3.8.10|Comet2.2.3|fp32|Unbabel/XCOMET-XL|1
16Python3.8.10|Comet2.2.2|qint8|Unbabel/XCOMET-XL|1
evaluation metric produces the same relative ordering translations identical to the reference may not receive
as the human gold standard (concordant), versus those the maximum score, or scores may fall outside the valid
where the ordering is incorrect (discordant). We follow range of 0 to 1 as in Comet models (requiring post-hoc
Deutsch et al. [43] by using a variant of Accuracy ad- clipping). This behavior is consistent with prior
observajusted for tie calibration by artificially creating ties from tions about learned metrics’ underperformance on
highcontinuous scores. This procedure is needed in the light quality translations, as noted by Agrawal et al. [44].
of the high number of rank ties stemming from human However, the trend reverses in the Mistake-only
score fabrication. The Acc value ranges from 0 to 1. Dataset: here, when including the reference, learned</p>
          <p>
            We also adopt Spearman’s correlation (Rho) (ranging metrics consistently outperform other metric types,
from -1 to 1). It ofers robustness to outliers and allows to regardless of the statistical measure used. This suggests
capture rank-based monotonic relationships even across that their modeling power becomes more efective in
the markedly diferent score distributions observed in lower-quality bands, where surface-level matches are
the metrics evaluated [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. less common and mistakes have to be properly identified
          </p>
          <p>We decide not to use Pearson’s correlation because it and penalized. Despite this regained advantage in the
assumes a linear relationship between the distributions Mistake-only setting, the overperformance margins
of the two score groups [43]. The proportional sever- of most learned metrics remain tight and agreement
ity weights we assign to diferent error types are not levels insuficient for a reliable quality evaluation. This
expected to be linearly replicated by metric outputs. suggests that there is still room for improvement –
especially as far as smaller-size metrics are concerned.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Results</title>
      <sec id="sec-3-1">
        <title>We apply meta-evaluation measures on both the Whole</title>
        <p>Dataset and the Mistake-only Dataset. This addresses
the need to adequately test the metrics on the two
criteria that have been established when defining the
problem in Section 4.1: absolute and relative agreement.</p>
        <p>In Table 3, results are accordingly structured under
two main sections, which separately report metric
performance under each evaluation criterion. For
metrics that generate holistic scores by aggregating
subscores algorithmically, we report the holistic score in
bold, while single decomposed scores are provided in
regular font.</p>
        <p>Metric paradigm performance varies across
quality ranges. Our results reveal a widely diferent
performance pattern across metric paradigms when evaluated
on the Whole Dataset versus the Mistake-only Dataset.</p>
        <p>Surprisingly, both string-based and embedding-based
metrics outperform learned metrics when evaluated on
the Whole Dataset. We explain this with the argument
that string-based metrics – being rule-based – can
reliably detect and reward the high sample of exact matches
with the reference. Embedding-based metrics also
beneift from their ability to capture lexical overlap at a
subword or token level, recognising meaning even when
the wording difers. We attribute the underperformance
of learned metrics primarily to the inherent nature of
their regression-based scoring. Unlike rule-based metrics
that produce deterministic outputs, learned metrics rely
on regression functions that approximate scores based
on distributional patterns in the training data. This can
result in unexpected behavior – for instance, candidate</p>
        <p>Mind the reference. Disaggregating the performance
of learned metrics by input type ofers valuable insights
into which linguistic resources most efectively
contribute to accurate evaluation. Considering the
Mistakeonly Dataset, reference-based scores surpass both
sourcebased and error-span counterparts for COMET, UNITE
and MetricX families. Interestingly, for metrics built on
the unified approach (such as UNITE and XCOMETXL),
the inclusion of both source and reference appears
beneifcial. While the reference remains the primary driver of
correlation, incorporating the source provides a modest
boost to overall score agreement. This suggests that
uniifed models, which incorporate additional layers to weigh
and integrate information streams from both inputs into
the holistic score, may be better suited to capture certain
error types that are only apparent when the source is
considered.</p>
        <p>In general, while source-based metrics trail behind
other learned metric types, they can outperform
embedding-based metrics counting on reference
translations, especially if we consider models with larger
capacity (MetricX-24-XL-QE, COMET-KiwiXL-DA and
XCOMETXL-src).</p>
        <p>Error-span metrics are misaligned. We assess the
usefulness of error-span annotations in comparison to
other linguistic signals. XCOMETXL-DA-mqm is the only
available decomposed score based exclusively on MQM
error span identification. Considering the Mistake-only
Dataset, we observe a drop compared to related subscores
of the same metric as well as to the smaller configuration
of the same metric (COMET-22-DA). This failure may
be attributable to a misalignment between the MQM</p>
        <p>IT→DE
Acc Rho</p>
        <p>DE→IT
Acc Rho</p>
        <p>IT→DE
Acc Rho</p>
        <p>DE→IT</p>
        <p>Acc Rho
annotation framework used for training such metrics and trend may corroborate the importance of the reference
our custom error taxonomy used for evaluation. Striving translation: gauging how much of semantic and syntactic
for consistency over error label criteria across training information contained in the reference transfers to
and evaluation is thus fundamental for fair assessment. the candidate may generally serve as a predictor of</p>
        <p>Looking at Whole Dataset, we likewise highlight legal text quality as conceived of by expert evaluators.
that error-span metrics (MetricX and XCOMETXL) are Yet, the negligible edge in the correlation measure is
surpassed by learned metrics that are optimized only for neither strong nor consistent enough to draw definitive
direct scalar prediction of sentence-level quality, such as conclusions. An informed interpretation of the results
COMET-22-DA, BLEURT and UNITE. As the training would require a qualitative analysis on the amount of
objective of error-span metrics is to regress over error semantic explicitation commonly expressed in the legal
annotations to estimate penalty weights accordingly, texts of both languages.
they may show a proneness for over-correction even in
high-quality segments.</p>
        <p>Minor varieties remain penalised. Focusing on
the target language, we observe that correlations on the</p>
        <p>Precision or Recall? In Appendix B, we collect Mistake-only Dataset are generally higher for Italian than
decomposed subscores for embedding-based metrics: for German. This result is noteworthy given that German
recall and precision. We notice that recall tends to benefits from a larger pool of training data due to the
correlate more strongly with human judgments than fact that it is a more regularly featured language in WMT
the holistic score and the precision subscore. This shared tasks, which contribute most of metrics training
data. We posit that this discrepancy supports the argu- pora in South Tyrolean German and including relevant
ment that generic models tend to embed biases toward terminology.
dominant language varieties. In the case of German, it is Also, we recommend exploring training strategies that
likely that the datasets used to train evaluation metrics integrate the strengths of embedding-based and learned
predominantly feature standard varieties such as those metrics, with the goal of developing evaluation systems
used in Germany and at the EU level. that perform robustly across the full quality spectrum of</p>
        <p>Moreover, we caution against drawing conclusions machine translation output.
based on the Whole Dataset, where Italian-to-German From a broader perspective, we suggest that metric
translations include nearly twice as many full matches selection in natural language generation tasks should be
between reference and candidate as the reverse direction. guided by a clear definition of the evaluation objective
This makes the datasets not comparable to each other, and the nature of the task. Learned metrics are more
inflating metric performance and simplifying evaluation efective when the task involves detecting and
weighfor German as the target language. ing complex linguistic phenomena that may surface in
diverse forms – such as in summarization or
questionanswering tasks. In such cases, the fine-tuning and
vali</p>
        <p>Size matters. When comparing learned metrics of dation of a custom metric may be a further convenient
increasing model size on the Mistake-only Dataset, we step. Conversely, more naive evaluation methods like the
observe a general trend where scaling up benefits evalua- string-based ones are often appropriate when low
varition performance. This is evident in the case of COMET- ance from a reference is expected, such as in the presence
Kiwi, where the XL variant consistently outperforms of named entities. As our findings show, the two
metits smaller counterpart, and for reference-based scores ric paradigms can even be complementary:
embeddingof XCOMETXL-DA-ref, which shows stronger results and string-based metrics are well-suited for evaluating
compared to COMET-22-DA. A more nuanced picture accuracy-related aspects, while learned metrics can ofer
emerges with MetricX, where the XL versions outper- global insight into the overall fluency of the generated
form the Large models only in evaluations into Italian, text and meaning preservation.
suggesting that scaling efects may vary across language
directions, presumably due to the language variety
provenance of additional data. References</p>
        <p>The quantized version of XCOMETXL-DA, though
slightly lowering correlation measures compared to its
full-precision counterpart, still outperforms all other
metrics, which confirms previous findings that quantization
can be a viable strategy for reducing computational costs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusions</title>
      <p>As an indication for future metric development, we
conclude that reference translations are most crucial for
enhancing evaluation reliability, while source sentences
may contribute marginally but are not essential. We
advise against embarking on the efort of error-span
annotation of large corpora with the aim of training new
metrics: it has notable human and resource costs but
results ofer no evidence that they determine
commensurate metric improvements. Instead, targeted extensions
of the existing MT@BZ dataset may provide more
costefective support for evaluation purposes.</p>
      <p>Given the underperformance of metrics when
evaluating South Tyrolean German as a target language, future
metric adaptation would likely benefit from applying
continued pre-training to generic encoder models on South
Tyrolean German data. This would provide a more
suitable backbone for further fine-tuning learned metrics. To
this end, eforts should be made to compile legal text
corin: Proceedings of the 2024 Joint International Con- guistics, Online, 2021, pp. 478–494. URL: https:
ference on Computational Linguistics, Language //aclanthology.org/2021.wmt-1.57/.
Resources and Evaluation (LREC-COLING 2024), [12] M. Freitag, G. Foster, D. Grangier, V. Ratnakar,
ELRA and ICCL, Torino, Italia, 2024, pp. 3553–3565. Q. Tan, W. Macherey, Experts, errors, and
conURL: https://aclanthology.org/2024.lrec-main.315/. text: A large-scale study of human evaluation for
[5] A. Magueresse, V. Carles, E. Heetderks, Low- machine translation, Transactions of the
Associaresource languages: A review of past work and tion for Computational Linguistics 9 (2021) 1460–
future challenges, 2020. URL: https://arxiv.org/abs/ 1474. URL: https://aclanthology.org/2021.tacl-1.87/.
2006.07264. arXiv:2006.07264. doi:10.1162/tacl_a_00437.
[6] R. Knowles, S. Larkin, C.-K. Lo, MSLC24: Further [13] F. D. Camillis, La traduzione non professionale nelle
challenges for metrics on a wide landscape of trans- istituzioni pubbliche dei territori di lingua
minorilation quality, in: Proceedings of the Ninth Confer- taria: il caso di studio dell’amministrazione della
ence on Machine Translation, Association for Com- Provincia autonoma di Bolzano, Ph.D. thesis, alma,
putational Linguistics, Miami, Florida, USA, 2024, 2021. URL: https://amsdottorato.unibo.it/id/eprint/
pp. 475–491. URL: https://aclanthology.org/2024. 9695/.</p>
      <p>wmt-1.34/. doi:10.18653/v1/2024.wmt-1.34. [14] F. De Camillis, E. Chiocchetti, Machine-translating
[7] V. Dewangan, B. R. S, G. Suri, R. Sonavane, When legal language: error analysis on an italian-german
every token counts: Optimal segmentation for corpus of decrees, Terminology science &amp; research
low-resource language models, in: Proceedings 27 (2024) 1–27. URL: https://journal-eaft-aet.net/
of the First Workshop on Language Models for index.php/tsr/article/view/8304/7492.
Low-Resource Languages, Association for Compu- [15] J. O. Alabi, D. I. Adelani, M. Mosbach, D. Klakow,
tational Linguistics, Abu Dhabi, United Arab Emi- Adapting pre-trained language models to African
rates, 2025, pp. 294–308. URL: https://aclanthology. languages via multilingual adaptive fine-tuning, in:
org/2025.loreslm-1.24/. Proceedings of the 29th International Conference
[8] F. De Camillis, E. W. Stemle, E. Chiocchetti, F. Fer- on Computational Linguistics, International
Comnicola, The MT@BZ corpus: machine translation mittee on Computational Linguistics, Gyeongju,
Re&amp; legal language, in: Proceedings of the 24th An- public of Korea, 2022, pp. 4336–4349. URL: https:
nual Conference of the European Association for //aclanthology.org/2022.coling-1.382/.
Machine Translation, European Association for Ma- [16] J. Sun, T. Sellam, E. Clark, T. Vu, T. Dozat, D.
Garchine Translation, Tampere, Finland, 2023, pp. 171– rette, A. Siddhant, J. Eisenstein, S. Gehrmann,
180. URL: https://aclanthology.org/2023.eamt-1.17/. Dialect-robust evaluation of generated text, in:
[9] A. Oliver, S. Alvarez-Vidal, E. Stemle, E. Chioc- Proceedings of the 61st Annual Meeting of the
Aschetti, Training an NMT system for legal texts sociation for Computational Linguistics (Volume
of a low-resource language variety south tyrolean 1: Long Papers), Association for Computational
German - Italian, in: Proceedings of the 25th An- Linguistics, Toronto, Canada, 2023, pp. 6010–6028.
nual Conference of the European Association for URL: https://aclanthology.org/2023.acl-long.331/.
Machine Translation (Volume 1), European Asso- doi:10.18653/v1/2023.acl-long.331.
ciation for Machine Translation (EAMT), Shefield, [17] C. Amrhein, N. Moghe, L. Guillou, ACES:
TranslaUK, 2024, pp. 573–579. URL: https://aclanthology. tion accuracy challenge sets for evaluating machine
org/2024.eamt-1.47/. translation metrics, in: Proceedings of the Seventh
[10] S. Perrella, L. Proietti, A. Scirè, E. Barba, R. Nav- Conference on Machine Translation (WMT),
Assoigli, Guardians of the machine translation meta- ciation for Computational Linguistics, Abu Dhabi,
evaluation: Sentinel metrics fall in!, in: Proceed- United Arab Emirates (Hybrid), 2022, pp. 479–513.
ings of the 62nd Annual Meeting of the Associa- URL: https://aclanthology.org/2022.wmt-1.44/.
tion for Computational Linguistics (Volume 1: Long [18] Y. Yan, T. Wang, C. Zhao, S. Huang, J. Chen,
Papers), Association for Computational Linguis- M. Wang, BLEURT has universal translations:
tics, Bangkok, Thailand, 2024, pp. 16216–16244. An analysis of automatic metrics by minimum
URL: https://aclanthology.org/2024.acl-long.856/. risk training, in: Proceedings of the 61st
Andoi:10.18653/v1/2024.acl-long.856. nual Meeting of the Association for Computational
[11] T. Kocmi, C. Federmann, R. Grundkiewicz, Linguistics (Volume 1: Long Papers), Association
M. Junczys-Dowmunt, H. Matsushita, A. Menezes, for Computational Linguistics, Toronto, Canada,
To ship or not to ship: An extensive evaluation 2023, pp. 5428–5443. URL: https://aclanthology.
of automatic metrics for machine translation, in: org/2023.acl-long.297/. doi:10.18653/v1/2023.
Proceedings of the Sixth Conference on Machine acl-long.297.</p>
      <p>Translation, Association for Computational Lin- [19] P. Fernandes, A. Farinhas, R. Rei, J. G. C. de Souza,
P. Ogayo, G. Neubig, A. Martins, Quality-aware [26] M. Freitag, N. Mathur, C.-k. Lo, E. Avramidis,
decoding for neural machine translation, in: Pro- R. Rei, B. Thompson, T. Kocmi, F. Blain, D. Deutsch,
ceedings of the 2022 Conference of the North Amer- C. Stewart, C. Zerva, S. Castilho, A. Lavie, G.
Fosican Chapter of the Association for Computational ter, Results of WMT23 metrics shared task:
MetLinguistics: Human Language Technologies, As- rics might be guilty but references are not
insociation for Computational Linguistics, Seattle, nocent, in: Proceedings of the Eighth
ConferUnited States, 2022, pp. 1396–1412. URL: https: ence on Machine Translation, Association for
Com//aclanthology.org/2022.naacl-main.100/. doi:10. putational Linguistics, Singapore, 2023, pp. 578–
18653/v1/2022.naacl-main.100. 628. URL: https://aclanthology.org/2023.wmt-1.51/.
[20] G. Kovacs, D. Deutsch, M. Freitag, Mitigating metric doi:10.18653/v1/2023.wmt-1.51.
bias in minimum Bayes risk decoding, in: Proceed- [27] N. Moghe, A. Fazla, C. Amrhein, T. Kocmi, M.
Steedings of the Ninth Conference on Machine Trans- man, A. Birch, R. Sennrich, L. Guillou, Machine
lation, Association for Computational Linguistics, translation meta evaluation through translation
acMiami, Florida, USA, 2024, pp. 1063–1094. URL: curacy challenge sets, Computational Linguistics 51
https://aclanthology.org/2024.wmt-1.109/. doi:10. (2025) 73–137. URL: https://aclanthology.org/2025.
18653/v1/2024.wmt-1.109. cl-1.4/. doi:10.1162/coli_a_00537.
[21] J. Pombal, N. M. Guerreiro, R. Rei, A. F. T. Martins, [28] E. Avramidis, S. Manakhimova, V. Macketanz,
Adding chocolate to mint: Mitigating metric inter- S. Möller, Machine translation metrics are
betference in machine translation, 2025. URL: https: ter in evaluating linguistic errors on LLMs than
//arxiv.org/abs/2503.08327. arXiv:2503.08327. on encoder-decoder systems, in: Proceedings of
[22] M. Freitag, N. Mathur, D. Deutsch, C.-K. Lo, the Ninth Conference on Machine Translation,
AsE. Avramidis, R. Rei, B. Thompson, F. Blain, sociation for Computational Linguistics, Miami,
T. Kocmi, J. Wang, D. I. Adelani, M. Buchicchio, Florida, USA, 2024, pp. 517–528. URL: https://
C. Zerva, A. Lavie, Are LLMs breaking MT met- aclanthology.org/2024.wmt-1.37/. doi:10.18653/
rics? results of the WMT24 metrics shared task, v1/2024.wmt-1.37.
in: Proceedings of the Ninth Conference on Ma- [29] V. Zouhar, S. Ding, A. Currey, T. Badeka, J. Wang,
chine Translation, Association for Computational B. Thompson, Fine-tuned machine translation
metLinguistics, Miami, Florida, USA, 2024, pp. 47– rics struggle in unseen domains, in: Proceedings
81. URL: https://aclanthology.org/2024.wmt-1.2/. of the 62nd Annual Meeting of the Association
doi:10.18653/v1/2024.wmt-1.2. for Computational Linguistics (Volume 2: Short
[23] N. Moghe, T. Sherborne, M. Steedman, A. Birch, Papers), Association for Computational
LinguisExtrinsic evaluation of machine translation metrics, tics, Bangkok, Thailand, 2024, pp. 488–500. URL:
in: Proceedings of the 61st Annual Meeting of the https://aclanthology.org/2024.acl-short.45/. doi:10.
Association for Computational Linguistics (Volume 18653/v1/2024.acl-short.45.
1: Long Papers), Association for Computational Lin- [30] A. Lommel, S. Gladkof, A. Melby, S. E. Wright,
guistics, Toronto, Canada, 2023, pp. 13060–13078. I. Strandvik, K. Gasova, A. Vaasa, A. Benzo,
URL: https://aclanthology.org/2023.acl-long.730/. R. Marazzato Sparano, M. Foresi, J. Innis, L. Han,
doi:10.18653/v1/2023.acl-long.730. G. Nenadic, The multi-range theory of
transla[24] S. Agrawal, A. Farajian, P. Fernandes, R. Rei, A. F. T. tion quality measurement: MQM scoring
modMartins, Assessing the role of context in chat els and statistical quality control, in:
Protranslation evaluation: Is context helpful and un- ceedings of the 16th Conference of the
Associder what conditions?, Transactions of the Associa- ation for Machine Translation in the Americas
tion for Computational Linguistics 12 (2024) 1250– (Volume 2: Presentations), Association for
Ma1267. URL: https://aclanthology.org/2024.tacl-1.69/. chine Translation in the Americas, Chicago, USA,
doi:10.1162/tacl_a_00700. 2024, pp. 75–94. URL: https://aclanthology.org/2024.
[25] R. Rei, N. M. Guerreiro, M. Treviso, L. Coheur, amta-presentations.6/.</p>
      <p>A. Lavie, A. Martins, The inside story: To- [31] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:
wards better understanding of machine transla- a method for automatic evaluation of machine
tion neural evaluation metrics, in: Proceedings translation, in: Proceedings of the 40th Annual
of the 61st Annual Meeting of the Association Meeting of the Association for Computational
Linfor Computational Linguistics (Volume 2: Short guistics, Association for Computational
LinguisPapers), Association for Computational Linguis- tics, Philadelphia, Pennsylvania, USA, 2002, pp.
tics, Toronto, Canada, 2023, pp. 1089–1105. URL: 311–318. URL: https://aclanthology.org/P02-1040/.
https://aclanthology.org/2023.acl-short.94/. doi:10. doi:10.3115/1073083.1073135.
18653/v1/2023.acl-short.94. [32] T. Sellam, D. Das, A. Parikh, BLEURT: Learning</p>
    </sec>
    <sec id="sec-5">
      <title>A. Custom error weights</title>
      <p>Type of Error</p>
      <p>Penalty Weight</p>
      <p>Accuracy errors
Mistranslation:</p>
      <p>Multiword expressions
Part of Speech
Word Sense Disambiguation
Partial</p>
      <p>Semantically Unrelated
Addition
Omission
Untranslated
Mechanical
Bilingual terminology
Source error</p>
      <p>Fluency errors
Grammar:</p>
      <p>Multiword syntax
Word form
Word order
Extra words</p>
      <p>Missing words
Lexicon:</p>
      <p>Lexical choice</p>
      <p>Non-existing or Foreign Word
Orthography:</p>
      <p>Spelling
Punctuation</p>
      <p>Capitalization
Gender
Inconsistency
Coherence
Multiple fluency errors
Other
0.781
0.778
0.781
MISTAKE-ONLY DATASET
IT→DE</p>
      <p>DE→IT</p>
      <p>IT→DE</p>
      <p>DE→IT
Declaration on Generative AI</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Scherrer</surname>
          </string-name>
          ,
          <article-title>Natural language processing for similar languages, varieties, and dialects: A survey</article-title>
          ,
          <source>Natural Language Engineering</source>
          <volume>26</volume>
          (
          <year>2020</year>
          )
          <fpage>595</fpage>
          -
          <lpage>612</lpage>
          . doi:
          <volume>10</volume>
          . 1017/S1351324920000492.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. M. I.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <article-title>CODET: A benchmark for contrastive dialectal evaluation of machine translation</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics, St</article-title>
          .
          <source>Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1790</fpage>
          -
          <lpage>1859</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-eacl.
          <volume>125</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. I.</given-names>
            <surname>Adelani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Masiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          , E. Briakou,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carpuat</surname>
          </string-name>
          , et al.,
          <article-title>AfriMTE and AfriCOMET: Enhancing COMET to embrace under-resourced African languages, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>5997</fpage>
          -
          <lpage>6023</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>334</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>334</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Falcão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Borg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aranberri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abela</surname>
          </string-name>
          ,
          <article-title>COMET for low-resource machine translation evaluation: A case study of English-Maltese and Spanish-Basque, robust metrics for text generation</article-title>
          ,
          <source>in: Proceedings lation in the Americas: Technical Papers</source>
          ,
          <article-title>Associaof the 58th Annual Meeting of the Association for tion for Machine Translation in the Americas</article-title>
          , CamComputational Linguistics, Association for Com- bridge, Massachusetts, USA,
          <year>2006</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>231</lpage>
          . URL: putational Linguistics, Online,
          <year>2020</year>
          , pp.
          <fpage>7881</fpage>
          -
          <lpage>7892</lpage>
          . https://aclanthology.org/
          <year>2006</year>
          .amta-papers.
          <volume>25</volume>
          /. URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>704</volume>
          /. [40]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wong</surname>
          </string-name>
          , doi:10.18653/v1/
          <year>2020</year>
          .acl-main.704. L. Chao,
          <article-title>UniTE: Unified translation evaluation</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Proceedings of the 60th Annual Meeting of the AsY</article-title>
          . Artzi, Bertscore:
          <article-title>Evaluating text generation with sociation for Computational Linguistics</article-title>
          (Volume bert,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .09675. 1:
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computational arXiv:
          <year>1904</year>
          .09675.
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>8117</fpage>
          -
          <lpage>8127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Popović</surname>
          </string-name>
          ,
          <article-title>chrF: character n-gram F-score for auto-</article-title>
          URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long.558/. matic MT evaluation</article-title>
          ,
          <source>in: Proceedings of the Tenth doi:10.18653/v1/2022.acl-long.558. Workshop on Statistical Machine Translation</source>
          , Asso- [41]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. v.</given-names>
            <surname>Stigt</surname>
          </string-name>
          , L. Coheur, ciation for Computational Linguistics, Lisbon, Por- P. Colombo,
          <string-name>
            <given-names>A. F. T.</given-names>
            <surname>Martins</surname>
          </string-name>
          , xcomet: Transpartugal,
          <year>2015</year>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>395</lpage>
          . URL: https://aclanthology. ent machine translation evaluation through fineorg/W15-3049/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W15</fpage>
          -3049.
          <article-title>grained error detection, Transactions of the Associ-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <source>COMET: ation for Computational Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <article-title>979- A neural framework for MT evaluation</article-title>
          , in: Pro-
          <fpage>995</fpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .tacl-
          <volume>1</volume>
          .54/.
          <source>ceedings of the 2020 Conference on Empirical Meth- doi:10.1162/tacl_a_00683. ods in Natural Language Processing (EMNLP)</source>
          , As- [42]
          <string-name>
            <given-names>V.</given-names>
            <surname>Zouhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Moghe</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Hadsociation for Computational Linguistics, Online, dow, Pitfalls and outlooks in using COMET</article-title>
          , in: 2020, pp.
          <fpage>2685</fpage>
          -
          <lpage>2702</lpage>
          . URL: https://aclanthology.org
          <source>/ Proceedings of the Ninth Conference on Machine 2020.emnlp-main</source>
          .
          <volume>213</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/2020. Translation, Association for Computational Linguisemnlp-main.
          <volume>213</volume>
          . tics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>1272</fpage>
          -
          <lpage>1288</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Treviso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zerva</surname>
          </string-name>
          , A. C. https://aclanthology.org/
          <year>2024</year>
          .wmt-
          <volume>1</volume>
          .121/. doi:10.
          <string-name>
            <surname>Farinha</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Maroti</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
          </string-name>
          ,
          <source>T. Glushkova</source>
          ,
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .wmt-
          <volume>1</volume>
          .121.
          <string-name>
            <surname>D. Alves</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            , [43]
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Deutsch</surname>
            , G. Foster,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
          </string-name>
          ,
          <article-title>Ties matter: MetaCometKiwi: IST-unbabel 2022 submission for the evaluating modern metrics with pairwise accuracy quality estimation shared task</article-title>
          ,
          <source>in: Proceedings and tie calibration</source>
          ,
          <source>in: Proceedings of the 2023 Conof the Seventh Conference on Machine Transla- ference on Empirical Methods in Natural Language tion (WMT)</source>
          ,
          <source>Association for Computational Lin- Processing</source>
          , Association for Computational Linguisguistics, Abu Dhabi,
          <article-title>United Arab Emirates (Hybrid), tics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>12914</fpage>
          -
          <lpage>12929</lpage>
          . URL: https:
          <year>2022</year>
          , pp.
          <fpage>634</fpage>
          -
          <lpage>645</lpage>
          . URL: https://aclanthology.org/ //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>798</volume>
          /. doi:10.
          <year>2022</year>
          .wmt-
          <volume>1</volume>
          .60/. 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>798</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pombal</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. van Stigt</surname>
          </string-name>
          , [44]
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <surname>Can M. Treviso</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
          </string-name>
          , A. Mar
          <article-title>- automatic metrics assess high-quality translations?, tins, Scaling up CometKiwi: Unbabel-IST 2023 sub- in: Proceedings of the 2024 Conference on Emmission for the quality estimation shared task</article-title>
          ,
          <source>in: pirical Methods in Natural Language Processing, Proceedings of the Eighth Conference on Machine Association for Computational Linguistics</source>
          , Miami, Translation, Association for Computational Lin- Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>14491</fpage>
          -
          <lpage>14502</lpage>
          . URL: https: guistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>841</fpage>
          -
          <lpage>848</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>802</volume>
          /. doi:10. //aclanthology.org/
          <year>2023</year>
          .wmt-
          <volume>1</volume>
          .73/. doi:
          <volume>10</volume>
          .18653/ 18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>802</volume>
          . v1/
          <year>2023</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>73</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Juraska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Finkelstein</surname>
          </string-name>
          , M. Freitag, MetricX-24:
          <article-title>The Google submission to the WMT 2024 metrics shared task</article-title>
          ,
          <source>in: Proceedings of the Ninth Conference on Machine Translation</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>492</fpage>
          -
          <lpage>504</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .wmt-
          <volume>1</volume>
          .35/. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .wmt-
          <volume>1</volume>
          .
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Micciulla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Makhoul</surname>
          </string-name>
          ,
          <article-title>A study of translation edit rate with targeted human annotation</article-title>
          ,
          <source>in: Proceedings of the 7th Conference of the Association for Machine Trans-</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>