<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An experiment in error analysis of real-time speech machine translation using the example of the European Parliament's Innovation Partnership⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisa Di Nuovo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Turin</institution>
          ,
          <addr-line>via Giuseppe Verdi, 8, 10124 Torino (TO) -</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, technological progress has made Machine Translation (MT) a reality. Significant improvements have been obtained using deep learning models as opposed to rule-based and statistical MT models. Human evaluation still remains under-explored. In 2019 the European Parliament (EP) started an innovation partnership with commercial operators, with the purpose of developing a tool exploiting state-of-the-art, real-time Automatic Speech Recognition (ASR) and MT technologies to make parliamentary plenary sessions accessible to D/deaf and hard of hearing. In this paper, we present a quantitative and qualitative error analysis carried out on a test set consisting of 78 short speeches delivered by Members of the EP in 19 languages deployed in the EP prototype by November 2022. The taxonomy used for ASR and MT is adapted from the Multidimensional Quality Metrics framework. Results show that sentence segmentation is the biggest issue in the ASR output-not considered using automatic metrics-which often afects the MT output.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Real-time speech machine translation</kwd>
        <kwd>Error analysis</kwd>
        <kwd>Human evaluation</kwd>
        <kwd>Cascade system</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tion campaign of the 19th International Conference on
Spoken Language Translation (IWSLT 2022) [5], one of
In recent years, the landscape of language translation has the eight shared tasks focused on real-time speech
transbeen fundamentally transformed by remarkable techno- lation, addressed as translation of ASR output or directly
logical advancements. Machine Translation (MT), once from the audio source and involving English to German,
an ambitious aspiration, has now become a tangible re- English to Japanese and English to Mandarin Chinese. A
ality. This transformation has been primarily fueled by novelty of this year campaign is the addition of manual
the advent of deep learning models, specifically neural evaluation of real-time outputs.
machine translation and transformers. These cutting- Like many natural language processing tasks, MT is
edge models have ushered in a new era of translation, dificult to evaluate. One of the reasons for this is the
noneclipsing the limitations of traditional rule-based and deterministic nature of translation, i.e. there is more than
statistical MT methods. Deep learning models use the one correct way to translate from one language into
anmechanism called attention to improve the performance other. Evaluation in shared tasks is usually carried out by
[1] and have been usually evaluated on ofline written means of automatic metrics, BLEU (Bilingual Evaluation
translation tasks involving a few language pairs [2]. Understudy) [6] being the standard for MT evaluation.</p>
      <p>Very recently, research expanded its focus also on This metric tries to overcome the nondeterministic
naspeech machine translation, tackled as a concatenation ture of translation using multiple references. However,
of Automatic Speech Recognition (ASR) and MT, or as automatic metrics have several limitations [7].2 On the
an end-to-end task (i.e. direct translation of speech in other hand, human evaluation, if carried out using
finedlanguage A into text in language B).1 In the last evalua- grained guidelines to limit subjectivity, can give a clearer
indication of the MT output quality. However, being
reCLiC-it 2023: 9th Italian Conference on Computational Linguistics, source expensive (i.e. it is hard to nfid skilled evaluators;
⋆NTovhi3s0s—tudDyecan02d, p2a0p23e,rVwenaiscwe, rIittatleyn while the author was working skilled evaluators have a high cost), it has been used
limfor the European Parliament, in the Unit in charge of the admin- itedly and in small studies. To avoid these limitations,
istration and evaluation of the prototype. This is not the oficial crowdsourced annotators have been used. Unfortunately,
evaluation methodology employed by the European Parliament for crowdsourced annotators are frequently inexperienced.
evaluation. As [10] afirm, crowdsourced human evaluation can be
* TCheentaruet,hEour,roaspeoafn1sCtOomctmobisesri2o0n2,3Isipsream(pVlAoy),eIdtablyy.the Joint Research used when MT quality is poor, because it can still provide
$ elisa.dinuovo@unito.it (E. D. Nuovo)
0000-0002-4814-982X (E. D. Nuovo) cascade and end-to-end systems.</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 2See Moorkens et al. [8] on translation quality assessment and
1SCPWrEooUerckReshdoeinpgs IhSftpN:/c1oe6u1r3-w-0s.o7rr3g eACxttEarimbUutRpionlWe4.0toIhnrteekrnssahttioounpdali(PeCCrsoBiYcne4.0e[).d3i,n4g]sf(oCrEaUcRo-WmpSa.orrisgo)n between hCuhmataznikMouTmeiva[9lu]aftoiorna. comprehensive review of automatic and
a useful indication; but, as quality improves, it becomes projects, provides a hierarchy of translation errors that
unfit and leads to erroneous claims. 3 can be adapted according to the application. We devised</p>
      <p>In 2019 the European Parliament (EP) started an Inno- our taxonomy consisting of diferent error categories for
vation Partnership with commercial operators, with the ASR and MT and 3 severity levels (i.e. neutral, minor
purpose of developing a tool that can perform real-time and major). We decided to exclude critical errors as in
ASR and MT from and into all the 24 oficial languages [10]. The remainder of the paper is organised as follows:
of the European Union (EU).4 This partnership has the in Section 2 we describe the methodology applied for
aim of making parliamentary plenary sessions accessible automatic and human evaluation; in Section 3 we
rein near-real time to D/deaf and hard of hearing persons.5 port the results quantitatively and qualitatively analysed
The challenges faced by this project are manifold: the per language and per annotator; Section 4 concludes the
high degree of multilingualism which is highly ambitious paper.
considering the technical limits of current MT
particularly in a number of low-resource languages, the presence
of non-native accents, the large variety of vocabulary/EU 2. Methodology description
jargon required by the numerous specific domains
tackled in the plenaries, the low latency constraints to have We evaluated ASR using both automatic metrics, in
partranscriptions and translations in near-real time, and the ticular WER and human evaluation, and MT only relying
required high quality of the output. By November 2022, on human evaluation. Human evaluation for both ASR
19 language models have been developed and made avail- and MT is carried out under the MQM framework. We
able via a demo interface. These languages are English describe the procedures and the experimental setup in
(EN), French (FR), German (DE), Spanish (ES), Italian (IT), the subsequent sections.</p>
      <p>Polish (PL), Greek (EL), Romanian (RO), Dutch (NL),
Portuguese (PT), Bulgarian (BG), Czech (CS), Slovak (SK), 2.1. Automatic evaluation
Croatian (HR), Lithuanian (LT), Finnish (FI), Hungarian Automatic evaluation was used only for ASR. The metric
(HU), Swedish (SV) and Slovenian (SL). used is WER.8 The test set consists of 92 short speeches</p>
      <p>In this paper, we present a quantitative and qualita- (minimum = 01:01; maximum = 05:10; average = 01:39;
tive study—using Word Error Rate (WER) metric [14] standard deviation = 00:39) delivered in March and May
and manual human evaluation—on a test set consisting 2022 plenaries by members of the EP. The speeches are
of short speeches delivered by members of the EP in in the 19 languages deployed in the tool by November
the 19 languages already deployed in the prototype.6 2022. See Table 1 for more details. Languages are ordered
The aim of this study is to evaluate the quality of both according to deployment in the tool.9
ASR and MT output and to reflect on the diferent in- The gold standard of this test set is made of the
verbasights of the same text given by diferent annotators. The tim transcription of the speeches (often referred to by its
manual human evaluation is based on an error taxon- French abbreviation, CRE, Compte Rendu d’Evènement),
omy adapted from the Multidimensional Quality Metrics manually corrected from the published report available
(MQM) framework7 and applied to part of the test set on the website of the EP.10 The corrections are performed
covering 6 languages (EN, FR, ES, IT, RO, DE). The MQM by two native speakers per language and a third
annotaframework, developed in the EU QTLaunchPad and QT21 tor is involved to solve disagreement.
31O2n, e13(i]n.)famous claim is that MT has achieved human parity [11, 2.2. Human evaluation
4Shpttepcsi:fic/a/etitoennsdeorfintgh.eteIdn.enuorvoaptiao.enu/Pcafrt/tncfetr-sdhoicpumareenatv.hatimlalb?ldeohcIedr=e: The error taxonomy is germane to the MQM framework
58722. All links were last access on 13/05/2023. and includes diferent categories for ASR and MT,
shar5Deaf with a capital D denotes individuals who are culturally and ing the same severity scale—i.e. neutral, minor, major
linguistically Deaf, often due to congenital deafness or early-life (see Figure 1 in Appendix A for the decision tree). The
hearing loss. They identify with the Deaf community, characterized
by its unique culture, sign languages, and traditions. In contrast, 8Script written by Dr. Claudio Fantinuoli available here: https://
deaf (with a lowercase d) is a general term referring to individuals github.com/fantinuoli/WERvisual/blob/main/wer.py.
with a hearing impairment, irrespective of their cultural identifi- 9During Stage 1 of the project 10 language models were deployed;
cation or community afiliation. It describes the audiological con- during Stage 2, 9 other language models were added. This order is
dition of partial or complete hearing loss, without specifying sign maintained in Table 1. Stage 1 models were trained during
2020language usage or cultural ties. 2021, Stage 2 languages during 2021-2022. Stage 1 ASR models
6This study and paper was written while the author was working for have been updated in August 2022. Both ASR and MT models are
the European Parliament Unit in charge of the prototype manage- developed by Cedat85 consortium.
ment and evaluation. This is not the oficial evaluation methodology 10Each parliamentary sitting is publicly available and the CRE and
employed by the Parliament to evaluate the prototype. videos in the original language are available in the EP website:
7MQM website available here: https://themqm.org/. https://www.europarl.europa.eu/plenary/en/debates-video.html.</p>
      <sec id="sec-1-1">
        <title>Language</title>
        <p>EN
FR
DE
ES
IT
RO
PL
EL
NL
PT
BG
CS
SK
SL
HR
LT
FI
HU
SV
Total</p>
        <sec id="sec-1-1-1">
          <title>In Table 2 we report their self-reported knowledge of the</title>
          <p>languages according to the CEFR levels.11</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Annotator</title>
        <p>Ann A
Ann B
Ann C
Ann D</p>
        <p>The annotated test set consists of 48 documents in 6
languages (EN, IT, FR, ES, RO, and DE): 18 automatic
error categories used for ASR error annotation are over- transcriptions (3 speeches per language, with an
identisegmentation, under-segmentation, lexical substitution, lex- fication number from 1 to 6) and 30 translations (from
ical deletion, lexical addition, morpho-syntactic errors (e.g. and into the 6 above mentioned languages). In Table 3
number agreement, part of speech substitution), termi- we report the evaluated task and the involved languages,
nology (e.g. named entities and terms). The categories the number of speeches (with an identification number
used for MT error annotation are accuracy (e.g. meaning between brackets to be able to identify them when used
is not rendered in its entirety), punctuation, grammar, as source and target and also in the automatic evaluation
register (formality, gender-marked pronouns), terminol- results reported in Table 4), and the annotators providing
ogy (including the presence of non-words, spelling errors the annotations.
or incorrect terms), other and unintelligible. Unintelligible
is used to mark segments containing more than 5 major
errors [10]. Other should be used in rare cases in which 3. Results
none of the existent error categories apply. Neutral
errors weight 0 points, minor errors 1, major 5. Except for 3.1. Automatic evaluation
unintelligible which weights 5, if minor, and 25, if major. ASR was evaluated in two diferent scenarios: first, in
These weights are similar to those used in [10]. sessions with more than one speech but all in the same</p>
        <p>We involved four annotators. All received the annota- language; second, in sessions with more than one speech,
tion guidelines and a training. After a few annotations each in a diferent language. This is possible because the
a further meeting was scheduled to clear doubts. We tool has a feature called Language Identification (LID),
involved four annotators with diferent backgrounds and which is used to identify the language spoken and
subseknowledge of the languages. For reference, we call them quently transcribe the audio in the identified language.
annotator A, B, C and D (henceforth, Ann for annotator). WER results (computed per session) are reported in
TaAnn A has a background in Translation studies and is an ble 4. In the table body, from row 2 to 7, we report the
experienced translator at the EP. Ann B was a trainee at WER obtained in the speeches also undergoing human
the EP with a master’s degree in Translation and previous evaluation. This is the reason why we have multiple
experience on the MQM framework. Ann C was a trainee rows for the same language (e.g. EN LID on with id code
at the EP with a master’s degree in Translation and no
previous experience on ASR and MT evaluation. Ann D is
a communications assistant at the EP with a background
in interpretation, with no experience on the MQM
framework, but with experience on ASR and MT evaluation.
11For more details about the CEFR levels
see the website: https://www.coe.int/en/web/
common-european-framework-reference-languages/
level-descriptions.
3.2. Human evaluation</p>
        <sec id="sec-1-2-1">
          <title>We investigated manual annotation quantitatively and</title>
          <p>qualitatively. Quantitative evaluation is based on an
average score per document and annotator. Qualitative
evaluation takes error categories and severities into
account.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>3.2.1. Quantitative evaluation</title>
        <p>
          Same speeches, diferent annotators. For each
annotator, we calculated a score per document by averaging
the segment-level scores. Results are shown in Figures 2–
6 in Appendix A. In general, ASR output received higher
Table 3 scores than expected, especially in languages in which
Human evaluation test set. WER is lower than 5% (i.e. RO, IT, ES). This can be due
to the fact that WER metric does not take punctuation
into account, thus over- and under-segmentation issues
3, indicating the speeches subjected to human evalua- are not counted. Also, in WER calculation all errors have
tion, and then again EN LID on, non subjected to human the same weight (e.g. a missing negation changing
comevaluation). pletely the meaning of the sentence has the same weight
of any other missing token). MT output received lower
Language (source id) LID WER scores if compared to ASR output. This might mean
All 19, 1 speech each On 6.45 that some errors in ASR are well handled in translation.
RO (
          <xref ref-type="bibr" rid="ref1 ref14 ref19">1</xref>
          ) On 2.77 When both annotators are native speakers of the target
IT (2) On 3.22 language, their scores are more similar. This is the case
EESN((43)) OOnn 48..9948 of Ann A and B in EN-RO (Figure 2), Ann B and C in
FR (5) On 8.91 EN-IT (Figure 4), Ann B and C in ES-IT (Figure 5). The
DE (
          <xref ref-type="bibr" rid="ref2">6</xref>
          ) On 7.81 same applies to Ann C and D (Figure 6) in the annotation
EN Of 5.25 of FR-IT MT, although Ann D displays a diferent
annotaEN On 5.48 tion behaviour than Ann B and C. In fact, Ann D tends to
IT Of 5.58 annotate fewer errors. This could be influenced by their
BG Of 5.83 diferent backgrounds (interpreter vs. translator).
PL Of 5.05 The annotation scores are the most similar when the
PL On 7.80 annotators are native speaker of the target language, as
HU Of 9.18 in the annotation of IT-EN MT (Figure 3) and EN-FR MT
CHSU OOfn 94..5073 (Figure 6). However, monolingual annotators (Ann A and
SK Of 2.52 C) show more severity in MT judgement into their native
SL Of 5.02 language (Figure 2 RO-IT MT and Figure 4 EN-IT MT)
HR Of 5.63 when compared with our IT-RO bilingual annotator (Ann
LT Of 11.14 B). This seems in line with research on bilingualism and
FI Of 5.48 acceptability, where results show that “bilinguals do not
SV Of 10.78 reject ungrammatical items with the same certainty as
Table 4 monolinguals” [15].
        </p>
        <p>ASR: Averaged WER results. Source id in brackets links the Averaging the scores attributed by the two annotators
speeches with those in Table 3. (ASR and translation into IT using the ASR output as
source, except for IT, that is translated into EN), we obtain
the following order (from the presumably best output to
the worse): ES (average = 17.3), IT (average = 24.5), FR</p>
        <sec id="sec-1-3-1">
          <title>The results show that LID does not have a big impact</title>
          <p>on WER (e.g. EN, HU), except for PL (almost 3% WER
diference), but the main diference in WER is due to
diferent speeches (e.g. IT, in which the 3 speeches with
LID on have a lower WER than the 3 with LID of, or EN
12Please note that this could be due to pure chance and since the test
set is small, we do not report statistical tests.
(average = 24.7), EN (average = 25.3), RO (average = 40.8). parte dell’Unione europea.
The order considering WER would be RO, IT, ES, FR, EN. “Protecting citizens from online hate. Here is a
Same annotator, diferent speeches. Here we com- good use. The most advanced technologies. This
pare the annotations carried out by Ann B and C. We is also a very appropriate use by the European
selected these two annotators because they performed Union.”
the majority of the annotation task, so it is possible to Ann C marked the errors in bold in Example 1 as
morphocompare their results in diferent languages. We report syntactic errors of a major nature, Ann D as lexical
subtheir scores in Figure 7–8, respectively (Appendix A). stitution of a minor nature. This is a blurry area, if you</p>
          <p>According to Ann B (Figure 7), we can order the lan- consider that both are functional words and in other
languages from the best output to the worse: ES (aver- guages could be rendered morphologically. We think that
age = 14.33), IT (average = 27.00), DE (average = 29.44), in a multilingual perspective, these should be treated as
EN (average = 36.67) and RO (average = 46.33). Accord- morphological being functional words. However,
probing to Ann C (Figure 8), the order would be: ES (aver- ably they are not major errors, as they do not afect a
age = 13.67), IT (average = 18.50), EN (average = 26.33) main idea of the speech (decision tree in Figure 1).
and FR (average = 27.50). As far as MT output is concerned, we report the results
in Figures 14– 20 in Appendix A. We notice the
inappro3.2.2. Qualitative evaluation priate use of the unintelligible category. Unintelligible
should mark segments in which it is impossible to
understand the message and to identify all the errors that led to
the incomprehensible segment. The fact that on the same
set, diferent annotators used it or not, it is a clear sign of
misunderstanding (Figures 15, 17, 18 and 19). In fact, in
Example 2, unintelligible is used in a segment in which it
is possible to understand the meaning, although there is
a minor grammatical error (attaccano ‘they attack’) and a
minor accuracy error (relative clause instead of adverbial
clause, che ‘that’ substituting per ‘to’).</p>
        </sec>
        <sec id="sec-1-3-2">
          <title>Each annotator draws a diferent picture of each text, being that the product of ASR or MT. As far as ASR output is concerned, we report the results in Figures 9– 13 in Appendix A.</title>
          <p>Despite we did not put a major emphasis on
overand under-segmentation errors during training, as they
were considered to be straightforward (at least in their
identification), the disagreement in annotations suggests
the contrary. In fact, diferent annotators draw opposite
pictures of their presence and importance. For
example, in Figure 9, we can notice that Ann A weighs more
over-segmentation than under-segmentation errors in
RO transcription, while Ann B does the opposite. The
system results on punctuation marking, and full stop
identification, in particular, seems to be below
state-ofthe-art performance [16]. Ann C (Figures 10–13) seems
to be more severe about morpho-syntactic errors in ASR.</p>
          <p>The same errors are annotated as lexical substitutions by
the other annotators, as in Example 1.
(2)</p>
          <p>REF: [. . . ] state precum Federat, ia Rusă utilizează
instrumentele moderne pentru a ataca state,
pentru a ataca entită t,i, pentru a pune în
pericol democra t,ia europeană, acest lucru necesită
un răspuns rapid s,i unit.
“[. . . ] countries like the Russian Federation use
modern tools to attack states, to attack entities,
to endanger European democracy, this requires
a rapid and united response.”
ASR: State precum Federat, ia Rusă utilizează
instrumentele moderne pentru a. Ataca state
pentru a ataca entităt, i pentru a pune în pericol
democrat, ia europeană. Acest lucru necesită un
răspuns. Rapid s,i unit, [. . . ]
FR-IT: Paesi come la Federazione Russa
usano strumenti moderni per Attaccano gli Stati
per attaccare entità che mettono in pericolo la
democrazia europea. Ciò richiede una risposta.</p>
          <p>
            Veloce e unito,
(
            <xref ref-type="bibr" rid="ref1 ref14 ref19">1</xref>
            )
          </p>
          <p>REF: Protéger les citoyens de la haine en ligne,
voici un bel usage des technologies les plus
avancées. Et voici aussi un usage très
approprié de l’Union européenne.
“Protecting citizens from online hate, here is a
good use of the most advanced technologies.</p>
          <p>And here a very appropriate use by the
European Union.”
ASR: Protéger les citoyens de la haine en ligne.</p>
          <p>Vavoaicnicéuens.bEenl uvsoaicgiea.uLsseisunteucshangoelotrgèisesaplepsropplruiés In Example 2 we also notice over-segmentation errors
de l’Union européenne. in the ASR transcription cascading in MT (Acest lucru
“Protecting citizens from online hate. Here is necesită un răspuns. Rapid s,i unit). In addition, it seems
a good use. The most advanced technologies. that Ann B, and in other examples Ann C, annotated
Here also a very appropriate use by the Euro- the output as if it was a written text and not an oral
pean Union.” text transposed in written. Thus, the reference text is
FR-IT: Proteggere i cittadini dall’odio online. only one of the possible transpositions. This is evident
Qui è un buon uso. Le tecnologie più avanzate. looking at punctuation. In Example 2, in fact, Ann B
Anche questo è un uso molto appropriato da not only marked the over-segmentation error dividing
the noun răspuns from its modifiers ( rapid s,i unit), but the MT understanding. Morpho-syntactic errors
annoalso another over-segmentation error (despite marked as tated in the ASR are frequently correct in the MT output.
minor) because in the reference this sentence is joint to Over-segmentation, instead, in particular when involves
the preceding one with a comma and not divided by a a full stop, remains unchanged in the MT output, as MT
full stop. However, it must be noted that a full stop there models usually mirror the punctuation of the source text.
is perfectly acceptable.</p>
          <p>Unintelligible errors were also marked when the other
annotator only noticed punctuation issues, as shown in 4. Conclusion
Example 3.</p>
        </sec>
        <sec id="sec-1-3-3">
          <title>We presented a quantitative and qualitative evaluation of</title>
          <p>
            REF: Putin has thrown the world and Europe the tool that has been developed in the context of a EP’s
back to a time we had hoped never to experience Innovation Partnership. We used WER score and human
again. A crisis of such dignity shows our true manual evaluation to evaluate the quality of ASR, and
colours – if we are on the right side of history only human evaluation for MT quality. The average WER
or choose the [path] path of destruction. is 6.43% in the multilingual test set made of 19 languages
ASR: Putin has thrown the world and Europe deployed by November 2022, which is very low but it
rbiaecnkcetod aagtaiminec,rwiseiswoofusludchhodpigentiotynsehvoewresxopuer- does not take into account segmentation issues. Human
true colours if we are on the right side of history evaluation highlighted the need for refining sentence
or choose the path path of destruction. segmentation, especially in languages in which the WER
EN-FR: Poutine a renvoyé le monde et l’Europe was very low (e.g. RO and IT). This could indicate that
à une époque, nous espérons ne plus avoir WER by itself is not enough to have a clear picture of
connu de crise de cette dignité montre nos the quality of the transcription. However, human
evaluvraies couleurs si nous sommes du bon côté ation remains a highly subjective task which attains all
de l’histoire ou si nous choisissons le chemin de categories, also those considered clear-cut categories (e.g.
la destruction. sentence segmentation). The annotators’ background
REF: [. . . ] Ceux qui ont harcelé et appelé au has an influence on error severity perception and error
meurtre sur Internet Samuel Paty, sont-ils, identification, and should be investigated in detail. In line
étaient-ils, des vecteurs de liberté d’expression? with what found in [17], we also found that annotators’
Poser la question, c’est déjà y apporter une sensitivity in deepening the error annotation is a main
réponse. cause of disagreement, in this case due to the attempt to
“Were those who harassed and called for the mur- annotate also the consequences of the error. Quantitative
der of Samuel Paty on the Internet vectors of results of human evaluation considering the ASR output
freedom of expression? To ask the question is to and its translation into IT (except for IT translated into
answer it.” EN) indicate ES as qualitatively better output, followed
AneStR. :JCseemux jqeujipoenttich.arEctealép.pSeuléInautemrneeutr.tIrnetseurr- by IT, FR and EN, and RO as the worse output. In general,
Internet, Samuel Paty. Sont-ils, étaient-ils des annotators rated ASR output worse than the MT output.
vecteurs de liberté d’expression. Poser la ques- However, this might be a consequence of the attitude of
tion c’est déjà y apporter une réponse. annotators putting too much emphasis on the provided
EN-IT: Coloro che hanno molestato. Su inter- reference transcription of the speech, not considering
net. Internet. Sono una petizione. E ha that, especially if punctuation is concerned, it is only
chiesto omicidio su Internet, Samuel Paty. Sono one of the possible accepted transpositions. Qualitative
loro, erano vettori della libertà di espressione. results highlighted that diferent annotators draw
difFare la domanda è già fornire una risposta. ferent pictures of the same speeches and that a second
“Those who harassed. On the Internet. Inter- round of annotations would be necessary to reduce
disnet. They are a petition. And called for mur- agreement and to clarify the use of error categories, like
wdeerreonvethcteoirnsteorfnferte,eSdaommueolfPeaxtpy.reIts’ssiothne.mT,othaesky unintelligible, frequently improperly applied.
the question is already to provide an answer.”
(
            <xref ref-type="bibr" rid="ref8">3</xref>
            )
(4)
An actual unintelligible error is instead reported in
Example 4. LID errors in this case caused unintelligibility in
the translation because the same portions of audio were
transcribed in diferent languages (transcribed as IT, PL,
and CS). Perhaps including the information about the
source language in the translated output could be useful
to reduce the impact that LID errors like these have in
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgments</title>
      <sec id="sec-2-1">
        <title>I want to thank the European Parliament for giving me</title>
        <p>the opportunity to conduct this study and the
annotators for participating and giving me the authorisation to
use their annotations for research purposes. I thank the
anonymous reviewers for their precious comments, and
I apologies if not all of them have been addressed.
Resources Association (ELRA), Athens, Greece,
2000. URL: http://www.lrec-conf.org/proceedings/
lrec2000/pdf/278.pdf.
[15] J. C. López Otero, On the acceptability of the
spanish dom among romanian-spanish bilinguals, in:
A. Mardale, S. Montrul (Eds.), The Acquisition of
Diferential Object Marking Trends in Language
Acquisition Research, John Benjamins Publishing
Company, 2020, pp. 161–181.
[16] O. Guhr, A.-K. Schumann, F. Bahrmann, H.-J.</p>
        <p>Böhme, FullStop: Multilingual Deep Models for
Punctuation Prediction, in: Swiss Text Analytics
Conference, 2021.
[17] E. Di Nuovo, Introducing VALICO-UD: a parallel,
learner Italian treebank for language learning
research , Pàtron Editore, 2023.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Figures</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          v1/
          <year>2022</year>
          .iwslt-
          <volume>1</volume>
          .
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu, Bleu: [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Uszkoreit,</surname>
          </string-name>
          <article-title>a method for automatic evaluation of machine</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser, I. Polosukhin, translation,
          <source>in: Proceedings of the 40th Annual</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: 31st Conference on Meeting of the Association for Computational</source>
          Lin-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Neural Information Processing Systems (NIPS</source>
          <year>2017</year>
          ), guistics, Association for Computational Linguis-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>California</surname>
          </string-name>
          , USA,
          <year>2017</year>
          . URL: https://proceedings. tics, Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          neurips.cc/paper_files/paper/2017/file/ 311-
          <fpage>318</fpage>
          . URL: https://aclanthology.org/P02-1040.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. doi:10.3115/1073083</source>
          .1073135. [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Federmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          , [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Olive</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McCary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christianson</surname>
          </string-name>
          , Ma-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>chyna</surname>
          </string-name>
          ,
          <source>Findings of the 2014 workshop on statis- book of Natural Language Processing and Machine</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>tical machine translation</article-title>
          ,
          <source>in: Proceedings of the Translation</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>745</fpage>
          -
          <lpage>843</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          Ninth Workshop on Statistical Machine Transla- [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Moorkens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Castilho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gaspari</surname>
          </string-name>
          , S. Doherty,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Baltimore</surname>
          </string-name>
          , Maryland, USA,
          <year>2014</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>58</lpage>
          . URL: tion: Technologies and applications, Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          https://aclanthology.org/W14-3302. doi:
          <volume>10</volume>
          .3115/ [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chatzikoumi</surname>
          </string-name>
          , How to evaluate
          <source>machine trans-</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          v1/
          <fpage>W14</fpage>
          -3302.
          <article-title>lation: A review of automated and human metrics</article-title>
          , [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cettolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karakanta</surname>
          </string-name>
          ,
          <source>Natural Language Engineering</source>
          <volume>26</volume>
          (
          <year>2020</year>
          )
          <fpage>137</fpage>
          -
          <lpage>161</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchi</surname>
          </string-name>
          , Cascade ver- [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ratnakar</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>59th Annual Meeting of the Association for Com- Machine Translation</article-title>
          , in: Transactions of the As-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>putational Linguistics and the 11th International sociation for Computational Linguistics</article-title>
          , volume
          <volume>9</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>Joint Conference on Natural Language Processing</source>
          <year>2021</year>
          , pp.
          <fpage>1460</fpage>
          -
          <lpage>1474</lpage>
          . URL: https://doi.org/10.1162/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computa- tacl_a_00437</source>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00437</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>tional Linguistics</surname>
          </string-name>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2873</fpage>
          -
          <lpage>2887</lpage>
          . URL: [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chowdhary</surname>
          </string-name>
          , J. Clark,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>224</volume>
          . doi:10.
          <string-name>
            <surname>C. Federmann</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Junczys-Dowmunt</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .224.
          <string-name>
            <given-names>W.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          , T.-Y. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menezes</surname>
          </string-name>
          , [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sperber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paulik</surname>
          </string-name>
          ,
          <article-title>Speech translation and the T</article-title>
          . Qin,
          <string-name>
            <given-names>F.</given-names>
            <surname>Seide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          , S. Wu,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          are,
          <source>in: Proceedings of the 58th Annual Meet- Human Parity on Automatic Chinese to English</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>ing of the Association for Computational Linguis</article-title>
          - News
          <string-name>
            <surname>Translation</surname>
          </string-name>
          ,
          <year>2018</year>
          . arXiv:
          <year>1803</year>
          .05567.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
            , Association for Computational Linguistics, On- [12]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Toral</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Castilho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Way</surname>
          </string-name>
          , Attaining the
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>line</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>7409</fpage>
          -
          <lpage>7421</lpage>
          . URL: https://aclanthology. unattainable
          <article-title>? reassessing claims of human par-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          org/
          <year>2020</year>
          .acl-main.
          <volume>661</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .
          <article-title>ity in neural machine translation</article-title>
          , in: Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>acl-main.661. ings of the Third Conference on Machine Trans</source>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          , L. Bentivogli, lation: Research Papers, Association for Compu-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>M. Zanon Boito</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cattoni</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Cur- tational
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Brussels, Belgium,
          <year>2018</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>rey</surname>
            , G. Dinu,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Elbayad</surname>
          </string-name>
          , C. Emmanuel,
          <volume>113</volume>
          -
          <fpage>123</fpage>
          . URL: https://aclanthology.org/W18-6312.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Estève</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Federmann</surname>
          </string-name>
          , S. Gahbiche, doi:10.18653/v1/
          <fpage>W18</fpage>
          -6312.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grundkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Ja- [13]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Läubli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Volk</surname>
          </string-name>
          , Has machine trans-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>McNamee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murray</surname>
          </string-name>
          , M. Naˇdejde, S. Nakamura, level evaluation,
          <source>in: Proceedings of the 2018</source>
          Con-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>doh</surname>
            , M. Turchi,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Virkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Waibel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , tics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>4791</fpage>
          -
          <lpage>4796</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <source>Findings of the IWSLT</source>
          <year>2022</year>
          https://aclanthology.org/D18-1512. doi:
          <volume>10</volume>
          .18653/
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>evaluation campaign</article-title>
          ,
          <source>in: Proceedings of the v1/D18-1512.</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          19th International Conference on Spoken Lan- [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nießen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Och</surname>
          </string-name>
          , G. Leusch,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          , An evalua-
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <surname>guage Translation (IWSLT</surname>
          </string-name>
          <year>2022</year>
          ),
          <article-title>Association for tion tool for machine translation: Fast evaluation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <source>person and online)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>98</fpage>
          -
          <lpage>157</lpage>
          . URL: https: International Conference on Language Resources
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          //aclanthology.org/
          <year>2022</year>
          .iwslt-
          <volume>1</volume>
          .10. doi:
          <volume>10</volume>
          .18653/ and Evaluation (LREC'00), European Language
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>