1. Introduction

An experiment in error analysis of real-time speech machine translation using the example of the European Parliament's Innovation Partnership⋆

Elisa Di Nuovo

0 0 University of Turin , via Giuseppe Verdi, 8, 10124 Torino (TO) - Italy

In recent years, technological progress has made Machine Translation (MT) a reality. Significant improvements have been obtained using deep learning models as opposed to rule-based and statistical MT models. Human evaluation still remains under-explored. In 2019 the European Parliament (EP) started an innovation partnership with commercial operators, with the purpose of developing a tool exploiting state-of-the-art, real-time Automatic Speech Recognition (ASR) and MT technologies to make parliamentary plenary sessions accessible to D/deaf and hard of hearing. In this paper, we present a quantitative and qualitative error analysis carried out on a test set consisting of 78 short speeches delivered by Members of the EP in 19 languages deployed in the EP prototype by November 2022. The taxonomy used for ASR and MT is adapted from the Multidimensional Quality Metrics framework. Results show that sentence segmentation is the biggest issue in the ASR output-not considered using automatic metrics-which often afects the MT output.

eol>Real-time speech machine translation Error analysis Human evaluation Cascade system

1. Introduction

tion campaign of the 19th International Conference on Spoken Language Translation (IWSLT 2022) [5], one of In recent years, the landscape of language translation has the eight shared tasks focused on real-time speech transbeen fundamentally transformed by remarkable techno- lation, addressed as translation of ASR output or directly logical advancements. Machine Translation (MT), once from the audio source and involving English to German, an ambitious aspiration, has now become a tangible re- English to Japanese and English to Mandarin Chinese. A ality. This transformation has been primarily fueled by novelty of this year campaign is the addition of manual the advent of deep learning models, specifically neural evaluation of real-time outputs. machine translation and transformers. These cutting- Like many natural language processing tasks, MT is edge models have ushered in a new era of translation, dificult to evaluate. One of the reasons for this is the noneclipsing the limitations of traditional rule-based and deterministic nature of translation, i.e. there is more than statistical MT methods. Deep learning models use the one correct way to translate from one language into anmechanism called attention to improve the performance other. Evaluation in shared tasks is usually carried out by [1] and have been usually evaluated on ofline written means of automatic metrics, BLEU (Bilingual Evaluation translation tasks involving a few language pairs [2]. Understudy) [6] being the standard for MT evaluation.

Very recently, research expanded its focus also on This metric tries to overcome the nondeterministic naspeech machine translation, tackled as a concatenation ture of translation using multiple references. However, of Automatic Speech Recognition (ASR) and MT, or as automatic metrics have several limitations [7].2 On the an end-to-end task (i.e. direct translation of speech in other hand, human evaluation, if carried out using finedlanguage A into text in language B).1 In the last evalua- grained guidelines to limit subjectivity, can give a clearer indication of the MT output quality. However, being reCLiC-it 2023: 9th Italian Conference on Computational Linguistics, source expensive (i.e. it is hard to nfid skilled evaluators; ⋆NTovhi3s0s—tudDyecan02d, p2a0p23e,rVwenaiscwe, rIittatleyn while the author was working skilled evaluators have a high cost), it has been used limfor the European Parliament, in the Unit in charge of the admin- itedly and in small studies. To avoid these limitations, istration and evaluation of the prototype. This is not the oficial crowdsourced annotators have been used. Unfortunately, evaluation methodology employed by the European Parliament for crowdsourced annotators are frequently inexperienced. evaluation. As [10] afirm, crowdsourced human evaluation can be * TCheentaruet,hEour,roaspeoafn1sCtOomctmobisesri2o0n2,3Isipsream(pVlAoy),eIdtablyy.the Joint Research used when MT quality is poor, because it can still provide $ elisa.dinuovo@unito.it (E. D. Nuovo) 0000-0002-4814-982X (E. D. Nuovo) cascade and end-to-end systems.

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License 2See Moorkens et al. [8] on translation quality assessment and 1SCPWrEooUerckReshdoeinpgs IhSftpN:/c1oe6u1r3-w-0s.o7rr3g eACxttEarimbUutRpionlWe4.0toIhnrteekrnssahttioounpdali(PeCCrsoBiYcne4.0e[).d3i,n4g]sf(oCrEaUcRo-WmpSa.orrisgo)n between hCuhmataznikMouTmeiva[9lu]aftoiorna. comprehensive review of automatic and a useful indication; but, as quality improves, it becomes projects, provides a hierarchy of translation errors that unfit and leads to erroneous claims. 3 can be adapted according to the application. We devised

In 2019 the European Parliament (EP) started an Inno- our taxonomy consisting of diferent error categories for vation Partnership with commercial operators, with the ASR and MT and 3 severity levels (i.e. neutral, minor purpose of developing a tool that can perform real-time and major). We decided to exclude critical errors as in ASR and MT from and into all the 24 oficial languages [10]. The remainder of the paper is organised as follows: of the European Union (EU).4 This partnership has the in Section 2 we describe the methodology applied for aim of making parliamentary plenary sessions accessible automatic and human evaluation; in Section 3 we rein near-real time to D/deaf and hard of hearing persons.5 port the results quantitatively and qualitatively analysed The challenges faced by this project are manifold: the per language and per annotator; Section 4 concludes the high degree of multilingualism which is highly ambitious paper. considering the technical limits of current MT particularly in a number of low-resource languages, the presence of non-native accents, the large variety of vocabulary/EU 2. Methodology description jargon required by the numerous specific domains tackled in the plenaries, the low latency constraints to have We evaluated ASR using both automatic metrics, in partranscriptions and translations in near-real time, and the ticular WER and human evaluation, and MT only relying required high quality of the output. By November 2022, on human evaluation. Human evaluation for both ASR 19 language models have been developed and made avail- and MT is carried out under the MQM framework. We able via a demo interface. These languages are English describe the procedures and the experimental setup in (EN), French (FR), German (DE), Spanish (ES), Italian (IT), the subsequent sections.

Polish (PL), Greek (EL), Romanian (RO), Dutch (NL), Portuguese (PT), Bulgarian (BG), Czech (CS), Slovak (SK), 2.1. Automatic evaluation Croatian (HR), Lithuanian (LT), Finnish (FI), Hungarian Automatic evaluation was used only for ASR. The metric (HU), Swedish (SV) and Slovenian (SL). used is WER.8 The test set consists of 92 short speeches

In this paper, we present a quantitative and qualita- (minimum = 01:01; maximum = 05:10; average = 01:39; tive study—using Word Error Rate (WER) metric [14] standard deviation = 00:39) delivered in March and May and manual human evaluation—on a test set consisting 2022 plenaries by members of the EP. The speeches are of short speeches delivered by members of the EP in in the 19 languages deployed in the tool by November the 19 languages already deployed in the prototype.6 2022. See Table 1 for more details. Languages are ordered The aim of this study is to evaluate the quality of both according to deployment in the tool.9 ASR and MT output and to reflect on the diferent in- The gold standard of this test set is made of the verbasights of the same text given by diferent annotators. The tim transcription of the speeches (often referred to by its manual human evaluation is based on an error taxon- French abbreviation, CRE, Compte Rendu d’Evènement), omy adapted from the Multidimensional Quality Metrics manually corrected from the published report available (MQM) framework7 and applied to part of the test set on the website of the EP.10 The corrections are performed covering 6 languages (EN, FR, ES, IT, RO, DE). The MQM by two native speakers per language and a third annotaframework, developed in the EU QTLaunchPad and QT21 tor is involved to solve disagreement. 31O2n, e13(i]n.)famous claim is that MT has achieved human parity [11, 2.2. Human evaluation 4Shpttepcsi:fic/a/etitoennsdeorfintgh.eteIdn.enuorvoaptiao.enu/Pcafrt/tncfetr-sdhoicpumareenatv.hatimlalb?ldeohcIedr=e: The error taxonomy is germane to the MQM framework 58722. All links were last access on 13/05/2023. and includes diferent categories for ASR and MT, shar5Deaf with a capital D denotes individuals who are culturally and ing the same severity scale—i.e. neutral, minor, major linguistically Deaf, often due to congenital deafness or early-life (see Figure 1 in Appendix A for the decision tree). The hearing loss. They identify with the Deaf community, characterized by its unique culture, sign languages, and traditions. In contrast, 8Script written by Dr. Claudio Fantinuoli available here: https:// deaf (with a lowercase d) is a general term referring to individuals github.com/fantinuoli/WERvisual/blob/main/wer.py. with a hearing impairment, irrespective of their cultural identifi- 9During Stage 1 of the project 10 language models were deployed; cation or community afiliation. It describes the audiological con- during Stage 2, 9 other language models were added. This order is dition of partial or complete hearing loss, without specifying sign maintained in Table 1. Stage 1 models were trained during 2020language usage or cultural ties. 2021, Stage 2 languages during 2021-2022. Stage 1 ASR models 6This study and paper was written while the author was working for have been updated in August 2022. Both ASR and MT models are the European Parliament Unit in charge of the prototype manage- developed by Cedat85 consortium. ment and evaluation. This is not the oficial evaluation methodology 10Each parliamentary sitting is publicly available and the CRE and employed by the Parliament to evaluate the prototype. videos in the original language are available in the EP website: 7MQM website available here: https://themqm.org/. https://www.europarl.europa.eu/plenary/en/debates-video.html.

Language

EN FR DE ES IT RO PL EL NL PT BG CS SK SL HR LT FI HU SV Total

In Table 2 we report their self-reported knowledge of the

languages according to the CEFR levels.11

Annotator

Ann A Ann B Ann C Ann D

The annotated test set consists of 48 documents in 6 languages (EN, IT, FR, ES, RO, and DE): 18 automatic error categories used for ASR error annotation are over- transcriptions (3 speeches per language, with an identisegmentation, under-segmentation, lexical substitution, lex- fication number from 1 to 6) and 30 translations (from ical deletion, lexical addition, morpho-syntactic errors (e.g. and into the 6 above mentioned languages). In Table 3 number agreement, part of speech substitution), termi- we report the evaluated task and the involved languages, nology (e.g. named entities and terms). The categories the number of speeches (with an identification number used for MT error annotation are accuracy (e.g. meaning between brackets to be able to identify them when used is not rendered in its entirety), punctuation, grammar, as source and target and also in the automatic evaluation register (formality, gender-marked pronouns), terminol- results reported in Table 4), and the annotators providing ogy (including the presence of non-words, spelling errors the annotations. or incorrect terms), other and unintelligible. Unintelligible is used to mark segments containing more than 5 major errors [10]. Other should be used in rare cases in which 3. Results none of the existent error categories apply. Neutral errors weight 0 points, minor errors 1, major 5. Except for 3.1. Automatic evaluation unintelligible which weights 5, if minor, and 25, if major. ASR was evaluated in two diferent scenarios: first, in These weights are similar to those used in [10]. sessions with more than one speech but all in the same

We involved four annotators. All received the annota- language; second, in sessions with more than one speech, tion guidelines and a training. After a few annotations each in a diferent language. This is possible because the a further meeting was scheduled to clear doubts. We tool has a feature called Language Identification (LID), involved four annotators with diferent backgrounds and which is used to identify the language spoken and subseknowledge of the languages. For reference, we call them quently transcribe the audio in the identified language. annotator A, B, C and D (henceforth, Ann for annotator). WER results (computed per session) are reported in TaAnn A has a background in Translation studies and is an ble 4. In the table body, from row 2 to 7, we report the experienced translator at the EP. Ann B was a trainee at WER obtained in the speeches also undergoing human the EP with a master’s degree in Translation and previous evaluation. This is the reason why we have multiple experience on the MQM framework. Ann C was a trainee rows for the same language (e.g. EN LID on with id code at the EP with a master’s degree in Translation and no previous experience on ASR and MT evaluation. Ann D is a communications assistant at the EP with a background in interpretation, with no experience on the MQM framework, but with experience on ASR and MT evaluation. 11For more details about the CEFR levels see the website: https://www.coe.int/en/web/ common-european-framework-reference-languages/ level-descriptions. 3.2. Human evaluation

We investigated manual annotation quantitatively and

qualitatively. Quantitative evaluation is based on an average score per document and annotator. Qualitative evaluation takes error categories and severities into account.

3.2.1. Quantitative evaluation

Same speeches, diferent annotators. For each annotator, we calculated a score per document by averaging the segment-level scores. Results are shown in Figures 2– 6 in Appendix A. In general, ASR output received higher Table 3 scores than expected, especially in languages in which Human evaluation test set. WER is lower than 5% (i.e. RO, IT, ES). This can be due to the fact that WER metric does not take punctuation into account, thus over- and under-segmentation issues 3, indicating the speeches subjected to human evalua- are not counted. Also, in WER calculation all errors have tion, and then again EN LID on, non subjected to human the same weight (e.g. a missing negation changing comevaluation). pletely the meaning of the sentence has the same weight of any other missing token). MT output received lower Language (source id) LID WER scores if compared to ASR output. This might mean All 19, 1 speech each On 6.45 that some errors in ASR are well handled in translation. RO ( 1 ) On 2.77 When both annotators are native speakers of the target IT (2) On 3.22 language, their scores are more similar. This is the case EESN((43)) OOnn 48..9948 of Ann A and B in EN-RO (Figure 2), Ann B and C in FR (5) On 8.91 EN-IT (Figure 4), Ann B and C in ES-IT (Figure 5). The DE ( 6 ) On 7.81 same applies to Ann C and D (Figure 6) in the annotation EN Of 5.25 of FR-IT MT, although Ann D displays a diferent annotaEN On 5.48 tion behaviour than Ann B and C. In fact, Ann D tends to IT Of 5.58 annotate fewer errors. This could be influenced by their BG Of 5.83 diferent backgrounds (interpreter vs. translator). PL Of 5.05 The annotation scores are the most similar when the PL On 7.80 annotators are native speaker of the target language, as HU Of 9.18 in the annotation of IT-EN MT (Figure 3) and EN-FR MT CHSU OOfn 94..5073 (Figure 6). However, monolingual annotators (Ann A and SK Of 2.52 C) show more severity in MT judgement into their native SL Of 5.02 language (Figure 2 RO-IT MT and Figure 4 EN-IT MT) HR Of 5.63 when compared with our IT-RO bilingual annotator (Ann LT Of 11.14 B). This seems in line with research on bilingualism and FI Of 5.48 acceptability, where results show that “bilinguals do not SV Of 10.78 reject ungrammatical items with the same certainty as Table 4 monolinguals” [15].

ASR: Averaged WER results. Source id in brackets links the Averaging the scores attributed by the two annotators speeches with those in Table 3. (ASR and translation into IT using the ASR output as source, except for IT, that is translated into EN), we obtain the following order (from the presumably best output to the worse): ES (average = 17.3), IT (average = 24.5), FR

The results show that LID does not have a big impact

on WER (e.g. EN, HU), except for PL (almost 3% WER diference), but the main diference in WER is due to diferent speeches (e.g. IT, in which the 3 speeches with LID on have a lower WER than the 3 with LID of, or EN 12Please note that this could be due to pure chance and since the test set is small, we do not report statistical tests. (average = 24.7), EN (average = 25.3), RO (average = 40.8). parte dell’Unione europea. The order considering WER would be RO, IT, ES, FR, EN. “Protecting citizens from online hate. Here is a Same annotator, diferent speeches. Here we com- good use. The most advanced technologies. This pare the annotations carried out by Ann B and C. We is also a very appropriate use by the European selected these two annotators because they performed Union.” the majority of the annotation task, so it is possible to Ann C marked the errors in bold in Example 1 as morphocompare their results in diferent languages. We report syntactic errors of a major nature, Ann D as lexical subtheir scores in Figure 7–8, respectively (Appendix A). stitution of a minor nature. This is a blurry area, if you

According to Ann B (Figure 7), we can order the lan- consider that both are functional words and in other languages from the best output to the worse: ES (aver- guages could be rendered morphologically. We think that age = 14.33), IT (average = 27.00), DE (average = 29.44), in a multilingual perspective, these should be treated as EN (average = 36.67) and RO (average = 46.33). Accord- morphological being functional words. However, probing to Ann C (Figure 8), the order would be: ES (aver- ably they are not major errors, as they do not afect a age = 13.67), IT (average = 18.50), EN (average = 26.33) main idea of the speech (decision tree in Figure 1). and FR (average = 27.50). As far as MT output is concerned, we report the results in Figures 14– 20 in Appendix A. We notice the inappro3.2.2. Qualitative evaluation priate use of the unintelligible category. Unintelligible should mark segments in which it is impossible to understand the message and to identify all the errors that led to the incomprehensible segment. The fact that on the same set, diferent annotators used it or not, it is a clear sign of misunderstanding (Figures 15, 17, 18 and 19). In fact, in Example 2, unintelligible is used in a segment in which it is possible to understand the meaning, although there is a minor grammatical error (attaccano ‘they attack’) and a minor accuracy error (relative clause instead of adverbial clause, che ‘that’ substituting per ‘to’).

Each annotator draws a diferent picture of each text, being that the product of ASR or MT. As far as ASR output is concerned, we report the results in Figures 9– 13 in Appendix A.

Despite we did not put a major emphasis on overand under-segmentation errors during training, as they were considered to be straightforward (at least in their identification), the disagreement in annotations suggests the contrary. In fact, diferent annotators draw opposite pictures of their presence and importance. For example, in Figure 9, we can notice that Ann A weighs more over-segmentation than under-segmentation errors in RO transcription, while Ann B does the opposite. The system results on punctuation marking, and full stop identification, in particular, seems to be below state-ofthe-art performance [16]. Ann C (Figures 10–13) seems to be more severe about morpho-syntactic errors in ASR.

The same errors are annotated as lexical substitutions by the other annotators, as in Example 1. (2)

REF: [. . . ] state precum Federat, ia Rusă utilizează instrumentele moderne pentru a ataca state, pentru a ataca entită t,i, pentru a pune în pericol democra t,ia europeană, acest lucru necesită un răspuns rapid s,i unit. “[. . . ] countries like the Russian Federation use modern tools to attack states, to attack entities, to endanger European democracy, this requires a rapid and united response.” ASR: State precum Federat, ia Rusă utilizează instrumentele moderne pentru a. Ataca state pentru a ataca entităt, i pentru a pune în pericol democrat, ia europeană. Acest lucru necesită un răspuns. Rapid s,i unit, [. . . ] FR-IT: Paesi come la Federazione Russa usano strumenti moderni per Attaccano gli Stati per attaccare entità che mettono in pericolo la democrazia europea. Ciò richiede una risposta.

Veloce e unito, ( 1 )

REF: Protéger les citoyens de la haine en ligne, voici un bel usage des technologies les plus avancées. Et voici aussi un usage très approprié de l’Union européenne. “Protecting citizens from online hate, here is a good use of the most advanced technologies.

And here a very appropriate use by the European Union.” ASR: Protéger les citoyens de la haine en ligne.

Vavoaicnicéuens.bEenl uvsoaicgiea.uLsseisunteucshangoelotrgèisesaplepsropplruiés In Example 2 we also notice over-segmentation errors de l’Union européenne. in the ASR transcription cascading in MT (Acest lucru “Protecting citizens from online hate. Here is necesită un răspuns. Rapid s,i unit). In addition, it seems a good use. The most advanced technologies. that Ann B, and in other examples Ann C, annotated Here also a very appropriate use by the Euro- the output as if it was a written text and not an oral pean Union.” text transposed in written. Thus, the reference text is FR-IT: Proteggere i cittadini dall’odio online. only one of the possible transpositions. This is evident Qui è un buon uso. Le tecnologie più avanzate. looking at punctuation. In Example 2, in fact, Ann B Anche questo è un uso molto appropriato da not only marked the over-segmentation error dividing the noun răspuns from its modifiers ( rapid s,i unit), but the MT understanding. Morpho-syntactic errors annoalso another over-segmentation error (despite marked as tated in the ASR are frequently correct in the MT output. minor) because in the reference this sentence is joint to Over-segmentation, instead, in particular when involves the preceding one with a comma and not divided by a a full stop, remains unchanged in the MT output, as MT full stop. However, it must be noted that a full stop there models usually mirror the punctuation of the source text. is perfectly acceptable.

Unintelligible errors were also marked when the other annotator only noticed punctuation issues, as shown in 4. Conclusion Example 3.

We presented a quantitative and qualitative evaluation of

REF: Putin has thrown the world and Europe the tool that has been developed in the context of a EP’s back to a time we had hoped never to experience Innovation Partnership. We used WER score and human again. A crisis of such dignity shows our true manual evaluation to evaluate the quality of ASR, and colours – if we are on the right side of history only human evaluation for MT quality. The average WER or choose the [path] path of destruction. is 6.43% in the multilingual test set made of 19 languages ASR: Putin has thrown the world and Europe deployed by November 2022, which is very low but it rbiaecnkcetod aagtaiminec,rwiseiswoofusludchhodpigentiotynsehvoewresxopuer- does not take into account segmentation issues. Human true colours if we are on the right side of history evaluation highlighted the need for refining sentence or choose the path path of destruction. segmentation, especially in languages in which the WER EN-FR: Poutine a renvoyé le monde et l’Europe was very low (e.g. RO and IT). This could indicate that à une époque, nous espérons ne plus avoir WER by itself is not enough to have a clear picture of connu de crise de cette dignité montre nos the quality of the transcription. However, human evaluvraies couleurs si nous sommes du bon côté ation remains a highly subjective task which attains all de l’histoire ou si nous choisissons le chemin de categories, also those considered clear-cut categories (e.g. la destruction. sentence segmentation). The annotators’ background REF: [. . . ] Ceux qui ont harcelé et appelé au has an influence on error severity perception and error meurtre sur Internet Samuel Paty, sont-ils, identification, and should be investigated in detail. In line étaient-ils, des vecteurs de liberté d’expression? with what found in [17], we also found that annotators’ Poser la question, c’est déjà y apporter une sensitivity in deepening the error annotation is a main réponse. cause of disagreement, in this case due to the attempt to “Were those who harassed and called for the mur- annotate also the consequences of the error. Quantitative der of Samuel Paty on the Internet vectors of results of human evaluation considering the ASR output freedom of expression? To ask the question is to and its translation into IT (except for IT translated into answer it.” EN) indicate ES as qualitatively better output, followed AneStR. :JCseemux jqeujipoenttich.arEctealép.pSeuléInautemrneeutr.tIrnetseurr- by IT, FR and EN, and RO as the worse output. In general, Internet, Samuel Paty. Sont-ils, étaient-ils des annotators rated ASR output worse than the MT output. vecteurs de liberté d’expression. Poser la ques- However, this might be a consequence of the attitude of tion c’est déjà y apporter une réponse. annotators putting too much emphasis on the provided EN-IT: Coloro che hanno molestato. Su inter- reference transcription of the speech, not considering net. Internet. Sono una petizione. E ha that, especially if punctuation is concerned, it is only chiesto omicidio su Internet, Samuel Paty. Sono one of the possible accepted transpositions. Qualitative loro, erano vettori della libertà di espressione. results highlighted that diferent annotators draw difFare la domanda è già fornire una risposta. ferent pictures of the same speeches and that a second “Those who harassed. On the Internet. Inter- round of annotations would be necessary to reduce disnet. They are a petition. And called for mur- agreement and to clarify the use of error categories, like wdeerreonvethcteoirnsteorfnferte,eSdaommueolfPeaxtpy.reIts’ssiothne.mT,othaesky unintelligible, frequently improperly applied. the question is already to provide an answer.” ( 3 ) (4) An actual unintelligible error is instead reported in Example 4. LID errors in this case caused unintelligibility in the translation because the same portions of audio were transcribed in diferent languages (transcribed as IT, PL, and CS). Perhaps including the information about the source language in the translated output could be useful to reduce the impact that LID errors like these have in

Acknowledgments I want to thank the European Parliament for giving me

the opportunity to conduct this study and the annotators for participating and giving me the authorisation to use their annotations for research purposes. I thank the anonymous reviewers for their precious comments, and I apologies if not all of them have been addressed. Resources Association (ELRA), Athens, Greece, 2000. URL: http://www.lrec-conf.org/proceedings/ lrec2000/pdf/278.pdf. [15] J. C. López Otero, On the acceptability of the spanish dom among romanian-spanish bilinguals, in: A. Mardale, S. Montrul (Eds.), The Acquisition of Diferential Object Marking Trends in Language Acquisition Research, John Benjamins Publishing Company, 2020, pp. 161–181. [16] O. Guhr, A.-K. Schumann, F. Bahrmann, H.-J.

Böhme, FullStop: Multilingual Deep Models for Punctuation Prediction, in: Swiss Text Analytics Conference, 2021. [17] E. Di Nuovo, Introducing VALICO-UD: a parallel, learner Italian treebank for language learning research , Pàtron Editore, 2023.

A. Figures

v1/ 2022 .iwslt- 1 . 10 .

[6]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: [1]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit,

a method for automatic evaluation of machine

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin, translation, in: Proceedings of the 40th Annual

Attention is all you need , in: 31st Conference on Meeting of the Association for Computational Lin-

Neural Information Processing Systems (NIPS 2017 ), guistics, Association for Computational Linguis-

California , USA, 2017 . URL: https://proceedings. tics, Philadelphia, Pennsylvania, USA, 2002 , pp.

neurips.cc/paper_files/paper/2017/file/ 311- 318 . URL: https://aclanthology.org/P02-1040.

3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. doi:10.3115/1073083 .1073135. [2]

Bojar ,

Buck ,

Federmann ,

Haddow , [7]

Dorr ,

Olive ,

McCary ,

Christianson , Ma-

chyna , Findings of the 2014 workshop on statis- book of Natural Language Processing and Machine

tical machine translation , in: Proceedings of the Translation , Springer, 2011 , pp. 745 - 843 .

Ninth Workshop on Statistical Machine Transla- [8]

Moorkens ,

Castilho ,

Gaspari , S. Doherty,

Baltimore , Maryland, USA, 2014 , pp. 12 - 58 . URL: tion: Technologies and applications, Springer, 2018 .

https://aclanthology.org/W14-3302. doi: 10 .3115/ [9]

Chatzikoumi , How to evaluate machine trans-

v1/ W14 -3302. lation: A review of automated and human metrics , [3]

Bentivogli ,

Cettolo ,

Gaido ,

Karakanta , Natural Language Engineering 26 ( 2020 ) 137 - 161 .

Martinelli ,

Negri ,

Turchi , Cascade ver- [10]

Freitag ,

Foster ,

Grangier ,

Ratnakar ,

59th Annual Meeting of the Association for Com- Machine Translation , in: Transactions of the As-

putational Linguistics and the 11th International sociation for Computational Linguistics , volume 9 ,

Joint Conference on Natural Language Processing 2021 , pp. 1460 - 1474 . URL: https://doi.org/10.1162/

(Volume 1 : Long

Papers)

, Association for Computa- tacl_a_00437 . doi: 10 .1162/tacl_a_ 00437 .

tional Linguistics , Online, 2021 , pp. 2873 - 2887 . URL: [11]

Hassan ,

Aue ,

Chen ,

Chowdhary , J. Clark,

https://aclanthology.org/ 2021 . acl-long . 224 . doi:10. C. Federmann , X.

Huang , M.

Junczys-Dowmunt ,

18653 /v1/ 2021 . acl-long .224.

Lewis ,

Li ,

Liu , T.-Y. Liu,

Luo ,

Menezes , [4]

Sperber ,

Paulik , Speech translation and the T . Qin,

Seide ,

Tan ,

Tian ,

Wu , S. Wu,

are, in: Proceedings of the 58th Annual Meet- Human Parity on Automatic Chinese to English

ing of the Association for Computational Linguis - News Translation , 2018 . arXiv: 1803 .05567.

tics , Association for Computational Linguistics, On- [12] A.

Toral , S.

Castilho , K.

Hu , A.

Way , Attaining the

line , 2020 , pp. 7409 - 7421 . URL: https://aclanthology. unattainable ? reassessing claims of human par-

org/ 2020 .acl-main. 661 . doi: 10 .18653/v1/ 2020 . ity in neural machine translation , in: Proceed-

acl-main.661. ings of the Third Conference on Machine Trans [5]

Anastasopoulos ,

Barrault , L. Bentivogli, lation: Research Papers, Association for Compu-

M. Zanon Boito , O.

Bojar , R.

Cattoni , A . Cur- tational Linguistics , Brussels, Belgium, 2018 , pp.

rey , G. Dinu, K.

Duh , M.

Elbayad , C. Emmanuel, 113 - 123 . URL: https://aclanthology.org/W18-6312.

Estève ,

Federico ,

Federmann , S. Gahbiche, doi:10.18653/v1/ W18 -6312.

Gong ,

Grundkiewicz ,

Haddow ,

Hsu , D. Ja- [13] S.

Läubli , R.

Sennrich , M.

Volk , Has machine trans-

McNamee ,

Murray , M. Naˇdejde, S. Nakamura, level evaluation, in: Proceedings of the 2018 Con-

doh , M. Turchi, Y.

Virkar , A.

Waibel , C.

Wang , tics, Brussels, Belgium, 2018 , pp. 4791 - 4796 . URL:

Watanabe , Findings of the IWSLT 2022 https://aclanthology.org/D18-1512. doi: 10 .18653/

evaluation campaign , in: Proceedings of the v1/D18-1512.

19th International Conference on Spoken Lan- [14]

Nießen ,

F. J.

Och , G. Leusch,

Ney , An evalua-

guage Translation (IWSLT

2022 ), Association for tion tool for machine translation: Fast evaluation

person and online) , 2022 , pp. 98 - 157 . URL: https: International Conference on Language Resources

//aclanthology.org/ 2022 .iwslt- 1 .10. doi: 10 .18653/ and Evaluation (LREC'00), European Language