1. Introduction

Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription

Alice Fedotova

Adriano Ferraresi

Maja Miličević Petrović

Alberto Barrón-Cedeño

0 0 DIT, Università di Bologna , Corso della Repubblica 136, 47121, Forlì , Italy

This paper presents a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art automatic speech recognition tools for verbatim transcription. The pipeline was developed during the process of updating the European Parliament Translation and Interpreting Corpus (EPTIC), leveraging recent NLP advancements to automate challenging tasks like multilingual alignment and speech recognition. Our findings indicate that current technologies can streamline corpus construction, with fine-tuning showing promising results in terms of transcription quality compared to out-of-the-box Whisper models. The lowest overall WER achieved for English was 0.180, using a fine-tuned Whisper-small model. As for Italian, the lowest WER (0.152) was obtained by the Whisper Large-v2 model, with the fine-tuned Whisper-small model still outperforming the baseline (0.201 vs. 0.219).

eol>multimodal corpora construction translation and interpreting corpora verbatim automatic speech recognition

1. Introduction

lation of EPTIC were developed ad hoc and aim at reproducing minimal prosodic features, but can still be The present paper introduces a pipeline for the construc- considered an instance of verbatim transcription [ 3, 1 ]; tion of multimodal and multilingual parallel corpora that the issue of what truly constitutes verbatimness is still could be used for translation and interpreting studies an object of debate and will be further discussed. There (TIS), among others. The construction of such resources is fairly widespread agreement on the statement that has been acknowledged as a “formidable task” [ 1 ], which every transcription system reflects a certain methodif automated —as we propose— involves a number of ological approach [4, 5], and that by “choosing not to subtasks such as automatic speech recognition (ASR), transcribe a particular dimension, the researcher has immultilingual sentence alignment, and forced alignment, plicitly decided that the dimension plays no role in the each of which poses its own challenges. Yet tackling these phenomenon in question” [4]. To investigate the characsubtasks also ofers a unique way to evaluate state-of- teristics of Whisper’s [2] transcriptions in English and the-art natural language processing (NLP) tools against Italian, we formulate the following two research quesa unique, multilingual benchmark. In this paper we dis- tions: RQ1 Is it possible to use fine-tuning to adapt the cuss the development of a modular pipeline adaptable for transcription style to the one of an expert annotator? each of these subtasks and address the issue of whether RQ2 What is the impact of speech type (native, nonperforming ASR with OpenAI’s Whisper [2] could be native, interpreted) on transcription quality? suitable for verbatim transcription. We find that satisfactory results can be achieved with

We showcase the utility of this pipeline by expanding automatic speech recognition, although challenges rethe European Parliament Translation and Interpreting main, especially with regards to the verbatimness of Corpus (EPTIC), a multimodal parallel corpus compris- the transcription —a crucial factor in corpora intended ing speeches delivered at the European Parliament along for TIS. Fine-tuning Whisper-small on English data obwith their oficial interpretations and translations [ 1, 3 ]. tains a lower word error rate (WER) of 0.180 compared The transcription conventions adopted for the compi- to Whisper-large v2 (0.194), potentially indicating that ifne-tuning Whisper models holds promise for improvCLiC-it 2024: Tenth Italian Conference on Computational Linguistics, ing their performance in terms of adhering to a certain Dec 04 — 06, 2024, Pisa, Italy transcription style. However, this was not the case when *$Coarlircees.pfeodnodtionvgaa2u@thuonri.bo.it (A. Fedotova); considering the experiments based on Italian. In the adriano.ferraresi@unibo.it (A. Ferraresi); maja.milicevic2@unibo.it Italian scenario, Whisper-large-v2 obtained a WER of (M. Miličević Petrović); a.barron@unibo.it (A. Barrón-Cedeño) 0.152 compared to a WER of 0.201 obtained by the fine0009-0001-4850-0974 (A. Fedotova); 0000-0002-6957-0605 tuned Whisper-small model. It should be noted, however, (A. Ferraresi); 0000-0003-4137-1898 (M. Miličević Petrović); that this constituted an improvement over the baseline 0000-0003-4719-3420 (A. Barrón-Cedeño) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Whisper-small model, which obtained a higher WER of Attribution 4.0 International (CC BY 4.0). 0.219. A significant limitation in the case of fine-tuning performs best overall among the tools considered. in Italian was constituted by the smaller amount of data Despite these advancements, several limitations peravailable for tuning compared to English. Lastly, we find sist in the current research. First, most studies focus that sentence alignment can be facilitated through state- primarily on English, with only some including other of-the-art embedding-based tools, whereas forced align- languages such as Chinese [16]. Furthermore, the field ment can be considered a largely solved problem. This of speech disfluency research faces challenges due to the makes the construction of corpora such as EPTIC more scarcity of publicly available benchmarking datasets, atstreamlined and requiring less human intervention, with tributed to high annotation costs, the clinical nature of wider implications for multilingual corpus construction some tasks, and the use of proprietary datasets [18]. The in the field of TIS and beyond. choice between Wav2Vec and Whisper remains a point of debate, with [8] finding similar results for both after ifne-tuning, while Azure of-the-shelf performed best, fol2. Related Work lowed by Whisper of-the-shelf. Still, [ 17] did not explore ifne-tuning, and [ 8] suggests that fine-tuned models generally perform better. The requirement for punctuation marks in some corpora, such as EPTIC, introduces another consideration in model selection. Wav2Vec does not output punctuation, while Whisper does, potentially influencing its suitability for certain applications. Additionally, while [13] used a large corpus, [15] indicated that Whisper can perform well with less data, highlighting the need for further investigation into optimal data requirements.

Recent advancements in the field of corpus linguistics have led to a multitude of complex multilingual and multimodal corpora, as well as novel approaches to corpus construction. Transcribing spoken data, identifying prosodic features, and aligning parallel texts are some of the tasks that are commonly involved. In this sense, a particularly representative case in point is constituted by interpreting corpora, such as EPIC [6], DIRSI [7], and EPTIC [ 3, 1 ], the latter also including translated texts. Based on data obtained from the European Parliament, these complex corpora require multi-step approaches for gathering and processing parallel, multilingual texts and multimodal 3. Corpus Construction data. Though the construction of translation and interpreting corpora has been largely carried out manually, it The present work is based on the European Parliament can also constitute a unique opportunity for developing Translation and Interpreting Corpus (EPTIC), a multinew tools and benchmarking recent advancements in the modal parallel corpus comprising speeches delivered at ifelds of NLP and ASR. ASR, in particular, has garnered the European Parliament (EP) along with their oficial inincreasing attention due to the time-consuming nature terpretations and translations.1 Within EPTIC, the corpus of spoken data transcription. construction process revolves around individual speech

A related research strand in the field of ASR concerns events, where edited verbatim reports published by the the level of detail of the transcriptions produced by ASR EP and transcriptions of the speeches are accompanied systems, as the task is usually not only to transcribe the by transcriptions of interpretations and oficial transspeech but to make sure that prosodic features, such as lations into other languages. These components form disfluencies, are maintained. [ 8] conducted a comprehen- a multi-parallel corpus, i.e. a corpus containing verbasive comparison of diferent ASR systems and acoustic tim transcriptions of source speeches, oficial verbatim models for disfluency detection and categorization, ex- reports and corresponding target translations and inamining Wav2Vec [9], HuBERT [ 10 ], WavLM [11], Whis- terpretations (quasi parallel at the intermodal level [3]). per [2], and Azure [12]. Their findings indicate that fine- The English partition consists of source English texts tuned models generally outperform their of-the-shelf and their translations into various languages. Corpora counterparts. [13] evaluated pre-trained models, reveal- containing translations in both possible directions (e.g., ing that Whisper-Large achieved the best overall WER from English to French and vice versa) are referred to as and chrF (character -gram F-measure [14]) scores. [15] bidirectional, while those with translations in only one demonstrated the potential of Whisper for adaptation in direction are referred to as unidirectional. Table 1 shows spoken language assessment with limited training data. the languages included and the size of the latest version, In the realm of commercial ASR services, [16] explored EPTIC v2, planned for release by the end of 2024. IBM’s ofering for transcribing English source speeches Our approach to corpus expansion began with a reand their interpretation, reporting an impressively low er- view of previous guidelines for developing EPTIC [ 1, 19 ]. ror rate of 4.7%. [17] conducted a systematic comparison The former procedure first involved obtaining data by of automatic transcription tools, evaluating factors such either scraping texts from the EP website2 or by manas data protection, accuracy, time eficiency, and costs for English and German interviews, and found that Whisper

1https://corpora.dipintra.it/eptic/ 2https://www.europarl.europa.eu/plenary/en/debates-video.html

ually downloading videos and then transcribing them. Transcripts of the original speeches and interpretations were manually adapted following editing conventions to annotate features of orality such as disfluencies and timestamped using Aegisub.3 Then, the texts were automatically segmented into sentences and aligned across languages and modalities, for instance between transcriptions and verbatim reports, with the help of the Intertext Editor alignment tool.4

The creation of the new workflow started with the previous procedure as a basis. It was first subdivided into separate tasks, the main ones being automatic speech recognition, multilingual sentence alignment, and forced alignment. Software selection was based on criteria such as ease of use and setup, compatibility with the Python programming language, linguistic coverage, and compatibility with Sketch Engine, an established corpus query tool for teaching and research [20, 21]. Python v. 3.11.5 was used along with the Poetry5 package manager for portability.6 Next, we discuss the tasks and the considerations made when designing the pipeline.

For this task, we use Bertalign [24]. Unlike predecessors such as Hunalign8 that rely on lexical translation probabilities, Bertalign employs sentence embeddings to identify parallel sentences, providing a more robust approach for handling semantic similarities. We used a version of the tool that has been extended to produce outputs in the Sketch Engine format for corpus indexing [20, 21].

Forced Alignment, the task of automatically aligning audio with transcriptions, is the most mature task for spoken corpora. Although WhisperX performs timestamping during transcription, we experimented with forced alignment on an existing portion of spoken EPTIC data, using the aeneas library, which supports more than thirty languages.9

The pipeline is structured in a modular fashion so as

to maximize reusability. The process begins with the extraction of text and video data from the EP website, using ad-hoc scripts which partially automate scraping of the EP website. Transcription is then performed using WhisperX. To remove mistranscriptions and to ensure adherence to the transcription guidelines, the transcripts undergo manual review to incorporate disfluencies and rectify potential mistranscriptions. Once the texts have been transcribed, they undergo sentence splitting and sentence alignment using Bertalign. Relevant metadata, encompassing session topics, are automatically retrieved from the EP website. The only item requiring manual input is the speech type, which can be defined as impromptu, read out, or mixed. After exporting the alignments in the Intertext format and performing part-ofspeech tagging with Sketch Engine, the texts and metadata are converted to the vertical format required for indexing in Sketch Engine [20, 21].

3https://aegisub.org

4https://wanthalf.saga.cz/intertext 5https://python-poetry.org 6The code is available at https://github.com/TinfFoil/eptic_v2_ pipeline 7https://github.com/m-bain/whisperX Automatic Speech Recognition has seen recent advancements, with the introduction of Whisper [2] and Wav2Vec 2.0 [9]. However, achieving a reasonable level of transcription quality is complex and context-dependent, as it can be interpreted and evaluated diferently depending on the domain, task, and application [22]. We decided to employ the WhisperX7 variant of Whisper, given its documented reliable performance for long-form transcription, which is oftentimes needed when dealing with parliamentary speech [23].

We require an ASR system to produce a verbatim transcription where all words are transcribed, along with disfluencies and extra-linguistic information. However, verbatimness is a broad concept, given the variety of transcription conventions existing in linguistics [17]. Whisper has been observed to produce transcripts “often almost comparable to the final read through of a manual Sentence Alignment involves identifying and align- (verbatim to gisted) transcript” [17], where gisted refers ing parallel sentences, both mono- and multilingually. to a transcription that “omits non-essential information (e.g., filler words, word fragments, repetition of words), and summarizes or grammatically correctly rephrases the audio content” [17]. Hereby, we define a verbatim

4. ASR for Verbatim Transcription: Evaluating Whisper 8https://github.com/danielvarga/hunalign 9https://www.readbeyond.it/aeneas/

such as nativeness influenced the WER. Findings for transcription as a transcription where “all words are tran- these experiments are presented in Table 3, and indicate scribed without additional grammatical corrections [and] a WER of 0.104 for native English speakers, 0.110 for word repetitions, utterances, word interruptions, and non-native speakers, and a notably higher WER of 0.222 elisions are kept” along with some rudimentary extra- for interpreted speech. Similar results were also obtained linguistic contextual information, such as applauses [17]. for Italian, with a WER of 0.131 for native speakers and

As part of our experiments, we tested the HuggingFace 0.188 for interpreted speech, which provides further evirelease10 of the Whisper models. The test set included dence for the finding of interpreted speech being more English, Italian, French, and Slovenian, though further ex- challenging to transcribe [16]. periments were conducted exclusively with English and To further explore the claim that fine-tuning improves Italian due to dataset limitations. We used 7 hours of au- the performance of the model by steering its output todio for English, 5 for Italian, 1.5 hours for French and 1.5 wards a more verbatim transcription, we now present the hours for Slovenian. Besides evaluating the models on the results of a qualitative error analysis. We consider a set of whole set of held-out data, we computed word error rates “markers of verbatimness” based on the definition in [ 17]: (WERs) for diferent speech types: native speech, non- contractions, truncated words, discourse markers, repnative speech, and interpreted speech.11 In addition to etitions, filled pauses and empty pauses. The following experimenting with the out-of-the-box versions of Whis- paragraphs present results that emerge from the analysis, per, we explored fine-tuning Whisper-small for English with examples provided in Table 4. Following [15], we and Italian. To train and test the models, we used 80% furthermore report the recall metric for each category. of the data for training, 10% for validation, and 10% for As for contractions, they are sometimes incorrectly testing. The training parameters for the Whisper-small resolved by the standard Whisper-large-v2 model; finemodel were set to a batch size of 16, a learning rate of 1e- tuning results in improvements. For instance, in the 5, mixed-precision training enabled, and a maximum of example shown in Table 4, the fine-tuned version of 5,000 training steps. Evaluation and saving checkpoints Whisper-small maintains the contraction while the large were enabled every 1,000 steps, optimizing for WER. model does not. Generally, however, Whisper-large-v2

The experimented Whisper models showed a robust shows acceptable performance even when fine-tuning is performance across languages and speech types. Our not performed, as Whisper was trained with unnormalifndings suggest that satisfactory results can be achieved ized transcripts including contractions, punctuation and for Italian, which exhibits a low WER of 0.152, and En- capitalization [2]. glish, with a WER as low as 0.194. The full set of results Truncations are not transcribed by the Whisper modis presented in Table 2, where the fine-tuned model is ref- els out-of-the-box. Fine-tuning shows some promising erenced as Small-FT. This fine-tuned model obtained the results, though truncations are not always transcribed relowest WER for English, performing better than Whisper- liably and transcription errors are sometimes introduced, large-v2, which could indicate that the model is learning as illustrated in Table 4. This is possibly due to the obserto produce a more verbatim transcription. In the case vation in [15] that, being largely trained on speech data of Italian, the fine-tuned model obtains a lower WER with a high level of inverse text normalization (ITN), a compared to the baseline Whisper-small model (0.201 process including disfluency removal, Whisper tends to for the fine-tuned model compared to the WER of 0.219 omit features of orality in favor of readability, which is obtained by the baseline Whisper-small). However, the unfavorable for the purpose of verbatim transcription. lowest WER of 0.152 is obtained by Whisper-large-v2, Discourse markers are mostly transcribed in English, which could be attributed to the lower amount of data even by the baseline Whisper-large-v2. In Italian, disavailable for fine-tuning compared to English. course markers are omitted considerably more often. An Lastly, to address RQ2, we evaluated whether factors example of this is provided in Table 4. This could be attributed to the fact that, even though Whisper models have been trained to produce transcriptions without any significant standardization [ 2], the amount and qual10https://huggingface.co/docs/transformers/en/model_doc/

whisper 11Which can be both into the interpreter’s A or B language. ity of training data for English are likely more extensive and varied compared to Italian, especially when it comes to examples of spontaneous speech. As for repetitions, the example in Table 4 shows both a repetition and a truncation, a common occurrence due to disfluent speech often comprising a combination of both. In the example, the fine-tuned Whisper-small model accurately Table 4 transcribes both disfluencies, while Whisper-large-v2 Transcription examples by disfluency type. For each example, rephrases them into a corrected transcription. Overall, we include (a) the reference transcription, (b) the transcription the baseline Whisper-large-v2 model always omitted repproduced by Whisper-small-FT and (c) by Whisper-large-v2. etitions both in English and Italian. This could be due to Example Transcription Rec EN Rec IT the powerful language model used by Whisper, which has been observed to correct such errors [13].

Contractions The last examples in Table 4 illustrate transcriptions of (a) I’m encouraged that the interim 100.00 – empty and filled pauses. Whereas Whisper-small-FT of(b) lIe’madeernschoipur.a. .ged that the interim 95.40 – ten captures them, the baseline model does not. However, leadership . . . the fine-tuned model’s performance is not consistent, and (c) I am encouraged that the interim 86.30 – occasionally non-existent empty pauses are transcribed leadership . . . by the model. As in the case of truncations, pauses are never transcribed by Whisper-large-v2, likely due to the models having been trained on data processed with ITN.

5. Conclusions and Future Work

This paper presented a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art automatic speech recognition tools for verbatim transcription. Experiments with Whisper models on EPTIC revealed robust performance across languages and speech types, particularly for English and Italian. However, some limitations remain regarding ASR performance and achieving verbatim transcriptions. Fine-tuning Whisper showed promising reductions in WER, particularly for English, indicating the potential of adapting the model to use a more verbatim style. Yet qualitative analysis revealed inconsistencies in handling disfluencies, truncations, and discourse markers. Furthermore, higher WERs for non-native and interpreted speech underscore remaining challenges.

Future research eforts could explore incorporating additional metrics beyond WER to better capture the degree of verbatimness in the transcriptions, and expanding the Italian dataset to potentially improve the performance of the fine-tuned model. Another avenue for research could include augmenting the dataset with external data containing pairs of audio and verbatim transcripts, most notably the Switchboard corpus introduced in [25]. Other methods besides fine-tuning could be explored to enhance the quality of transcriptions, for instance by leveraging the oficial verbatim reports on the European Parliament’s website. Lastly, a model could be developed for detecting the metadata item relative to the speech type, i.e. impromptu, read out, or mixed, based on textual or multimodal features.

Acknowledgments The work of A. Fedotova is supported by the NextGeneration EU programme, ALMArie CURIE 2021 - Linea SUpER, Ref. CUPJ45F21001470005.

1 ( 2014 ) 7 - 36 . doi:https://doi.org/10.1007/

s40607-014-0009-9 . [22]

Kuhn ,

Kersken ,

Reuter ,

Egger , G. Zim-

Accessible Computing 16 ( 2024 ) 1 - 23 . doi:https:

//doi.org/10.1145/3636513. [23]

Bain ,

Huh , T. Han, A . Zisserman, Whisperx:

audio , arXiv preprint ( 2023 ). URL: https://arxiv.org/

pdf/2303 .00747, retrieved May 20, 2024 . [24]

Lei ,

Zhu , Bertalign: Improved word

Digital Scholarship in the Humanities 38 ( 2022 )

621- 634 . doi:https://doi.org/10.1093/llc/

fqac089. [25]

J. J.

Godfrey ,

E. C.

Holliman , J. McDaniel , Switch-

ume 1 , IEEE Computer Society, 1992 , pp. 517 - 520 .

doi:10 .1109/ICASSP. 1992 . 225858 .