1. Introduction

: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

Sara Papi

Marco Gaido

Luisa Bentivogli

Alessio Brutti

Mauro Cettolo

Roberto Gretter

Marco Matassoni

Mohamed Nabih

Matteo Negri

0 0 Fondazione Bruno Kessler , Via Sommarive 18, 38123 Trento , Italy

2025

The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature-with inaccessible training data and code-poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar eforts in speech processing remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including codebase, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research. The FAMA collection is available at: https://huggingface.co/collections/FBK-MT/fama-683425df3fb2b3171e0cdc9e

eol>speech automatic speech recognition speech translation ASR ST open science open source speech foundation model

1. Introduction

strated the feasibility of training large language models (LLMs) using only open-source (OS) data [12], realizing The development of speech foundation models (SFMs) an open-science1 system [14] for text processing. Howhas significantly advanced speech processing in the last ever, such comprehensive approaches are still lacking in few years, particularly in areas such as automatic speech the field of speech processing. recognition (ASR) and speech translation (ST). Popu- Recent works towards this direction are represented lar SFMs such as OpenAI Whisper [1] and Meta Seam- by OWSM [15] and its subsequent versions [16]. OWSM, lessM4T [2] have been released to the public in various whose model weights and codebase used for the training sizes and with extensive language coverage. However, are released open source, reproduces a Whisper-style these models completely lack comprehensive accessibil- training using publicly available data. Despite repreity to their training codebases and datasets, hindering senting a valuable initiative toward building an opentheir reproducibility and raising concerns about poten- science system, there is still a step missing for creating tial data contamination [3], thereby complicating fair the first SFM of this kind: leveraging only data that is evaluation. not only publicly available but also released under an

In other domains, multiple eforts towards building OS-compliant license [17]. Such efort would allow users models that are more accessible, reproducible, and free complete access and control over the data used at every from proprietary constraints have been made [4, 5, 6, 7, stage of the scientific process, promoting reproducibility 8, 9, 10]. For instance, the OLMO project [11] has demon- [18], fair evaluation [19], and the ability to build upon prior research without any barriers [20]. Besides transCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- parency and collaboration, these eforts also foster users’ tics, September 24 — 26, 2025, Cagliari, Italy trust by ensuring that data is not leveraged to build tools * These authors contributed equally. that can be used under conditions/purposes (e.g., comb$enstpivaop@i@ffbbkk.e.euu(L(S..BPeanptii)v;omgglia);idboru@ttfib@k.febuk.(eMu.(AGa.iBdrou);tti); mercial) for which the data was not intended [14]. cettolo@fbk.eu (M. Cettolo); gretter@fbk.eu (R. Gretter); To fill this gap, we release FAMA,2 the first family matasso@fbk.eu (M. Matassoni); mnabih@fbk.eu (M. Nabih); of large-scale open-science SFMs for English and Italian negri@fbk.eu (M. Negri) trained on over 150k hours of exclusively OS-compliant https://sarapapi.github.io/ (S. Papi); https://mgaido91.github.io/ (M. Gaido)

0000-0002-4494-8886 (S. Papi); 0000-0003-4217-1396 (M. Gaido); 1Open science involves ensuring transparency and accessibility at 0000-0001-7480-2231 (L. Bentivogli); 0000-0003-4146-3071 all stages of the scientific process [ 13], including publishing OS (A. Brutti); 0000-0001-8388-497X (M. Cettolo); 0000-0002-9689-1316 research papers, data, code, and any information needed to replicate (M. Matassoni); 0000-0001-9132-9220 (M. Nabih); the research. 0000-0002-8811-4330 (M. Negri) 2Fama (from the Latin “fari” meaning “to speak”) is the personifica©At2tr0i2b5utCioonpy4r.0igIhnttefornratthioisnpaalp(CerCbByYit4s.0a)u.thors. Use permitted under Creative Commons License tion of the public voice in Roman mythology. speech data. We leverage both already available OS datasets and create a new collection of ASR and ST psuedolabels for Italian and English comprising more than 16k hours of OS-compliant speech, along with automatically generated Italian and English translations for an additional 130k+ hours of speech. We also detail training and evaluation procedures and provide full access to training data to have complete control of the model creation and avoid data contamination issues. FAMA models achieve remarkable results, with up to 4.2 WER and 0.152 COMET improvement on average across languages compared to OWSM and remaining competitive in terms of ASR performance with the Whisper model family while being up to 8 times faster. All the artifacts used for realizing FAMA models, including codebase, datasets, and models themself, are released under OS-compliant licenses, promoting a more responsible creation of models in our community. Our approach would not only facilitate fair evaluation and comparison of SFMs but also encourage broader participation in speech technology development, leading to more inclusive and diverse applications.

CC-BY 4.0 license. The videos are automatically con

verted into wav files with one channel and a sampling rate of 16k Hz. Then, the audio is cleaned from music The artifacts are available at: and non-speech phenomena and segmented using silero [27], a lightweight VAD having low computational reFAMA-medium (878M): quirements. Lastly, to make it suitable for training, the https://hf.co/FBK-MT/fama-medium audio is split using SHAS [28] in segments of around 16 seconds on average. The resulting dataset contains FAMA-small (479M): automatic transcripts, which we created with Whisper https://hf.co/FBK-MT/fama-small large-v3,4 for 14,200 hours of speech for English (en) and 1,828 for Italian (it). Including publicly available data FAMA-medium-asr (878M): (113,951 hours for en, and 22,383 hours for it), the final https://hf.co/FBK-MT/fama-medium-asr ASR training set comprises 128,152 hours of en speech and 24,211 hours of it speech, with a total of 152,363 hours FAMA-small-asr (479M): of speech data, including 48,259 gold-labeled hours. https://hf.co/FBK-MT/fama-small-asr Being composed of speech-transcript pairs, the data mentioned so far is suitable for ASR. For ST, instead, FAMA Training Data: only CoVoST2 and FLEURS contain translations from https://hf.co/datasets/FBK-MT/fama-data and into en and it. For this reason, we automatically translated the transcripts of all the speech data (includFAMA Code: ing the original CoVoST2) with MADLAD-400 3B-MT https://github.com/hlt-mt/FBK-fairseq [29].5 Following [30, 31], we additionally filter out samples based on the ratio between the source and tar2. The FAMA Framework get text lengths (in characters) for each language pair based on their distribution (min = 0.75, max = 1.45 for 2.1. Training and Evaluation Data en-it, and min = 0.65, max = 1.35 for it-en), resulting into 3.41% of data filtering for en-it and 3.12% for it-en.

In compliance with the open-science ideology, we train The final training set (Table 2) comprises the automatiand test our models only on OS-compliant data. The cally translated speech data and the gold CoVoST2 and training set comprises both already publicly available FLEURS datasets, resulting in a total of 147,686 hours for OS datasets, and new pseudolabels created for this work, en-it and it-en. whose list is presented in Table 1. For validation during training, and testing, we use gold

To create the new pseudolabels, we leveraged the labeled benchmarks. ASR evaluation is conducted on speech content of YouTube-Commons,3 a dataset col- CommonVoice, MLS, and VoxPopuli, with CommonVoice lecting YouTube videos released with the permissive

3https://hf.co/datasets/PleIAs/YouTube-Commons 4https://hf.co/openai/whisper-large-v3 5https://hf.co/google/madlad400-3b-mt CommonVoice v18 [21] CoVoST2 [22] - automatic labels LibriSpeech [24] MOSEL [17]

MLS [25] VoxPopuli-ASR [26] YouTube-Commons (our paper) Total (A)

Filtered (A) CoVoST2 [22] - gold labels FLEURS [23]

Total #hours en-it it-en 1746 250 A 420 28 A 358 - A 66,301 21,775 A 44,600 247 A

519 74 A 14,200 1,828 A 128,144 24,202 A 123,777 23,445 A 420 28 G

7 9 G 124,204 23,482 G+A

We train both models using a combination of three losses.

First, a label-smoothed cross-entropy loss (ℒCE) is applied to the decoder output, using the target text as the reference (transcripts for ASR and translations for ST).

Second, a CTC loss [36] is computed using transcripts as reference (ℒCTCsrc) on the output of the 8th encoder layer for small and the 16th for medium. Third, a CTC loss on the final encoder output ( ℒCTCtgt) is applied to predict the target text. The final loss is the weighted sum of the above-mentioned losses:

ℒ = 1ℒCE + 2ℒCTCsrc + 3ℒCTCtgt where 1, 2, 3 = 5.0, 1.0, 2.0, and the label smoothTable 2 ing factor of the CE is 0.1.

ST: List of both publicly available training data and the data FAMA models are trained using a two-stage approach, created in this paper for English-Italian (en-it) and Italian- where the model is pre-trained first on ASR data only English (it-en). “G” stands for gold labels while “A” for auto- (ASR pre-training) and then trained on both ASR and ST matically generated labels (translations). data (ASR+ST training). Both training stages lasted 1M steps, corresponding to ∼ 6 epochs over the training data.

For the ASR pre-training, the learning rate (S1) schedalso serving as the validation set for both en and it. For uler adopted to train the small model is the Noam schedtranslation, we use CoVoST2 for it-en and FLEURS dev uler [33] with a peak of 2e-3 and 25,000 warm-up steps. and test sets for en-it. To cope with convergence issues similar to [16], for the medium model we adopted a piece-wise warm-up on the 2.2. Model Architecture Noam scheduler, with the learning rate first increasing linearly to 2e-5 for 25k steps and then to 2e-4 for an addiFAMA models are two-sized encoder-decoder architec- tional 25k steps, followed by the standard inverse square tures, small and medium. Both models are composed root function. For the ASR+ST training, we sample the of a Conformer encoder [32] and a Transformer decoder ASR target with probability ASR=0.5 and use the ST tar[33]. FAMA small has 12 encoder layers and 6 decoder get otherwise. Training settings are the same as for ASR layers, while FAMA medium has 24 encoder layers and 12 pre-training, except for the learning rate that is set to a decoder layers. Our decision to use an encoder twice as constant value S2=1e-4. Experiments on how ASR and deep as the decoder–unlike Whisper and OWSM, which S2 are determined for the small model are discussed have an equal number of encoder and decoder layers–is in Section 3.1. For the medium model, similarly to the driven by two key motivations: i) since autoregressive ifrst stage, the S2 is scaled down by one order of magnimodels perform multiple decoder passes during output tude compared to the small model, i.e., a constant value generation, a shallower decoder speeds up inference by S2=1e-5 is used. making each pass faster, and ii) since many approaches The optimizer is AdamW with momentum 1, 2 = integrate SFMs with LLMs by leveraging the encoder 0.9, 0.98, a weight decay of 0.001, a dropout of 0.1, and [34], a deeper encoder helps preserve more of the SFMs clip normalization of 10.0. We apply SpecAugment [37] processing capabilities in such integrations. Each layer during both ASR pre-training and ASR+ST training. We has 16 attention heads, an embedding dimension of 1,024, use mini-batches of 10,000 tokens for FAMA small and and an FFN dimension of 4,096. 4,500 for FAMA medium with an update frequency of,

The Conformer encoder is preceded by two 1D convo- respectively, 2 and 6 on 16 NVIDIA A100 GPUs (64GB lutional layers with a stride of 2 and a kernel size of 5. RAM), save checkpoints every 1,000 steps and average The kernel size of the Conformer convolutional module is the last 25 checkpoints to obtain the final model. 31 for both the point- and depth-wise convolutions. The The inference is performed using a single NVIDIA vocabulary is built using a SentencePiece unigram model A100 GPU with a batch size of 80,000 tokens. We use [35] with size 16,000 trained on en and it transcripts. Two beam search with beam 5, unknown penalty of 10,000, extra tokens–<lang:en> and <lang:it>–are added to and no-repeat n-gram size of 5. Additionally, we report indicate whether the target text is in en or it. The in- the results using the joint CTC rescoring [38], leveraging put audio is represented by 80 Mel-filterbank features the CTC on the encoder output with weight 0.2. Both extracted every 10 ms with a window of 25 ms. training and inference are done using the bug-free Conformer implementation [39] available in FBK-fairseq,6 which is built upon fairseq-S2T [40]. ASR performance is evaluated with word error rate (WER) using the jiWER library7 with the text normalized using Whisper normalizer8. ST performance is evaluated using COMET [41] version 2.2.4, with the default Unbabel/wmt22-comet-da model. 2.4. Terms of Comparison As a first term of comparison, we use Whisper [ 1] in both medium9 and large-v3 configurations as the first is comparable with FAMA medium in terms of size and the second–trained on more than 4M hours—is the best performing model of the Whisper family. The comparison is made for en and it ASR and it-en ST, as Whisper does not cover the en-to-many translation directions. Whisper models are released under Apache 2.0 license and, therefore, open weights. For both ASR and ST, we also compare with SeamlessM4T medium10 and v2-large11 covering ASR and both ST language directions [2]. The model is non-commercial and, therefore, not open. We also compare with OWSM v3.1 medium12, the best performing model of the OWSM family, also covering ASR and both ST language directions and released open source [16].

To ensure a fair comparison, we perform the inference with HuggingFace transformers13 version 4.48.1 using the standard settings and beam search with beam 5, except for OWSM, which is not supported on HuggingFace, and for which the original ESPNet14 inference code is used with a beam size of 3.15

3. Results

3.1. Pre-training and Catastrophic

Forgetting the conditions in which this phenomenon arises during the ASR+ST training.

1.8 6https://github.com/hlt-mt/FBK-fairseq 7https://pypi.org/project/jiwer/ 8https://pypi.org/project/whisper-normalizer/ 9https://hf.co/openai/whisper-medium 10https://hf.co/facebook/hf-seamless-m4t-medium 11https://hf.co/facebook/seamless-m4t-v2-large 12https://hf.co/espnet/owsm_v3.1_ebf 13https://pypi.org/project/transformers/ 14https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/

s2t1 15We attempted to use a beam size of 5 but the model had out-ofmemory issues even when reducing the batch size.

Figure 1 shows the perplexity (ppl) behavior during the ifrst 100/500k steps of the FAMA small model training on the validation sets. We present the results of diferent systems obtained by varying both the learning rate S2 and the sampling probability ASR discussed in Section 2.3. Lower values of S2 (e.g., 1e-5) lead to worse performance and are not included in the results. Since the computational budget for our experiments is limited, we analyze two cases for the sampling probability: 1) ASR=0.5 to obtain a system equally trained on both ASR and ST tasks, and 2) ASR=0.2 to obtain a system trained Whisper medium Whisper large-v3 OWSM v3.1 medium SeamlessM4T medium SeamlessM4T v2-large FAMA-ASR small

+ joint CTC rescoring FAMA-ASR medium

+ joint CTC rescoring FAMA small

+ joint CTC rescoring FAMA medium + joint CTC rescoring

MLS

AVG

ASR (WER ↓) more on the unseen task during pre-training, i.e., the ST model (FAMA-ASR), obtained after pre-training, and of task. the final ASR+ST model, as well as the results obtained

As we can see from the curves, a S2 of 1e-3 seems through joint CTC rescoring. to be too high for maintaining good ASR performance Looking at the results of FAMA-ASR, we observe that while learning a new task (ST). Both in the case in which the medium model outperforms the small one, with the ST training is more boosted (ASR=0.2) and in the ∼ 0.8 WER improvements on average both with and case in which ASR and ST training is balanced (ASR=0.5), without the joint CTC rescoring. Compared to Whiswe notice a significant increase in the ASR ppl of up per medium, FAMA achieves better results with FAMA to 0.25 that corresponds to a drop in performance of medium outperforming Whisper by 4.4 WER on en and 3-4 WER on both languages – which, moreover, is not 6.4 on it while having a similar number of model paramrecovered later on in the training. Therefore, to avoid eters. Remarkable performance is achieved by FAMA catastrophic forgetting arising just in the first steps, we medium also compared to OWSM v3.1 medium, with imexclude S2=1e-3 and use 1e-4 for the two-stage training. provements of up to 1.1 WER on en and 7.3 on it, but also Regarding the ASR sampling, we look at the behavior compared to Whisper large-v3, where similar WER of the curves for 500k steps (half of the second-stage scores are achieved. Instead, SeamlessM4T models, levertraining) and notice that the ASR ppl curve with ASR=0.5 aging large pretrained models such as wav2vec-BERT slowly approaches the original model ppl value while 2.0 (which is trained on 4.5 million hours) and NLLB the one with ASR=0.2, despite improving, is not able to (which is trained on more than 43 billion sentences), still approach the original ppl value. This is counterbalanced outperform FAMA, with the v2-large scoring an inby a lower (hence, better) ppl of the ASR=0.2 curve on credibly low WER on CommonVoice also compared to ST compared to that of the ASR=0.5 curve. However, this a strong competitor as Whisper large-v3. Looking at diference, which is about ∼ 0.2 ppl, is not reflected in the the ASR results of the final FAMA models, we observe ST performance, which only improves by 0.005 COMET that the WER remained almost unaltered compared to points on average. Instead, the diference in terms of the ASR-only model, as already discussed in Section 3.1. WER is significant, with a quality drop of ∼ 0.8 WER Regarding ST results, we notice that FAMA models outacross en and it. As a result, we conclude that we avoid perform OWSM v3.1 medium, with an improvement of catastrophic forgetting in the two-stage training only by up to 0.141 COMET by FAMA small and 0.152 by FAMA evenly sampling the ASR and ST tasks during the second medium while still struggling to achieve the performance step. of Whisper and SeamlessM4T.

These mixed outcomes–competitive ASR performance 3.2. Comparison with Existing SFMs even against larger non-open models but lower ST performance–demonstrate both the feasibility of buildIn Table 3, we show the results for both ASR and ST of ing high-quality open-science SFMs and the need for iniour FAMA models and SFMs presented in Section 2.4. For tiatives dedicated to creating OS-compliant ST datasets FAMA models, we provide the scores of the ASR-only with human references to bridge the gap with non-open

Whisper medium

Whisper large-v3 SeamlessM4T medium SeamlessM4T v2-large FAMA small FAMA medium 3.3. Computational Time As an additional comparison, we evaluate the throughput of the SFMs on a single NVIDIA A40 40GB. The throughput, measured in xRTF (the inverse of the realtime factor),16 is calculated as the number of seconds Table 5 of processed audio divided by the compute time in sec- Absolute WER quality gaps between female and male subsets, onds. The test set used for this performance evaluation divided into read (Gap R) and spontaneous (Gap S) speech. is CommonVoice on both en and it with a total duration of, respectively, 26.9 and 26.4 hours. For each model, we report the maximum batch size possible spanning in the between female WER and male WER scores obtained on range 2, 4, 8, and 16, as higher values resulted in out-of- CommonVoice 17 and VoxPopuli. memory issues with all models. The results are reported We can observe that FAMA-ASR small obtained the in Table 4. smallest–hence, best–performance gap between male and

We notice that Whisper models are the slowest ones, feminine transcription from read speech, with a gap bewith an average xRTF of 12.1 for medium and 7.2 for ing an order of magnitude smaller than all other models. large-v3, making them ∼ 3-6 times slower than FAMA When moving to the spontaneous speech, instead, Whismedium and ∼ 5-8 than FAMA small. These results can per large-v3 obtains the best result. Overall, Whisper be attributed to the architectural design of Whisper mod- achieves the best average result, followed by FAMA-ASR els that apply an × 2 audio subsampling compared to small and FAMA medium, which are the only models the commonly used × 4 (as in FAMA) and introduce a scoring less than a 1.0 WER diference. All FAMA models lot of padding in shorter sequences to achieve the fixed can outperform Seamless M4T v2-large, achieving an 30-second length. The Seamless models, despite having average gap reduction of 0.16 to 0.52. no extra padding (as FAMA) and a greater audio subsampling of × 8, are ∼ 2 times faster than Whisper ones but still 1.5-3 times slower for, respectively, medium and 4. Conclusions v2-large, compared to FAMA medium and 2-4 compared to FAMA small, making the FAMA model family the fastest by a large margin.

In this paper, we addressed the challenges posed by the

closed nature of existing SFMs, such as limited accessibility to training data and codebases, by introducing FAMA, the first large-scale open-science SFM for English and 3.4. Gender Bias Analysis Italian. Trained on over 150k hours of exclusively OS We also measure the gender bias disparity between male speech, FAMA ensures full transparency, with all artiand female performance using the ASR benchmark pro- facts released under OS-compliant licenses. Additionally, posed by Attanasio et al. [44]. The results are presented in we contributed a new collection of ASR and ST pseuTable 517 and are measured as absolute performance gaps dolabels for about 16k hours of speech data, and more than 130k hours of English and Italian automatic translations. Results show that FAMA models outperform OWSM on both ASR and ST and also achieve comparable ASR results to Whisper while being up to 8 times faster. By providing the community with fully accessible 16https://github.com/NVIDIA/DeepLearningExamples/blob/

master/Kaldi/SpeechRecognition/README.md#metrics 17Results and per-language statistics are available on the original leaderboard: https://huggingface.co/spaces/g8a9/ fair-asr-leaderboard resources, FAMA bridges the gap between advances in speech technology and open science principles, enabling fair evaluation, broader participation, and inclusivity. Future work will focus on extending FAMA to additional languages with the ultimate goal of further expanding the open science ecosystem to speech technologies.

Acknowledgments This paper has received funding from the PNRR project

FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU, and from the European Union’s Horizon research and innovation programme under grant agreement No 101135798, project Meetween (My Personal AI Mediator for Virtual MEETings BetWEEN People). We acknowledge CINECA for the availability of high-performance computing resources and support. llms, in: First Conference on Language Modeling, 2024. [7] Q. Sun, Y. Luo, S. Li, W. Zhang, W. Liu, OpenOmni:

A collaborative open source tool for building futureready multimodal conversational agents, in: D. I.

Hernandez Farias, T. Hope, M. Li (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics,

Miami, Florida, USA, 2024, pp. 46–52. [8] M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S.

Park, M. Salehi, N. Muennighof, K. Lo, L. Soldaini, et al., Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models, arXiv preprint arXiv:2409.17146 (2024). [9] W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker,

T. Rintamaki, M. Shoeybi, B. Catanzaro, W. Ping, Nvlm: Open frontier-class multimodal llms, arXiv preprint arXiv:2409.11402 (2024). [10] P. H. Martins, P. Fernandes, J. Alves, N. M. Guerreiro, R. Rei, D. M. Alves, J. Pombal, A. Farajian, M. Faysse, M. Klimaszewski, P. Colombo, B. Had[1] A. Radford, J. W. Kim, T. Xu, G. Brockman, dow, J. G. de Souza, A. Birch, A. F. Martins, Eurollm: C. McLeavey, I. Sutskever, Robust speech recog- Multilingual language models for europe, Procedia nition via large-scale weak supervision, in: Inter- Computer Science 255 (2025) 53–62. Proceedings of national Conference on Machine Learning, PMLR, the Second EuroHPC user day.

2023, pp. 28492–28518. [11] D. Groeneveld, et al., OLMo: Accelerating the sci[2] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, ence of language models, in: L.-W. Ku, A. Martins, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, V. Srikumar (Eds.), Proceedings of the 62nd AnK. Hefernan, J. Hofman, et al., Seamlessm4t: Mas- nual Meeting of the Association for Computational sively multilingual & multimodal machine transla- Linguistics (Volume 1: Long Papers), Association tion, arXiv preprint arXiv:2308.11596 (2023). for Computational Linguistics, Bangkok, Thailand, [3] Y. Dong, X. Jiang, H. Liu, Z. Jin, B. Gu, M. Yang, G. Li, 2024, pp. 15789–15809.

Generalization or memorization: Data contamina- [12] L. Soldaini, et al., Dolma: an open corpus of three tion and trustworthy evaluation for large language trillion tokens for language model pretraining remodels, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), search, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Lin- Proceedings of the 62nd Annual Meeting of the Asgui stics: ACL 2024 , Association for Computational sociation for Computational Linguistics (Volume 1: Linguistics, Bangkok, Thailand, 2024, pp. 12039– Long Papers) , Association for Computational Lin12050. gui stics, Bangkok, Thailand, 2024 , pp. 15725–15788. [4] BigScience Workshop, T. L. Scao, A. Fan, C. Akiki, [13] R. Vicente-Saez, C. Martinez-Fuentes, Open sciE. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. ence now: A systematic literature review for an Luccioni, F. Yvon, et al., Bloom: A 176b-parameter integrated definition, Journal of Business Research open-access multilingual language model, arXiv 88 (2018) 428–436.

preprint arXiv:2211.05100 (2022). [14] M. White, I. Haddad, C. Osborne, X.-Y. Y. Liu, A. Ab[5] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, delmonsef, S. Varghese, A. L. Hors, The model K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, openness framework: Promoting completeness and U. S. Prashanth, E. Raf, A. Skowron, L. Sutawika, openness for reproducibility, transparency, and usO. Van Der Wal, Pythia: a suite for analyzing large ability in artificial intelligence, arXiv preprint language models across training and scaling, in: arXiv:2403.13784 (2024).

Proceedings of the 40th International Conference [15] Y. Peng, J. Tian, B. Yan, D. Berrebbi, X. Chang, X. Li, on Machine Learning, ICML’23, JMLR.org, 2023. J. Shi, S. Arora, W. Chen, R. Sharma, W. Zhang, [6] Z. Liu, A. Qiao, W. Neiswanger, H. Wang, B. Tan, Y. Sudo, M. Shakeel, J.-W. Jung, S. Maiti, S. WatanT. Tao, J. Li, Y. Wang, S. Sun, O. Pangarkar, et al., abe, Reproducing whisper-style training using an Llm360: Towards fully transparent open-source open-source toolkit and publicly available data, in: During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

2023

IEEE

Automatic Speech Recognition and Un- [25]

Pratap ,

Xu ,

Sriram , G. Synnaeve, R. Col-

derstanding Workshop (ASRU) , 2023 , pp. 1 - 8 . lobert, MLS:

Large-Scale Multilingual Dataset for [16]

Peng ,

Tian ,

Chen ,

Arora ,

Yan ,

Sudo , Speech Research, in: Proc. Interspeech 2020 , 2020 ,

Shakeel ,

Choi ,

Shi ,

Chang , J. weon pp. 2757 - 2761 .

Jung , S.

Watanabe , Owsm v3 . 1: Better and faster [26]

Wang ,

Riviere ,

Lee ,

Wu , C. Talnikar,

branchformer, in: Interspeech 2024 , 2024 , pp. 352 - Populi : A large-scale multilingual speech corpus

356. for representation learning , semi-supervised learn[17]

Gaido ,

Papi ,

Bentivogli ,

Brutti , M. Cet- ing and interpretation , in: C. Zong , F.

Xia , W.

Li ,

tolo , R. Gretter, M.

Matassoni , M.

Nabih , M.

Ne- R. Navigli (Eds.), Proceedings of the 59th Annual

gri , MOSEL: 950 , 000 hours of speech data for open- Meeting of the Association for Computational Lin-

source speech foundation model training on EU guistics and the 11th

International Joint Conference

languages, in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen on Natural Language Processing (Volume 1: Long

(Eds.), Proceedings of the 2024 Conference on Em- Papers) , Association for Computational Linguistics,

pirical Methods in Natural Language Processing , Online, 2021 , pp. 993 - 1003 .

Association for Computational

Linguistics

, Miami, [27]

Team , Silero vad: pre-trained enterprise-grade

Florida , USA, 2024 , pp. 13934 - 13947 . voice activity detector (vad), number detector and [18]

Belz ,

Thomson ,

Reiter ,

Mille , Non- language classifier, https://github.com/snakers4/

repeatable experiments and non-reproducible re- silero-

vad , 2024 .

sults: The reproducibility crisis in human eval- [28]

Tsiamas ,

G. I.

Gállego ,

J. A. R.

Fonollosa , M. R.

Computational

Linguistics: ACL

2023 , Association speech 2022 , 2022 , pp. 106 - 110 .

for Computational

Linguistics

, Toronto, Canada, [29]

Kudugunta ,

Caswell ,

Zhang , X. Garcia,

2023 , pp. 3676 - 3687 . D. Xin , A.

Kusupati , R.

Stella , A.

Bapna , O.

Firat , [19] S.

Balloccu , P.

Schmidtová , M.

Lango , O.

Dusek , Madlad- 400 : a multilingual and document-level

Leak, cheat, repeat: Data contamination and evalua- large audited dataset , in: Proceedings of the 37th

ham , M. Purver (Eds.), Proceedings of the 18th Con- Processing Systems , NIPS '23, Curran Associates

ference of the European Chapter of the Association Inc ., Red

Hook

, NY , USA, 2023 .

for Computational Linguistics (Volume 1 : Long Pa- [30]

Gaido ,

Papi ,

Fucci , G. Fiameni, M. Negri,

St . Julian's, Malta , 2024 , pp. 67 - 93 . tion: FBK@IWSLT2022, in: E. Salesky,

Federico , [20]

Chesbrough , From open science to open inno- M. Costa-jussà (Eds.), Proceedings of the 19th Inter-

Management , ESADE ( 2015 ). tion (IWSLT 2022 ), Association for Computational [21]

Ardila ,

Branson ,

Davis ,

Henretty , Linguistics, Dublin, Ireland (in-person and online),

Kohler , J. Meyer,

Morais ,

Saunders , F. M. 2022 , pp. 177 - 189 .

Tyers , G.

Weber, Common voice: A massively- [31] M. M. I. Alam , A. Anastasopoulos , A case study on

12th Conference on Language Resources and Eval- preprint arXiv:2402 . 01945 ( 2024 ).

uation (LREC 2020 ), 2020 , pp. 4211 - 4215 . [32]

Gulati ,

Qin , C.-C. Chiu , N.

Parmar , Y.

Zhang , [22] C.

Wang , A.

Wu , J.

Gu , J. Pino,

CoVoST 2 and Mas- J. Yu , W. Han, S.

Wang , Z.

Zhang , Y. Wu, R. Pang,

Interspeech 2021 , 2021 , pp. 2247 - 2251 . for Speech Recognition , in: Proc. Interspeech 2020 , [23]

Conneau , M. Ma, S. Khanuja,

Zhang , V. Axel- 2020, pp. 5036 - 5040 .

rod , S. Dalmia, J.

Riesa , C.

Rivera , A.

Bapna , Fleurs: [33] A.

Vaswani , N.

Shazeer , N.

Parmar , J. Uszkoreit,

tations of speech, in: 2022 IEEE Spoken Language tention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Technology Workshop (SLT) , 2023 , pp. 798 - 805 . S. Bengio,

Wallach ,

Fergus , S. Vishwanathan, [24]

Panayotov ,

Chen ,

Povey ,

Khudanpur , Lib- R. Garnett (Eds.), Advances in Neural Informa-

rispeech: An ASR corpus based on public domain tion Processing Systems , volume 30 , 2017 . URL:

audio books , in: 2015 IEEE International Confer- https://proceedings.neurips.cc/paper/2017/file/

ence on Acoustics, Speech and Signal Processing 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

(ICASSP) , 2015 , pp. 5206 - 5210 . [34]

Gaido ,

Papi ,

Negri ,

Bentivogli , Speech

translation with speech foundation models and 2702 .

large language models: What is there and what [42] M. McCloskey , N. J. Cohen , Catastrophic inter-

(Eds.), Proceedings of the 62nd Annual Meeting learning problem , volume 24 of Psychology of Learn-

of the Association for Computational Linguistics ing and Motivation, 1989 .

(Volume 1 : Long

Papers)

, Association for Compu- [43]

Kar , G. Castellucci,

Filice , S. Malmasi,

tational Linguistics , Bangkok, Thailand, 2024 ,

pp. O.

Rokhlenko , Preventing catastrophic forgetting

14760- 14778 . in continual learning of new natural language tasks , [35]

Kudo , J. Richardson, SentencePiece: A simple in : Proceedings of the 28th ACM SIGKDD Confer-

detokenizer for neural text processing , in: E. Blanco, 2022 , pp. 3137 - 3145 .

Lu (Eds.), Proceedings of the 2018 Conference [44]

Attanasio ,

Savoldi ,

Fucci ,

Hovy , Twists,

Computational

Linguistics , Brussels, Belgium, 2018 , Proceedings of the 2024 Conference on Empirical

pp. 66 - 71 . Methods in Natural Language Processing, Associa[36]

Graves ,

Fernández ,

Gomez ,

Schmidhuber , tion for Computational Linguistics , Miami, Florida,

Connectionist temporal classification: Labelling un - USA, 2024 , pp. 21318 - 21340 .

works, in: Proceedings of the 23rd International

Conference on Machine Learning , ICML ' 06 , New

York , NY, USA, 2006 , p. 369 - 376 . [37]

D. S.

Park ,

Chan ,

Zhang , C.- C. Chiu , B. Zoph ,

Recognition , in : Proc. Interspeech 2019 , 2019 , pp.

2613- 2617 . [38]

Yan ,

Dalmia ,

Higuchi ,

Neubig ,

Metze ,

Augenstein (Eds.), Proceedings of the 17th Confer-

tational Linguistics , Dubrovnik, Croatia, 2023 , pp.

1623- 1639 . [39]

Papi ,

Gaido ,

Pilzer ,

Negri , When good

ings of the 62nd Annual Meeting of the Association

for Computational Linguistics (Volume 1 : Long Pa-

Bangkok , Thailand, 2024 , pp. 3657 - 3672 . [40]

Wang ,

Tang ,

Ma ,

Wu ,

Okhonko ,

Pino , fairseq S2T: Fast speech-to-text modeling

with fairseq , in: Proceedings of the 2020 Confer-

strations , 2020 . [41]

Rei ,

Stewart ,

A. C.

Farinha ,

Lavie , COMET:

the 2020 Conference on Empirical Methods in Natu-

Computational

Linguistics , Online, 2020 , pp. 2685 -