<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>: The First Large-Scale Open-Science Speech Foundation Model for English and Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Papi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Gaido</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luisa Bentivogli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Brutti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Cettolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Gretter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Matassoni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Nabih</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Negri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, 38123 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature-with inaccessible training data and code-poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar eforts in speech processing remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including codebase, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research. The FAMA collection is available at: https://huggingface.co/collections/FBK-MT/fama-683425df3fb2b3171e0cdc9e</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;speech</kwd>
        <kwd>automatic speech recognition</kwd>
        <kwd>speech translation</kwd>
        <kwd>ASR</kwd>
        <kwd>ST</kwd>
        <kwd>open science</kwd>
        <kwd>open source</kwd>
        <kwd>speech foundation model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>strated the feasibility of training large language models
(LLMs) using only open-source (OS) data [12], realizing
The development of speech foundation models (SFMs) an open-science1 system [14] for text processing.
Howhas significantly advanced speech processing in the last ever, such comprehensive approaches are still lacking in
few years, particularly in areas such as automatic speech the field of speech processing.
recognition (ASR) and speech translation (ST). Popu- Recent works towards this direction are represented
lar SFMs such as OpenAI Whisper [1] and Meta Seam- by OWSM [15] and its subsequent versions [16]. OWSM,
lessM4T [2] have been released to the public in various whose model weights and codebase used for the training
sizes and with extensive language coverage. However, are released open source, reproduces a Whisper-style
these models completely lack comprehensive accessibil- training using publicly available data. Despite
repreity to their training codebases and datasets, hindering senting a valuable initiative toward building an
opentheir reproducibility and raising concerns about poten- science system, there is still a step missing for creating
tial data contamination [3], thereby complicating fair the first SFM of this kind: leveraging only data that is
evaluation. not only publicly available but also released under an</p>
      <p>In other domains, multiple eforts towards building OS-compliant license [17]. Such efort would allow users
models that are more accessible, reproducible, and free complete access and control over the data used at every
from proprietary constraints have been made [4, 5, 6, 7, stage of the scientific process, promoting reproducibility
8, 9, 10]. For instance, the OLMO project [11] has demon- [18], fair evaluation [19], and the ability to build upon
prior research without any barriers [20]. Besides
transCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- parency and collaboration, these eforts also foster users’
tics, September 24 — 26, 2025, Cagliari, Italy trust by ensuring that data is not leveraged to build tools
* These authors contributed equally. that can be used under conditions/purposes (e.g.,
comb$enstpivaop@i@ffbbkk.e.euu(L(S..BPeanptii)v;omgglia);idboru@ttfib@k.febuk.(eMu.(AGa.iBdrou);tti); mercial) for which the data was not intended [14].
cettolo@fbk.eu (M. Cettolo); gretter@fbk.eu (R. Gretter); To fill this gap, we release FAMA,2 the first family
matasso@fbk.eu (M. Matassoni); mnabih@fbk.eu (M. Nabih); of large-scale open-science SFMs for English and Italian
negri@fbk.eu (M. Negri) trained on over 150k hours of exclusively OS-compliant
 https://sarapapi.github.io/ (S. Papi); https://mgaido91.github.io/
(M. Gaido)</p>
      <p>0000-0002-4494-8886 (S. Papi); 0000-0003-4217-1396 (M. Gaido); 1Open science involves ensuring transparency and accessibility at
0000-0001-7480-2231 (L. Bentivogli); 0000-0003-4146-3071 all stages of the scientific process [ 13], including publishing OS
(A. Brutti); 0000-0001-8388-497X (M. Cettolo); 0000-0002-9689-1316 research papers, data, code, and any information needed to replicate
(M. Matassoni); 0000-0001-9132-9220 (M. Nabih); the research.
0000-0002-8811-4330 (M. Negri) 2Fama (from the Latin “fari” meaning “to speak”) is the
personifica©At2tr0i2b5utCioonpy4r.0igIhnttefornratthioisnpaalp(CerCbByYit4s.0a)u.thors. Use permitted under Creative Commons License tion of the public voice in Roman mythology.
speech data. We leverage both already available OS
datasets and create a new collection of ASR and ST
psuedolabels for Italian and English comprising more
than 16k hours of OS-compliant speech, along with
automatically generated Italian and English translations for
an additional 130k+ hours of speech. We also detail
training and evaluation procedures and provide full access to
training data to have complete control of the model
creation and avoid data contamination issues. FAMA models
achieve remarkable results, with up to 4.2 WER and 0.152
COMET improvement on average across languages
compared to OWSM and remaining competitive in terms of
ASR performance with the Whisper model family while
being up to 8 times faster. All the artifacts used for
realizing FAMA models, including codebase, datasets, and
models themself, are released under OS-compliant licenses,
promoting a more responsible creation of models in our
community. Our approach would not only facilitate fair
evaluation and comparison of SFMs but also encourage
broader participation in speech technology development,
leading to more inclusive and diverse applications.</p>
      <sec id="sec-1-1">
        <title>CC-BY 4.0 license. The videos are automatically con</title>
        <p>verted into wav files with one channel and a sampling
rate of 16k Hz. Then, the audio is cleaned from music
The artifacts are available at: and non-speech phenomena and segmented using silero
[27], a lightweight VAD having low computational
reFAMA-medium (878M): quirements. Lastly, to make it suitable for training, the
https://hf.co/FBK-MT/fama-medium audio is split using SHAS [28] in segments of around
16 seconds on average. The resulting dataset contains
FAMA-small (479M): automatic transcripts, which we created with Whisper
https://hf.co/FBK-MT/fama-small large-v3,4 for 14,200 hours of speech for English (en)
and 1,828 for Italian (it). Including publicly available data
FAMA-medium-asr (878M): (113,951 hours for en, and 22,383 hours for it), the final
https://hf.co/FBK-MT/fama-medium-asr ASR training set comprises 128,152 hours of en speech
and 24,211 hours of it speech, with a total of 152,363 hours
FAMA-small-asr (479M): of speech data, including 48,259 gold-labeled hours.
https://hf.co/FBK-MT/fama-small-asr Being composed of speech-transcript pairs, the data
mentioned so far is suitable for ASR. For ST, instead,
FAMA Training Data: only CoVoST2 and FLEURS contain translations from
https://hf.co/datasets/FBK-MT/fama-data and into en and it. For this reason, we automatically
translated the transcripts of all the speech data
(includFAMA Code: ing the original CoVoST2) with MADLAD-400 3B-MT
https://github.com/hlt-mt/FBK-fairseq [29].5 Following [30, 31], we additionally filter out
samples based on the ratio  between the source and
tar2. The FAMA Framework get text lengths (in characters) for each language pair
based on their distribution (min = 0.75, max = 1.45 for
2.1. Training and Evaluation Data en-it, and min = 0.65, max = 1.35 for it-en), resulting
into 3.41% of data filtering for en-it and 3.12% for it-en.</p>
        <p>In compliance with the open-science ideology, we train The final training set (Table 2) comprises the
automatiand test our models only on OS-compliant data. The cally translated speech data and the gold CoVoST2 and
training set comprises both already publicly available FLEURS datasets, resulting in a total of 147,686 hours for
OS datasets, and new pseudolabels created for this work, en-it and it-en.
whose list is presented in Table 1. For validation during training, and testing, we use
gold</p>
        <p>To create the new pseudolabels, we leveraged the labeled benchmarks. ASR evaluation is conducted on
speech content of YouTube-Commons,3 a dataset col- CommonVoice, MLS, and VoxPopuli, with CommonVoice
lecting YouTube videos released with the permissive</p>
      </sec>
      <sec id="sec-1-2">
        <title>3https://hf.co/datasets/PleIAs/YouTube-Commons</title>
      </sec>
      <sec id="sec-1-3">
        <title>4https://hf.co/openai/whisper-large-v3 5https://hf.co/google/madlad400-3b-mt</title>
        <sec id="sec-1-3-1">
          <title>CommonVoice v18 [21] CoVoST2 [22] - automatic labels LibriSpeech [24] MOSEL [17]</title>
          <p>MLS [25]
VoxPopuli-ASR [26]
YouTube-Commons (our paper)
Total (A)</p>
          <p>Filtered (A)
CoVoST2 [22] - gold labels
FLEURS [23]</p>
          <p>Total
#hours
en-it it-en
1746 250 A
420 28 A
358 - A
66,301 21,775 A
44,600 247 A</p>
          <p>519 74 A
14,200 1,828 A
128,144 24,202 A
123,777 23,445 A
420 28 G</p>
          <p>7 9 G
124,204 23,482 G+A</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>We train both models using a combination of three losses.</title>
        <p>First, a label-smoothed cross-entropy loss (ℒCE) is
applied to the decoder output, using the target text as the
reference (transcripts for ASR and translations for ST).</p>
        <p>Second, a CTC loss [36] is computed using transcripts as
reference (ℒCTCsrc) on the output of the 8th encoder layer
for small and the 16th for medium. Third, a CTC loss
on the final encoder output ( ℒCTCtgt) is applied to predict
the target text. The final loss is the weighted sum of the
above-mentioned losses:</p>
        <p>ℒ =  1ℒCE +  2ℒCTCsrc +  3ℒCTCtgt
where  1,  2,  3 = 5.0, 1.0, 2.0, and the label
smoothTable 2 ing factor of the CE is 0.1.</p>
        <p>ST: List of both publicly available training data and the data FAMA models are trained using a two-stage approach,
created in this paper for English-Italian (en-it) and Italian- where the model is pre-trained first on ASR data only
English (it-en). “G” stands for gold labels while “A” for auto- (ASR pre-training) and then trained on both ASR and ST
matically generated labels (translations).
data (ASR+ST training). Both training stages lasted 1M
steps, corresponding to ∼ 6 epochs over the training data.</p>
        <p>For the ASR pre-training, the learning rate (S1)
schedalso serving as the validation set for both en and it. For uler adopted to train the small model is the Noam
schedtranslation, we use CoVoST2 for it-en and FLEURS dev uler [33] with a peak of 2e-3 and 25,000 warm-up steps.
and test sets for en-it. To cope with convergence issues similar to [16], for the
medium model we adopted a piece-wise warm-up on the
2.2. Model Architecture Noam scheduler, with the learning rate first increasing
linearly to 2e-5 for 25k steps and then to 2e-4 for an
addiFAMA models are two-sized encoder-decoder architec- tional 25k steps, followed by the standard inverse square
tures, small and medium. Both models are composed root function. For the ASR+ST training, we sample the
of a Conformer encoder [32] and a Transformer decoder ASR target with probability ASR=0.5 and use the ST
tar[33]. FAMA small has 12 encoder layers and 6 decoder get otherwise. Training settings are the same as for ASR
layers, while FAMA medium has 24 encoder layers and 12 pre-training, except for the learning rate that is set to a
decoder layers. Our decision to use an encoder twice as constant value S2=1e-4. Experiments on how ASR and
deep as the decoder–unlike Whisper and OWSM, which S2 are determined for the small model are discussed
have an equal number of encoder and decoder layers–is in Section 3.1. For the medium model, similarly to the
driven by two key motivations: i) since autoregressive ifrst stage, the S2 is scaled down by one order of
magnimodels perform multiple decoder passes during output tude compared to the small model, i.e., a constant value
generation, a shallower decoder speeds up inference by S2=1e-5 is used.
making each pass faster, and ii) since many approaches The optimizer is AdamW with momentum  1,  2 =
integrate SFMs with LLMs by leveraging the encoder 0.9, 0.98, a weight decay of 0.001, a dropout of 0.1, and
[34], a deeper encoder helps preserve more of the SFMs clip normalization of 10.0. We apply SpecAugment [37]
processing capabilities in such integrations. Each layer during both ASR pre-training and ASR+ST training. We
has 16 attention heads, an embedding dimension of 1,024, use mini-batches of 10,000 tokens for FAMA small and
and an FFN dimension of 4,096. 4,500 for FAMA medium with an update frequency of,</p>
        <p>The Conformer encoder is preceded by two 1D convo- respectively, 2 and 6 on 16 NVIDIA A100 GPUs (64GB
lutional layers with a stride of 2 and a kernel size of 5. RAM), save checkpoints every 1,000 steps and average
The kernel size of the Conformer convolutional module is the last 25 checkpoints to obtain the final model.
31 for both the point- and depth-wise convolutions. The The inference is performed using a single NVIDIA
vocabulary is built using a SentencePiece unigram model A100 GPU with a batch size of 80,000 tokens. We use
[35] with size 16,000 trained on en and it transcripts. Two beam search with beam 5, unknown penalty of 10,000,
extra tokens–&lt;lang:en&gt; and &lt;lang:it&gt;–are added to and no-repeat n-gram size of 5. Additionally, we report
indicate whether the target text is in en or it. The in- the results using the joint CTC rescoring [38], leveraging
put audio is represented by 80 Mel-filterbank features the CTC on the encoder output with weight 0.2. Both
extracted every 10 ms with a window of 25 ms. training and inference are done using the bug-free
Conformer implementation [39] available in FBK-fairseq,6
which is built upon fairseq-S2T [40]. ASR performance
is evaluated with word error rate (WER) using the jiWER
library7 with the text normalized using Whisper
normalizer8. ST performance is evaluated using COMET [41]
version 2.2.4, with the default Unbabel/wmt22-comet-da
model.
2.4. Terms of Comparison
As a first term of comparison, we use Whisper [ 1] in both
medium9 and large-v3 configurations as the first is
comparable with FAMA medium in terms of size and the
second–trained on more than 4M hours—is the best
performing model of the Whisper family. The comparison
is made for en and it ASR and it-en ST, as Whisper does
not cover the en-to-many translation directions.
Whisper models are released under Apache 2.0 license and,
therefore, open weights. For both ASR and ST, we also
compare with SeamlessM4T medium10 and v2-large11
covering ASR and both ST language directions [2]. The
model is non-commercial and, therefore, not open. We
also compare with OWSM v3.1 medium12, the best
performing model of the OWSM family, also covering ASR
and both ST language directions and released open source
[16].</p>
        <p>To ensure a fair comparison, we perform the inference
with HuggingFace transformers13 version 4.48.1 using the
standard settings and beam search with beam 5, except
for OWSM, which is not supported on HuggingFace, and
for which the original ESPNet14 inference code is used
with a beam size of 3.15</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Results</title>
      <p>3.1. Pre-training and Catastrophic</p>
      <p>Forgetting
the conditions in which this phenomenon arises during
the ASR+ST training.</p>
      <p>1.8
6https://github.com/hlt-mt/FBK-fairseq
7https://pypi.org/project/jiwer/
8https://pypi.org/project/whisper-normalizer/
9https://hf.co/openai/whisper-medium
10https://hf.co/facebook/hf-seamless-m4t-medium
11https://hf.co/facebook/seamless-m4t-v2-large
12https://hf.co/espnet/owsm_v3.1_ebf
13https://pypi.org/project/transformers/
14https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/</p>
      <p>s2t1
15We attempted to use a beam size of 5 but the model had
out-ofmemory issues even when reducing the batch size.</p>
      <p>Figure 1 shows the perplexity (ppl) behavior during the
ifrst 100/500k steps of the FAMA small model training
on the validation sets. We present the results of
diferent systems obtained by varying both the learning rate
S2 and the sampling probability ASR discussed in
Section 2.3. Lower values of S2 (e.g., 1e-5) lead to worse
performance and are not included in the results. Since
the computational budget for our experiments is limited,
we analyze two cases for the sampling probability: 1)
ASR=0.5 to obtain a system equally trained on both ASR
and ST tasks, and 2) ASR=0.2 to obtain a system trained
Whisper medium
Whisper large-v3
OWSM v3.1 medium
SeamlessM4T medium
SeamlessM4T v2-large
FAMA-ASR small</p>
      <p>+ joint CTC rescoring
FAMA-ASR medium</p>
      <p>+ joint CTC rescoring
FAMA small</p>
      <p>+ joint CTC rescoring
FAMA medium
+ joint CTC rescoring</p>
      <p>MLS</p>
      <p>AVG</p>
      <p>ASR (WER ↓)
more on the unseen task during pre-training, i.e., the ST model (FAMA-ASR), obtained after pre-training, and of
task. the final ASR+ST model, as well as the results obtained</p>
      <p>As we can see from the curves, a S2 of 1e-3 seems through joint CTC rescoring.
to be too high for maintaining good ASR performance Looking at the results of FAMA-ASR, we observe that
while learning a new task (ST). Both in the case in which the medium model outperforms the small one, with
the ST training is more boosted (ASR=0.2) and in the ∼ 0.8 WER improvements on average both with and
case in which ASR and ST training is balanced (ASR=0.5), without the joint CTC rescoring. Compared to
Whiswe notice a significant increase in the ASR ppl of up per medium, FAMA achieves better results with FAMA
to 0.25 that corresponds to a drop in performance of medium outperforming Whisper by 4.4 WER on en and
3-4 WER on both languages – which, moreover, is not 6.4 on it while having a similar number of model
paramrecovered later on in the training. Therefore, to avoid eters. Remarkable performance is achieved by FAMA
catastrophic forgetting arising just in the first steps, we medium also compared to OWSM v3.1 medium, with
imexclude S2=1e-3 and use 1e-4 for the two-stage training. provements of up to 1.1 WER on en and 7.3 on it, but also
Regarding the ASR sampling, we look at the behavior compared to Whisper large-v3, where similar WER
of the curves for 500k steps (half of the second-stage scores are achieved. Instead, SeamlessM4T models,
levertraining) and notice that the ASR ppl curve with ASR=0.5 aging large pretrained models such as wav2vec-BERT
slowly approaches the original model ppl value while 2.0 (which is trained on 4.5 million hours) and NLLB
the one with ASR=0.2, despite improving, is not able to (which is trained on more than 43 billion sentences), still
approach the original ppl value. This is counterbalanced outperform FAMA, with the v2-large scoring an
inby a lower (hence, better) ppl of the ASR=0.2 curve on credibly low WER on CommonVoice also compared to
ST compared to that of the ASR=0.5 curve. However, this a strong competitor as Whisper large-v3. Looking at
diference, which is about ∼ 0.2 ppl, is not reflected in the the ASR results of the final FAMA models, we observe
ST performance, which only improves by 0.005 COMET that the WER remained almost unaltered compared to
points on average. Instead, the diference in terms of the ASR-only model, as already discussed in Section 3.1.
WER is significant, with a quality drop of ∼ 0.8 WER Regarding ST results, we notice that FAMA models
outacross en and it. As a result, we conclude that we avoid perform OWSM v3.1 medium, with an improvement of
catastrophic forgetting in the two-stage training only by up to 0.141 COMET by FAMA small and 0.152 by FAMA
evenly sampling the ASR and ST tasks during the second medium while still struggling to achieve the performance
step. of Whisper and SeamlessM4T.</p>
      <p>These mixed outcomes–competitive ASR performance
3.2. Comparison with Existing SFMs even against larger non-open models but lower ST
performance–demonstrate both the feasibility of
buildIn Table 3, we show the results for both ASR and ST of ing high-quality open-science SFMs and the need for
iniour FAMA models and SFMs presented in Section 2.4. For tiatives dedicated to creating OS-compliant ST datasets
FAMA models, we provide the scores of the ASR-only with human references to bridge the gap with non-open</p>
      <sec id="sec-2-1">
        <title>Whisper medium</title>
        <p>Whisper large-v3
SeamlessM4T medium
SeamlessM4T v2-large
FAMA small
FAMA medium
3.3. Computational Time
As an additional comparison, we evaluate the
throughput of the SFMs on a single NVIDIA A40 40GB. The
throughput, measured in xRTF (the inverse of the
realtime factor),16 is calculated as the number of seconds Table 5
of processed audio divided by the compute time in sec- Absolute WER quality gaps between female and male subsets,
onds. The test set used for this performance evaluation divided into read (Gap R) and spontaneous (Gap S) speech.
is CommonVoice on both en and it with a total duration
of, respectively, 26.9 and 26.4 hours. For each model, we
report the maximum batch size possible spanning in the between female WER and male WER scores obtained on
range 2, 4, 8, and 16, as higher values resulted in out-of- CommonVoice 17 and VoxPopuli.
memory issues with all models. The results are reported We can observe that FAMA-ASR small obtained the
in Table 4. smallest–hence, best–performance gap between male and</p>
        <p>We notice that Whisper models are the slowest ones, feminine transcription from read speech, with a gap
bewith an average xRTF of 12.1 for medium and 7.2 for ing an order of magnitude smaller than all other models.
large-v3, making them ∼ 3-6 times slower than FAMA When moving to the spontaneous speech, instead,
Whismedium and ∼ 5-8 than FAMA small. These results can per large-v3 obtains the best result. Overall, Whisper
be attributed to the architectural design of Whisper mod- achieves the best average result, followed by FAMA-ASR
els that apply an × 2 audio subsampling compared to small and FAMA medium, which are the only models
the commonly used × 4 (as in FAMA) and introduce a scoring less than a 1.0 WER diference. All FAMA models
lot of padding in shorter sequences to achieve the fixed can outperform Seamless M4T v2-large, achieving an
30-second length. The Seamless models, despite having average gap reduction of 0.16 to 0.52.
no extra padding (as FAMA) and a greater audio
subsampling of × 8, are ∼ 2 times faster than Whisper ones
but still 1.5-3 times slower for, respectively, medium and 4. Conclusions
v2-large, compared to FAMA medium and 2-4
compared to FAMA small, making the FAMA model family
the fastest by a large margin.</p>
        <sec id="sec-2-1-1">
          <title>In this paper, we addressed the challenges posed by the</title>
          <p>closed nature of existing SFMs, such as limited
accessibility to training data and codebases, by introducing FAMA,
the first large-scale open-science SFM for English and
3.4. Gender Bias Analysis Italian. Trained on over 150k hours of exclusively OS
We also measure the gender bias disparity between male speech, FAMA ensures full transparency, with all
artiand female performance using the ASR benchmark pro- facts released under OS-compliant licenses. Additionally,
posed by Attanasio et al. [44]. The results are presented in we contributed a new collection of ASR and ST
pseuTable 517 and are measured as absolute performance gaps dolabels for about 16k hours of speech data, and more
than 130k hours of English and Italian automatic
translations. Results show that FAMA models outperform
OWSM on both ASR and ST and also achieve
comparable ASR results to Whisper while being up to 8 times
faster. By providing the community with fully accessible
16https://github.com/NVIDIA/DeepLearningExamples/blob/</p>
          <p>master/Kaldi/SpeechRecognition/README.md#metrics
17Results and per-language statistics are available on the
original leaderboard: https://huggingface.co/spaces/g8a9/
fair-asr-leaderboard
resources, FAMA bridges the gap between advances in
speech technology and open science principles, enabling
fair evaluation, broader participation, and inclusivity.
Future work will focus on extending FAMA to additional
languages with the ultimate goal of further expanding
the open science ecosystem to speech technologies.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This paper has received funding from the PNRR project</title>
        <p>FAIR - Future AI Research (PE00000013), under the NRRP
MUR program funded by the NextGenerationEU, and
from the European Union’s Horizon research and
innovation programme under grant agreement No 101135798,
project Meetween (My Personal AI Mediator for Virtual
MEETings BetWEEN People). We acknowledge CINECA
for the availability of high-performance computing
resources and support.
llms, in: First Conference on Language Modeling,
2024.
[7] Q. Sun, Y. Luo, S. Li, W. Zhang, W. Liu, OpenOmni:</p>
        <p>A collaborative open source tool for building
futureready multimodal conversational agents, in: D. I.</p>
        <p>Hernandez Farias, T. Hope, M. Li (Eds.), Proceedings
of the 2024 Conference on Empirical Methods in
Natural Language Processing: System
Demonstrations, Association for Computational Linguistics,</p>
        <p>Miami, Florida, USA, 2024, pp. 46–52.
[8] M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S.</p>
        <p>Park, M. Salehi, N. Muennighof, K. Lo, L. Soldaini,
et al., Molmo and pixmo: Open weights and open
data for state-of-the-art multimodal models, arXiv
preprint arXiv:2409.17146 (2024).
[9] W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker,</p>
        <p>T. Rintamaki, M. Shoeybi, B. Catanzaro, W. Ping,
Nvlm: Open frontier-class multimodal llms, arXiv
preprint arXiv:2409.11402 (2024).
[10] P. H. Martins, P. Fernandes, J. Alves, N. M.
Guerreiro, R. Rei, D. M. Alves, J. Pombal, A. Farajian,
M. Faysse, M. Klimaszewski, P. Colombo, B.
Had[1] A. Radford, J. W. Kim, T. Xu, G. Brockman, dow, J. G. de Souza, A. Birch, A. F. Martins, Eurollm:
C. McLeavey, I. Sutskever, Robust speech recog- Multilingual language models for europe, Procedia
nition via large-scale weak supervision, in: Inter- Computer Science 255 (2025) 53–62. Proceedings of
national Conference on Machine Learning, PMLR, the Second EuroHPC user day.</p>
        <p>2023, pp. 28492–28518. [11] D. Groeneveld, et al., OLMo: Accelerating the
sci[2] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, ence of language models, in: L.-W. Ku, A. Martins,
N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, V. Srikumar (Eds.), Proceedings of the 62nd
AnK. Hefernan, J. Hofman, et al., Seamlessm4t: Mas- nual Meeting of the Association for Computational
sively multilingual &amp; multimodal machine transla- Linguistics (Volume 1: Long Papers), Association
tion, arXiv preprint arXiv:2308.11596 (2023). for Computational Linguistics, Bangkok, Thailand,
[3] Y. Dong, X. Jiang, H. Liu, Z. Jin, B. Gu, M. Yang, G. Li, 2024, pp. 15789–15809.</p>
        <p>
          Generalization or memorization: Data contamina- [12] L. Soldaini, et al., Dolma: an open corpus of three
tion and trustworthy evaluation for large language trillion tokens for language model pretraining
remodels, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.), search, in: L.-W. Ku, A. Martins, V. Srikumar (Eds.),
Findings of the Association for Computational Lin- Proceedings of the 62nd Annual Meeting of the
Asgui
          <xref ref-type="bibr" rid="ref24">stics: ACL 2024</xref>
          , Association for Computational sociation for Computational Linguistics
          <xref ref-type="bibr" rid="ref14 ref43 ref47 ref5 ref60">(Volume 1:
Linguistics, Bangkok, Thailand, 2024, pp. 12039– Long Papers)</xref>
          , Association for Computational
Lin12050. gui
          <xref ref-type="bibr" rid="ref24">stics, Bangkok, Thailand, 2024</xref>
          , pp. 15725–15788.
[4] BigScience Workshop, T. L. Scao, A. Fan, C. Akiki, [13] R. Vicente-Saez, C. Martinez-Fuentes, Open
sciE. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. ence now: A systematic literature review for an
Luccioni, F. Yvon, et al., Bloom: A 176b-parameter integrated definition, Journal of Business Research
open-access multilingual language model, arXiv 88 (2018) 428–436.
        </p>
        <p>preprint arXiv:2211.05100 (2022). [14] M. White, I. Haddad, C. Osborne, X.-Y. Y. Liu, A.
Ab[5] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, delmonsef, S. Varghese, A. L. Hors, The model
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, openness framework: Promoting completeness and
U. S. Prashanth, E. Raf, A. Skowron, L. Sutawika, openness for reproducibility, transparency, and
usO. Van Der Wal, Pythia: a suite for analyzing large ability in artificial intelligence, arXiv preprint
language models across training and scaling, in: arXiv:2403.13784 (2024).</p>
        <p>Proceedings of the 40th International Conference [15] Y. Peng, J. Tian, B. Yan, D. Berrebbi, X. Chang, X. Li,
on Machine Learning, ICML’23, JMLR.org, 2023. J. Shi, S. Arora, W. Chen, R. Sharma, W. Zhang,
[6] Z. Liu, A. Qiao, W. Neiswanger, H. Wang, B. Tan, Y. Sudo, M. Shakeel, J.-W. Jung, S. Maiti, S.
WatanT. Tao, J. Li, Y. Wang, S. Sun, O. Pangarkar, et al., abe, Reproducing whisper-style training using an
Llm360: Towards fully transparent open-source open-source toolkit and publicly available data, in:
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using
these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2023
          <string-name>
            <given-names>IEEE</given-names>
            <surname>Automatic</surname>
          </string-name>
          <article-title>Speech Recognition</article-title>
          and Un- [25]
          <string-name>
            <given-names>V.</given-names>
            <surname>Pratap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sriram</surname>
          </string-name>
          , G. Synnaeve, R. Col-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>derstanding Workshop (ASRU)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . lobert, MLS:
          <string-name>
            <given-names>A</given-names>
            <surname>Large-Scale Multilingual</surname>
          </string-name>
          Dataset for [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sudo</surname>
          </string-name>
          , Speech Research,
          <source>in: Proc. Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Shakeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          , J. weon pp.
          <fpage>2757</fpage>
          -
          <lpage>2761</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <year>Owsm v3</year>
          .
          <article-title>1: Better</article-title>
          and faster [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riviere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          , C. Talnikar,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          branchformer,
          <source>in: Interspeech</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>352</fpage>
          -
          <lpage>Populi</lpage>
          :
          <article-title>A large-scale multilingual speech corpus</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          356.
          <article-title>for representation learning</article-title>
          , semi-supervised learn[17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brutti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Cet- ing and interpretation</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>tolo</surname>
            , R. Gretter,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Matassoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nabih</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ne- R. Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>gri</surname>
          </string-name>
          , MOSEL:
          <volume>950</volume>
          ,
          <article-title>000 hours of speech data for open- Meeting of the Association for Computational Lin-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>source speech foundation model training on EU guistics and the 11th</article-title>
          <source>International Joint Conference</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          languages, in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <article-title>Chen on Natural Language Processing (Volume 1: Long</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Em- Papers)</source>
          , Association for Computational Linguistics,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>pirical Methods in Natural Language Processing</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , Miami, [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Silero vad: pre-trained enterprise-grade</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Florida</surname>
          </string-name>
          , USA,
          <year>2024</year>
          , pp.
          <fpage>13934</fpage>
          -
          <lpage>13947</lpage>
          .
          <article-title>voice activity detector (vad), number detector</article-title>
          and [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mille</surname>
          </string-name>
          , Non- language classifier, https://github.com/snakers4/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>repeatable experiments and non-reproducible re- silero-</article-title>
          <string-name>
            <surname>vad</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>sults: The reproducibility crisis in human eval-</article-title>
          [28]
          <string-name>
            <given-names>I.</given-names>
            <surname>Tsiamas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Gállego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A. R.</given-names>
            <surname>Fonollosa</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. R.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Computational</surname>
            <given-names>Linguistics: ACL</given-names>
          </string-name>
          <year>2023</year>
          , Association speech
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>for Computational</surname>
            <given-names>Linguistics</given-names>
          </string-name>
          , Toronto, Canada, [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kudugunta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Caswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Garcia,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <year>2023</year>
          , pp.
          <fpage>3676</fpage>
          -
          <lpage>3687</lpage>
          . D.
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kusupati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bapna</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Firat</surname>
            , [19]
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Balloccu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidtová</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lango</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Dusek</surname>
          </string-name>
          , Madlad-
          <volume>400</volume>
          :
          <article-title>a multilingual and document-level</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Leak, cheat, repeat: Data contamination and evalua- large audited dataset</article-title>
          ,
          <source>in: Proceedings of the 37th</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>ham</surname>
          </string-name>
          , M. Purver (Eds.),
          <source>Proceedings of the 18th Con- Processing Systems</source>
          , NIPS '23, Curran Associates
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>ference of the European Chapter of the Association Inc</article-title>
          .,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>for Computational Linguistics</surname>
          </string-name>
          (Volume
          <volume>1</volume>
          : Long Pa- [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fucci</surname>
          </string-name>
          , G. Fiameni, M. Negri,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>St</surname>
          </string-name>
          .
          <source>Julian's, Malta</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>93</lpage>
          . tion: FBK@IWSLT2022, in: E. Salesky,
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          , [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chesbrough</surname>
          </string-name>
          , From open science to open inno- M. Costa-jussà (Eds.),
          <source>Proceedings of the 19th Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Management</surname>
            ,
            <given-names>ESADE</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <source>tion (IWSLT</source>
          <year>2022</year>
          ), Association for Computational [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Henretty</surname>
          </string-name>
          , Linguistics, Dublin, Ireland (in-person and online),
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohler</surname>
          </string-name>
          , J. Meyer,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. M.</surname>
          </string-name>
          <year>2022</year>
          , pp.
          <fpage>177</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Tyers</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Weber, Common voice: A massively-</article-title>
          [31]
          <string-name>
            <surname>M. M. I. Alam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Anastasopoulos</surname>
          </string-name>
          ,
          <article-title>A case study on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>12th Conference on Language Resources and Eval- preprint arXiv:2402</source>
          .
          <year>01945</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>uation (LREC</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>4211</fpage>
          -
          <lpage>4215</lpage>
          . [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gulati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Chiu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , [22]
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>J. Pino,</given-names>
          </string-name>
          <article-title>CoVoST 2</article-title>
          and
          <string-name>
            <surname>Mas- J. Yu</surname>
            , W. Han,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Y. Wu, R. Pang,
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>Interspeech</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2247</fpage>
          -
          <lpage>2251</lpage>
          .
          <article-title>for Speech Recognition</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2020</year>
          , [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          , M. Ma, S. Khanuja,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , V. Axel- 2020, pp.
          <fpage>5036</fpage>
          -
          <lpage>5040</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>rod</surname>
            , S. Dalmia,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Riesa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Rivera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bapna</surname>
            , Fleurs: [33]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <article-title>tations of speech, in: 2022 IEEE Spoken Language tention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>Technology Workshop (SLT)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>798</fpage>
          -
          <lpage>805</lpage>
          . S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          , S. Vishwanathan, [24]
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          , Lib- R. Garnett (Eds.),
          <source>Advances in Neural Informa-</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>rispeech: An ASR corpus based on public domain tion Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <year>2017</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>audio books</article-title>
          , in: 2015 IEEE International Confer- https://proceedings.neurips.cc/paper/2017/file/
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>ence on Acoustics, Speech and Signal Processing 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>(ICASSP)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>5206</fpage>
          -
          <lpage>5210</lpage>
          . [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          , Speech
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <article-title>translation with speech foundation models and 2702</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <article-title>large language models: What is there</article-title>
          and what [42]
          <string-name>
            <surname>M. McCloskey</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          , Catastrophic inter-
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 62nd Annual Meeting learning problem</source>
          , volume
          <volume>24</volume>
          of Psychology of Learn-
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <article-title>of the Association for Computational Linguistics ing</article-title>
          and Motivation,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Compu- [43]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kar</surname>
          </string-name>
          , G. Castellucci,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          , S. Malmasi,
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <source>tational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          ,
          <string-name>
            <given-names>pp. O.</given-names>
            <surname>Rokhlenko</surname>
          </string-name>
          , Preventing catastrophic forgetting
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          14760-
          <fpage>14778</fpage>
          .
          <article-title>in continual learning of new natural language tasks</article-title>
          , [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kudo</surname>
          </string-name>
          , J. Richardson, SentencePiece: A simple in
          <source>: Proceedings of the 28th ACM SIGKDD</source>
          Confer-
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <article-title>detokenizer for neural text processing</article-title>
          , in: E. Blanco,
          <year>2022</year>
          , pp.
          <fpage>3137</fpage>
          -
          <lpage>3145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2018 Conference</source>
          [44]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          , Twists,
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          , Brussels, Belgium,
          <year>2018</year>
          , Proceedings of the 2024 Conference on Empirical
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          pp.
          <fpage>66</fpage>
          -
          <lpage>71</lpage>
          . Methods in Natural Language Processing, Associa[36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>tion for Computational Linguistics</article-title>
          , Miami, Florida,
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <article-title>Connectionist temporal classification: Labelling un</article-title>
          - USA,
          <year>2024</year>
          , pp.
          <fpage>21318</fpage>
          -
          <lpage>21340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          works,
          <source>in: Proceedings of the 23rd International</source>
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <source>Conference on Machine Learning</source>
          , ICML '
          <fpage>06</fpage>
          , New
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <string-name>
            <surname>York</surname>
          </string-name>
          , NY, USA,
          <year>2006</year>
          , p.
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          . [37]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C.-
          <string-name>
            <surname>C. Chiu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zoph</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          , in
          <source>: Proc. Interspeech</source>
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          2613-
          <fpage>2617</fpage>
          . [38]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dalmia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Higuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Metze</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 17th Confer-</source>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <string-name>
            <surname>tational Linguistics</surname>
          </string-name>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          1623-
          <fpage>1639</fpage>
          . [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Papi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaido</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pilzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          , When good
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <article-title>ings of the 62nd Annual Meeting of the Association</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          <string-name>
            <surname>for Computational Linguistics</surname>
          </string-name>
          (Volume
          <volume>1</volume>
          : Long Pa-
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          <string-name>
            <surname>Bangkok</surname>
          </string-name>
          , Thailand,
          <year>2024</year>
          , pp.
          <fpage>3657</fpage>
          -
          <lpage>3672</lpage>
          . [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Okhonko</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pino</surname>
          </string-name>
          , fairseq S2T:
          <article-title>Fast speech-to-text modeling</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <article-title>with fairseq</article-title>
          ,
          <source>in: Proceedings of the 2020</source>
          Confer-
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <string-name>
            <surname>strations</surname>
          </string-name>
          ,
          <year>2020</year>
          . [41]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          , COMET:
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          <source>the 2020 Conference on Empirical Methods in Natu-</source>
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2685</fpage>
          -
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>