Towards a responsible usage of AI-based Large Acoustic
                                Models for Automatic Speech Recognition: on the
                                importance of data in the self-supervised era.
                                Vincenzo Norman Vitale1,2,*,† , Emilia Tanda1,† and Francesco Cutugno1,2,*,†
                                1
                                    University of Naples, Federico II, Corso Umberto I, 40, Naples, 80138, Italy
                                2
                                    UrbanECO Research Center, University of Naples, Federico II, via Tarsia, 31, Naples, 80134, Italy


                                                   Abstract
                                                   The evolution of artificial intelligence models has made them tools of everyday use in many fields. However, the enormous
                                                   capabilities demonstrated by these models have, on the one hand, some apparent costs in terms of money, computational
                                                   resources, or data. On the other hand, there are some hidden costs for end users who rely on models trained by third parties,
                                                   sacrifice awareness and control of a tool, and try to evaluate its performance in their specific contexts. This is the case of
                                                   supervised End-to-End (E2E) ASR systems and self-supervised E2E-ASR, also referred to as Large Acoustic Models (LAM). On
                                                   the one hand, they provide an important starting point for building information systems oriented to speech interaction and,
                                                   on the other hand, are complex to evaluate, use and adapt in specific contexts.

                                                   Keywords
                                                   End-to-End ASR, self-supervised, quality of data, communication style, responsible AI


                                1. Introduction                                                                  cesses of manual labelling. On the other hand, it has
                                                                                                                 allowed the creation of a large number of freely avail-
                                Modern Automatic Speech Recognition (ASR) systems,                               able and open-source general-purpose ASRs, bringing
                                among the other Natural Language Processing (NLP)                                these systems within the reach of a greater number of
                                systems, achieve remarkable performances thanks to                               institutions and companies. However, their use remains
                                the computing potential enabled by Deep Neural Net-                              limited due to the lack of benchmarks oriented towards
                                works (DNN). Indeed, over the last decade, the auto-                             specific contexts and communication styles. In this work,
                                matic speech recognition community has made great                                we will analyze the evolution of ASR systems, how the
                                strides [1, 2, 3], moving from traditional hybrid mod-                           nature of the data used for their training has changed,
                                elling (Acoustic Model+Language Moel) to end-to-end                              and the limitations of modern ASR systems. Finally, we
                                (E2E) modelling that directly translates an input speech                         will propose an initiative aimed at collecting high-quality
                                sequence into a sequence of output tokens using a sin-                           data in Italian aimed at both performance verification
                                gle network, leading to self-supervised E2E models, also                         and training based on specific communicative styles.
                                referred to as Large Acoustic Models (LAMs), that can
                                model speech without the aid of labelled data. These rev-
                                olutionary innovations have completely subverted the                              2. The evolution of ASR systems
                                traditional architectures of ASR systems used in previ-
                                ous decades. In addition, there has also been a strong                                                 ASR systems have been the subject of several revolutions,
                                impact on the cost-effectiveness and democratization of                                                which have impacted their internal architecture and the
                                ASR systems. On the one hand, the change in architec-                                                  nature of the data employed for their training. Tradi-
                                ture has made it more economical to collect and create                                                 tional ASR systems rely on two separate components [9]:
                                the data sets necessary for training, which previously                                                 The Acoustic Model (AM), which is aimed at converting
                                required the use of a large number of experts in the field                                             the voice signal into a sequence of phones, and the Lan-
                                of speech analysis involved in long and expensive pro-                                                 guage Model (LM), aimed at transforming the sequence
                                                                                                                                       of phones received from the AM, in the most likely and
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- reliable transcription. These two models were initially
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                          realised with techniques such as Hidden Markov Models
                                *
                                  Corresponding author.
                                †                                                                                                      (HMM) or Gaussian Mixture Models (GMM). Then, with
                                  These authors contributed equally.
                                $ vincenzonorman.vitale@unina.it (V. N. Vitale);
                                                                                                                                       the advent of Deep Neural Networks (DNN), both have
                                e.tanda@studenti.unina.it (E. Tanda); cutugno@unina.it                                                 been realized as supervised DNNs. Still, the output of
                                (F. Cutugno)                                                                                           both components was the same: the AM produces the
                                 0000-0002-0365-8575 (V. N. Vitale); 0000-0001-9457-6243                                              most likely sequence of phones given the input voice sig-
                                (F. Cutugno)                                                                                           nal, while the LM provides the most reliable transcription
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
      Model                 Type              Train Data           Year     Decoder        Encoder       Test-clean/other
  Transformer [4]            E2E           970Hr Transcribed       2020      RNN-T       Transformer          2.0/4.6
   Conformer [5]             E2E           970Hr Transcribed       2020      RNN-T        Conformer           1.9/3.9
  Wav2vec2.0 [6]             E2E          60KHr Untranscribed
                                                                   2020       CTC        Transformer           1.8/3.3
 with Quantization     Self-Supervised     100Hr Transcribed
    HuBERT [7]               E2E          60KHr Untranscribed
                                                                   2021       CTC        Transformer           1.8/2.9
     with KNN          Self-Supervised     100Hr Transcribed
  W2V-BERT [8]               E2E          60KHr Untranscribed
                                                                   2021      RNN-T       Conformer             1.4/2.5
 with Quantization     Self-Supervised     100Hr Transcribed
Table 1
In table are E2E ASR systems performance based on Librispeech test-set, least recent to the most recent. For self-supervised
systems is also reported the algorithm used during the self-supervised pre-training phase.


given the input sequence of phones. This means that the        bels (character, word piece, etc.) to one label unless a
two components had separate objectives and relied on           special label separates these. The result is a sequence of la-
different kinds of high-quality and costly datasets. On the    bels shorter or equal to the input vector sequence length.
one hand, the AM needs well-aligned sound-to-phone             The CTC is one of the most diffused decoding techniques.
transcriptions. On the other hand, the LM needs a sta-         As non-auto-regressive, it is also considered computa-
tistically representative set of phone-to-word samples         tionally effective as it requires less time and resources
in order to provide meaningful transcription. Providing        for training and inference phases. Conversely, the RNN-
adequate quality data requires highly specialised profes-      T (also named Transducer) is an auto-regressive speech
sionals to hand label in both cases. This type of ASR          transcription technique which overcomes CTC’s limita-
system requires tens to hundreds of hours of speech to         tions, i.e., non-auto-regressive and limited label sequence
train the AM and a few million words to train an LM            length. An RNN-T is a speech transcription technique
(depending on the context). The aim is to transcribe           which can produce label-transcription sequences longer
fairly long sentences with an accuracy linked to specific      than the input vector sequence and models long-term
application contexts.                                          transcription elements’ inter-dependency. A Transducer
   The turning point that led to the recent End-to-End         typically comprises two sub-decoding modules: one that
ASR (E2E-ASR) [2] was the introduction of the Trans-           forecasts the next transcription label based on the previ-
former [10] network architecture, on which most ac-            ous transcriptions (prediction network); the other that
tual AI models rely. Compared to traditional systems,          combines the encoder and prediction-network outputs
in E2E-ASRs, the voice signal is directly converted into       to produce a new transcription label (joiner network).
its corresponding transcription without any intermedi-         These features improve transcription speed and perfor-
ate, human-readable format. This evolution results in          mance with respect to CTC at the expense of more train-
systems with a single objective needing only one cheaper       ing and computational resources required [12].
dataset to be trained since the intermediate phones tran-         Finally, the most recent advancement consists of the
scription and the alignment parts have been removed.           employment of self-supervised training techniques, giv-
The Transformer architecture [10] opens up the possi-          ing rise to what could be defined as the first truly End-
bility of building a combination of AM and LM, now             to-End ASR, namely Wav2Vec2 [6] and after a while to
referred to as the Encoder and Decoder, which directly         HuBERT [7], both are also referred to as Large Acous-
maps an unaligned sequence of sounds to its transcrip-         tic Model (LAM) [13] because of their training process
tion. With a few hundred hours of non-aligned tran-            which usually involves two main phases. The first one
scribed speech through a supervised learning process,          is the pre-training phase, during which vast amounts of
E2E-ASR systems outperform the previous generation on          untranscribed speech data are employed in order to recog-
average by providing an error of up to 5% in the case of       nize and discretize hidden acoustic units’ representations
pure Transformer-Encoder systems, or up to 4% in the           by employing different processes such as quantization
case of Conformer-Encoder systems [5] (see table 1 for         (Wav2Vec2[6]) directly from the raw audio sample, or
performance). Clearly, the Decoder module implemen-            clustering (HuBERT [7]) on MFCC features. Then, during
tation choice strongly impacts E2E ASR performances,           the last phase, a transcription module could be trained on
such module is usually implemented as a Connectionist          smaller datasets (few hours) in order to obtain an error
Temporal Classification (CTC) model [11] or as a Recur-        rate of about 2% (see table 1).
rent Neural Network Transducer (RNN-T) [12]. CTC
is a non-auto-regressive speech transcription technique
which collapses consecutive, all-equal, transcription la-
3. Self-supervised E2E Solutions (?) which is usually unavailable. Another aspect to consider
                                     is how and to what extent the democratisation of ASR
   to data shortages                 systems has been impacted. In fact, if, on the one hand,
Undeniably, by committing the model to learn all parts           it is possible to obtain much more data for the same cost,
automatically, E2E-ASRs overcome the difficulties and            on the other hand, the same quantity of resources is no
cost-ineffectiveness of the data preparation and mod-            longer sufficient, especially for training purposes.
elling phases of conventional systems, while requiring
far more training data [14]. This shift significantly im-        4. High-quality data for
pacted ASR systems; on the one hand, it significantly
reduced training data costs while increasing their vol-             context-specific assessment
ume, as shown by the availability of plenty of general-
purpose training datasets [1, 3]. On the other hand, in          Clearly, the availability of good-quality and well-
spite of the cheapness of training data, ASR systems are         categorized data is paramount in the current application
now accessible to a wider public. Clearly, these innova-         landscape. On the one hand, such data is essential to
tions present some expenses, which in this case consist          evaluate pre-trained systems in specific contexts with
of higher computational costs, longer training times, and        speaking styles related to different communication situa-
loss of modularity [3] compared to traditional ASR sys-          tions. On the other hand, such data is crucial for training
tems. Indeed, adapting such a general-purpose E2E-ASR            and fine-tuning modern supervised and self-supervised
to specific contexts means, in some cases, updating the          E2E ASR. To this end, the Phoné consortium was born
Decoder (LM) to a special-purpose field or updating the          as a voluntary initiative to collect, verify and distribute
Encoder (AM) to handle a special type of speech, which           transcribed and non-transcribed Italian speech datasets
requires fine-tuning and, in the worst cases, training the       in various application contexts. Table 2 shows the actual
model from scratch.                                              amount of data collected and verified by the consortium
   Then, the advent of Self-supervised systems impacted          to provide Italian institutions and companies with ad-
the adaptability aspects of general-purpose E2E ASR, giv-        equate instruments to evaluate these promising tools,
ing rise to Large Acoustic Models (LAMs), which basically        which are, however, assessed in contexts and communi-
are Encoders trained on vast amounts of non-transcribed          cation styles that do not reflect the target ones.
cheaper datasets, compared to data needed by simple                 Currently, data is divided into two macro-categories,
E2E-ASR, which are then combined with an Encoder                 namely, Transcribed and Untranscribed, to enable the fu-
part trained on small quantities of language-specific tran-      ture training of self-supervised E2E-ASR. Then, datasets
scribed data. The result is a large, general-purpose model       are further divided into specific communication styles
that can be easily deployed in most contexts. Although           [20, 21]:
they are publicly available and, therefore, freely adapt-             • Monologic speech involves only one person
able, the necessary computational resources are so pro-                 speaking without interacting with an interlocutor.
hibitive that they are within the reach of a few companies              This type of speech is characterized by consis-
and institutions, even for simple fine-tuning.                          tency and structuring, as it typically consists of
   A further point to be considered is that the advantages              lectures, speeches or situations that require pre-
of both simple E2E ASR and Self-supervised ones come                    liminary preparation. As a result, the speech ap-
at the expense of lower interpretability of systems’ inter-             pears cohesive and well-organized. The language
nals, making it difficult to diagnose errors and limiting               register tends to be higher and more formal.
their usage in critical contexts [3]. However, some studies           • Dialogic speech involves two or more people in a
in the field of eXplainable AI (XAI)[15] try to provide ex-             conversation, characterized by exchanges of mes-
planations and methodologies for analysing behaviours                   sages and information. It is thus configured as a
and phenomena modelled by various E2E ASR systems,                      communicative act with a dynamic structure. Un-
aiming to make them more interpretable [16, 17, 18, 19],                like monologic speech, dialogic speech does not
still based on special purpose data.                                    involve prior preparation; therefore, the speech
   To summarize, although the innovations introduced                    tends to be simpler from a syntactic point of view,
by E2E and self-supervised E2E systems have allowed                     the articulation of words tends to be less precise
their fast diffusion, still their industrial and institutional          (hypoarticulation), and it is also characterized by
deployment remains subject to limitations [3] which, in                 greater conciseness of expression.
some cases, are strongly related to special-purpose data              • In Read speech, the speaker reads a written text
availability. Indeed, employing a general-purpose E2E                   aloud (as in the case of audiobooks), therefore this
ASR system in a specific domain requires evaluation and                 type of speech is characterized by clear pronunci-
potential fine-tuning /training on domain-specific data,                ation (there is a tendency towards hyperarticula-
                                                                        tion), complete syntax and greater coherence and
       cohesion of the text. Furthermore, another fea-         [3] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter,
       ture is given by the modulation of reading speed            S. Watanabe, End-to-end speech recognition: A
       and the use of strategic pauses and intonations             survey, IEEE/ACM Transactions on Audio, Speech,
       to improve communicative effectiveness.                     and Language Processing (2023).
                                                               [4] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDer-
                                                                   mott, S. Koo, S. Kumar, Transformer transducer: A
    Material Type     Speech Type         Minutes
                                                                   streamable speech recognition model with trans-
     Transcribed       Monologic         500 Minutes
     Transcribed        Dialogic         400 Minutes               former encoders and rnn-t loss, in: ICASSP 2020-
     Transcribed         Read            120 Minutes               2020 IEEE International Conference on Acoustics,
    Untranscribed      Monologic       10000 Minutes               Speech and Signal Processing (ICASSP), IEEE, 2020,
    Untranscribed       Dialogic        500 Minutes                pp. 7829–7833.
    Untranscribed        Read           2200 Minutes           [5] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
                                                                   J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al.,
Table 2
List of material collected and verified for the evaluation
                                                                   Conformer: Convolution-augmented transformer
and training of E2E-ASR systems (both supervised and self-         for speech recognition (2020).
supervised) in specific contexts for the Italian language.     [6] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec
                                                                   2.0: A framework for self-supervised learning of
                                                                   speech representations, Advances in neural infor-
  Behind ASR-related aspects, the consortium’s pur-                mation processing systems 33 (2020) 12449–12460.
poses also extend to other voice-related tasks, which          [7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakho-
include, but are not limited to, Text-To-Speech (TTS),             tia, R. Salakhutdinov, A. Mohamed, Hubert:
Speaker Identification (SI), Speaker Verification (SV), and        Self-supervised speech representation learning by
others.                                                            masked prediction of hidden units, IEEE/ACM
                                                                   Transactions on Audio, Speech, and Language Pro-
5. Conclusion                                                      cessing 29 (2021) 3451–3460.
                                                               [8] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin,
In this work, we present the current panorama related              R. Pang, Y. Wu, W2v-bert: Combining contrastive
to E2E ASR systems, how their data usage evolved along             learning and masked language modeling for self-
with technological improvements and the current issues             supervised speech pre-training, in: 2021 IEEE Au-
that these improvements solved or introduced. Firstly,             tomatic Speech Recognition and Understanding
we observe the significant improvement in models’ per-             Workshop (ASRU), IEEE, 2021, pp. 244–250.
formances while pointing out issues connected to the           [9] S. Karpagavalli, E. Chandra, A review on automatic
models’ capacity assessment related to specific communi-           speech recognition architecture and approaches,
cation styles and domains. We observe the shift in model           International Journal of Signal Processing, Image
training costs, moving away from data becoming cheaper             Processing and Pattern Recognition 9 (2016) 393–
and easier to collect towards computing resources grow-            404.
ing in quantity and costs. Then, we observed how the          [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
advantages introduced by modern E2E (supervised and                L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
self-supervised) ASRs come at the expense of an increase           tention is all you need, Advances in neural infor-
in their complexity, which consequently reduces their              mation processing systems 30 (2017).
interpretability. Finally, we propose a voluntary, high-      [11] A. Graves, S. Fernández, F. Gomez, J. Schmidhu-
quality data collection initiative to evaluate and train           ber, Connectionist temporal classification: labelling
systems related to various speech communication styles             unsegmented sequence data with recurrent neural
to enable more informed use and greater accessibility of           networks, in: Proceedings of the 23rd international
E2E-ASR systems.                                                   conference on Machine learning, 2006, pp. 369–376.
                                                              [12] A. Graves, Sequence transduction with recurrent
                                                                   neural networks, arXiv preprint arXiv:1211.3711
References                                                         (2012).
                                                              [13] L. M. Giordano Orsini, V. N. Vitale, F. Cutugno,
 [1] M. Malik, M. K. Malik, K. Mehmood, I. Makhdoom,               Large scale acoustic models: A new perspective,
     Automatic speech recognition: a survey, Multime-              Sistemi intelligenti (2023). doi:10.1422/108137.
     dia Tools and Applications 80 (2021) 9411–9457.          [14] G. Coro, F. V. Massoli, A. Origlia, F. Cutugno,
 [2] J. Li, et al., Recent advances in end-to-end automatic        Psycho-acoustics inspired automatic speech recog-
     speech recognition, APSIPA Transactions on Signal             nition, Computers & Electrical Engineering 93
     and Information Processing 11 (2022).                         (2021) 107238.
[15] D. Gunning, Explainable artificial intelligence
     (xai), Defense advanced research projects agency
     (DARPA), nd Web 2 (2017) 1.
[16] A. Prasad, P. Jyothi, How accents confound: Prob-
     ing for accent information in end-to-end speech
     recognition systems, in: Proceedings of the 58th
     Annual Meeting of the Association for Computa-
     tional Linguistics, 2020, pp. 3739–3753.
[17] D. Ma, N. Ryant, M. Liberman, Probing acoustic
     representations for phonetic properties, in: ICASSP
     2021-2021 IEEE International Conference on Acous-
     tics, Speech and Signal Processing (ICASSP), IEEE,
     2021, pp. 311–315.
[18] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
     sis of a self-supervised speech representation model,
     in: 2021 IEEE Automatic Speech Recognition and
     Understanding Workshop (ASRU), IEEE, 2021, pp.
     914–921.
[19] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex-
     ploring emergent syllables in end-to-end automatic
     speech recognizers through model explainability
     technique, Neural Computing and Applications
     (2024) 1–27.
[20] M. Nakamura, K. Iwano, S. Furui, Differences be-
     tween acoustic characteristics of spontaneous and
     read speech and their effects on speech recognition
     performance, Computer Speech & Language 22
     (2008) 171–184.
[21] P. Azizova, Linguistic analysis and learning of di-
     alogical speech in literary texts, JETT 14 (2023)
     86–94.