=Paper=
{{Paper
|id=Vol-3762/492
|storemode=property
|title=Towards a responsible usage of AI-based Large Acoustic Models for Automatic Speech Recognition: on the importance of data in the self-supervised era
|pdfUrl=https://ceur-ws.org/Vol-3762/492.pdf
|volume=Vol-3762
|authors=Vincenzo Norman Vitale,Emilia Tanda,Francesco Cutugno
|dblpUrl=https://dblp.org/rec/conf/ital-ia/VitaleTC24
}}
==Towards a responsible usage of AI-based Large Acoustic Models for Automatic Speech Recognition: on the importance of data in the self-supervised era==
Towards a responsible usage of AI-based Large Acoustic
Models for Automatic Speech Recognition: on the
importance of data in the self-supervised era.
Vincenzo Norman Vitale1,2,*,† , Emilia Tanda1,† and Francesco Cutugno1,2,*,†
1
University of Naples, Federico II, Corso Umberto I, 40, Naples, 80138, Italy
2
UrbanECO Research Center, University of Naples, Federico II, via Tarsia, 31, Naples, 80134, Italy
Abstract
The evolution of artificial intelligence models has made them tools of everyday use in many fields. However, the enormous
capabilities demonstrated by these models have, on the one hand, some apparent costs in terms of money, computational
resources, or data. On the other hand, there are some hidden costs for end users who rely on models trained by third parties,
sacrifice awareness and control of a tool, and try to evaluate its performance in their specific contexts. This is the case of
supervised End-to-End (E2E) ASR systems and self-supervised E2E-ASR, also referred to as Large Acoustic Models (LAM). On
the one hand, they provide an important starting point for building information systems oriented to speech interaction and,
on the other hand, are complex to evaluate, use and adapt in specific contexts.
Keywords
End-to-End ASR, self-supervised, quality of data, communication style, responsible AI
1. Introduction cesses of manual labelling. On the other hand, it has
allowed the creation of a large number of freely avail-
Modern Automatic Speech Recognition (ASR) systems, able and open-source general-purpose ASRs, bringing
among the other Natural Language Processing (NLP) these systems within the reach of a greater number of
systems, achieve remarkable performances thanks to institutions and companies. However, their use remains
the computing potential enabled by Deep Neural Net- limited due to the lack of benchmarks oriented towards
works (DNN). Indeed, over the last decade, the auto- specific contexts and communication styles. In this work,
matic speech recognition community has made great we will analyze the evolution of ASR systems, how the
strides [1, 2, 3], moving from traditional hybrid mod- nature of the data used for their training has changed,
elling (Acoustic Model+Language Moel) to end-to-end and the limitations of modern ASR systems. Finally, we
(E2E) modelling that directly translates an input speech will propose an initiative aimed at collecting high-quality
sequence into a sequence of output tokens using a sin- data in Italian aimed at both performance verification
gle network, leading to self-supervised E2E models, also and training based on specific communicative styles.
referred to as Large Acoustic Models (LAMs), that can
model speech without the aid of labelled data. These rev-
olutionary innovations have completely subverted the 2. The evolution of ASR systems
traditional architectures of ASR systems used in previ-
ous decades. In addition, there has also been a strong ASR systems have been the subject of several revolutions,
impact on the cost-effectiveness and democratization of which have impacted their internal architecture and the
ASR systems. On the one hand, the change in architec- nature of the data employed for their training. Tradi-
ture has made it more economical to collect and create tional ASR systems rely on two separate components [9]:
the data sets necessary for training, which previously The Acoustic Model (AM), which is aimed at converting
required the use of a large number of experts in the field the voice signal into a sequence of phones, and the Lan-
of speech analysis involved in long and expensive pro- guage Model (LM), aimed at transforming the sequence
of phones received from the AM, in the most likely and
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- reliable transcription. These two models were initially
nized by CINI, May 29-30, 2024, Naples, Italy realised with techniques such as Hidden Markov Models
*
Corresponding author.
† (HMM) or Gaussian Mixture Models (GMM). Then, with
These authors contributed equally.
$ vincenzonorman.vitale@unina.it (V. N. Vitale);
the advent of Deep Neural Networks (DNN), both have
e.tanda@studenti.unina.it (E. Tanda); cutugno@unina.it been realized as supervised DNNs. Still, the output of
(F. Cutugno) both components was the same: the AM produces the
0000-0002-0365-8575 (V. N. Vitale); 0000-0001-9457-6243 most likely sequence of phones given the input voice sig-
(F. Cutugno) nal, while the LM provides the most reliable transcription
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Model Type Train Data Year Decoder Encoder Test-clean/other
Transformer [4] E2E 970Hr Transcribed 2020 RNN-T Transformer 2.0/4.6
Conformer [5] E2E 970Hr Transcribed 2020 RNN-T Conformer 1.9/3.9
Wav2vec2.0 [6] E2E 60KHr Untranscribed
2020 CTC Transformer 1.8/3.3
with Quantization Self-Supervised 100Hr Transcribed
HuBERT [7] E2E 60KHr Untranscribed
2021 CTC Transformer 1.8/2.9
with KNN Self-Supervised 100Hr Transcribed
W2V-BERT [8] E2E 60KHr Untranscribed
2021 RNN-T Conformer 1.4/2.5
with Quantization Self-Supervised 100Hr Transcribed
Table 1
In table are E2E ASR systems performance based on Librispeech test-set, least recent to the most recent. For self-supervised
systems is also reported the algorithm used during the self-supervised pre-training phase.
given the input sequence of phones. This means that the bels (character, word piece, etc.) to one label unless a
two components had separate objectives and relied on special label separates these. The result is a sequence of la-
different kinds of high-quality and costly datasets. On the bels shorter or equal to the input vector sequence length.
one hand, the AM needs well-aligned sound-to-phone The CTC is one of the most diffused decoding techniques.
transcriptions. On the other hand, the LM needs a sta- As non-auto-regressive, it is also considered computa-
tistically representative set of phone-to-word samples tionally effective as it requires less time and resources
in order to provide meaningful transcription. Providing for training and inference phases. Conversely, the RNN-
adequate quality data requires highly specialised profes- T (also named Transducer) is an auto-regressive speech
sionals to hand label in both cases. This type of ASR transcription technique which overcomes CTC’s limita-
system requires tens to hundreds of hours of speech to tions, i.e., non-auto-regressive and limited label sequence
train the AM and a few million words to train an LM length. An RNN-T is a speech transcription technique
(depending on the context). The aim is to transcribe which can produce label-transcription sequences longer
fairly long sentences with an accuracy linked to specific than the input vector sequence and models long-term
application contexts. transcription elements’ inter-dependency. A Transducer
The turning point that led to the recent End-to-End typically comprises two sub-decoding modules: one that
ASR (E2E-ASR) [2] was the introduction of the Trans- forecasts the next transcription label based on the previ-
former [10] network architecture, on which most ac- ous transcriptions (prediction network); the other that
tual AI models rely. Compared to traditional systems, combines the encoder and prediction-network outputs
in E2E-ASRs, the voice signal is directly converted into to produce a new transcription label (joiner network).
its corresponding transcription without any intermedi- These features improve transcription speed and perfor-
ate, human-readable format. This evolution results in mance with respect to CTC at the expense of more train-
systems with a single objective needing only one cheaper ing and computational resources required [12].
dataset to be trained since the intermediate phones tran- Finally, the most recent advancement consists of the
scription and the alignment parts have been removed. employment of self-supervised training techniques, giv-
The Transformer architecture [10] opens up the possi- ing rise to what could be defined as the first truly End-
bility of building a combination of AM and LM, now to-End ASR, namely Wav2Vec2 [6] and after a while to
referred to as the Encoder and Decoder, which directly HuBERT [7], both are also referred to as Large Acous-
maps an unaligned sequence of sounds to its transcrip- tic Model (LAM) [13] because of their training process
tion. With a few hundred hours of non-aligned tran- which usually involves two main phases. The first one
scribed speech through a supervised learning process, is the pre-training phase, during which vast amounts of
E2E-ASR systems outperform the previous generation on untranscribed speech data are employed in order to recog-
average by providing an error of up to 5% in the case of nize and discretize hidden acoustic units’ representations
pure Transformer-Encoder systems, or up to 4% in the by employing different processes such as quantization
case of Conformer-Encoder systems [5] (see table 1 for (Wav2Vec2[6]) directly from the raw audio sample, or
performance). Clearly, the Decoder module implemen- clustering (HuBERT [7]) on MFCC features. Then, during
tation choice strongly impacts E2E ASR performances, the last phase, a transcription module could be trained on
such module is usually implemented as a Connectionist smaller datasets (few hours) in order to obtain an error
Temporal Classification (CTC) model [11] or as a Recur- rate of about 2% (see table 1).
rent Neural Network Transducer (RNN-T) [12]. CTC
is a non-auto-regressive speech transcription technique
which collapses consecutive, all-equal, transcription la-
3. Self-supervised E2E Solutions (?) which is usually unavailable. Another aspect to consider
is how and to what extent the democratisation of ASR
to data shortages systems has been impacted. In fact, if, on the one hand,
Undeniably, by committing the model to learn all parts it is possible to obtain much more data for the same cost,
automatically, E2E-ASRs overcome the difficulties and on the other hand, the same quantity of resources is no
cost-ineffectiveness of the data preparation and mod- longer sufficient, especially for training purposes.
elling phases of conventional systems, while requiring
far more training data [14]. This shift significantly im- 4. High-quality data for
pacted ASR systems; on the one hand, it significantly
reduced training data costs while increasing their vol- context-specific assessment
ume, as shown by the availability of plenty of general-
purpose training datasets [1, 3]. On the other hand, in Clearly, the availability of good-quality and well-
spite of the cheapness of training data, ASR systems are categorized data is paramount in the current application
now accessible to a wider public. Clearly, these innova- landscape. On the one hand, such data is essential to
tions present some expenses, which in this case consist evaluate pre-trained systems in specific contexts with
of higher computational costs, longer training times, and speaking styles related to different communication situa-
loss of modularity [3] compared to traditional ASR sys- tions. On the other hand, such data is crucial for training
tems. Indeed, adapting such a general-purpose E2E-ASR and fine-tuning modern supervised and self-supervised
to specific contexts means, in some cases, updating the E2E ASR. To this end, the Phoné consortium was born
Decoder (LM) to a special-purpose field or updating the as a voluntary initiative to collect, verify and distribute
Encoder (AM) to handle a special type of speech, which transcribed and non-transcribed Italian speech datasets
requires fine-tuning and, in the worst cases, training the in various application contexts. Table 2 shows the actual
model from scratch. amount of data collected and verified by the consortium
Then, the advent of Self-supervised systems impacted to provide Italian institutions and companies with ad-
the adaptability aspects of general-purpose E2E ASR, giv- equate instruments to evaluate these promising tools,
ing rise to Large Acoustic Models (LAMs), which basically which are, however, assessed in contexts and communi-
are Encoders trained on vast amounts of non-transcribed cation styles that do not reflect the target ones.
cheaper datasets, compared to data needed by simple Currently, data is divided into two macro-categories,
E2E-ASR, which are then combined with an Encoder namely, Transcribed and Untranscribed, to enable the fu-
part trained on small quantities of language-specific tran- ture training of self-supervised E2E-ASR. Then, datasets
scribed data. The result is a large, general-purpose model are further divided into specific communication styles
that can be easily deployed in most contexts. Although [20, 21]:
they are publicly available and, therefore, freely adapt- • Monologic speech involves only one person
able, the necessary computational resources are so pro- speaking without interacting with an interlocutor.
hibitive that they are within the reach of a few companies This type of speech is characterized by consis-
and institutions, even for simple fine-tuning. tency and structuring, as it typically consists of
A further point to be considered is that the advantages lectures, speeches or situations that require pre-
of both simple E2E ASR and Self-supervised ones come liminary preparation. As a result, the speech ap-
at the expense of lower interpretability of systems’ inter- pears cohesive and well-organized. The language
nals, making it difficult to diagnose errors and limiting register tends to be higher and more formal.
their usage in critical contexts [3]. However, some studies • Dialogic speech involves two or more people in a
in the field of eXplainable AI (XAI)[15] try to provide ex- conversation, characterized by exchanges of mes-
planations and methodologies for analysing behaviours sages and information. It is thus configured as a
and phenomena modelled by various E2E ASR systems, communicative act with a dynamic structure. Un-
aiming to make them more interpretable [16, 17, 18, 19], like monologic speech, dialogic speech does not
still based on special purpose data. involve prior preparation; therefore, the speech
To summarize, although the innovations introduced tends to be simpler from a syntactic point of view,
by E2E and self-supervised E2E systems have allowed the articulation of words tends to be less precise
their fast diffusion, still their industrial and institutional (hypoarticulation), and it is also characterized by
deployment remains subject to limitations [3] which, in greater conciseness of expression.
some cases, are strongly related to special-purpose data • In Read speech, the speaker reads a written text
availability. Indeed, employing a general-purpose E2E aloud (as in the case of audiobooks), therefore this
ASR system in a specific domain requires evaluation and type of speech is characterized by clear pronunci-
potential fine-tuning /training on domain-specific data, ation (there is a tendency towards hyperarticula-
tion), complete syntax and greater coherence and
cohesion of the text. Furthermore, another fea- [3] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter,
ture is given by the modulation of reading speed S. Watanabe, End-to-end speech recognition: A
and the use of strategic pauses and intonations survey, IEEE/ACM Transactions on Audio, Speech,
to improve communicative effectiveness. and Language Processing (2023).
[4] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDer-
mott, S. Koo, S. Kumar, Transformer transducer: A
Material Type Speech Type Minutes
streamable speech recognition model with trans-
Transcribed Monologic 500 Minutes
Transcribed Dialogic 400 Minutes former encoders and rnn-t loss, in: ICASSP 2020-
Transcribed Read 120 Minutes 2020 IEEE International Conference on Acoustics,
Untranscribed Monologic 10000 Minutes Speech and Signal Processing (ICASSP), IEEE, 2020,
Untranscribed Dialogic 500 Minutes pp. 7829–7833.
Untranscribed Read 2200 Minutes [5] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al.,
Table 2
List of material collected and verified for the evaluation
Conformer: Convolution-augmented transformer
and training of E2E-ASR systems (both supervised and self- for speech recognition (2020).
supervised) in specific contexts for the Italian language. [6] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec
2.0: A framework for self-supervised learning of
speech representations, Advances in neural infor-
Behind ASR-related aspects, the consortium’s pur- mation processing systems 33 (2020) 12449–12460.
poses also extend to other voice-related tasks, which [7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakho-
include, but are not limited to, Text-To-Speech (TTS), tia, R. Salakhutdinov, A. Mohamed, Hubert:
Speaker Identification (SI), Speaker Verification (SV), and Self-supervised speech representation learning by
others. masked prediction of hidden units, IEEE/ACM
Transactions on Audio, Speech, and Language Pro-
5. Conclusion cessing 29 (2021) 3451–3460.
[8] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin,
In this work, we present the current panorama related R. Pang, Y. Wu, W2v-bert: Combining contrastive
to E2E ASR systems, how their data usage evolved along learning and masked language modeling for self-
with technological improvements and the current issues supervised speech pre-training, in: 2021 IEEE Au-
that these improvements solved or introduced. Firstly, tomatic Speech Recognition and Understanding
we observe the significant improvement in models’ per- Workshop (ASRU), IEEE, 2021, pp. 244–250.
formances while pointing out issues connected to the [9] S. Karpagavalli, E. Chandra, A review on automatic
models’ capacity assessment related to specific communi- speech recognition architecture and approaches,
cation styles and domains. We observe the shift in model International Journal of Signal Processing, Image
training costs, moving away from data becoming cheaper Processing and Pattern Recognition 9 (2016) 393–
and easier to collect towards computing resources grow- 404.
ing in quantity and costs. Then, we observed how the [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
advantages introduced by modern E2E (supervised and L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
self-supervised) ASRs come at the expense of an increase tention is all you need, Advances in neural infor-
in their complexity, which consequently reduces their mation processing systems 30 (2017).
interpretability. Finally, we propose a voluntary, high- [11] A. Graves, S. Fernández, F. Gomez, J. Schmidhu-
quality data collection initiative to evaluate and train ber, Connectionist temporal classification: labelling
systems related to various speech communication styles unsegmented sequence data with recurrent neural
to enable more informed use and greater accessibility of networks, in: Proceedings of the 23rd international
E2E-ASR systems. conference on Machine learning, 2006, pp. 369–376.
[12] A. Graves, Sequence transduction with recurrent
neural networks, arXiv preprint arXiv:1211.3711
References (2012).
[13] L. M. Giordano Orsini, V. N. Vitale, F. Cutugno,
[1] M. Malik, M. K. Malik, K. Mehmood, I. Makhdoom, Large scale acoustic models: A new perspective,
Automatic speech recognition: a survey, Multime- Sistemi intelligenti (2023). doi:10.1422/108137.
dia Tools and Applications 80 (2021) 9411–9457. [14] G. Coro, F. V. Massoli, A. Origlia, F. Cutugno,
[2] J. Li, et al., Recent advances in end-to-end automatic Psycho-acoustics inspired automatic speech recog-
speech recognition, APSIPA Transactions on Signal nition, Computers & Electrical Engineering 93
and Information Processing 11 (2022). (2021) 107238.
[15] D. Gunning, Explainable artificial intelligence
(xai), Defense advanced research projects agency
(DARPA), nd Web 2 (2017) 1.
[16] A. Prasad, P. Jyothi, How accents confound: Prob-
ing for accent information in end-to-end speech
recognition systems, in: Proceedings of the 58th
Annual Meeting of the Association for Computa-
tional Linguistics, 2020, pp. 3739–3753.
[17] D. Ma, N. Ryant, M. Liberman, Probing acoustic
representations for phonetic properties, in: ICASSP
2021-2021 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), IEEE,
2021, pp. 311–315.
[18] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy-
sis of a self-supervised speech representation model,
in: 2021 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), IEEE, 2021, pp.
914–921.
[19] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex-
ploring emergent syllables in end-to-end automatic
speech recognizers through model explainability
technique, Neural Computing and Applications
(2024) 1–27.
[20] M. Nakamura, K. Iwano, S. Furui, Differences be-
tween acoustic characteristics of spontaneous and
read speech and their effects on speech recognition
performance, Computer Speech & Language 22
(2008) 171–184.
[21] P. Azizova, Linguistic analysis and learning of di-
alogical speech in literary texts, JETT 14 (2023)
86–94.