Towards a responsible usage of AI-based Large Acoustic Models for Automatic Speech Recognition: on the importance of data in the self-supervised era. Vincenzo Norman Vitale1,2,*,† , Emilia Tanda1,† and Francesco Cutugno1,2,*,† 1 University of Naples, Federico II, Corso Umberto I, 40, Naples, 80138, Italy 2 UrbanECO Research Center, University of Naples, Federico II, via Tarsia, 31, Naples, 80134, Italy Abstract The evolution of artificial intelligence models has made them tools of everyday use in many fields. However, the enormous capabilities demonstrated by these models have, on the one hand, some apparent costs in terms of money, computational resources, or data. On the other hand, there are some hidden costs for end users who rely on models trained by third parties, sacrifice awareness and control of a tool, and try to evaluate its performance in their specific contexts. This is the case of supervised End-to-End (E2E) ASR systems and self-supervised E2E-ASR, also referred to as Large Acoustic Models (LAM). On the one hand, they provide an important starting point for building information systems oriented to speech interaction and, on the other hand, are complex to evaluate, use and adapt in specific contexts. Keywords End-to-End ASR, self-supervised, quality of data, communication style, responsible AI 1. Introduction cesses of manual labelling. On the other hand, it has allowed the creation of a large number of freely avail- Modern Automatic Speech Recognition (ASR) systems, able and open-source general-purpose ASRs, bringing among the other Natural Language Processing (NLP) these systems within the reach of a greater number of systems, achieve remarkable performances thanks to institutions and companies. However, their use remains the computing potential enabled by Deep Neural Net- limited due to the lack of benchmarks oriented towards works (DNN). Indeed, over the last decade, the auto- specific contexts and communication styles. In this work, matic speech recognition community has made great we will analyze the evolution of ASR systems, how the strides [1, 2, 3], moving from traditional hybrid mod- nature of the data used for their training has changed, elling (Acoustic Model+Language Moel) to end-to-end and the limitations of modern ASR systems. Finally, we (E2E) modelling that directly translates an input speech will propose an initiative aimed at collecting high-quality sequence into a sequence of output tokens using a sin- data in Italian aimed at both performance verification gle network, leading to self-supervised E2E models, also and training based on specific communicative styles. referred to as Large Acoustic Models (LAMs), that can model speech without the aid of labelled data. These rev- olutionary innovations have completely subverted the 2. The evolution of ASR systems traditional architectures of ASR systems used in previ- ous decades. In addition, there has also been a strong ASR systems have been the subject of several revolutions, impact on the cost-effectiveness and democratization of which have impacted their internal architecture and the ASR systems. On the one hand, the change in architec- nature of the data employed for their training. Tradi- ture has made it more economical to collect and create tional ASR systems rely on two separate components [9]: the data sets necessary for training, which previously The Acoustic Model (AM), which is aimed at converting required the use of a large number of experts in the field the voice signal into a sequence of phones, and the Lan- of speech analysis involved in long and expensive pro- guage Model (LM), aimed at transforming the sequence of phones received from the AM, in the most likely and Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- reliable transcription. These two models were initially nized by CINI, May 29-30, 2024, Naples, Italy realised with techniques such as Hidden Markov Models * Corresponding author. † (HMM) or Gaussian Mixture Models (GMM). Then, with These authors contributed equally. $ vincenzonorman.vitale@unina.it (V. N. Vitale); the advent of Deep Neural Networks (DNN), both have e.tanda@studenti.unina.it (E. Tanda); cutugno@unina.it been realized as supervised DNNs. Still, the output of (F. Cutugno) both components was the same: the AM produces the  0000-0002-0365-8575 (V. N. Vitale); 0000-0001-9457-6243 most likely sequence of phones given the input voice sig- (F. Cutugno) nal, while the LM provides the most reliable transcription © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Model Type Train Data Year Decoder Encoder Test-clean/other Transformer [4] E2E 970Hr Transcribed 2020 RNN-T Transformer 2.0/4.6 Conformer [5] E2E 970Hr Transcribed 2020 RNN-T Conformer 1.9/3.9 Wav2vec2.0 [6] E2E 60KHr Untranscribed 2020 CTC Transformer 1.8/3.3 with Quantization Self-Supervised 100Hr Transcribed HuBERT [7] E2E 60KHr Untranscribed 2021 CTC Transformer 1.8/2.9 with KNN Self-Supervised 100Hr Transcribed W2V-BERT [8] E2E 60KHr Untranscribed 2021 RNN-T Conformer 1.4/2.5 with Quantization Self-Supervised 100Hr Transcribed Table 1 In table are E2E ASR systems performance based on Librispeech test-set, least recent to the most recent. For self-supervised systems is also reported the algorithm used during the self-supervised pre-training phase. given the input sequence of phones. This means that the bels (character, word piece, etc.) to one label unless a two components had separate objectives and relied on special label separates these. The result is a sequence of la- different kinds of high-quality and costly datasets. On the bels shorter or equal to the input vector sequence length. one hand, the AM needs well-aligned sound-to-phone The CTC is one of the most diffused decoding techniques. transcriptions. On the other hand, the LM needs a sta- As non-auto-regressive, it is also considered computa- tistically representative set of phone-to-word samples tionally effective as it requires less time and resources in order to provide meaningful transcription. Providing for training and inference phases. Conversely, the RNN- adequate quality data requires highly specialised profes- T (also named Transducer) is an auto-regressive speech sionals to hand label in both cases. This type of ASR transcription technique which overcomes CTC’s limita- system requires tens to hundreds of hours of speech to tions, i.e., non-auto-regressive and limited label sequence train the AM and a few million words to train an LM length. An RNN-T is a speech transcription technique (depending on the context). The aim is to transcribe which can produce label-transcription sequences longer fairly long sentences with an accuracy linked to specific than the input vector sequence and models long-term application contexts. transcription elements’ inter-dependency. A Transducer The turning point that led to the recent End-to-End typically comprises two sub-decoding modules: one that ASR (E2E-ASR) [2] was the introduction of the Trans- forecasts the next transcription label based on the previ- former [10] network architecture, on which most ac- ous transcriptions (prediction network); the other that tual AI models rely. Compared to traditional systems, combines the encoder and prediction-network outputs in E2E-ASRs, the voice signal is directly converted into to produce a new transcription label (joiner network). its corresponding transcription without any intermedi- These features improve transcription speed and perfor- ate, human-readable format. This evolution results in mance with respect to CTC at the expense of more train- systems with a single objective needing only one cheaper ing and computational resources required [12]. dataset to be trained since the intermediate phones tran- Finally, the most recent advancement consists of the scription and the alignment parts have been removed. employment of self-supervised training techniques, giv- The Transformer architecture [10] opens up the possi- ing rise to what could be defined as the first truly End- bility of building a combination of AM and LM, now to-End ASR, namely Wav2Vec2 [6] and after a while to referred to as the Encoder and Decoder, which directly HuBERT [7], both are also referred to as Large Acous- maps an unaligned sequence of sounds to its transcrip- tic Model (LAM) [13] because of their training process tion. With a few hundred hours of non-aligned tran- which usually involves two main phases. The first one scribed speech through a supervised learning process, is the pre-training phase, during which vast amounts of E2E-ASR systems outperform the previous generation on untranscribed speech data are employed in order to recog- average by providing an error of up to 5% in the case of nize and discretize hidden acoustic units’ representations pure Transformer-Encoder systems, or up to 4% in the by employing different processes such as quantization case of Conformer-Encoder systems [5] (see table 1 for (Wav2Vec2[6]) directly from the raw audio sample, or performance). Clearly, the Decoder module implemen- clustering (HuBERT [7]) on MFCC features. Then, during tation choice strongly impacts E2E ASR performances, the last phase, a transcription module could be trained on such module is usually implemented as a Connectionist smaller datasets (few hours) in order to obtain an error Temporal Classification (CTC) model [11] or as a Recur- rate of about 2% (see table 1). rent Neural Network Transducer (RNN-T) [12]. CTC is a non-auto-regressive speech transcription technique which collapses consecutive, all-equal, transcription la- 3. Self-supervised E2E Solutions (?) which is usually unavailable. Another aspect to consider is how and to what extent the democratisation of ASR to data shortages systems has been impacted. In fact, if, on the one hand, Undeniably, by committing the model to learn all parts it is possible to obtain much more data for the same cost, automatically, E2E-ASRs overcome the difficulties and on the other hand, the same quantity of resources is no cost-ineffectiveness of the data preparation and mod- longer sufficient, especially for training purposes. elling phases of conventional systems, while requiring far more training data [14]. This shift significantly im- 4. High-quality data for pacted ASR systems; on the one hand, it significantly reduced training data costs while increasing their vol- context-specific assessment ume, as shown by the availability of plenty of general- purpose training datasets [1, 3]. On the other hand, in Clearly, the availability of good-quality and well- spite of the cheapness of training data, ASR systems are categorized data is paramount in the current application now accessible to a wider public. Clearly, these innova- landscape. On the one hand, such data is essential to tions present some expenses, which in this case consist evaluate pre-trained systems in specific contexts with of higher computational costs, longer training times, and speaking styles related to different communication situa- loss of modularity [3] compared to traditional ASR sys- tions. On the other hand, such data is crucial for training tems. Indeed, adapting such a general-purpose E2E-ASR and fine-tuning modern supervised and self-supervised to specific contexts means, in some cases, updating the E2E ASR. To this end, the Phoné consortium was born Decoder (LM) to a special-purpose field or updating the as a voluntary initiative to collect, verify and distribute Encoder (AM) to handle a special type of speech, which transcribed and non-transcribed Italian speech datasets requires fine-tuning and, in the worst cases, training the in various application contexts. Table 2 shows the actual model from scratch. amount of data collected and verified by the consortium Then, the advent of Self-supervised systems impacted to provide Italian institutions and companies with ad- the adaptability aspects of general-purpose E2E ASR, giv- equate instruments to evaluate these promising tools, ing rise to Large Acoustic Models (LAMs), which basically which are, however, assessed in contexts and communi- are Encoders trained on vast amounts of non-transcribed cation styles that do not reflect the target ones. cheaper datasets, compared to data needed by simple Currently, data is divided into two macro-categories, E2E-ASR, which are then combined with an Encoder namely, Transcribed and Untranscribed, to enable the fu- part trained on small quantities of language-specific tran- ture training of self-supervised E2E-ASR. Then, datasets scribed data. The result is a large, general-purpose model are further divided into specific communication styles that can be easily deployed in most contexts. Although [20, 21]: they are publicly available and, therefore, freely adapt- • Monologic speech involves only one person able, the necessary computational resources are so pro- speaking without interacting with an interlocutor. hibitive that they are within the reach of a few companies This type of speech is characterized by consis- and institutions, even for simple fine-tuning. tency and structuring, as it typically consists of A further point to be considered is that the advantages lectures, speeches or situations that require pre- of both simple E2E ASR and Self-supervised ones come liminary preparation. As a result, the speech ap- at the expense of lower interpretability of systems’ inter- pears cohesive and well-organized. The language nals, making it difficult to diagnose errors and limiting register tends to be higher and more formal. their usage in critical contexts [3]. However, some studies • Dialogic speech involves two or more people in a in the field of eXplainable AI (XAI)[15] try to provide ex- conversation, characterized by exchanges of mes- planations and methodologies for analysing behaviours sages and information. It is thus configured as a and phenomena modelled by various E2E ASR systems, communicative act with a dynamic structure. Un- aiming to make them more interpretable [16, 17, 18, 19], like monologic speech, dialogic speech does not still based on special purpose data. involve prior preparation; therefore, the speech To summarize, although the innovations introduced tends to be simpler from a syntactic point of view, by E2E and self-supervised E2E systems have allowed the articulation of words tends to be less precise their fast diffusion, still their industrial and institutional (hypoarticulation), and it is also characterized by deployment remains subject to limitations [3] which, in greater conciseness of expression. some cases, are strongly related to special-purpose data • In Read speech, the speaker reads a written text availability. Indeed, employing a general-purpose E2E aloud (as in the case of audiobooks), therefore this ASR system in a specific domain requires evaluation and type of speech is characterized by clear pronunci- potential fine-tuning /training on domain-specific data, ation (there is a tendency towards hyperarticula- tion), complete syntax and greater coherence and cohesion of the text. Furthermore, another fea- [3] R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter, ture is given by the modulation of reading speed S. Watanabe, End-to-end speech recognition: A and the use of strategic pauses and intonations survey, IEEE/ACM Transactions on Audio, Speech, to improve communicative effectiveness. and Language Processing (2023). [4] Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDer- mott, S. Koo, S. Kumar, Transformer transducer: A Material Type Speech Type Minutes streamable speech recognition model with trans- Transcribed Monologic 500 Minutes Transcribed Dialogic 400 Minutes former encoders and rnn-t loss, in: ICASSP 2020- Transcribed Read 120 Minutes 2020 IEEE International Conference on Acoustics, Untranscribed Monologic 10000 Minutes Speech and Signal Processing (ICASSP), IEEE, 2020, Untranscribed Dialogic 500 Minutes pp. 7829–7833. Untranscribed Read 2200 Minutes [5] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., Table 2 List of material collected and verified for the evaluation Conformer: Convolution-augmented transformer and training of E2E-ASR systems (both supervised and self- for speech recognition (2020). supervised) in specific contexts for the Italian language. [6] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in neural infor- Behind ASR-related aspects, the consortium’s pur- mation processing systems 33 (2020) 12449–12460. poses also extend to other voice-related tasks, which [7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakho- include, but are not limited to, Text-To-Speech (TTS), tia, R. Salakhutdinov, A. Mohamed, Hubert: Speaker Identification (SI), Speaker Verification (SV), and Self-supervised speech representation learning by others. masked prediction of hidden units, IEEE/ACM Transactions on Audio, Speech, and Language Pro- 5. Conclusion cessing 29 (2021) 3451–3460. [8] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, In this work, we present the current panorama related R. Pang, Y. Wu, W2v-bert: Combining contrastive to E2E ASR systems, how their data usage evolved along learning and masked language modeling for self- with technological improvements and the current issues supervised speech pre-training, in: 2021 IEEE Au- that these improvements solved or introduced. Firstly, tomatic Speech Recognition and Understanding we observe the significant improvement in models’ per- Workshop (ASRU), IEEE, 2021, pp. 244–250. formances while pointing out issues connected to the [9] S. Karpagavalli, E. Chandra, A review on automatic models’ capacity assessment related to specific communi- speech recognition architecture and approaches, cation styles and domains. We observe the shift in model International Journal of Signal Processing, Image training costs, moving away from data becoming cheaper Processing and Pattern Recognition 9 (2016) 393– and easier to collect towards computing resources grow- 404. ing in quantity and costs. Then, we observed how the [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, advantages introduced by modern E2E (supervised and L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- self-supervised) ASRs come at the expense of an increase tention is all you need, Advances in neural infor- in their complexity, which consequently reduces their mation processing systems 30 (2017). interpretability. Finally, we propose a voluntary, high- [11] A. Graves, S. Fernández, F. Gomez, J. Schmidhu- quality data collection initiative to evaluate and train ber, Connectionist temporal classification: labelling systems related to various speech communication styles unsegmented sequence data with recurrent neural to enable more informed use and greater accessibility of networks, in: Proceedings of the 23rd international E2E-ASR systems. conference on Machine learning, 2006, pp. 369–376. [12] A. Graves, Sequence transduction with recurrent neural networks, arXiv preprint arXiv:1211.3711 References (2012). [13] L. M. Giordano Orsini, V. N. Vitale, F. Cutugno, [1] M. Malik, M. K. Malik, K. Mehmood, I. Makhdoom, Large scale acoustic models: A new perspective, Automatic speech recognition: a survey, Multime- Sistemi intelligenti (2023). doi:10.1422/108137. dia Tools and Applications 80 (2021) 9411–9457. [14] G. Coro, F. V. Massoli, A. Origlia, F. Cutugno, [2] J. Li, et al., Recent advances in end-to-end automatic Psycho-acoustics inspired automatic speech recog- speech recognition, APSIPA Transactions on Signal nition, Computers & Electrical Engineering 93 and Information Processing 11 (2022). (2021) 107238. [15] D. Gunning, Explainable artificial intelligence (xai), Defense advanced research projects agency (DARPA), nd Web 2 (2017) 1. [16] A. Prasad, P. Jyothi, How accents confound: Prob- ing for accent information in end-to-end speech recognition systems, in: Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, 2020, pp. 3739–3753. [17] D. Ma, N. Ryant, M. Liberman, Probing acoustic representations for phonetic properties, in: ICASSP 2021-2021 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 311–315. [18] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy- sis of a self-supervised speech representation model, in: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2021, pp. 914–921. [19] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, Ex- ploring emergent syllables in end-to-end automatic speech recognizers through model explainability technique, Neural Computing and Applications (2024) 1–27. [20] M. Nakamura, K. Iwano, S. Furui, Differences be- tween acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance, Computer Speech & Language 22 (2008) 171–184. [21] P. Azizova, Linguistic analysis and learning of di- alogical speech in literary texts, JETT 14 (2023) 86–94.