1. Introduction

Toward Optimised Datasets to Fine-tune ASR Systems Leveraging Less but More Informative Speech.

Loredana Schettino

Vincenzo Norman Vitale

Alessandro Vietti

0 0 Free University of Bozen-Bolzano , Piazza Università, 1, 39100 Bolzano , Italia 1 University of Naples Federico II , C.so Umberto I, 40, 80138 Napoli , Italia

2025

Modern Automatic Speech Recognition (ASR) systems, based on Deep Neural Networks (DNN), have achieved remarkable performance modelling huge quantity of speech data. However, recent studies have shown that fine-tuning pre-trained models, despite providing a powerful solution in low-resource settings, lacks robustness across diferent speech styles, and this is not just related to the amount of training data, but to substantial diferences in phonetic-prosodic characteristics. Therefore, this study aims to explore how modern E2E ASR systems' performance is afected by the amount of training data and the type of speech data and which acoustic-phonetic features most markedly exert an influence. To this aim, a k-fold cross-validation was performed by fine-tuning a pre-trained FastConformer model with datasets varying in type of speech data and size. Then we performed a correlation analysis between the values of the acoustic characteristics of the data and the recognition scores. The analyses allow the identification of an optimal combination of speech data type and amount of training data. Also, results show that using both more spontaneous speech or more controlled speech can be beneficial, provided that the speech rate is contained.

eol>Speech style ASR Sample Eficiency Acoustic Features K-fold Cross-Validation

1. Introduction

ation of the data considered. In fact, while most benchmarks consist of read or rather controlled speech producSpoken language is intrinsically variable. Speech pro- tions, the interest in ASR applications in real contexts, duced to convey a message can vary widely depending such as human-machine-interactions or transcription of on several internal and external factors, such as the com- spontaneous conversation, led to the evaluation of ASR municative and contextual situation, the formality of performance in diferent, less controlled and more spontathe exchange, the speaker’s disposition and individual neous scenarios, which resulted in diferent performance choices of the forms and phonetic realisation deemed as values for other types of data, e.g, lower for more sponmost appropriate and functioning to convey the intended taneous datasets [4]. In particular, a recent study on the message given the specific condition of production and evaluation of ASR systems, based on state-of-the-art sureception [1]. Thus, speech variability can be described pervised, self-supervised, and weakly supervised End-toas the synergetic contribution of linguistic, contextual, End models, on Italian speech [5, 6], showed consistent and social factors [2], which results in diferent types of performance diferences across speech types: dialogic, speech, often referred to as speech style, characterized monologic, and read speech. Namely, increasing perforby varying levels of spontaneity, fluency, speaking rate, mance from dialogic speech to monologic speech and prosodic variation, degree of phonic specification [3, 1]. from the latter to read speech.

Modern ASR systems, based on Deep Neural Networks Eforts devoted to overcoming this issue often consist (DNNs), have achieved remarkable performance by mod- of building complex and costly models that require large elling the linguistic and acoustic features of spoken lan- amounts of data and computational resources. However, guage. However, these systems implicitly learn to model this can be problematic, especially when working with soonly a small proportion of the possible variation that called “low-resource languages”. Diferent studies have characterises spoken language. As a result, error rates provided evidence that a powerful solution is provided increase with the degree of linguistic and phonetic vari- by fine-tuning pre-trained models (see [ 7]). However, [ 8 ] adopted this approach in a study on low-resource speech CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- recognition and showed not only a lack of robustness tics, September 24 — 26, 2025, Cagliari, Italy. in Word Error Rate (WER) distributions across diferent *$C olsrcrhesepttoinnod@ingunaiubtzh.iotr(.L. Schettino); speakers and conversation contexts, but also that this was vincenzonorman.vitale@unina.it (V. N. Vitale); not related to the amount of training data, but to substanalessandro.vietti@unibz.it (A. Vietti) tial diferences in prosody, pronunciation and utterance 0000-0002-3788-3754 (L. Schettino); 0000-0002-0365-8575 length. This led to acknowledging that using more data (V. N. Vitale); 0000-0002-4166-540X (A. Vietti) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License and more complex techniques is not suficient to address Attribution 4.0 International (CC BY 4.0). the problem of automatically recognising diferent types The present study aims to contribute to this line of reof data. Rather, we need to investigate how diferent search by developing and validating a method to address types of data and their specific acoustic-prosodic fea- the following research questions (RQs): tures afect the performance of ASR systems to address RQ1. If modern E2E ASR systems’ performance is this robustness issue [7]. afected by the amount of training data and the type of

Based on this body of research, this work aims to con- speech data, can we identify the optimal combination of tribute to the study of how diferent types of speech data speech data and amount of training data? are modelled and how this afects the robustness of the RQ2. What acoustic-phonetic characteristics afect the model toward the definition of an optimal dataset to ob- most modern E2E ASR performance? To what extent? tain robust recognition systems.

2. Related Work 3. Methodology

To investigate how data characterised by diferent feaEspecially, but not exclusively, within the context of low- tures (data type) and varying amounts of training data resource studies, the need to develop less resource-greedy (training data time) can afect the fine-tuning of modASR systems emerges. To this end, diferent data efi- ern ASR models, our method includes a K-fold crossciency techniques, e.g., learning or data augmentation validation procedure [ 15 ]. This technique is used when techniques, have been explored, such as multilingual there is a limited amount of data and provides insight transfer to provide robust acoustic word embeddings into the model’s performance across diferent data sub[ 9, 10 ], self-training, where an ASR system trained with sets. It consists of splitting the data into subsets (folds) the available human-transcribed data is used to gener- and training diferent models, as many as the number ate transcriptions, which are then combined with the of folds, each time considering a diferent combination original data to train a new ASR system, or neural TTS of folds as training (potentially validation) and test sets. synthetic data generation [ 11 ]. However, although it has The approach follows these key steps: been shown that the size of training data afects the performance of ASR systems, "[w]hether data augmentation • selection of data with diferent speech characteris always beneficial is an open question." [ 11, 723 ]. istics;

Another way to help achieve high performance with • fold splitting according to training-specific criteminimal data may consist in relying on less but more ria, i.e., speech type and training fold size (mininformative data by investigating how diferent types of utes); speech data are modelled and afect the robustness of the • selection of a pre-trained model for fine-tuning; model, and which combination of diferent speech types • evaluating model performance for the selected and amount of data optimises the informativeness and datasets; eficiency of a sample to fine-tune pre-trained models. • fine-tuning the pre-trained model by training it

To this end, a better understanding of the aspects of on the diferent folds; speech that challenge ASR architectures the most is re- • comparison of the performance of the fine-tuned quired. In the last 20 years, various studies have inves- models; tigated which phonetic features afect automatic recog- • Word Error Rate - acoustic features correlation nition the most (see [7] for an overview). In particular, analysis. issues were observed to mostly concern features of conversational speech such as grammatical inconsistencies, 3.1. Data self-interruptions, backchannels, lexical and non-lexical disfluencies, and the degree of pronunciation variation [ 12, 13 ]. ASR systems were also observed to struggle to recognise words with low intensity, high F0 value or shorter duration [ 14 ]. Then, a recent study aimed at gaining insight on which aspects of casual, conversational speech cause the largest challenges for diferent ASR HMM and transformer-based architectures showed that utterance length (in number of tokens), articulation rate and pronunciation variation exert a major influence, with higher recognition scores correlating with longer utterances, lower speech rates and lower phonetic variation [7].

Given the methodological focus of this study, we decided to work with a well-known, restricted dataset to gain clearer insights into the efectiveness of the method and the findings. Hence, we selected data from a corpus that was the object of previous phonetic studies [ 16, 17 ], namely the CHROME corpus [ 18 ]. The corpus comprises approximately 10 hours of speech produced by three female expert museum guides (G) leading visits at San Martino Charterhouse (in Naples). It consists of Neapolitan Italian, informative semi-monologic, semi-spontaneous speech characterised by a high degree of discourse planning and an asymmetrical relationship between the interlocutors. The three speakers show idiosyncratic speech styles [19]. In particular, they use diferent speech rates and diferent “hesitation strategies”. G01 produces approximately 159 words per minute and seems to privilege an “on the fly” production, using several non-lexical ifllers ( eeh, ehm) and prolongations to cover speech planning time; G02 shows a higher speech rate, producing about 174 words per minute, where utterances are juxtaposed to each other as she tends to avoid silent pauses altogether, avoid prolongations and non-lexical fillers, and prefer lexical fillers instead; G03 adopts a more controlled, “rhetorical” style, with a lower speech rate of about 146 words per minute and mainly using lexical ifllers and silent pauses.

3.2. Data Preparation

Using the text annotation in TextGrid format [20], the dataset was split in Inter-Pausal Units based on pauses longer than 250 ms. This resulted in utterances with a mean duration of 4,81 seconds (standard deviation = makes it particularly suitable for real-time speech recog2,88, max length = 30 ms). The text was normalised by nition tasks. Furthermore, the architecture is highly scalremoving special characters, but leaving annotation of able, and indeed, FastConformer is at the core of topsegmental phenomena such as fillers (eeh, ehm, mh) and performing Nvidia ASR systems like Canary and Paraprolongations (e.g., laaa). The final considered dataset keet. consists of slightly more than 3h and 27881 tokens for The Group K-fold is a variation of k-fold crossG01, about 3h and a half and 39145 tokens for G02, and validation intended for scenarios where the data has a preabout 3 h and 29341 tokens for G03. G02 shows a higher defined group structure. The key constraint is to ensure speech rate than both G01 and G02. See Table 1 for total that the same group is not represented within the same duration, tokens and speech rate (SR), and mean (m) and splits, namely training, validation and test sets. In our standard deviation (sd) of utterance duration and tokens. case, samples from the same speaker will be grouped in the same split. This method prevents data leakage by ensuring that the model generalises to new, unseen groups, 3.3. Modelling not just new samples from existing groups. The corpus is Selecting an appropriate pre-trained model is a criti- split into three folds, one per speaker and idiosyncratic cal decision that influences the success of subsequent speech style (data set type), and these were further split downstream tasks. While many high-performing models into five sub-folds of diferent sizes (split size), resulting are available, such as Whisper or Phi-4, our selection in 15 diferent fold combinations described in Figure 1. was guided by several practical requirements: languagespecific support for Italian, computational eficiency, and 3.4. Evaluation and correlation analysis public availability to ensure experimental reproducibility and democratic access. Accordingly, we chose the The model performance across the diferent folds was FastConformer model pre-trained on Italian by Nvidia evaluated considering the Word Error Rate (WER) com[21]. The FastConformer is an eficient variant of the puted at the utterance level. Model comparison was conConformer architecture, designed to significantly reduce ducted based on WER mean and distribution values per the computational cost and latency of the standard Con- fold to observe which model performed better across the former model while maintaining high accuracy. This considered folds.

Then, correlation analysis between data characteristics and WER was performed to examine the influence of acoustic features on the performance of diferent time folds. Feature values were automatically extracted for each utterance employing the OpenSmile toolkit [22]. The Geneva Minimalistic Acoustic Parameter Set (eGeMAPSv02) [23], i.e., a restricted set of features based on interdisciplinary evidence and theoretical significance, was selected as the feature set. The study focuses, in particular, on the features that could be considered as the most relevant, as reported in previous literature [7] and inspection of the data.

4. Results 4.1. Model performance and comparison

The analysis starts by evaluating the model’s baseline performance on the defined datasets before applying kfold cross-validation to establish a reference for comparison. The selected model performs less for the G01 dataset (mWER = 0.51, sd = 0.32) than for the G03 dataset (mWER = 0.40, sd = 0.26) and the G02 dataset (mWER = 0.39, sd = 0.26), see the first three rows of Table 2. The overall mean WER across diferent data type sets is 0,43 (sd = 0.26).

Then, we observe the model’s performance on each fold. Figure 2 and Table 2 show the mean WERs per train set data type and size. The mean WERs across the data type sets (purple line) reach lower values than the baseline (red dashed line) already after fine-tuning with the smallest 15’ sets (mWER_15 = 0.32, mWER_30 = 0.22, mWER_60 = 0.18, mWER_120 = 0.16, mWER_all = 0.16). The values decrease as the size of the training set increases. However, the magnitude of the WER diference between subsequent size groups progressively diminishes until it becomes trivial between the models trained on 60’ speech and those trained on the entire datasets (about 3h).

We then consider the mean WER values grouped by train set data type. Although models trained on G01, as well as G02 and G03 data, perform better than the baseline, we observe that the models trained on G02 data perform worse than the others, with WERs closer to the overall baseline. In particular, the models trained on G02 are tested on G03 and are closer to the G03 baseline (mWER = 0.4). Instead, the models trained on G03 and tested on G01 for the speech rate values, the latter correlate with WERs show a larger diference with the G01 baseline (mWER = positively and increasingly along the train set size. How0.51) than the diference between models trained on G01 ever, this trend is considerably stronger for the models and tested on G02 and the G02 baseline (mWER = 0.39). trained on data from the G01 dataset (and tested on G02

Considering both the contribution of the train set data dataset). Weaker correlations are observed for the mean type and the size to the model performance improvement, values of F0, especially for the G02 and G03 models, with the optimal fold is G03_120. the strength slightly increasing with the size of the training set. Rather constantly weak correlations can be ob4.2. Features Correlation with WER served for median loudness, MFCC4 in voiced regions and WER values. Still rather constant but slightly stronger is the correlation between loudness peaks per second and WERs for the models trained on the G02 dataset.

To explore how diferent datasets afect model performance, we observe which features correlate with the trained models. The heatmap in Figure 4.2 shows the Pearson coeficients resulting from the correlation between a selection of relevant acoustic features and the WER for each model. The colour of each tile represents the direction of the correlation, while its intensity indicates the strength of the correlation. Red denotes a positive correlation, meaning higher feature values correspond to higher WER, whereas blue indicates a negative correlation, where higher feature values align with lower WER. White represents a weak or no correlation.

We observe negative correlations between the WER values and both the utterance duration and tokens. The correlation becomes weaker, but still noticeable, with increasing train set data size, and the same trend is observed for each dataset. An opposite trend is observed

5. Discussion and Conclusions

This study contributes to investigations on how the performance of modern E2E ASR models is afected by the type and amount of speech data used for training and aims to define a way to identify an optimal combination of type and amount of speech data. The investigation is supported by observation of how diferent speech acoustic features contribute to the model performance.

The Fast-Conformer WER on the selected semimonologic, semi-sponetanous data presents overall lower values than the evaluation provided by a previous study on Italian monologic data, i.e., 12.8 WER [6]. More specifically, lower recognition scores are reported for This study provides evidence corroborating the idea G01 speech, characterised by a more spontaneous speech that less but more informative data can be used to finestyle, including more features such as non-lexical fillers tune pre-trained models, which could be useful for fineand prolongations than the other speakers, which is in tuning in low-resource scenarios. Furthermore, the use line with the literature [ 12, 13 ]. of the Fastconformer highlights the value of architectures

The cross-fold evaluation shows that the models’ per- that ofer a favorable trade-of between performance and formance improves with train set size; however, the mag- computational resources. These models present a vinitude of the improvement gradually decreases until be- able alternative for deployment on resource-constrained, coming trivial between models trained on 120 minutes privacy-oriented devices. At the same time, they can and about 3 hours of speech. This finding supports the be quickly adapted to diferent low-resourced contexts, claim that simply increasing the size of the training set standing in practical contrast to larger-scale yet resourceis not always beneficial and not always enough to guar- demanding models. antee better performance. Although this trend stands In this study, we prioritised methodological soundness across all datasets, variation can still be observed. and understanding over immediate broad applicability.

The models trained on speech produced by the second We selected a known dataset restricted in size and speaker guide (G02) perform worse than the others, with recog- diversity to enhance the interpretability of the results, nition scores closer to the overall baseline. In particular, verify the method’s core efectiveness and establish a the models trained on G02 speech, that is characterised solid foundation for scaling to larger, more diverse corby higher speech rate and fewer pauses, are tested on pora. Future work will be devoted to further exploring G03 speech and achieve smaller improvement over the this direction by considering larger datasets that maxG03 baseline as compared to the models trained on G03 imise diferences in acoustic-phonetic features that were speech, showing a more controlled speech style, and G01 observed to be relevant for the modelling. speech, defined by a more spontaneous speech style. It is particularly worth noticing that the models trained on G03 and tested on G01 show the best recognition scores References over all size folds, thus overcoming the G01 baseline disadvantage. This seems to indicate that some speech data [1] B. V. Tucker, Y. Mukai, Spontaneous speech, Camare more informative than others and may even over- bridge University Press, 2023. come recognition issues related to more spontaneous [2] A. Vietti, Il ruolo della variabilità acustica nella and conversational speech styles; however, studies in costruzione del dato linguistico, in: Superare this direction should be further developed. l’evanescenza del parlato: un vademecum per il

Considering both the contribution of the train set data trattamento digitale di dati linguistici, Bergamo Unitype and size to the model performance improvement, versity Press, 2021, pp. 45–70. the dataset that optimises the combination of data type [3] P. Wagner, J. Trouvain, F. Zimmerer, In defense and amount is the one containing 120 minutes, i.e., two- of stylistic diversity in speech research, Journal of thirds of the available dataset, of the more controlled, but Phonetics 48 (2015) 1–12. still spontaneous, speech produced by G03 (RQ1). [4] P. Gabler, B. C. Geiger, B. Schuppler, R. Kern, Recon

In line with the literature [7], correlations between sidering read and spontaneous speech: Causal perrecognition scores and utterance durational features spectives on the generation of training data for auemerge. More specifically, higher length values (in terms tomatic speech recognition, Information 14 (2023) of utterance tokens and duration) correlate with lower 137. recognition errors, which indicates that providing a wider [5] N. Vitale, E. Tanda, F. Cutugno, Towards a responcontext enhances recognition. Conversely, higher speech sible usage of ai-based large acoustic models for rates hinder recognition. However, this efect is more automatic speech recognition: On the importance or less mitigated according to the speech type in the of data in the selfsupervised era, in: Atti quarto Contraining set (RQ2). This finding, as well as the constant vegno Nazionale CINI sull’Intelligenza Artificiale– and weak correlations observed for the other acoustic fea- Ital-IA 2024, 2024. tures, deserves further attention and needs to be explored [6] T. Cimmino, E. Tanda, V. N. Vitale, F. Cutugno, Evalin future works. uating asr performance in italian speech, in: STUDI

Overall, these findings show that using both more AISV, Milano: Oficinaventuno, under review. spontaneous speech and more controlled speech can be [7] J. Linke, B. C. Geiger, G. Kubin, B. Schuppler, What’s beneficial to fine-tune a pre-trained model, provided that so complex about conversational speech? A comthe speech rate is not too high. More detailed analyses parison of HMM-based and transformer-based ASR will be performed considering the values of the acoustic architectures, Computer Speech & Language 90 characteristics and their variation to gain deeper insight. (2025) 101738. Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[8]

Linke ,

P. N.

Garner ,

Kubin ,

Schuppler , Con- 2018, pp. 1 - 4 . versational speech recognition needs data? exper- [19]

Schettino ,

Betz ,

Cutugno ,

Wagner , Hesitaiments with austrian german, in: Proceedings of tions and individual variability in Italian tourist the Thirteenth Language Resources and Evaluation guides' speech , in: C. Bernardasci , D. Dipino , Conference, 2022 , pp. 4684 - 4691 . D. Garassino , S.

Negrinelli , E.

Pellegrino , S. Schmid

[9]

Hermann ,

Kamper ,

Goldwater , Multilin- (Eds.), Speaker Individuality in Phonetics and gual and unsupervised subword modeling for zero- Speech Sciences: Speech Technology and Forensic resource languages , Computer Speech & Language Applications , STUDI AISV 8 , Milano

Oficinaven65 (

2021 ) 101098 . tuno, 2021 , pp. 243 - 262 .

[10]

Kamper ,

Matusevych ,

Goldwater , Improved [20]

Boersma ,

Weenink , Praat: doing phonetics by acoustic word embeddings for zero-resource lan - computer [computer program]. version 5 .3. 51, Onguages using multilingual transfer , IEEE/ACM line: http://www. praat. org/retrieved, last viewed Transactions on Audio, Speech, and Language Pro- on 12 ( 1999 - 2022 ). cessing 29 ( 2021 ) 1107 - 1118 . [21]

Rekesh ,

N. R.

Koluguri ,

Kriman , S. Majumdar,

[11]

Bartelds , N. San, B. McDonnell , D.

Jurafsky , V.

Noroozi , H.

Huang , O.

Hrinchuk , K.

Puvvada , M.

Wieling , Making more of little data: Improving A . Kumar , J. Balam , Fast conformer with linearly low-resource automatic speech recognition using scalable attention for eficient speech recognition, data augmentation, in: Proceedings of the 61st in: 2023 IEEE Automatic Speech Recognition and Annual Meeting of the Association for Computa- Understanding Workshop (ASRU), IEEE, 2023 , pp. tional Linguistics Volume 1 :

Long

Papers , 2023 , p. 1 - 8 . 715 - 729 . [22]

Eyben ,

Wöllmer ,

Schuller , Opensmile: the

[12]

Schuppler ,

Adda-Decker ,

J. A.

Morales - munich versatile and fast open-source audio feaCordovilla, Pronunciation variation in read and con- ture extractor, in: Proceedings of the 18th ACM versational austrian german ., in: INTERSPEECH, international conference on Multimedia, 2010 , pp. 2014 , pp. 1453 - 1457 . 1459 - 1462 .

[13]

Lopez ,

Liesenfeld ,

Dingemanse , Evaluation [23]

Eyben ,

K. R.

Scherer ,

B. W.

Schuller , J. Sundberg, of automatic speech recognition for conversational E . André,

Busso ,

L. Y.

Devillers ,

Epps , P. Laukka, speech in dutch, english and german: What goes S. S. Narayanan, The geneva minimalistic acoustic missing?, in: Proceedings of the 18th Conference parameter set (gemaps) for voice research and afon Natural Language Processing (KONVENS 2022), fective computing , IEEE transactions on afective 2022 , pp. 135 - 143 . computing 7 ( 2015 ) 190 - 202 .

[14]

Goldwater ,

Jurafsky ,

C. D.

Manning , Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase asr error rates , in: Proceedings of ACL-08: HLT, Association for Computational Linguistics , 2008 , pp. 380 - 388 .

[15]

Burkov , The hundred-page machine learning book , volume 1 ,

Andriy

Burkov Quebec City, QC , Canada, 2019 .

[16]

Schettino , The role of disfluencies in Italian discourse. Modelling and speech synthesis applications , Ph.D. thesis, Ph. D. dissertation, Universita degli Studi di Salerno , 2022 .

[17]

Vitale ,

Schettino ,

Cutugno , Rich speech signal: exploring and exploiting end-to-end automatic speech recognizers' ability to model hesitation phenomena , in: 25th Annual Conference of the International Speech Communication Association (INTERSPEECH 2024 ), ISCA, 2024 , pp. 222 - 226 .

[18]

Origlia ,

Savy ,

Poggi ,

Cutugno ,

Alfano , F. D'Errico , L.

Vincze , V.

Cataldo , An audiovisual corpus of guided tours in cultural sites: Data collection protocols in the CHROME project , in: Proceedings of the 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage , volume 2091 ,