-

Divergence-aware Approaches to Mitigate Subgroup Disparities in Speech Models

AlkisKoudounas

ElianaPastor

FlavioGiobergia

ElenaBaralis

Politecnico di Torino

Italy

2026

Speech models often struggle with performance inconsistencies across diferent subgroups, leading to degraded accuracy for certain speaker demographics, accents, or recording conditions. These discrepancies may originate from multiple reasons, such as imbalanced training data, suboptimal representation learning, and limitations in model generalization. Addressing these issues allows for improving model robustness and reliability in real-world applications. We propose to mitigate performance disparities of subgroups that underperform, i.e., exhibit daivergence, relative to overall model performance. We tackle the performance disparities both via in-processing solutions, i.e., implementing mitigation measures during model development, and a post-processing one, refining already trained models. As in-processing solutions, we propose three approaches: divergence-aware regularization, targeted data augmentation, and contrastive learning (CLUES). Each method improves model learning in diferent ways: divergenceaware regularization adjusts training to focus on low-performing subgroups, targeted data augmentation generates synthetic variations to enhance model robustness, while CLUES refines latent representations. The post-processing strategy introduces a divergence-aware data acquisition method to prioritize acquiring real-world samples from underperforming subgroups.

bias mitigation spoken language understanding speech processing data acquisition divergence

CEUR ISSN1613-0073

1. Introduction

Speech models are widely used in modern applications, including virtual assistants, transcription services, and accessibility tools1,[ 2, 3, 4 ]. These models must handle a wide range of speech variations, including diferent accents, speaking styles, and recording conditions5,[ 6, 7 ]. Despite their advancements, these models often exhibit performance disparities across diferent population subgroups. Studies have shown that factors such as gender, accent, speaking rate, and recording conditions can significantly impact the accuracy of these systems (8[ , 9, 10, 11, 12, 13, 14, 15, 16 ]). These inconsistencies reduce the reliability of speech models and limit their ability to perform well across diverse real-world conditions. Several factors may contribute to these disparities. Diferences in data distribution can lead to imbalanced learning, where models become more accurate for certain types of speech while struggling with others. Inadequate representation learning fails to capture the full spectrum of speech variations. Models may also struggle to generalize when encountering speech characteristics underrepresented during training. Addressing these issues allows for improving speech model robustness and ensuring that they perform consistently across diferent conditions.

Various methods have been proposed to address these challenges. Many approaches rely on manually identifying specific speech characteristics that might cause performance issues1[ 7 ]. Some approaches use data augmentation 1[ 8 ], generating synthetic variations to improve model robustness. Others explore domain adaptation19[ , 20 ], fine-tuning models on datasets that better represent specific speech characteristics. Adversarial training has also been used to make models more invariant to certain variations in speec1h8][. While these techniques have improved fairness and robustness, they may overlook unexpected subgroups that emerge only after model evaluation. Moreover, performance disparities often occur at the intersection of multiple speech characteristics, making it dificult to address all sources of inconsistency through predefined subgroup selection alone.

Recent research has explored automated subgroup identification, using clustering techniques to detect data patterns where models underperform12[]. While these data-driven approaches help identify performance gaps, they often lack interpretability and do not clearly describe the underlying problems. Consequently, they do not provide insights into the specific sources of performance inconsistencies nor guidance for data acquisition for model improvement.

Our paper presents a framework addressing these limitations through four complementary methods. We propose to mitigate the performance disparities within data subgroups that deviate significantly, i.e., exhibit a divergence, from the overall model performance. We propose both post-processing and in-processing approaches. For post-processing, i.e., improving already trained models 2[ 1 ], we propose a targeted data acquisition to collect new real-world samples to fine-tune a pre-trained model, mitigating its disparities 2[ 2 ]. In-processing involves the implementation of mitigation measures during the model development phas2e1][. As inprocessing, we propose three techniques2[ 3, 24 ]: divergence-aware regularization, targeted data augmentation, and contrastive learning. Divergence-aware regularization modifies the model loss function to emphasize underperforming subgroups during training. Targeted data augmentation increases the representation of these subgroups by applying transformations to existing samples. Finally, contrastive learning refines the model’s internal representations by grouping similar samples closer together in latent space.

To evaluate these methods, we conduct experiments on two spoken language understanding datasets: Fluent Speech Commands (FSC) in English2[ 5 ] and ITALIC [ 26 ] in Italian. We finetune transformer-based speech models and measure their performance using overall accuracy, subgroup performance divergence, and latent space analysis. Our results provide insights into the efectiveness of each method in reducing bias and improving performance.

2. Problematic Subgroup Identification on Interpretable Metadata

Speech models often exhibit inconsistent performance across diferent speaker groups. To address this issue, it is necessary first to identify and analyze these subgroups systematically. A challenge in subgroup identification is ensuring that the subgroups are interpretable, meaning they provide clear insights into why performance disparities occur. For instance, “young men in noisy scenarios” is an interpretable subgroup, allowing both understanding and intervention. To achieve this identification, we leverage the techniques of2[ 7, 28, 29 ] that define subgroups as interpretable combinations of metadata such as speaker demographics, recording conditions, and task characteristics. In the following, we outline the definition of interpretable metadata and then the automatic identification of subgroups.

Interpretable Metadata. Speech datasets typically include a variety of metadata attributes that can influence model performance. Demographic attributes such as gender, age, and native language are among the most common factors afecting recognition accuracy. Beyond demographics, speech characteristics such as speaking rate and silence duration also impact recognition performance3[0]. Faster speech or heavily accented pronunciation may introduce additional challenges, especially if the training data lacks suficient diversity. In addition to speaker characteristics, recording conditions also contribute to subgroup disparities. Factors such as background noise, microphone type, and reverberation levels can create variations in audio quality, afecting model predictions. A model trained primarily on clean audio data may struggle when encountering noisy environments, leading to disparate performance outcomes for speakers who record in less controlled conditions. Task-specific metadata, such as intent categories in spoken language understanding, also play a role in subgroup performance. Certain intents or command structures may be more frequently represented in training data, resulting in better recognition accuracy compared to less frequent or more complex intent formulations.

Automatic subgroup identification . To systematically extract underperforming subgroups, we adopt DivExplorer3[ 1, 32, 33 ]. DivExplorer identifies underperforming and interpretable subgroups by analyzing metadata attributes and measuring performancdeivergence, which quantifies how much a subgroup’s performance deviates from the overall model performance.

Specifically, let denote the dataset and the set of metadata attributes. Anitem is defined as an attribute-value pair. For exampleg,ender=female or speaking rate=high are items. A subgroup corresponds to the subset of data instances that satisfy one or more such items, represented as anitemset . Given a statistic (e.g., accuracy or error rate), the divergenΔce( ) of a subgroup identified by the itemset is defined as: Δ ( ) = ( ) − () . It indicates that the subgroup underperforms significantly relative to the dataset overall. A high negative divergence value indicates that the subgroup is significantly underperforming compared to the dataset as a whole. To ensure statistical reliability, subgroup discovery is constrained by a minimum support threshold, which filters out small subgroups where performance estimates may be unreliable. The subgroups are extracted by augmenting frequent pattern mining techniques, such as FP-Growth or Apriori, over the defined interpretable metadata, to also compute the divergence during the extraction process. By identifying subgroups with significant divergence, this method provides a structured way to analyze and mitigate performance inconsistencies. The identified subgroups inform post-processing targeted data acquisition (3§.1) and in-processing techniques via regularization 3(§.2), data augmentation (§3.3), and contrastive learning (3§.4).

3. Bias Mitigation Methods

Bias in speech models arises when performance varies significantly across diferent subgroups, often due to imbalanced representation in training data. To mitigate these disparities, various techniques have been proposed, broadly categorized into post-processing methods, which refine a trained model, and in-processing methods, which modify the training process itself. Postprocessing methods are useful when fairness issues emerge after deployment, as they adjust model predictions or incorporate new data without requiring full retraining. In-processing methods, on the other hand, introduce fairness-aware mechanisms directly into the learning process to ensure balanced performance from the outset.

This study covers 4 bias mitigation techniques: one is a post-processing method, 3 are inprocessing. For post-processing, we use targeted data acquisition, which enhances fairness by collecting additional subgroup-specific data. In-processing approaches include divergenceaware regularization, which modifies the loss function to prioritize underperforming subgroups; targeted data augmentation, which increases subgroup diversity through synthetic transformations; and contrastive learning (CLUES) to refine latent representations for improved fairness.

3.1. Post-Processing: Targeted Data Acquisition

Targeted data acquisition is a post-processing approach that improves subgroup performance by supplementing the training set with additional real-world examples from underperforming subgroups. This method identifies performance disparities after model deployment and retrains the model with newly collected data.

The process begins by evaluating the trained model to identify subgroups with significantly lower accuracy compared to the overall dataset. These subgroups are interpretable, ensuring that their characteristics are clearly defined. By guaranteeing interpretability, we can perform targeted data acquisition to acquire new speech samples that better represent them. These additional samples are then integrated into the dataset, and the model undergoes additional ifne-tuning to improve its ability to generalize across all subgroups.

One of the key advantages of targeted data acquisition is its reliance on real-world speech variations rather than artificial data. This ensures that the model learns from natural speech patterns, accents, and recording conditions that were previously underrepresented. However, this method requires significant resources for data collection, annotation, and model retraining. Despite these challenges, targeted data acquisition is particularly valuable in deployed systems, where performance problems across groups only become apparent after real-world use.

3.2. In-Processing: Divergence-Aware Regularization

Traditional model learning functions optimize for overall performance, often overlooking subgroup disparities. Divergence-aware regularization is an in-processing technique that directly modifies the training process to subgroup learning. This approach dynamically adjusts the loss function to focus on underperforming subgroups, ensuring they receive increased attention during training. In this method, the model training continuously monitors performance across diferent subgroups. If a subgroup exhibits significantly lower accuracy compared to the overall dataset, its samples are assigned higher loss weights during training. By amplifying the contribution of these samples, the model is encouraged to learn representations that better capture subgroup-specific variations.

Divergence-aware regularization is an efective solution for bias mitigation without requiring additional data. Since it operates directly on the training loss, it improves subgroup performance without altering the dataset size or introducing synthetic transformations.

3.3. In-Processing: Targeted Data Augmentation

Targeted data augmentation is another in-processing method that improves subgroup performance by artificially incrementing the training data for underperforming subgroups. Instead of collecting new samples, this approach applies synthetic transformations to the existing data to increase subgroup representation. Several augmentation techniques are commonly used in speech processing, including time stretching, which alters the speed of speech, pitch shifting, which changes the speaker’s tone, and noise injection, which simulates diferent recording environments. These transformations create diverse variations of the same speech sample, allowing the model to become more robust to variations in speaking style, accent, or background noise. In our context, once the underperforming subgroups are identified, we apply targeted data augmentation techniques to increase their presence in the training set.

One key advantage of this approach is its eficiency, as augmentation can be applied easily to existing samples. However, this method does not introduce truly new linguistic or demographic diversity—it only manipulates existing samples. Despite this limitation, it serves as a costefective way to improve model robustness for underperforming subgroups.

3.4. In-Processing: Contrastive Learning (CLUES)

Contrastive learning has gained attention as an efective technique for refining the latent space representations of deep learning models. The CLUES (Contrastive Learning framework for Underperforming Subgroups) method applies contrastive loss to guide the model in learning more structured and subgroup-aware representations. Unlike regularization or data augmentation, which focuses on altering training behavior, contrastive learning reshapes the model’s internal feature space to better distinguish between subgroups.

CLUES operates at three levels of contrastive learning. At the task level, it ensures that samples belonging to the same class are grouped closely together while separating samples from diferent classes. At the subgroup level, it clusters samples from the same subgroup while pushing apart those from diferent subgroups. Finally, at the error level, it groups correctly classified samples separately from misclassified ones within each subgroup. By optimizing these three objectives, CLUES improves how the model encodes subgroup-specific information, leading to improved subgroup performance.

A key advantage of CLUES is that it improves model representations at the subgroup level without requiring additional data by restructuring the way data is represented. By explicitly shaping the latent space, CLUES reduces overlap between subgroup distributions, preventing the model from learning biased or entangled representations. However, contrastive learning introduces additional computational complexity. Despite this, experimental results show that CLUES provides the most efective technique for mitigating bias and improving performance.

Summary. Post-processing and in-processing methods ofer distinct strategies for addressing bias in speech models. Targeted data acquisition, as a post-processing method, enhances subgroup performance by incorporating real-world samples into model fine-tuning. In contrast, in-processing methods adjust the training process to achieve improvement at the subgroup level without external data collection. The selection of an appropriate bias mitigation method depends on the specific requirements of the application, including available data and computational constraints objectives. In the next section, we outline the experimental setup used to evaluate these methods and analyze their efect on subgroup and overall model performance.

4. Results and Analysis

This section presents the results of applying the four bias mitigation methods, analyzing their impact on overall model performance, subgroup fairness, and latent space representations.

4.1. Experimental setup

Dataset and models. We conduct experiments on two spoken language understanding datasets: Fluent Speech Commands (FSC) [ 25 ] in English and ITALIC [ 26 ] in Italian. These datasets contain labeled utterances categorized by intent. The data is split into training, validation, and test sets, ensuring that speakers do not overlap between splits. To test the scenario of data acquisition of unseen samples, we also tested a configuration in which we use part of the original train set for actual training and a part for the data acquisition, denoted as heldout. We fine-tune wav2vec 2.0 [ 34 ] for FSC and XLS-R [35] for ITALIC . For our subgroup extraction withDivExplorer, we explored all subgroups with a minimum frequency0o.0f3, following 2[ 2 ]. For both post-processing data acquisition and in-processing data augmentation, the hyperparameter defines the top- most challenging subgroups to attention. We report the results for K=2. Complete results with sensitivity analysis, ablation studies, and evaluations on emotion recognition and automatic speech recognition tasks are available2i2n, [ 23, 24 ].

Metrics. We evaluate accuracy and macro F1 score to measure overall performance. For subgroup performance, we evaluate the maximum subgroup divergencΔe ( ), average divergence for the top-10 (Δ -10) underperforming subgroups, and the average divergence in absolute terms (|Δ - |). We also performed a latent space analysis using the Silhouette Score to assess how well the model distinguishes between subgroups when adopting CLUES.

Baselines. We compare our mitigation methods when using our automatic identification approach against a set of alternative baselines that aim to identify challenging samples for model improvement. Therandom baseline selects samples randomly, serving as a control to highlight the efectiveness of subgroup-based selection. Theclustering baseline follows1[ 2 ], where challenging subgroups are identified using K-means clustering applied to acoustic embeddings. The clusters with the lowest performance are then used to determine the most challenging samples. TheKNN baseline employs a K-Nearest Neighbors approach, where an utterance is C S F C I L A T I original - no held out w/ random w/ KNN w/ clustering w/ error-driven ours - w/DivExplorer original - all w/ random w/ KNN w/ clustering ours - w/DivExplorer w/ random w/ KNN w/ clustering ours - w/DivExplorer

w/ clustering ours - w/DivExplorer original - no held out w/ random w/ KNN w/ clustering w/ error-driven ours - w/DivExplorer original - all w/ random w/ KNN w/ clustering ours - w/DivExplorer w/ random w/ KNN w/ clustering ours - w/DivExplorer

w/ clustering ours - w/DivExplorer acquisition acquisition acquisition acquisition acquisition target data++ target data++ target data++ target data++ regularization regularization regularization regularization

CLUES

CLUES acquisition acquisition acquisition acquisition acquisition target data++ target data++ target data++ target data++ regularization regularization regularization regularization

CLUES CLUES

Accuracy considered challenging if its nearest neighbors in the validation set are frequently misclassified, with K optimized per dataset. Finally, therror-driven baseline, close to3[ 6 ], selects misclassified instances from the held-out set and incorporates them into training. We evaluate this baseline only for post-processing since training loss inherently accounts for errors during learning.

4.2. Experimental results

Overall Performance. We report the results in Tabl1e. The four proposed bias mitigation methods lead to varying degrees of improvement in model accuracy and fairness. Divergenceaware regularization and contrastive learning (CLUES) achieve highest overall accuracy while allowing the highest reductions in subgroup disparities. Coupling any strategy with our identification methodology generally always achieves the best results (light yellow).

On the FSC dataset, when using the full training seto(riginal-all), the baseline wav2vec 2.0 model achieves an accuracy of 93.42% and an F1 macro score of 93.11%, but exhibits high divergence across subgroups. After applying mitigation strategies, CLUES improves overall accuracy to 98.79% and reduces subgroup divergence significantly. Divergence-aware regularization similarly enhances subgroup performance while maintaining a competitive accuracy of 98.5%, while targeted data augmentation yields more moderate improvements, particularly benefiting subgroups with lower representation. On ITALIC, the baseline XLS-R model achieves 73.22% F1 Macro (original - all). Divergence-aware regularization and contrastive learning improve overall performance to 74.85% for the former and to 76.10% and 76.72% when using CLUES coupled with clustering or our identification approach based oDnivExplorer.

Subgroup Performance. We assess how well the methods reduce subgroup disparities. Before mitigation, the baseline FSC model on overall data haΔs a of 53.18% (i.e., the least accurate subgroup performs significantly worse than the global accuracy of 93.42%). CLUES reduces the most this divergence, down to 17.58%. Divergence-aware regularization also substantially reducesΔ to 24.49%, confirming its efectiveness in addressing subgroup imbalances. For ITALIC, the baseline XLS-R model on overall data starts witΔh a of 47.54%. Contrastive learning and divergence-aware regularization reduce this gap to 40.15% and 30.10%.

Latent Space Analysis. We use the Silhouette Score to investigate the impact of bias mitigation on the latent space representations. A higher Silhouette Score indicates that the model better separates subgroups, reflecting improved internal representations of speech variations.

The baseline FSC model achieves a Silhouette Score of 0.737. CLUES improves this to 0.894, demonstrating that targeting subgroup representation learning significantly enhances the model’s ability to distinguish between subgroups. A similar pattern is observed in the ITALIC dataset, where contrastive learning improves Silhouette Scores from 0.319 to 0.539. This suggests that models trained with contrastive objectives learn more structured and subgroup-aware representations, contributing to improvements in subgroup performance. A complete analysis of model representations can be found in24[].

5. Conclusions

This paper outlined a framework for improving speech model performance by identifying and mitigating subgroup disparities, leveraging interpretable metadata to systematically detect underperforming, i.e.d, ivergent, subgroups. We explored four mitigation techniques: the post-processing targeted data acquisition and thein-processing divergence-aware regularization, targeted data augmentation, and contrastive learning (CLUES). Each method addressed performance inconsistencies diferently, either by enhancing model training, refining latent representations, or incorporating subgroup-specific data. The experimental results show that CLUES and the divergence-aware regularization are the most efective in reducing subgroup disparities. Moreover, CLUES enhances latent space representations. The findings highlight the value of adopting divergence-aware subgroup identification in speech model development.

Acknowledgments

This work is partially supported by the FAIR - Future Artificial Intelligence Research and received funding from the European Union NextGenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555 11/10/2022, PE00000013) and the spoke “FutureHPC & BigData” of the ICSC - Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by the European Union - NextGenerationEU. This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.

Declaration on Generative AI

During the preparation of this work, the author(s) used, Grammarly, ChatGPT to check grammar and spelling, paraphrase and reword. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. volume 33, 2020, pp. 12449–12460. URL:https://proceedings.neurips.cc/paper/2020/file/ 92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf. [35] A. Babu, et al., XLS-R: Self-supervised Cross-lingual Speech Representation Learning at

Scale, in: Proc. Interspeech 2022, 2022. do1i:0.21437/Interspeech.2022-143. [36] R. Magar, A. B. Farimani, Learning from mistakes: Sampling strategies to eficiently train machine learning models for material property prediction, Computational Materials Science 224 (2023).

[1] S. wen Yang , P. -H. Chi ,

Y.-S.

Chuang ,

C.-I. J.

Lai ,

Lakhotia ,

Y. Y.

Lin ,

A. T.

Liu ,

Shi ,

Chang , G.-T. Lin,

T.-H.

Huang , W.-C. Tseng, K. tik Lee ,

D.-R.

Liu ,

Huang ,

Dong ,

S.-W.

Li ,

Watanabe ,

Mohamed , H. yi Lee, SUPERB: Speech Processing Universal PERformance Benchmark, in: Proc. Interspeech 2021 , 2021 , pp. 1194 - 1198 . do1i0 :.21437/ Interspeech.2021- 1775 .

[2]

La Quatra ,

Koudounas ,

Baralis ,

S. M.

Siniscalchi , Speech analysis of language varieties in italy , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024 ), 2024 , pp. 15147 - 15159 .

[3]

Radford ,

J. W.

Kim , T. Xu,

Brockman ,

McLeavey , I. Sutskever , Robust speech recognition via large-scale weak supervision , in: International Conference on Machine Learning, PMLR , 2023 , pp. 28492 - 28518 .

[4]

Koudounas , G. Ciravegna,

Fantini , E. Crosetti, G. Succo,

Cerquitelli ,

Baralis , et al., Voice disorder analysis: a transformer-based approach , in: INTERSPEECH, ISCA , 2024 , pp. 3040 - 3044 .

[5]

Vaiani ,

Koudounas ,

M. La

Quatra ,

Cagliero ,

Garza , E. Baralis, Transformer-based non-verbal emotion recognition: Exploring model portability across speakers' genders , in: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge , MuSe' 22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 89 - 94 . URL: https://doi.org/10.1145/3551876.3554801. doi: 10 .1145/3551876.3554801.

[6]

Feng ,

Narayanan , Foundation model assisted automatic speech emotion recognition: Transcribing, annotating, and augmenting , in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2024 , pp. 12116 - 12120 .

[7]

Koudounas ,

M. La

Quatra ,

S. M.

Siniscalchi , E. Baralis, voc2vec: A foundation model for non-verbal vocalization , in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2025 , pp. 1 - 5 . doi1: 0 .1109/ICASSP49660. 2025 . 10890672 .

[8]

J. P.

Bajorek , Voice recognition still has significant race and gender biases , Harvard Business Review 10 ( 2019 ).

[9]

Koenecke ,

Nam ,

Lake ,

Nudell ,

Quartey ,

Mengesha ,

Toups ,

J. R.

Rickford ,

Jurafsky ,

Goel , Racial disparities in automated speech recognition , Proc. of the National Academy of Sciences ( 2020 ).

[10]

Mengesha ,

Heldreth ,

Lahav ,

Sublewski , E. Tuennerman, “ i don't think these devices are very culturally sensitive .” -impact of automated speech recognition errors on african americans , Frontiers in Artificial Intelligence ( 2021 ) 169 .

[11]

Liu ,

Picheny ,

Sarı ,

Chitkara ,

Xiao ,

Zhang ,

Chou ,

Alvarado ,

Hazirbas ,

Saraf , Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions , in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2022 , pp. 6162 - 6166 .

[12]

Dheram ,

Ramakrishnan ,

Raju ,

I.-F.

Chen ,

King ,

Powell ,

Saboowala ,

Shetty ,

Stolcke , Toward fairness in speech recognition: Discovery and mitigation of performance disparities , in: Proc. Interspeech 2022 , 2022 , pp. 1268 - 1272 . do1i0 :.21437/ Interspeech.2022- 10816 .

[13]

Liu ,

I.-E.

Veliche ,

Peng , Model-based approach for measuring the fairness in asr , in: ICASSP, IEEE, 2022 .

[14]

Koudounas ,

Pastor ,

Mazzia ,

Giollo ,

Gueudre , E. Reale,

Attanasio ,

Cagliero ,

Cumani , L. De Alfaro,

Baralis ,

Amberti , Leveraging confidence models for identifying challenging data subgroups in speech models , in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , 2024 , pp. 134 - 138 . doi: 10 .1109/ICASSPW62465. 2024 . 10626001 .

[15]

Feng ,

B. M.

Halpern ,

Kudina ,

Scharenborg , Towards inclusive automatic speech recognition , Computer Speech & Language 84 ( 2024 ) 101567 .

[16]

Koudounas ,

Giobergia , Houston we have a divergence: A subgroup performance analysis of asr models , in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , 2024 , pp. 812 - 813 . do1i : 0 .1109/ICASSPW62465. 2024 . 10626156 .

[17]

Niebuhr ,

Michaud , Speech data acquisition: the underestimated challenge , KALIPHOKieler Arbeiten zur Linguistik und Phonetik 3 ( 2015 ) 1 - 42 .

[18]

Zhang ,

B. M.

Halpern ,

Patel ,

Scharenborg , Mitigating bias against non-native accents , in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH , volume 2022 , 2022 , pp. 3168 - 3172 .

[19]

I.-E.

Veliche ,

Fung , Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering , in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2023 , pp. 1 - 5 .

[20]

Shen ,

Yang , G. Sun,

Langman , E. Han, J . Droppo ,

Stolcke , Improving fairness in speaker verification via group-adapted fusion network , in: ICASSP, IEEE, 2022 .

[21]

Mehrabi ,

Morstatter ,

Saxena ,

Lerman ,

Galstyan , A survey on bias and fairness in machine learning , ACM computing surveys (CSUR) 54 ( 2021 ) 1 - 35 .

[22]

Koudounas , E. Pastor, G. Attanasio, Luca, L. de Alfaro, E. Baralis, Prioritizing data acquisition for end-to-end speech model improvement , in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2024 , pp. 1 - 5 . doi: 10 .1109/ICASSP48485. 2024 . 10446326 .

[23]

Koudounas ,

Pastor , L. de Alfaro, E. Baralis, Mitigating subgroup disparities in speech models: A divergence-aware dual strategy , IEEE Transactions on Audio, Speech and Language Processing 33 ( 2025 ) 883 - 895 . doi1 : 0 .1109/TASLPRO. 2025 . 3539429 .

[24]

Koudounas ,

Giobergia ,

Pastor ,

Baralis , A contrastive learning approach to mitigate bias in speech models , in: Interspeech 2024 , 2024 , pp. 827 - 831 . do1i : 0 .21437/ Interspeech.2024- 1219 .

[25]

Lugosch ,

Ravanelli ,

Ignoto ,

V. S.

Tomar ,

Bengio , Speech model pre-training for end-to-end spoken language understanding , in: Interspeech 2019 , 20th Annual Conference of the International Speech Communication Association, 2019 , pp. 814 - 818 .

[26]

Koudounas ,

M. La

Quatra ,

Vaiani ,

Colomba ,

Attanasio ,

Pastor ,

Cagliero , E. Baralis, ITALIC: An Italian Intent Classification Dataset , in: Proc. INTERSPEECH 2023 , 2023 , pp. 2153 - 2157 . doi: 10 .21437/Interspeech.2023- 1980 .

[27]

Koudounas , E. Pastor,

Attanasio ,

Mazzia ,

Giollo ,

Gueudre ,

Cagliero , L. de Alfaro, E. Baralis,

Amberti , Exploring subgroup performance in end-to-end speech models , in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2023 , pp. 1 - 5 . doi1: 0 .1109/ICASSP49357. 2023 . 10095284 .

[28]

Koudounas , E. Pastor,

Attanasio ,

Mazzia ,

Giollo ,

Gueudre ,

Reale ,

Cagliero ,

Cumani , L. de Alfaro, E. Baralis,

Amberti , Towards comprehensive subgroup performance analysis in speech models , IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 ( 2024 ) 1468 - 1480 . doi1 : 0 .1109/TASLP. 2024 . 3363447 .

[29]

Koudounas ,

Pastor , E. Baralis, Assessing speech model performance: A subgroup perspective , in: SEBD 2024: 32nd Symposium on Advanced Database System , volume 3741 , CEUR Workshop Proceedings, 2024 , pp. 101 - 111 . URL:https://ceur-ws. org/ Vol- 3741 / paper64.pdf.

[30]

Pastor ,

Koudounas ,

Attanasio ,

Hovy , E. Baralis, Explaining speech classification models via word-level audio segments and paralinguistic features , in: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2024 , pp. 2221 - 2238 . URL:https://aclanthology.org/ 2024 . eacl-long . 136 ./

[31]

Pastor , L. de Alfaro, E. Baralis, Looking for trouble: Analyzing classifier behavior via pattern divergence , in: Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21 , ACM , 2021 , p. 1400 - 1412 . doi: 10 .1145/3448016.3457284.

[32]

Pastor ,

Gavgavian ,

Baralis , L. de Alfaro, How divergent is your data? , Proc. VLDB Endow . 14 ( 2021 ) 2835 - 2838 . URL: https://doi.org/10.14778/3476311.3476357. doi: 10 . 14778/3476311.3476357.

[33]

Pastor ,

Baralis , L. de Alfaro, A hierarchical approach to anomalous subgroup discovery , in: 2023 IEEE 39th international conference on data engineering (ICDE) , IEEE, 2023 , pp. 2647 - 2659 . doi: 10 .1109/ICDE55515. 2023 . 00203 .

[34]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec 2 . 0: A framework for self-supervised learning of speech representations , in: Advances in Neural Information Processing Systems ,