-

SEBD

1613-0073

AlkisKoudounas

ElianaPastor

ElenaBaralis

0 0 Politecnico di Torino , Turin , Italy

2024

32 23 26

Spoken language understanding (SLU) models are commonly evaluated based on overall performance or predefined subgroups, often overlooking the potential insights gained from more comprehensive subgroup analyses. Conducting a more thorough analysis at the subgroup level can reveal valuable insights into the variations in speech system performance across diferent subgroups. Yet, identifying interpretable subgroups in raw speech data poses inherent challenges.

Subgroup identification Model bias analysis Bias mitigation Speech representation E2E-SLU models

CEUR ceur-ws.org

1. Introduction

Intelligent systems with speech recognition, transcription, and comprehension capabilities are increasingly common across various domains, including virtual assista1n,t2s],[customer service [ 3, 4 ], and healthcare5[ , 6 ]. However, current evaluation paradigms for these systems predominantly focus on aggregate performance metrics, overlooking potential disparities across diferent groups [ 7, 8, 9 ]. Furthermore, the proliferation of large pre-trained neural models using self-supervised learning poses challenges for interpretability and identification of performance disparities through conventional methodologi1e0s,[ 11 ]. These issues highlight the need for a comprehensive evaluation framework that captures subgroup-level efects to enable responsible assessment of speech technologies, identifying and mitigating unintended harms.

Recent literature has highlighted issues of model bias and unequal treatment across data subgroups [ 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 ]. A data subgroup refers to a subset of instances demonstrating similar characteristics within the latent space or common attribute values (e.g., utterances spoken by female speakers). Previous approaches have typically focused on predefined subgroups based on protected attributes or features of interest knowanpriori. Specifically, these works targeted identifying bias within specific demographic traits, such as skin tone [ 12 ], ethnicity [ 16 ], or combinations of metadata, such as demographics and geolocation 1[ 5 ], as well as gender, age, and accents1[ 3 ] or gender, age, skin tones 1[ 4 ]. However, such categorizations often necessitate human expertise and preclude the exploration of unanticipated yet significant subgroups.

In this work, we propose an automated method for identifying critical subgroups to address these limitations. Unlike existing clustering-based speaker embedding techniqu1e5s, 1[ 8 ], our approach facilitates intersectional analysis, enabling us to explore the combined impacts of multiple attributes. Speech data frequently includes additional metadata about the speaker (e.g., the gender) or task (e.g., the emotion associated with a sentence). Other features as speaking rate, signal-to-noise ratio, and number of words, can be extracted from the audio or transcripts. The latter are essential for capturing narrow nuances that could significantly afect model performance. By combining such metadata values, we can identify interpretable data subgroups. Research questions. This study investigates bias in speech model performance across data subgroups, mainly focusing on spoken language understanding (SLU). We automatically identify combinations of metadata values that exhibit the highest: i(ni)tra-model performance gaps, indicating significant performance diferences between the overall dataset and specific data subgroups, and (ii)cross-model performance gaps, signifying notable diferences in subgroup performance among diferent models. Our approach enables the identification of data subgroups where a model exhibits lower performance compared to the overall behavior. We leverage this interpretable identification of critical subgroups for a targeted data acquisition strategy to enhance performance and mitigate model biases. Therefore, this work addresses the following research questions (RQs)(:RQ1) “How can we automatically identify and characterize the most critical subgroups for an SLU model?(”R, Q2) “How does model size or architecture impact subgroup performance?”, and(RQ3) “ How does adopting a subgroup-guided data acquisition strategy influence the overall model and subgroup performance compared to an indiscriminate approach?”.

Our approach. We introduce a novel task-, model-, and dataset-agnostic methodology for automating the characterization and comparison of data subgroups induced by metadata attributes. We identify all “frequent subgroups,” i.e., those exceeding a certain support threshold (e.g., at least0.1% of the dataset), that exhibit maximal disparities in intra- and cross-model performance. We provide end-users with interpretable representations of such critical subgroups within a given speech task and model and further use this information to mitigate model inner biases.

The primary contributions of this work are: (i) a novel framework for analyzing SLU models by identifying subgroups exhibiting large performance gaps; (ii) insights into the efects of model size at the subgroup level; and (iii) a subgroup-guided targeted data acquisition approach to enhance overall and across subgroups model performance.

We conduct comprehensive experiments across three speech tasks (Automatic Speech Recognition (ASR), Intent Classification (IC), Emotion Recognition (ER)), three datasets (LibriSpeech [ 24 ], FSC [ 25 ], and IEMOCAP [ 26 ]), and for the transformer-based speech model wav2vec 2.0 [ 27 ]. Our experimental results demonstrate that our subgroup-level analysis reveals distinctive performance patterns in data subpopulations. We further show that our subgroupguided acquisition approach consistently improves performance both overall and on subgroups compared to an indiscriminate strategy, even when acquiring a subset of the data.

2. Methodology

Our approach examines model performance at the subgroup level, whesruebgaroup is defined as a subset of the data characterized by specific metadata values, and denoted as itemset. This metadata covers mixed factors, including speaker traits (e.g., gender, age), speech features (e.g., speaking rate, number of pauses), and task-specific attributes (e.g., intents, labels). For instance, the subgroup{gender=male, age ∈ [41-65]} signifies utterances from male speakers aged 41 to 65.

Our analysis of subgroup behavior leverages two key concepts: intra-model divergence and cross-model performance gap. The former indicates the disparity in model performance between a subgroup and the entire dataset, revealing subgroups associated with performance variations, be it below-average, above-average, or equivalent. We will also leverage this aspect to guide the data acquisition strategy. Conversely, the latter quantifies the performance diferences between two models on the same subgroup, facilitating comparative assessments at the subgroup level.

2.1. Itemsets through interpretable metadata

We analyze speech model behavior by slicing data into interpretable subgroups. We define interpretable metadata as attributes understandable by humans, e.g., speaker age or gender or utterance noise level. For instanc“eo,ld men in noisy scenarios” is an interpretable subgroup. Metadata Description. Identifying interpretable subgroups in raw speech data poses intrinsic challenges. To overcome this issue, we enrich speech data with interpretable metadata from various domains, providing a human-understandable description of utterances. They can be inherent to the dataset or derived from utterances/transcriptions. Examples of such metadata attributes include: (i)speaker demographics like gender or age, (iit)ask-specific features , like intent or emotion associated with an utterance, (iirie)cording conditions, such as environment type and noise level, and (ivs)peech features, such as speaking rate and duration of silences. Items and Itemsets. Let represent our dataset an d denote its metadata attribute set. An item represents an attribute equality= , where is an attribute in , and is its value. We only focus on discretized attributes, thus continuous-valued attributes are discretized before applying our techniques. Examples of items includgeender = male and age ∈ [41 − 65], if gender and age are attributes. Asubgroup corresponding to an item denotes the dataset portion satisfying it. We ensure that subgroups form a dataset partition for each attribute. For example, the age ranges must not overlap within thaege attribute, and collectively, they must cover all potential age ranges.

Items facilitate the selection of data subsets based on single attributes, whitielmesets allow slicing across multiple attributes. An item sectomprises zero or more items, each including a diferent attribute. For instance, an itemset like{gender = female, age ∈ [ 22, 40 ]} defines a subgroup based on the gender and age attributes. We define data subgroups via itemsets, enabling an interpretable subgroup definition. Thesupport of an itemset denotes the fraction of the dataset it covers. For instance, an itemset with suppo0r.t02 represents2% of the dataset. The empty itemset (∅) corresponds to the entire dataset and has a support o1.f An itemset is frequent if its support exceeds a minimum threshold)(.

2.2. Intra and cross-model performance gaps

We aim to identify subgroups exhibiting performance disparities compared to the overall dataset. We rely onDivExplorer [ 22, 28 ] to extract all frequent itemsets above a specified support threshold. While subgroups grow exponentially with the number of attributes, many extracted itemsets may have minimal or zero support, making them less relevant for subgroup performance analysis. Performance statistics for subgroups with low support may also sufer from statistical lfuctuations. Therefore, to ensure operational significance, we only focus on the subgroups surpassing a given threshold (e.g., comprising at lea0s.t1% of the dataset), called frequent itemsets, which tend to be more limited.

We employ the concept of subgroup divergence (i.e., intra-model performance gap) as introduced in [ 22 ]. It quantifies the diference in performance between a subgroup and the entire dataset. Let represent a generic statistic for a downstream SLU task. For a modeland a subgroup (i.e., itemset) , ( , ) denotes the average statistic value (e.g., accuracy, error rate) of the model on the subgroup. We define the divergence of items etfor model as the diference between the model performance overand the performance over the entire dataset: Δ ( , ) = ( , ) − (∅, ) (1) A higher divergence (in absolute terms) indicates a more significant variation in subgroup performance compared to the overall dataset.

Assessing performance discrepancies at the subgroup level is also crucial for model comparison. We introduce the concept of cross-model performance gap, which measures the performance diference between two models on a specific subgroup. This gap could be used to compare diferent models, characterized by diferent size, architecture, or pre-training objective. Specifically, given two models 1 and 2, the performance gap from mode l 1 to model 2 for the itemset is defined as the change in performance on obtained by replacin g 1 with 2: gap ( , 1, 2) = ( , 2) − ( , 1) (2) The definitions of intra- and cross-model gaps apply to generic SLU models for any task, enabling assessment of subgroup performance for a given dataset annotated via metadata. This methodology thus remains task-, model-, and dataset-agnostic. To evaluate the statistical significance, we employ Welch’s t-test to test the hypothesis that the means of the statist ic are equal for (i) the subgrou pand the entire populatio n , and (ii) the two model s 1 and 2.

2.3. Local contribution through Shapley values

After identifying itemsets exhibiting significant divergence or gap, we seek to understand the contribution of each item to these metrics. We employ game theory concepts to provide local insights into subgroup behavior.

The local contribution quantifies the role of each item within an itemset in influencing its divergence or gap, using Shapley values. In this framework, items within an itemset are akin to team members, and the divergence or gap metric represents the team’s total score. Specifically, for an item within itemset and a metric of interest( ) , i.e., divergence or gap, the Shapley value (, ) measures how much contributes to( ) , with ∑∈ (, ) = ( ) . More details on this local as well as the global contribution can be found17in,2[ 2, 29 ].

2.4. Subgroup-guided Data Acquisition

After evaluating the performance of a given speech model, our objective is to improve it both overall and across diferent subpopulations. We identify the critical subgroups (i.e., itemsets) characterized by negative divergence, representing challenging scenarios for the model. We implement a pruning procedure to eliminate redundancy among such subgroups, follow2i2n]g. [ Specifically, when encountering two subgroups , and , where includes along with an additional metadata condition, we retain only the more general subgr o u, pi,f the absolute diference in their divergences is below a predefined threshold. This approach is based on the rationale that adequately captures the divergence exhibited b y, as the extra metadata in only marginally afects the divergence. Pruning the critical subgroups yields a more concise representation, forcing the data acquisition process to focus on the most pertinent attributes.

We prioritize data acquisition eforts on the top - critical subgroups with the highest negative divergence in accuracy and retrain the model with additional data belonging to these subgroups. The parameter allows us to control the data acquisition process and observe its impact on model performance overall and within subgroups. Further details can be foun3d0]in. [

3. Results and Discussion

We assess the efectiveness of our methodology by (i) analyzing its ability to identify sources of errors, (ii) examining the influence of factors such as model size, architecture, and pre-training (a) w2v2-b. Δacc = -31.22% (b) w2v2-b. Δacc = -0.65% objective on subgroup-level performance, and (iii) evaluating the efect of using subgroup-level information to guide a data acquisition strategy in enhancing model performance and mitigating biases. Please refer to1[ 7, 29, 30 ] for a complete set of the results.

Metadata. We enrich the datasets with various metadata categories. We first incorporate demographic attributes of speakers where available, including gender, age, and country. We also consider unique metadata pertinent to each task if available, i.e., intent FfoSrC, and emotion and arousal labels foIrEMOCAP. We finally extract from the raw signal utterance/transcription attributes such as silence duration (total and trimmed), word count, speaking rate (words per second), signal-to-noise ratio, and spectral flatness. The trimmed duration excludes initial and ifnal pauses, while the total silence duration includes the entire utterance without any pauses. As the frequency and duration of intermediate pauses had little efect on model performance across all datasets, except foLribriSpeech, we chose to retain them for this dataset only.

Continuous attributes like speaking rate or utterance duration require discretization into ifxed ranges. Using frequency-based discretization, we thus discretize this metadata into three ranges labeled as “low,” “medium,” and “high.” RQ1: Model understanding at the subgroup level. We focus on the performance of the wav2vec 2.0 base model 2[ 7 ] across all datasets. Tabl1e shows the subgroups with the largest negative and positive divergence, indicating critical scenarios for each dataset. The divergence values associated with these subgroups are statistically significant (with> 2 , as per Siegel’s rule of thumb 3[ 1 ]). ForFSC and IEMOCAP, we evaluate model accuracy across various data subgroups, where higher accuracy indicates better performance. A negative divergence signifies accuracy below the average, while a positive divergence indicates above-average accuracy. (a) FSC. Performance improvement for 63.75% of subgroups, decrease for 31.89% of them.

(d) LibriSpeech. Performance improvement for 99.25% of subgroups, decrease for 0.75% of them.

For instance, forFSC, the wav2vec 2.0 base model exhibits its poorest performance for the subgroup characterized by speakers aged 22-40, male gender, no specified location, high speaking rate, and high total silence (Tabl1e,first block), with a divergence ofΔ = −31.2%. Analyzing sensitive attributes like gender is crucial, as evidenced by the significant impact observed. Specifically, female speakers achieve higher accuracy within the identified subgroup than males when all other metadata values remain constant. This trend is further confirmed by the Shapley values illustrated in Figu1r(ea)-(b), where the male gender is associated with lower accuracy. In contrast, the female gender exhibits a positive impact.

Conversely, the analysis also reveals subgroups with above-average performance. For example, the model correctly predicts all utterances associated with the subgroup of speakers aged 22-40 with a low speaking rate, long duration, and “washroom” as the target location.

Similar assessments can be made for other datasets. FLoirbriSpeech, we study the Word Error Rate (WER); a positive WER divergence (i.e., higher than overall) signifies lower performance. RQ2: Model comparison at the subgroup level. We compare diferent model performances at the overall and subgroup levels, detecting which subpopulations benefit the most from model changes. We analyze here how increasing the size of such models afects their performance at both levels. For changes in architecture and pre-training objective, please ref2e9r]t.o [

Larger models tend to be more accurate overall, a3n2d] [claims that larger models are also fairer. However, performance for specific subgroups is complex and depends on the dataset/task. We specifically examine how scaling up the wav2vec 2.0 model influences performance across datasets, with Table2 summarizing the performance gap in terms of the highest performance improvement and decrease, and Figur2eillustrating the distribution of this gap across subgroups.

While a larger model size enhances both overall and subgroup WER inLitbhreiSpeech dataset, it diminishes performance at both levels fIoErMOCAP. We further reveal varying subgroup impacts onFSC, indicating that certain groups benefit more from a larger model size than others. Nonetheless, more than 30% of the explored subgroups decrease performance when scaling up the size. These findings emphasize the importance of analyzing subgroup-specific outcomes when evaluating the efectiveness of larger models.

RQ3: Subgroup-guided data acquisition. We use the identified critical subgroups to guide a targeted data acquisition to improve model performance and mitigate its biases. We discuss the results forFSC. Further outcomes onITALIC [33], an IC dataset in Italian, can be found in 3[0].

We partition our dataset into training, held-out, validation, and test sets, employing an 80-20 split for training and held-out data, respectively, while retaining the original validation and test sets. We first identify critical subgroups using the validation set, then acquire data samples from the held-out set, and retrain the model with these samples. Evaluation on the test set (T3a)ble reveals consistently superior performance across overall and subgroup-level metrics, compared to baseline methods such as indiscriminate random and clustering-guided acquisiti1o5n], [ where samples are selected from the acoustic embedding clusters with subpar performance.

Selecting only the top 2 critical subgroups leads to significant performance improvements at both overall and subgroup levels. Specifically, it achieves the best F1 score and accuracy performance, as well as the lowest maximum divergenΔce− ( ) and the lowest average divergence for the top-10Δ(−−10 ), 20 (Δ−−20 ), and 50 (Δ−−50 ) subgroups with the highest negative divergence. While performance slightly lowers when increasing the nu mobfer critical subgroups, it remains significantly better than the original model performance and the one obtained when adding all available data. The lowest average absolute divergence is found with = 5 critical subgroups, indicating reduced performance disparities across subgroups.

Overall, these results underscore the efectiveness of targeted data acquisition in mitigating performance disparities and improving model robustness across diverse subgroups.

4. Conclusion

This study presents a novel methodology for evaluating spoken language understanding (SLU) system performance by analyzing model bias at the subgroup level. We enrich raw speech data by extracting metadata that include speaker demographics, task- and signal-related features to allow the definition of human-interpretable subgroups. By automating the detection of performance disparities within subgroups, our approach enhances error analysis, facilitates model comparison, and mitigates biases, thus improving overall performance. This versatile methodology demonstrates efectiveness across various speech tasks, datasets, and model sizes, ofering insights into which subgroups benefit most from system enhancements and contributing to the development of more inclusive and efective speech technologies.

Endow. 14 (2021) 2835–2838. doi:10.14778/3476311.3476357. [29] A. Koudounas, E. Pastor, G. Attanasio, V. Mazzia, M. Giollo, T. Gueudre, E. Reale, L. Cagliero, S. Cumani, L. de Alfaro, E. Baralis, D. Amberti, Towards comprehensive subgroup performance analysis in speech models, IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 (2024) 1468–1480. doi1:0.1109/TASLP.2024.3363447. [30] A. Koudounas, E. Pastor, G. Attanasio, L. de Alfaro, E. Baralis, Prioritizing data acquisition for end-to-end speech model improvement, in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7000–7004. doi:10.1109/ICASSP48485.2024.10446326. [31] A. F. Siegel, Chapter 10 - hypothesis testing: Deciding between reality and coincidence, in: A. F. Siegel (Ed.), Practical Business Statistics (Sixth Edition), sixth edition ed., Springer Science & Business Media, 2012, pp. 249–287. doi1:0.1016/B978-0-12-385208-3.00010-9. [32] Y. Sheng, J. Yang, Y. Wu, K. Mao, Y. Shi, J. Hu, W. Jiang, L. Yang, The larger the fairer? small neural networks can achieve fairness for edge devices, arXiv preprint arXiv:2202.11317 (2022). [33] A. Koudounas, M. La Quatra, L. Vaiani, L. Colomba, G. Attanasio, E. Pastor, L. Cagliero, E. Baralis, ITALIC: An Italian Intent Classification Dataset, in: Proc. INTERSPEECH 2023, 2023, pp. 2153–2157. doi:10.21437/Interspeech.2023-1980.

[1]

Sarikaya ,

P. A.

Crook ,

Marin ,

Jeong ,

J.-P.

Robichaud ,

Celikyilmaz ,

Y.-B.

Kim ,

Rochette ,

O. Z.

Khan ,

Liu , et al., An overview of end-to-end language understanding and dialog management for personal digital assistants , in: 2016 ieee spoken language technology workshop (slt) , IEEE, 2016 , pp. 391 - 397 .

[2]

Terzopoulos ,

Satratzemi , Voice assistants and smart speakers in everyday life and in education, Informatics in Education 19 ( 2020 ) 473 - 490 .

[3]

Nuruzzaman ,

O. K.

Hussain , A survey on chatbot implementation in customer service industry through deep neural networks , in: 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE) , IEEE, 2018 , pp. 54 - 61 .

[4]

Scheidt ,

Chung , Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation , International Journal of Information Management 45 ( 2019 ) 223 - 232 . URL: https://www.sciencedirect.com/science/article/pii/ S0268401217309441. doi:https://doi.org/10.1016/j.ijinfomgt. 2018 . 01 .002.

[5]

Latif ,

Qadir ,

Qayyum ,

Usama ,

Younis , Speech technology for healthcare: Opportunities, challenges, and state of the art , IEEE Reviews in Biomedical Engineering ( 2020 ).

[6]

La Quatra ,

Vaiani ,

Koudounas ,

Cagliero ,

Garza , E. Baralis, How much attention should we pay to mosquitoes? , in: Proceedings of the 30th ACM International Conference on Multimedia, MM '22 , Association for Computing Machinery, New York, NY, USA, 2022 , p. 7135 - 7139 . URL: https://doi.org/10.1145/3503161.3551594. doi: 10 .1145/ 3503161.3551594.

[7]

Turian ,

Shier ,

H. R.

Khan ,

Raj ,

B. W.

Schuller ,

C. J.

Steinmetz ,

Malloy , G. Tzanetakis, G. Velarde,

McNally , et al., Hear: Holistic evaluation of audio representations, in: NeurIPS 2021 Competitions and Demonstrations Track , PMLR , 2022 , pp. 125 - 145 .

[8] S. wen Yang , P. -H. Chi ,

Y.-S.

Chuang ,

C.-I. J.

Lai ,

Lakhotia ,

Y. Y.

Lin ,

A. T.

Liu ,

Shi ,

Chang , G.-T. Lin,

T.-H.

Huang , W.-C. Tseng, K. tik Lee ,

D.-R.

Liu ,

Huang ,

Dong ,

S.-W.

Li ,

Watanabe ,

Mohamed , H. yi Lee, SUPERB: Speech Processing Universal PERformance Benchmark, in: Proc. Interspeech 2021 , 2021 , pp. 1194 - 1198 . d1o0i :.21437/ Interspeech.2021- 1775 .

[9]

La Quatra ,

Koudounas ,

Vaiani , E. Baralis,

Garza ,

Cagliero ,

S. M.

Siniscalchi , Benchmarking representations for speech, music, and acoustic events , in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , 2024 .

[10]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong , et al., A survey of large language models , arXiv preprint arXiv:2303.18223 ( 2023 ).

[11]

Singh ,

J. P.

Inala ,

Galley ,

Caruana ,

Gao , Rethinking interpretability in the era of large language models , arXiv preprint arXiv:2402.01761 ( 2024 ).

[12]

Koenecke ,

Nam ,

Lake ,

Nudell ,

Quartey ,

Mengesha ,

Toups ,

J. R.

Rickford ,

Jurafsky ,

Goel , Racial disparities in automated speech recognition , Proc. of the National Academy of Sciences 117 ( 2020 ) 7684 - 7689 .

[13]

Feng ,

Kudina ,

B. M.

Halpern ,

Scharenborg , Quantifying bias in automatic speech recognition , arXiv preprint arXiv:2103.15122 ( 2021 ).

[14]

Liu ,

Picheny ,

Sarı ,

Chitkara ,

Xiao ,

Zhang ,

Chou ,

Alvarado ,

Hazirbas ,

Saraf , Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions , in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2022 , pp. 6162 - 6166 .

[15]

Dheram ,

Ramakrishnan ,

Raju ,

I.-F.

Chen ,

King ,

Powell ,

Saboowala ,

Shetty ,

Stolcke , Toward fairness in speech recognition: Discovery and mitigation of performance disparities , in: Proc. Interspeech 2022 , 2022 , pp. 1268 - 1272 . d1o0i .:21437/ Interspeech.2022- 10816 .

[16]

L.-F.

Lai ,

Holliday , Exploring Sources of Racial Bias in Automatic Speech Recognition through the Lens of Rhythmic Variation , in: Proc. INTERSPEECH 2023 , 2023 , pp. 1284 - 1288 . doi: 10 .21437/Interspeech.2023- 159 .

[17]

Koudounas , E. Pastor,

Attanasio ,

Mazzia ,

Giollo ,

Gueudre ,

Cagliero , L. de Alfaro, E. Baralis,

Amberti , Exploring subgroup performance in end-to-end speech models , in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2023 , pp. 1 - 5 . doi1: 0 .1109/ICASSP49357. 2023 . 10095284 .

[18]

I.-E.

Veliche ,

Fung , Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering , in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2023 , pp. 1 - 5 .

[19]

Koudounas ,

Giobergia , Houston we have a divergence: A subgroup performance analysis of asr models , arXiv preprint arXiv:2404.07226 ( 2024 ).

[20]

Koudounas ,

Giobergia , E. Baralis, Bad exoplanet! explaining degraded performance when reconstructing exoplanets atmospheric parameters , in: NeurIPS 2023 AI for Science Workshop , 2023 .

[21]

Shahbazi ,

Lin ,

Asudeh ,

Jagadish , Representation bias in data: a survey on identification and resolution techniques , ACM Computing Surveys 55 ( 2023 ) 1 - 39 .

[22]

Pastor , L. de Alfaro, E. Baralis, Looking for trouble: Analyzing classifier behavior via pattern divergence , in: Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21 , ACM , 2021 , p. 1400 - 1412 . doi: 10 .1145/3448016.3457284.

[23]

Pastor ,

Baralis , L. de Alfaro, A hierarchical approach to anomalous subgroup discovery , in: 39th IEEE International Conference on Data Engineering, ICDE 2023 , IEEE, 2023 , pp. 2647 - 2659 . doi: 10 .1109/ICDE55515. 2023 . 00203 .

[24]

Panayotov ,

Chen ,

Povey ,

Khudanpur , Librispeech: An asr corpus based on public domain audio books , in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2015 , pp. 5206 - 5210 . doi1 : 0 .1109/ICASSP. 2015 . 7178964 .

[25]

Lugosch ,

Ravanelli ,

Ignoto ,

V. S.

Tomar ,

Bengio , Speech model pre-training for end-to-end spoken language understanding , in: Interspeech 2019 , 20th Annual Conference of the International Speech Communication Association, 2019 , pp. 814 - 818 .

[26]

Busso ,

Bulut ,

C.-C.

Lee ,

E. A.

Kazemzadeh ,

E. M.

Provost ,

Kim ,

J. N.

Chang ,

Lee ,

S. S.

Narayanan , Iemocap: interactive emotional dyadic motion capture database , Language Resources and Evaluation 42 ( 2008 ) 335 - 359 .

[27]

Baevski ,

Zhou ,

Mohamed ,

Auli , wav2vec 2 . 0: A framework for self-supervised learning of speech representations , in: Advances in Neural Information Processing Systems , volume 33 , 2020 , pp. 12449 - 12460 .

[28]

Pastor ,

Gavgavian ,

Baralis , L. de Alfaro, How divergent is your data? , Proc. VLDB