<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Divergence-aware Approaches to Mitigate Subgroup Disparities in Speech Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AlkisKoudounas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ElianaPastor</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FlavioGiobergia</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ElenaBaralis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Politecnico di Torino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Speech models often struggle with performance inconsistencies across diferent subgroups, leading to degraded accuracy for certain speaker demographics, accents, or recording conditions. These discrepancies may originate from multiple reasons, such as imbalanced training data, suboptimal representation learning, and limitations in model generalization. Addressing these issues allows for improving model robustness and reliability in real-world applications. We propose to mitigate performance disparities of subgroups that underperform, i.e., exhibit daivergence, relative to overall model performance. We tackle the performance disparities both via in-processing solutions, i.e., implementing mitigation measures during model development, and a post-processing one, refining already trained models. As in-processing solutions, we propose three approaches: divergence-aware regularization, targeted data augmentation, and contrastive learning (CLUES). Each method improves model learning in diferent ways: divergenceaware regularization adjusts training to focus on low-performing subgroups, targeted data augmentation generates synthetic variations to enhance model robustness, while CLUES refines latent representations. The post-processing strategy introduces a divergence-aware data acquisition method to prioritize acquiring real-world samples from underperforming subgroups.</p>
      </abstract>
      <kwd-group>
        <kwd>bias mitigation</kwd>
        <kwd>spoken language understanding</kwd>
        <kwd>speech processing</kwd>
        <kwd>data acquisition</kwd>
        <kwd>divergence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ISSN1613-0073</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Speech models are widely used in modern applications, including virtual assistants, transcription
services, and accessibility tools1,[
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. These models must handle a wide range of speech
variations, including diferent accents, speaking styles, and recording conditions5,[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
Despite their advancements, these models often exhibit performance disparities across diferent
population subgroups. Studies have shown that factors such as gender, accent, speaking rate,
and recording conditions can significantly impact the accuracy of these systems (8[
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref16 ref9">, 9, 10, 11,
12, 13, 14, 15, 16</xref>
        ]). These inconsistencies reduce the reliability of speech models and limit their
ability to perform well across diverse real-world conditions. Several factors may contribute to
these disparities. Diferences in data distribution can lead to imbalanced learning, where models
become more accurate for certain types of speech while struggling with others. Inadequate
representation learning fails to capture the full spectrum of speech variations. Models may
also struggle to generalize when encountering speech characteristics underrepresented during
training. Addressing these issues allows for improving speech model robustness and ensuring
that they perform consistently across diferent conditions.
      </p>
      <p>
        Various methods have been proposed to address these challenges. Many approaches rely on
manually identifying specific speech characteristics that might cause performance issues1[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Some approaches use data augmentation 1[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], generating synthetic variations to improve model
robustness. Others explore domain adaptation19[
        <xref ref-type="bibr" rid="ref20">, 20</xref>
        ], fine-tuning models on datasets that
better represent specific speech characteristics. Adversarial training has also been used to
make models more invariant to certain variations in speec1h8][. While these techniques
have improved fairness and robustness, they may overlook unexpected subgroups that emerge
only after model evaluation. Moreover, performance disparities often occur at the intersection
of multiple speech characteristics, making it dificult to address all sources of inconsistency
through predefined subgroup selection alone.
      </p>
      <p>Recent research has explored automated subgroup identification, using clustering techniques
to detect data patterns where models underperform12[]. While these data-driven approaches
help identify performance gaps, they often lack interpretability and do not clearly describe the
underlying problems. Consequently, they do not provide insights into the specific sources of
performance inconsistencies nor guidance for data acquisition for model improvement.</p>
      <p>
        Our paper presents a framework addressing these limitations through four complementary
methods. We propose to mitigate the performance disparities within data subgroups that deviate
significantly, i.e., exhibit a divergence, from the overall model performance. We propose both
post-processing and in-processing approaches. For post-processing, i.e., improving already
trained models 2[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we propose a targeted data acquisition to collect new real-world samples
to fine-tune a pre-trained model, mitigating its disparities 2[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In-processing involves the
implementation of mitigation measures during the model development phas2e1][. As
inprocessing, we propose three techniques2[
        <xref ref-type="bibr" rid="ref24 ref3">3, 24</xref>
        ]: divergence-aware regularization, targeted
data augmentation, and contrastive learning. Divergence-aware regularization modifies the
model loss function to emphasize underperforming subgroups during training. Targeted data
augmentation increases the representation of these subgroups by applying transformations to
existing samples. Finally, contrastive learning refines the model’s internal representations by
grouping similar samples closer together in latent space.
      </p>
      <p>
        To evaluate these methods, we conduct experiments on two spoken language understanding
datasets: Fluent Speech Commands (FSC) in English2[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and ITALIC [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] in Italian. We
finetune transformer-based speech models and measure their performance using overall accuracy,
subgroup performance divergence, and latent space analysis. Our results provide insights into
the efectiveness of each method in reducing bias and improving performance.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Problematic Subgroup Identification on Interpretable</title>
    </sec>
    <sec id="sec-4">
      <title>Metadata</title>
      <p>
        Speech models often exhibit inconsistent performance across diferent speaker groups. To
address this issue, it is necessary first to identify and analyze these subgroups systematically. A
challenge in subgroup identification is ensuring that the subgroups are interpretable, meaning
they provide clear insights into why performance disparities occur. For instance, “young men
in noisy scenarios” is an interpretable subgroup, allowing both understanding and intervention.
To achieve this identification, we leverage the techniques of2[
        <xref ref-type="bibr" rid="ref28 ref29 ref7">7, 28, 29</xref>
        ] that define subgroups
as interpretable combinations of metadata such as speaker demographics, recording conditions,
and task characteristics. In the following, we outline the definition of interpretable metadata
and then the automatic identification of subgroups.
      </p>
      <p>Interpretable Metadata. Speech datasets typically include a variety of metadata attributes
that can influence model performance. Demographic attributes such as gender, age, and
native language are among the most common factors afecting recognition accuracy. Beyond
demographics, speech characteristics such as speaking rate and silence duration also impact
recognition performance3[0]. Faster speech or heavily accented pronunciation may introduce
additional challenges, especially if the training data lacks suficient diversity. In addition to
speaker characteristics, recording conditions also contribute to subgroup disparities. Factors
such as background noise, microphone type, and reverberation levels can create variations in
audio quality, afecting model predictions. A model trained primarily on clean audio data may
struggle when encountering noisy environments, leading to disparate performance outcomes
for speakers who record in less controlled conditions. Task-specific metadata, such as intent
categories in spoken language understanding, also play a role in subgroup performance. Certain
intents or command structures may be more frequently represented in training data, resulting
in better recognition accuracy compared to less frequent or more complex intent formulations.</p>
      <p>
        Automatic subgroup identification . To systematically extract underperforming subgroups,
we adopt DivExplorer3[
        <xref ref-type="bibr" rid="ref1 ref32 ref33">1, 32, 33</xref>
        ]. DivExplorer identifies underperforming and interpretable
subgroups by analyzing metadata attributes and measuring performancdeivergence, which
quantifies how much a subgroup’s performance deviates from the overall model performance.
      </p>
      <p>Specifically, let  denote the dataset and the set of metadata attributes. Anitem is defined
as an attribute-value pair. For exampleg,ender=female or speaking rate=high are items. A
subgroup corresponds to the subset of data instances that satisfy one or more such items,
represented as anitemset  . Given a statistic (e.g., accuracy or error rate), the divergenΔce( )
of a subgroup identified by the itemset  is defined as: Δ ( ) =  ( ) −  () . It indicates that the
subgroup underperforms significantly relative to the dataset overall. A high negative divergence
value indicates that the subgroup is significantly underperforming compared to the dataset
as a whole. To ensure statistical reliability, subgroup discovery is constrained by a minimum
support threshold, which filters out small subgroups where performance estimates may be
unreliable. The subgroups are extracted by augmenting frequent pattern mining techniques,
such as FP-Growth or Apriori, over the defined interpretable metadata, to also compute the
divergence during the extraction process. By identifying subgroups with significant divergence,
this method provides a structured way to analyze and mitigate performance inconsistencies. The
identified subgroups inform post-processing targeted data acquisition (3§.1) and in-processing
techniques via regularization 3(§.2), data augmentation (§3.3), and contrastive learning (3§.4).</p>
    </sec>
    <sec id="sec-5">
      <title>3. Bias Mitigation Methods</title>
      <p>Bias in speech models arises when performance varies significantly across diferent subgroups,
often due to imbalanced representation in training data. To mitigate these disparities, various
techniques have been proposed, broadly categorized into post-processing methods, which refine
a trained model, and in-processing methods, which modify the training process itself.
Postprocessing methods are useful when fairness issues emerge after deployment, as they adjust
model predictions or incorporate new data without requiring full retraining. In-processing
methods, on the other hand, introduce fairness-aware mechanisms directly into the learning
process to ensure balanced performance from the outset.</p>
      <p>This study covers 4 bias mitigation techniques: one is a post-processing method, 3 are
inprocessing. For post-processing, we use targeted data acquisition, which enhances fairness
by collecting additional subgroup-specific data. In-processing approaches include
divergenceaware regularization, which modifies the loss function to prioritize underperforming subgroups;
targeted data augmentation, which increases subgroup diversity through synthetic
transformations; and contrastive learning (CLUES) to refine latent representations for improved fairness.</p>
      <sec id="sec-5-1">
        <title>3.1. Post-Processing: Targeted Data Acquisition</title>
        <p>Targeted data acquisition is a post-processing approach that improves subgroup performance
by supplementing the training set with additional real-world examples from underperforming
subgroups. This method identifies performance disparities after model deployment and retrains
the model with newly collected data.</p>
        <p>The process begins by evaluating the trained model to identify subgroups with significantly
lower accuracy compared to the overall dataset. These subgroups are interpretable, ensuring
that their characteristics are clearly defined. By guaranteeing interpretability, we can perform
targeted data acquisition to acquire new speech samples that better represent them. These
additional samples are then integrated into the dataset, and the model undergoes additional
ifne-tuning to improve its ability to generalize across all subgroups.</p>
        <p>One of the key advantages of targeted data acquisition is its reliance on real-world speech
variations rather than artificial data. This ensures that the model learns from natural speech
patterns, accents, and recording conditions that were previously underrepresented. However,
this method requires significant resources for data collection, annotation, and model retraining.
Despite these challenges, targeted data acquisition is particularly valuable in deployed systems,
where performance problems across groups only become apparent after real-world use.</p>
      </sec>
      <sec id="sec-5-2">
        <title>3.2. In-Processing: Divergence-Aware Regularization</title>
        <p>Traditional model learning functions optimize for overall performance, often overlooking
subgroup disparities. Divergence-aware regularization is an in-processing technique that
directly modifies the training process to subgroup learning. This approach dynamically adjusts
the loss function to focus on underperforming subgroups, ensuring they receive increased
attention during training. In this method, the model training continuously monitors performance
across diferent subgroups. If a subgroup exhibits significantly lower accuracy compared to the
overall dataset, its samples are assigned higher loss weights during training. By amplifying
the contribution of these samples, the model is encouraged to learn representations that better
capture subgroup-specific variations.</p>
        <p>Divergence-aware regularization is an efective solution for bias mitigation without requiring
additional data. Since it operates directly on the training loss, it improves subgroup performance
without altering the dataset size or introducing synthetic transformations.</p>
      </sec>
      <sec id="sec-5-3">
        <title>3.3. In-Processing: Targeted Data Augmentation</title>
        <p>Targeted data augmentation is another in-processing method that improves subgroup
performance by artificially incrementing the training data for underperforming subgroups. Instead of
collecting new samples, this approach applies synthetic transformations to the existing data
to increase subgroup representation. Several augmentation techniques are commonly used in
speech processing, including time stretching, which alters the speed of speech, pitch shifting,
which changes the speaker’s tone, and noise injection, which simulates diferent recording
environments. These transformations create diverse variations of the same speech sample,
allowing the model to become more robust to variations in speaking style, accent, or background
noise. In our context, once the underperforming subgroups are identified, we apply targeted
data augmentation techniques to increase their presence in the training set.</p>
        <p>One key advantage of this approach is its eficiency, as augmentation can be applied easily to
existing samples. However, this method does not introduce truly new linguistic or demographic
diversity—it only manipulates existing samples. Despite this limitation, it serves as a
costefective way to improve model robustness for underperforming subgroups.</p>
      </sec>
      <sec id="sec-5-4">
        <title>3.4. In-Processing: Contrastive Learning (CLUES)</title>
        <p>Contrastive learning has gained attention as an efective technique for refining the latent space
representations of deep learning models. The CLUES (Contrastive Learning framework for
Underperforming Subgroups) method applies contrastive loss to guide the model in learning more
structured and subgroup-aware representations. Unlike regularization or data augmentation,
which focuses on altering training behavior, contrastive learning reshapes the model’s internal
feature space to better distinguish between subgroups.</p>
        <p>CLUES operates at three levels of contrastive learning. At the task level, it ensures that
samples belonging to the same class are grouped closely together while separating samples
from diferent classes. At the subgroup level, it clusters samples from the same subgroup while
pushing apart those from diferent subgroups. Finally, at the error level, it groups correctly
classified samples separately from misclassified ones within each subgroup. By optimizing
these three objectives, CLUES improves how the model encodes subgroup-specific information,
leading to improved subgroup performance.</p>
        <p>A key advantage of CLUES is that it improves model representations at the subgroup level
without requiring additional data by restructuring the way data is represented. By explicitly
shaping the latent space, CLUES reduces overlap between subgroup distributions, preventing
the model from learning biased or entangled representations. However, contrastive learning
introduces additional computational complexity. Despite this, experimental results show that
CLUES provides the most efective technique for mitigating bias and improving performance.</p>
        <p>Summary. Post-processing and in-processing methods ofer distinct strategies for addressing
bias in speech models. Targeted data acquisition, as a post-processing method, enhances
subgroup performance by incorporating real-world samples into model fine-tuning. In contrast,
in-processing methods adjust the training process to achieve improvement at the subgroup level
without external data collection. The selection of an appropriate bias mitigation method depends
on the specific requirements of the application, including available data and computational
constraints objectives. In the next section, we outline the experimental setup used to evaluate
these methods and analyze their efect on subgroup and overall model performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>4. Results and Analysis</title>
      <p>This section presents the results of applying the four bias mitigation methods, analyzing their
impact on overall model performance, subgroup fairness, and latent space representations.</p>
      <sec id="sec-6-1">
        <title>4.1. Experimental setup</title>
        <p>
          Dataset and models. We conduct experiments on two spoken language understanding datasets:
Fluent Speech Commands (FSC) [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] in English and ITALIC [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] in Italian. These datasets
contain labeled utterances categorized by intent. The data is split into training, validation,
and test sets, ensuring that speakers do not overlap between splits. To test the scenario of
data acquisition of unseen samples, we also tested a configuration in which we use part of
the original train set for actual training and a part for the data acquisition, denoted as
heldout. We fine-tune wav2vec 2.0 [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] for FSC and XLS-R [35] for ITALIC . For our subgroup
extraction withDivExplorer, we explored all subgroups with a minimum frequency0o.0f3,
following 2[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For both post-processing data acquisition and in-processing data augmentation,
the hyperparameter defines the top-  most challenging subgroups to attention. We report
the results for K=2. Complete results with sensitivity analysis, ablation studies, and evaluations
on emotion recognition and automatic speech recognition tasks are available2i2n, [
          <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
          ].
        </p>
        <p>Metrics. We evaluate accuracy and macro F1 score to measure overall performance. For
subgroup performance, we evaluate the maximum subgroup divergencΔe ( ), average divergence

for the top-10 (Δ -10) underperforming subgroups, and the average divergence in absolute
terms (|Δ - |). We also performed a latent space analysis using the Silhouette Score to assess
how well the model distinguishes between subgroups when adopting CLUES.</p>
        <p>
          Baselines. We compare our mitigation methods when using our automatic identification
approach against a set of alternative baselines that aim to identify challenging samples for model
improvement. Therandom baseline selects samples randomly, serving as a control to highlight
the efectiveness of subgroup-based selection. Theclustering baseline follows1[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], where
challenging subgroups are identified using K-means clustering applied to acoustic embeddings.
The clusters with the lowest performance are then used to determine the most challenging
samples. TheKNN baseline employs a K-Nearest Neighbors approach, where an utterance is
C
S
F
C
I
L
A
T
I
original - no held out
w/ random
w/ KNN
w/ clustering
w/ error-driven
ours - w/DivExplorer
original - all
w/ random
w/ KNN
w/ clustering
ours - w/DivExplorer
w/ random
w/ KNN
w/ clustering
ours - w/DivExplorer
        </p>
        <p>w/ clustering
ours - w/DivExplorer
original - no held out
w/ random
w/ KNN
w/ clustering
w/ error-driven
ours - w/DivExplorer
original - all
w/ random
w/ KNN
w/ clustering
ours - w/DivExplorer
w/ random
w/ KNN
w/ clustering
ours - w/DivExplorer</p>
        <p>w/ clustering
ours - w/DivExplorer
acquisition
acquisition
acquisition
acquisition
acquisition
target data++
target data++
target data++
target data++
regularization
regularization
regularization
regularization</p>
        <p>CLUES</p>
        <p>CLUES
acquisition
acquisition
acquisition
acquisition
acquisition
target data++
target data++
target data++
target data++
regularization
regularization
regularization
regularization</p>
        <p>CLUES
CLUES</p>
        <p>
          Accuracy
considered challenging if its nearest neighbors in the validation set are frequently misclassified,
with K optimized per dataset. Finally, therror-driven baseline, close to3[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], selects misclassified
instances from the held-out set and incorporates them into training. We evaluate this baseline
only for post-processing since training loss inherently accounts for errors during learning.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>4.2. Experimental results</title>
        <p>Overall Performance. We report the results in Tabl1e. The four proposed bias mitigation
methods lead to varying degrees of improvement in model accuracy and fairness.
Divergenceaware regularization and contrastive learning (CLUES) achieve highest overall accuracy while
allowing the highest reductions in subgroup disparities. Coupling any strategy with our
identification methodology generally always achieves the best results (light yellow).</p>
        <p>On the FSC dataset, when using the full training seto(riginal-all), the baseline wav2vec 2.0
model achieves an accuracy of 93.42% and an F1 macro score of 93.11%, but exhibits high
divergence across subgroups. After applying mitigation strategies, CLUES improves overall accuracy
to 98.79% and reduces subgroup divergence significantly. Divergence-aware regularization
similarly enhances subgroup performance while maintaining a competitive accuracy of 98.5%,
while targeted data augmentation yields more moderate improvements, particularly benefiting
subgroups with lower representation. On ITALIC, the baseline XLS-R model achieves 73.22%
F1 Macro (original - all). Divergence-aware regularization and contrastive learning improve
overall performance to 74.85% for the former and to 76.10% and 76.72% when using CLUES
coupled with clustering or our identification approach based oDnivExplorer.</p>
        <p>Subgroup Performance. We assess how well the methods reduce subgroup disparities.
Before mitigation, the baseline FSC model on overall data haΔs a of 53.18% (i.e., the least
accurate subgroup performs significantly worse than the global accuracy of 93.42%). CLUES
reduces the most this divergence, down to 17.58%. Divergence-aware regularization also
substantially reducesΔ to 24.49%, confirming its efectiveness in addressing subgroup imbalances.
For ITALIC, the baseline XLS-R model on overall data starts witΔh a of 47.54%. Contrastive
learning and divergence-aware regularization reduce this gap to 40.15% and 30.10%.</p>
        <p>Latent Space Analysis. We use the Silhouette Score to investigate the impact of bias
mitigation on the latent space representations. A higher Silhouette Score indicates that the model
better separates subgroups, reflecting improved internal representations of speech variations.</p>
        <p>The baseline FSC model achieves a Silhouette Score of 0.737. CLUES improves this to 0.894,
demonstrating that targeting subgroup representation learning significantly enhances the
model’s ability to distinguish between subgroups. A similar pattern is observed in the ITALIC
dataset, where contrastive learning improves Silhouette Scores from 0.319 to 0.539. This suggests
that models trained with contrastive objectives learn more structured and subgroup-aware
representations, contributing to improvements in subgroup performance. A complete analysis
of model representations can be found in24[].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>5. Conclusions</title>
      <p>This paper outlined a framework for improving speech model performance by identifying
and mitigating subgroup disparities, leveraging interpretable metadata to systematically
detect underperforming, i.e.d, ivergent, subgroups. We explored four mitigation techniques: the
post-processing targeted data acquisition and thein-processing divergence-aware
regularization, targeted data augmentation, and contrastive learning (CLUES). Each method addressed
performance inconsistencies diferently, either by enhancing model training, refining latent
representations, or incorporating subgroup-specific data. The experimental results show that
CLUES and the divergence-aware regularization are the most efective in reducing subgroup
disparities. Moreover, CLUES enhances latent space representations. The findings highlight the
value of adopting divergence-aware subgroup identification in speech model development.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the FAIR - Future Artificial Intelligence Research and
received funding from the European Union NextGenerationEU (PIANO NAZIONALE DI RIPRESA
E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1555
11/10/2022, PE00000013) and the spoke “FutureHPC &amp; BigData” of the ICSC - Centro Nazionale
di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by
the European Union - NextGenerationEU. This manuscript reflects only the authors’ views
and opinions, neither the European Union nor the European Commission can be considered
responsible for them.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used, Grammarly, ChatGPT to check grammar
and spelling, paraphrase and reword. After using these tools, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.
volume 33, 2020, pp. 12449–12460. URL:https://proceedings.neurips.cc/paper/2020/file/
92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf.
[35] A. Babu, et al., XLS-R: Self-supervised Cross-lingual Speech Representation Learning at</p>
      <p>Scale, in: Proc. Interspeech 2022, 2022. do1i:0.21437/Interspeech.2022-143.
[36] R. Magar, A. B. Farimani, Learning from mistakes: Sampling strategies to eficiently
train machine learning models for material property prediction, Computational Materials
Science 224 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. wen Yang</given-names>
            , P.
            <surname>-H. Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-I. J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          , G.-T. Lin,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Huang</surname>
          </string-name>
          , W.-C. Tseng,
          <string-name>
            <given-names>K.</given-names>
            tik
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , H. yi Lee,
          <source>SUPERB: Speech Processing Universal PERformance Benchmark, in: Proc. Interspeech</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>1194</fpage>
          -
          <lpage>1198</lpage>
          .
          <year>do1i0</year>
          :.21437/ Interspeech.2021-
          <volume>1775</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>La Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Siniscalchi</surname>
          </string-name>
          ,
          <article-title>Speech analysis of language varieties in italy</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>15147</fpage>
          -
          <lpage>15159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T. Xu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McLeavey</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Robust speech recognition via large-scale weak supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>28492</fpage>
          -
          <lpage>28518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          , G. Ciravegna,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fantini</surname>
          </string-name>
          , E. Crosetti, G. Succo,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerquitelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          , et al.,
          <article-title>Voice disorder analysis: a transformer-based approach</article-title>
          , in: INTERSPEECH,
          <string-name>
            <surname>ISCA</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>3040</fpage>
          -
          <lpage>3044</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , E. Baralis,
          <article-title>Transformer-based non-verbal emotion recognition: Exploring model portability across speakers' genders</article-title>
          ,
          <source>in: Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge</source>
          , MuSe' 22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>89</fpage>
          -
          <lpage>94</lpage>
          . URL: https://doi.org/10.1145/3551876.3554801. doi:
          <volume>10</volume>
          .1145/3551876.3554801.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Foundation model assisted automatic speech emotion recognition: Transcribing, annotating, and augmenting</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>12116</fpage>
          -
          <lpage>12120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Siniscalchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Baralis,</surname>
          </string-name>
          <article-title>voc2vec: A foundation model for non-verbal vocalization</article-title>
          ,
          <source>in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi1:
          <fpage>0</fpage>
          .1109/ICASSP49660.
          <year>2025</year>
          .
          <volume>10890672</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bajorek</surname>
          </string-name>
          ,
          <article-title>Voice recognition still has significant race and gender biases</article-title>
          ,
          <source>Harvard Business Review</source>
          <volume>10</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koenecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nudell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Quartey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mengesha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Toups</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Rickford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <article-title>Racial disparities in automated speech recognition</article-title>
          ,
          <source>Proc. of the National Academy of Sciences</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mengesha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heldreth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lahav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sublewski</surname>
          </string-name>
          , E. Tuennerman, “
          <article-title>i don't think these devices are very culturally sensitive</article-title>
          .”
          <article-title>-impact of automated speech recognition errors on african americans</article-title>
          ,
          <source>Frontiers in Artificial Intelligence</source>
          (
          <year>2021</year>
          )
          <fpage>169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Picheny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sarı</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chitkara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alvarado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hazirbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Saraf</surname>
          </string-name>
          ,
          <article-title>Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions</article-title>
          ,
          <source>in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>6162</fpage>
          -
          <lpage>6166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dheram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saboowala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <article-title>Toward fairness in speech recognition: Discovery and mitigation of performance disparities</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>1268</fpage>
          -
          <lpage>1272</lpage>
          .
          <year>do1i0</year>
          :.21437/ Interspeech.2022-
          <volume>10816</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-E.</given-names>
            <surname>Veliche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Model-based approach for measuring the fairness in asr</article-title>
          , in: ICASSP, IEEE,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mazzia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gueudre</surname>
          </string-name>
          , E. Reale,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cumani</surname>
          </string-name>
          , L. De Alfaro,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amberti</surname>
          </string-name>
          ,
          <article-title>Leveraging confidence models for identifying challenging data subgroups in speech models</article-title>
          ,
          <source>in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>134</fpage>
          -
          <lpage>138</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSPW62465.
          <year>2024</year>
          .
          <volume>10626001</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Halpern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kudina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Scharenborg</surname>
          </string-name>
          ,
          <article-title>Towards inclusive automatic speech recognition</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>84</volume>
          (
          <year>2024</year>
          )
          <fpage>101567</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <article-title>Houston we have a divergence: A subgroup performance analysis of asr models</article-title>
          ,
          <source>in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>812</fpage>
          -
          <lpage>813</lpage>
          .
          <year>do1i</year>
          :
          <fpage>0</fpage>
          .1109/ICASSPW62465.
          <year>2024</year>
          .
          <volume>10626156</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Niebuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Michaud</surname>
          </string-name>
          ,
          <article-title>Speech data acquisition: the underestimated challenge</article-title>
          ,
          <source>KALIPHOKieler Arbeiten zur Linguistik und Phonetik</source>
          <volume>3</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Halpern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Scharenborg</surname>
          </string-name>
          ,
          <article-title>Mitigating bias against non-native accents</article-title>
          ,
          <source>in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          , volume
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>3168</fpage>
          -
          <lpage>3172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>I.-E.</given-names>
            <surname>Veliche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering</article-title>
          ,
          <source>in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , G. Sun,
          <string-name>
            <given-names>R.</given-names>
            <surname>Langman</surname>
          </string-name>
          , E. Han,
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Droppo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <article-title>Improving fairness in speaker verification via group-adapted fusion network</article-title>
          , in: ICASSP, IEEE,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mehrabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morstatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galstyan</surname>
          </string-name>
          ,
          <article-title>A survey on bias and fairness in machine learning</article-title>
          ,
          <source>ACM computing surveys (CSUR) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          , E. Pastor, G. Attanasio, Luca, L. de Alfaro, E. Baralis,
          <article-title>Prioritizing data acquisition for end-to-end speech model improvement</article-title>
          ,
          <source>in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP48485.
          <year>2024</year>
          .
          <volume>10446326</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <article-title>Mitigating subgroup disparities in speech models: A divergence-aware dual strategy</article-title>
          ,
          <source>IEEE Transactions on Audio, Speech and Language Processing</source>
          <volume>33</volume>
          (
          <year>2025</year>
          )
          <fpage>883</fpage>
          -
          <lpage>895</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .1109/TASLPRO.
          <year>2025</year>
          .
          <volume>3539429</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          ,
          <article-title>A contrastive learning approach to mitigate bias in speech models</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>827</fpage>
          -
          <lpage>831</lpage>
          .
          <year>do1i</year>
          :
          <fpage>0</fpage>
          .21437/ Interspeech.2024-
          <volume>1219</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lugosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ravanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ignoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Speech model pre-training for end-to-end spoken language understanding</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2019</year>
          , 20th Annual Conference of the International Speech Communication Association,
          <year>2019</year>
          , pp.
          <fpage>814</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Colomba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          , E. Baralis,
          <string-name>
            <surname>ITALIC:</surname>
          </string-name>
          <article-title>An Italian Intent Classification Dataset</article-title>
          ,
          <source>in: Proc. INTERSPEECH</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>2153</fpage>
          -
          <lpage>2157</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2023-
          <fpage>1980</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          , E. Pastor,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mazzia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gueudre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amberti</surname>
          </string-name>
          ,
          <article-title>Exploring subgroup performance in end-to-end speech models</article-title>
          ,
          <source>in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi1:
          <fpage>0</fpage>
          .1109/ICASSP49357.
          <year>2023</year>
          .
          <volume>10095284</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          , E. Pastor,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mazzia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gueudre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cumani</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amberti</surname>
          </string-name>
          ,
          <article-title>Towards comprehensive subgroup performance analysis in speech models</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>32</volume>
          (
          <year>2024</year>
          )
          <fpage>1468</fpage>
          -
          <lpage>1480</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .1109/TASLP.
          <year>2024</year>
          .
          <volume>3363447</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          , E. Baralis,
          <article-title>Assessing speech model performance: A subgroup perspective</article-title>
          ,
          <source>in: SEBD 2024: 32nd Symposium on Advanced Database System</source>
          , volume
          <volume>3741</volume>
          , CEUR Workshop Proceedings,
          <year>2024</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>111</lpage>
          . URL:https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3741</volume>
          / paper64.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          , E. Baralis,
          <article-title>Explaining speech classification models via word-level audio segments and paralinguistic features</article-title>
          ,
          <source>in: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>2221</fpage>
          -
          <lpage>2238</lpage>
          . URL:https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>eacl-long</article-title>
          .
          <volume>136</volume>
          ./
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <article-title>Looking for trouble: Analyzing classifier behavior via pattern divergence</article-title>
          ,
          <source>in: Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2021</year>
          , p.
          <fpage>1400</fpage>
          -
          <lpage>1412</lpage>
          . doi:
          <volume>10</volume>
          .1145/3448016.3457284.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavgavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          , L. de Alfaro,
          <article-title>How divergent is your data?</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>2835</fpage>
          -
          <lpage>2838</lpage>
          . URL: https://doi.org/10.14778/3476311.3476357. doi:
          <volume>10</volume>
          . 14778/3476311.3476357.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          , L. de Alfaro,
          <article-title>A hierarchical approach to anomalous subgroup discovery</article-title>
          ,
          <source>in: 2023 IEEE 39th international conference on data engineering (ICDE)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>2647</fpage>
          -
          <lpage>2659</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE55515.
          <year>2023</year>
          .
          <volume>00203</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>