<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>AlkisKoudounas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ElianaPastor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ElenaBaralis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>32</volume>
      <fpage>23</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Spoken language understanding (SLU) models are commonly evaluated based on overall performance or predefined subgroups, often overlooking the potential insights gained from more comprehensive subgroup analyses. Conducting a more thorough analysis at the subgroup level can reveal valuable insights into the variations in speech system performance across diferent subgroups. Yet, identifying interpretable subgroups in raw speech data poses inherent challenges.</p>
      </abstract>
      <kwd-group>
        <kwd>Subgroup identification</kwd>
        <kwd>Model bias analysis</kwd>
        <kwd>Bias mitigation</kwd>
        <kwd>Speech representation</kwd>
        <kwd>E2E-SLU models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Intelligent systems with speech recognition, transcription, and comprehension capabilities
are increasingly common across various domains, including virtual assista1n,t2s],[customer
service [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], and healthcare5[
        <xref ref-type="bibr" rid="ref6">, 6</xref>
        ]. However, current evaluation paradigms for these systems
predominantly focus on aggregate performance metrics, overlooking potential disparities across
diferent groups [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. Furthermore, the proliferation of large pre-trained neural models using
self-supervised learning poses challenges for interpretability and identification of performance
disparities through conventional methodologi1e0s,[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These issues highlight the need for a
comprehensive evaluation framework that captures subgroup-level efects to enable responsible
assessment of speech technologies, identifying and mitigating unintended harms.
      </p>
      <p>
        Recent literature has highlighted issues of model bias and unequal treatment across data
subgroups [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref23">12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</xref>
        ]. A data subgroup refers to a subset of
instances demonstrating similar characteristics within the latent space or common attribute
values (e.g., utterances spoken by female speakers). Previous approaches have typically focused
on predefined subgroups based on protected attributes or features of interest knowanpriori.
Specifically, these works targeted identifying bias within specific demographic traits, such
as skin tone [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], ethnicity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or combinations of metadata, such as demographics and
geolocation 1[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], as well as gender, age, and accents1[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or gender, age, skin tones 1[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
However, such categorizations often necessitate human expertise and preclude the exploration
of unanticipated yet significant subgroups.
      </p>
      <p>
        In this work, we propose an automated method for identifying critical subgroups to address
these limitations. Unlike existing clustering-based speaker embedding techniqu1e5s, 1[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], our
approach facilitates intersectional analysis, enabling us to explore the combined impacts of
multiple attributes. Speech data frequently includes additional metadata about the speaker (e.g.,
the gender) or task (e.g., the emotion associated with a sentence). Other features as speaking
rate, signal-to-noise ratio, and number of words, can be extracted from the audio or transcripts.
The latter are essential for capturing narrow nuances that could significantly afect model
performance. By combining such metadata values, we can identify interpretable data subgroups.
Research questions. This study investigates bias in speech model performance across data
subgroups, mainly focusing on spoken language understanding (SLU). We automatically identify
combinations of metadata values that exhibit the highest: i(ni)tra-model performance gaps,
indicating significant performance diferences between the overall dataset and specific data
subgroups, and (ii)cross-model performance gaps, signifying notable diferences in subgroup
performance among diferent models. Our approach enables the identification of data subgroups
where a model exhibits lower performance compared to the overall behavior. We leverage
this interpretable identification of critical subgroups for a targeted data acquisition strategy to
enhance performance and mitigate model biases. Therefore, this work addresses the following
research questions (RQs)(:RQ1) “How can we automatically identify and characterize the most
critical subgroups for an SLU model?(”R, Q2) “How does model size or architecture impact
subgroup performance?”, and(RQ3) “ How does adopting a subgroup-guided data acquisition
strategy influence the overall model and subgroup performance compared to an indiscriminate
approach?”.
      </p>
      <p>Our approach. We introduce a novel task-, model-, and dataset-agnostic methodology for
automating the characterization and comparison of data subgroups induced by metadata attributes.
We identify all “frequent subgroups,” i.e., those exceeding a certain support threshold (e.g., at
least0.1% of the dataset), that exhibit maximal disparities in intra- and cross-model performance.
We provide end-users with interpretable representations of such critical subgroups within a
given speech task and model and further use this information to mitigate model inner biases.</p>
      <p>The primary contributions of this work are: (i) a novel framework for analyzing SLU models
by identifying subgroups exhibiting large performance gaps; (ii) insights into the efects of
model size at the subgroup level; and (iii) a subgroup-guided targeted data acquisition approach
to enhance overall and across subgroups model performance.</p>
      <p>
        We conduct comprehensive experiments across three speech tasks (Automatic Speech
Recognition (ASR), Intent Classification (IC), Emotion Recognition (ER)), three datasets
(LibriSpeech [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], FSC [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and IEMOCAP [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]), and for the transformer-based speech model
wav2vec 2.0 [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Our experimental results demonstrate that our subgroup-level analysis reveals
distinctive performance patterns in data subpopulations. We further show that our
subgroupguided acquisition approach consistently improves performance both overall and on subgroups
compared to an indiscriminate strategy, even when acquiring a subset of the data.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Methodology</title>
      <p>Our approach examines model performance at the subgroup level, whesruebgaroup is defined
as a subset of the data characterized by specific metadata values, and denoted as itemset. This
metadata covers mixed factors, including speaker traits (e.g., gender, age), speech features (e.g.,
speaking rate, number of pauses), and task-specific attributes (e.g., intents, labels). For instance,
the subgroup{gender=male, age ∈ [41-65]} signifies utterances from male speakers aged 41 to 65.</p>
      <p>Our analysis of subgroup behavior leverages two key concepts: intra-model divergence and
cross-model performance gap. The former indicates the disparity in model performance between
a subgroup and the entire dataset, revealing subgroups associated with performance variations,
be it below-average, above-average, or equivalent. We will also leverage this aspect to guide the
data acquisition strategy. Conversely, the latter quantifies the performance diferences between
two models on the same subgroup, facilitating comparative assessments at the subgroup level.</p>
      <sec id="sec-3-1">
        <title>2.1. Itemsets through interpretable metadata</title>
        <p>We analyze speech model behavior by slicing data into interpretable subgroups. We define
interpretable metadata as attributes understandable by humans, e.g., speaker age or gender or
utterance noise level. For instanc“eo,ld men in noisy scenarios” is an interpretable subgroup.
Metadata Description. Identifying interpretable subgroups in raw speech data poses intrinsic
challenges. To overcome this issue, we enrich speech data with interpretable metadata from
various domains, providing a human-understandable description of utterances. They can be
inherent to the dataset or derived from utterances/transcriptions. Examples of such metadata
attributes include: (i)speaker demographics like gender or age, (iit)ask-specific features , like
intent or emotion associated with an utterance, (iirie)cording conditions, such as environment
type and noise level, and (ivs)peech features, such as speaking rate and duration of silences.
Items and Itemsets. Let  represent our dataset an d denote its metadata attribute set. An
item represents an attribute equality=  , where is an attribute in , and  is its value. We
only focus on discretized attributes, thus continuous-valued attributes are discretized before
applying our techniques. Examples of items includgeender = male and age ∈ [41 − 65], if
gender and age are attributes. Asubgroup corresponding to an item denotes the dataset portion
satisfying it. We ensure that subgroups form a dataset partition for each attribute. For example,
the age ranges must not overlap within thaege attribute, and collectively, they must cover all
potential age ranges.</p>
        <p>
          Items facilitate the selection of data subsets based on single attributes, whitielmesets allow
slicing across multiple attributes. An item sectomprises zero or more items, each including
a diferent attribute. For instance, an itemset like{gender = female, age ∈ [
          <xref ref-type="bibr" rid="ref22">22, 40</xref>
          ]} defines
a subgroup based on the gender and age attributes. We define data subgroups via itemsets,
enabling an interpretable subgroup definition. Thesupport of an itemset denotes the fraction of
the dataset it covers. For instance, an itemset with suppo0r.t02 represents2% of the dataset.
The empty itemset (∅) corresponds to the entire dataset and has a support o1.f An itemset is
frequent if its support exceeds a minimum threshold)(.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Intra and cross-model performance gaps</title>
        <p>
          We aim to identify subgroups exhibiting performance disparities compared to the overall dataset.
We rely onDivExplorer [
          <xref ref-type="bibr" rid="ref22 ref28">22, 28</xref>
          ] to extract all frequent itemsets above a specified support
threshold. While subgroups grow exponentially with the number of attributes, many extracted
itemsets may have minimal or zero support, making them less relevant for subgroup performance
analysis. Performance statistics for subgroups with low support may also sufer from statistical
lfuctuations. Therefore, to ensure operational significance, we only focus on the subgroups
surpassing a given threshold (e.g., comprising at lea0s.t1% of the dataset), called frequent
itemsets, which tend to be more limited.
        </p>
        <p>
          We employ the concept of subgroup divergence (i.e., intra-model performance gap) as
introduced in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. It quantifies the diference in performance between a subgroup and the entire
dataset. Let  represent a generic statistic for a downstream SLU task. For a modeland a
subgroup (i.e., itemset) ,  ( , ) denotes the average statistic value (e.g., accuracy, error rate) of
the model on the subgroup. We define the divergence of items etfor model as the diference
between the model performance overand the performance over the entire dataset:
Δ ( , ) =  ( , ) −  (∅, )
(1)
A higher divergence (in absolute terms) indicates a more significant variation in subgroup
performance compared to the overall dataset.
        </p>
        <p>Assessing performance discrepancies at the subgroup level is also crucial for model
comparison. We introduce the concept of cross-model performance gap, which measures the
performance diference between two models on a specific subgroup. This gap could be used to
compare diferent models, characterized by diferent size, architecture, or pre-training objective.
Specifically, given two models 1 and  2, the performance gap from mode l 1 to model 2 for
the itemset is defined as the change in performance on obtained by replacin g 1 with  2:
gap ( ,  1,  2) =  ( , 
2) −  ( , 
1)
(2)
The definitions of intra- and cross-model gaps apply to generic SLU models for any task,
enabling assessment of subgroup performance for a given dataset annotated via metadata.
This methodology thus remains task-, model-, and dataset-agnostic. To evaluate the statistical
significance, we employ Welch’s t-test to test the hypothesis that the means of the statist ic
are equal for (i) the subgrou pand the entire populatio n , and (ii) the two model s 1 and  2.</p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Local contribution through Shapley values</title>
        <p>After identifying itemsets exhibiting significant divergence or gap, we seek to understand the
contribution of each item to these metrics. We employ game theory concepts to provide local
insights into subgroup behavior.</p>
        <p>
          The local contribution quantifies the role of each item within an itemset in influencing its
divergence or gap, using Shapley values. In this framework, items within an itemset are akin to
team members, and the divergence or gap metric represents the team’s total score. Specifically,
for an item within itemset and a metric of interest( ) , i.e., divergence or gap, the Shapley
value  (,  ) measures how much contributes to( ) , with ∑∈   (,  ) = ( ) . More details on
this local as well as the global contribution can be found17in,2[
          <xref ref-type="bibr" rid="ref2">2, 29</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>2.4. Subgroup-guided Data Acquisition</title>
        <p>After evaluating the performance of a given speech model, our objective is to improve it both
overall and across diferent subpopulations. We identify the critical subgroups (i.e., itemsets)
characterized by negative divergence, representing challenging scenarios for the model. We
implement a pruning procedure to eliminate redundancy among such subgroups, follow2i2n]g. [
Specifically, when encountering two subgroups , and   , where  includes  along with an
additional metadata condition, we retain only the more general subgr o u, pi,f the absolute
diference in their divergences is below a predefined threshold. This approach is based on the
rationale that adequately captures the divergence exhibited b y, as the extra metadata in 
only marginally afects the divergence. Pruning the critical subgroups yields a more concise
representation, forcing the data acquisition process to focus on the most pertinent attributes.</p>
        <p>We prioritize data acquisition eforts on the top - critical subgroups with the highest negative
divergence in accuracy and retrain the model with additional data belonging to these subgroups.
The parameter allows us to control the data acquisition process and observe its impact on
model performance overall and within subgroups. Further details can be foun3d0]in. [</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results and Discussion</title>
      <p>
        We assess the efectiveness of our methodology by (i) analyzing its ability to identify sources of
errors, (ii) examining the influence of factors such as model size, architecture, and pre-training
(a) w2v2-b. Δacc = -31.22%
(b) w2v2-b. Δacc = -0.65%
objective on subgroup-level performance, and (iii) evaluating the efect of using subgroup-level
information to guide a data acquisition strategy in enhancing model performance and mitigating
biases. Please refer to1[
        <xref ref-type="bibr" rid="ref7">7, 29, 30</xref>
        ] for a complete set of the results.
      </p>
      <p>Metadata. We enrich the datasets with various metadata categories. We first incorporate
demographic attributes of speakers where available, including gender, age, and country. We also
consider unique metadata pertinent to each task if available, i.e., intent FfoSrC, and emotion
and arousal labels foIrEMOCAP. We finally extract from the raw signal utterance/transcription
attributes such as silence duration (total and trimmed), word count, speaking rate (words per
second), signal-to-noise ratio, and spectral flatness. The trimmed duration excludes initial and
ifnal pauses, while the total silence duration includes the entire utterance without any pauses.
As the frequency and duration of intermediate pauses had little efect on model performance
across all datasets, except foLribriSpeech, we chose to retain them for this dataset only.</p>
      <p>
        Continuous attributes like speaking rate or utterance duration require discretization into
ifxed ranges. Using frequency-based discretization, we thus discretize this metadata into three
ranges labeled as “low,” “medium,” and “high.”
RQ1: Model understanding at the subgroup level. We focus on the performance of the
wav2vec 2.0 base model 2[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] across all datasets. Tabl1e shows the subgroups with the largest
negative and positive divergence, indicating critical scenarios for each dataset. The divergence
values associated with these subgroups are statistically significant (with&gt; 2 , as per Siegel’s
rule of thumb 3[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). ForFSC and IEMOCAP, we evaluate model accuracy across various data
subgroups, where higher accuracy indicates better performance. A negative divergence signifies
accuracy below the average, while a positive divergence indicates above-average accuracy.
(a) FSC. Performance improvement
for 63.75% of subgroups, decrease
for 31.89% of them.
      </p>
      <p>(c) IEMOCAP. Performance
improvement for 11.21% of subgroups,
decrease for 85.85% of them.</p>
      <p>(d) LibriSpeech. Performance
improvement for 99.25% of subgroups,
decrease for 0.75% of them.</p>
      <p>For instance, forFSC, the wav2vec 2.0 base model exhibits its poorest performance for
the subgroup characterized by speakers aged 22-40, male gender, no specified location, high
speaking rate, and high total silence (Tabl1e,first block), with a divergence ofΔ = −31.2%.
Analyzing sensitive attributes like gender is crucial, as evidenced by the significant impact
observed. Specifically, female speakers achieve higher accuracy within the identified subgroup
than males when all other metadata values remain constant. This trend is further confirmed by
the Shapley values illustrated in Figu1r(ea)-(b), where the male gender is associated with lower
accuracy. In contrast, the female gender exhibits a positive impact.</p>
      <p>Conversely, the analysis also reveals subgroups with above-average performance. For example,
the model correctly predicts all utterances associated with the subgroup of speakers aged 22-40
with a low speaking rate, long duration, and “washroom” as the target location.</p>
      <p>Similar assessments can be made for other datasets. FLoirbriSpeech, we study the Word Error
Rate (WER); a positive WER divergence (i.e., higher than overall) signifies lower performance.
RQ2: Model comparison at the subgroup level. We compare diferent model performances
at the overall and subgroup levels, detecting which subpopulations benefit the most from model
changes. We analyze here how increasing the size of such models afects their performance at
both levels. For changes in architecture and pre-training objective, please ref2e9r]t.o [</p>
      <p>Larger models tend to be more accurate overall, a3n2d] [claims that larger models are also
fairer. However, performance for specific subgroups is complex and depends on the dataset/task.
We specifically examine how scaling up the wav2vec 2.0 model influences performance across
datasets, with Table2 summarizing the performance gap in terms of the highest performance
improvement and decrease, and Figur2eillustrating the distribution of this gap across subgroups.</p>
      <p>While a larger model size enhances both overall and subgroup WER inLitbhreiSpeech
dataset, it diminishes performance at both levels fIoErMOCAP. We further reveal varying
subgroup impacts onFSC, indicating that certain groups benefit more from a larger model size
than others. Nonetheless, more than 30% of the explored subgroups decrease performance when
scaling up the size. These findings emphasize the importance of analyzing subgroup-specific
outcomes when evaluating the efectiveness of larger models.</p>
      <p>RQ3: Subgroup-guided data acquisition. We use the identified critical subgroups to guide a
targeted data acquisition to improve model performance and mitigate its biases. We discuss the
results forFSC. Further outcomes onITALIC [33], an IC dataset in Italian, can be found in 3[0].</p>
      <p>We partition our dataset into training, held-out, validation, and test sets, employing an 80-20
split for training and held-out data, respectively, while retaining the original validation and test
sets. We first identify critical subgroups using the validation set, then acquire data samples from
the held-out set, and retrain the model with these samples. Evaluation on the test set (T3a)ble
reveals consistently superior performance across overall and subgroup-level metrics, compared
to baseline methods such as indiscriminate random and clustering-guided acquisiti1o5n], [
where samples are selected from the acoustic embedding clusters with subpar performance.</p>
      <p>Selecting only the top 2 critical subgroups leads to significant performance improvements
at both overall and subgroup levels. Specifically, it achieves the best F1 score and accuracy
performance, as well as the lowest maximum divergenΔce− ( ) and the lowest average
divergence for the top-10Δ(−−10 ), 20 (Δ−−20 ), and 50 (Δ−−50 ) subgroups with the highest
negative divergence. While performance slightly lowers when increasing the nu mobfer
critical subgroups, it remains significantly better than the original model performance and the
one obtained when adding all available data. The lowest average absolute divergence is found
with  = 5 critical subgroups, indicating reduced performance disparities across subgroups.</p>
      <p>Overall, these results underscore the efectiveness of targeted data acquisition in mitigating
performance disparities and improving model robustness across diverse subgroups.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>This study presents a novel methodology for evaluating spoken language understanding (SLU)
system performance by analyzing model bias at the subgroup level. We enrich raw speech data
by extracting metadata that include speaker demographics, task- and signal-related features
to allow the definition of human-interpretable subgroups. By automating the detection of
performance disparities within subgroups, our approach enhances error analysis, facilitates
model comparison, and mitigates biases, thus improving overall performance. This versatile
methodology demonstrates efectiveness across various speech tasks, datasets, and model sizes,
ofering insights into which subgroups benefit most from system enhancements and contributing
to the development of more inclusive and efective speech technologies.</p>
      <p>Endow. 14 (2021) 2835–2838. doi:10.14778/3476311.3476357.
[29] A. Koudounas, E. Pastor, G. Attanasio, V. Mazzia, M. Giollo, T. Gueudre, E. Reale, L. Cagliero,
S. Cumani, L. de Alfaro, E. Baralis, D. Amberti, Towards comprehensive subgroup
performance analysis in speech models, IEEE/ACM Transactions on Audio, Speech, and
Language Processing 32 (2024) 1468–1480. doi1:0.1109/TASLP.2024.3363447.
[30] A. Koudounas, E. Pastor, G. Attanasio, L. de Alfaro, E. Baralis, Prioritizing data acquisition
for end-to-end speech model improvement, in: ICASSP 2024 - 2024 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 7000–7004.
doi:10.1109/ICASSP48485.2024.10446326.
[31] A. F. Siegel, Chapter 10 - hypothesis testing: Deciding between reality and coincidence, in:
A. F. Siegel (Ed.), Practical Business Statistics (Sixth Edition), sixth edition ed., Springer
Science &amp; Business Media, 2012, pp. 249–287. doi1:0.1016/B978-0-12-385208-3.00010-9.
[32] Y. Sheng, J. Yang, Y. Wu, K. Mao, Y. Shi, J. Hu, W. Jiang, L. Yang, The larger the fairer? small
neural networks can achieve fairness for edge devices, arXiv preprint arXiv:2202.11317
(2022).
[33] A. Koudounas, M. La Quatra, L. Vaiani, L. Colomba, G. Attanasio, E. Pastor, L. Cagliero,
E. Baralis, ITALIC: An Italian Intent Classification Dataset, in: Proc. INTERSPEECH 2023,
2023, pp. 2153–2157. doi:10.21437/Interspeech.2023-1980.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sarikaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Crook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Robichaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rochette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Z.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>An overview of end-to-end language understanding and dialog management for personal digital assistants</article-title>
          ,
          <source>in: 2016 ieee spoken language technology workshop (slt)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>391</fpage>
          -
          <lpage>397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Terzopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Satratzemi</surname>
          </string-name>
          ,
          <article-title>Voice assistants and smart speakers in everyday life and in education, Informatics in Education 19 (</article-title>
          <year>2020</year>
          )
          <fpage>473</fpage>
          -
          <lpage>490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nuruzzaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. K.</given-names>
            <surname>Hussain</surname>
          </string-name>
          ,
          <article-title>A survey on chatbot implementation in customer service industry through deep neural networks</article-title>
          ,
          <source>in: 2018 IEEE 15th International Conference on e-Business Engineering (ICEBE)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Scheidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <article-title>Making a case for speech analytics to improve customer service quality: Vision, implementation, and evaluation</article-title>
          ,
          <source>International Journal of Information Management</source>
          <volume>45</volume>
          (
          <year>2019</year>
          )
          <fpage>223</fpage>
          -
          <lpage>232</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0268401217309441. doi:https://doi.org/10.1016/j.ijinfomgt.
          <year>2018</year>
          .
          <volume>01</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Latif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Qayyum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Usama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Younis</surname>
          </string-name>
          ,
          <article-title>Speech technology for healthcare: Opportunities, challenges, and state of the art</article-title>
          , IEEE Reviews in Biomedical Engineering (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>La Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          , E. Baralis,
          <article-title>How much attention should we pay to mosquitoes?</article-title>
          ,
          <source>in: Proceedings of the 30th ACM International Conference on Multimedia, MM '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>7135</fpage>
          -
          <lpage>7139</lpage>
          . URL: https://doi.org/10.1145/3503161.3551594. doi:
          <volume>10</volume>
          .1145/ 3503161.3551594.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Turian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Raj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Steinmetz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Malloy</surname>
          </string-name>
          , G. Tzanetakis, G. Velarde,
          <string-name>
            <given-names>K.</given-names>
            <surname>McNally</surname>
          </string-name>
          , et al.,
          <article-title>Hear: Holistic evaluation of audio representations, in: NeurIPS 2021 Competitions and Demonstrations Track</article-title>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2022</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. wen Yang</given-names>
            , P.
            <surname>-H. Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-I. J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lakhotia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          , G.-T. Lin,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Huang</surname>
          </string-name>
          , W.-C. Tseng,
          <string-name>
            <given-names>K.</given-names>
            tik
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , H. yi Lee,
          <source>SUPERB: Speech Processing Universal PERformance Benchmark, in: Proc. Interspeech</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>1194</fpage>
          -
          <lpage>1198</lpage>
          .
          <year>d1o0i</year>
          :.21437/ Interspeech.2021-
          <volume>1775</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>La Quatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vaiani</surname>
          </string-name>
          , E. Baralis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Siniscalchi</surname>
          </string-name>
          ,
          <article-title>Benchmarking representations for speech, music, and acoustic events</article-title>
          ,
          <source>in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Inala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Rethinking interpretability in the era of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.01761</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koenecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nudell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Quartey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mengesha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Toups</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Rickford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <article-title>Racial disparities in automated speech recognition</article-title>
          ,
          <source>Proc. of the National Academy of Sciences</source>
          <volume>117</volume>
          (
          <year>2020</year>
          )
          <fpage>7684</fpage>
          -
          <lpage>7689</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kudina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Halpern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Scharenborg</surname>
          </string-name>
          ,
          <article-title>Quantifying bias in automatic speech recognition</article-title>
          ,
          <source>arXiv preprint arXiv:2103.15122</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Picheny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sarı</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chitkara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alvarado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hazirbas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Saraf</surname>
          </string-name>
          ,
          <article-title>Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions</article-title>
          ,
          <source>in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>6162</fpage>
          -
          <lpage>6166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dheram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saboowala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <article-title>Toward fairness in speech recognition: Discovery and mitigation of performance disparities</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>1268</fpage>
          -
          <lpage>1272</lpage>
          .
          <year>d1o0i</year>
          .:21437/ Interspeech.2022-
          <volume>10816</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.-F.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Holliday</surname>
          </string-name>
          ,
          <article-title>Exploring Sources of Racial Bias in Automatic Speech Recognition through the Lens of Rhythmic Variation</article-title>
          ,
          <source>in: Proc. INTERSPEECH</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>1284</fpage>
          -
          <lpage>1288</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2023-
          <volume>159</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          , E. Pastor,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mazzia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giollo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gueudre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cagliero</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amberti</surname>
          </string-name>
          ,
          <article-title>Exploring subgroup performance in end-to-end speech models</article-title>
          ,
          <source>in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi1:
          <fpage>0</fpage>
          .1109/ICASSP49357.
          <year>2023</year>
          .
          <volume>10095284</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.-E.</given-names>
            <surname>Veliche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Improving fairness and robustness in end-to-end speech recognition through unsupervised clustering</article-title>
          ,
          <source>in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          ,
          <article-title>Houston we have a divergence: A subgroup performance analysis of asr models</article-title>
          ,
          <source>arXiv preprint arXiv:2404.07226</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Koudounas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giobergia</surname>
          </string-name>
          , E. Baralis,
          <article-title>Bad exoplanet! explaining degraded performance when reconstructing exoplanets atmospheric parameters</article-title>
          ,
          <source>in: NeurIPS 2023 AI for Science Workshop</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shahbazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <article-title>Representation bias in data: a survey on identification and resolution techniques</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          , L. de Alfaro, E. Baralis,
          <article-title>Looking for trouble: Analyzing classifier behavior via pattern divergence</article-title>
          ,
          <source>in: Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2021</year>
          , p.
          <fpage>1400</fpage>
          -
          <lpage>1412</lpage>
          . doi:
          <volume>10</volume>
          .1145/3448016.3457284.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          , L. de Alfaro,
          <article-title>A hierarchical approach to anomalous subgroup discovery</article-title>
          ,
          <source>in: 39th IEEE International Conference on Data Engineering, ICDE</source>
          <year>2023</year>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>2647</fpage>
          -
          <lpage>2659</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE55515.
          <year>2023</year>
          .
          <volume>00203</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khudanpur</surname>
          </string-name>
          ,
          <string-name>
            <surname>Librispeech:</surname>
          </string-name>
          <article-title>An asr corpus based on public domain audio books</article-title>
          ,
          <source>in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>5206</fpage>
          -
          <lpage>5210</lpage>
          .
          <year>doi1</year>
          :
          <fpage>0</fpage>
          .1109/ICASSP.
          <year>2015</year>
          .
          <volume>7178964</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lugosch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ravanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ignoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Speech model pre-training for end-to-end spoken language understanding</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2019</year>
          , 20th Annual Conference of the International Speech Communication Association,
          <year>2019</year>
          , pp.
          <fpage>814</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bulut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Kazemzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Provost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Iemocap: interactive emotional dyadic motion capture database</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>42</volume>
          (
          <year>2008</year>
          )
          <fpage>335</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          , wav2vec
          <volume>2</volume>
          .
          <article-title>0: A framework for self-supervised learning of speech representations</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>12449</fpage>
          -
          <lpage>12460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavgavian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Baralis</surname>
          </string-name>
          , L. de Alfaro,
          <article-title>How divergent is your data?</article-title>
          ,
          <source>Proc. VLDB</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>