Explainability for Speech Models: On the Challenges of
                                Acoustic Feature Selection
                                Dennis Fucci1,2 , Beatrice Savoldi2 , Marco Gaido2 , Matteo Negri2 , Mauro Cettolo2 and
                                Luisa Bentivogli2
                                1
                                    University of Trento, Via Calepina, 14, 38122 Trento TN, Italy
                                2
                                    Fondazione Bruno Kessler, Via Sommarive, 18, 38123 Trento TN, Italy


                                                  Abstract
                                                  Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI)
                                                  has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI
                                                  techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation
                                                  for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech
                                                  processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the
                                                  features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to
                                                  enhance the depth and informativeness of explanations for speech.

                                                  Keywords
                                                  Speech Models, Explainability, Feature Attribution


                                1. Introduction                                                                                            by these components, and determining which input el-
                                                                                                                                           ements guide the model’s predictions [15]. The latter
                                            Models are only as interpretable as their                                                      task is the focus of feature attribution methods, which
                                            features. [1]                                                                                  provide intuitive explanations by visualizing which input
                                                                                                                                           elements (e.g., pixels in an image or words in a sentence)
                                   Spoken language—as perhaps our most natural form                                                        have influenced the model’s predictions. These meth-
                                of interaction—is the foundational element of many tech-                                                   ods assign a score to each input feature, quantifying its
                                nologies we interact with in our daily lives [2], from vir-                                                importance or contribution to the output: higher scores
                                tual assistants to voice dictation [3, 4, 5]. More recently,                                               indicate greater importance of the corresponding input
                                the emergence of highly capable speech foundation mod-                                                     features for generating the output [16, 17, 18, 19]. They
                                els [6, 7, 8, 9] has also facilitated and expanded the adop-                                               can help identify potential causes for errors and unex-
                                tion of speech technologies on an unprecedented multi-                                                     pected behaviors, as well as analyze the model’s response
                                lingual scale. In light of this proliferation, a need arises                                               to specific input properties. Overall, these explainability
                                to prioritize transparency and interpretability, qualities                                                 methods serve to present the reason why models make
                                already demanded in the growing landscape of Machine                                                       specific predictions by establishing a connection between
                                Learning (ML).                                                                                             input and output as a form of intuitive explanation for
                                   As a response, the field of eXplainable AI (XAI) has                                                    humans, thereby enhancing interpretability.1
                                risen prominently, with the aim of facilitating under-                                                        Over time, ongoing efforts have aimed to refine fea-
                                standing of the rationale behind model decisions and fos-                                                  ture attribution techniques and provide more effective
                                tering users’ trust [10, 11, 12, 13]. XAI is also reinforced                                               explanations [22, 23]. However, it is essential to recog-
                                by the establishment of norms and legal frameworks, as                                                     nize that the effectiveness of feature attribution explana-
                                seen in the European Union’s General Data Protection                                                       tions relies not only on the techniques themselves but
                                Regulation, which enshrines the ‘right to explanation’,                                                    also on the informativeness of the input features used
                                and the AI Act, which emphasizes transparency as a piv-                                                    as explanatory variables. If an explanation highlights
                                otal component of ML applications [14].                                                                    unintelligible or poorly informative features, it does little
                                   XAI encompasses various tasks and methods, such                                                         to enhance the understanding of the model’s behavior
                                as identifying relevant model components for specific
                                predictions, understanding the information processed                                                       1
                                                                                                                                               Despite numerous efforts to differentiate the closely related con-
                                                                                                                                               cepts of explainability and interpretability, no consensus exists in
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                           the literature on their definitions [20]. In this paper, we adopt a
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                 perspective similar to that of Saeed and Omlin [21], where explain-
                                $ dfucci@fbk.eu (D. Fucci); bsavoldi@fbk.eu (B. Savoldi);                                                      ability refers to the process of extracting insights from a model’s
                                mgaido@fbk.eu (M. Gaido); negri@fbk.eu (M. Negri);                                                             workings through specific techniques, while interpretability refers
                                cettolo@fbk.eu (M. Cettolo); bentivo@fbk.eu (L. Bentivogli)                                                    to the understanding process of those insights, crucial to make
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).                                                         them actionable.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
[1]. This can undermine key principles in XAI, such as                    several paralinguistic cues (i.e. how is said) entailing
accuracy—the property of correctly reflecting the factors                 extensive variation—also for single individual speakers
that led the model to a specific decision including all                   [33]. As such, it comprises several dimensions, which are
relevant information—and meaningfulness—the property                      hard to pin down individually, but collectively amount
of offering explanations that are comprehensible to the                   to what we intuitively and simply perceive as spoken
user [24].2                                                               language.
   In fields involving images or texts, feature representa-                  From a linguistic perspective, the spoken communica-
tions are typically constrained to pixels and words, re-                  tion system consists of the combination of phonemes,3
spectively. However, for speech, multiple input represen-                 which are regarded as the smallest meaningful units
tations can be adopted, each emphasizing different acous-                 of sounds [34, 35]. Physically, it involves the contin-
tic aspects. Indeed, a sequence of speech elements does                   uous flow of sounds shaped by the movements of our
not only convey the meaning of what is said (like words                   phonatory organs, transmitted as sound waves [36]. Per-
in a text) but also bears a wealth of additional information              ceptually, we process speech through three primary di-
useful for both human understanding and automatic pro-                    mensions [37]: i) time, or the sequential occurrence of
cessing (e.g. intonation, loudness, speaking rate). Con-                  sounds;4 ii) intensity, corresponding to the energy level
sequently, when employing feature attribution methods,                    of the wave due to the strength of molecular vibration,
the resulting explanations can vary significantly in shape                which we perceived as loudness; iii) frequency, regard-
and focus on more or less informative characteristics de-                 ing the rate of vibrations produced by the vocal cords—
pending on the type of speech representation used. To                     interpreted as pitch—and whose modulation is responsi-
date, research on feature attribution for speech is no-                   ble for shaping the type of speech sound.
tably limited to few applications—including classification                   These three elements, known as acoustic correlates [38],
[27, 28] and generative tasks [29, 30, 31, 32]—which offer                are specific to both speakers and phonemes. For example,
a somewhat fragmented picture in the choice of speech                     speakers possess unique characteristics, including pitch
representations, thus providing limited insights on the                   and speaking rate [33], and also exhibit high variability
relation between the features considered and the expla-                   stemming from various sociodemographic factors such
nations based upon them.                                                  as gender, age, and dialect [39]. In these cases, the speech
   In light of the above, this paper reflects on the im-                  content needs to be disentangled from the variability in
pact of the chosen acoustic features in explaining the                    its delivery. Conversely, language sounds exhibit variabil-
rationale behind speech models, aiming to gain a deeper                   ity in duration—e.g., /i/ in ship and sheep—and are distin-
understanding of the trade-offs associated with acoustic                  guished by specific frequency ranges [36]. The frequency
features. By first offering a gentle introduction to the                  dimension also plays a vital role in shaping suprasegmen-
rich and multidimensional nature of speech and its digital                tal aspects of speech—broader phenomena that span mul-
representation, we identify current gaps and potential                    tiple segments—such as intonation, obtained by varying
avenues for effectively incorporating this multidimen-                    pitch [40]. Pitch, for instance, has a distinctive function
sionality into XAI for speech models. Our discussion will                 in tonal languages, where it is used to distinguish lexi-
focus on two critical factors: i) the amount of information               cal or grammatical meaning [41]. But even in non-tonal
these features provide about the model’s behavior, which                  languages, these prosodic elements are indispensable to
influences the richness of the explanations, and ii) the                  delivering different meanings and intents, as the reader
level of detail of such information, which determines the                 can perceive by reading out loud two contrastive sen-
granularity of the explanations. We will also explore how                 tences such as: “You got the joke right” and “You got the
these aspects impact both the accuracy and meaningful-                    joke, right?”, where pauses and prosody play pivotal roles.
ness of the explanations, ultimately shaping their overall                   All these factors add to the multidimensionality of
interpretability.                                                         speech, which feature engineering strives to encapsulate
                                                                          and that cannot be overlooked in the explanatory process.

2. The Correlates of Speech
                                                                          3. Speech Representations
To gain deeper insight into the complexities of defining
informative features in speech, we explore key charac- While various representations are used to encode speech
teristics of speech and their implications for modeling. in a digital format, three main types are commonly given
    Speech is a multifaceted phenomenon. It is grounded 3
                                                          Throughout the paper, we use the abstract category of phoneme to
on the materiality of sound to convey linguistic content denote individual speech sounds. However, when discussing their
(i.e. what is said), which is modulated depending on actual realizations, it is more accurate to refer to them as phones
                                                                              [34].
2                                                                         4
    The properties of accuracy and meaningfulness can be associated           E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top) differ-
    with those of faithfulness and plausibility, respectively [25, 26].       entiates two words.
as input to state-of-the-art speech models (for a review,
see [42, 43]). Namely, waveforms, spectrograms, and
mel-frequency cepstral coefficients (MFCCs), which are
shown in Figure 1.
   The waveform serves as the most fundamental repre-
sentation of a signal, comprising sequences of samples
(e.g., 16, 000 per second), each indicating the amplitude
of the signal at a specific point in time—essentially, the
fluctuations in air pressure over time. This type of repre-
sentation is leveraged by models like Wav2vec [6].
   The spectrogram results from feature engineering
operations that decompose the speech signal into its
frequencies, presenting a 2D visualization of frequency
distributions over time. These representations are com-
monly depicted as heatmaps, where color intensity cor-
responds to the energy of a specific frequency at a given
moment. The time unit in spectrograms is represented
by a fixed-length window of a few milliseconds (e.g., 25),
commonly referred to as a frame, whithin which a given
number of waveform samples are encompassed. Notably,
the articulation of sounds produces time-frequency pat-
terns which are visible as darker regions [36]. Prominent
examples of state-of-the-art models leveraging spectro-
grams are Whisper [9] and SeamlessM4T [44].
   The MFCCs offer another 2D representation where
each coefficient captures important details about how
                                                              Figure 1: Schematic illustration of the primary speech rep-
the frequency content of the signal changes over time.        resentations used by state-of-the-art speech models for the
Like spectrograms, MFCCs offer information about both         utterance “This is a waveform”. The features were computed
frequency and time, but in a more compact form. MFCCs         using Librosa 0.10.1 [47].
are commonly used in the implementation of ASR mod-
els within popular toolkits like Kaldi5 [45] and Mozilla
DeepSpeech6 .
   Overall, though different in nature, these three types     tise in speech processing, being aware of the trade-offs
of representations are all effectively exploited by current   they intrinsically entail is crucial for carefully conducting
speech models.7 For human understanding, however,             XAI examination in speech. Indeed, it is precisely upon
they actually vary in terms of informativeness with re-       such input features—and their trade-offs—that explana-
spect to the acoustic correlates discussed in §2. Indeed,     tions are built.
although both intensity and frequency are somewhat
discernible in waveforms, qualitative distinctions of pat-    4. Richness of Explanations
terns specific to pitch or phoneme frequencies are rarely
feasible [36]. Comparatively, spectrograms and MFCCs              Considering the foregoing, there is a causal relationship
are richer and more descriptive, because they capture             wherein explanatory possibilities in speech XAI are in-
the multiple dimensions of time, frequency, and intensity         herently limited by the richness of the audio features
with finer detail. Still, spectrograms are more conducive         used, specifically the dimensions they encapsulate. This
to phonetic analyses, given the established knowledge             limitation directly correlates with the richness of the
in analyzing frequency patterns over time within this             resulting explanations. Also, owing to the compatibil-
representation [36] In contrast, MFCCs are rarely used            ity of current models with various representation types,
for phonetic analysis [46].                                       the explanations generated are inevitably confined by
   Overall, while weighting the informativeness and se-           the specific input features provided to the model. To ex-
lection of speech representations requires a certain exper-       emplify, if models process audio as waverfoms—which
                                                                  poorly represent the frequency dimension for human
5
  See https://kaldi-asr.org/doc/feat.html.                        understanding—explanations accounting for such a cor-
6
  See https://deepspeech.readthedocs.io/en/master/DeepSpeech.
  html.
                                                                  relate will be out of reach. In fact, previous works by
7
  We are not aware of any recent study attributing higher systems Wu et al. [31] and Wu et al. [32], based on waveforms
  performance depending on the used representation.               solely focus on the temporal dimension to explain ASR.
In these cases, to avoid limiting the understanding of the    5. Granularity of Explanations
models’ behavior to one single dimension it would be
advisable to explore alternative techniques that offer        Another critical factor concerning the informativeness
deeper insights into how models process other acous-          of input features is the level of granularity at which
tic correlates. For instance, Pastor et al. [28] integrated   the features are considered during the explanatory pro-
counterfactual explanations to specifically investigate       cess. This decision affects the level of detail in the result-
whether selected paralinguistic features such as pitch,       ing explanations and, consequently, accuracy—as more
speaking rate, and background noise were influent for         detailed explanations may more accurately reflect the
the model’s prediction. Additionally, various techniques      model’s behavior—and their meaningfulness—as detailed
exist to analyze how models extract relevant patterns         and comprehensive explanations can be more difficult to
from waveforms through convolutions [48, 49, 50].             interpret [12, 24].
   When the selected input features represent multiple           In the time domain, for example, input features are
dimensions, as in the case of spectrograms or MFCCs, the      highly fine-grained. As discussed in §2, spectrograms
decision to only account for one of these dimensions be-      typically contain frames spanning tens of milliseconds,
comes arbitrary. For example, two models tested by Wu         capturing detailed frequency content within each frame,
et al. [31], namely, DeepSpeech [51] and Sphinx [52], are     whereas waveforms are composed of samples taken at
fed with spectrograms and MFCCs, respectively. How-           much shorter time intervals—for instance, as mentioned
ever, explanations based on raw waveforms are provided        in §2, there can be 16, 000 samples in just one second.
for these models. This inconsistency between the fea-         This level of detail poses great challenges for (human)
tures used in explanations and those used by the models       comprehension, particularly for a broader audience, since
inevitably offers only a partial overview of the models’      mapping groups of frames/samples in an explanation to
behavior and limits the exploration of important acoustic     recognizable speech units is highly time-consuming and
aspects. This, in turn, can impact the accuracy of the ex-    requires specialized expertise.
planations, which ideally should encompass all relevant          Accordingly, to address the issue and make explana-
information.                                                  tions for speech more broadly accessible, previous works
   To prioritize explanation accuracy and conduct anal-       have leveraged textual transcripts within the explanation
yses considering the crucial role of acoustic correlates      process. More specifically, Wu et al. [32] and Pastor et al.
such as frequency, it is advisable to take into account all   [28] resort to the alignment of audio to text, either for
dimensions embedded in the speech representation.             individual phonemes or words, respectively, and apply ex-
This approach is exemplified by the works of Markert          plainability techniques to such units. While this approach
et al. [30], who provide explanations that account for        helps decipher the contribution of input features based
the most influential elements in MFCCs, as well as Trinh      on more intuitive linguistic units, it diverges from how
and Mandel [29] and Becker et al. [27], who base the          current models process speech features in small frames
explanations on spectrograms. In the work by Mark-            and samples [43]. This divergence risks overlooking the
ert et al. [30], however, it is challenging to connect the    model’s behavior and compromises the accuracy and ef-
results with specific acoustic parameters due to the com-     fectiveness of the explanations. For instance, whether
plexity of analyzing MFCCs (see §3), which significantly      ASR systems rely on shorter or longer time intervals than
undermines the meaningfulness of the explanations. In         individual words remains unclear [29]. Therefore, ana-
contrast, explanations using spectrograms offer valuable      lyzing this aspect requires a more granular approach at
insights into how machines process speech, producing          the time level.
both accurate and meaningful results. For instance, Trinh        In light of the above, explanations should be ob-
and Mandel [29] demonstrated that neural ASR models           tained with low-level units to avoid biasing explana-
focus on high-energy time-frequency regions for tran-         tions towards human understanding. The use of audio-
scription, while Becker et al. [27] found that lower fre-     transcript alignment to aid analysis of explanations can
quency ranges, typically associated with pitch, exhibit       be very useful but should occur downstream of the expla-
higher attribution scores in speaker gender classification    nation process, not upstream. In this way, we can maxi-
tasks [27], showing some alignment with human speech          mize the use of all available units to generate detailed and
processing. However, interpreting these insights requires     accurate explanations, and then aggregate scores from
specialized expertise, which can reduce the meaningful-       individual frames or samples to create more compact
ness of explanations for non-experts. This highlights         representations at the level of phonemes or words, ensur-
that, even in speech, the balance between accuracy and        ing flexibility in the meaningfulness of the explanations
meaningfulness can vary depending on the context [24].        according to specific needs. This bottom-up approach
                                                              mirrors practices in the text domain, providing adapt-
                                                              ability in defining attribution units that can range from
                                                              subwords to words or phrases [53, 54].
6. Conclusion                                              speech models [56].
                                                             The balance of richness and granularity—which
This paper has examined the role of acoustic features and also relates to the interplay between accuracy and
their selection for explaining speech models. More specif- meaningfulness—is also relevant to common users who
ically, we considered a specific subfield of XAI, namely, interact with speech technologies. However, investigat-
feature attribution, which connects input features to out- ing how explanations can be effectively communicated
puts as a form of explanation. Previous research has not to and understood by these users in the context of daily
explicitly addressed how to incorporate features into the speech technology use exceeds the scope of this paper
explanation process within the speech domain, where and warrants further exploration.
input is encoded in more varied ways compared to other
fields, such as text. This has led to diverse approaches,
each with different implications for what can and can- 8. Acknowledgments
not be explained about model behavior, and with the
risk of not fully or accurately representing the model’s We acknowledge the support of the PNRR project FAIR -
functioning.                                               Future AI Research (PE00000013), under the NRRP MUR
   By discussing the key characteristics of speech and program funded by the NextGenerationEU. This paper
the properties of the most adopted acoustic features, we has also received funding from the European Union’s
argue that explanations should ideally encompass all Horizon research and innovation programme under grant
available dimensions, particularly time and frequency, as agreement No 101135798, project Meetween (My Per-
both are essential for a comprehensive understanding of sonal AI Mediator for Virtual MEETtings BetWEEN Peo-
the models’ rationale. We have also discussed challenges ple).
associated with aligning explanations at high granularity
with human understanding, emphasizing solutions that          References
provide flexibility in the analysis, allowing for adjust-
ments between more or less detail as needed.                  [1] A. Zytek, I. Arnaldo, D. Liu, L. Berti-Equille,
   Building on these insights, our ongoing research fo-           K. Veeramachaneni, The Need for Interpretable
cuses on developing feature attribution techniques that           Features: Motivation and Taxonomy, SIGKDD Ex-
operate on spectrograms at the finest possible unit level,        plor. Newsl. 24 (2022) 1–13. URL: https://doi.org/
integrating both time and frequency dimensions. Our aim           10.1145/3544903.3544905. doi:10.1145/3544903.
is to generate explanations that are accurate and mean-           3544905.
ingful for experts, as well as adaptable for non-expert       [2] C. Munteanu, M. Jones, S. Oviatt, S. Brewster,
users. More broadly, we hope that our reflections will            G. Penn, S. Whittaker, N. Rajput, A. Nanavati,
be beneficial and thought-provoking for researchers cur-          We need to talk: HCI and the delicate topic of
rently working in, or entering, the field of XAI for speech       spoken language interaction, in: CHI ’13 Ex-
models, thereby contributing to a deeper understanding            tended Abstracts on Human Factors in Computing
of the rationale behind these models.                             Systems, CHI EA ’13, Association for Computing
                                                                  Machinery, New York, NY, USA, 2013, pp. 2459–
                                                                  2464. URL: https://doi.org/10.1145/2468356.2468803.
7. Limitations                                                    doi:10.1145/2468356.2468803.
While exploring the relationship between the informa-         [3] H. Feng, K. Fawaz, K. G. Shin, Continuous Au-
tiveness of speech features and explanations, we have             thentication for Voice Assistants, in: Proceedings
deliberately not delved into the needs of specific stake-         of the 23rd Annual International Conference on
holders for XAI applications. Indeed, different stakehold-        Mobile Computing and Networking, MobiCom ’17,
ers present varying needs [55, 56], and to consider them          Association for Computing Machinery, New York,
is a research avenue of paramount importance for the              NY, USA, 2017, pp. 343–355. URL: https://doi.org/
growth of XAI. As a nascent area of investigations, how-          10.1145/3117811.3117823. doi:10.1145/3117811.
ever, XAI for speech is still relatively in its infancy, we       3117823.
thus prioritized more fundamental methodological and          [4] P. Cheng, U. Roedig, Personal Voice Assistant
design decisions which prioritize a comprehensive and             Security and Privacy—A Survey, Proceedings of
detailed understanding at a low level of model’s rationale.       the IEEE 110 (2022) 476–507. doi:10.1109/JPROC.
Accordingly, our reflections might be more appealing for          2022.3153167.
a range of users who engage with speech models and pos-       [5] S. Malodia, N. Islam, P. Kaur, A. Dhir, Why Do Peo-
sess expertise in machine learning and/or speech analysis,        ple Use Artificial Intelligence (AI)-Enabled Voice
ranging from developers to speech therapists assisted by          Assistants?, IEEE Transactions on Engineering
     Management 71 (2024) 491–505. doi:10.1109/TEM.                  The role of explainable AI in the context of the
     2021.3117884.                                                   AI Act, in: Proceedings of the 2023 ACM Con-
 [6] A. Baevski, H. Zhou, A. Mohamed, M. Auli,                       ference on Fairness, Accountability, and Trans-
     wav2vec 2.0: A Framework for Self-Supervised                    parency, FAccT ’23, Association for Computing
     Learning of Speech Representations, in: Proceed-                Machinery, New York, NY, USA, 2023, pp. 1139–
     ings of the 34th International Conference on Neu-               1150. URL: https://doi.org/10.1145/3593013.3594069.
     ral Information Processing Systems, NIPS’20, Cur-               doi:10.1145/3593013.3594069.
     ran Associates Inc., Red Hook, NY, USA, 2020, pp.          [15] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-
     12449–12460.                                                    jussà, A Primer on the Inner Workings of
 [7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia,                Transformer-based Language Models, 2024.
     R. Salakhutdinov, A. Mohamed, HuBERT: Self-                     arXiv:2405.00208.
     Supervised Speech Representation Learning by               [16] M. Ancona, E. Ceolini, C. Öztireli, M. Gross,
     Masked Prediction of Hidden Units, IEEE/ACM                     Gradient-Based Attribution Methods, in: W. Samek,
     Transactions on Audio, Speech, and Language                     G. Montavon, A. Vedaldi, L. K. Hansen, K.-R.
     Processing 29 (2021) 3451–3460. URL: https://doi.               Müller (Eds.), Explainable AI: Interpreting, Explain-
     org/10.1109/TASLP.2021.3122291. doi:10.1109/                    ing and Visualizing Deep Learning, Springer In-
     TASLP.2021.3122291.                                             ternational Publishing, Cham, 2019, pp. 169–191.
 [8] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,              URL: https://doi.org/10.1007/978-3-030-28954-6_9.
     J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou,          doi:10.1007/978-3-030-28954-6_9.
     S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei,   [17] W. Samek, K.-R. Müller, Towards Explainable Ar-
     WavLM: Large-Scale Self-Supervised Pre-Training                 tificial Intelligence, in: W. Samek, G. Montavon,
     for Full Stack Speech Processing, IEEE Journal of               A. Vedaldi, L. K. Hansen, K.-R. Müller (Eds.), Ex-
     Selected Topics in Signal Processing 16 (2022) 1505–            plainable AI: Interpreting, Explaining and Visual-
     1518. doi:10.1109/JSTSP.2022.3188113.                           izing Deep Learning, Springer International Pub-
 [9] A. Radford, J. W. Kim, T. Xu, G. Brockman,                      lishing, Cham, 2019, pp. 5–22. URL: https://doi.
     C. McLeavey, I. Sutskever, Robust Speech Recog-                 org/10.1007/978-3-030-28954-6_1. doi:10.1007/
     nition via Large-Scale Weak Supervision, in: Pro-               978-3-030-28954-6_1.
     ceedings of the 40th International Conference on           [18] S. Agarwal, S. Jabbari, C. Agarwal, S. Upadhyay,
     Machine Learning, ICML’23, JMLR.org, 2023, pp.                  S. Wu, H. Lakkaraju, Towards the Unification and
     28492–28518.                                                    Robustness of Perturbation and Gradient Based
[10] F. Doshi-Velez, B. Kim, Towards A Rigorous Sci-                 Explanations, in: M. Meila, T. Zhang (Eds.),
     ence of Interpretable Machine Learning, 2017.                   Proceedings of the 38th International Conference
     arXiv:1702.08608.                                               on Machine Learning, volume 139 of Proceedings
[11] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Ma-               of Machine Learning Research, PMLR, 2021, pp.
     chine Learning Interpretability: A Survey on Meth-              110–119. URL: https://proceedings.mlr.press/v139/
     ods and Metrics, Electronics 8 (2019). URL: https://            agarwal21c.html.
     www.mdpi.com/2079-9292/8/8/832. doi:10.3390/               [19] M. Ivanovs, R. Kadikis, K. Ozols, Perturbation-based
     electronics8080832.                                             methods for explaining deep neural networks: A
[12] G. Vilone, L. Longo, Notions of explainability                  survey, Pattern Recognition Letters 150 (2021) 228–
     and evaluation approaches for explainable artifi-               234. URL: https://www.sciencedirect.com/science/
     cial intelligence, Information Fusion 76 (2021) 89–             article/pii/S0167865521002440. doi:https://doi.
     106. URL: https://www.sciencedirect.com/science/                org/10.1016/j.patrec.2021.06.030.
     article/pii/S1566253521001093. doi:https://doi.            [20] F. K. Došilović, M. Brčić, N. Hlupić, Explainable
     org/10.1016/j.inffus.2021.05.009.                               Artificial Intelligence: A Survey, in: 2018 41st In-
[13] R. Pradhan, J. Zhu, B. Glavic, B. Salimi, Inter-                ternational Convention on Information and Com-
     pretable Data-Based Explanations for Fairness De-               munication Technology, Electronics and Microelec-
     bugging, in: Proceedings of the 2022 Interna-                   tronics (MIPRO), 2018, pp. 210–215. doi:10.23919/
     tional Conference on Management of Data, SIG-                   MIPRO.2018.8400040.
     MOD ’22, Association for Computing Machinery,              [21] W. Saeed, C. Omlin, Explainable AI (XAI): A
     New York, NY, USA, 2022, pp. 247–261. URL: https:               systematic meta-survey of current challenges and
     //doi.org/10.1145/3514221.3517886. doi:10.1145/                 future opportunities, Knowledge-Based Systems
     3514221.3517886.                                                263 (2023) 110273. URL: https://www.sciencedirect.
[14] C. Panigutti, R. Hamon, I. Hupont, D. Fernan-                   com/science/article/pii/S0950705123000230.
     dez Llorca, D. Fano Yela, H. Junklewitz, S. Scalzo,             doi:https://doi.org/10.1016/j.knosys.
     G. Mazzini, I. Sanchez, J. Soler Garrido, E. Gomez,             2023.110273.
[22] Y. Zhou, S. Booth, M. T. Ribeiro, J. Shah, Do                Recognition – Means for a Better Understanding?,
     Feature Attribution Methods Correctly Attribute              in: Proc. 2021 ISCA Symposium on Security and
     Features?, Proceedings of the AAAI Confer-                   Privacy in Speech Communication, 2021, pp. 14–20.
     ence on Artificial Intelligence 36 (2022) 9623–9633.         doi:10.21437/SPSC.2021-4.
     URL: https://ojs.aaai.org/index.php/AAAI/article/       [31] X. Wu, P. Bell, A. Rajan, Explanations for Automatic
     view/21196. doi:10.1609/aaai.v36i9.21196.                    Speech Recognition, in: ICASSP 2023 - 2023 IEEE
[23] D. Qin, G. Amariucai, D. Qiao, Y. Guan, S. Fu, A             International Conference on Acoustics, Speech and
     Comprehensive and Reliable Feature Attribution               Signal Processing (ICASSP), 2023, pp. 1–5. doi:10.
     Method: Double-sided Remove and Reconstruct                  1109/ICASSP49357.2023.10094635.
     (DoRaR), 2023. arXiv:2310.17945.                        [32] X. Wu, P. Bell, A. Rajan, Can We Trust Explainable
[24] P. J. Phillips, C. Hahn, P. Fontana, A. Yates, K. K.         AI Methods on ASR? An Evaluation on Phoneme
     Greene, D. Broniatowski, M. A. Przybocki, Four               Recognition, in: ICASSP 2024 - 2024 IEEE Inter-
     Principles of Explainable Artificial Intelligence,           national Conference on Acoustics, Speech and Sig-
     2021. URL: https://tsapps.nist.gov/publication/get_          nal Processing (ICASSP), 2024, pp. 10296–10300.
     pdf.cfm?pub_id=933399. doi:https://doi.org/                  doi:10.1109/ICASSP48485.2024.10445989.
     10.6028/NIST.IR.8312.                                   [33] N. Audibert, C. Fougeron, Intra-speaker pho-
[25] A. Jacovi, Y. Goldberg, Towards faithfully inter-            netic variation in read speech: comparison
     pretable NLP systems: How should we define and               with inter-speaker variability in a controlled
     evaluate faithfulness?, in: D. Jurafsky, J. Chai,            population,      in: Interspeech 2022, ISCA, In-
     N. Schluter, J. Tetreault (Eds.), Proceedings of the         cheon, South Korea, 2022, pp. 4755–4759. URL:
     58th Annual Meeting of the Association for Com-              https://hal.science/hal-03852142. doi:10.21437/
     putational Linguistics, Association for Computa-             Interspeech.2022-10965.
     tional Linguistics, Online, 2020, pp. 4198–4205. URL:   [34] J. Clark, C. Yallop, An Introduction to Phonetics
     https://aclanthology.org/2020.acl-main.386. doi:10.          and Phonology, B. Blackwell, Oxford, UK, 1990.
     18653/v1/2020.acl-main.386.                             [35] G. Yule, The Study of Language, 7 ed., Cambridge
[26] Q. Lyu, M. Apidianaki, C. Callison-Burch, To-                University Press, 2020.
     wards Faithful Model Explanation in NLP: A Survey,      [36] K. N. Stevens, Acoustic Phonetics, The MIT Press,
     Computational Linguistics 50 (2024) 657–723. URL:            2000.
     https://aclanthology.org/2024.cl-2.6. doi:10.1162/      [37] N. H. van Schijndel, T. Houtgast, J. M. Festen, Ef-
     coli_a_00511.                                                fects of degradation of intensity, time, or frequency
[27] S. Becker, J. Vielhaben, M. Ackermann, K.-R.                 content on speech intelligibility for normal-hearing
     Müller, S. Lapuschkin, W. Samek, AudioMNIST:                 and hearing-impaired listeners, The Journal of
     Exploring Explainable Artificial Intelligence for            the Acoustical Society of America 110 (2001) 529–
     audio analysis on a simple benchmark, Jour-                  542. URL: https://doi.org/10.1121/1.1378345. doi:10.
     nal of the Franklin Institute 361 (2024) 418–                1121/1.1378345.
     428. URL: https://www.sciencedirect.com/science/        [38] K. N. Stevens, Acoustic correlates of some pho-
     article/pii/S0016003223007536. doi:https://doi.              netic categories, The Journal of the Acoustical So-
     org/10.1016/j.jfranklin.2023.11.038.                         ciety of America 68 (1980) 836–842. doi:10.1121/
[28] E. Pastor, A. Koudounas, G. Attanasio, D. Hovy,              1.384823.
     E. Baralis, Explaining Speech Classification Models     [39] J. Honey,        Sociophonology,       John Wiley
     via Word-Level Audio Segments and Paralinguistic             & Sons, Ltd, 2017, pp. 92–106. URL:
     Features, in: Y. Graham, M. Purver (Eds.), Pro-              https://onlinelibrary.wiley.com/doi/abs/
     ceedings of the 18th Conference of the European              10.1002/9781405166256.ch6.                doi:https:
     Chapter of the Association for Computational Lin-            //doi.org/10.1002/9781405166256.ch6.
     guistics (Volume 1: Long Papers), Association for       [40] D. Hirst, Speech Prosody: from Acoustics to Inter-
     Computational Linguistics, St. Julian’s, Malta, 2024,        pretation, Springer Berlin, Heidelberg, 2024.
     pp. 2221–2238. URL: https://aclanthology.org/2024.      [41] C. T. Best, The Diversity of Tone Languages
     eacl-long.136.                                               and the Roles of Pitch Variation in Non-tone
[29] V. A. Trinh, M. Mandel, Directly Comparing                   Languages: Considerations for Tone Perception
     the Listening Strategies of Humans and Machines,             Research, Frontiers in Psychology 10 (2019).
     IEEE/ACM Transactions on Audio, Speech, and Lan-             URL:         https://www.frontiersin.org/journals/
     guage Processing 29 (2021) 312–323. doi:10.1109/             psychology/articles/10.3389/fpsyg.2019.00364.
     TASLP.2020.3040545.                                          doi:10.3389/fpsyg.2019.00364.
[30] K. Markert, R. Parracone, M. Kulakov, P. Sperl, C.-Y.   [42] F. Alías, J. C. Socoró, X. Sevillano, A Review
     Kao, K. Böttinger, Visualizing Automatic Speech              of Physical and Perceptual Feature Extraction
     Techniques for Speech, Music and Environmental                  bofo, P. Biberstein, N. D. Sergin, R. Hennequin,
     Sounds, Applied Sciences 6 (2016). URL: https://                R. Naktinis, beantowel, T. Kim, J. P. Åsen, J. Lim,
     www.mdpi.com/2076-3417/6/5/143. doi:10.3390/                    A. Malins, D. Hereñú, S. van der Struijk, L. Nickel,
     app6050143.                                                     J. Wu, Z. Wang, T. Gates, M. Vollrath, A. Sar-
[43] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihal-                roff, Xiao-Ming, A. Porter, S. Kranzler, Voodoohop,
     cea, S. Poria, A review of deep learning techniques             M. D. Gangi, H. Jinoz, C. Guerrero, A. Mazhar, tod-
     for speech processing, Information Fusion 99                    drme2178, Z. Baratz, A. Kostin, X. Zhuang, C. T.
     (2023) 101869. URL: https://www.sciencedirect.                  Lo, P. Campr, E. Semeniuc, M. Biswal, S. Moura,
     com/science/article/pii/S1566253523001859.                      P. Brossier, H. Lee, W. Pimenta, librosa/librosa:
     doi:https://doi.org/10.1016/j.inffus.                           0.10.1, 2023. URL: https://doi.org/10.5281/zenodo.
     2023.101869.                                                    8252662. doi:10.5281/zenodo.8252662.
[44] S. Communication, L. Barrault, Y.-A. Chung, M. C.          [48] M. Ravanelli,         Y. Bengio,          Interpretable
     Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsa-            Convolutional Filters with SincNet, 2019.
     har, H. Gong, K. Heffernan, J. Hoffman, C. Klaiber,             arXiv:1811.09725.
     P. Li, D. Licht, J. Maillard, A. Rakotoarison, K. R.       [49] M. Angrick, C. Herff, G. Johnson, J. Shih, D. Krusien-
     Sadagopan, G. Wenzek, E. Ye, B. Akula, P.-J. Chen,              ski, T. Schultz,         Interpretation of convolu-
     N. E. Hachem, B. Ellis, G. M. Gonzalez, J. Haa-                 tional neural networks for speech spectrogram
     heim, P. Hansanti, R. Howes, B. Huang, M.-J.                    regression from intracranial recordings, Neu-
     Hwang, H. Inaguma, S. Jain, E. Kalbassi, A. Kallet,             rocomput. 342 (2019) 145–151. URL: https://doi.
     I. Kulikov, J. Lam, D. Li, X. Ma, R. Mavlyutov,                 org/10.1016/j.neucom.2018.10.080. doi:10.1016/j.
     B. Peloquin, M. Ramadan, A. Ramakrishnan, A. Sun,               neucom.2018.10.080.
     K. Tran, T. Tran, I. Tufanov, V. Vogeti, C. Wood,          [50] H. Fayyazi, Y. Shekofteh, IIRI-Net: An interpretable
     Y. Yang, B. Yu, P. Andrews, C. Balioglu, M. R. Costa-           convolutional front-end inspired by IIR filters for
     jussà, O. Celebi, M. Elbayad, C. Gao, F. Guzmán,                speaker identification, Neurocomput. 558 (2023).
     J. Kao, A. Lee, A. Mourachko, J. Pino, S. Popuri,               URL: https://doi.org/10.1016/j.neucom.2023.126767.
     C. Ropers, S. Saleem, H. Schwenk, P. Tomasello,                 doi:10.1016/j.neucom.2023.126767.
     C. Wang, J. Wang, S. Wang, SeamlessM4T: Mas-               [51] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di-
     sively Multilingual & Multimodal Machine Trans-                 amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,
     lation, 2023. URL: https://arxiv.org/abs/2308.11596.            A. Coates, A. Y. Ng, Deep Speech: Scaling up end-to-
     arXiv:2308.11596.                                               end speech recognition, 2014. arXiv:1412.5567.
[45] D. Povey, A. Ghoshal, G. Boulianne, L. Burget,             [52] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh,
     O. Glembek, N. Goel, M. Hannemann, P. Motlicek,                 B. Raj, P. Wolf, Design of the CMU Sphinx-
     Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer,                   4 Decoder,       in: Proc. 8th European Confer-
     K. Vesely, The Kaldi Speech Recognition Toolkit, in:            ence on Speech Communication and Technology
     IEEE 2011 Workshop on Automatic Speech Recogni-                 (Eurospeech 2003), 2003, pp. 1181–1184. doi:10.
     tion and Understanding, IEEE Signal Processing So-              21437/Eurospeech.2003-382.
     ciety, 2011, pp. 1–4. IEEE Catalog No.: CFP11SRW-          [53] G. Sarti, N. Feldhus, L. Sickert, O. van der Wal, Inseq:
     USB.                                                            An Interpretability Toolkit for Sequence Generation
[46] K. Ikarous, The Encoding of Vowel Features in                   Models, in: D. Bollegala, R. Huang, A. Ritter (Eds.),
     Mel-Frequency Cepstral Coefficients, in: A. Vietti,             Proceedings of the 61st Annual Meeting of the As-
     L. Spreafico, D. Mereu, V. Galatà (Eds.), Il parlato nel        sociation for Computational Linguistics (Volume 3:
     contesto naturale [Speech in the natural context],              System Demonstrations), Association for Computa-
     Officinaventuno, Milano, 2018, p. 9–18. URL: https:             tional Linguistics, Toronto, Canada, 2023, pp. 421–
     //doi.org/10.17469/O2104AISV000001.                             435. URL: https://aclanthology.org/2023.acl-demo.
[47] B. McFee, M. McVicar, D. Faronbi, I. Roman,                     40. doi:10.18653/v1/2023.acl-demo.40.
     M. Gover, S. Balke, S. Seyfarth, A. Malek, C. Raffel,      [54] V. Miglani, A. Yang, A. Markosyan, D. Garcia-
     V. Lostanlen, B. van Niekirk, D. Lee, F. Cwitkowitz,            Olano, N. Kokhlikyan, Using Captum to Ex-
     F. Zalkow, O. Nieto, D. Ellis, J. Mason, K. Lee,                plain Generative Language Models, in: L. Tan,
     B. Steers, E. Halvachs, C. Thomé, F. Robert-                    D. Milajevs, G. Chauhan, J. Gwinnup, E. Rippeth
     Stöter, R. Bittner, Z. Wei, A. Weiss, E. Batten-                (Eds.), Proceedings of the 3rd Workshop for Nat-
     berg, K. Choi, R. Yamamoto, C. Carr, A. Metsai,                 ural Language Processing Open Source Software
     S. Sullivan, P. Friesch, A. Krishnakumar, S. Hidaka,            (NLP-OSS 2023), Association for Computational
     S. Kowalik, F. Keller, D. Mazur, A. Chabot-Leclerc,             Linguistics, Singapore, 2023, pp. 165–173. URL:
     C. Hawthorne, C. Ramaprasad, M. Keum, J. Gomez,                 https://aclanthology.org/2023.nlposs-1.19. doi:10.
     W. Monroe, V. A. Morozov, K. Eliasi, nullmighty-                18653/v1/2023.nlposs-1.19.
[55] M. Langer, D. Oster, T. Speith, H. Hermanns,
     L. Kästner, E. Schmidt, A. Sesing, K. Baum, What
     do we want from Explainable Artificial Intelligence
     (XAI)? – A stakeholder perspective on XAI and a
     conceptual model guiding interdisciplinary XAI
     research, Artificial Intelligence 296 (2021) 103473.
     URL:       https://www.sciencedirect.com/science/
     article/pii/S0004370221000242. doi:https://doi.
     org/10.1016/j.artint.2021.103473.
[56] M. Calvano, A. Curci, A. Pagano, A. Piccinno,
     Speech Therapy Supported by AI and Smart
     Assistants,      in: R. Kadgien, A. Jedlitschka,
     A. Janes, V. Lenarduzzi, X. Li (Eds.), Product-
     Focused Software Process Improvement, Springer
     Nature Switzerland, Cham, 2024, pp. 97–104.