=Paper=
{{Paper
|id=Vol-3878/44_main_long
|storemode=property
|title=Explainability for Speech Models: On the Challenges of Acoustic Feature Selection
|pdfUrl=https://ceur-ws.org/Vol-3878/44_main_long.pdf
|volume=Vol-3878
|authors=Dennis Fucci,Beatrice Savoldi,Marco Gaido,Matteo Negri,Mauro Cettolo,Luisa Bentivogli
|dblpUrl=https://dblp.org/rec/conf/clic-it/FucciSGNCB24
}}
==Explainability for Speech Models: On the Challenges of Acoustic Feature Selection==
Explainability for Speech Models: On the Challenges of
Acoustic Feature Selection
Dennis Fucci1,2 , Beatrice Savoldi2 , Marco Gaido2 , Matteo Negri2 , Mauro Cettolo2 and
Luisa Bentivogli2
1
University of Trento, Via Calepina, 14, 38122 Trento TN, Italy
2
Fondazione Bruno Kessler, Via Sommarive, 18, 38123 Trento TN, Italy
Abstract
Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI)
has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI
techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation
for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech
processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the
features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to
enhance the depth and informativeness of explanations for speech.
Keywords
Speech Models, Explainability, Feature Attribution
1. Introduction by these components, and determining which input el-
ements guide the model’s predictions [15]. The latter
Models are only as interpretable as their task is the focus of feature attribution methods, which
features. [1] provide intuitive explanations by visualizing which input
elements (e.g., pixels in an image or words in a sentence)
Spoken language—as perhaps our most natural form have influenced the model’s predictions. These meth-
of interaction—is the foundational element of many tech- ods assign a score to each input feature, quantifying its
nologies we interact with in our daily lives [2], from vir- importance or contribution to the output: higher scores
tual assistants to voice dictation [3, 4, 5]. More recently, indicate greater importance of the corresponding input
the emergence of highly capable speech foundation mod- features for generating the output [16, 17, 18, 19]. They
els [6, 7, 8, 9] has also facilitated and expanded the adop- can help identify potential causes for errors and unex-
tion of speech technologies on an unprecedented multi- pected behaviors, as well as analyze the model’s response
lingual scale. In light of this proliferation, a need arises to specific input properties. Overall, these explainability
to prioritize transparency and interpretability, qualities methods serve to present the reason why models make
already demanded in the growing landscape of Machine specific predictions by establishing a connection between
Learning (ML). input and output as a form of intuitive explanation for
As a response, the field of eXplainable AI (XAI) has humans, thereby enhancing interpretability.1
risen prominently, with the aim of facilitating under- Over time, ongoing efforts have aimed to refine fea-
standing of the rationale behind model decisions and fos- ture attribution techniques and provide more effective
tering users’ trust [10, 11, 12, 13]. XAI is also reinforced explanations [22, 23]. However, it is essential to recog-
by the establishment of norms and legal frameworks, as nize that the effectiveness of feature attribution explana-
seen in the European Union’s General Data Protection tions relies not only on the techniques themselves but
Regulation, which enshrines the ‘right to explanation’, also on the informativeness of the input features used
and the AI Act, which emphasizes transparency as a piv- as explanatory variables. If an explanation highlights
otal component of ML applications [14]. unintelligible or poorly informative features, it does little
XAI encompasses various tasks and methods, such to enhance the understanding of the model’s behavior
as identifying relevant model components for specific
predictions, understanding the information processed 1
Despite numerous efforts to differentiate the closely related con-
cepts of explainability and interpretability, no consensus exists in
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, the literature on their definitions [20]. In this paper, we adopt a
Dec 04 — 06, 2024, Pisa, Italy perspective similar to that of Saeed and Omlin [21], where explain-
$ dfucci@fbk.eu (D. Fucci); bsavoldi@fbk.eu (B. Savoldi); ability refers to the process of extracting insights from a model’s
mgaido@fbk.eu (M. Gaido); negri@fbk.eu (M. Negri); workings through specific techniques, while interpretability refers
cettolo@fbk.eu (M. Cettolo); bentivo@fbk.eu (L. Bentivogli) to the understanding process of those insights, crucial to make
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). them actionable.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
[1]. This can undermine key principles in XAI, such as several paralinguistic cues (i.e. how is said) entailing
accuracy—the property of correctly reflecting the factors extensive variation—also for single individual speakers
that led the model to a specific decision including all [33]. As such, it comprises several dimensions, which are
relevant information—and meaningfulness—the property hard to pin down individually, but collectively amount
of offering explanations that are comprehensible to the to what we intuitively and simply perceive as spoken
user [24].2 language.
In fields involving images or texts, feature representa- From a linguistic perspective, the spoken communica-
tions are typically constrained to pixels and words, re- tion system consists of the combination of phonemes,3
spectively. However, for speech, multiple input represen- which are regarded as the smallest meaningful units
tations can be adopted, each emphasizing different acous- of sounds [34, 35]. Physically, it involves the contin-
tic aspects. Indeed, a sequence of speech elements does uous flow of sounds shaped by the movements of our
not only convey the meaning of what is said (like words phonatory organs, transmitted as sound waves [36]. Per-
in a text) but also bears a wealth of additional information ceptually, we process speech through three primary di-
useful for both human understanding and automatic pro- mensions [37]: i) time, or the sequential occurrence of
cessing (e.g. intonation, loudness, speaking rate). Con- sounds;4 ii) intensity, corresponding to the energy level
sequently, when employing feature attribution methods, of the wave due to the strength of molecular vibration,
the resulting explanations can vary significantly in shape which we perceived as loudness; iii) frequency, regard-
and focus on more or less informative characteristics de- ing the rate of vibrations produced by the vocal cords—
pending on the type of speech representation used. To interpreted as pitch—and whose modulation is responsi-
date, research on feature attribution for speech is no- ble for shaping the type of speech sound.
tably limited to few applications—including classification These three elements, known as acoustic correlates [38],
[27, 28] and generative tasks [29, 30, 31, 32]—which offer are specific to both speakers and phonemes. For example,
a somewhat fragmented picture in the choice of speech speakers possess unique characteristics, including pitch
representations, thus providing limited insights on the and speaking rate [33], and also exhibit high variability
relation between the features considered and the expla- stemming from various sociodemographic factors such
nations based upon them. as gender, age, and dialect [39]. In these cases, the speech
In light of the above, this paper reflects on the im- content needs to be disentangled from the variability in
pact of the chosen acoustic features in explaining the its delivery. Conversely, language sounds exhibit variabil-
rationale behind speech models, aiming to gain a deeper ity in duration—e.g., /i/ in ship and sheep—and are distin-
understanding of the trade-offs associated with acoustic guished by specific frequency ranges [36]. The frequency
features. By first offering a gentle introduction to the dimension also plays a vital role in shaping suprasegmen-
rich and multidimensional nature of speech and its digital tal aspects of speech—broader phenomena that span mul-
representation, we identify current gaps and potential tiple segments—such as intonation, obtained by varying
avenues for effectively incorporating this multidimen- pitch [40]. Pitch, for instance, has a distinctive function
sionality into XAI for speech models. Our discussion will in tonal languages, where it is used to distinguish lexi-
focus on two critical factors: i) the amount of information cal or grammatical meaning [41]. But even in non-tonal
these features provide about the model’s behavior, which languages, these prosodic elements are indispensable to
influences the richness of the explanations, and ii) the delivering different meanings and intents, as the reader
level of detail of such information, which determines the can perceive by reading out loud two contrastive sen-
granularity of the explanations. We will also explore how tences such as: “You got the joke right” and “You got the
these aspects impact both the accuracy and meaningful- joke, right?”, where pauses and prosody play pivotal roles.
ness of the explanations, ultimately shaping their overall All these factors add to the multidimensionality of
interpretability. speech, which feature engineering strives to encapsulate
and that cannot be overlooked in the explanatory process.
2. The Correlates of Speech
3. Speech Representations
To gain deeper insight into the complexities of defining
informative features in speech, we explore key charac- While various representations are used to encode speech
teristics of speech and their implications for modeling. in a digital format, three main types are commonly given
Speech is a multifaceted phenomenon. It is grounded 3
Throughout the paper, we use the abstract category of phoneme to
on the materiality of sound to convey linguistic content denote individual speech sounds. However, when discussing their
(i.e. what is said), which is modulated depending on actual realizations, it is more accurate to refer to them as phones
[34].
2 4
The properties of accuracy and meaningfulness can be associated E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top) differ-
with those of faithfulness and plausibility, respectively [25, 26]. entiates two words.
as input to state-of-the-art speech models (for a review,
see [42, 43]). Namely, waveforms, spectrograms, and
mel-frequency cepstral coefficients (MFCCs), which are
shown in Figure 1.
The waveform serves as the most fundamental repre-
sentation of a signal, comprising sequences of samples
(e.g., 16, 000 per second), each indicating the amplitude
of the signal at a specific point in time—essentially, the
fluctuations in air pressure over time. This type of repre-
sentation is leveraged by models like Wav2vec [6].
The spectrogram results from feature engineering
operations that decompose the speech signal into its
frequencies, presenting a 2D visualization of frequency
distributions over time. These representations are com-
monly depicted as heatmaps, where color intensity cor-
responds to the energy of a specific frequency at a given
moment. The time unit in spectrograms is represented
by a fixed-length window of a few milliseconds (e.g., 25),
commonly referred to as a frame, whithin which a given
number of waveform samples are encompassed. Notably,
the articulation of sounds produces time-frequency pat-
terns which are visible as darker regions [36]. Prominent
examples of state-of-the-art models leveraging spectro-
grams are Whisper [9] and SeamlessM4T [44].
The MFCCs offer another 2D representation where
each coefficient captures important details about how
Figure 1: Schematic illustration of the primary speech rep-
the frequency content of the signal changes over time. resentations used by state-of-the-art speech models for the
Like spectrograms, MFCCs offer information about both utterance “This is a waveform”. The features were computed
frequency and time, but in a more compact form. MFCCs using Librosa 0.10.1 [47].
are commonly used in the implementation of ASR mod-
els within popular toolkits like Kaldi5 [45] and Mozilla
DeepSpeech6 .
Overall, though different in nature, these three types tise in speech processing, being aware of the trade-offs
of representations are all effectively exploited by current they intrinsically entail is crucial for carefully conducting
speech models.7 For human understanding, however, XAI examination in speech. Indeed, it is precisely upon
they actually vary in terms of informativeness with re- such input features—and their trade-offs—that explana-
spect to the acoustic correlates discussed in §2. Indeed, tions are built.
although both intensity and frequency are somewhat
discernible in waveforms, qualitative distinctions of pat- 4. Richness of Explanations
terns specific to pitch or phoneme frequencies are rarely
feasible [36]. Comparatively, spectrograms and MFCCs Considering the foregoing, there is a causal relationship
are richer and more descriptive, because they capture wherein explanatory possibilities in speech XAI are in-
the multiple dimensions of time, frequency, and intensity herently limited by the richness of the audio features
with finer detail. Still, spectrograms are more conducive used, specifically the dimensions they encapsulate. This
to phonetic analyses, given the established knowledge limitation directly correlates with the richness of the
in analyzing frequency patterns over time within this resulting explanations. Also, owing to the compatibil-
representation [36] In contrast, MFCCs are rarely used ity of current models with various representation types,
for phonetic analysis [46]. the explanations generated are inevitably confined by
Overall, while weighting the informativeness and se- the specific input features provided to the model. To ex-
lection of speech representations requires a certain exper- emplify, if models process audio as waverfoms—which
poorly represent the frequency dimension for human
5
See https://kaldi-asr.org/doc/feat.html. understanding—explanations accounting for such a cor-
6
See https://deepspeech.readthedocs.io/en/master/DeepSpeech.
html.
relate will be out of reach. In fact, previous works by
7
We are not aware of any recent study attributing higher systems Wu et al. [31] and Wu et al. [32], based on waveforms
performance depending on the used representation. solely focus on the temporal dimension to explain ASR.
In these cases, to avoid limiting the understanding of the 5. Granularity of Explanations
models’ behavior to one single dimension it would be
advisable to explore alternative techniques that offer Another critical factor concerning the informativeness
deeper insights into how models process other acous- of input features is the level of granularity at which
tic correlates. For instance, Pastor et al. [28] integrated the features are considered during the explanatory pro-
counterfactual explanations to specifically investigate cess. This decision affects the level of detail in the result-
whether selected paralinguistic features such as pitch, ing explanations and, consequently, accuracy—as more
speaking rate, and background noise were influent for detailed explanations may more accurately reflect the
the model’s prediction. Additionally, various techniques model’s behavior—and their meaningfulness—as detailed
exist to analyze how models extract relevant patterns and comprehensive explanations can be more difficult to
from waveforms through convolutions [48, 49, 50]. interpret [12, 24].
When the selected input features represent multiple In the time domain, for example, input features are
dimensions, as in the case of spectrograms or MFCCs, the highly fine-grained. As discussed in §2, spectrograms
decision to only account for one of these dimensions be- typically contain frames spanning tens of milliseconds,
comes arbitrary. For example, two models tested by Wu capturing detailed frequency content within each frame,
et al. [31], namely, DeepSpeech [51] and Sphinx [52], are whereas waveforms are composed of samples taken at
fed with spectrograms and MFCCs, respectively. How- much shorter time intervals—for instance, as mentioned
ever, explanations based on raw waveforms are provided in §2, there can be 16, 000 samples in just one second.
for these models. This inconsistency between the fea- This level of detail poses great challenges for (human)
tures used in explanations and those used by the models comprehension, particularly for a broader audience, since
inevitably offers only a partial overview of the models’ mapping groups of frames/samples in an explanation to
behavior and limits the exploration of important acoustic recognizable speech units is highly time-consuming and
aspects. This, in turn, can impact the accuracy of the ex- requires specialized expertise.
planations, which ideally should encompass all relevant Accordingly, to address the issue and make explana-
information. tions for speech more broadly accessible, previous works
To prioritize explanation accuracy and conduct anal- have leveraged textual transcripts within the explanation
yses considering the crucial role of acoustic correlates process. More specifically, Wu et al. [32] and Pastor et al.
such as frequency, it is advisable to take into account all [28] resort to the alignment of audio to text, either for
dimensions embedded in the speech representation. individual phonemes or words, respectively, and apply ex-
This approach is exemplified by the works of Markert plainability techniques to such units. While this approach
et al. [30], who provide explanations that account for helps decipher the contribution of input features based
the most influential elements in MFCCs, as well as Trinh on more intuitive linguistic units, it diverges from how
and Mandel [29] and Becker et al. [27], who base the current models process speech features in small frames
explanations on spectrograms. In the work by Mark- and samples [43]. This divergence risks overlooking the
ert et al. [30], however, it is challenging to connect the model’s behavior and compromises the accuracy and ef-
results with specific acoustic parameters due to the com- fectiveness of the explanations. For instance, whether
plexity of analyzing MFCCs (see §3), which significantly ASR systems rely on shorter or longer time intervals than
undermines the meaningfulness of the explanations. In individual words remains unclear [29]. Therefore, ana-
contrast, explanations using spectrograms offer valuable lyzing this aspect requires a more granular approach at
insights into how machines process speech, producing the time level.
both accurate and meaningful results. For instance, Trinh In light of the above, explanations should be ob-
and Mandel [29] demonstrated that neural ASR models tained with low-level units to avoid biasing explana-
focus on high-energy time-frequency regions for tran- tions towards human understanding. The use of audio-
scription, while Becker et al. [27] found that lower fre- transcript alignment to aid analysis of explanations can
quency ranges, typically associated with pitch, exhibit be very useful but should occur downstream of the expla-
higher attribution scores in speaker gender classification nation process, not upstream. In this way, we can maxi-
tasks [27], showing some alignment with human speech mize the use of all available units to generate detailed and
processing. However, interpreting these insights requires accurate explanations, and then aggregate scores from
specialized expertise, which can reduce the meaningful- individual frames or samples to create more compact
ness of explanations for non-experts. This highlights representations at the level of phonemes or words, ensur-
that, even in speech, the balance between accuracy and ing flexibility in the meaningfulness of the explanations
meaningfulness can vary depending on the context [24]. according to specific needs. This bottom-up approach
mirrors practices in the text domain, providing adapt-
ability in defining attribution units that can range from
subwords to words or phrases [53, 54].
6. Conclusion speech models [56].
The balance of richness and granularity—which
This paper has examined the role of acoustic features and also relates to the interplay between accuracy and
their selection for explaining speech models. More specif- meaningfulness—is also relevant to common users who
ically, we considered a specific subfield of XAI, namely, interact with speech technologies. However, investigat-
feature attribution, which connects input features to out- ing how explanations can be effectively communicated
puts as a form of explanation. Previous research has not to and understood by these users in the context of daily
explicitly addressed how to incorporate features into the speech technology use exceeds the scope of this paper
explanation process within the speech domain, where and warrants further exploration.
input is encoded in more varied ways compared to other
fields, such as text. This has led to diverse approaches,
each with different implications for what can and can- 8. Acknowledgments
not be explained about model behavior, and with the
risk of not fully or accurately representing the model’s We acknowledge the support of the PNRR project FAIR -
functioning. Future AI Research (PE00000013), under the NRRP MUR
By discussing the key characteristics of speech and program funded by the NextGenerationEU. This paper
the properties of the most adopted acoustic features, we has also received funding from the European Union’s
argue that explanations should ideally encompass all Horizon research and innovation programme under grant
available dimensions, particularly time and frequency, as agreement No 101135798, project Meetween (My Per-
both are essential for a comprehensive understanding of sonal AI Mediator for Virtual MEETtings BetWEEN Peo-
the models’ rationale. We have also discussed challenges ple).
associated with aligning explanations at high granularity
with human understanding, emphasizing solutions that References
provide flexibility in the analysis, allowing for adjust-
ments between more or less detail as needed. [1] A. Zytek, I. Arnaldo, D. Liu, L. Berti-Equille,
Building on these insights, our ongoing research fo- K. Veeramachaneni, The Need for Interpretable
cuses on developing feature attribution techniques that Features: Motivation and Taxonomy, SIGKDD Ex-
operate on spectrograms at the finest possible unit level, plor. Newsl. 24 (2022) 1–13. URL: https://doi.org/
integrating both time and frequency dimensions. Our aim 10.1145/3544903.3544905. doi:10.1145/3544903.
is to generate explanations that are accurate and mean- 3544905.
ingful for experts, as well as adaptable for non-expert [2] C. Munteanu, M. Jones, S. Oviatt, S. Brewster,
users. More broadly, we hope that our reflections will G. Penn, S. Whittaker, N. Rajput, A. Nanavati,
be beneficial and thought-provoking for researchers cur- We need to talk: HCI and the delicate topic of
rently working in, or entering, the field of XAI for speech spoken language interaction, in: CHI ’13 Ex-
models, thereby contributing to a deeper understanding tended Abstracts on Human Factors in Computing
of the rationale behind these models. Systems, CHI EA ’13, Association for Computing
Machinery, New York, NY, USA, 2013, pp. 2459–
2464. URL: https://doi.org/10.1145/2468356.2468803.
7. Limitations doi:10.1145/2468356.2468803.
While exploring the relationship between the informa- [3] H. Feng, K. Fawaz, K. G. Shin, Continuous Au-
tiveness of speech features and explanations, we have thentication for Voice Assistants, in: Proceedings
deliberately not delved into the needs of specific stake- of the 23rd Annual International Conference on
holders for XAI applications. Indeed, different stakehold- Mobile Computing and Networking, MobiCom ’17,
ers present varying needs [55, 56], and to consider them Association for Computing Machinery, New York,
is a research avenue of paramount importance for the NY, USA, 2017, pp. 343–355. URL: https://doi.org/
growth of XAI. As a nascent area of investigations, how- 10.1145/3117811.3117823. doi:10.1145/3117811.
ever, XAI for speech is still relatively in its infancy, we 3117823.
thus prioritized more fundamental methodological and [4] P. Cheng, U. Roedig, Personal Voice Assistant
design decisions which prioritize a comprehensive and Security and Privacy—A Survey, Proceedings of
detailed understanding at a low level of model’s rationale. the IEEE 110 (2022) 476–507. doi:10.1109/JPROC.
Accordingly, our reflections might be more appealing for 2022.3153167.
a range of users who engage with speech models and pos- [5] S. Malodia, N. Islam, P. Kaur, A. Dhir, Why Do Peo-
sess expertise in machine learning and/or speech analysis, ple Use Artificial Intelligence (AI)-Enabled Voice
ranging from developers to speech therapists assisted by Assistants?, IEEE Transactions on Engineering
Management 71 (2024) 491–505. doi:10.1109/TEM. The role of explainable AI in the context of the
2021.3117884. AI Act, in: Proceedings of the 2023 ACM Con-
[6] A. Baevski, H. Zhou, A. Mohamed, M. Auli, ference on Fairness, Accountability, and Trans-
wav2vec 2.0: A Framework for Self-Supervised parency, FAccT ’23, Association for Computing
Learning of Speech Representations, in: Proceed- Machinery, New York, NY, USA, 2023, pp. 1139–
ings of the 34th International Conference on Neu- 1150. URL: https://doi.org/10.1145/3593013.3594069.
ral Information Processing Systems, NIPS’20, Cur- doi:10.1145/3593013.3594069.
ran Associates Inc., Red Hook, NY, USA, 2020, pp. [15] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa-
12449–12460. jussà, A Primer on the Inner Workings of
[7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, Transformer-based Language Models, 2024.
R. Salakhutdinov, A. Mohamed, HuBERT: Self- arXiv:2405.00208.
Supervised Speech Representation Learning by [16] M. Ancona, E. Ceolini, C. Öztireli, M. Gross,
Masked Prediction of Hidden Units, IEEE/ACM Gradient-Based Attribution Methods, in: W. Samek,
Transactions on Audio, Speech, and Language G. Montavon, A. Vedaldi, L. K. Hansen, K.-R.
Processing 29 (2021) 3451–3460. URL: https://doi. Müller (Eds.), Explainable AI: Interpreting, Explain-
org/10.1109/TASLP.2021.3122291. doi:10.1109/ ing and Visualizing Deep Learning, Springer In-
TASLP.2021.3122291. ternational Publishing, Cham, 2019, pp. 169–191.
[8] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, URL: https://doi.org/10.1007/978-3-030-28954-6_9.
J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, doi:10.1007/978-3-030-28954-6_9.
S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei, [17] W. Samek, K.-R. Müller, Towards Explainable Ar-
WavLM: Large-Scale Self-Supervised Pre-Training tificial Intelligence, in: W. Samek, G. Montavon,
for Full Stack Speech Processing, IEEE Journal of A. Vedaldi, L. K. Hansen, K.-R. Müller (Eds.), Ex-
Selected Topics in Signal Processing 16 (2022) 1505– plainable AI: Interpreting, Explaining and Visual-
1518. doi:10.1109/JSTSP.2022.3188113. izing Deep Learning, Springer International Pub-
[9] A. Radford, J. W. Kim, T. Xu, G. Brockman, lishing, Cham, 2019, pp. 5–22. URL: https://doi.
C. McLeavey, I. Sutskever, Robust Speech Recog- org/10.1007/978-3-030-28954-6_1. doi:10.1007/
nition via Large-Scale Weak Supervision, in: Pro- 978-3-030-28954-6_1.
ceedings of the 40th International Conference on [18] S. Agarwal, S. Jabbari, C. Agarwal, S. Upadhyay,
Machine Learning, ICML’23, JMLR.org, 2023, pp. S. Wu, H. Lakkaraju, Towards the Unification and
28492–28518. Robustness of Perturbation and Gradient Based
[10] F. Doshi-Velez, B. Kim, Towards A Rigorous Sci- Explanations, in: M. Meila, T. Zhang (Eds.),
ence of Interpretable Machine Learning, 2017. Proceedings of the 38th International Conference
arXiv:1702.08608. on Machine Learning, volume 139 of Proceedings
[11] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Ma- of Machine Learning Research, PMLR, 2021, pp.
chine Learning Interpretability: A Survey on Meth- 110–119. URL: https://proceedings.mlr.press/v139/
ods and Metrics, Electronics 8 (2019). URL: https:// agarwal21c.html.
www.mdpi.com/2079-9292/8/8/832. doi:10.3390/ [19] M. Ivanovs, R. Kadikis, K. Ozols, Perturbation-based
electronics8080832. methods for explaining deep neural networks: A
[12] G. Vilone, L. Longo, Notions of explainability survey, Pattern Recognition Letters 150 (2021) 228–
and evaluation approaches for explainable artifi- 234. URL: https://www.sciencedirect.com/science/
cial intelligence, Information Fusion 76 (2021) 89– article/pii/S0167865521002440. doi:https://doi.
106. URL: https://www.sciencedirect.com/science/ org/10.1016/j.patrec.2021.06.030.
article/pii/S1566253521001093. doi:https://doi. [20] F. K. Došilović, M. Brčić, N. Hlupić, Explainable
org/10.1016/j.inffus.2021.05.009. Artificial Intelligence: A Survey, in: 2018 41st In-
[13] R. Pradhan, J. Zhu, B. Glavic, B. Salimi, Inter- ternational Convention on Information and Com-
pretable Data-Based Explanations for Fairness De- munication Technology, Electronics and Microelec-
bugging, in: Proceedings of the 2022 Interna- tronics (MIPRO), 2018, pp. 210–215. doi:10.23919/
tional Conference on Management of Data, SIG- MIPRO.2018.8400040.
MOD ’22, Association for Computing Machinery, [21] W. Saeed, C. Omlin, Explainable AI (XAI): A
New York, NY, USA, 2022, pp. 247–261. URL: https: systematic meta-survey of current challenges and
//doi.org/10.1145/3514221.3517886. doi:10.1145/ future opportunities, Knowledge-Based Systems
3514221.3517886. 263 (2023) 110273. URL: https://www.sciencedirect.
[14] C. Panigutti, R. Hamon, I. Hupont, D. Fernan- com/science/article/pii/S0950705123000230.
dez Llorca, D. Fano Yela, H. Junklewitz, S. Scalzo, doi:https://doi.org/10.1016/j.knosys.
G. Mazzini, I. Sanchez, J. Soler Garrido, E. Gomez, 2023.110273.
[22] Y. Zhou, S. Booth, M. T. Ribeiro, J. Shah, Do Recognition – Means for a Better Understanding?,
Feature Attribution Methods Correctly Attribute in: Proc. 2021 ISCA Symposium on Security and
Features?, Proceedings of the AAAI Confer- Privacy in Speech Communication, 2021, pp. 14–20.
ence on Artificial Intelligence 36 (2022) 9623–9633. doi:10.21437/SPSC.2021-4.
URL: https://ojs.aaai.org/index.php/AAAI/article/ [31] X. Wu, P. Bell, A. Rajan, Explanations for Automatic
view/21196. doi:10.1609/aaai.v36i9.21196. Speech Recognition, in: ICASSP 2023 - 2023 IEEE
[23] D. Qin, G. Amariucai, D. Qiao, Y. Guan, S. Fu, A International Conference on Acoustics, Speech and
Comprehensive and Reliable Feature Attribution Signal Processing (ICASSP), 2023, pp. 1–5. doi:10.
Method: Double-sided Remove and Reconstruct 1109/ICASSP49357.2023.10094635.
(DoRaR), 2023. arXiv:2310.17945. [32] X. Wu, P. Bell, A. Rajan, Can We Trust Explainable
[24] P. J. Phillips, C. Hahn, P. Fontana, A. Yates, K. K. AI Methods on ASR? An Evaluation on Phoneme
Greene, D. Broniatowski, M. A. Przybocki, Four Recognition, in: ICASSP 2024 - 2024 IEEE Inter-
Principles of Explainable Artificial Intelligence, national Conference on Acoustics, Speech and Sig-
2021. URL: https://tsapps.nist.gov/publication/get_ nal Processing (ICASSP), 2024, pp. 10296–10300.
pdf.cfm?pub_id=933399. doi:https://doi.org/ doi:10.1109/ICASSP48485.2024.10445989.
10.6028/NIST.IR.8312. [33] N. Audibert, C. Fougeron, Intra-speaker pho-
[25] A. Jacovi, Y. Goldberg, Towards faithfully inter- netic variation in read speech: comparison
pretable NLP systems: How should we define and with inter-speaker variability in a controlled
evaluate faithfulness?, in: D. Jurafsky, J. Chai, population, in: Interspeech 2022, ISCA, In-
N. Schluter, J. Tetreault (Eds.), Proceedings of the cheon, South Korea, 2022, pp. 4755–4759. URL:
58th Annual Meeting of the Association for Com- https://hal.science/hal-03852142. doi:10.21437/
putational Linguistics, Association for Computa- Interspeech.2022-10965.
tional Linguistics, Online, 2020, pp. 4198–4205. URL: [34] J. Clark, C. Yallop, An Introduction to Phonetics
https://aclanthology.org/2020.acl-main.386. doi:10. and Phonology, B. Blackwell, Oxford, UK, 1990.
18653/v1/2020.acl-main.386. [35] G. Yule, The Study of Language, 7 ed., Cambridge
[26] Q. Lyu, M. Apidianaki, C. Callison-Burch, To- University Press, 2020.
wards Faithful Model Explanation in NLP: A Survey, [36] K. N. Stevens, Acoustic Phonetics, The MIT Press,
Computational Linguistics 50 (2024) 657–723. URL: 2000.
https://aclanthology.org/2024.cl-2.6. doi:10.1162/ [37] N. H. van Schijndel, T. Houtgast, J. M. Festen, Ef-
coli_a_00511. fects of degradation of intensity, time, or frequency
[27] S. Becker, J. Vielhaben, M. Ackermann, K.-R. content on speech intelligibility for normal-hearing
Müller, S. Lapuschkin, W. Samek, AudioMNIST: and hearing-impaired listeners, The Journal of
Exploring Explainable Artificial Intelligence for the Acoustical Society of America 110 (2001) 529–
audio analysis on a simple benchmark, Jour- 542. URL: https://doi.org/10.1121/1.1378345. doi:10.
nal of the Franklin Institute 361 (2024) 418– 1121/1.1378345.
428. URL: https://www.sciencedirect.com/science/ [38] K. N. Stevens, Acoustic correlates of some pho-
article/pii/S0016003223007536. doi:https://doi. netic categories, The Journal of the Acoustical So-
org/10.1016/j.jfranklin.2023.11.038. ciety of America 68 (1980) 836–842. doi:10.1121/
[28] E. Pastor, A. Koudounas, G. Attanasio, D. Hovy, 1.384823.
E. Baralis, Explaining Speech Classification Models [39] J. Honey, Sociophonology, John Wiley
via Word-Level Audio Segments and Paralinguistic & Sons, Ltd, 2017, pp. 92–106. URL:
Features, in: Y. Graham, M. Purver (Eds.), Pro- https://onlinelibrary.wiley.com/doi/abs/
ceedings of the 18th Conference of the European 10.1002/9781405166256.ch6. doi:https:
Chapter of the Association for Computational Lin- //doi.org/10.1002/9781405166256.ch6.
guistics (Volume 1: Long Papers), Association for [40] D. Hirst, Speech Prosody: from Acoustics to Inter-
Computational Linguistics, St. Julian’s, Malta, 2024, pretation, Springer Berlin, Heidelberg, 2024.
pp. 2221–2238. URL: https://aclanthology.org/2024. [41] C. T. Best, The Diversity of Tone Languages
eacl-long.136. and the Roles of Pitch Variation in Non-tone
[29] V. A. Trinh, M. Mandel, Directly Comparing Languages: Considerations for Tone Perception
the Listening Strategies of Humans and Machines, Research, Frontiers in Psychology 10 (2019).
IEEE/ACM Transactions on Audio, Speech, and Lan- URL: https://www.frontiersin.org/journals/
guage Processing 29 (2021) 312–323. doi:10.1109/ psychology/articles/10.3389/fpsyg.2019.00364.
TASLP.2020.3040545. doi:10.3389/fpsyg.2019.00364.
[30] K. Markert, R. Parracone, M. Kulakov, P. Sperl, C.-Y. [42] F. Alías, J. C. Socoró, X. Sevillano, A Review
Kao, K. Böttinger, Visualizing Automatic Speech of Physical and Perceptual Feature Extraction
Techniques for Speech, Music and Environmental bofo, P. Biberstein, N. D. Sergin, R. Hennequin,
Sounds, Applied Sciences 6 (2016). URL: https:// R. Naktinis, beantowel, T. Kim, J. P. Åsen, J. Lim,
www.mdpi.com/2076-3417/6/5/143. doi:10.3390/ A. Malins, D. Hereñú, S. van der Struijk, L. Nickel,
app6050143. J. Wu, Z. Wang, T. Gates, M. Vollrath, A. Sar-
[43] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihal- roff, Xiao-Ming, A. Porter, S. Kranzler, Voodoohop,
cea, S. Poria, A review of deep learning techniques M. D. Gangi, H. Jinoz, C. Guerrero, A. Mazhar, tod-
for speech processing, Information Fusion 99 drme2178, Z. Baratz, A. Kostin, X. Zhuang, C. T.
(2023) 101869. URL: https://www.sciencedirect. Lo, P. Campr, E. Semeniuc, M. Biswal, S. Moura,
com/science/article/pii/S1566253523001859. P. Brossier, H. Lee, W. Pimenta, librosa/librosa:
doi:https://doi.org/10.1016/j.inffus. 0.10.1, 2023. URL: https://doi.org/10.5281/zenodo.
2023.101869. 8252662. doi:10.5281/zenodo.8252662.
[44] S. Communication, L. Barrault, Y.-A. Chung, M. C. [48] M. Ravanelli, Y. Bengio, Interpretable
Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsa- Convolutional Filters with SincNet, 2019.
har, H. Gong, K. Heffernan, J. Hoffman, C. Klaiber, arXiv:1811.09725.
P. Li, D. Licht, J. Maillard, A. Rakotoarison, K. R. [49] M. Angrick, C. Herff, G. Johnson, J. Shih, D. Krusien-
Sadagopan, G. Wenzek, E. Ye, B. Akula, P.-J. Chen, ski, T. Schultz, Interpretation of convolu-
N. E. Hachem, B. Ellis, G. M. Gonzalez, J. Haa- tional neural networks for speech spectrogram
heim, P. Hansanti, R. Howes, B. Huang, M.-J. regression from intracranial recordings, Neu-
Hwang, H. Inaguma, S. Jain, E. Kalbassi, A. Kallet, rocomput. 342 (2019) 145–151. URL: https://doi.
I. Kulikov, J. Lam, D. Li, X. Ma, R. Mavlyutov, org/10.1016/j.neucom.2018.10.080. doi:10.1016/j.
B. Peloquin, M. Ramadan, A. Ramakrishnan, A. Sun, neucom.2018.10.080.
K. Tran, T. Tran, I. Tufanov, V. Vogeti, C. Wood, [50] H. Fayyazi, Y. Shekofteh, IIRI-Net: An interpretable
Y. Yang, B. Yu, P. Andrews, C. Balioglu, M. R. Costa- convolutional front-end inspired by IIR filters for
jussà, O. Celebi, M. Elbayad, C. Gao, F. Guzmán, speaker identification, Neurocomput. 558 (2023).
J. Kao, A. Lee, A. Mourachko, J. Pino, S. Popuri, URL: https://doi.org/10.1016/j.neucom.2023.126767.
C. Ropers, S. Saleem, H. Schwenk, P. Tomasello, doi:10.1016/j.neucom.2023.126767.
C. Wang, J. Wang, S. Wang, SeamlessM4T: Mas- [51] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di-
sively Multilingual & Multimodal Machine Trans- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta,
lation, 2023. URL: https://arxiv.org/abs/2308.11596. A. Coates, A. Y. Ng, Deep Speech: Scaling up end-to-
arXiv:2308.11596. end speech recognition, 2014. arXiv:1412.5567.
[45] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, [52] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh,
O. Glembek, N. Goel, M. Hannemann, P. Motlicek, B. Raj, P. Wolf, Design of the CMU Sphinx-
Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, 4 Decoder, in: Proc. 8th European Confer-
K. Vesely, The Kaldi Speech Recognition Toolkit, in: ence on Speech Communication and Technology
IEEE 2011 Workshop on Automatic Speech Recogni- (Eurospeech 2003), 2003, pp. 1181–1184. doi:10.
tion and Understanding, IEEE Signal Processing So- 21437/Eurospeech.2003-382.
ciety, 2011, pp. 1–4. IEEE Catalog No.: CFP11SRW- [53] G. Sarti, N. Feldhus, L. Sickert, O. van der Wal, Inseq:
USB. An Interpretability Toolkit for Sequence Generation
[46] K. Ikarous, The Encoding of Vowel Features in Models, in: D. Bollegala, R. Huang, A. Ritter (Eds.),
Mel-Frequency Cepstral Coefficients, in: A. Vietti, Proceedings of the 61st Annual Meeting of the As-
L. Spreafico, D. Mereu, V. Galatà (Eds.), Il parlato nel sociation for Computational Linguistics (Volume 3:
contesto naturale [Speech in the natural context], System Demonstrations), Association for Computa-
Officinaventuno, Milano, 2018, p. 9–18. URL: https: tional Linguistics, Toronto, Canada, 2023, pp. 421–
//doi.org/10.17469/O2104AISV000001. 435. URL: https://aclanthology.org/2023.acl-demo.
[47] B. McFee, M. McVicar, D. Faronbi, I. Roman, 40. doi:10.18653/v1/2023.acl-demo.40.
M. Gover, S. Balke, S. Seyfarth, A. Malek, C. Raffel, [54] V. Miglani, A. Yang, A. Markosyan, D. Garcia-
V. Lostanlen, B. van Niekirk, D. Lee, F. Cwitkowitz, Olano, N. Kokhlikyan, Using Captum to Ex-
F. Zalkow, O. Nieto, D. Ellis, J. Mason, K. Lee, plain Generative Language Models, in: L. Tan,
B. Steers, E. Halvachs, C. Thomé, F. Robert- D. Milajevs, G. Chauhan, J. Gwinnup, E. Rippeth
Stöter, R. Bittner, Z. Wei, A. Weiss, E. Batten- (Eds.), Proceedings of the 3rd Workshop for Nat-
berg, K. Choi, R. Yamamoto, C. Carr, A. Metsai, ural Language Processing Open Source Software
S. Sullivan, P. Friesch, A. Krishnakumar, S. Hidaka, (NLP-OSS 2023), Association for Computational
S. Kowalik, F. Keller, D. Mazur, A. Chabot-Leclerc, Linguistics, Singapore, 2023, pp. 165–173. URL:
C. Hawthorne, C. Ramaprasad, M. Keum, J. Gomez, https://aclanthology.org/2023.nlposs-1.19. doi:10.
W. Monroe, V. A. Morozov, K. Eliasi, nullmighty- 18653/v1/2023.nlposs-1.19.
[55] M. Langer, D. Oster, T. Speith, H. Hermanns,
L. Kästner, E. Schmidt, A. Sesing, K. Baum, What
do we want from Explainable Artificial Intelligence
(XAI)? – A stakeholder perspective on XAI and a
conceptual model guiding interdisciplinary XAI
research, Artificial Intelligence 296 (2021) 103473.
URL: https://www.sciencedirect.com/science/
article/pii/S0004370221000242. doi:https://doi.
org/10.1016/j.artint.2021.103473.
[56] M. Calvano, A. Curci, A. Pagano, A. Piccinno,
Speech Therapy Supported by AI and Smart
Assistants, in: R. Kadgien, A. Jedlitschka,
A. Janes, V. Lenarduzzi, X. Li (Eds.), Product-
Focused Software Process Improvement, Springer
Nature Switzerland, Cham, 2024, pp. 97–104.