Explainability for Speech Models: On the Challenges of Acoustic Feature Selection Dennis Fucci1,2 , Beatrice Savoldi2 , Marco Gaido2 , Matteo Negri2 , Mauro Cettolo2 and Luisa Bentivogli2 1 University of Trento, Via Calepina, 14, 38122 Trento TN, Italy 2 Fondazione Bruno Kessler, Via Sommarive, 18, 38123 Trento TN, Italy Abstract Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech. Keywords Speech Models, Explainability, Feature Attribution 1. Introduction by these components, and determining which input el- ements guide the model’s predictions [15]. The latter Models are only as interpretable as their task is the focus of feature attribution methods, which features. [1] provide intuitive explanations by visualizing which input elements (e.g., pixels in an image or words in a sentence) Spoken language—as perhaps our most natural form have influenced the model’s predictions. These meth- of interaction—is the foundational element of many tech- ods assign a score to each input feature, quantifying its nologies we interact with in our daily lives [2], from vir- importance or contribution to the output: higher scores tual assistants to voice dictation [3, 4, 5]. More recently, indicate greater importance of the corresponding input the emergence of highly capable speech foundation mod- features for generating the output [16, 17, 18, 19]. They els [6, 7, 8, 9] has also facilitated and expanded the adop- can help identify potential causes for errors and unex- tion of speech technologies on an unprecedented multi- pected behaviors, as well as analyze the model’s response lingual scale. In light of this proliferation, a need arises to specific input properties. Overall, these explainability to prioritize transparency and interpretability, qualities methods serve to present the reason why models make already demanded in the growing landscape of Machine specific predictions by establishing a connection between Learning (ML). input and output as a form of intuitive explanation for As a response, the field of eXplainable AI (XAI) has humans, thereby enhancing interpretability.1 risen prominently, with the aim of facilitating under- Over time, ongoing efforts have aimed to refine fea- standing of the rationale behind model decisions and fos- ture attribution techniques and provide more effective tering users’ trust [10, 11, 12, 13]. XAI is also reinforced explanations [22, 23]. However, it is essential to recog- by the establishment of norms and legal frameworks, as nize that the effectiveness of feature attribution explana- seen in the European Union’s General Data Protection tions relies not only on the techniques themselves but Regulation, which enshrines the ‘right to explanation’, also on the informativeness of the input features used and the AI Act, which emphasizes transparency as a piv- as explanatory variables. If an explanation highlights otal component of ML applications [14]. unintelligible or poorly informative features, it does little XAI encompasses various tasks and methods, such to enhance the understanding of the model’s behavior as identifying relevant model components for specific predictions, understanding the information processed 1 Despite numerous efforts to differentiate the closely related con- cepts of explainability and interpretability, no consensus exists in CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, the literature on their definitions [20]. In this paper, we adopt a Dec 04 — 06, 2024, Pisa, Italy perspective similar to that of Saeed and Omlin [21], where explain- $ dfucci@fbk.eu (D. Fucci); bsavoldi@fbk.eu (B. Savoldi); ability refers to the process of extracting insights from a model’s mgaido@fbk.eu (M. Gaido); negri@fbk.eu (M. Negri); workings through specific techniques, while interpretability refers cettolo@fbk.eu (M. Cettolo); bentivo@fbk.eu (L. Bentivogli) to the understanding process of those insights, crucial to make © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). them actionable. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings [1]. This can undermine key principles in XAI, such as several paralinguistic cues (i.e. how is said) entailing accuracy—the property of correctly reflecting the factors extensive variation—also for single individual speakers that led the model to a specific decision including all [33]. As such, it comprises several dimensions, which are relevant information—and meaningfulness—the property hard to pin down individually, but collectively amount of offering explanations that are comprehensible to the to what we intuitively and simply perceive as spoken user [24].2 language. In fields involving images or texts, feature representa- From a linguistic perspective, the spoken communica- tions are typically constrained to pixels and words, re- tion system consists of the combination of phonemes,3 spectively. However, for speech, multiple input represen- which are regarded as the smallest meaningful units tations can be adopted, each emphasizing different acous- of sounds [34, 35]. Physically, it involves the contin- tic aspects. Indeed, a sequence of speech elements does uous flow of sounds shaped by the movements of our not only convey the meaning of what is said (like words phonatory organs, transmitted as sound waves [36]. Per- in a text) but also bears a wealth of additional information ceptually, we process speech through three primary di- useful for both human understanding and automatic pro- mensions [37]: i) time, or the sequential occurrence of cessing (e.g. intonation, loudness, speaking rate). Con- sounds;4 ii) intensity, corresponding to the energy level sequently, when employing feature attribution methods, of the wave due to the strength of molecular vibration, the resulting explanations can vary significantly in shape which we perceived as loudness; iii) frequency, regard- and focus on more or less informative characteristics de- ing the rate of vibrations produced by the vocal cords— pending on the type of speech representation used. To interpreted as pitch—and whose modulation is responsi- date, research on feature attribution for speech is no- ble for shaping the type of speech sound. tably limited to few applications—including classification These three elements, known as acoustic correlates [38], [27, 28] and generative tasks [29, 30, 31, 32]—which offer are specific to both speakers and phonemes. For example, a somewhat fragmented picture in the choice of speech speakers possess unique characteristics, including pitch representations, thus providing limited insights on the and speaking rate [33], and also exhibit high variability relation between the features considered and the expla- stemming from various sociodemographic factors such nations based upon them. as gender, age, and dialect [39]. In these cases, the speech In light of the above, this paper reflects on the im- content needs to be disentangled from the variability in pact of the chosen acoustic features in explaining the its delivery. Conversely, language sounds exhibit variabil- rationale behind speech models, aiming to gain a deeper ity in duration—e.g., /i/ in ship and sheep—and are distin- understanding of the trade-offs associated with acoustic guished by specific frequency ranges [36]. The frequency features. By first offering a gentle introduction to the dimension also plays a vital role in shaping suprasegmen- rich and multidimensional nature of speech and its digital tal aspects of speech—broader phenomena that span mul- representation, we identify current gaps and potential tiple segments—such as intonation, obtained by varying avenues for effectively incorporating this multidimen- pitch [40]. Pitch, for instance, has a distinctive function sionality into XAI for speech models. Our discussion will in tonal languages, where it is used to distinguish lexi- focus on two critical factors: i) the amount of information cal or grammatical meaning [41]. But even in non-tonal these features provide about the model’s behavior, which languages, these prosodic elements are indispensable to influences the richness of the explanations, and ii) the delivering different meanings and intents, as the reader level of detail of such information, which determines the can perceive by reading out loud two contrastive sen- granularity of the explanations. We will also explore how tences such as: “You got the joke right” and “You got the these aspects impact both the accuracy and meaningful- joke, right?”, where pauses and prosody play pivotal roles. ness of the explanations, ultimately shaping their overall All these factors add to the multidimensionality of interpretability. speech, which feature engineering strives to encapsulate and that cannot be overlooked in the explanatory process. 2. The Correlates of Speech 3. Speech Representations To gain deeper insight into the complexities of defining informative features in speech, we explore key charac- While various representations are used to encode speech teristics of speech and their implications for modeling. in a digital format, three main types are commonly given Speech is a multifaceted phenomenon. It is grounded 3 Throughout the paper, we use the abstract category of phoneme to on the materiality of sound to convey linguistic content denote individual speech sounds. However, when discussing their (i.e. what is said), which is modulated depending on actual realizations, it is more accurate to refer to them as phones [34]. 2 4 The properties of accuracy and meaningfulness can be associated E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top) differ- with those of faithfulness and plausibility, respectively [25, 26]. entiates two words. as input to state-of-the-art speech models (for a review, see [42, 43]). Namely, waveforms, spectrograms, and mel-frequency cepstral coefficients (MFCCs), which are shown in Figure 1. The waveform serves as the most fundamental repre- sentation of a signal, comprising sequences of samples (e.g., 16, 000 per second), each indicating the amplitude of the signal at a specific point in time—essentially, the fluctuations in air pressure over time. This type of repre- sentation is leveraged by models like Wav2vec [6]. The spectrogram results from feature engineering operations that decompose the speech signal into its frequencies, presenting a 2D visualization of frequency distributions over time. These representations are com- monly depicted as heatmaps, where color intensity cor- responds to the energy of a specific frequency at a given moment. The time unit in spectrograms is represented by a fixed-length window of a few milliseconds (e.g., 25), commonly referred to as a frame, whithin which a given number of waveform samples are encompassed. Notably, the articulation of sounds produces time-frequency pat- terns which are visible as darker regions [36]. Prominent examples of state-of-the-art models leveraging spectro- grams are Whisper [9] and SeamlessM4T [44]. The MFCCs offer another 2D representation where each coefficient captures important details about how Figure 1: Schematic illustration of the primary speech rep- the frequency content of the signal changes over time. resentations used by state-of-the-art speech models for the Like spectrograms, MFCCs offer information about both utterance “This is a waveform”. The features were computed frequency and time, but in a more compact form. MFCCs using Librosa 0.10.1 [47]. are commonly used in the implementation of ASR mod- els within popular toolkits like Kaldi5 [45] and Mozilla DeepSpeech6 . Overall, though different in nature, these three types tise in speech processing, being aware of the trade-offs of representations are all effectively exploited by current they intrinsically entail is crucial for carefully conducting speech models.7 For human understanding, however, XAI examination in speech. Indeed, it is precisely upon they actually vary in terms of informativeness with re- such input features—and their trade-offs—that explana- spect to the acoustic correlates discussed in §2. Indeed, tions are built. although both intensity and frequency are somewhat discernible in waveforms, qualitative distinctions of pat- 4. Richness of Explanations terns specific to pitch or phoneme frequencies are rarely feasible [36]. Comparatively, spectrograms and MFCCs Considering the foregoing, there is a causal relationship are richer and more descriptive, because they capture wherein explanatory possibilities in speech XAI are in- the multiple dimensions of time, frequency, and intensity herently limited by the richness of the audio features with finer detail. Still, spectrograms are more conducive used, specifically the dimensions they encapsulate. This to phonetic analyses, given the established knowledge limitation directly correlates with the richness of the in analyzing frequency patterns over time within this resulting explanations. Also, owing to the compatibil- representation [36] In contrast, MFCCs are rarely used ity of current models with various representation types, for phonetic analysis [46]. the explanations generated are inevitably confined by Overall, while weighting the informativeness and se- the specific input features provided to the model. To ex- lection of speech representations requires a certain exper- emplify, if models process audio as waverfoms—which poorly represent the frequency dimension for human 5 See https://kaldi-asr.org/doc/feat.html. understanding—explanations accounting for such a cor- 6 See https://deepspeech.readthedocs.io/en/master/DeepSpeech. html. relate will be out of reach. In fact, previous works by 7 We are not aware of any recent study attributing higher systems Wu et al. [31] and Wu et al. [32], based on waveforms performance depending on the used representation. solely focus on the temporal dimension to explain ASR. In these cases, to avoid limiting the understanding of the 5. Granularity of Explanations models’ behavior to one single dimension it would be advisable to explore alternative techniques that offer Another critical factor concerning the informativeness deeper insights into how models process other acous- of input features is the level of granularity at which tic correlates. For instance, Pastor et al. [28] integrated the features are considered during the explanatory pro- counterfactual explanations to specifically investigate cess. This decision affects the level of detail in the result- whether selected paralinguistic features such as pitch, ing explanations and, consequently, accuracy—as more speaking rate, and background noise were influent for detailed explanations may more accurately reflect the the model’s prediction. Additionally, various techniques model’s behavior—and their meaningfulness—as detailed exist to analyze how models extract relevant patterns and comprehensive explanations can be more difficult to from waveforms through convolutions [48, 49, 50]. interpret [12, 24]. When the selected input features represent multiple In the time domain, for example, input features are dimensions, as in the case of spectrograms or MFCCs, the highly fine-grained. As discussed in §2, spectrograms decision to only account for one of these dimensions be- typically contain frames spanning tens of milliseconds, comes arbitrary. For example, two models tested by Wu capturing detailed frequency content within each frame, et al. [31], namely, DeepSpeech [51] and Sphinx [52], are whereas waveforms are composed of samples taken at fed with spectrograms and MFCCs, respectively. How- much shorter time intervals—for instance, as mentioned ever, explanations based on raw waveforms are provided in §2, there can be 16, 000 samples in just one second. for these models. This inconsistency between the fea- This level of detail poses great challenges for (human) tures used in explanations and those used by the models comprehension, particularly for a broader audience, since inevitably offers only a partial overview of the models’ mapping groups of frames/samples in an explanation to behavior and limits the exploration of important acoustic recognizable speech units is highly time-consuming and aspects. This, in turn, can impact the accuracy of the ex- requires specialized expertise. planations, which ideally should encompass all relevant Accordingly, to address the issue and make explana- information. tions for speech more broadly accessible, previous works To prioritize explanation accuracy and conduct anal- have leveraged textual transcripts within the explanation yses considering the crucial role of acoustic correlates process. More specifically, Wu et al. [32] and Pastor et al. such as frequency, it is advisable to take into account all [28] resort to the alignment of audio to text, either for dimensions embedded in the speech representation. individual phonemes or words, respectively, and apply ex- This approach is exemplified by the works of Markert plainability techniques to such units. While this approach et al. [30], who provide explanations that account for helps decipher the contribution of input features based the most influential elements in MFCCs, as well as Trinh on more intuitive linguistic units, it diverges from how and Mandel [29] and Becker et al. [27], who base the current models process speech features in small frames explanations on spectrograms. In the work by Mark- and samples [43]. This divergence risks overlooking the ert et al. [30], however, it is challenging to connect the model’s behavior and compromises the accuracy and ef- results with specific acoustic parameters due to the com- fectiveness of the explanations. For instance, whether plexity of analyzing MFCCs (see §3), which significantly ASR systems rely on shorter or longer time intervals than undermines the meaningfulness of the explanations. In individual words remains unclear [29]. Therefore, ana- contrast, explanations using spectrograms offer valuable lyzing this aspect requires a more granular approach at insights into how machines process speech, producing the time level. both accurate and meaningful results. For instance, Trinh In light of the above, explanations should be ob- and Mandel [29] demonstrated that neural ASR models tained with low-level units to avoid biasing explana- focus on high-energy time-frequency regions for tran- tions towards human understanding. The use of audio- scription, while Becker et al. [27] found that lower fre- transcript alignment to aid analysis of explanations can quency ranges, typically associated with pitch, exhibit be very useful but should occur downstream of the expla- higher attribution scores in speaker gender classification nation process, not upstream. In this way, we can maxi- tasks [27], showing some alignment with human speech mize the use of all available units to generate detailed and processing. However, interpreting these insights requires accurate explanations, and then aggregate scores from specialized expertise, which can reduce the meaningful- individual frames or samples to create more compact ness of explanations for non-experts. This highlights representations at the level of phonemes or words, ensur- that, even in speech, the balance between accuracy and ing flexibility in the meaningfulness of the explanations meaningfulness can vary depending on the context [24]. according to specific needs. This bottom-up approach mirrors practices in the text domain, providing adapt- ability in defining attribution units that can range from subwords to words or phrases [53, 54]. 6. Conclusion speech models [56]. The balance of richness and granularity—which This paper has examined the role of acoustic features and also relates to the interplay between accuracy and their selection for explaining speech models. More specif- meaningfulness—is also relevant to common users who ically, we considered a specific subfield of XAI, namely, interact with speech technologies. However, investigat- feature attribution, which connects input features to out- ing how explanations can be effectively communicated puts as a form of explanation. Previous research has not to and understood by these users in the context of daily explicitly addressed how to incorporate features into the speech technology use exceeds the scope of this paper explanation process within the speech domain, where and warrants further exploration. input is encoded in more varied ways compared to other fields, such as text. This has led to diverse approaches, each with different implications for what can and can- 8. Acknowledgments not be explained about model behavior, and with the risk of not fully or accurately representing the model’s We acknowledge the support of the PNRR project FAIR - functioning. Future AI Research (PE00000013), under the NRRP MUR By discussing the key characteristics of speech and program funded by the NextGenerationEU. This paper the properties of the most adopted acoustic features, we has also received funding from the European Union’s argue that explanations should ideally encompass all Horizon research and innovation programme under grant available dimensions, particularly time and frequency, as agreement No 101135798, project Meetween (My Per- both are essential for a comprehensive understanding of sonal AI Mediator for Virtual MEETtings BetWEEN Peo- the models’ rationale. We have also discussed challenges ple). associated with aligning explanations at high granularity with human understanding, emphasizing solutions that References provide flexibility in the analysis, allowing for adjust- ments between more or less detail as needed. [1] A. Zytek, I. Arnaldo, D. Liu, L. Berti-Equille, Building on these insights, our ongoing research fo- K. Veeramachaneni, The Need for Interpretable cuses on developing feature attribution techniques that Features: Motivation and Taxonomy, SIGKDD Ex- operate on spectrograms at the finest possible unit level, plor. Newsl. 24 (2022) 1–13. URL: https://doi.org/ integrating both time and frequency dimensions. Our aim 10.1145/3544903.3544905. doi:10.1145/3544903. is to generate explanations that are accurate and mean- 3544905. ingful for experts, as well as adaptable for non-expert [2] C. Munteanu, M. Jones, S. Oviatt, S. Brewster, users. More broadly, we hope that our reflections will G. Penn, S. Whittaker, N. Rajput, A. Nanavati, be beneficial and thought-provoking for researchers cur- We need to talk: HCI and the delicate topic of rently working in, or entering, the field of XAI for speech spoken language interaction, in: CHI ’13 Ex- models, thereby contributing to a deeper understanding tended Abstracts on Human Factors in Computing of the rationale behind these models. Systems, CHI EA ’13, Association for Computing Machinery, New York, NY, USA, 2013, pp. 2459– 2464. URL: https://doi.org/10.1145/2468356.2468803. 7. Limitations doi:10.1145/2468356.2468803. While exploring the relationship between the informa- [3] H. Feng, K. Fawaz, K. G. Shin, Continuous Au- tiveness of speech features and explanations, we have thentication for Voice Assistants, in: Proceedings deliberately not delved into the needs of specific stake- of the 23rd Annual International Conference on holders for XAI applications. Indeed, different stakehold- Mobile Computing and Networking, MobiCom ’17, ers present varying needs [55, 56], and to consider them Association for Computing Machinery, New York, is a research avenue of paramount importance for the NY, USA, 2017, pp. 343–355. URL: https://doi.org/ growth of XAI. As a nascent area of investigations, how- 10.1145/3117811.3117823. doi:10.1145/3117811. ever, XAI for speech is still relatively in its infancy, we 3117823. thus prioritized more fundamental methodological and [4] P. Cheng, U. Roedig, Personal Voice Assistant design decisions which prioritize a comprehensive and Security and Privacy—A Survey, Proceedings of detailed understanding at a low level of model’s rationale. the IEEE 110 (2022) 476–507. doi:10.1109/JPROC. Accordingly, our reflections might be more appealing for 2022.3153167. a range of users who engage with speech models and pos- [5] S. Malodia, N. Islam, P. Kaur, A. Dhir, Why Do Peo- sess expertise in machine learning and/or speech analysis, ple Use Artificial Intelligence (AI)-Enabled Voice ranging from developers to speech therapists assisted by Assistants?, IEEE Transactions on Engineering Management 71 (2024) 491–505. doi:10.1109/TEM. The role of explainable AI in the context of the 2021.3117884. AI Act, in: Proceedings of the 2023 ACM Con- [6] A. Baevski, H. Zhou, A. Mohamed, M. Auli, ference on Fairness, Accountability, and Trans- wav2vec 2.0: A Framework for Self-Supervised parency, FAccT ’23, Association for Computing Learning of Speech Representations, in: Proceed- Machinery, New York, NY, USA, 2023, pp. 1139– ings of the 34th International Conference on Neu- 1150. URL: https://doi.org/10.1145/3593013.3594069. ral Information Processing Systems, NIPS’20, Cur- doi:10.1145/3593013.3594069. ran Associates Inc., Red Hook, NY, USA, 2020, pp. [15] J. Ferrando, G. Sarti, A. Bisazza, M. R. Costa- 12449–12460. jussà, A Primer on the Inner Workings of [7] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, Transformer-based Language Models, 2024. R. Salakhutdinov, A. Mohamed, HuBERT: Self- arXiv:2405.00208. Supervised Speech Representation Learning by [16] M. Ancona, E. Ceolini, C. Öztireli, M. Gross, Masked Prediction of Hidden Units, IEEE/ACM Gradient-Based Attribution Methods, in: W. Samek, Transactions on Audio, Speech, and Language G. Montavon, A. Vedaldi, L. K. Hansen, K.-R. Processing 29 (2021) 3451–3460. URL: https://doi. Müller (Eds.), Explainable AI: Interpreting, Explain- org/10.1109/TASLP.2021.3122291. doi:10.1109/ ing and Visualizing Deep Learning, Springer In- TASLP.2021.3122291. ternational Publishing, Cham, 2019, pp. 169–191. [8] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, URL: https://doi.org/10.1007/978-3-030-28954-6_9. J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, doi:10.1007/978-3-030-28954-6_9. S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, F. Wei, [17] W. Samek, K.-R. Müller, Towards Explainable Ar- WavLM: Large-Scale Self-Supervised Pre-Training tificial Intelligence, in: W. Samek, G. Montavon, for Full Stack Speech Processing, IEEE Journal of A. Vedaldi, L. K. Hansen, K.-R. Müller (Eds.), Ex- Selected Topics in Signal Processing 16 (2022) 1505– plainable AI: Interpreting, Explaining and Visual- 1518. doi:10.1109/JSTSP.2022.3188113. izing Deep Learning, Springer International Pub- [9] A. Radford, J. W. Kim, T. Xu, G. Brockman, lishing, Cham, 2019, pp. 5–22. URL: https://doi. C. McLeavey, I. Sutskever, Robust Speech Recog- org/10.1007/978-3-030-28954-6_1. doi:10.1007/ nition via Large-Scale Weak Supervision, in: Pro- 978-3-030-28954-6_1. ceedings of the 40th International Conference on [18] S. Agarwal, S. Jabbari, C. Agarwal, S. Upadhyay, Machine Learning, ICML’23, JMLR.org, 2023, pp. S. Wu, H. Lakkaraju, Towards the Unification and 28492–28518. Robustness of Perturbation and Gradient Based [10] F. Doshi-Velez, B. Kim, Towards A Rigorous Sci- Explanations, in: M. Meila, T. Zhang (Eds.), ence of Interpretable Machine Learning, 2017. Proceedings of the 38th International Conference arXiv:1702.08608. on Machine Learning, volume 139 of Proceedings [11] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Ma- of Machine Learning Research, PMLR, 2021, pp. chine Learning Interpretability: A Survey on Meth- 110–119. URL: https://proceedings.mlr.press/v139/ ods and Metrics, Electronics 8 (2019). URL: https:// agarwal21c.html. www.mdpi.com/2079-9292/8/8/832. doi:10.3390/ [19] M. Ivanovs, R. Kadikis, K. Ozols, Perturbation-based electronics8080832. methods for explaining deep neural networks: A [12] G. Vilone, L. Longo, Notions of explainability survey, Pattern Recognition Letters 150 (2021) 228– and evaluation approaches for explainable artifi- 234. URL: https://www.sciencedirect.com/science/ cial intelligence, Information Fusion 76 (2021) 89– article/pii/S0167865521002440. doi:https://doi. 106. URL: https://www.sciencedirect.com/science/ org/10.1016/j.patrec.2021.06.030. article/pii/S1566253521001093. doi:https://doi. [20] F. K. Došilović, M. Brčić, N. Hlupić, Explainable org/10.1016/j.inffus.2021.05.009. Artificial Intelligence: A Survey, in: 2018 41st In- [13] R. Pradhan, J. Zhu, B. Glavic, B. Salimi, Inter- ternational Convention on Information and Com- pretable Data-Based Explanations for Fairness De- munication Technology, Electronics and Microelec- bugging, in: Proceedings of the 2022 Interna- tronics (MIPRO), 2018, pp. 210–215. doi:10.23919/ tional Conference on Management of Data, SIG- MIPRO.2018.8400040. MOD ’22, Association for Computing Machinery, [21] W. Saeed, C. Omlin, Explainable AI (XAI): A New York, NY, USA, 2022, pp. 247–261. URL: https: systematic meta-survey of current challenges and //doi.org/10.1145/3514221.3517886. doi:10.1145/ future opportunities, Knowledge-Based Systems 3514221.3517886. 263 (2023) 110273. URL: https://www.sciencedirect. [14] C. Panigutti, R. Hamon, I. Hupont, D. Fernan- com/science/article/pii/S0950705123000230. dez Llorca, D. Fano Yela, H. Junklewitz, S. Scalzo, doi:https://doi.org/10.1016/j.knosys. G. Mazzini, I. Sanchez, J. Soler Garrido, E. Gomez, 2023.110273. [22] Y. Zhou, S. Booth, M. T. Ribeiro, J. Shah, Do Recognition – Means for a Better Understanding?, Feature Attribution Methods Correctly Attribute in: Proc. 2021 ISCA Symposium on Security and Features?, Proceedings of the AAAI Confer- Privacy in Speech Communication, 2021, pp. 14–20. ence on Artificial Intelligence 36 (2022) 9623–9633. doi:10.21437/SPSC.2021-4. URL: https://ojs.aaai.org/index.php/AAAI/article/ [31] X. Wu, P. Bell, A. Rajan, Explanations for Automatic view/21196. doi:10.1609/aaai.v36i9.21196. Speech Recognition, in: ICASSP 2023 - 2023 IEEE [23] D. Qin, G. Amariucai, D. Qiao, Y. Guan, S. Fu, A International Conference on Acoustics, Speech and Comprehensive and Reliable Feature Attribution Signal Processing (ICASSP), 2023, pp. 1–5. doi:10. Method: Double-sided Remove and Reconstruct 1109/ICASSP49357.2023.10094635. (DoRaR), 2023. arXiv:2310.17945. [32] X. Wu, P. Bell, A. Rajan, Can We Trust Explainable [24] P. J. Phillips, C. Hahn, P. Fontana, A. Yates, K. K. AI Methods on ASR? An Evaluation on Phoneme Greene, D. Broniatowski, M. A. Przybocki, Four Recognition, in: ICASSP 2024 - 2024 IEEE Inter- Principles of Explainable Artificial Intelligence, national Conference on Acoustics, Speech and Sig- 2021. URL: https://tsapps.nist.gov/publication/get_ nal Processing (ICASSP), 2024, pp. 10296–10300. pdf.cfm?pub_id=933399. doi:https://doi.org/ doi:10.1109/ICASSP48485.2024.10445989. 10.6028/NIST.IR.8312. [33] N. Audibert, C. Fougeron, Intra-speaker pho- [25] A. Jacovi, Y. Goldberg, Towards faithfully inter- netic variation in read speech: comparison pretable NLP systems: How should we define and with inter-speaker variability in a controlled evaluate faithfulness?, in: D. Jurafsky, J. Chai, population, in: Interspeech 2022, ISCA, In- N. Schluter, J. Tetreault (Eds.), Proceedings of the cheon, South Korea, 2022, pp. 4755–4759. URL: 58th Annual Meeting of the Association for Com- https://hal.science/hal-03852142. doi:10.21437/ putational Linguistics, Association for Computa- Interspeech.2022-10965. tional Linguistics, Online, 2020, pp. 4198–4205. URL: [34] J. Clark, C. Yallop, An Introduction to Phonetics https://aclanthology.org/2020.acl-main.386. doi:10. and Phonology, B. Blackwell, Oxford, UK, 1990. 18653/v1/2020.acl-main.386. [35] G. Yule, The Study of Language, 7 ed., Cambridge [26] Q. Lyu, M. Apidianaki, C. Callison-Burch, To- University Press, 2020. wards Faithful Model Explanation in NLP: A Survey, [36] K. N. Stevens, Acoustic Phonetics, The MIT Press, Computational Linguistics 50 (2024) 657–723. URL: 2000. https://aclanthology.org/2024.cl-2.6. doi:10.1162/ [37] N. H. van Schijndel, T. Houtgast, J. M. Festen, Ef- coli_a_00511. fects of degradation of intensity, time, or frequency [27] S. Becker, J. Vielhaben, M. Ackermann, K.-R. content on speech intelligibility for normal-hearing Müller, S. Lapuschkin, W. Samek, AudioMNIST: and hearing-impaired listeners, The Journal of Exploring Explainable Artificial Intelligence for the Acoustical Society of America 110 (2001) 529– audio analysis on a simple benchmark, Jour- 542. URL: https://doi.org/10.1121/1.1378345. doi:10. nal of the Franklin Institute 361 (2024) 418– 1121/1.1378345. 428. URL: https://www.sciencedirect.com/science/ [38] K. N. Stevens, Acoustic correlates of some pho- article/pii/S0016003223007536. doi:https://doi. netic categories, The Journal of the Acoustical So- org/10.1016/j.jfranklin.2023.11.038. ciety of America 68 (1980) 836–842. doi:10.1121/ [28] E. Pastor, A. Koudounas, G. Attanasio, D. Hovy, 1.384823. E. Baralis, Explaining Speech Classification Models [39] J. Honey, Sociophonology, John Wiley via Word-Level Audio Segments and Paralinguistic & Sons, Ltd, 2017, pp. 92–106. URL: Features, in: Y. Graham, M. Purver (Eds.), Pro- https://onlinelibrary.wiley.com/doi/abs/ ceedings of the 18th Conference of the European 10.1002/9781405166256.ch6. doi:https: Chapter of the Association for Computational Lin- //doi.org/10.1002/9781405166256.ch6. guistics (Volume 1: Long Papers), Association for [40] D. Hirst, Speech Prosody: from Acoustics to Inter- Computational Linguistics, St. Julian’s, Malta, 2024, pretation, Springer Berlin, Heidelberg, 2024. pp. 2221–2238. URL: https://aclanthology.org/2024. [41] C. T. Best, The Diversity of Tone Languages eacl-long.136. and the Roles of Pitch Variation in Non-tone [29] V. A. Trinh, M. Mandel, Directly Comparing Languages: Considerations for Tone Perception the Listening Strategies of Humans and Machines, Research, Frontiers in Psychology 10 (2019). IEEE/ACM Transactions on Audio, Speech, and Lan- URL: https://www.frontiersin.org/journals/ guage Processing 29 (2021) 312–323. doi:10.1109/ psychology/articles/10.3389/fpsyg.2019.00364. TASLP.2020.3040545. doi:10.3389/fpsyg.2019.00364. [30] K. Markert, R. Parracone, M. Kulakov, P. Sperl, C.-Y. [42] F. Alías, J. C. Socoró, X. Sevillano, A Review Kao, K. Böttinger, Visualizing Automatic Speech of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental bofo, P. Biberstein, N. D. Sergin, R. Hennequin, Sounds, Applied Sciences 6 (2016). URL: https:// R. Naktinis, beantowel, T. Kim, J. P. Åsen, J. Lim, www.mdpi.com/2076-3417/6/5/143. doi:10.3390/ A. Malins, D. Hereñú, S. van der Struijk, L. Nickel, app6050143. J. Wu, Z. Wang, T. Gates, M. Vollrath, A. Sar- [43] A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihal- roff, Xiao-Ming, A. Porter, S. Kranzler, Voodoohop, cea, S. Poria, A review of deep learning techniques M. D. Gangi, H. Jinoz, C. Guerrero, A. Mazhar, tod- for speech processing, Information Fusion 99 drme2178, Z. Baratz, A. Kostin, X. Zhuang, C. T. (2023) 101869. URL: https://www.sciencedirect. Lo, P. Campr, E. Semeniuc, M. Biswal, S. Moura, com/science/article/pii/S1566253523001859. P. Brossier, H. Lee, W. Pimenta, librosa/librosa: doi:https://doi.org/10.1016/j.inffus. 0.10.1, 2023. URL: https://doi.org/10.5281/zenodo. 2023.101869. 8252662. doi:10.5281/zenodo.8252662. [44] S. Communication, L. Barrault, Y.-A. Chung, M. C. [48] M. Ravanelli, Y. Bengio, Interpretable Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsa- Convolutional Filters with SincNet, 2019. har, H. Gong, K. Heffernan, J. Hoffman, C. Klaiber, arXiv:1811.09725. P. Li, D. Licht, J. Maillard, A. Rakotoarison, K. R. [49] M. Angrick, C. Herff, G. Johnson, J. Shih, D. Krusien- Sadagopan, G. Wenzek, E. Ye, B. Akula, P.-J. Chen, ski, T. Schultz, Interpretation of convolu- N. E. Hachem, B. Ellis, G. M. Gonzalez, J. Haa- tional neural networks for speech spectrogram heim, P. Hansanti, R. Howes, B. Huang, M.-J. regression from intracranial recordings, Neu- Hwang, H. Inaguma, S. Jain, E. Kalbassi, A. Kallet, rocomput. 342 (2019) 145–151. URL: https://doi. I. Kulikov, J. Lam, D. Li, X. Ma, R. Mavlyutov, org/10.1016/j.neucom.2018.10.080. doi:10.1016/j. B. Peloquin, M. Ramadan, A. Ramakrishnan, A. Sun, neucom.2018.10.080. K. Tran, T. Tran, I. Tufanov, V. Vogeti, C. Wood, [50] H. Fayyazi, Y. Shekofteh, IIRI-Net: An interpretable Y. Yang, B. Yu, P. Andrews, C. Balioglu, M. R. Costa- convolutional front-end inspired by IIR filters for jussà, O. Celebi, M. Elbayad, C. Gao, F. Guzmán, speaker identification, Neurocomput. 558 (2023). J. Kao, A. Lee, A. Mourachko, J. Pino, S. Popuri, URL: https://doi.org/10.1016/j.neucom.2023.126767. C. Ropers, S. Saleem, H. Schwenk, P. Tomasello, doi:10.1016/j.neucom.2023.126767. C. Wang, J. Wang, S. Wang, SeamlessM4T: Mas- [51] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Di- sively Multilingual & Multimodal Machine Trans- amos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, lation, 2023. URL: https://arxiv.org/abs/2308.11596. A. Coates, A. Y. Ng, Deep Speech: Scaling up end-to- arXiv:2308.11596. end speech recognition, 2014. arXiv:1412.5567. [45] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, [52] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, B. Raj, P. Wolf, Design of the CMU Sphinx- Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, 4 Decoder, in: Proc. 8th European Confer- K. Vesely, The Kaldi Speech Recognition Toolkit, in: ence on Speech Communication and Technology IEEE 2011 Workshop on Automatic Speech Recogni- (Eurospeech 2003), 2003, pp. 1181–1184. doi:10. tion and Understanding, IEEE Signal Processing So- 21437/Eurospeech.2003-382. ciety, 2011, pp. 1–4. IEEE Catalog No.: CFP11SRW- [53] G. Sarti, N. Feldhus, L. Sickert, O. van der Wal, Inseq: USB. An Interpretability Toolkit for Sequence Generation [46] K. Ikarous, The Encoding of Vowel Features in Models, in: D. Bollegala, R. Huang, A. Ritter (Eds.), Mel-Frequency Cepstral Coefficients, in: A. Vietti, Proceedings of the 61st Annual Meeting of the As- L. Spreafico, D. Mereu, V. Galatà (Eds.), Il parlato nel sociation for Computational Linguistics (Volume 3: contesto naturale [Speech in the natural context], System Demonstrations), Association for Computa- Officinaventuno, Milano, 2018, p. 9–18. URL: https: tional Linguistics, Toronto, Canada, 2023, pp. 421– //doi.org/10.17469/O2104AISV000001. 435. URL: https://aclanthology.org/2023.acl-demo. [47] B. McFee, M. McVicar, D. Faronbi, I. Roman, 40. doi:10.18653/v1/2023.acl-demo.40. M. Gover, S. Balke, S. Seyfarth, A. Malek, C. Raffel, [54] V. Miglani, A. Yang, A. Markosyan, D. Garcia- V. Lostanlen, B. van Niekirk, D. Lee, F. Cwitkowitz, Olano, N. Kokhlikyan, Using Captum to Ex- F. Zalkow, O. Nieto, D. Ellis, J. Mason, K. Lee, plain Generative Language Models, in: L. Tan, B. Steers, E. Halvachs, C. Thomé, F. Robert- D. Milajevs, G. Chauhan, J. Gwinnup, E. Rippeth Stöter, R. Bittner, Z. Wei, A. Weiss, E. Batten- (Eds.), Proceedings of the 3rd Workshop for Nat- berg, K. Choi, R. Yamamoto, C. Carr, A. Metsai, ural Language Processing Open Source Software S. Sullivan, P. Friesch, A. Krishnakumar, S. Hidaka, (NLP-OSS 2023), Association for Computational S. Kowalik, F. Keller, D. Mazur, A. Chabot-Leclerc, Linguistics, Singapore, 2023, pp. 165–173. URL: C. Hawthorne, C. Ramaprasad, M. Keum, J. Gomez, https://aclanthology.org/2023.nlposs-1.19. doi:10. W. Monroe, V. A. Morozov, K. Eliasi, nullmighty- 18653/v1/2023.nlposs-1.19. [55] M. Langer, D. Oster, T. Speith, H. Hermanns, L. Kästner, E. Schmidt, A. Sesing, K. Baum, What do we want from Explainable Artificial Intelligence (XAI)? – A stakeholder perspective on XAI and a conceptual model guiding interdisciplinary XAI research, Artificial Intelligence 296 (2021) 103473. URL: https://www.sciencedirect.com/science/ article/pii/S0004370221000242. doi:https://doi. org/10.1016/j.artint.2021.103473. [56] M. Calvano, A. Curci, A. Pagano, A. Piccinno, Speech Therapy Supported by AI and Smart Assistants, in: R. Kadgien, A. Jedlitschka, A. Janes, V. Lenarduzzi, X. Li (Eds.), Product- Focused Software Process Improvement, Springer Nature Switzerland, Cham, 2024, pp. 97–104.