Explainability for Speech Models: On the Challenges of Acoustic Feature Selection

Explainability for Speech Models: On the Challenges of Acoustic Feature Selection DennisFucci dfucci@fbk.eu University of Trento

Via Calepina, 14 38122 Trento TN Italy

Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

BeatriceSavoldi bsavoldi@fbk.eu Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

MarcoGaido mgaido@fbk.eu Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

MatteoNegri negri@fbk.eu Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

MauroCettolo cettolo@fbk.eu Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

LuisaBentivogli Fondazione Bruno Kessler

Via Sommarive, 18 38123 Trento TN Italy

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Explainability for Speech Models: On the Challenges of Acoustic Feature Selection 1613-0073 C3838FC2307220E849F624C154382F44 GROBID - A machine learning software for extracting information from scholarly documents Speech Models Explainability Feature Attribution

Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.

Introduction

Models are only as interpretable as their features. [1] Spoken language-as perhaps our most natural form of interaction-is the foundational element of many technologies we interact with in our daily lives [2], from virtual assistants to voice dictation [3,4,5]. More recently, the emergence of highly capable speech foundation models [6,7,8,9] has also facilitated and expanded the adoption of speech technologies on an unprecedented multilingual scale. In light of this proliferation, a need arises to prioritize transparency and interpretability, qualities already demanded in the growing landscape of Machine Learning (ML).

As a response, the field of eXplainable AI (XAI) has risen prominently, with the aim of facilitating understanding of the rationale behind model decisions and fostering users' trust [10,11,12,13]. XAI is also reinforced by the establishment of norms and legal frameworks, as seen in the European Union's General Data Protection Regulation, which enshrines the 'right to explanation', and the AI Act, which emphasizes transparency as a pivotal component of ML applications [14].

XAI encompasses various tasks and methods, such as identifying relevant model components for specific predictions, understanding the information processed by these components, and determining which input elements guide the model's predictions [15]. The latter task is the focus of feature attribution methods, which provide intuitive explanations by visualizing which input elements (e.g., pixels in an image or words in a sentence) have influenced the model's predictions. These methods assign a score to each input feature, quantifying its importance or contribution to the output: higher scores indicate greater importance of the corresponding input features for generating the output [16,17,18,19]. They can help identify potential causes for errors and unexpected behaviors, as well as analyze the model's response to specific input properties. Overall, these explainability methods serve to present the reason why models make specific predictions by establishing a connection between input and output as a form of intuitive explanation for humans, thereby enhancing interpretability. 1 Over time, ongoing efforts have aimed to refine feature attribution techniques and provide more effective explanations [22,23]. However, it is essential to recognize that the effectiveness of feature attribution explanations relies not only on the techniques themselves but also on the informativeness of the input features used as explanatory variables. If an explanation highlights unintelligible or poorly informative features, it does little to enhance the understanding of the model's behavior 1 Despite numerous efforts to differentiate the closely related concepts of explainability and interpretability, no consensus exists in the literature on their definitions [20]. In this paper, we adopt a perspective similar to that of Saeed and Omlin [21], where explainability refers to the process of extracting insights from a model's workings through specific techniques, while interpretability refers to the understanding process of those insights, crucial to make them actionable. [1]. This can undermine key principles in XAI, such as accuracy-the property of correctly reflecting the factors that led the model to a specific decision including all relevant information-and meaningfulness-the property of offering explanations that are comprehensible to the user [24]. 2In fields involving images or texts, feature representations are typically constrained to pixels and words, respectively. However, for speech, multiple input representations can be adopted, each emphasizing different acoustic aspects. Indeed, a sequence of speech elements does not only convey the meaning of what is said (like words in a text) but also bears a wealth of additional information useful for both human understanding and automatic processing (e.g. intonation, loudness, speaking rate). Consequently, when employing feature attribution methods, the resulting explanations can vary significantly in shape and focus on more or less informative characteristics depending on the type of speech representation used. To date, research on feature attribution for speech is notably limited to few applications-including classification [27,28] and generative tasks [29,30,31,32]-which offer a somewhat fragmented picture in the choice of speech representations, thus providing limited insights on the relation between the features considered and the explanations based upon them.

In light of the above, this paper reflects on the impact of the chosen acoustic features in explaining the rationale behind speech models, aiming to gain a deeper understanding of the trade-offs associated with acoustic features. By first offering a gentle introduction to the rich and multidimensional nature of speech and its digital representation, we identify current gaps and potential avenues for effectively incorporating this multidimensionality into XAI for speech models. Our discussion will focus on two critical factors: i) the amount of information these features provide about the model's behavior, which influences the richness of the explanations, and ii) the level of detail of such information, which determines the granularity of the explanations. We will also explore how these aspects impact both the accuracy and meaningfulness of the explanations, ultimately shaping their overall interpretability.

The Correlates of Speech

To gain deeper insight into the complexities of defining informative features in speech, we explore key characteristics of speech and their implications for modeling.

Speech is a multifaceted phenomenon. It is grounded on the materiality of sound to convey linguistic content (i.e. what is said), which is modulated depending on several paralinguistic cues (i.e. how is said) entailing extensive variation-also for single individual speakers [33]. As such, it comprises several dimensions, which are hard to pin down individually, but collectively amount to what we intuitively and simply perceive as spoken language.

From a linguistic perspective, the spoken communication system consists of the combination of phonemes, 3which are regarded as the smallest meaningful units of sounds [34,35]. Physically, it involves the continuous flow of sounds shaped by the movements of our phonatory organs, transmitted as sound waves [36]. Perceptually, we process speech through three primary dimensions [37]: i) time, or the sequential occurrence of sounds;4 ii) intensity, corresponding to the energy level of the wave due to the strength of molecular vibration, which we perceived as loudness; iii) frequency, regarding the rate of vibrations produced by the vocal cordsinterpreted as pitch-and whose modulation is responsible for shaping the type of speech sound.

These three elements, known as acoustic correlates [38], are specific to both speakers and phonemes. For example, speakers possess unique characteristics, including pitch and speaking rate [33], and also exhibit high variability stemming from various sociodemographic factors such as gender, age, and dialect [39]. In these cases, the speech content needs to be disentangled from the variability in its delivery. Conversely, language sounds exhibit variability in duration-e.g., /i/ in ship and sheep-and are distinguished by specific frequency ranges [36]. The frequency dimension also plays a vital role in shaping suprasegmental aspects of speech-broader phenomena that span multiple segments-such as intonation, obtained by varying pitch [40]. Pitch, for instance, has a distinctive function in tonal languages, where it is used to distinguish lexical or grammatical meaning [41]. But even in non-tonal languages, these prosodic elements are indispensable to delivering different meanings and intents, as the reader can perceive by reading out loud two contrastive sentences such as: "You got the joke right" and "You got the joke, right?", where pauses and prosody play pivotal roles.

All these factors add to the multidimensionality of speech, which feature engineering strives to encapsulate and that cannot be overlooked in the explanatory process.

Speech Representations

While various representations are used to encode speech in a digital format, three main types are commonly given as input to state-of-the-art speech models (for a review, see [42,43]). Namely, waveforms, spectrograms, and mel-frequency cepstral coefficients (MFCCs), which are shown in Figure 1.

The waveform serves as the most fundamental representation of a signal, comprising sequences of samples (e.g., 16, 000 per second), each indicating the amplitude of the signal at a specific point in time-essentially, the fluctuations in air pressure over time. This type of representation is leveraged by models like Wav2vec [6].

The spectrogram results from feature engineering operations that decompose the speech signal into its frequencies, presenting a 2D visualization of frequency distributions over time. These representations are commonly depicted as heatmaps, where color intensity corresponds to the energy of a specific frequency at a given moment. The time unit in spectrograms is represented by a fixed-length window of a few milliseconds (e.g., 25), commonly referred to as a frame, whithin which a given number of waveform samples are encompassed. Notably, the articulation of sounds produces time-frequency patterns which are visible as darker regions [36]. Prominent examples of state-of-the-art models leveraging spectrograms are Whisper [9] and SeamlessM4T [44].

The MFCCs offer another 2D representation where each coefficient captures important details about how the frequency content of the signal changes over time. Like spectrograms, MFCCs offer information about both frequency and time, but in a more compact form. MFCCs are commonly used in the implementation of ASR models within popular toolkits like Kaldi5 [45] and Mozilla DeepSpeech 6 .

Overall, though different in nature, these three types of representations are all effectively exploited by current speech models. 7 For human understanding, however, they actually vary in terms of informativeness with respect to the acoustic correlates discussed in §2. Indeed, although both intensity and frequency are somewhat discernible in waveforms, qualitative distinctions of patterns specific to pitch or phoneme frequencies are rarely feasible [36]. Comparatively, spectrograms and MFCCs are richer and more descriptive, because they capture the multiple dimensions of time, frequency, and intensity with finer detail. Still, spectrograms are more conducive to phonetic analyses, given the established knowledge in analyzing frequency patterns over time within this representation [36] In contrast, MFCCs are rarely used for phonetic analysis [46].

Overall, while weighting the informativeness and selection of speech representations requires a certain exper-Figure 1: Schematic illustration of the primary speech representations used by state-of-the-art speech models for the utterance "This is a waveform". The features were computed using Librosa 0.10.1 [47].

tise in speech processing, being aware of the trade-offs they intrinsically entail is crucial for carefully conducting XAI examination in speech. Indeed, it is precisely upon such input features-and their trade-offs-that explanations are built.

Richness of Explanations

Considering the foregoing, there is a causal relationship wherein explanatory possibilities in speech XAI are inherently limited by the richness of the audio features used, specifically the dimensions they encapsulate. This limitation directly correlates with the richness of the resulting explanations. Also, owing to the compatibility of current models with various representation types, the explanations generated are inevitably confined by the specific input features provided to the model. To exemplify, if models process audio as waverfoms-which poorly represent the frequency dimension for human understanding-explanations accounting for such a correlate will be out of reach. In fact, previous works by Wu et al. [31] and Wu et al. [32], based on waveforms solely focus on the temporal dimension to explain ASR.

In these cases, to avoid limiting the understanding of the models' behavior to one single dimension it would be advisable to explore alternative techniques that offer deeper insights into how models process other acoustic correlates. For instance, Pastor et al. [28] integrated counterfactual explanations to specifically investigate whether selected paralinguistic features such as pitch, speaking rate, and background noise were influent for the model's prediction. Additionally, various techniques exist to analyze how models extract relevant patterns from waveforms through convolutions [48,49,50].

When the selected input features represent multiple dimensions, as in the case of spectrograms or MFCCs, the decision to only account for one of these dimensions becomes arbitrary. For example, two models tested by Wu et al. [31], namely, DeepSpeech [51] and Sphinx [52], are fed with spectrograms and MFCCs, respectively. However, explanations based on raw waveforms are provided for these models. This inconsistency between the features used in explanations and those used by the models inevitably offers only a partial overview of the models' behavior and limits the exploration of important acoustic aspects. This, in turn, can impact the accuracy of the explanations, which ideally should encompass all relevant information.

To prioritize explanation accuracy and conduct analyses considering the crucial role of acoustic correlates such as frequency, it is advisable to take into account all dimensions embedded in the speech representation. This approach is exemplified by the works of Markert et al. [30], who provide explanations that account for the most influential elements in MFCCs, as well as Trinh and Mandel [29] and Becker et al. [27], who base the explanations on spectrograms. In the work by Markert et al. [30], however, it is challenging to connect the results with specific acoustic parameters due to the complexity of analyzing MFCCs (see §3), which significantly undermines the meaningfulness of the explanations. In contrast, explanations using spectrograms offer valuable insights into how machines process speech, producing both accurate and meaningful results. For instance, Trinh and Mandel [29] demonstrated that neural ASR models focus on high-energy time-frequency regions for transcription, while Becker et al. [27] found that lower frequency ranges, typically associated with pitch, exhibit higher attribution scores in speaker gender classification tasks [27], showing some alignment with human speech processing. However, interpreting these insights requires specialized expertise, which can reduce the meaningfulness of explanations for non-experts. This highlights that, even in speech, the balance between accuracy and meaningfulness can vary depending on the context [24].

Granularity of Explanations

Another critical factor concerning the informativeness of input features is the level of granularity at which the features are considered during the explanatory process. This decision affects the level of detail in the resulting explanations and, consequently, accuracy-as more detailed explanations may more accurately reflect the model's behavior-and their meaningfulness-as detailed and comprehensive explanations can be more difficult to interpret [12,24].

In the time domain, for example, input features are highly fine-grained. As discussed in §2, spectrograms typically contain frames spanning tens of milliseconds, capturing detailed frequency content within each frame, whereas waveforms are composed of samples taken at much shorter time intervals-for instance, as mentioned in §2, there can be 16, 000 samples in just one second. This level of detail poses great challenges for (human) comprehension, particularly for a broader audience, since mapping groups of frames/samples in an explanation to recognizable speech units is highly time-consuming and requires specialized expertise.

Accordingly, to address the issue and make explanations for speech more broadly accessible, previous works have leveraged textual transcripts within the explanation process. More specifically, Wu et al. [32] and Pastor et al. [28] resort to the alignment of audio to text, either for individual phonemes or words, respectively, and apply explainability techniques to such units. While this approach helps decipher the contribution of input features based on more intuitive linguistic units, it diverges from how current models process speech features in small frames and samples [43]. This divergence risks overlooking the model's behavior and compromises the accuracy and effectiveness of the explanations. For instance, whether ASR systems rely on shorter or longer time intervals than individual words remains unclear [29]. Therefore, analyzing this aspect requires a more granular approach at the time level.

In light of the above, explanations should be obtained with low-level units to avoid biasing explanations towards human understanding. The use of audiotranscript alignment to aid analysis of explanations can be very useful but should occur downstream of the explanation process, not upstream. In this way, we can maximize the use of all available units to generate detailed and accurate explanations, and then aggregate scores from individual frames or samples to create more compact representations at the level of phonemes or words, ensuring flexibility in the meaningfulness of the explanations according to specific needs. This bottom-up approach mirrors practices in the text domain, providing adaptability in defining attribution units that can range from subwords to words or phrases [53,54].

Conclusion

This paper has examined the role of acoustic features and their selection for explaining speech models. More specifically, we considered a specific subfield of XAI, namely, feature attribution, which connects input features to outputs as a form of explanation. Previous research has not explicitly addressed how to incorporate features into the explanation process within the speech domain, where input is encoded in more varied ways compared to other fields, such as text. This has led to diverse approaches, each with different implications for what can and cannot be explained about model behavior, and with the risk of not fully or accurately representing the model's functioning.

By discussing the key characteristics of speech and the properties of the most adopted acoustic features, we argue that explanations should ideally encompass all available dimensions, particularly time and frequency, as both are essential for a comprehensive understanding of the models' rationale. We have also discussed challenges associated with aligning explanations at high granularity with human understanding, emphasizing solutions that provide flexibility in the analysis, allowing for adjustments between more or less detail as needed.

Building on these insights, our ongoing research focuses on developing feature attribution techniques that operate on spectrograms at the finest possible unit level, integrating both time and frequency dimensions. Our aim is to generate explanations that are accurate and meaningful for experts, as well as adaptable for non-expert users. More broadly, we hope that our reflections will be beneficial and thought-provoking for researchers currently working in, or entering, the field of XAI for speech models, thereby contributing to a deeper understanding of the rationale behind these models.

Limitations

While exploring the relationship between the informativeness of speech features and explanations, we have deliberately not delved into the needs of specific stakeholders for XAI applications. Indeed, different stakeholders present varying needs [55,56], and to consider them is a research avenue of paramount importance for the growth of XAI. As a nascent area of investigations, however, XAI for speech is still relatively in its infancy, we thus prioritized more fundamental methodological and design decisions which prioritize a comprehensive and detailed understanding at a low level of model's rationale. Accordingly, our reflections might be more appealing for a range of users who engage with speech models and possess expertise in machine learning and/or speech analysis, ranging from developers to speech therapists assisted by speech models [56].

The balance of richness and granularity-which also relates to the interplay between accuracy and meaningfulness-is also relevant to common users who interact with speech technologies. However, investigating how explanations can be effectively communicated to and understood by these users in the context of daily speech technology use exceeds the scope of this paper and warrants further exploration.

The properties of accuracy and meaningfulness can be associated with those of faithfulness and plausibility, respectively[25,26]. Throughout the paper, we use the abstract category of phoneme to denote individual speech sounds. However, when discussing their actual realizations, it is more accurate to refer to them as phones[34]. E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top) differentiates two words. See https://kaldi-asr.org/doc/feat.html. See https://deepspeech.readthedocs.io/en/master/DeepSpeech. html. We are not aware of any recent study attributing higher systems performance depending on the used representation.

Acknowledgments

We acknowledge the support of the PNRR project FAIR -Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. This paper has also received funding from the European Union's Horizon research and innovation programme under grant agreement No 101135798, project Meetween (My Personal AI Mediator for Virtual MEETtings BetWEEN People).

The Need for Interpretable Features: Motivation and Taxonomy AZytek IArnaldo DLiu LBerti-Equille KVeeramachaneni 10.1145/3544903.3544905 doi:10.1145/3544903. 3544905 SIGKDD Explor. Newsl 24 2022 We need to talk: HCI and the delicate topic of spoken language interaction CMunteanu MJones SOviatt SBrewster GPenn SWhittaker NRajput ANanavati 10.1145/2468356.2468803 doi:10.1145/2468356.2468803 CHI '13 Extended Abstracts on Human Factors in Computing Systems, CHI EA '13

New York, NY, USA

Association for Computing Machinery 2013 Continuous Authentication for Voice Assistants HFeng KFawaz KGShin 10.1145/3117811.3117823 doi:10.1145/3117811. 3117823 Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, MobiCom '17 the 23rd Annual International Conference on Mobile Computing and Networking, MobiCom '17

New York, NY, USA

Association for Computing Machinery 2017 Personal Voice Assistant Security and Privacy-A Survey PCheng URoedig 10.1109/JPROC.2022.3153167 Proceedings of the IEEE 110 2022 Why Do People Use Artificial Intelligence (AI)-Enabled Voice Assistants? SMalodia NIslam PKaur ADhir 10.1109/TEM.2021.3117884 IEEE Transactions on Engineering Management 71 2024 wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations ABaevski HZhou AMohamed MAuli Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20 the 34th International Conference on Neural Information Processing Systems, NIPS'20

Red Hook, NY, USA

Curran Associates Inc 2020 HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units W.-NHsu BBolte Y.-HHTsai KLakhotia RSalakhutdinov AMohamed 10.1109/TASLP.2021.3122291 IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 2021 WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing SChen CWang ZChen YWu SLiu ZChen JLi NKanda TYoshioka XXiao JWu LZhou SRen YQian YQian JWu MZeng XYu FWei 10.1109/JSTSP.2022.3188113 IEEE Journal of Selected Topics in Signal Processing 16 2022 Robust Speech Recognition via Large-Scale Weak Supervision ARadford JWKim TXu GBrockman CMcleavey ISutskever JMLR.org Proceedings of the 40th International Conference on Machine Learning, ICML'23 the 40th International Conference on Machine Learning, ICML'23 2023 Towards A Rigorous Science of Interpretable Machine Learning FDoshi-Velez BKim arXiv:1702.08608 2017 Machine Learning Interpretability: A Survey on Methods and Metrics DVCarvalho EMPereira JSCardoso 10.3390/electronics8080832 Electronics 8 2019 Notions of explainability and evaluation approaches for explainable artificial intelligence GVilone LLongo 10.1016/j.inffus.2021.05.009 Information Fusion 76 2021 Interpretable Data-Based Explanations for Fairness Debugging RPradhan JZhu BGlavic BSalimi 10.1145/3514221.3517886 doi:10.1145/ 3514221.3517886 Proceedings of the 2022 International Conference on Management of Data, SIG-MOD '22 the 2022 International Conference on Management of Data, SIG-MOD '22

New York, NY, USA

Association for Computing Machinery 2022 The role of explainable AI in the context of the AI Act CPanigutti RHamon IHupont DFernandez Llorca DFano Yela HJunklewitz SScalzo GMazzini ISanchez JSoler EGarrido Gomez 10.1145/3593013.3594069 doi:10.1145/3593013.3594069 Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT '23 the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT '23

New York, NY, USA

Association for Computing Machinery 2023 JFerrando GSarti ABisazza MRCostajussà arXiv:2405.00208 A Primer on the Inner Workings of Transformer-based Language Models 2024 Gradient-Based Attribution Methods MAncona ECeolini CÖztireli MGross 10.1007/978-3-030-28954-6_9 doi: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning WSamek GMontavon AVedaldi LKHansen K.-RMüller

Cham

Springer International Publishing 2019 Towards Explainable Artificial Intelligence WSamek K.-RMüller 10.1007/978-3-030-28954-6_1 doi: Explainable AI: Interpreting, Explaining and Visualizing Deep Learning WSamek GMontavon AVedaldi LKHansen K.-RMüller

Cham

Springer International Publishing 2019 Towards the Unification and Robustness of Perturbation and Gradient Based Explanations SAgarwal SJabbari CAgarwal SUpadhyay SWu HLakkaraju Proceedings of the 38th International Conference on Machine Learning MMeila TZhang the 38th International Conference on Machine Learning

PMLR

2021 139 Proceedings of Machine Learning Research Perturbation-based methods for explaining deep neural networks: A survey MIvanovs RKadikis KOzols 10.1016/j.patrec.2021.06.030 Pattern Recognition Letters 150 2021 Explainable Artificial Intelligence: A Survey FKDošilović MBrčić NHlupić 10.23919/MIPRO.2018.8400040 ternational Convention on Information and Communication Technology, Electronics and Microelectronics

MIPRO

2018. 2018 Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities WSaeed COmlin 10.1016/j.knosys.2023.110273 Knowledge-Based Systems 263 110273 2023 Do Feature Attribution Methods Correctly Attribute Features? YZhou SBooth MTRibeiro JShah 10.1609/aaai.v36i9.21196 Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2022 36 DQin GAmariucai DQiao YGuan SFu arXiv:2310.17945 A Comprehensive and Reliable Feature Attribution Method: Double-sided Remove and Reconstruct (DoRaR) 2023 PJPhillips CHahn PFontana AYates KKGreene DBroniatowski MAPrzybocki 10.6028/NIST.IR.8312 Four Principles of Explainable Artificial Intelligence 2021 Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? AJacovi YGoldberg 10.18653/v1/2020.acl-main.386 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Towards Faithful Model Explanation in NLP: A Survey QLyu MApidianaki CCallison-Burch 10.1162/coli_a_00511 Computational Linguistics 50 2024 AudioMNIST: Exploring Explainable Artificial Intelligence for audio analysis on a simple benchmark SBecker JVielhaben MAckermann K.-RMüller SLapuschkin WSamek 10.1016/j.jfranklin.2023.11.038 Journal of the Franklin Institute 361 2024 Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features EPastor AKoudounas GAttanasio DHovy EBaralis Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Long Papers YGraham MPurver the 18th Conference of the European Chapter of the Association for Computational Linguistics

St. Julian's, Malta

2024 1 Association for Computational Linguistics Directly Comparing the Listening Strategies of Humans and Machines VATrinh MMandel 10.1109/TASLP.2020.3040545 IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 2021 Visualizing Automatic Speech Recognition -Means for a Better Understanding? KMarkert RParracone MKulakov PSperl C.-YKao KBöttinger 10.21437/SPSC.2021-4 Proc. 2021 ISCA Symposium on Security and Privacy in Speech Communication 2021 ISCA Symposium on Security and Privacy in Speech Communication 2021 Explanations for Automatic Speech Recognition XWu PBell ARajan 10.1109/ICASSP49357.2023.10094635 ICASSP 2023 -2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023 Can We Trust Explainable AI Methods on ASR? An Evaluation on Phoneme Recognition XWu PBell ARajan 10.1109/ICASSP48485.2024.10445989 ICASSP 2024 -2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024 Intra-speaker phonetic variation in read speech: comparison with inter-speaker variability in a controlled population NAudibert CFougeron 10.21437/Interspeech.2022-10965 Interspeech 2022, ISCA, Incheon

South Korea

2022 An Introduction to Phonetics and Phonology JClark CYallop 1990 B. Blackwell Oxford, UK GYule The Study of Language Cambridge University Press 2020 7 ed Acoustic Phonetics KNStevens 2000 The MIT Press Effects of degradation of intensity, time, or frequency content on speech intelligibility for normal-hearing and hearing-impaired listeners NHVan Schijndel THoutgast JMFesten 10.1121/1.1378345 doi: The Journal of the Acoustical Society of America 110 2001 Acoustic correlates of some phonetic categories KNStevens 10.1121/1.384823 The Journal of the Acoustical Society of America 68 1980 JHoney 10.1002/9781405166256.ch6 Sociophonology John Wiley & Sons, Ltd 2017 <idno type="DOI">10.1002/9781405166256.ch6</idno> <idno>.1002/9781405166256.ch6</idno> <ptr target="//doi.org/10" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b40"> <monogr> <title level="m" type="main">Speech Prosody: from Acoustics to Interpretation DHirst 2024 Springer Berlin, Heidelberg The Diversity of Tone Languages and the Roles of Pitch Variation in Non-tone Languages: Considerations for Tone Perception Research CTBest 10.3389/fpsyg.2019.00364 Frontiers in Psychology 10 2019 A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds FAlías JCSocoró XSevillano 10.3390/app6050143 Applied Sciences 6 2016 A review of deep learning techniques for speech processing AMehrish NMajumder RBharadwaj RMihalcea SPoria 10.1016/j.inffus.2023.101869 Information Fusion 99 101869 2023 SCommunication LBarrault Y.-AChung MCMeglioli DDale NDong P.-ADuquenne HElsahar HGong KHeffernan JHoffman CKlaiber PLi DLicht JMaillard ARakotoarison KRSadagopan GWenzek EYe BAkula P.-JChen NEHachem BEllis GMGonzalez JHaaheim PHansanti RHowes BHuang M.-JHwang HInaguma SJain EKalbassi AKallet IKulikov JLam DLi XMa RMavlyutov BPeloquin MRamadan ARamakrishnan ASun KTran TTran ITufanov VVogeti CWood YYang BYu PAndrews CBalioglu MRCostajussà OCelebi MElbayad CGao FGuzmán JKao ALee AMourachko JPino SPopuri CRopers SSaleem HSchwenk PTomasello CWang JWang SWang SeamlessM4T: Massively Multilingual & Multimodal Machine Translation 2023 The Kaldi Speech Recognition Toolkit DPovey AGhoshal GBoulianne LBurget OGlembek NGoel MHannemann PMotlicek YQian PSchwarz JSilovsky GStemmer KVesely IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society IEEE Catalog 2011 CFP11SRW-USB The Encoding of Vowel Features in Mel-Frequency Cepstral Coefficients KIkarous 10.17469/O2104AISV000001 Il parlato nel contesto naturale [Speech in the natural context AVietti LSpreafico DMereu VGalatà

Officinaventuno, Milano

2018 <author> <persName><forename type="first">B</forename><surname>Mcfee</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Mcvicar</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Faronbi</surname></persName> </author> <author> <persName><forename type="first">I</forename><surname>Roman</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Gover</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Balke</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Seyfarth</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Malek</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Raffel</surname></persName> </author> <author> <persName><forename type="first">V</forename><surname>Lostanlen</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Van Niekirk</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Lee</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Cwitkowitz</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Zalkow</surname></persName> </author> <author> <persName><forename type="first">O</forename><surname>Nieto</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Ellis</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Mason</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Lee</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Steers</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Halvachs</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Thomé</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Robert-Stöter</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Bittner</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Wei</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Weiss</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Battenberg</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Choi</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Yamamoto</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Carr</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Metsai</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Sullivan</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Friesch</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Krishnakumar</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Hidaka</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Kowalik</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Keller</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Mazur</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Chabot-Leclerc</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Hawthorne</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Ramaprasad</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Keum</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Gomez</surname></persName> </author> <author> <persName><forename type="first">W</forename><surname>Monroe</surname></persName> </author> <author> <persName><forename type="first">V</forename><forename type="middle">A</forename><surname>Morozov</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Eliasi</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Nullmightybofo</surname></persName> </author> <author> <persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Biberstein</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Sergin</surname></persName> </author> <author> <persName><forename type="first">R</forename><surname>Hennequin</surname></persName> </author> <author> <persName><surname>Naktinis</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Beantowel</surname></persName> </author> <author> <persName><forename type="first">J</forename><forename type="middle">P</forename><surname>Kim</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Åsen</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Lim</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Malins</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Hereñú</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Van Der Struijk</surname></persName> </author> <author> <persName><forename type="first">J</forename><surname>Nickel</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Wu</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Wang</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Gates</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Vollrath</surname></persName> </author> <author> <persName><surname>Sarroff</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Xiao-Ming</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Porter</surname></persName> </author> <author> <persName><surname>Kranzler</surname></persName> </author> <author> <persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Voodoohop</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Gangi</surname></persName> </author> <author> <persName><forename type="first">C</forename><surname>Jinoz</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Guerrero</surname></persName> </author> <author> <persName><surname>Mazhar</surname></persName> </author> <author> <persName><forename type="first">Z</forename><surname>Tod-Drme2178</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Baratz</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Kostin</surname></persName> </author> <author> <persName><forename type="first">C</forename><forename type="middle">T</forename><surname>Zhuang</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b48"> <analytic> <title/> <author> <persName><forename type="first">P</forename><surname>Lo</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Campr</surname></persName> </author> <author> <persName><forename type="first">M</forename><surname>Semeniuc</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Biswal</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Moura</surname></persName> </author> <author> <persName><forename type="first">H</forename><surname>Brossier</surname></persName> </author> <author> <persName><forename type="first">W</forename><surname>Lee</surname></persName> </author> <author> <persName><surname>Pimenta</surname></persName> </author> <idno type="DOI">10.5281/zenodo.8252662</idno> <ptr target="https://doi.org/10.5281/zenodo.8252662.doi:10.5281/zenodo.8252662" /> </analytic> <monogr> <title level="j">librosa/librosa 0 10 2023 MRavanelli YBengio arXiv:1811.09725 Interpretable Convolutional Filters with SincNet 2019 Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings MAngrick CHerff GJohnson JShih DKrusienski TSchultz 10.1016/j.neucom.2018.10.080 Neurocomput 342 2019 IIRI-Net: An interpretable convolutional front-end inspired by IIR filters for speaker identification HFayyazi YShekofteh 10.1016/j.neucom.2023.126767 Neurocomput 558 2023 AHannun CCase JCasper BCatanzaro GDiamos EElsen RPrenger SSatheesh SSengupta ACoates AYNg arXiv:1412.5567 Deep Speech: Scaling up end-toend speech recognition 2014 Design of the CMU Sphinx-4 Decoder PLamere PKwok WWalker EGouvea RSingh BRaj PWolf 10.21437/Eurospeech.2003-382 Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003) 8th European Conference on Speech Communication and Technology (Eurospeech 2003) 2003 Inseq: An Interpretability Toolkit for Sequence Generation Models GSarti NFeldhus LSickert OVan Der Wal 10.18653/v1/2023.acl-demo.40 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) DBollegala RHuang ARitter the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Toronto, Canada

Association for Computational Linguistics 2023 Using Captum to Explain Generative Language Models VMiglani AYang AMarkosyan DGarcia-Olano NKokhlikyan 10.18653/v1/2023.nlposs-1.19 Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Association for Computational Linguistics LTan DMilajevs GChauhan JGwinnup ERippeth the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), Association for Computational Linguistics

Singapore

2023 What do we want from Explainable Artificial Intelligence (XAI)? -A stakeholder perspective on XAI and a conceptual model guiding interdisciplinary XAI research MLanger DOster TSpeith HHermanns LKästner ESchmidt ASesing KBaum 10.1016/j.artint.2021.103473 Artificial Intelligence 296 103473 2021 Speech Therapy Supported by AI and Smart Assistants MCalvano ACurci APagano APiccinno Product-Focused Software Process Improvement RKadgien AJedlitschka AJanes VLenarduzzi XLi

Switzerland, Cham

Springer Nature 2024