1. Introduction

Management

10.1109/TEM

Explainability for Speech Models: On the Challenges of Acoustic Feature Selection

Dennis Fucci

0 1

Beatrice Savoldi

Marco Gaido

Matteo Negri

Mauro Cettolo

Luisa Bentivogli

0 0 Fondazione Bruno Kessler , Via Sommarive, 18, 38123 Trento TN , Italy 1 University of Trento , Via Calepina, 14, 38122 Trento TN , Italy

2023

71 2024 491 505

Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.

eol>Speech Models Explainability Feature Attribution

1. Introduction

1Despite numerous eforts to diferentiate the closely related concepts of explainability and interpretability, no consensus exists in the literature on their definitions [ 20]. In this paper, we adopt a perspective similar to that of Saeed and Omlin [21], where explainability refers to the process of extracting insights from a model’s workings through specific techniques, while interpretability refers to the understanding process of those insights, crucial to make them actionable. [ 1 ]. This can undermine key principles in XAI, such as several paralinguistic cues (i.e. how is said) entailing accuracy—the property of correctly reflecting the factors extensive variation—also for single individual speakers that led the model to a specific decision including all [33]. As such, it comprises several dimensions, which are relevant information—and meaningfulness—the property hard to pin down individually, but collectively amount of ofering explanations that are comprehensible to the to what we intuitively and simply perceive as spoken user [24].2 language.

In fields involving images or texts, feature representa- From a linguistic perspective, the spoken communications are typically constrained to pixels and words, re- tion system consists of the combination of phonemes,3 spectively. However, for speech, multiple input represen- which are regarded as the smallest meaningful units tations can be adopted, each emphasizing diferent acous- of sounds [34, 35]. Physically, it involves the contintic aspects. Indeed, a sequence of speech elements does uous flow of sounds shaped by the movements of our not only convey the meaning of what is said (like words phonatory organs, transmitted as sound waves [36]. Perin a text) but also bears a wealth of additional information ceptually, we process speech through three primary diuseful for both human understanding and automatic pro- mensions [37]: i) time, or the sequential occurrence of cessing (e.g. intonation, loudness, speaking rate). Con- sounds;4 ii) intensity, corresponding to the energy level sequently, when employing feature attribution methods, of the wave due to the strength of molecular vibration, the resulting explanations can vary significantly in shape which we perceived as loudness; iii) frequency, regardand focus on more or less informative characteristics de- ing the rate of vibrations produced by the vocal cords— pending on the type of speech representation used. To interpreted as pitch—and whose modulation is responsidate, research on feature attribution for speech is no- ble for shaping the type of speech sound. tably limited to few applications—including classification These three elements, known as acoustic correlates [38], [27, 28] and generative tasks [29, 30, 31, 32]—which ofer are specific to both speakers and phonemes. For example, a somewhat fragmented picture in the choice of speech speakers possess unique characteristics, including pitch representations, thus providing limited insights on the and speaking rate [33], and also exhibit high variability relation between the features considered and the expla- stemming from various sociodemographic factors such nations based upon them. as gender, age, and dialect [39]. In these cases, the speech

In light of the above, this paper reflects on the im- content needs to be disentangled from the variability in pact of the chosen acoustic features in explaining the its delivery. Conversely, language sounds exhibit variabilrationale behind speech models, aiming to gain a deeper ity in duration—e.g., /i/ in ship and sheep—and are distinunderstanding of the trade-ofs associated with acoustic guished by specific frequency ranges [ 36]. The frequency features. By first ofering a gentle introduction to the dimension also plays a vital role in shaping suprasegmenrich and multidimensional nature of speech and its digital tal aspects of speech—broader phenomena that span mulrepresentation, we identify current gaps and potential tiple segments—such as intonation, obtained by varying avenues for efectively incorporating this multidimen- pitch [40]. Pitch, for instance, has a distinctive function sionality into XAI for speech models. Our discussion will in tonal languages, where it is used to distinguish lexifocus on two critical factors: i) the amount of information cal or grammatical meaning [41]. But even in non-tonal these features provide about the model’s behavior, which languages, these prosodic elements are indispensable to influences the richness of the explanations, and ii) the delivering diferent meanings and intents, as the reader level of detail of such information, which determines the can perceive by reading out loud two contrastive sengranularity of the explanations. We will also explore how tences such as: “You got the joke right” and “You got the these aspects impact both the accuracy and meaningful- joke, right?”, where pauses and prosody play pivotal roles. ness of the explanations, ultimately shaping their overall All these factors add to the multidimensionality of interpretability. speech, which feature engineering strives to encapsulate and that cannot be overlooked in the explanatory process.

2. The Correlates of Speech To gain deeper insight into the complexities of defining

informative features in speech, we explore key characteristics of speech and their implications for modeling.

Speech is a multifaceted phenomenon. It is grounded on the materiality of sound to convey linguistic content (i.e. what is said), which is modulated depending on

2The properties of accuracy and meaningfulness can be associated

with those of faithfulness and plausibility, respectively [25, 26].

3. Speech Representations

While various representations are used to encode speech in a digital format, three main types are commonly given 3Throughout the paper, we use the abstract category of phoneme to denote individual speech sounds. However, when discussing their actual realizations, it is more accurate to refer to them as phones [34]. 4E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top) diferentiates two words. as input to state-of-the-art speech models (for a review, see [42, 43]). Namely, waveforms, spectrograms, and mel-frequency cepstral coeficients (MFCCs), which are shown in Figure 1.

The waveform serves as the most fundamental representation of a signal, comprising sequences of samples (e.g., 16, 000 per second), each indicating the amplitude of the signal at a specific point in time—essentially, the lfuctuations in air pressure over time. This type of representation is leveraged by models like Wav2vec [6].

The spectrogram results from feature engineering operations that decompose the speech signal into its frequencies, presenting a 2D visualization of frequency distributions over time. These representations are commonly depicted as heatmaps, where color intensity corresponds to the energy of a specific frequency at a given moment. The time unit in spectrograms is represented by a fixed-length window of a few milliseconds (e.g., 25), commonly referred to as a frame, whithin which a given number of waveform samples are encompassed. Notably, the articulation of sounds produces time-frequency patterns which are visible as darker regions [36]. Prominent examples of state-of-the-art models leveraging spectrograms are Whisper [9] and SeamlessM4T [44].

The MFCCs ofer another 2D representation where each coeficient captures important details about how Figure 1: Schematic illustration of the primary speech repthe frequency content of the signal changes over time. resentations used by state-of-the-art speech models for the Like spectrograms, MFCCs ofer information about both utterance “This is a waveform”. The features were computed frequency and time, but in a more compact form. MFCCs using Librosa 0.10.1 [47]. are commonly used in the implementation of ASR models within popular toolkits like Kaldi5 [45] and Mozilla DeepSpeech6.

Overall, though diferent in nature, these three types tise in speech processing, being aware of the trade-ofs of representations are all efectively exploited by current they intrinsically entail is crucial for carefully conducting speech models.7 For human understanding, however, XAI examination in speech. Indeed, it is precisely upon they actually vary in terms of informativeness with re- such input features—and their trade-ofs—that explanaspect to the acoustic correlates discussed in §2. Indeed, tions are built. although both intensity and frequency are somewhat discernible in waveforms, qualitative distinctions of pat- 4. Richness of Explanations terns specific to pitch or phoneme frequencies are rarely feasible [36]. Comparatively, spectrograms and MFCCs Considering the foregoing, there is a causal relationship are richer and more descriptive, because they capture wherein explanatory possibilities in speech XAI are inthe multiple dimensions of time, frequency, and intensity herently limited by the richness of the audio features with finer detail. Still, spectrograms are more conducive used, specifically the dimensions they encapsulate. This to phonetic analyses, given the established knowledge limitation directly correlates with the richness of the in analyzing frequency patterns over time within this resulting explanations. Also, owing to the compatibilrepresentation [36] In contrast, MFCCs are rarely used ity of current models with various representation types, for phonetic analysis [46]. the explanations generated are inevitably confined by

Overall, while weighting the informativeness and se- the specific input features provided to the model. To exlection of speech representations requires a certain exper- emplify, if models process audio as waverfoms—which poorly represent the frequency dimension for human 5See https://kaldi-asr.org/doc/feat.html. understanding—explanations accounting for such a cor6Shetmel. https://deepspeech.readthedocs.io/en/master/DeepSpeech. relate will be out of reach. In fact, previous works by 7We are not aware of any recent study attributing higher systems Wu et al. [31] and Wu et al. [32], based on waveforms performance depending on the used representation. solely focus on the temporal dimension to explain ASR.

5. Granularity of Explanations

In these cases, to avoid limiting the understanding of the models’ behavior to one single dimension it would be advisable to explore alternative techniques that ofer Another critical factor concerning the informativeness deeper insights into how models process other acous- of input features is the level of granularity at which tic correlates. For instance, Pastor et al. [28] integrated the features are considered during the explanatory procounterfactual explanations to specifically investigate cess. This decision afects the level of detail in the resultwhether selected paralinguistic features such as pitch, ing explanations and, consequently, accuracy—as more speaking rate, and background noise were influent for detailed explanations may more accurately reflect the the model’s prediction. Additionally, various techniques model’s behavior—and their meaningfulness—as detailed exist to analyze how models extract relevant patterns and comprehensive explanations can be more dificult to from waveforms through convolutions [48, 49, 50]. interpret [12, 24].

When the selected input features represent multiple In the time domain, for example, input features are dimensions, as in the case of spectrograms or MFCCs, the highly fine-grained. As discussed in §2, spectrograms decision to only account for one of these dimensions be- typically contain frames spanning tens of milliseconds, comes arbitrary. For example, two models tested by Wu capturing detailed frequency content within each frame, et al. [31], namely, DeepSpeech [51] and Sphinx [52], are whereas waveforms are composed of samples taken at fed with spectrograms and MFCCs, respectively. How- much shorter time intervals—for instance, as mentioned ever, explanations based on raw waveforms are provided in §2, there can be 16, 000 samples in just one second. for these models. This inconsistency between the fea- This level of detail poses great challenges for (human) tures used in explanations and those used by the models comprehension, particularly for a broader audience, since inevitably ofers only a partial overview of the models’ mapping groups of frames/samples in an explanation to behavior and limits the exploration of important acoustic recognizable speech units is highly time-consuming and aspects. This, in turn, can impact the accuracy of the ex- requires specialized expertise. planations, which ideally should encompass all relevant Accordingly, to address the issue and make explanainformation. tions for speech more broadly accessible, previous works

To prioritize explanation accuracy and conduct anal- have leveraged textual transcripts within the explanation yses considering the crucial role of acoustic correlates process. More specifically, Wu et al. [32] and Pastor et al. such as frequency, it is advisable to take into account all [28] resort to the alignment of audio to text, either for dimensions embedded in the speech representation. individual phonemes or words, respectively, and apply exThis approach is exemplified by the works of Markert plainability techniques to such units. While this approach et al. [30], who provide explanations that account for helps decipher the contribution of input features based the most influential elements in MFCCs, as well as Trinh on more intuitive linguistic units, it diverges from how and Mandel [29] and Becker et al. [27], who base the current models process speech features in small frames explanations on spectrograms. In the work by Mark- and samples [43]. This divergence risks overlooking the ert et al. [30], however, it is challenging to connect the model’s behavior and compromises the accuracy and efresults with specific acoustic parameters due to the com- fectiveness of the explanations. For instance, whether plexity of analyzing MFCCs (see §3), which significantly ASR systems rely on shorter or longer time intervals than undermines the meaningfulness of the explanations. In individual words remains unclear [29]. Therefore, anacontrast, explanations using spectrograms ofer valuable lyzing this aspect requires a more granular approach at insights into how machines process speech, producing the time level. both accurate and meaningful results. For instance, Trinh In light of the above, explanations should be oband Mandel [29] demonstrated that neural ASR models tained with low-level units to avoid biasing explanafocus on high-energy time-frequency regions for tran- tions towards human understanding. The use of audioscription, while Becker et al. [27] found that lower fre- transcript alignment to aid analysis of explanations can quency ranges, typically associated with pitch, exhibit be very useful but should occur downstream of the explahigher attribution scores in speaker gender classification nation process, not upstream. In this way, we can maxitasks [27], showing some alignment with human speech mize the use of all available units to generate detailed and processing. However, interpreting these insights requires accurate explanations, and then aggregate scores from specialized expertise, which can reduce the meaningful- individual frames or samples to create more compact ness of explanations for non-experts. This highlights representations at the level of phonemes or words, ensurthat, even in speech, the balance between accuracy and ing flexibility in the meaningfulness of the explanations meaningfulness can vary depending on the context [24]. according to specific needs. This bottom-up approach mirrors practices in the text domain, providing adaptability in defining attribution units that can range from subwords to words or phrases [53, 54].

6. Conclusion

speech models [56].

The balance of richness and granularity—which This paper has examined the role of acoustic features and also relates to the interplay between accuracy and their selection for explaining speech models. More specif- meaningfulness—is also relevant to common users who ically, we considered a specific subfield of XAI, namely, interact with speech technologies. However, investigatfeature attribution, which connects input features to out- ing how explanations can be efectively communicated puts as a form of explanation. Previous research has not to and understood by these users in the context of daily explicitly addressed how to incorporate features into the speech technology use exceeds the scope of this paper explanation process within the speech domain, where and warrants further exploration. input is encoded in more varied ways compared to other ifelds, such as text. This has led to diverse approaches, each with diferent implications for what can and can- 8. Acknowledgments not be explained about model behavior, and with the risk of not fully or accurately representing the model’s We acknowledge the support of the PNRR project FAIR functioning. Future AI Research (PE00000013), under the NRRP MUR

By discussing the key characteristics of speech and program funded by the NextGenerationEU. This paper the properties of the most adopted acoustic features, we has also received funding from the European Union’s argue that explanations should ideally encompass all Horizon research and innovation programme under grant available dimensions, particularly time and frequency, as agreement No 101135798, project Meetween (My Perboth are essential for a comprehensive understanding of sonal AI Mediator for Virtual MEETtings BetWEEN Peothe models’ rationale. We have also discussed challenges ple). associated with aligning explanations at high granularity with human understanding, emphasizing solutions that References provide flexibility in the analysis, allowing for adjustments between more or less detail as needed.

Building on these insights, our ongoing research focuses on developing feature attribution techniques that operate on spectrograms at the finest possible unit level, integrating both time and frequency dimensions. Our aim is to generate explanations that are accurate and meaningful for experts, as well as adaptable for non-expert users. More broadly, we hope that our reflections will be beneficial and thought-provoking for researchers currently working in, or entering, the field of XAI for speech models, thereby contributing to a deeper understanding of the rationale behind these models.

7. Limitations

While exploring the relationship between the informativeness of speech features and explanations, we have deliberately not delved into the needs of specific stakeholders for XAI applications. Indeed, diferent stakeholders present varying needs [55, 56], and to consider them is a research avenue of paramount importance for the growth of XAI. As a nascent area of investigations, however, XAI for speech is still relatively in its infancy, we thus prioritized more fundamental methodological and design decisions which prioritize a comprehensive and detailed understanding at a low level of model’s rationale. Accordingly, our reflections might be more appealing for a range of users who engage with speech models and possess expertise in machine learning and/or speech analysis, ranging from developers to speech therapists assisted by

[1]

Zytek ,

Arnaldo ,

Liu ,

Berti-Equille ,

Veeramachaneni , The Need for Interpretable Features: Motivation and Taxonomy, SIGKDD Explor . Newsl. 24 ( 2022 ) 1 - 13 . URL: https://doi.org/ 10.1145/3544903.3544905. doi: 10 .1145/3544903. 3544905.

[2]

Munteanu ,

Jones ,

Oviatt ,

Brewster , G. Penn,

Whittaker ,

Rajput ,

Nanavati , We need to talk: HCI and the delicate topic of spoken language interaction , in: CHI '13 Extended Abstracts on Human Factors in Computing Systems, CHI EA '13 , Association for Computing Machinery, New York, NY, USA, 2013 , pp. 2459 - 2464 . URL: https://doi.org/10.1145/2468356.2468803. doi: 10 .1145/2468356.2468803.

[3]

Feng ,

Fawaz ,

K. G.

Shin , Continuous Authentication for Voice Assistants , in: Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking , MobiCom '17, Association for Computing Machinery, New York, NY, USA, 2017 , pp. 343 - 355 . URL: https://doi.org/ 10.1145/3117811.3117823. doi: 10 .1145/3117811. 3117823.

[4]

Cheng , U. Roedig, Personal Voice Assistant Security and Privacy-A Survey , Proceedings of the IEEE 110 ( 2022 ) 476 - 507 . doi: 10 .1109/JPROC. 2022 . 3153167 .

[5]

Malodia ,

Islam ,

Kaur ,

Dhir , Why Do People Use Artificial Intelligence (AI)-Enabled Voice Assistants?, IEEE Transactions on Engineering