<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Management</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/TEM</article-id>
      <title-group>
        <article-title>Explainability for Speech Models: On the Challenges of Acoustic Feature Selection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dennis Fucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beatrice Savoldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Gaido</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Negri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Cettolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luisa Bentivogli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive, 18, 38123 Trento TN</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Via Calepina, 14, 38122 Trento TN</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>71</volume>
      <issue>2024</issue>
      <fpage>491</fpage>
      <lpage>505</lpage>
      <abstract>
        <p>Spurred by the demand for transparency and interpretability in Artificial Intelligence (AI), the field of eXplainable AI (XAI) has experienced significant growth, marked by both theoretical reflections and technical advancements. While various XAI techniques, especially feature attribution methods, have been extensively explored across diverse tasks, their adaptation for the speech modality is comparatively lagging behind. We argue that a key challenge in feature attribution for speech processing lies in identifying informative acoustic features. In this paper, we discuss the key challenges in selecting the features for speech explanations. Also, in light of existing research, we highlight current gaps and propose future avenues to enhance the depth and informativeness of explanations for speech.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Speech Models</kwd>
        <kwd>Explainability</kwd>
        <kwd>Feature Attribution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        1Despite numerous eforts to diferentiate the closely related
concepts of explainability and interpretability, no consensus exists in
the literature on their definitions [ 20]. In this paper, we adopt a
perspective similar to that of Saeed and Omlin [21], where
explainability refers to the process of extracting insights from a model’s
workings through specific techniques, while interpretability refers
to the understanding process of those insights, crucial to make
them actionable.
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This can undermine key principles in XAI, such as several paralinguistic cues (i.e. how is said) entailing
accuracy—the property of correctly reflecting the factors extensive variation—also for single individual speakers
that led the model to a specific decision including all [33]. As such, it comprises several dimensions, which are
relevant information—and meaningfulness—the property hard to pin down individually, but collectively amount
of ofering explanations that are comprehensible to the to what we intuitively and simply perceive as spoken
user [24].2 language.
      </p>
      <p>In fields involving images or texts, feature representa- From a linguistic perspective, the spoken
communications are typically constrained to pixels and words, re- tion system consists of the combination of phonemes,3
spectively. However, for speech, multiple input represen- which are regarded as the smallest meaningful units
tations can be adopted, each emphasizing diferent acous- of sounds [34, 35]. Physically, it involves the
contintic aspects. Indeed, a sequence of speech elements does uous flow of sounds shaped by the movements of our
not only convey the meaning of what is said (like words phonatory organs, transmitted as sound waves [36].
Perin a text) but also bears a wealth of additional information ceptually, we process speech through three primary
diuseful for both human understanding and automatic pro- mensions [37]: i) time, or the sequential occurrence of
cessing (e.g. intonation, loudness, speaking rate). Con- sounds;4 ii) intensity, corresponding to the energy level
sequently, when employing feature attribution methods, of the wave due to the strength of molecular vibration,
the resulting explanations can vary significantly in shape which we perceived as loudness; iii) frequency,
regardand focus on more or less informative characteristics de- ing the rate of vibrations produced by the vocal cords—
pending on the type of speech representation used. To interpreted as pitch—and whose modulation is
responsidate, research on feature attribution for speech is no- ble for shaping the type of speech sound.
tably limited to few applications—including classification These three elements, known as acoustic correlates [38],
[27, 28] and generative tasks [29, 30, 31, 32]—which ofer are specific to both speakers and phonemes. For example,
a somewhat fragmented picture in the choice of speech speakers possess unique characteristics, including pitch
representations, thus providing limited insights on the and speaking rate [33], and also exhibit high variability
relation between the features considered and the expla- stemming from various sociodemographic factors such
nations based upon them. as gender, age, and dialect [39]. In these cases, the speech</p>
      <p>In light of the above, this paper reflects on the im- content needs to be disentangled from the variability in
pact of the chosen acoustic features in explaining the its delivery. Conversely, language sounds exhibit
variabilrationale behind speech models, aiming to gain a deeper ity in duration—e.g., /i/ in ship and sheep—and are
distinunderstanding of the trade-ofs associated with acoustic guished by specific frequency ranges [ 36]. The frequency
features. By first ofering a gentle introduction to the dimension also plays a vital role in shaping
suprasegmenrich and multidimensional nature of speech and its digital tal aspects of speech—broader phenomena that span
mulrepresentation, we identify current gaps and potential tiple segments—such as intonation, obtained by varying
avenues for efectively incorporating this multidimen- pitch [40]. Pitch, for instance, has a distinctive function
sionality into XAI for speech models. Our discussion will in tonal languages, where it is used to distinguish
lexifocus on two critical factors: i) the amount of information cal or grammatical meaning [41]. But even in non-tonal
these features provide about the model’s behavior, which languages, these prosodic elements are indispensable to
influences the richness of the explanations, and ii) the delivering diferent meanings and intents, as the reader
level of detail of such information, which determines the can perceive by reading out loud two contrastive
sengranularity of the explanations. We will also explore how tences such as: “You got the joke right” and “You got the
these aspects impact both the accuracy and meaningful- joke, right?”, where pauses and prosody play pivotal roles.
ness of the explanations, ultimately shaping their overall All these factors add to the multidimensionality of
interpretability. speech, which feature engineering strives to encapsulate
and that cannot be overlooked in the explanatory process.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Correlates of Speech</title>
      <sec id="sec-2-1">
        <title>To gain deeper insight into the complexities of defining</title>
        <p>informative features in speech, we explore key
characteristics of speech and their implications for modeling.</p>
        <p>Speech is a multifaceted phenomenon. It is grounded
on the materiality of sound to convey linguistic content
(i.e. what is said), which is modulated depending on</p>
      </sec>
      <sec id="sec-2-2">
        <title>2The properties of accuracy and meaningfulness can be associated</title>
        <p>with those of faithfulness and plausibility, respectively [25, 26].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Speech Representations</title>
      <p>While various representations are used to encode speech
in a digital format, three main types are commonly given
3Throughout the paper, we use the abstract category of phoneme to
denote individual speech sounds. However, when discussing their
actual realizations, it is more accurate to refer to them as phones
[34].
4E.g. the order of sounds between /pAt/ (pot) or /tAp/ (top)
diferentiates two words.
as input to state-of-the-art speech models (for a review,
see [42, 43]). Namely, waveforms, spectrograms, and
mel-frequency cepstral coeficients (MFCCs), which are
shown in Figure 1.</p>
      <p>The waveform serves as the most fundamental
representation of a signal, comprising sequences of samples
(e.g., 16, 000 per second), each indicating the amplitude
of the signal at a specific point in time—essentially, the
lfuctuations in air pressure over time. This type of
representation is leveraged by models like Wav2vec [6].</p>
      <p>The spectrogram results from feature engineering
operations that decompose the speech signal into its
frequencies, presenting a 2D visualization of frequency
distributions over time. These representations are
commonly depicted as heatmaps, where color intensity
corresponds to the energy of a specific frequency at a given
moment. The time unit in spectrograms is represented
by a fixed-length window of a few milliseconds (e.g., 25),
commonly referred to as a frame, whithin which a given
number of waveform samples are encompassed. Notably,
the articulation of sounds produces time-frequency
patterns which are visible as darker regions [36]. Prominent
examples of state-of-the-art models leveraging
spectrograms are Whisper [9] and SeamlessM4T [44].</p>
      <p>The MFCCs ofer another 2D representation where
each coeficient captures important details about how Figure 1: Schematic illustration of the primary speech
repthe frequency content of the signal changes over time. resentations used by state-of-the-art speech models for the
Like spectrograms, MFCCs ofer information about both utterance “This is a waveform”. The features were computed
frequency and time, but in a more compact form. MFCCs using Librosa 0.10.1 [47].
are commonly used in the implementation of ASR
models within popular toolkits like Kaldi5 [45] and Mozilla
DeepSpeech6.</p>
      <p>Overall, though diferent in nature, these three types tise in speech processing, being aware of the trade-ofs
of representations are all efectively exploited by current they intrinsically entail is crucial for carefully conducting
speech models.7 For human understanding, however, XAI examination in speech. Indeed, it is precisely upon
they actually vary in terms of informativeness with re- such input features—and their trade-ofs—that
explanaspect to the acoustic correlates discussed in §2. Indeed, tions are built.
although both intensity and frequency are somewhat
discernible in waveforms, qualitative distinctions of pat- 4. Richness of Explanations
terns specific to pitch or phoneme frequencies are rarely
feasible [36]. Comparatively, spectrograms and MFCCs Considering the foregoing, there is a causal relationship
are richer and more descriptive, because they capture wherein explanatory possibilities in speech XAI are
inthe multiple dimensions of time, frequency, and intensity herently limited by the richness of the audio features
with finer detail. Still, spectrograms are more conducive used, specifically the dimensions they encapsulate. This
to phonetic analyses, given the established knowledge limitation directly correlates with the richness of the
in analyzing frequency patterns over time within this resulting explanations. Also, owing to the
compatibilrepresentation [36] In contrast, MFCCs are rarely used ity of current models with various representation types,
for phonetic analysis [46]. the explanations generated are inevitably confined by</p>
      <p>Overall, while weighting the informativeness and se- the specific input features provided to the model. To
exlection of speech representations requires a certain exper- emplify, if models process audio as waverfoms—which
poorly represent the frequency dimension for human
5See https://kaldi-asr.org/doc/feat.html. understanding—explanations accounting for such a
cor6Shetmel. https://deepspeech.readthedocs.io/en/master/DeepSpeech. relate will be out of reach. In fact, previous works by
7We are not aware of any recent study attributing higher systems Wu et al. [31] and Wu et al. [32], based on waveforms
performance depending on the used representation. solely focus on the temporal dimension to explain ASR.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Granularity of Explanations</title>
      <p>In these cases, to avoid limiting the understanding of the
models’ behavior to one single dimension it would be
advisable to explore alternative techniques that ofer Another critical factor concerning the informativeness
deeper insights into how models process other acous- of input features is the level of granularity at which
tic correlates. For instance, Pastor et al. [28] integrated the features are considered during the explanatory
procounterfactual explanations to specifically investigate cess. This decision afects the level of detail in the
resultwhether selected paralinguistic features such as pitch, ing explanations and, consequently, accuracy—as more
speaking rate, and background noise were influent for detailed explanations may more accurately reflect the
the model’s prediction. Additionally, various techniques model’s behavior—and their meaningfulness—as detailed
exist to analyze how models extract relevant patterns and comprehensive explanations can be more dificult to
from waveforms through convolutions [48, 49, 50]. interpret [12, 24].</p>
      <p>When the selected input features represent multiple In the time domain, for example, input features are
dimensions, as in the case of spectrograms or MFCCs, the highly fine-grained. As discussed in §2, spectrograms
decision to only account for one of these dimensions be- typically contain frames spanning tens of milliseconds,
comes arbitrary. For example, two models tested by Wu capturing detailed frequency content within each frame,
et al. [31], namely, DeepSpeech [51] and Sphinx [52], are whereas waveforms are composed of samples taken at
fed with spectrograms and MFCCs, respectively. How- much shorter time intervals—for instance, as mentioned
ever, explanations based on raw waveforms are provided in §2, there can be 16, 000 samples in just one second.
for these models. This inconsistency between the fea- This level of detail poses great challenges for (human)
tures used in explanations and those used by the models comprehension, particularly for a broader audience, since
inevitably ofers only a partial overview of the models’ mapping groups of frames/samples in an explanation to
behavior and limits the exploration of important acoustic recognizable speech units is highly time-consuming and
aspects. This, in turn, can impact the accuracy of the ex- requires specialized expertise.
planations, which ideally should encompass all relevant Accordingly, to address the issue and make
explanainformation. tions for speech more broadly accessible, previous works</p>
      <p>To prioritize explanation accuracy and conduct anal- have leveraged textual transcripts within the explanation
yses considering the crucial role of acoustic correlates process. More specifically, Wu et al. [32] and Pastor et al.
such as frequency, it is advisable to take into account all [28] resort to the alignment of audio to text, either for
dimensions embedded in the speech representation. individual phonemes or words, respectively, and apply
exThis approach is exemplified by the works of Markert plainability techniques to such units. While this approach
et al. [30], who provide explanations that account for helps decipher the contribution of input features based
the most influential elements in MFCCs, as well as Trinh on more intuitive linguistic units, it diverges from how
and Mandel [29] and Becker et al. [27], who base the current models process speech features in small frames
explanations on spectrograms. In the work by Mark- and samples [43]. This divergence risks overlooking the
ert et al. [30], however, it is challenging to connect the model’s behavior and compromises the accuracy and
efresults with specific acoustic parameters due to the com- fectiveness of the explanations. For instance, whether
plexity of analyzing MFCCs (see §3), which significantly ASR systems rely on shorter or longer time intervals than
undermines the meaningfulness of the explanations. In individual words remains unclear [29]. Therefore,
anacontrast, explanations using spectrograms ofer valuable lyzing this aspect requires a more granular approach at
insights into how machines process speech, producing the time level.
both accurate and meaningful results. For instance, Trinh In light of the above, explanations should be
oband Mandel [29] demonstrated that neural ASR models tained with low-level units to avoid biasing
explanafocus on high-energy time-frequency regions for tran- tions towards human understanding. The use of
audioscription, while Becker et al. [27] found that lower fre- transcript alignment to aid analysis of explanations can
quency ranges, typically associated with pitch, exhibit be very useful but should occur downstream of the
explahigher attribution scores in speaker gender classification nation process, not upstream. In this way, we can
maxitasks [27], showing some alignment with human speech mize the use of all available units to generate detailed and
processing. However, interpreting these insights requires accurate explanations, and then aggregate scores from
specialized expertise, which can reduce the meaningful- individual frames or samples to create more compact
ness of explanations for non-experts. This highlights representations at the level of phonemes or words,
ensurthat, even in speech, the balance between accuracy and ing flexibility in the meaningfulness of the explanations
meaningfulness can vary depending on the context [24]. according to specific needs. This bottom-up approach
mirrors practices in the text domain, providing
adaptability in defining attribution units that can range from
subwords to words or phrases [53, 54].</p>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <p>speech models [56].</p>
      <p>The balance of richness and granularity—which
This paper has examined the role of acoustic features and also relates to the interplay between accuracy and
their selection for explaining speech models. More specif- meaningfulness—is also relevant to common users who
ically, we considered a specific subfield of XAI, namely, interact with speech technologies. However,
investigatfeature attribution, which connects input features to out- ing how explanations can be efectively communicated
puts as a form of explanation. Previous research has not to and understood by these users in the context of daily
explicitly addressed how to incorporate features into the speech technology use exceeds the scope of this paper
explanation process within the speech domain, where and warrants further exploration.
input is encoded in more varied ways compared to other
ifelds, such as text. This has led to diverse approaches,
each with diferent implications for what can and can- 8. Acknowledgments
not be explained about model behavior, and with the
risk of not fully or accurately representing the model’s We acknowledge the support of the PNRR project FAIR
functioning. Future AI Research (PE00000013), under the NRRP MUR</p>
      <p>By discussing the key characteristics of speech and program funded by the NextGenerationEU. This paper
the properties of the most adopted acoustic features, we has also received funding from the European Union’s
argue that explanations should ideally encompass all Horizon research and innovation programme under grant
available dimensions, particularly time and frequency, as agreement No 101135798, project Meetween (My
Perboth are essential for a comprehensive understanding of sonal AI Mediator for Virtual MEETtings BetWEEN
Peothe models’ rationale. We have also discussed challenges ple).
associated with aligning explanations at high granularity
with human understanding, emphasizing solutions that References
provide flexibility in the analysis, allowing for
adjustments between more or less detail as needed.</p>
      <p>Building on these insights, our ongoing research
focuses on developing feature attribution techniques that
operate on spectrograms at the finest possible unit level,
integrating both time and frequency dimensions. Our aim
is to generate explanations that are accurate and
meaningful for experts, as well as adaptable for non-expert
users. More broadly, we hope that our reflections will
be beneficial and thought-provoking for researchers
currently working in, or entering, the field of XAI for speech
models, thereby contributing to a deeper understanding
of the rationale behind these models.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Limitations</title>
      <p>While exploring the relationship between the
informativeness of speech features and explanations, we have
deliberately not delved into the needs of specific
stakeholders for XAI applications. Indeed, diferent
stakeholders present varying needs [55, 56], and to consider them
is a research avenue of paramount importance for the
growth of XAI. As a nascent area of investigations,
however, XAI for speech is still relatively in its infancy, we
thus prioritized more fundamental methodological and
design decisions which prioritize a comprehensive and
detailed understanding at a low level of model’s rationale.
Accordingly, our reflections might be more appealing for
a range of users who engage with speech models and
possess expertise in machine learning and/or speech analysis,
ranging from developers to speech therapists assisted by</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zytek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Arnaldo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Berti-Equille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          ,
          <article-title>The Need for Interpretable Features: Motivation and Taxonomy, SIGKDD Explor</article-title>
          . Newsl.
          <volume>24</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://doi.org/ 10.1145/3544903.3544905. doi:
          <volume>10</volume>
          .1145/3544903. 3544905.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Munteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oviatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brewster</surname>
          </string-name>
          , G. Penn,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whittaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rajput</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nanavati</surname>
          </string-name>
          ,
          <article-title>We need to talk: HCI and the delicate topic of spoken language interaction</article-title>
          ,
          <source>in: CHI '13 Extended Abstracts on Human Factors in Computing Systems, CHI EA '13</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2013</year>
          , pp.
          <fpage>2459</fpage>
          -
          <lpage>2464</lpage>
          . URL: https://doi.org/10.1145/2468356.2468803. doi:
          <volume>10</volume>
          .1145/2468356.2468803.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fawaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <article-title>Continuous Authentication for Voice Assistants</article-title>
          ,
          <source>in: Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking</source>
          , MobiCom '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>355</lpage>
          . URL: https://doi.org/ 10.1145/3117811.3117823. doi:
          <volume>10</volume>
          .1145/3117811. 3117823.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , U. Roedig, Personal Voice Assistant Security and
          <string-name>
            <surname>Privacy-A Survey</surname>
          </string-name>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>110</volume>
          (
          <year>2022</year>
          )
          <fpage>476</fpage>
          -
          <lpage>507</lpage>
          . doi:
          <volume>10</volume>
          .1109/JPROC.
          <year>2022</year>
          .
          <volume>3153167</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Malodia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhir</surname>
          </string-name>
          ,
          <string-name>
            <surname>Why Do People Use Artificial Intelligence (AI)-Enabled Voice</surname>
          </string-name>
          Assistants?, IEEE Transactions on Engineering
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>