<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>It makes sense! Exploring user preferences for AI explanations on video</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesc Xavier Gaya-Morey</string-name>
          <email>francesc-xavier.gaya@uib.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Maria Buades-Rubio</string-name>
          <email>josemaria.buades@uib.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Scott MacKenzie</string-name>
          <email>mack@yorku.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Manresa-Yee</string-name>
          <email>cristina.manresa@uib.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universitat de les Illes Balears</institution>
          ,
          <addr-line>Carretera de Valldemossa, km 7.5, 07122, Palma, Illes Balears</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>York University</institution>
          ,
          <addr-line>4700 Keele St, North York, ON M3J 1P3</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many explainable artificial intelligence methods exist; however, there is a lack of user evaluations on explainability or trustworthiness. Consequently, it remains unclear which XAI methods are appropriate based on users and their preferences. We present an evaluation with expert users of six video removal-based XAI methods applied across three networks and two datasets. Experts consistently preferred the video-adapted RISE method, while identifying the video-adapted univariate predictors method as the least preferred. These findings provide insight for researchers and practitioners on the preferred XAI methods to use with videos, while also expanding the understanding of XAI methods from a human perspective.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;explainable artificial intelligence</kwd>
        <kwd>evaluation</kwd>
        <kwd>human-centered XAI</kwd>
        <kwd>video-based XAI methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing presence of AI-driven systems across many domains [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] highlights the need for
explainability. Since the appearance of deep learning, numerous explainable AI (XAI) methods have emerged
with the goal of explaining the AI rationale to a human user [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] .
      </p>
      <p>
        While developing novel XAI methods is important, research is needed on human-centered XAI
(HCXAI) since explanations must align with specific users and in diferent contexts. That is, integrating
the human factor into the research and development of XAI [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is paramount.
      </p>
      <p>
        Although social sciences routinely explore human explanation processes, research in XAI often relies
on researchers’ intuition for a “good” explanation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and overlooks human understanding, recipient
profiles, and contextual factors around the explanation. On the one hand, recent studies highlight a gap,
revealing under-utilization of human-centered methods from human-computer interaction (HCI) in the
design of XAI systems [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. Research shows that human-centered approaches in XAI are relevant for
user-driven technical choices, identifying pitfalls in XAI methods, and providing conceptual frameworks
for human-compatible XAI [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. On the other hand, the literature shows a lack of frameworks, methods,
and metrics to evaluate whether XAI methods provide adequate explainability to humans [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ].
The few XAI user evaluations that exist often lack insight from cognitive or social sciences [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and
do not follow a standard procedure for measuring, quantifying, and comparing the explainability of
AI systems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Further supporting this observation, Wells and Bednarz [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] conducted a systematic
review examining XAI studies through a user-focused lens. Their findings revealed that many studies
did not involve users. Even when user testing was conducted, key details were often omitted, such as
the number of participants, recruitment methods, or participants’ level of expertise in machine learning.
This limits the transparency and reproducibility of evaluations.
      </p>
      <p>
        Studies empirically evaluating XAI methods with specific tasks, users, and contexts show diferent
needs and preferences of users [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. When addressing visual data (viz., images or videos),
evaluations of XAI methods with users are found for images [
        <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21">18, 19, 20, 21</xref>
        ], but to the best of our
knowledge, there is no work addressing videos. Therefore, research is needed to fully understand the
impact of XAI methods for video and the efectiveness of the explanations.
      </p>
      <p>In this work, we describe a quantitative study with experts assessing six video removal-based XAI
methods applied across three networks and two datasets. To achieve this, we adapted six widely used
XAI methods, originally for image-based local explanations, to the video domain. To reveal diferences,
we generated explanations for three networks representing varied approaches: transformers and
convolutional models. Additionally, we utilized two publicly available human-action datasets: one
recorded in a controlled environment and another comprising videos from YouTube.</p>
      <p>We quantify user preferences of the six types of XAI methods. Our findings show an agreement both
with the preferred and the least preferred methods. In light of the findings, researchers and practitioners
have concrete design implications for user-driven choices for XAI methods.</p>
      <p>The paper is organized as follows: Section 2 provides a review of the key related concepts. Section 3
details the AI-driven system, including the datasets, neural networks, and XAI methods used. Section 4
outlines the methodology, covering the participants, apparatus, procedure, and design of the study. The
results are presented in Section 5, followed by a discussion in Section 6. Finally, Section 7 concludes
and highlights potential directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Human-centered explainable artificial intelligence</title>
        <p>
          XAI refers to methods and techniques that make the decisions and inner workings of AI models more
transparent and understandable to humans. As AI systems are increasingly complex, particularly in
deep learning, understanding how they arrive at outcomes is more dificult. XAI addresses this by
providing insight into model behavior, thus allowing users to interpret, trust, and validate AI-driven
decisions [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ].
        </p>
        <p>
          Explainability techniques are generally categorized along several key dimensions: data versus model
focus, direct versus post-hoc explanations, global versus local scope, and static versus interactive
presentation [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. First, explanations may aim to clarify either the properties of the input data or the
behavior of the model itself. When explaining the model, the distinction is between directly interpretable
models (e.g., linear regression, tree-based methods) and post-hoc explainability, which is applied after
the model is trained. Additionally, explanations may target individual predictions (local) or the model’s
behavior as a whole (global). Finally, explanations can be static or, as recommended by Miller [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
designed to support interactive user engagement for deeper understanding.
        </p>
        <p>
          Human-centered XAI builds on the foundation of traditional XAI by focusing not just on technical
explainability, but also on aligning explanations with human needs, cognitive processes, and context
[
          <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
          ]. Rather than assuming that an explanation is suficient, human-centered XAI emphasizes
usability, interpretability, and relevance for diverse users, including non-experts [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. The goal is
to create explanations that are intuitive and context-aware and thereby support decision-making to
improve the collaboration between humans and AI systems [
          <xref ref-type="bibr" rid="ref26 ref9">9, 26</xref>
          ]. This perspective recognizes that the
efectiveness of an explanation depends as much on the user as on the model itself.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. XAI applied to video data</title>
        <p>
          While image-based XAI methods are extensively studied (e.g., [
          <xref ref-type="bibr" rid="ref27 ref28 ref29">27, 28, 29</xref>
          ]), video-based XAI methods,
particularly model-agnostic ones, remain relatively underexplored due to the unique challenges posed
by video. However, model-agnostic methods are valuable because they ofer flexibility and broad
applicability for real-world scenarios. For an up-to-date and comprehensive overview of XAI methods
designed for video data, we refer the reader to the review by Gaya-Morey et al. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. User studies for visual data</title>
        <p>
          Most existing work on XAI focuses on algorithmic metrics, often overlooking how actual users interpret,
trust, or benefit from these explanations [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Consequently, relatively few user studies evaluate image
and video explanations with human participants. This gap is especially pronounced in the video domain,
where the temporal dimension adds complexity to human interpretation. Evaluating explanations with
real users is crucial for understanding their practical utility, improving design choices, and ensuring
that such systems align with human reasoning and decision-making.
        </p>
        <p>
          Regarding user studies on XAI images, Aechtner et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] studied users’ perception on local vs.
global explanations, showing the preference for AI novices toward local explanations. Manresa et al.
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] also studied local vs. global explanations, engaging 104 users on aspects such as perceived trust or
understanding. Higher scores were observed for combinations of both explanations. Alqaraawi et al.
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] also studied the performance of saliency maps in a user study. They reported a preference for
LRP and noted the limited help of explanations in predicting the network’s output for new images
or in identifying image features the system is sensitive to. Selvaraju et al. [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] explored whether
Grad-CAM explanations helped users establish appropriate trust in predictions. Their results showed
that Grad-CAM enabled untrained users to successfully diferentiate a “stronger” deep network from a
“weaker” one, even when they produced identical predictions.
        </p>
        <p>In our review of the literature for XAI applied to video, we did not find any work evaluating or
comparing diferent explanations on videos from a human perspective.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. AI-driven system and XAI Methods</title>
      <p>To evaluate user preferences for video XAI methods, we created a sample set combining three networks,
two datasets, and six XAI methods. This allowed us to introduce variation and thereby identify influences
across the three networks.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>
          We selected two datasets with distinct characteristics to train the models and evaluate the XAI methods:
Kinetics 400 [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] and EtriActivity3D [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ].
        </p>
        <p>The Kinetics 400 dataset is a large-scale collection of YouTube videos covering 400 human action
categories, with at least 400 video clips per class. The dataset focuses on diverse human activities,
including both interactions between people and interactions with objects. It features a wide variety of
participants, environments, and objects, alongside challenges such as camera motion and video edits
within the same clip, contributing to its complexity.</p>
        <p>In contrast, EtriActivity3D is a more specialized dataset containing 112,620 video samples across 55
activity classes. It focuses on everyday tasks performed by 100 individuals, half of whom are over 64
years old, providing insight into older demographics. The videos were captured in home environments
across multiple rooms and from eight fixed cameras, ensuring a stable, unedited recording for each clip.
This controlled setup allows for a consistent evaluation, free from the variations introduced by camera
movements or edits.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Neural networks</title>
        <p>
          We used three networks: TimeSformer [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], TANet [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], and TPN [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. These networks represent
diferent architectural approaches—transformers and convolutional models—allowing us to explore both
similarities and diferences in user evaluations. The choice of networks is justified by their performance
in action classification tasks and their public availability within the MMAction2 PyTorch-based
opensource toolbox for video analysis [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ].
        </p>
        <p>TimeSformer, a variant of the Vision Transformer, captures spatio-temporal features by processing
frame-level patches. TANet incorporates a Temporal Adaptive Module (TAM) within its 2D CNN
framework, enabling the capture of both short-term and long-term temporal dynamics using a two-level
adaptive mechanism. The Temporal Pyramid Network (TPN), on the other hand, extracts and integrates
spatial, temporal, and semantic information using hierarchical rescaling; that is, enhancing performance
for tasks with temporal variability. For both TANet and TPN, we used the ResNet50 architecture as the
backbone.</p>
        <p>For the Kinetics 400 dataset, we utilized pre-trained weights available in the MMAction2 framework.
For the EtriActivity3D dataset, we nfie-tuned the networks using the pre-trained Kinetics 400 weights,
training them for 10 epochs with 5-fold cross-validation.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. XAI methods</title>
        <p>
          Since we adopted networks with varying architectures, we opted for model-agnostic XAI methods,
which generate explanations independent of the underlying model. Specifically, we employed publicly
available1 video adaptations of widely used model-agnostic XAI methods, including LIME (Video LIME)
[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], Kernel-SHAP (Video Kernel-SHAP) [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], RISE (Video RISE) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], occlusion sensitivity (Video SOS)
[38], LOCO (Video LOCO) [39], and univariate predictors (Video UP) [40]. Aligned with the previously
mentioned dimensions of XAI, these methods aim to explain the model and are characterized by being
post-hoc, local in scope, and static in presentation [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].
        </p>
        <p>The operation of these methods involves four main steps: (1) segmenting the input video into regions
consisting of pixels from diferent frames, (2) occluding these regions and passing the modified video
through the model, where predictions change based on the occluded regions, (3) summarizing the
relevance of each region to the target prediction, and (4) visualizing the explanations. The exact
parameters used in each step depend on the method. The application of these XAI methods to explain a
model’s prediction for a given video produces an explanation in the form of a video, within which each
pixel represents the relevance of the corresponding pixel in the original video. Figure 1 displays an
explanation example using each method.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Video explanations</title>
        <p>For the evaluation, we selected a random sample of 30 videos from each dataset—Kinetics 400 and
EtriActivity3D. For consistency, we included only videos that were correctly classified by all three
networks used in the study. If a video was misclassified by any network, it was replaced with another
randomly selected video. To ensure a fair comparison, we enforced equal conditions for the methods,
such as the number of features, samples, and occlusion type.</p>
        <p>Each of the 30 videos from both datasets was processed through the three networks—TimeSformer,
TPN, and TANet. For every prediction, explanations were generated using all six XAI methods described
earlier. This resulted in 30 × 6 × 3 × 2 = 1, 080 explanations across the experiment.</p>
        <p>To enhance the clarity and interpretability of the explanations, only the top 30% most relevant regions
were retained, filtering out less significant areas. Additionally, we applied histogram stretching to
ensure the explanations utilized the full range of the color spectrum, making the visualizations more
distinct. Furthermore, negative relevance values were removed for two main reasons: to simplify the
information presented to users assessing the explanations and to standardize outputs across all XAI
methods, as not all methods provide both positive and negative relevance scores.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <sec id="sec-4-1">
        <title>Explanations were presented to the users to assess their preferences.</title>
        <sec id="sec-4-1-1">
          <title>4.1. Participants</title>
          <p>Six volunteer participants (three female) from the local university were recruited. Ages ranged from
24 to 47 years (mean = 34.7, SD = 10.4). The experts have extensive experience in both AI and XAI,
with their expertise grounded in years of specialized research and practical applications. Two of the
experts, the younger ones, worked in AI for at least three to four years and have spent the past two
years working in XAI. The more experienced experts have an extensive background both in AI and XAI,
having worked in the latter area for a minimum of five years. Further, three of the experts focus their
research in HCI. Their research spans a broad spectrum, including computer vision and deep learning
applied to HCI problems. Although all participants had experience with AI and XAI, their familiarity
did not extend to all XAI methods.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2. Apparatus</title>
          <p>A user interface was developed to display the video, its associated class, the explanation, and a
corresponding color map to aid users in their evaluation. See Figure 2. The interface included the 1,080
explanations, with each screen displaying only one type of explanation.</p>
          <p>The question posed to the users during evaluation was: “Do the highlighted regions used in the
classification seem reasonable?” The response options ranged from -2 ( strongly disagree) to +2 (strongly
agree). Thus, negative scores highlight the deviation of participants from the explanation. The query
sought to determine whether the highlighted regions align with users’ perceptions when identifying
specific actions in the video. To mitigate any potential bias, explanations are presented in random
order and without information regarding the network, dataset, or XAI method. This ensures a “blind”
evaluation.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.3. Procedure</title>
          <p>The study was conducted using a laptop with the program installed locally, which participants were
allowed to take home. Each participant was tasked with evaluating 1,080 explanations, a process that
required approximately 3 to 4 hours. To accommodate this, participants were given the flexibility to
pause and resume the evaluation at their convenience.</p>
          <p>The explanations were presented to all participants in the same order, with the next explanation
automatically displayed after one was assessed. However, participants had the flexibility to navigate
freely between explanations, enabling them to revisit, reassess, and update their scores as needed.</p>
          <p>For each method, the mean score was computed for all participants for each XAI method. In addition,
we created aggregated barplots of the participant scores by method, dataset, and network and analyzed
statistical significance of the diferent factors.</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>4.4. Design</title>
          <p>The study was a 6 × 3 × 2 within-subjects design with the following independent variables and levels:
• XAI method (Video RISE, Video Kernel-SHAP, Video LOCO, Video LIME, Video SOS, Video UP)
• Network (TimeSformer, TANet, TPN)
• Dataset (EtriActivity3D, Kinetics400)</p>
          <p>The dependent variable was the score for reasonableness on a 5-point Likert scale from -2 (strongly
disagree) to 2 (strongly agree).</p>
          <p>The total number of trials was 6,480 (= 6 participants × 6 XAI methods × 3 networks × 2 datasets ×
30 videos per condition).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>We now present the results for the participant assessments by the conditions tested. The grand mean
over all 6,480 explanations was 0.292. To the question of interest, this represents an overall participant
response between 0 (neutral) and 1 (mildly agree). Thus, there was a general tendency for participants
to feel the explanations from the XAI methods leaned toward “reasonable.” By XAI method, the means
were -0.806 (Video UP), -0.093 (Video LOCO), 0.368 (Video SOS), 0.356 (Video Kernel-SHAP), 0.540
(Video LIME), and 1.390 (Video RISE). By network, the means were 0.200 (TANet), 0.389 (TimeSformer),
and 0.288 (TPN). By dataset, the means were 0.096 (EtriActivity3D) and 0.489 (Kinetics 400). Further
analyses by combinations of these conditions are now described. The user-based evaluation results
are presented in Figures 3 and 4. Figure 3 illustrates the average scores per method, while Figure 4
aggregates the scores by dataset, network, and method.</p>
      <p>A three-way ANOVA was conducted to evaluate the efects of Dataset, Neural Network architecture,
and XAI Method on user ratings. Significant main efects were found for Dataset ( 1,6444 = 162.83),
Network (2,6444 = 12.49), and XAI Method (5,6444 = 369.43). Additionally, significant interactions
were observed between Dataset × Network (2,6444 = 10.98), Dataset × XAI Method (5,6444 = 43.17),
Network × XAI Method (10,6444 = 9.06), and the three-way interaction Dataset × Network × XAI
Method (10,6444 = 5.61). In all cases,  &lt; .001. These results indicate that user perception depends not
only on individual factors but also on their combinations, with the XAI Method showing the strongest
efect on ratings.</p>
      <p>To assess the explanatory power of diferent factors, we computed 2 for three models: one including
all factors (Dataset, Network, and XAI Method), one considering only the XAI Method, and one including
only Dataset and Network. The full model achieved 2 = .273, indicating that the factors together
explain 27.3% of the variance in user ratings. The model considering only the XAI Method yielded
2 = .208, confirming that XAI Method is the most influential factor. Conversely, the model including
only Dataset and Network yielded only 2 = .024, suggesting that these factors alone contribute
minimally to explaining user ratings. Additionally, the full model had the lowest AIC (21,216) and BIC
(21,460), indicating the best balance between goodness-of-fit and model complexity.</p>
      <p>
        A post hoc Tukey HSD test was conducted to analyze pairwise diferences between XAI Methods.
The results revealed significant diferences in most comparisons (  &lt; .05), except between Video
Kernel-SHAP and Video SOS ( = .999), where no significant diference was found. The Video RISE
method consistently obtained significantly higher ratings compared to other methods, with the largest
diferences observed against Video UP (mean diference = 2.20,  &lt; .001). Conversely, Video UP received
significantly lower ratings than all other methods. These findings confirm that the choice of XAI Method
strongly influences user ratings. Video RISE exhibits the most favorable results, attaining an average
score of 1.39 within the range [
        <xref ref-type="bibr" rid="ref2">-2, 2</xref>
        ]. In the second position, is Video LIME (0.54), closely followed by
Video SOS (0.37) and Video Kernel SHAP (0.36). Conversely, Video LOCO scores poorly (-0.09), and
Video UP receives a score of -0.81, the lowest score.
      </p>
      <p>A post hoc Tukey HSD test was also performed to analyze pairwise diferences between the three
neural network architectures. The results revealed that TimeSformer received significantly higher
ratings than TANet (mean diference = 0.19,  &lt; .001). However, the diferences between TPN and
TANet (mean diference = 0.09,  = .119) and between TimeSformer and TPN (mean diference = 0.10,
 = .056) were not statistically significant.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The preference for Video RISE by experts suggests that placing important regions over the image made
sense to the users (see Figure 5, first row, third column, explanation for the “brushing hair” class). Also,
the smooth explanations displayed by Video RISE, without hard edges, were favored over other methods.
This observation prompts the question of whether introducing smoothness in other methods, such as
through a Gaussian filter, would positively influence the quality of the explanation according to users.
While Video RISE consistently achieved superior results across datasets and networks, the performance
of other methods varied depending on these factors. For instance, Video UP scored approximately one
point higher on Kinetics 400 than on EtriActivity3D, and Video SOS performed better on TimeSformer
than on other networks. This suggests that certain XAI methods may be better suited to specific neural
networks or data characteristics.</p>
      <p>The dataset also influenced user ratings. On average, scores for Kinetics 400 were 0.39 points higher
than those for EtriActivity3D, with the ANOVA confirming this diference as significant ( 1,6444 = 162.8,
 &lt; .001). We attribute the diference to dataset complexity: Kinetics 400 features more challenging videos
with camera movements, cuts, and a broader range of action classes, making explanation generation
more dificult. In contrast, EtriActivity3D ofers a simpler context for identifying important regions for
classification, which likely influenced user scores.</p>
      <p>Regarding network selection, we observed significant diferences in average user scores between two
models: TimeSformer (average score = 0.39) and TANet (average score = 0.20). However, no significant
diference was found between TPN (average score = 0.29) and either of the other two. This demonstrates
that, even when trained under identical conditions, architectural diferences between models impacts
user evaluations. For example, the explanations with Video UP and Video SOS consistently received
higher scores when generated for TimeSformer compared to the other two networks, as shown in
Figure 3. Consequently, to ensure a fair assessment of XAI methods, experiments should include
multiple networks representing diverse architectural designs.</p>
      <p>A limitation of this study is the participant sample size. However, the unanimous agreement among
participants on both the best and worst explanations strengthens our confidence in the findings. Figure 5
presents examples of explanations that unanimously received the highest and lowest scores.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Although numerous XAI methods exist for generating explanations, selecting the most suitable method
remains a challenge for both researchers and practitioners. This study marks a step forward in
understanding how users perceive six well-known local XAI methods (including LIME, SHAP, or RISE)
when adapted to the video domain. By applying explanations on diverse datasets and networks, we
explored the influence of those factors. Remarkably, and although the sample of experts is small, there
was consensus: Video RISE was preferred by participants while Video UP received the lowest scores.</p>
      <p>User studies to evaluate XAI methods are essential for gaining insights into how users interact
with and interpret explanations from AI systems. This knowledge can guide technical decisions based
on users’ explainability preferences, and help in choosing an appropriate XAI method for real world
applications. However, the studies are time-consuming and costly, requiring significant resources to
gather meaningful data. To speedup the evaluation process, automatic metrics such as area under
the curve (AUC) can ofer more eficient ways to assess XAI methods. However, an ongoing debate
persists on whether the performance of XAI methods through objective metrics should take precedence
over user preferences in determining their efectiveness or application. Nevertheless, we believe it is
fundamental to test how well automatic metrics align with the user perspective. Bridging this gap will
ensure the evaluation process remains both efective and representative of real-world user experiences.</p>
      <p>Future work will involve evaluations with a larger sample of users to further validate and test our
ifndings. Additionally, incorporating greater participant diversity—such as variations in AI knowledge,
age, and other demographics—will provide deeper understanding of XAI methods from a human
perspective.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is part of the Project PID2023-149079OB-I00 (EXPLAINME) funded by MICIU/AEI/10.13039/
501100011033/ and ERDF, EU and of Project PID2022-136779OB-C32 (PLEISAR) funded by MICIU/ AEI
/10.13039/501100011033/ and FEDER, EU. F. X. Gaya-Morey was supported by an FPU scholarship from
the Ministry of European Funds, University and Culture of the Government of the Balearic Islands.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on generative AI</title>
      <sec id="sec-9-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>[38] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings
of the 13th European Conference on Computer Vision – ECCV ’14 (LNCS 8689), Springer, Cham,
2014, pp. 818–833. doi:10.1007/978-3-319-10590-1_53.
[39] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, Distribution-free predictive inference
for regression, Journal of the American Statistical Association 113 (2018) 1094–1111. doi:10.1080/
01621459.2017.1307116.
[40] I. Guyon, A. Elisseef, An introduction to variable and feature selection, Journal of Machine
Learning Research 3 (2003) 1157–1182. URL: https://www.jmlr.org/papers/volume3/guyon03a/
guyon03a.pdf.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abbas</surname>
          </string-name>
          ,
          <article-title>A survey on deep learning and its applications</article-title>
          ,
          <source>Computer Science Review</source>
          <volume>40</volume>
          (
          <year>2021</year>
          )
          <article-title>100379</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.cosrev.
          <year>2021</year>
          .
          <volume>100379</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barredo Arrieta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Del</given-names>
            <surname>Ser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bennetot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barbado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gil-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benjamins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chatila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI</article-title>
          ,
          <source>Information Fusion</source>
          <volume>58</volume>
          (
          <year>2020</year>
          )
          <fpage>82</fpage>
          -
          <lpage>115</lpage>
          . doi:/10.1016/j.inffus.
          <year>2019</year>
          .
          <volume>12</volume>
          .012.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Berrada</surname>
          </string-name>
          ,
          <article-title>Peeking inside the black-box: A survey on explainable artificial intelligence (XAI)</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>52138</fpage>
          -
          <lpage>52160</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2018</year>
          .
          <volume>2870052</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Schoonderwoerd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jorritsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Neerincx</surname>
          </string-name>
          , K. van den Bosch,
          <string-name>
            <surname>Human-centered</surname>
            <given-names>XAI</given-names>
          </string-name>
          :
          <article-title>Developing design patterns for explanations of clinical decision support systems</article-title>
          ,
          <source>International Journal of Human-Computer Studies</source>
          <volume>154</volume>
          (
          <year>2021</year>
          )
          <article-title>102684</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.ijhcs.
          <year>2021</year>
          .
          <volume>102684</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Explanation in artificial intelligence: Insights from the social sciences</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>267</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.artint.
          <year>2018</year>
          .
          <volume>07</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          , T.-T. Nguyen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fiedler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Unhelkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seidel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kasneci</surname>
          </string-name>
          , E. Kasneci,
          <article-title>Towards human-centered explainable AI: A survey of user studies for model explanations</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis &amp; Machine Intelligence</source>
          <volume>46</volume>
          (
          <year>2024</year>
          )
          <fpage>2104</fpage>
          -
          <lpage>2122</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TPAMI.
          <year>2023</year>
          .
          <volume>3331846</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Uusitalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lensu</surname>
          </string-name>
          ,
          <article-title>A unified and practical user-centric framework for explainable artificial intelligence</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>283</volume>
          (
          <year>2024</year>
          )
          <article-title>111107</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.knosys.
          <year>2023</year>
          .
          <volume>111107</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohseni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zarei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Ragan</surname>
          </string-name>
          ,
          <article-title>A multidisciplinary survey and framework for design and evaluation of explainable AI systems</article-title>
          ,
          <source>ACM Transactions on Interactive Intelligent Systems</source>
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <volume>24</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          :
          <fpage>45</fpage>
          . doi:
          <volume>10</volume>
          .1145/3387166.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <article-title>Human-centered explainable AI (XAI): From algorithms to user experiences</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2110.10790.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cowls</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beltrametti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chatila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chazerand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dignum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luetge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Madelin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Pagallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schafer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Valcke</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Vayena,</surname>
          </string-name>
          <article-title>AI4People: An ethical framework for a good AI society: Opportunities, risks, principles, and recommendations</article-title>
          ,
          <source>Minds and Machines</source>
          <volume>28</volume>
          (
          <year>2018</year>
          )
          <fpage>689</fpage>
          -
          <lpage>707</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11023-018-9482-5.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Mueller</surname>
          </string-name>
          , G. Klein,
          <string-name>
            <given-names>J.</given-names>
            <surname>Litman</surname>
          </string-name>
          ,
          <article-title>Metrics for explainable AI: Challenges and prospects</article-title>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1812</year>
          .
          <volume>04608</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Miró-Nicolau</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J. i Capó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moyà-Alcover</surname>
          </string-name>
          ,
          <article-title>Assessing fidelity in XAI post-hoc techniques: A comparative study with ground truth explanations datasets</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>335</volume>
          (
          <year>2024</year>
          )
          <article-title>104179</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.artint.
          <year>2024</year>
          .
          <volume>104179</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Burkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <article-title>A survey on the explainability of supervised machine learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>70</volume>
          (
          <year>2021</year>
          )
          <fpage>245</fpage>
          -
          <lpage>317</lpage>
          . doi:
          <volume>10</volume>
          .1613/jair.1.12228.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wells</surname>
          </string-name>
          , T. Bednarz,
          <article-title>Explainable AI and reinforcement learning: Asystematic review of current approaches and trends</article-title>
          ,
          <source>Frontiers in Artificial Intelligance</source>
          <volume>4</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .3389/frai.
          <year>2021</year>
          .
          <volume>550030</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K. E.</given-names>
            <surname>Bellamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dugan</surname>
          </string-name>
          ,
          <article-title>Explaining models: An empirical study of how explanations impact fairness judgment</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on Intelligent User Interfaces - IUI '19</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2019</year>
          , pp.
          <fpage>275</fpage>
          -
          <lpage>285</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3301275.3302310.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>U.</given-names>
            <surname>Ehsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Passi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>The who in XAI: How AI background shapes perceptions of AI explanations</article-title>
          ,
          <source>in: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems - CHI '24</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2024</year>
          , pp.
          <volume>316</volume>
          .
          <fpage>1</fpage>
          -
          <lpage>316</lpage>
          .32. doi:
          <volume>10</volume>
          .1145/3613904.3642474.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Szymanski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Millecamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verbert</surname>
          </string-name>
          ,
          <article-title>Visual, textual or hybrid: The efect of user expertise on diferent explanations</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on Intelligent User Interfaces - IUI '21</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2021</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>119</lpage>
          . doi:
          <volume>10</volume>
          .1145/3397481.3450662.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alqaraawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Weiß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Costanza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Berthouze</surname>
          </string-name>
          ,
          <article-title>Evaluating saliency map explanations for convolutional neural networks: A user study</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Intelligent User Interfaces - IUI '20</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2020</year>
          , p.
          <fpage>275</fpage>
          -
          <lpage>285</lpage>
          . doi:
          <volume>10</volume>
          .1145/3377325.3377519.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manresa-Yee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ramis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. X.</given-names>
            <surname>Gaya-Morey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Buades</surname>
          </string-name>
          ,
          <article-title>Impact of explanations for trustworthy and transparent artificial intelligence</article-title>
          ,
          <source>in: Proceedings of the XXIII International Conference on Human Computer Interaction- Interacción '23</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1145/3612783. 3612798.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Heimerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Weitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baur</surname>
          </string-name>
          , E. Andre,
          <article-title>Unraveling ML models of emotion with NOVA: Multilevel explainable AI for non-experts</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>1</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . doi:
          <volume>10</volume>
          .1109/TAFFC.
          <year>2020</year>
          .
          <volume>3043603</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Aechtner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Katwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Onghena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Valenzuela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilbik</surname>
          </string-name>
          ,
          <article-title>Comparing user perception of explanations developed with XAI methods</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Fuzzy Systems - FUZZ-IEEE '22</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , New York,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . doi:
          <volume>10</volume>
          .1109/ FUZZ-IEEE55066.
          <year>2022</year>
          .
          <volume>9882743</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>V.</given-names>
            <surname>Arya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K. E.</given-names>
            <surname>Bellamy</surname>
          </string-name>
          , P.-Y. Chen,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhurandhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mojsilović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mourad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pedemonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Richards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sattigeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shanmugam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <source>AI</source>
          explainability
          <volume>360</volume>
          :
          <article-title>An extensible toolkit for understanding data and machine learning models</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>19</fpage>
          -
          <lpage>1035</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gruen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miller</surname>
          </string-name>
          , Questioning the AI:
          <article-title>Informing design practices for explainable AI user experiences</article-title>
          ,
          <source>in: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems - CHI '20</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .1145/3313831.3376590.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Barda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Horvat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hochheiser</surname>
          </string-name>
          ,
          <article-title>A qualitative research framework for the design of user-centered displays of explanations for machine learning model predictions in healthcare</article-title>
          ,
          <source>BMC Medical Informatics and Decision Making</source>
          <volume>20</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1186/s12911-020-01276-x.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Braga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          , L. Rosado,
          <article-title>XAI systems evaluation: A review of human and computer-centred methods</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>12</volume>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .3390/app12199423.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>U.</given-names>
            <surname>Ehsan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wintersberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manger</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A. Riener</surname>
            ,
            <given-names>M. O.</given-names>
          </string-name>
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>Human-centered explainable AI (HCXAI): Beyond opening the black-box of AI, in: Extended Abstracts of the ACM SIGCHI Conference on Human Factors in Computing Systems -</article-title>
          CHI EA '
          <volume>22</volume>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .1145/3491101.3503727.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>"Why should I trust you?": Explaining the predictions of any classifier</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 2939672.2939778.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>V.</given-names>
            <surname>Petsiuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
          </string-name>
          , RISE:
          <article-title>Randomized input sampling for explanation of black-box models</article-title>
          ,
          <source>in: Proceedings of the British Machine Vision</source>
          Conference - BMVC '
          <fpage>18</fpage>
          ,
          <string-name>
            <surname>Newcastle</surname>
          </string-name>
          , UK,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>151</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1806</year>
          .
          <volume>07421</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Neural Information Processing Systems - NIPE '17</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4768</fpage>
          -
          <lpage>4777</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/file/ 8a20a8621978632d76c43dfd28b67767-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>F. X.</given-names>
            <surname>Gaya-Morey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Buades-Rubio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. S.</given-names>
            <surname>MacKenzie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Manresa-Yee,
          <article-title>Revex: A unified framework for removal-based explainable artificial intelligence in video, 2024</article-title>
          . doi: https://doi.org/10. 48550/arXiv.2401.11796, submitted for publication.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Batra</surname>
          </string-name>
          , Grad-cam:
          <article-title>Visual explanations from deep networks via gradient-based localization</article-title>
          ,
          <source>in: Proceedings of the International Conference on Computer Vision - ICCV '17</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , New York,
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV.
          <year>2017</year>
          .
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suleyman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The kinetics human action video dataset</article-title>
          ,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.1705.06950.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , ETRI-Activity3D:
          <article-title>A large-scale RGB-D dataset for robots to recognize daily activities of the elderly</article-title>
          ,
          <source>in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems - IROS '20</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , New York,
          <year>2020</year>
          , pp.
          <fpage>10990</fpage>
          -
          <lpage>10997</lpage>
          . doi:
          <volume>10</volume>
          .1109/IROS45743.
          <year>2020</year>
          .
          <volume>9341160</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bertasius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <article-title>Is space-time attention all you need for video understanding?</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning Research - PMLR '21</source>
          , volume
          <volume>139</volume>
          , ML Research Press, Maastricht University, Netherlands,
          <year>2021</year>
          , pp.
          <fpage>813</fpage>
          -
          <lpage>824</lpage>
          . URL: https://proceedings.mlr.press/v139/bertasius21a/bertasius21a-supp.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qian</surname>
          </string-name>
          , T. Lu, TAM:
          <article-title>Temporal Adaptive Module for video recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision - ICCV '21</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , New York,
          <year>2021</year>
          , pp.
          <fpage>13688</fpage>
          -
          <lpage>13698</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCV48922.
          <year>2021</year>
          .
          <volume>01345</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Temporal pyramid network for action recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR '20</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , New York,
          <year>2020</year>
          , pp.
          <fpage>591</fpage>
          -
          <lpage>600</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR42600.
          <year>2020</year>
          .
          <volume>00067</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <article-title>OpenMMLab, OpenMMLab's next generation video understanding toolbox and benchmark</article-title>
          , https: //github.com/open-mmlab/
          <year>mmaction2</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>