=Paper=
{{Paper
|id=Vol-3793/paper5
|storemode=property
|title=Faithful Attention Explainer: Verbalizing Decisions
Based on Discriminative Features
|pdfUrl=https://ceur-ws.org/Vol-3793/paper_5.pdf
|volume=Vol-3793
|authors=Yao Rong,David Scheerer,Enkelejda Kasneci
|dblpUrl=https://dblp.org/rec/conf/xai/RongSK24
}}
==Faithful Attention Explainer: Verbalizing Decisions
Based on Discriminative Features==
Faithful Attention Explainer: Verbalizing Decisions
Based on Discriminative Features⋆
Yao Rong1,* , David Scheerer2 and Enkelejda Kasneci1
1
Technical University of Munich, Arcisstraße 21, 80333 Munich, Germany
2
University of Tübingen, Sand 14, 72076 Tübingen, Germany
Abstract
In recent years, model explanation methods have been designed to interpret model decisions faithfully
and intuitively so that users can easily understand them. In this paper, we propose a framework, Faithful
Attention Explainer (FAE), capable of generating faithful textual explanations regarding the attended-to
features. Towards this goal, we deploy an attention module that takes the visual feature maps from the
classifier for sentence generation. Furthermore, our method successfully learns the association between
features and words, which allows a novel attention enforcement module for attention explanation. Our
model achieves promising performance in caption quality metrics and a faithful decision-relevance
metric on two datasets (CUB and ACT-X). In addition, we show that FAE can interpret gaze-based human
attention, as human gaze indicates the discriminative features that humans use for decision-making,
demonstrating the potential of deploying human gaze for advanced human-AI interaction.
Keywords
Explainable AI (XAI), Saliency Map, Faithfulness, Visual Explanation, Textual Explanations
1. Introduction
Explainable AI (XAI) models are being used more, especially in safety-critical applications such
as automatic medical diagnosis [1, 2, 3]. An explanation of a decision should be understandable
for humans [4], and include objects or features that are responsible for that decision made by a
model, i.e., faithful to the model decision [5, 6, 7]. In image-based applications, two modalities are
typically used in model explanations: visual and textual explanation [8]. Several related works
in this context [9, 10, 11, 12, 13] reveal discriminative (salient) areas for the neural network in
decision-making by means of saliency maps. Such saliency maps visualize the post-hoc attention
of a deep neural network. However, humans often prefer textual justifications of model decisions
since they allow for easier access to the understanding of the causality provided by models
[6, 14]. In this work, we introduce a novel method, “Faithful Attention Explainer” (FAE),
which generates faithful textual explanations according to the decision made by the classifier.
Late-breaking work, Demos and Doctoral Consortium, colocated with The 2nd World Conference on eXplainable Artificial
Intelligence: July 17–19, 2024, Valletta, Malta
⋆
You can use this document as the template for preparing your publication. We recommend using the latest version
of the ceurart style.
*
Corresponding author.
$ yao.rong@tum.de (Y. Rong); david.scheerer@student.uni-tuebingen.de (D. Scheerer); enkelejda.kasneci@tum.de
(E. Kasneci)
0000-0002-6031-3741 (Y. Rong); 0000-0003-3146-4484 (E. Kasneci)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Gaze-Enforced FAE:
FAE with Attention Enforcement:
this bird has a white b
this bird has a white belly and breast and breast with a blac
GradCAM with brown streaks on its crown. crown and long pointy
Gaze Part: Belly Gaze Part: Beak
As shown by an example in Figure 1, the ex-
planation of our model includes the object FAE: He is wearing a helmet and
riding a skateboard.
“skateboard," which is used for the action clas-
sification (shown in GradCAM [15]). When
we give the GradCAM as the extrinsic atten- FAE with Attention Enforcement:
tion, the model describes more of the area, He is standing on a skateboard and
such as “standing on a skateboard" and “going is going down a flight of stairs.
down a flight of stairs." Similarly, human atten-
GradCAM: Skateboard
tion also conveys the potential to explain our
GRADCAM did not affect
decisions [16]. It is visualized
sentence
in the saliency Figure 1: FAE generates faithful explanations
map style and compared to models’ post-hoc (Top). Using attention enforcement,
attention maps in solving visual question an- FAE generates a sentence further ex-
swering and classification tasks [17, 18]. In plaining the attended-to area in Grad-
this context, the language model should also CAM (Bottom).
be able to generate a faithful explanation based on human attention. Providing human attention
interpretation can help study the human attention mechanism and better integrate it into com-
puter vision applications. To summarize, this work proposes a novel framework, FAE, which
generates faithful textual explanations based on attention maps (from models or humans).
2. Related Work
Attention models for generating textual descriptions are known to be highly effective [19, 20, 21,
22, 23]. For example, [20] proposes an attention model consisting of linear layers to localize the
relevant area in the image for sentence generation. However, the attention model grounds the
current word to a wrong region since its current hidden state contains only information of past
words [19]. To solve this problem, [24, 25] use extra supervision for correct visual grounding is
therefore needed, while [19] proposes the Prophet Attention model which takes both future
and past words into account and recreates attention weights and thus does not require extra
supervision. Inspired by the PA model, we incorporate future words (generated after the current
word) to ground the current word in the image in our attention model. Generating faithful
explanations for classifiers is more than image captioning [6, 5] since the generated sentence
must rationalize the decision and include discriminative features for the distinctive output
class. To generate sentences conditioned on classifiers, previous works [26, 6, 5, 8, 14] use
features from the corresponding classifier and feed them into an LSTM layer to generate textual
explanations. However, these explanations may not be faithful to each sample since they are
trained to be discriminative on class-level and thus can generate features that are not visible in
that image [5]. Going beyond previous work, our framework utilizes an attention module for
word grounding directly.
3. Methodology
Our FAE generates textual explanations for image classifiers, i.e., FAE verbalizes classification
decisions by creating sentences containing words related to image regions that have been
Embedding Embedding
Attention Attention
Model Model
BiLSTM BiLSTM
Decoder Decoder
Attention Attention
Model Model
Embedding Embedding
Figure 2: Overview of Faithful Attention Explainer. The encoder is omitted for simplicity but the output
features 𝑉 𝑓 and 𝑉 𝑖 from the encoder are denoted. The embedding layer is used to transform words
into embeddings. Left: the attention model and decoder are illustrated. The attention model produces
attention 𝛼 based on the previous sequence. Right: the attention alignment is used to produce 𝛼 ^ based
on the generated sequence 𝑦^𝑡:𝑇 , which tries to align 𝛼 with 𝛼
^.
important to the decision of the classifier. In this section, we explain the details of each module
in FAE and introduce the Attention Enforcement algorithm in detail. Our network approach
follows an Encoder-Decoder framework. The goal of FAE is to take the image 𝑥 ∈ R𝐻×𝑊 ×𝐶
and to predict the class label as well as to create a textual explanation y
^ as a sequence of 1-of-N
words:
^ = {𝑦^1 , 𝑦^2 , ..., 𝑦^𝑇 }, 𝑦^𝑡 ∈ R𝑁
y (1)
where T denotes the length of the output and 𝑦𝑡 is the predicted word at 𝑡 step. FAE exploits the
class-discriminative feature vector 𝑉 ∈ Rℎ×𝑤×𝑐 from the classifier (also used as the encoder
Σ(·)). Σ is a deep convolutional neural network and can extract several visual feature vectors 𝑉
from different layers of the input image 𝑥. Taking ResNet101 as an example, 𝑉 𝑓 is the feature
map after the last residual block, while 𝑉 𝑖 can be a set of feature maps taken from the final
layer of the first, second, and third blocks. For each step 𝑡, the attention model 𝑓𝐴𝑡𝑡 (·) computes
attention maps 𝛼𝑡 based on the decoder’s (an LSTM model) hidden state ℎ𝑡−1 and the feature
vector from the encoder. The output of the attention module 𝑉𝑡𝑎 is given to the decoder and
guides it towards important areas relevant for the explanation: the attention-weighted average
of focus features 𝑉𝑡𝑎 is:
𝐾
1 ∑︁
𝑎
𝑉𝑡 = 𝛼𝑡,𝑗 𝑉𝑗𝑓 . (2)
𝐾
𝑗=1
Figure 2 (Left) illustrates this architecture that contains the attention module for generating
textual explanations. We follow the method proposed in [20] to build and train this model. As
the attention model computes weights based on the previous hidden state of the LSTM, which
is generated using the previous input word. As a result, the attention weights are also based on
the previous word. To tackle this challenge, we introduce a module called attention alignment.
Inside the module, we make use of future knowledge (words) to adjust the attention map for
the current word. To do so, a Bidirectional LSTM (BiLSTM)[27] is employed to encode the
generated sequence. The attention model described in the last section is used to regenerate
new attention weights 𝛼𝑡 based on the hidden state ℎ𝑡−1 from the BiLSTM. Specifically, we get
ℎ𝑡−1 by concatenating the hidden states from forward and backward paths (and halving the
dimension). Figure 2 (Right) illustrates the attention alignment.
𝛼𝑡 = 𝑓𝐴𝑡𝑡 (ℎ𝑡−1 , 𝑉 𝑓 )
𝐾
𝑎 1 ∑︁ (3)
𝑉𝑡 = 𝛼𝑡 𝑉𝑗𝑓
𝐾
𝑗=1
As a regularization to the training loss, we use the L1 norm between the newly grounded
attention weights 𝛼 and the ones generated by the attention model 𝛼:
𝑇
∑︁
ℒ𝛼 (𝜃) = ||𝛼𝑡 − 𝛼𝑡 || (4)
𝑡=1
Moreover, the learned attention can be given by users, i.e., by replacing attention weights
by other attention maps 𝜖, e.g., GradCAM or human gaze, during inference. We refer this as
Attention Enforcement (AE). Concretely, we generate the focus feature 𝑉 𝜖 :
𝐾
1 ∑︁
𝑉 =𝜖
Softmax(𝜖)𝑉𝑗𝑓 (5)
𝐾
𝑗=1
4. Experiments
Metrics. To evaluate and compare our model with other works, we use the following metrics:
BLEU-4, ROGUE-L, METEOR, CIDer. These metrics measure the similarity between generated
sentences and their ground-truth. However, they only indicate the sentence quality on a
linguistic level but have no insights into the faithfulness of generated explanations. Therefore,
we measure the Faithful Explanation Rate (FER) in generated explanations compared to ground-
truth sentences, inspired by [5]. Specifically, for an image 𝑥, discriminative visual regions
used in the model’s decision are found out with the help of GradCAM [15]. Using the part
annotations, the decision-related part/object 𝑦𝑜 can be identified (the part that is closest to the
maximum value in GradCAM). Noun-phrases of that part in all ground-truth sentences are
extracted to form a set {g1 , g2 , ..., g𝑀 } where g𝑖 denotes for a noun-phrase. For the generated
sequence y ^, we detect whether the 𝑦𝑜 is in y ^, if not, the hit rate is 0. If yes, we detect the
corresponding noun-phrase g ^. Then we compare the word hit rate of g ^ with all possible g𝑖 and
use the best one for the FER score.
Datasets. We use two datasets for our experiments: the CUB-200-2011 dataset (CUB) and
Action Explanation Dataset (ACT-X). CUB contains 11.788 images of birds distributed across
200 species [28]. Each image has ten explanations of the visual appearance collected by [29].
ACT-X [8] has 397 classes of activities and in total 18030 images selected from [30]. For each
image, three explanations are provided. We follow the provided train and test splits on both
datasets. When evaluating the FER score, we use the part annotations on CUB and object-level
annotations on ACT-X. The object-level annotation on ACT-X denoted as MPII-ANO, only
contains a few images in ACT-X (150 images with 600 object classes) provided by [5].
Dataset Method Backbone BLEU-4 METEOR CIDer
GVE [6] VGG - 29.20 56.70
InterpNET [26] VGG 62.30 37.90 82.10
SAT ResNet-101 57.14 36.71 61.80
Method CUB MPII-ANO
CUB SAT [20] 37.43 26.32
FAE (Ours) ResNet-50 57.94 36.33 55.98
FAE (Ours) ResNet-101 60.19 38.13 66.36 FAE (Ours) 39.42 28.40
GVE [6] VGG 12.90 15.90 12.40 SAT-AE [20] 38.54 26.84
PJ-X [8] ResNet-152 24.50 21.50 58.70 FAE-AE (Ours) 44.33 29.76
SAT [20] ResNet-101 25.63 24.53 50.39
ACT-X
FAE (Ours) ResNet-50 26.66 24.37 57.19
FAE (Ours) ResNet-101 27.06 25.33 66.17
Table 1
Left: Comparison with other methods on CUB and ACT-X in standard sentence quality metrics. Right:
FER score on CUB and MPII-ANO. The first block contains methods without Attention Enforcement
(AE); the second block with using AE. ResNet101 is used as the backbone for all models.
4.1. Quantitative Results
We first compare our model with other state-of-the-art approaches in the linguistic quality
of generated explanations. In table 1 (Left), we compare our FAE using two backbones with
InterpNET [26], Generating Visual Explanations (GVE) [6], and Pointing and Justification
Explanation (PJ-X) model [8]. On CUB, our model (using ResNet101 backbone) outperforms
GVE, e.g., in the metric CIDer, our model achieves 66.36 while GVE achieves 56.70. Compared
to InterpNET, however, our model surpasses only in METEOR. The possible reason is that
InterpNET deploys richer features (8192-dim compact bilinear features), two extra hidden layers,
and two stacked LSTM layers, which introduces more computational costs and makes the results
hard to reproduce. Results on ACT-X are shown in the second block. Our model (ResNet101)
achieves higher scores in all three metrics than other methods. Besides the linguistic quality,
FER score are shown in table 1 (Right). We compare our framework with SAT since Attention
Enforcement (AE) can also be applied to it. For a fair comparison, we evaluate both under
the same settings. In the first block, where no AE is used, FAE achieves the best performance:
39.42 on CUB and 28.40 on MPII-ANO, which validates that our FAE is advanced in faithful
explanation generation. When using GradCAM attention enforcement, SAT and FAE both
improve the FER scores, while FAE surpasses SAT on both datasets. The improvement of using
AE in both models validates the generalization of AE.
4.2. Qualitative Results
We give GradCAM maps as extrinsic attention maps to guide the model FAE with AE to focus on
the area highlighted in the attention map. Two generated sentence examples are illustrated in
Figure 3. After applying the enforcement in the first example, the explanation incorporates the
part “a white belly", which is missing before. Nevertheless, when enforcement on the MPII-ANO
dataset, the effects are others. Since the GradCAM highlights a lot of area on the boat and in
the background (on the sea), the sentence after the enforcement describes the relation between
objects correctly: the man is standing “on the boat" instead of “in front of a boat". The results
show that our FAE can provide explanations that are faithful and human-understandable to
FAE: This bird has wings that are white and has a long bill.
FAE with GradCAM Enforcement:
This bird has a white belly and breast with a black crown
and long pointy bill.
FAE: He is standing in front of a boat with a fishing pole in his hands.
FAE with GradCAM Enforcement:
He is standing on a boat with a fishing pole in his hands.
FAE: This bird has a brown crown brown primaries and a brown belly.
FAE with Human Attention Enforcement:
This bird has a white belly and breast with brown and black spots
and a white eyebrow.
Figure 3: Illustration of using attention enforcement on CUB and MPII-ANO. Left: Images and extrinsic
saliency maps are shown. Middle: Frames denote the step where enforcement is activated. Right:
Sentences generated by FAE with and without attention enforcement. The top two examples use
GradCAM from the classifier as extrinsic attention maps, while the bottom one uses human gaze maps.
not only intrinsic but also extrinsic attention maps. Additionally, we try a different source of
FAE: This bird has a long black bill with a white breast.
extrinsic attention for AE: Human Attention (HA). We evaluate ourAttention
FAE with Human HA-enforcement
Enforcement: on the CUB
test set and use the HA map provided in CUB Gaze-based Human Attention (CUB-GHA)
This bird has a white belly and breast with a black crown
and long pointy bill.
[18].
This dataset is built by tracking the eye fixations of humans while presenting them of a bird to
focus on distinctive features for that species. For each image, there are always multiple attention
maps and each attention map represents an eye fixation. In Figure 3, the bottom example
shows the HA attention maps. When we deploy our AE on using HA as extrinsic attention
information, the sentence describes the two areas: “a white belly" in the first fixation area and
“breast with brown and black spots" in the second attention area. This setting confirms that our
method can produce accurate textual explanations focusing on user attention, demonstrating
FAE: This bird has wings that are black and has a yellow belly.
FAE with Human Attention Enforcement:
the generalizability of our proposed framework. This bird has a yellow belly and breast with a black crown
and white wingbars.
5. Discussion
Large Language Models (LLMs), such as the GPT series, have demonstrated their sophisticated
abilities in understanding and generating explanations. Recent advancements enable these
models to analyze multimodal data. For example, the GPT-4 model can create textual expla-
nations from an input image. To evaluate its effectiveness, we tested the GPT-4 model with
two types of images: an original image and a saliency map highlighting human attention, as
illustrated in Figure 4. The GPT-4 successfully generated an analysis of the areas most salient
to human gaze. However, we observe the problem in the generated textual explanations: the
model fails to correctly identify the area where the user focused. For example, it mistook the
belly/breast area as the head. These mistakes rather demonstrate a common weakness in the
model: hallucination. To harvest the power of language models, we consider for future work
fine-tuning a smaller general language model to generate textual explanations based on the
areas of gaze attention of users. This approach can enhance the possibilities for intuitive and
direct interaction between humans and AI systems through gaze-based communication.
FAE with Human Attention:
This bird has a white belly and breast.
GPT-4:
… The saliency map suggests that people typically focus on the head
because it provides essential visual cues, …
Figure 4: Comparison of our method and GPT-4 in generating textual explanations.
6. Conclusion
In this paper, we propose a novel framework FAE that can generate decision explanations
faithful to intrinsic attention, i.e., generated by an attention model based on visual features
from the classifier. Our results on the CUB and ACT-X datasets validate and confirm the high
faithfulness and quality in explanations provided by FAE. Moreover, we extend FAE by using
Attention Enforcement and can thus interpret extrinsic attention e.g., human attention. For
future work, our method expands opportunities for natural and straightforward communication
between humans and AI systems via gaze-driven interactions.
References
[1] H. H. Pham, T. T. Le, D. Q. Tran, D. T. Ngo, H. Q. Nguyen, Interpreting chest x-rays via cnns
that exploit hierarchical disease dependencies and uncertainty labels, Neurocomputing
(2021).
[2] E. Tjoa, C. Guan, A survey on explainable artificial intelligence (XAI): towards medical
XAI, CoRR (2019). URL: http://arxiv.org/abs/1907.07374. arXiv:1907.07374.
[3] Y. Rong, N. Castner, E. Bozkir, E. Kasneci, User trust on an explainable ai-based medical
diagnosis support system, arXiv preprint arXiv:2204.12230 (2022).
[4] Y. Rong, T. Leemann, T.-T. Nguyen, L. Fiedler, P. Qian, V. Unhelkar, T. Seidel, G. Kasneci,
E. Kasneci, Towards human-centered explainable ai: A survey of user studies for model
explanations, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[5] S. Wickramanayake, W. Hsu, M. Lee, Flex: Faithful linguistic explanations for neural net
based model decisions, in: AAAI, 2019.
[6] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, T. Darrell, Generating visual
explanations, CoRR (2016). URL: http://arxiv.org/abs/1603.08507. arXiv:1603.08507.
[7] Y. Rong, T. Leemann, V. Borisov, G. Kasneci, E. Kasneci, A consistent and efficient evaluation
strategy for attribution methods, in: International Conference on Machine Learning, PMLR,
2022, pp. 18770–18795.
[8] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, M. Rohrbach,
Multimodal explanations: Justifying decisions and pointing to the evidence, CoRR (2018).
URL: http://arxiv.org/abs/1802.08129. arXiv:1802.08129.
[9] V. Petsiuk, A. Das, K. Saenko, Rise: Randomized input sampling for explanation of black-
box models, BMVC (2018).
[10] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for
discriminative localization, in: CVPR, 2016.
[11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Visual
explanations from deep networks via gradient-based localization, in: ICCV, 2017.
[12] M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: ICML, 2017.
[13] A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through propagating
activation differences, in: ICML, 2017.
[14] J. Kim, A. Rohrbach, T. Darrell, J. F. Canny, Z. Akata, Textual explanations for self-driving
vehicles, CoRR (2018). URL: http://arxiv.org/abs/1807.11546. arXiv:1807.11546.
[15] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, D. Batra, Grad-cam: Why
did you say that? visual explanations from deep networks via gradient-based localization,
CoRR (2016).
[16] M. I. Posner, S. E. Petersen, The attention system of the human brain, Annual review of
neuroscience (1990).
[17] A. Das, H. Agrawal, L. Zitnick, D. Parikh, D. Batra, Human attention in visual question
answering: Do humans and deep networks look at the same regions?, Computer Vision
and Image Understanding (2017).
[18] Y. Rong, W. Xu, Z. Akata, E. Kasneci, Human attention in fine-grained classification, arXiv
preprint arXiv:2111.01628 (2021).
[19] F. Liu, X. Ren, X. Wu, S. Ge, W. Fan, Y. Zou, X. Sun, Prophet attention: Predicting attention
with future attention, in: NeurIPS, 2020. URL: https://proceedings.neurips.cc/paper/2020/
file/13fe9d84310e77f13a6d184dbf1232f3-Paper.pdf.
[20] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio,
Show, attend and tell: Neural image caption generation with visual attention, CoRR (2015).
URL: http://arxiv.org/abs/1502.03044. arXiv:1502.03044.
[21] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, T.-S. Chua, Sca-cnn: Spatial and
channel-wise attention in convolutional networks for image captioning, in: CVPR, 2017.
[22] J. Lu, C. Xiong, D. Parikh, R. Socher, Knowing when to look: Adaptive attention via a
visual sentinel for image captioning, in: CVPR, 2017.
[23] Q. You, H. Jin, Z. Wang, C. Fang, J. Luo, Image captioning with semantic attention, in:
CVPR, 2016.
[24] C. Liu, J. Mao, F. Sha, A. Yuille, Attention correctness in neural image captioning, in:
AAAI, 2017.
[25] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, M. Rohrbach, Grounded video description, in:
CVPR, 2019.
[26] S. Barratt, Interpnet: Neural introspection for interpretable deep learning, ArXiv (2017).
[27] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE transactions on
Signal Processing (1997).
[28] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The Caltech-UCSD Birds-200-2011
Dataset, Technical Report, California Institute of Technology, 2011.
[29] S. E. Reed, Z. Akata, B. Schiele, H. Lee, Learning deep representations of fine-grained visual
descriptions, CoRR (2016). URL: http://arxiv.org/abs/1605.05395. arXiv:1605.05395.
[30] M. Andriluka, L. Pishchulin, P. Gehler, B. Schiele, 2d human pose estimation: New
benchmark and state of the art analysis, in: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2014.