<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Natural Language Explanations for Visual Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yan Zhou</string-name>
          <email>yanzho@uio.no</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Baifan Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ingrid C. Yu</string-name>
          <email>ingridcy@ifi.uio.no</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Oslo Metropolitan University</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>35</volume>
      <fpage>8312</fpage>
      <lpage>8322</lpage>
      <abstract>
        <p>Visual Question Answering (VQA) is a challenging task that requires reasoning over both visual and textual information. Recently, there has been growing interest in enhancing VQA with Natural Language Explanations (NLEs) to improve transparency and trust. While existing methods leverage powerful language models for explanation generation, many score high on lexical-level text similarity rather than capturing the underlying reasoning process. In this work, we propose NMN-BART, a novel architecture that combines Neural Module Networks (NMNs) with the pretrained BART language model, using cross-modal fusion to bridge visual semantics and textual reasoning. We evaluate NMN-BART on the VQA-X dataset, where it significantly outperforms baselines on semantic-based metrics, despite lower scores on lexical similarity metrics. This suggests that our method excels in capturing the meaningful content of the explanations, rather than matching the references in wording. The case study with human evaluation further verifies our finding that our method produces semantically rich and persuasive explanations. visual question answering with explanations, natural language explanation, neural module network Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Background. Visual Question Answering (VQA) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a challenging task at the intersection of computer
vision and natural language processing, where an accurate answer needs to be generated by reasoning
over both visual and textual information given an image and a corresponding question. This task is
crucial as it mirrors real-world scenarios where machines must integrate multimodal data, enabling
more natural human-computer interactions.
      </p>
      <p>
        Recently, there has been growing interest in enhancing VQA with Natural Language Explanations
(NLEs) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], leading to the task of Visual Question Answering with Explanation (VQA-E), which aims to
produce both an accurate answer and a human-understandable explanation for that answer (Figure 1).
While the full VQA-E task involves predicting both answers and explanations, in this work we focus
on the generation of NLEs, assuming access to the correct answer, following prior work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
targeted setup allows us to isolate and evaluate the model’s ability to produce semantically meaningful
and interpretable justifications, an essential component for enhancing transparency and user trust in
real-world applications such as medical diagnostics and autonomous driving. We further discuss the
rationales in Section 3.
      </p>
      <p>Challenges. Generating natural language explanations for VQA presents several challenges. A key
dificulty is</p>
      <p>
        aligning visual and textual modalities: the system must interpret image content and link it
to linguistic constructs to produce coherent explanations. Common approaches use vision-language
models that separate answer prediction and explanation generation [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], while end-to-end models [6]
attempt joint prediction of both. These models typically combine vision encoders like CLIP [7, 8, 6]
with transformer-based language models [
        <xref ref-type="bibr" rid="ref3">6, 3, 9</xref>
        ]. However, prior studies suggest that such models
may rely on dataset biases or visual “shortcuts” [10], enabling them to achieve high answer accuracy
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <sec id="sec-1-1">
        <title>Answer: "Eating."</title>
      </sec>
      <sec id="sec-1-2">
        <title>Explanation: "he is biting a vine of leaves with his mouth", "he is very hungry at the moment", "the giraffe is pulling food from the basket on the pole."</title>
        <p>without fully capturing the semantic content of the visual scene. This can lead to shallow or unfaithful
explanations that fail to reflect the actual reasoning process.</p>
        <p>Another limitation lies in generating explanations that faithfully reflect the reasoning process . Many
multimodal models fail to explicitly connect reasoning steps with the generated explanation. Neural
Module Networks (NMNs) [11, 10, 12] ofer a structured reasoning framework by decomposing questions
into interpretable modules. While efective for answer prediction in VQA tasks, NMNs remain
underexplored for explanation generation, leaving their potential to produce reasoning-aligned outputs
largely untapped.</p>
        <p>
          Finally, evaluating NLEs remains a challenge. Many works [
          <xref ref-type="bibr" rid="ref3">6, 3, 9, 8</xref>
          ] try to achieve high scores on
n-gram-based metrics such as BLEU and ROUGE. These metrics focus on lexical comparison, often
insuficient in reflecting semantic alignment. Even semantically oriented metrics such as METEOR [ 13]
or SPICE [14] still compare generated explanations against reference texts, and may overlook extra
relevant information introduced in the generated explanation. Thus, new evaluation strategies are
needed to assess explanation quality beyond similarity to references.
        </p>
        <p>Contributions. This work presents our ongoing research on a novel architecture, NMN-BART, which
integrates NMN with the pretrained language model BART using a cross-modal fusion module. This
cross-modal fusion module integrates reasoning over Scene Graph (SG) representations with textual
information from the language model, facilitating the generation of explanations grounded in both visual
and textual reasoning. The compositional and explainable mechanism of NMN enables NMN-BART to
model deeper reasoning process over the semantic relationships among the question, answer, and visual
content, thereby producing explanations that are both semantically richer and more interpretable.</p>
        <p>We evaluate NMN-BART on the VQA-X dataset, where our model achieves performance comparable
to state-of-the-art methods. Notably, NMN-BART scores significantly higher on semantic-based
metrics such as METEOR and SPICE, while exhibiting lower performance on n-gram based metrics
such as BLEU and ROUGE-L. We interpret these findings as evidence that our approach is capable of
generating explanations with enhanced semantic understanding, even when the generated text diverges
from the reference in terms of exact phrasing. Case studies with human evaluation further confirm
that NMN-BART generates rich and meaningful explanations, sometimes providing more information
than the reference explanations. Our main contributions are summarised as follows:
• We propose NMN-BART, a novel architecture that combines the reasoning capabilities of NMN with
the text generation of BART, for generating natural language explanations in VQA .
• We demonstrate through extensive experiments and evaluations on the VQA-X dataset that our
approach yields explanations with improved semantic quality, as evidenced by semantic-based metrics.</p>
        <p>We adopt and adapt metrics for representative case studies and human evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Visual question answering with explanation (VQA-E). Recent advances in VQA emphasise the
importance of generating natural language explanations (NLEs) to justify answers and improve
interpretability. Existing methods can be broadly grouped into two types: (i) generating explanations
without conditioning on the answers, and (ii) generating explanations conditioned on the answers.</p>
      <p>Category (i) approaches jointly predict answers and explanations. For example, NLX-GPT [6] frames
the task as unified text generation. It integrates a CLIP image encoder with a distilled GPT-2 decoder,
allowing the explanation to be generated as part of the reasoning process.</p>
      <p>
        Category (ii) methods typically decouple the VQA and explanation generation stages. They first
predict an answer, then condition the explanation on the answer, question, and image features [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">5, 4, 3, 9</xref>
        ].
For instance, the Rational Transformer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], combines GPT-2 with outputs from object detection, situation
recognition, and commonsense inference to generate rationales for complex visual reasoning tasks.
e-UG [9] similarly use GPT-2, conditioning explanations on various combinations of visual and textual
features. S3C [8] improves explanation by incorporating answer scores as rewards in a self-critical
learning framework, using CLIP-based encoders and prompt-based templates to guide the generation.
Reasoning-enhanced Vision-Language (VL) models. Recent work has shown that incorporating
explicit reasoning into language models improves performance on complex tasks [15, 16]. However,
due to the complexity of aligning and integrating cross-modal information, VL models still struggle to
capture visual reasoning efectively.
      </p>
      <p>
        One line of research adopts Neural Module Networks (NMNs) [11], which dynamically compose
neural modules based on the input question. To mitigate the vision-to-reasoning shortcut in NMNs,
XNM [10] employs scene graphs for visual reasoning, instead of using “low-level” visual perception,
especially in datasets like CLEVR [17] with ground-truth scene graph annotations. For datasets without
scene graphs, such as VQA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], XMN constructs scene graph representations from visual features to
enable dynamic reasoning [18]. We adopt this strategy with the VQA-X [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] dataset.
      </p>
      <p>Inspired by NMN’s compositional reasoning, recent methods explore code generation [19, 20] or large
language models (LLMs) [21] for step-by-step visual reasoning. Chain-of-Thought (CoT) prompting has
also been applied in VL reasoning. For instance, CCoT [22] generates scene graphs via LLMs and uses
them in prompts to extract compositional knowledge. These approaches ofer flexibility and strong
generalisation, with their zero-shot performance avoiding task-specific training or fine-tuning.</p>
      <p>In this work, we integrate a NMN [18, 23] with BART, a transformer-based language model [24],
leveraging NMN for compositional reasoning over scene graph representations and BART for natural
language generation. This guides explanation generation and helps the model capture richer semantic
relations among the image, question, and answer.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Formulation and Rationale</title>
      <p>In Visual Question Answering with Explanation (VQA-E), the objective is to develop a model that can
answer questions about images, and generate textual explanations of the answers, by understanding
and reasoning over both visual and textual information. We follow the general task formulation in [9],
which denotes a visual information as  (e.g., image), a textual information as  (e.g., question), and the
objective of VQA-E is to learn a function ℱ to predict the answer  to the question, and the explanation
 that justifies the answer  : ,  = ℱ ( , ) .</p>
      <p>
        There are generally two paradigms for achieving this. One class of approaches generates the
answer and explanation simultaneously, without conditioning the explanation on the answer.
Another class adopts a post-hoc (after the fact) strategy, where the answer is first determined or given,
and the explanation is then generated conditioned on that answer. The task is then changed to
 = ℱ  ( , ),  = ℱ  ( , , ) . Some works generate both answers and explanations (conditioned or
unconditioned on the answer) and then filter out explanations where the answers are incorrect during
evaluation, a setting referred to as the filtered setting [8]. In this work, we adopt the post-hoc strategy
and condition the explanation generation on the answer. We follow a design choice similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], in
which a given answer is provided to the explanation generation as additional input to  and  . This
design ofers several advantages:
• No dataset filtering: It allows evaluation over the full dataset without filtering for only correct
predictions, preserving the diversity and dificulty of the original examples.
• Bias mitigation: It avoids biases introduced by the filtered setting that remove cases where the predicted
answer is incorrect, which are often dificult or ambiguous cases, and potentially informative and
important for assessing explanation quality.
• Focused generation: By receiving the answer as input, the model can concentrate on elaborating,
contextualising, and justifying the answer, leading to more relevant and detailed explanations.
      </p>
      <p>Based on these rationales, we formulate our task as learning a function ℱ that generates a natural
language explanation  from the image  , the question  , and the given answer  (Eq. 1). Here,  and 
are natural language sentences, and  is typically a word or a short phrase. The explanation  consists
of one or more sentences that provide a human-understandable rationale for the given answer.
 = ℱ  ( , , )
(1)</p>
    </sec>
    <sec id="sec-4">
      <title>4. Our Approach: NMN-BART</title>
      <p>NMN-BART consists of (1) visual preprocessing, (2) an NMN-BART encoder, and (3) a BART decoder
(Figure 2). Visual preprocessing transforms the image to scene graph representations  , the NMN-BART
encoder takes the  and the text (question  and answer  ) as input, and produces fused encoding,
which is then processed by the BART decoder to generate the explanation  .</p>
      <p>Visual preprocessing. Here the image  is transformed into scene graph representations  . Visual
features are extracted from images using the Bottom-Up Attention model [25], a pretrained Visual Model
that employs a Faster R-CNN detector trained on the Visual Genome dataset. These visual features,
termed visual Region-of-Interest (RoI) features, serve as a visual foundation for the construction of  .</p>
      <p>To structure visual information for compositional reasoning, we convert these region-level features
into scene graph representations following [10]. A scene graph is a structured representation of an
image that encodes objects (nodes), their attributes, and pairwise relations (edges) between them. In
our approach, each node v in the graph corresponds to a detected object and is constructed from the
visual RoI features. Each edge e represents a spatial or semantic relation between two objects and is
constructed by concatenating the visual RoI features of the two connected nodes: e = [v ; v ].</p>
      <p>This graph-based representation provides a structured abstraction over raw pixel data, which enables
NMN to reason over entities and relations in a compositional manner, facilitating interpretable reasoning
for the explanation generation.</p>
      <p>NMN-BART Encoder. The NMN-BART encoder can be seen as BART encoder enhanced with NMN
(Figure 2). The BART encoder layers transform text into contextualised feature representations. The
NMN processes these text features along with the scene graph representations to learn neural modules,
producing intermediate results referred to as module output. The cross-modal fusion layers then
combine these module outputs with the output of the previous encoder or fusion layer to form the fused
encoding H , which is used by the BART decoder to generate explanations. The overall functionality of
NMN-BART encoder can be summarised as: H = NMN-BART-Encoder(, , ) .</p>
      <p>BART encoder layers. We use a pretrained BART as the foundational language model, leveraging its
encoder to process textual inputs (questions and answers). We choose BART, a transformer-based
sequence-to-sequence model, for it combines the strengths of bidirectional and autoregressive
transformers, making it efective for natural language generation tasks [ 24]. The text inputs ,  are
tokenised into a sequence of tokens W = {w0, w1, ..., w } = Tokeniser(, ) and embedded into dense
vector representations H0 = Embed(W), which are then passed to the first encoder layer. The output
representation of each encoder layer  is computed as:</p>
      <p>H = {h0, h1, … , h } = Encoder-Layer({h−1 , h1−1 , … , h−1 }) for  = 1, … , 
0
(2)
where Encoder-layer(.) is a single transformer encoder layer, and {h0, h1, … , h } denotes the hidden
states H of the  -th layer. L is the number of encoder layers of the pretrained BART model.
Neural Module Network. NMN takes two inputs: 1) textual features from  and  , including text
embedding H0 and the hidden states from the first encoder layer H1; 2) the scene graph representations  .</p>
      <p>NMN processes these inputs to produce intermediate outputs O from the reasoning steps on the scene
graph, where  denotes the reasoning step, expressed as:</p>
      <p>{O1, O2, O3} = NMN(H0, H1, )
with  being the dimension of the module output.
where O = {o0, o1, … , o },  = 1, 2, 3 represents the intermediate module output at steps 1, 2, and 3,</p>
      <p>We choose NMN because it enables fully diferentiable training via back-propagation without expert
supervision of reasoning steps. Following StackNMN [18] for the modular reasoning process and [10]
for reasoning over scene graphs, we summarise the key components of StackNMN here, and refer
technical details to [18]. StackNMN consists of three components: 1) The Layout Controller converts
text information into a temporal distribution over module weights, segmenting reasoning into steps. 2)
Module weights are assigned to a sequence of neural modules, each designed to perform a reasoning
step. These modules include Find, Transform, And, Or, Filter, Scene, Answer, Compare, and NoOp, and
operate on scene graph representations. The module outputs are visual attention maps or score vectors
over possible answers. 3) A diferentiable memory stack stores and accesses intermediate module
outputs during execution.</p>
      <p>Cross-modal fusion. The fusion layers integrate visual and textual information from the NMN (module
outputs) with the hidden states from the previous encoder or fusion layer. Each fusion layer corresponds
to one neural module reasoning step. For the  -th fusion layer (right side of Figure 2), we concatenate
the first token node from the previous layer
and feed it into an MLP (Eq. 4). The resulting fused token h̃ 0 is then propagated to all nodes in the
fusion layer output through the self-attention mechanism of the transformer (Eq. 5, Eq. 6). One fusion
layer can be summarised in Eq. 7. The final output of the fusion layers is the fused encoding
H .
h−1 with the last token node of the module output o−1 ,
0

[h̃ 0−1
; õ −1 ] = MLP([h−1
0</p>
      <p>; o−1 ])
H̃ −1 = [h̃ −1 , h1−1 , … , h−1 ]</p>
      <p>0</p>
      <p>H = SelfAttention(H̃ −1 )</p>
      <p>H = Fusion-Layer([H−1 ; O−1 ])
BART decoder. The fused encoding H from the last fusion layer is passed to the BART decoder to
generate explanations:  =</p>
      <p>BART-Decoder(H ).
(3)
(4)
(5)
(6)
(7)
Training scheme. We initialise the encoder, fusion, and decoder layers with a pretrained BART
model. Visual preprocessing (scene graph generation) is computed in advance. The entire NMN-BART
encoder and the BART decoder are trained end-to-end using cross-entropy loss [24], with inputs  ,  ,
precomputed  , and reference explanation  .</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment</title>
      <p>
        Dataset. We evaluate our method on the VQA-X dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], an established benchmark that extends
the Visual Question Answering (VQA) dataset with human-written explanations. VQA-X contains 33k
question-answer pairs over 28k images sourced from the MSCOCO dataset [26]. Each question averages
7.5 words, and each explanation around 11 words, with a vocabulary size of approximately 10k. The
data is split into training (29k), validation (1.4k), and test (1.9k) sets. Each question may have multiple
valid answers. The scale and diversity of VQA-X make it well suited for assessing both answer accuracy
and the quality of generated explanations.
      </p>
      <p>
        Baselines. We compare our NMN-BART model against representative baselines, categorised by their
use of answer conditioning (more discussion see Section 3). Not answer-conditioned: NLX-GPT [6]
generates explanations without relying on the predicted answer. Other methods are answer-conditioned,
where RVT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and our model NMN-BART are with a given answer to guide explanation generation,
and the other methods apply a filtered setting : filtering those explanations where answers are correctly
predicted by the model, assuming that explanations supporting incorrect answers are invalid and should
be excluded from evaluation [6]. These methods include PJ-X [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], FME [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], e-UG [9], and S3C [8], where
most methods condition the explanation directly on answers, and in S3C the explanation is rewarded
by the correct answer using reinforcement learning.
      </p>
      <p>Automatic evaluation metrics. We use two types of automatic metrics for explanation evaluation:
N-gram-based metrics: These metrics assess lexical-level similarity, measuring word and phrase overlap
without considering deeper meaning. We use BLEU-1 [27] to measure the unigram precision, capturing
word overlap between the generated and reference texts, with a brevity penalty for short outputs.
ROUGE-L [28] uses the Longest Common Subsequence to assess sentence-level structural similarity.
Both are primarily sensitive to n-gram overlap.</p>
      <p>Semantic-based metrics: These metrics evaluate a deeper semantic alignment between the generated
and reference text. METEOR [13] goes beyond n-gram matching by incorporating stemming, synonym
matching, and paraphrase tables. SPICE [14] converts both explanations into scene graphs and evaluates
the alignment of objects, attributes, and relationships.</p>
      <p>Implementation. We use the pretrained BART-base from Facebook [29] as our language backbone.
The encoder consists of 6 layers, with 3 of them configured as fusion layers corresponding to the three
NMN modules. The model is trained for 15 epochs (17,730 steps) using a batch size of 24 and a learning
rate of 2e-5, on a single Quadro GV100 GPU for approximately 4.8 hours. For cross-modal fusion,
we apply a one-hidden-layer MLP that projects the concatenation of token nodes from LM and NMN
outputs to a 288-dimensional space, with a dropout rate of 0.2.</p>
      <sec id="sec-5-1">
        <title>5.1. Results and Discussion</title>
        <p>Results. The results of all methods on the VQA-X dataset are summarised in Table 1. The methods
are categorised first by whether the explanation generation is Not answer-conditioned or
Answerconditioned. Among the latter, the methods are further categorised by applying the filtered setting :
ifltering explanations only when the answers are correct) [ 8] or with given answer : a given answer is
an input for generating the explanation. It can be observed that recent methods such as NLX-GPT and
S3C, achieve high scores on BLEU-1 and ROUGE-L, indicating strong lexical-level similarity with the
reference explanations. Our approach obtains substantially higher scores on semantic-based metrics,
achieving a METEOR of 38.5 and a SPICE of 25.9, with respective improvement of 61% and 13% compared
to the second best baseline S3C. On the other hand, NMN-BART has BLEU-1 (36.8) and ROUGE-L
(31.3) scores that are notably lower than many baselines.</p>
        <p>Discussion. By comparing the two types of metrics across all methods, we can conclude that
semanticbased metrics are generally more challenging than n-gram-based metrics. On average, most methods
score high on n-gram-based metrics but relatively low on semantic-based ones (100 means perfect.).</p>
        <p>The n-gram-based metrics primarily focus on word or phrase overlap. BLEU-1 captures unigram
overlap, while ROUGE-L focusses on the longest subsequence. From the results, we can infer that
NLX-GPT and S3C generate explanations that closely overlap with the reference, while NMN-BART
performs less well on lexical overlap. However, this does not imply worse performance for NMN-BART;
rather, it reflects that NMN-BART generates more diferent lexical content than the reference.</p>
        <p>METEOR extends n-gram overlap by incorporating semantic alignment. It also considers recall
and penalises brevity, meaning that explanations lacking key information or being overly short will
score lower. METEOR thus balances lexical similarity and semantic alignment, capturing subtleties
that n-gram metrics might miss. Observing that methods such as NLX-GPT and S3C score higher on
n-gram-based metrics but lower on METEOR, we can postulate it is because their explanations tend to
be shorter or omit important details. While NMN-BART generates explanations that are relatively
longer and cover more important information overlapping with the reference.</p>
        <p>SPICE, designed for evaluating image captions, constructs scene graphs from both generated and
reference texts, comparing entities, relationships, and attributes. It considers both precision and recall,
and is robust to lexical variation. As the most challenging metric, SPICE scores tend to be low in all
methods. Methods with high n-gram-based scores also tend to perform poorly on SPICE, probably
because they generate shorter explanations with less coverage of key information. Despite scoring
relatively high on METEOR, NMN-BART also exhibits a relatively low SPICE score. It can be that
the specific way SPICE constructs scene graphs afects the evaluation, or it may indicate that the
explanations of NMN-BART contain more information than the reference, afecting the alignment.</p>
        <p>In conclusion, NMN-BART demonstrates superior performance in generating semantically aligned
explanations compared to the baselines. To verify the underlying reasons we postulated, we conduct
case studies and human evaluations (Section 5.2).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Case Study with Human Evaluation</title>
        <p>Case study. To further understand the automatic evaluation results and verify our interpretation, we
analyse several representative cases (Figure 3). These cases show that NMN-BART indeed generates
explanations with good semantic alignment with references, despite mismatches at the lexical level. In
some cases, the generated explanations even contain more relevant information than the reference.</p>
        <p>Case 1: Skateboarding. For the question “What is the man doing?” with the answer “skateboarding,” the
reference explanations describe actions such as riding, balancing, or performing tricks on a skateboard.
Our model-generated explanations, while difering in wording, capture key information by describing
trick performance and movement on a ramp. The generated explanations also provide additional
Reference Explanations: "he is riding his skateboard on a skate rail,
he is doing a trick on his board,
he is balancing on a board on top of a rail"
Predicted Explanations: "he is doing a trick on a skateboard.
he is jumping a trick off the edge of a ramp on top of a board that has wheels attached to his skateboard.
he is at the edge and he is on the edge on a ramp"
Question: "What are the people doing?"
Answer: "snowboarding"
Reference Explanations: "it is their favorite sport,
they all are riding on a snowboard,
they are on a snow-filled mountain with boards uder their feet"
Predicted Explanations: "they are on a snowboard going down a snowy hill.
they have snowboards strapped to their feet and are on snowboards.
they're attached to boards and are touching the snow"
Question: "Is the sink clean?"
Answer: ”clean"
Reference Explanations: "ther is nothing in the sink,
it is white and does not have any residue inside it,
it is sparkling white with no dirty spots"
a
b
c
0 EntityRelationAttribut0eRse0lIenvfaonrmcAeartgivuemneensstStrengStahtisfaction EntityRelationAttributeRselIenvfaonrmcAeartgivuemneensstStrengStahtisfaction EntityRelatio0nAttributeRselIenvfaonrmcAeartgivuemneensstStrengStahtisfaction
Figure 4: Human evaluation scores comparing reference and predicted explanations across three cases using
seven metrics (metrics definition see Section 5.2). Higher scores reflect better performance.
information, such as details describing the trick: jumping, edge, ramp, wheels. In particular, the
predicted explanation contains more objects not in the reference such as ramp, wheels, edge, and
additional relations, e.g., in (he, is_jumping, trick) and (trick, of, edge) .</p>
        <p>Case 2: Snowboarding. When asked “What are the people doing?” with the answer “snowboarding,”
the reference explanations focus on the general activity of riding snowboards on a snowy mountain.
Our model similarly identifies the key elements of the scene, describing that the individuals are moving
downhill with snowboards strapped to their feet. Notably, the reference explanation “it is their favorite
sport” cannot be directly seen from the image. It is a rather subjective interpretation. The predicted
explanations has more details, such as (they, going_down hill) and (they, are_touching, snow).</p>
        <p>Case 3: Sink Cleanliness. For the question “Is the sink clean?” with the answer “clean,” the reference
explanations describe the absence of dirt or residue. A key challenge here is handling negation. If
we allow negation in the predicate, such as (sink, covered_with_no, dirt), this would lead to an infinite
number of possible tail entities. To address this, we treat covered_with_no_dirt as a relevant attribute in
this context, which captures the intended meaning of cleanliness, rather than treating dirt or grime
as entities. With this design, the generated explanation is semantically correct and aligned with the
reference in detecting relevant attributes, even though it provides extra contextual details, such as
mentioning a toilet that is not directly visible in the image.</p>
        <p>Human evaluation metrics for case study. We adapt human evaluation metrics from [30] combining
content-based and subjective metrics, as outlined below:
• Entity: the number of relevant distinct entities in the explanations.
• Relation: the number of relevant distinct relations in the explanations.
• Attributes: the number of relevant distinct attributes in the explanations.
• Relevance: the explanation only contains relevant information about the question and answer, a
subjective score ranging 0-5 with lower number penalising superfluous information.
• Informativeness: the explanation adds additional relevant information beyond the Q&amp;A, a subjective
score ranging 0-5 with higher number indicating more relevant information.
• Arguments strength: the degree to which the explanation supports the answer. It reflects the number
or strength of the correct arguments in the explanation with the values, a subjective score ranging
0-5 with higher number indicating stronger arguments.
• Satisfaction: subjective satisfaction about the explanation, a subjective score ranging 0-5 with higher
number indicating higher satisfaction.</p>
        <p>The human evaluation results comparing the predicted and reference explanations across three
cases using both content-based and subjective metrics are presented in Figure 4. For content-based
coverage (Entity, Relation, Attributes), the predicted explanations capture more or comparable relevant
information compared to the references, particularly excelling in Entity and Relation. Subjective
evaluations of explanation quality, including Relevance, Informativeness, Argument Strength, and
Satisfaction, show that predicted explanations achieve competitive or superior performance in most
cases, with multiple perfect scores (5) in these metrics. Based on these results, we can conclude that our
approach produces explanations that are semantically rich and have high human satisfaction (4 to 5),
even when the generated text deviates from reference explanations at the lexical level.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Outlook</title>
      <p>In this paper, we introduce our ongoing work on NMN-BART, a novel architecture that combines
Neural Module Network with BART to generate natural language explanations for visual question
answering. Our model leverages compositional reasoning over scene graphs to capture deeper semantic
relationships between the image, question, and answer, producing explanations that are semantically
rich, persuasive, and with high human satisfaction. Experiments on the VQA-X dataset demonstrate
that our method significantly outperforms baselines in capturing semantic content, despite lower lexical
alignment with the references. Future work will focus on testing NMN-BART on additional datasets,
performing larger-scale human evaluations, developing automatic metrics that better align with human
rationales, ultimately contributing to more transparent and interpretable AI.</p>
      <p>Declaration on Generative AI. The authors have used ChatGPT to assist with the polishing of
human-authored text. The authors take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          , et al.,
          <source>VQA: Visual Question Answering</source>
          , in: ICCV, IEEE,
          <year>2015</year>
          , pp.
          <fpage>2425</fpage>
          -
          <lpage>2433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Kancheti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <surname>Beyond</surname>
            <given-names>VQA</given-names>
          </string-name>
          :
          <article-title>Generating Multi-word Answers and Rationales to Visual Questions</article-title>
          , in: CVPRW, IEEE,
          <year>2021</year>
          , pp.
          <fpage>1623</fpage>
          -
          <lpage>1632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          , et al.,
          <article-title>Natural Language Rationales with FullStack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs, in: Findings EMNLP</article-title>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>2810</fpage>
          -
          <lpage>2829</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <article-title>Faithful Multimodal Explanation for Visual Question Answering</article-title>
          , in: ACL Workshop, ACL,
          <year>2019</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Multimodal</surname>
            <given-names>Explanations</given-names>
          </string-name>
          :
          <article-title>Justifying Decisions and Pointing to the Evidence</article-title>
          , in: CVPR, IEEE,
          <year>2018</year>
          , pp.
          <fpage>8779</fpage>
          -
          <lpage>8788</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>