1. Introduction

1613-0073

Natural Language Explanations for Visual Question Answering

Yan Zhou

yanzho@uio.no 1

Baifan Zhou

0 1

Ingrid C. Yu

ingridcy@ifi.uio.no 1

Workshop

0 Department of Computer Science, Oslo Metropolitan University , Norway 1 Department of Informatics, University of Oslo , Norway

2022

35 8312 8322

Visual Question Answering (VQA) is a challenging task that requires reasoning over both visual and textual information. Recently, there has been growing interest in enhancing VQA with Natural Language Explanations (NLEs) to improve transparency and trust. While existing methods leverage powerful language models for explanation generation, many score high on lexical-level text similarity rather than capturing the underlying reasoning process. In this work, we propose NMN-BART, a novel architecture that combines Neural Module Networks (NMNs) with the pretrained BART language model, using cross-modal fusion to bridge visual semantics and textual reasoning. We evaluate NMN-BART on the VQA-X dataset, where it significantly outperforms baselines on semantic-based metrics, despite lower scores on lexical similarity metrics. This suggests that our method excels in capturing the meaningful content of the explanations, rather than matching the references in wording. The case study with human evaluation further verifies our finding that our method produces semantically rich and persuasive explanations. visual question answering with explanations, natural language explanation, neural module network Proceedings

1. Introduction

Background. Visual Question Answering (VQA) [ 1 ] is a challenging task at the intersection of computer vision and natural language processing, where an accurate answer needs to be generated by reasoning over both visual and textual information given an image and a corresponding question. This task is crucial as it mirrors real-world scenarios where machines must integrate multimodal data, enabling more natural human-computer interactions.

Recently, there has been growing interest in enhancing VQA with Natural Language Explanations (NLEs) [ 2 ], leading to the task of Visual Question Answering with Explanation (VQA-E), which aims to produce both an accurate answer and a human-understandable explanation for that answer (Figure 1). While the full VQA-E task involves predicting both answers and explanations, in this work we focus on the generation of NLEs, assuming access to the correct answer, following prior work [ 3 ]. This targeted setup allows us to isolate and evaluate the model’s ability to produce semantically meaningful and interpretable justifications, an essential component for enhancing transparency and user trust in real-world applications such as medical diagnostics and autonomous driving. We further discuss the rationales in Section 3.

Challenges. Generating natural language explanations for VQA presents several challenges. A key dificulty is

aligning visual and textual modalities: the system must interpret image content and link it to linguistic constructs to produce coherent explanations. Common approaches use vision-language models that separate answer prediction and explanation generation [ 4, 5 ], while end-to-end models [6] attempt joint prediction of both. These models typically combine vision encoders like CLIP [7, 8, 6] with transformer-based language models [ 6, 3, 9 ]. However, prior studies suggest that such models may rely on dataset biases or visual “shortcuts” [10], enabling them to achieve high answer accuracy

CEUR

ceur-ws.org

Answer: "Eating." Explanation: "he is biting a vine of leaves with his mouth", "he is very hungry at the moment", "the giraffe is pulling food from the basket on the pole."

without fully capturing the semantic content of the visual scene. This can lead to shallow or unfaithful explanations that fail to reflect the actual reasoning process.

Another limitation lies in generating explanations that faithfully reflect the reasoning process . Many multimodal models fail to explicitly connect reasoning steps with the generated explanation. Neural Module Networks (NMNs) [11, 10, 12] ofer a structured reasoning framework by decomposing questions into interpretable modules. While efective for answer prediction in VQA tasks, NMNs remain underexplored for explanation generation, leaving their potential to produce reasoning-aligned outputs largely untapped.

Finally, evaluating NLEs remains a challenge. Many works [ 6, 3, 9, 8 ] try to achieve high scores on n-gram-based metrics such as BLEU and ROUGE. These metrics focus on lexical comparison, often insuficient in reflecting semantic alignment. Even semantically oriented metrics such as METEOR [ 13] or SPICE [14] still compare generated explanations against reference texts, and may overlook extra relevant information introduced in the generated explanation. Thus, new evaluation strategies are needed to assess explanation quality beyond similarity to references.

Contributions. This work presents our ongoing research on a novel architecture, NMN-BART, which integrates NMN with the pretrained language model BART using a cross-modal fusion module. This cross-modal fusion module integrates reasoning over Scene Graph (SG) representations with textual information from the language model, facilitating the generation of explanations grounded in both visual and textual reasoning. The compositional and explainable mechanism of NMN enables NMN-BART to model deeper reasoning process over the semantic relationships among the question, answer, and visual content, thereby producing explanations that are both semantically richer and more interpretable.

We evaluate NMN-BART on the VQA-X dataset, where our model achieves performance comparable to state-of-the-art methods. Notably, NMN-BART scores significantly higher on semantic-based metrics such as METEOR and SPICE, while exhibiting lower performance on n-gram based metrics such as BLEU and ROUGE-L. We interpret these findings as evidence that our approach is capable of generating explanations with enhanced semantic understanding, even when the generated text diverges from the reference in terms of exact phrasing. Case studies with human evaluation further confirm that NMN-BART generates rich and meaningful explanations, sometimes providing more information than the reference explanations. Our main contributions are summarised as follows: • We propose NMN-BART, a novel architecture that combines the reasoning capabilities of NMN with the text generation of BART, for generating natural language explanations in VQA . • We demonstrate through extensive experiments and evaluations on the VQA-X dataset that our approach yields explanations with improved semantic quality, as evidenced by semantic-based metrics.

We adopt and adapt metrics for representative case studies and human evaluation.

2. Related Work

Visual question answering with explanation (VQA-E). Recent advances in VQA emphasise the importance of generating natural language explanations (NLEs) to justify answers and improve interpretability. Existing methods can be broadly grouped into two types: (i) generating explanations without conditioning on the answers, and (ii) generating explanations conditioned on the answers.

Category (i) approaches jointly predict answers and explanations. For example, NLX-GPT [6] frames the task as unified text generation. It integrates a CLIP image encoder with a distilled GPT-2 decoder, allowing the explanation to be generated as part of the reasoning process.

Category (ii) methods typically decouple the VQA and explanation generation stages. They first predict an answer, then condition the explanation on the answer, question, and image features [ 5, 4, 3, 9 ]. For instance, the Rational Transformer [ 3 ], combines GPT-2 with outputs from object detection, situation recognition, and commonsense inference to generate rationales for complex visual reasoning tasks. e-UG [9] similarly use GPT-2, conditioning explanations on various combinations of visual and textual features. S3C [8] improves explanation by incorporating answer scores as rewards in a self-critical learning framework, using CLIP-based encoders and prompt-based templates to guide the generation. Reasoning-enhanced Vision-Language (VL) models. Recent work has shown that incorporating explicit reasoning into language models improves performance on complex tasks [15, 16]. However, due to the complexity of aligning and integrating cross-modal information, VL models still struggle to capture visual reasoning efectively.

One line of research adopts Neural Module Networks (NMNs) [11], which dynamically compose neural modules based on the input question. To mitigate the vision-to-reasoning shortcut in NMNs, XNM [10] employs scene graphs for visual reasoning, instead of using “low-level” visual perception, especially in datasets like CLEVR [17] with ground-truth scene graph annotations. For datasets without scene graphs, such as VQA [ 1 ], XMN constructs scene graph representations from visual features to enable dynamic reasoning [18]. We adopt this strategy with the VQA-X [ 5 ] dataset.

Inspired by NMN’s compositional reasoning, recent methods explore code generation [19, 20] or large language models (LLMs) [21] for step-by-step visual reasoning. Chain-of-Thought (CoT) prompting has also been applied in VL reasoning. For instance, CCoT [22] generates scene graphs via LLMs and uses them in prompts to extract compositional knowledge. These approaches ofer flexibility and strong generalisation, with their zero-shot performance avoiding task-specific training or fine-tuning.

In this work, we integrate a NMN [18, 23] with BART, a transformer-based language model [24], leveraging NMN for compositional reasoning over scene graph representations and BART for natural language generation. This guides explanation generation and helps the model capture richer semantic relations among the image, question, and answer.

3. Task Formulation and Rationale

In Visual Question Answering with Explanation (VQA-E), the objective is to develop a model that can answer questions about images, and generate textual explanations of the answers, by understanding and reasoning over both visual and textual information. We follow the general task formulation in [9], which denotes a visual information as (e.g., image), a textual information as (e.g., question), and the objective of VQA-E is to learn a function ℱ to predict the answer to the question, and the explanation that justifies the answer : , = ℱ ( , ) .

There are generally two paradigms for achieving this. One class of approaches generates the answer and explanation simultaneously, without conditioning the explanation on the answer. Another class adopts a post-hoc (after the fact) strategy, where the answer is first determined or given, and the explanation is then generated conditioned on that answer. The task is then changed to = ℱ ( , ), = ℱ ( , , ) . Some works generate both answers and explanations (conditioned or unconditioned on the answer) and then filter out explanations where the answers are incorrect during evaluation, a setting referred to as the filtered setting [8]. In this work, we adopt the post-hoc strategy and condition the explanation generation on the answer. We follow a design choice similar to [ 3 ], in which a given answer is provided to the explanation generation as additional input to and . This design ofers several advantages: • No dataset filtering: It allows evaluation over the full dataset without filtering for only correct predictions, preserving the diversity and dificulty of the original examples. • Bias mitigation: It avoids biases introduced by the filtered setting that remove cases where the predicted answer is incorrect, which are often dificult or ambiguous cases, and potentially informative and important for assessing explanation quality. • Focused generation: By receiving the answer as input, the model can concentrate on elaborating, contextualising, and justifying the answer, leading to more relevant and detailed explanations.

Based on these rationales, we formulate our task as learning a function ℱ that generates a natural language explanation from the image , the question , and the given answer (Eq. 1). Here, and are natural language sentences, and is typically a word or a short phrase. The explanation consists of one or more sentences that provide a human-understandable rationale for the given answer. = ℱ ( , , ) (1)

4. Our Approach: NMN-BART

NMN-BART consists of (1) visual preprocessing, (2) an NMN-BART encoder, and (3) a BART decoder (Figure 2). Visual preprocessing transforms the image to scene graph representations , the NMN-BART encoder takes the and the text (question and answer ) as input, and produces fused encoding, which is then processed by the BART decoder to generate the explanation .

Visual preprocessing. Here the image is transformed into scene graph representations . Visual features are extracted from images using the Bottom-Up Attention model [25], a pretrained Visual Model that employs a Faster R-CNN detector trained on the Visual Genome dataset. These visual features, termed visual Region-of-Interest (RoI) features, serve as a visual foundation for the construction of .

To structure visual information for compositional reasoning, we convert these region-level features into scene graph representations following [10]. A scene graph is a structured representation of an image that encodes objects (nodes), their attributes, and pairwise relations (edges) between them. In our approach, each node v in the graph corresponds to a detected object and is constructed from the visual RoI features. Each edge e represents a spatial or semantic relation between two objects and is constructed by concatenating the visual RoI features of the two connected nodes: e = [v ; v ].

This graph-based representation provides a structured abstraction over raw pixel data, which enables NMN to reason over entities and relations in a compositional manner, facilitating interpretable reasoning for the explanation generation.

NMN-BART Encoder. The NMN-BART encoder can be seen as BART encoder enhanced with NMN (Figure 2). The BART encoder layers transform text into contextualised feature representations. The NMN processes these text features along with the scene graph representations to learn neural modules, producing intermediate results referred to as module output. The cross-modal fusion layers then combine these module outputs with the output of the previous encoder or fusion layer to form the fused encoding H , which is used by the BART decoder to generate explanations. The overall functionality of NMN-BART encoder can be summarised as: H = NMN-BART-Encoder(, , ) .

BART encoder layers. We use a pretrained BART as the foundational language model, leveraging its encoder to process textual inputs (questions and answers). We choose BART, a transformer-based sequence-to-sequence model, for it combines the strengths of bidirectional and autoregressive transformers, making it efective for natural language generation tasks [ 24]. The text inputs , are tokenised into a sequence of tokens W = {w0, w1, ..., w } = Tokeniser(, ) and embedded into dense vector representations H0 = Embed(W), which are then passed to the first encoder layer. The output representation of each encoder layer is computed as:

H = {h0, h1, … , h } = Encoder-Layer({h−1 , h1−1 , … , h−1 }) for = 1, … , 0 (2) where Encoder-layer(.) is a single transformer encoder layer, and {h0, h1, … , h } denotes the hidden states H of the -th layer. L is the number of encoder layers of the pretrained BART model. Neural Module Network. NMN takes two inputs: 1) textual features from and , including text embedding H0 and the hidden states from the first encoder layer H1; 2) the scene graph representations .

NMN processes these inputs to produce intermediate outputs O from the reasoning steps on the scene graph, where denotes the reasoning step, expressed as:

{O1, O2, O3} = NMN(H0, H1, ) with being the dimension of the module output. where O = {o0, o1, … , o }, = 1, 2, 3 represents the intermediate module output at steps 1, 2, and 3,

We choose NMN because it enables fully diferentiable training via back-propagation without expert supervision of reasoning steps. Following StackNMN [18] for the modular reasoning process and [10] for reasoning over scene graphs, we summarise the key components of StackNMN here, and refer technical details to [18]. StackNMN consists of three components: 1) The Layout Controller converts text information into a temporal distribution over module weights, segmenting reasoning into steps. 2) Module weights are assigned to a sequence of neural modules, each designed to perform a reasoning step. These modules include Find, Transform, And, Or, Filter, Scene, Answer, Compare, and NoOp, and operate on scene graph representations. The module outputs are visual attention maps or score vectors over possible answers. 3) A diferentiable memory stack stores and accesses intermediate module outputs during execution.

Cross-modal fusion. The fusion layers integrate visual and textual information from the NMN (module outputs) with the hidden states from the previous encoder or fusion layer. Each fusion layer corresponds to one neural module reasoning step. For the -th fusion layer (right side of Figure 2), we concatenate the first token node from the previous layer and feed it into an MLP (Eq. 4). The resulting fused token h̃ 0 is then propagated to all nodes in the fusion layer output through the self-attention mechanism of the transformer (Eq. 5, Eq. 6). One fusion layer can be summarised in Eq. 7. The final output of the fusion layers is the fused encoding H . h−1 with the last token node of the module output o−1 , 0 [h̃ 0−1 ; õ −1 ] = MLP([h−1 0

; o−1 ]) H̃ −1 = [h̃ −1 , h1−1 , … , h−1 ]

H = SelfAttention(H̃ −1 )

H = Fusion-Layer([H−1 ; O−1 ]) BART decoder. The fused encoding H from the last fusion layer is passed to the BART decoder to generate explanations: =

BART-Decoder(H ). (3) (4) (5) (6) (7) Training scheme. We initialise the encoder, fusion, and decoder layers with a pretrained BART model. Visual preprocessing (scene graph generation) is computed in advance. The entire NMN-BART encoder and the BART decoder are trained end-to-end using cross-entropy loss [24], with inputs , , precomputed , and reference explanation .

5. Experiment

Dataset. We evaluate our method on the VQA-X dataset [ 5 ], an established benchmark that extends the Visual Question Answering (VQA) dataset with human-written explanations. VQA-X contains 33k question-answer pairs over 28k images sourced from the MSCOCO dataset [26]. Each question averages 7.5 words, and each explanation around 11 words, with a vocabulary size of approximately 10k. The data is split into training (29k), validation (1.4k), and test (1.9k) sets. Each question may have multiple valid answers. The scale and diversity of VQA-X make it well suited for assessing both answer accuracy and the quality of generated explanations.

Baselines. We compare our NMN-BART model against representative baselines, categorised by their use of answer conditioning (more discussion see Section 3). Not answer-conditioned: NLX-GPT [6] generates explanations without relying on the predicted answer. Other methods are answer-conditioned, where RVT [ 3 ] and our model NMN-BART are with a given answer to guide explanation generation, and the other methods apply a filtered setting : filtering those explanations where answers are correctly predicted by the model, assuming that explanations supporting incorrect answers are invalid and should be excluded from evaluation [6]. These methods include PJ-X [ 5 ], FME [ 4 ], e-UG [9], and S3C [8], where most methods condition the explanation directly on answers, and in S3C the explanation is rewarded by the correct answer using reinforcement learning.

Automatic evaluation metrics. We use two types of automatic metrics for explanation evaluation: N-gram-based metrics: These metrics assess lexical-level similarity, measuring word and phrase overlap without considering deeper meaning. We use BLEU-1 [27] to measure the unigram precision, capturing word overlap between the generated and reference texts, with a brevity penalty for short outputs. ROUGE-L [28] uses the Longest Common Subsequence to assess sentence-level structural similarity. Both are primarily sensitive to n-gram overlap.

Semantic-based metrics: These metrics evaluate a deeper semantic alignment between the generated and reference text. METEOR [13] goes beyond n-gram matching by incorporating stemming, synonym matching, and paraphrase tables. SPICE [14] converts both explanations into scene graphs and evaluates the alignment of objects, attributes, and relationships.

Implementation. We use the pretrained BART-base from Facebook [29] as our language backbone. The encoder consists of 6 layers, with 3 of them configured as fusion layers corresponding to the three NMN modules. The model is trained for 15 epochs (17,730 steps) using a batch size of 24 and a learning rate of 2e-5, on a single Quadro GV100 GPU for approximately 4.8 hours. For cross-modal fusion, we apply a one-hidden-layer MLP that projects the concatenation of token nodes from LM and NMN outputs to a 288-dimensional space, with a dropout rate of 0.2.

5.1. Results and Discussion

Results. The results of all methods on the VQA-X dataset are summarised in Table 1. The methods are categorised first by whether the explanation generation is Not answer-conditioned or Answerconditioned. Among the latter, the methods are further categorised by applying the filtered setting : ifltering explanations only when the answers are correct) [ 8] or with given answer : a given answer is an input for generating the explanation. It can be observed that recent methods such as NLX-GPT and S3C, achieve high scores on BLEU-1 and ROUGE-L, indicating strong lexical-level similarity with the reference explanations. Our approach obtains substantially higher scores on semantic-based metrics, achieving a METEOR of 38.5 and a SPICE of 25.9, with respective improvement of 61% and 13% compared to the second best baseline S3C. On the other hand, NMN-BART has BLEU-1 (36.8) and ROUGE-L (31.3) scores that are notably lower than many baselines.

Discussion. By comparing the two types of metrics across all methods, we can conclude that semanticbased metrics are generally more challenging than n-gram-based metrics. On average, most methods score high on n-gram-based metrics but relatively low on semantic-based ones (100 means perfect.).

The n-gram-based metrics primarily focus on word or phrase overlap. BLEU-1 captures unigram overlap, while ROUGE-L focusses on the longest subsequence. From the results, we can infer that NLX-GPT and S3C generate explanations that closely overlap with the reference, while NMN-BART performs less well on lexical overlap. However, this does not imply worse performance for NMN-BART; rather, it reflects that NMN-BART generates more diferent lexical content than the reference.

METEOR extends n-gram overlap by incorporating semantic alignment. It also considers recall and penalises brevity, meaning that explanations lacking key information or being overly short will score lower. METEOR thus balances lexical similarity and semantic alignment, capturing subtleties that n-gram metrics might miss. Observing that methods such as NLX-GPT and S3C score higher on n-gram-based metrics but lower on METEOR, we can postulate it is because their explanations tend to be shorter or omit important details. While NMN-BART generates explanations that are relatively longer and cover more important information overlapping with the reference.

SPICE, designed for evaluating image captions, constructs scene graphs from both generated and reference texts, comparing entities, relationships, and attributes. It considers both precision and recall, and is robust to lexical variation. As the most challenging metric, SPICE scores tend to be low in all methods. Methods with high n-gram-based scores also tend to perform poorly on SPICE, probably because they generate shorter explanations with less coverage of key information. Despite scoring relatively high on METEOR, NMN-BART also exhibits a relatively low SPICE score. It can be that the specific way SPICE constructs scene graphs afects the evaluation, or it may indicate that the explanations of NMN-BART contain more information than the reference, afecting the alignment.

In conclusion, NMN-BART demonstrates superior performance in generating semantically aligned explanations compared to the baselines. To verify the underlying reasons we postulated, we conduct case studies and human evaluations (Section 5.2).

5.2. Case Study with Human Evaluation

Case study. To further understand the automatic evaluation results and verify our interpretation, we analyse several representative cases (Figure 3). These cases show that NMN-BART indeed generates explanations with good semantic alignment with references, despite mismatches at the lexical level. In some cases, the generated explanations even contain more relevant information than the reference.

Case 1: Skateboarding. For the question “What is the man doing?” with the answer “skateboarding,” the reference explanations describe actions such as riding, balancing, or performing tricks on a skateboard. Our model-generated explanations, while difering in wording, capture key information by describing trick performance and movement on a ramp. The generated explanations also provide additional Reference Explanations: "he is riding his skateboard on a skate rail, he is doing a trick on his board, he is balancing on a board on top of a rail" Predicted Explanations: "he is doing a trick on a skateboard. he is jumping a trick off the edge of a ramp on top of a board that has wheels attached to his skateboard. he is at the edge and he is on the edge on a ramp" Question: "What are the people doing?" Answer: "snowboarding" Reference Explanations: "it is their favorite sport, they all are riding on a snowboard, they are on a snow-filled mountain with boards uder their feet" Predicted Explanations: "they are on a snowboard going down a snowy hill. they have snowboards strapped to their feet and are on snowboards. they're attached to boards and are touching the snow" Question: "Is the sink clean?" Answer: ”clean" Reference Explanations: "ther is nothing in the sink, it is white and does not have any residue inside it, it is sparkling white with no dirty spots" a b c 0 EntityRelationAttribut0eRse0lIenvfaonrmcAeartgivuemneensstStrengStahtisfaction EntityRelationAttributeRselIenvfaonrmcAeartgivuemneensstStrengStahtisfaction EntityRelatio0nAttributeRselIenvfaonrmcAeartgivuemneensstStrengStahtisfaction Figure 4: Human evaluation scores comparing reference and predicted explanations across three cases using seven metrics (metrics definition see Section 5.2). Higher scores reflect better performance. information, such as details describing the trick: jumping, edge, ramp, wheels. In particular, the predicted explanation contains more objects not in the reference such as ramp, wheels, edge, and additional relations, e.g., in (he, is_jumping, trick) and (trick, of, edge) .

Case 2: Snowboarding. When asked “What are the people doing?” with the answer “snowboarding,” the reference explanations focus on the general activity of riding snowboards on a snowy mountain. Our model similarly identifies the key elements of the scene, describing that the individuals are moving downhill with snowboards strapped to their feet. Notably, the reference explanation “it is their favorite sport” cannot be directly seen from the image. It is a rather subjective interpretation. The predicted explanations has more details, such as (they, going_down hill) and (they, are_touching, snow).

Case 3: Sink Cleanliness. For the question “Is the sink clean?” with the answer “clean,” the reference explanations describe the absence of dirt or residue. A key challenge here is handling negation. If we allow negation in the predicate, such as (sink, covered_with_no, dirt), this would lead to an infinite number of possible tail entities. To address this, we treat covered_with_no_dirt as a relevant attribute in this context, which captures the intended meaning of cleanliness, rather than treating dirt or grime as entities. With this design, the generated explanation is semantically correct and aligned with the reference in detecting relevant attributes, even though it provides extra contextual details, such as mentioning a toilet that is not directly visible in the image.

Human evaluation metrics for case study. We adapt human evaluation metrics from [30] combining content-based and subjective metrics, as outlined below: • Entity: the number of relevant distinct entities in the explanations. • Relation: the number of relevant distinct relations in the explanations. • Attributes: the number of relevant distinct attributes in the explanations. • Relevance: the explanation only contains relevant information about the question and answer, a subjective score ranging 0-5 with lower number penalising superfluous information. • Informativeness: the explanation adds additional relevant information beyond the Q&A, a subjective score ranging 0-5 with higher number indicating more relevant information. • Arguments strength: the degree to which the explanation supports the answer. It reflects the number or strength of the correct arguments in the explanation with the values, a subjective score ranging 0-5 with higher number indicating stronger arguments. • Satisfaction: subjective satisfaction about the explanation, a subjective score ranging 0-5 with higher number indicating higher satisfaction.

The human evaluation results comparing the predicted and reference explanations across three cases using both content-based and subjective metrics are presented in Figure 4. For content-based coverage (Entity, Relation, Attributes), the predicted explanations capture more or comparable relevant information compared to the references, particularly excelling in Entity and Relation. Subjective evaluations of explanation quality, including Relevance, Informativeness, Argument Strength, and Satisfaction, show that predicted explanations achieve competitive or superior performance in most cases, with multiple perfect scores (5) in these metrics. Based on these results, we can conclude that our approach produces explanations that are semantically rich and have high human satisfaction (4 to 5), even when the generated text deviates from reference explanations at the lexical level.

6. Conclusion and Outlook

In this paper, we introduce our ongoing work on NMN-BART, a novel architecture that combines Neural Module Network with BART to generate natural language explanations for visual question answering. Our model leverages compositional reasoning over scene graphs to capture deeper semantic relationships between the image, question, and answer, producing explanations that are semantically rich, persuasive, and with high human satisfaction. Experiments on the VQA-X dataset demonstrate that our method significantly outperforms baselines in capturing semantic content, despite lower lexical alignment with the references. Future work will focus on testing NMN-BART on additional datasets, performing larger-scale human evaluations, developing automatic metrics that better align with human rationales, ultimately contributing to more transparent and interpretable AI.

Declaration on Generative AI. The authors have used ChatGPT to assist with the polishing of human-authored text. The authors take full responsibility for the publication’s content.

[1]

Antol ,

Agrawal ,

Lu ,

Mitchell , et al., VQA: Visual Question Answering , in: ICCV, IEEE, 2015 , pp. 2425 - 2433 .

[2]

Dua ,

S. S.

Kancheti ,

V. N.

Balasubramanian , Beyond

VQA

: Generating Multi-word Answers and Rationales to Visual Questions , in: CVPRW, IEEE, 2021 , pp. 1623 - 1632 .

[3]

Marasović ,

Bhagavatula ,

J. S.

Park ,

R. L.

Bras , et al., Natural Language Rationales with FullStack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs, in: Findings EMNLP , ACL, 2020 , pp. 2810 - 2829 .

[4]

Wu ,

R. J.

Mooney , Faithful Multimodal Explanation for Visual Question Answering , in: ACL Workshop, ACL, 2019 , pp. 103 - 112 .

[5]

D. H.

Park ,

L. A.

Hendricks ,

Akata ,

Rohrbach , et al., Multimodal

Explanations

: Justifying Decisions and Pointing to the Evidence , in: CVPR, IEEE, 2018 , pp. 8779 - 8788 .