Visual Causal Question and Answering with
                                Knowledge Graph Link Prediction⋆
                                Utkarshani Jaimini1,* , Cory Henson2 and Amit Sheth1
                                1
                                    Artificial Intelligence Institute, University of South Carolina, Columbia, SC, USA
                                2
                                    Bosch Center for Artificial Intelligence, Pittsburgh, PA, USA


                                               Abstract
                                               The ability to answer causal questions is important for any system that requires robust scene under-
                                               standing. In this demonstration, we develop a prototype system that leverages our causal link prediction
                                               framework, CausalLP. CausalLP framework uses a visual causal knowledge graph and associated knowl-
                                               edge graph embedding for two visual causal question and answering tasks- (i) causal explanation and (ii)
                                               causal prediction. In the live demonstration sessions, the participants will be invited to test the efficiency
                                               and effectiveness of the system for visual causal question and answering.

                                               Keywords
                                               Visual causal knowledge graph, causal explanation, causal prediction, causal link prediction


                                1. Introduction
                                Answering questions about scenes often requires knowledge of the causal relations between
                                events. As an example, consider a scene in which a yellow ball collides with a blue cylinder, as
                                depicted in Figure 1. Several questions may be asked about this collision event, including:

                                        • Question: What is the cause of the collision? Answer: The red cube collides with the
                                          yellow ball.
                                        • Question: What is the effect of the collision? Answer: The blue cylinder moves.

                                   The first question type is referred to as a causal explanation; i.e. what is the cause of an
                                event. The second question type is referred to as a causal prediction; i.e. what is the effect of an
                                event. The ability to answer these types of causal questions is important for any system that
                                requires robust scene understanding. In this demo, we will show how these types of questions
                                are answered with the Causal Link Prediction (CausalLP) framework [1]. The information about
                                objects and events occurring in the scene are represented in a knowledge graph (KG) along with
                                the their associated causal relation. The link prediction techniques are used to infer new causal
                                relations between events. These newly inferred causal links serve as answers to the explanation
                                and prediction questions.
                                Posters, Demos, and Industry Tracks at ISWC 2024, November 13–15, 2024, Baltimore, USA
                                ⋆
                                  You can use this document as the template for preparing your publication. We recommend using the latest version
                                  of the ceurart style.
                                *
                                  Corresponding author.
                                $ ujaimini@email.sc.edu (U. Jaimini); cory.henson@us.bosch.com (C. Henson); amit@sc.edu (A. Sheth)
                                 0000-0002-1168-0684 (U. Jaimini); 0000-0003-3875-3705 (C. Henson); 0000-0002-0021-5293 (A. Sheth)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: The center frame shows a video scene of a collision event, Event[T], between the yellow
ball and blue cylinder. To the left, Event[T-1], shows a prior collision event between the red cube and
yellow ball that caused Event[T]. To the right, Event[T+1], shows a subsequent event of the blue cylinder
moving that is caused by Event[T].


   The recent work in event level visual causal questions and answering focuses on the task of
causal reasoning by discovering visual-linguistic causal patterns, temporal causal structures,
and object-level causal relationship between object and language semantics [2, 3, 4]. To the best
of our knowledge, the proposed CausalLP framework is the first attempt towards incorporating
weights between the events (i.e. weighted causal relations) with the knowledge graph embedding
(KGE) for visual causal question and answering.


2. Demonstration
The demonstration1 of CausalLP focuses on showcasing key functionalities along with the
benefits of using KG link prediction for the visual causal question and answering task [1]. This
approach is applied to the CLEVRER [5] and CLEVRER-Humans [6], visual causal reasoning
benchmark datasets to answer questions about video scenes with objects moving and interacting
in a simulated environment. These datasets contains over 1000 simulated video scenes, annotated
with information about the events, the participating objects, the causal relations between events,
and the weights for each relation (i.e. weighted causal relation). The CLEVRER-Humans dataset
provides information about the causal relations between events in the form of a Causal Event
Graph (CEG). A CEG is constructed for each video through human annotators working with
Mechanical Turk. For more information about the CLEVRER-Humans dataset, see [6].
   Figure 2 shows an example with the interactive Python interface, where CausalLP is able to
answer causal explanation and causal prediction questions about an event in the video scene.
As shown in Figure 2 (A), the user can choose a target video in order to ask causal explanation
and causal prediction question. Figure 2 (B) lists the events that occur in the video. Figure 2
(C) shows how a user can ask an explanation question about an event and display the result,
such as What is the cause of the yellow ball hits the light blue cylinder. The event is caused by
a comeFrom event. Figure 2 (D) shows how a user can ask a prediction question for an event
and display the result, such as What is the effect of the gray ball enter from the left?. This event
causes a Hit event in subsequent frames.
   To perform the question and answering task with CausalLP, two models were trained for the
explanation and prediction questions. The training and testing data were selected by splitting

1
    https://drive.google.com/file/d/1P3D3HIppZFsabsknLVq-4GwqLUciCcWQ/view?usp=sharing
Figure 2: Visual Causal Question and Answering demonstration system. In the window (A) the user
can choose a target video for causal explanation and causal prediction question. (B) lists the events that
occur in the video. (C), and (D) are causal explanation and causal prediction questions and answering
windows respectively.


the causal relations for each video scene based on their temporal positioning [1]. For the
explanation model, the first few events in each scene are removed from the training data and
only used for testing. For the prediction model, on the other hand, the final few events in each
scene are removed from the training data and used for testing. With this setup, the initial events
in each scene serve as answers to explanation questions while the final events serve as answers
to prediction questions. Evaluation results of the CausalLP approach with the CLEVRER and
CLEVRER-Humans datasets, as used in this demonstration, are promising. Using DistMult
alone to train the KGE, i.e. without weights, results in an MRR score of 0.37. On the other hand,
using DistMult together with FocusE, i.e. with weights, results in an MRR score of 0.56. On
an average across all the models (i.e., TransE, DistMult, HolE, ComplEx), integrating weights
(i.e. weighted causal relations) leads to a +75% MRR score improvement. Additionally, adding
knowledge about the types of events and participating objects improves MRR score by +31%.


3. Conclusion and future work
In this paper, we present the CausalLP framework and demonstrate its use for a visual question
and answering task. Specifically, causal explanation and prediction questions are answered
based on video scenes from the CLEVRER and CLEVRER-Humans benchmark datasets. The
proposed framework can be used for problems which involve cause and effect associations such
as root cause analysis at time of system failure, cause and effect of a collision understanding in
the autonomous driving systems, and trajectory prediction of a vehicle after a collision. In the
future, we aim to extend the CausalLP for answering counterfactual "What if" questions.
Acknowledgments
This work is supported in part by NSF grants #2133842, "EAGER: Advancing Neuro-symbolic
AI with Deep Knowledge Infused Learning", and #2119654, "RII Track 2 FEC: Enabling Factory
to Factory (F2F) Networking for Future Manufacturing". Any opinions, findings, conclusions, or
recommendations expressed in this material are those of the authors and do not necessarily
reflect the views of the NSF.


References
[1] U. Jaimini, C. Henson, A. P. Sheth, Causallp: Learning causal relations with weighted
    knowledge graph link prediction, arXiv preprint arXiv:2405.02327 (2024).
[2] J. Xiao, X. Shang, A. Yao, T.-S. Chua, Next-qa: Next phase of question-answering to
    explaining temporal actions, in: Proceedings of the IEEE/CVF conference on computer
    vision and pattern recognition, 2021, pp. 9777–9786.
[3] C. Zang, H. Wang, M. Pei, W. Liang, Discovering the real association: Multimodal causal
    reasoning in video question answering, in: Proceedings of the IEEE/CVF Conference on
    Computer Vision and Pattern Recognition, 2023, pp. 19027–19036.
[4] Y. Liu, G. Li, L. Lin, Cross-modal causal relational reasoning for event-level visual question
    answering, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[5] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, J. B. Tenenbaum, Clevrer: Collision
    events for video representation and reasoning, in: International Conference on Learning
    Representations, 2019.
[6] J. Mao, X. Yang, X. Zhang, N. Goodman, J. Wu, Clevrer-humans: Describing physical and
    causal events the human way, Advances in Neural Information Processing Systems 35
    (2022) 7755–7768.