1. Introduction

Visual Causal Question and Answering with Knowledge Graph Link Prediction⋆

Utkarshani Jaimini

Cory Henson

Amit Sheth

0 0 Artificial Intelligence Institute, University of South Carolina , Columbia, SC , USA 1 Bosch Center for Artificial Intelligence , Pittsburgh, PA , USA

The ability to answer causal questions is important for any system that requires robust scene understanding. In this demonstration, we develop a prototype system that leverages our causal link prediction framework, CausalLP. CausalLP framework uses a visual causal knowledge graph and associated knowledge graph embedding for two visual causal question and answering tasks- (i) causal explanation and (ii) causal prediction. In the live demonstration sessions, the participants will be invited to test the eficiency and efectiveness of the system for visual causal question and answering.

eol>Visual causal knowledge graph causal explanation causal prediction causal link prediction

1. Introduction

Answering questions about scenes often requires knowledge of the causal relations between events. As an example, consider a scene in which a yellow ball collides with a blue cylinder, as depicted in Figure 1. Several questions may be asked about this collision event, including:

The recent work in event level visual causal questions and answering focuses on the task of causal reasoning by discovering visual-linguistic causal patterns, temporal causal structures, and object-level causal relationship between object and language semantics [ 2, 3, 4 ]. To the best of our knowledge, the proposed CausalLP framework is the first attempt towards incorporating weights between the events (i.e. weighted causal relations) with the knowledge graph embedding (KGE) for visual causal question and answering.

2. Demonstration

The demonstration1 of CausalLP focuses on showcasing key functionalities along with the benefits of using KG link prediction for the visual causal question and answering task [ 1 ]. This approach is applied to the CLEVRER [ 5 ] and CLEVRER-Humans [ 6 ], visual causal reasoning benchmark datasets to answer questions about video scenes with objects moving and interacting in a simulated environment. These datasets contains over 1000 simulated video scenes, annotated with information about the events, the participating objects, the causal relations between events, and the weights for each relation (i.e. weighted causal relation). The CLEVRER-Humans dataset provides information about the causal relations between events in the form of a Causal Event Graph (CEG). A CEG is constructed for each video through human annotators working with Mechanical Turk. For more information about the CLEVRER-Humans dataset, see [ 6 ].

Figure 2 shows an example with the interactive Python interface, where CausalLP is able to answer causal explanation and causal prediction questions about an event in the video scene. As shown in Figure 2 (A), the user can choose a target video in order to ask causal explanation and causal prediction question. Figure 2 (B) lists the events that occur in the video. Figure 2 (C) shows how a user can ask an explanation question about an event and display the result, such as What is the cause of the yellow ball hits the light blue cylinder. The event is caused by a comeFrom event. Figure 2 (D) shows how a user can ask a prediction question for an event and display the result, such as What is the efect of the gray ball enter from the left? . This event causes a Hit event in subsequent frames.

To perform the question and answering task with CausalLP, two models were trained for the explanation and prediction questions. The training and testing data were selected by splitting 1https://drive.google.com/file/d/1P3D3HIppZFsabsknLVq-4GwqLUciCcWQ/view?usp=sharing the causal relations for each video scene based on their temporal positioning [ 1 ]. For the explanation model, the first few events in each scene are removed from the training data and only used for testing. For the prediction model, on the other hand, the final few events in each scene are removed from the training data and used for testing. With this setup, the initial events in each scene serve as answers to explanation questions while the final events serve as answers to prediction questions. Evaluation results of the CausalLP approach with the CLEVRER and CLEVRER-Humans datasets, as used in this demonstration, are promising. Using DistMult alone to train the KGE, i.e. without weights, results in an MRR score of 0.37. On the other hand, using DistMult together with FocusE, i.e. with weights, results in an MRR score of 0.56. On an average across all the models (i.e., TransE, DistMult, HolE, ComplEx), integrating weights (i.e. weighted causal relations) leads to a +75% MRR score improvement. Additionally, adding knowledge about the types of events and participating objects improves MRR score by +31%.

3. Conclusion and future work

In this paper, we present the CausalLP framework and demonstrate its use for a visual question and answering task. Specifically, causal explanation and prediction questions are answered based on video scenes from the CLEVRER and CLEVRER-Humans benchmark datasets. The proposed framework can be used for problems which involve cause and efect associations such as root cause analysis at time of system failure, cause and efect of a collision understanding in the autonomous driving systems, and trajectory prediction of a vehicle after a collision. In the future, we aim to extend the CausalLP for answering counterfactual "What if" questions. This work is supported in part by NSF grants #2133842, "EAGER: Advancing Neuro-symbolic AI with Deep Knowledge Infused Learning", and #2119654, "RII Track 2 FEC: Enabling Factory to Factory (F2F) Networking for Future Manufacturing". Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

[1]

Jaimini ,

Henson ,

A. P.

Sheth , Causallp: Learning causal relations with weighted knowledge graph link prediction , arXiv preprint arXiv:2405.02327 ( 2024 ).

[2]

Xiao ,

Shang ,

Yao , T.-S. Chua, Next-qa: Next phase of question-answering to explaining temporal actions , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021 , pp. 9777 - 9786 .

[3]

Zang ,

Wang ,

Pei , W. Liang, Discovering the real association: Multimodal causal reasoning in video question answering , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2023 , pp. 19027 - 19036 .

[4]

Liu ,

Li ,

Lin , Cross-modal causal relational reasoning for event-level visual question answering , IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2023 ).

[5]

Yi ,

Gan ,

Li ,

Kohli ,

Wu ,

Torralba ,

J. B.

Tenenbaum , Clevrer: Collision events for video representation and reasoning , in: International Conference on Learning Representations, 2019 .

[6]

Mao ,

Yang ,

Zhang ,

Goodman , J. Wu, Clevrer-humans: Describing physical and causal events the human way , Advances in Neural Information Processing Systems 35 ( 2022 ) 7755 - 7768 .