<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Causal Question and Answering with Knowledge Graph Link Prediction⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Utkarshani Jaimini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cory Henson</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amit Sheth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence Institute, University of South Carolina</institution>
          ,
          <addr-line>Columbia, SC</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Bosch Center for Artificial Intelligence</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The ability to answer causal questions is important for any system that requires robust scene understanding. In this demonstration, we develop a prototype system that leverages our causal link prediction framework, CausalLP. CausalLP framework uses a visual causal knowledge graph and associated knowledge graph embedding for two visual causal question and answering tasks- (i) causal explanation and (ii) causal prediction. In the live demonstration sessions, the participants will be invited to test the eficiency and efectiveness of the system for visual causal question and answering.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual causal knowledge graph</kwd>
        <kwd>causal explanation</kwd>
        <kwd>causal prediction</kwd>
        <kwd>causal link prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Answering questions about scenes often requires knowledge of the causal relations between
events. As an example, consider a scene in which a yellow ball collides with a blue cylinder, as
depicted in Figure 1. Several questions may be asked about this collision event, including:</p>
      <p>
        The recent work in event level visual causal questions and answering focuses on the task of
causal reasoning by discovering visual-linguistic causal patterns, temporal causal structures,
and object-level causal relationship between object and language semantics [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. To the best
of our knowledge, the proposed CausalLP framework is the first attempt towards incorporating
weights between the events (i.e. weighted causal relations) with the knowledge graph embedding
(KGE) for visual causal question and answering.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Demonstration</title>
      <p>
        The demonstration1 of CausalLP focuses on showcasing key functionalities along with the
benefits of using KG link prediction for the visual causal question and answering task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This
approach is applied to the CLEVRER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and CLEVRER-Humans [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], visual causal reasoning
benchmark datasets to answer questions about video scenes with objects moving and interacting
in a simulated environment. These datasets contains over 1000 simulated video scenes, annotated
with information about the events, the participating objects, the causal relations between events,
and the weights for each relation (i.e. weighted causal relation). The CLEVRER-Humans dataset
provides information about the causal relations between events in the form of a Causal Event
Graph (CEG). A CEG is constructed for each video through human annotators working with
Mechanical Turk. For more information about the CLEVRER-Humans dataset, see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Figure 2 shows an example with the interactive Python interface, where CausalLP is able to
answer causal explanation and causal prediction questions about an event in the video scene.
As shown in Figure 2 (A), the user can choose a target video in order to ask causal explanation
and causal prediction question. Figure 2 (B) lists the events that occur in the video. Figure 2
(C) shows how a user can ask an explanation question about an event and display the result,
such as What is the cause of the yellow ball hits the light blue cylinder. The event is caused by
a comeFrom event. Figure 2 (D) shows how a user can ask a prediction question for an event
and display the result, such as What is the efect of the gray ball enter from the left? . This event
causes a Hit event in subsequent frames.</p>
      <p>
        To perform the question and answering task with CausalLP, two models were trained for the
explanation and prediction questions. The training and testing data were selected by splitting
1https://drive.google.com/file/d/1P3D3HIppZFsabsknLVq-4GwqLUciCcWQ/view?usp=sharing
the causal relations for each video scene based on their temporal positioning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For the
explanation model, the first few events in each scene are removed from the training data and
only used for testing. For the prediction model, on the other hand, the final few events in each
scene are removed from the training data and used for testing. With this setup, the initial events
in each scene serve as answers to explanation questions while the final events serve as answers
to prediction questions. Evaluation results of the CausalLP approach with the CLEVRER and
CLEVRER-Humans datasets, as used in this demonstration, are promising. Using DistMult
alone to train the KGE, i.e. without weights, results in an MRR score of 0.37. On the other hand,
using DistMult together with FocusE, i.e. with weights, results in an MRR score of 0.56. On
an average across all the models (i.e., TransE, DistMult, HolE, ComplEx), integrating weights
(i.e. weighted causal relations) leads to a +75% MRR score improvement. Additionally, adding
knowledge about the types of events and participating objects improves MRR score by +31%.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion and future work</title>
      <p>In this paper, we present the CausalLP framework and demonstrate its use for a visual question
and answering task. Specifically, causal explanation and prediction questions are answered
based on video scenes from the CLEVRER and CLEVRER-Humans benchmark datasets. The
proposed framework can be used for problems which involve cause and efect associations such
as root cause analysis at time of system failure, cause and efect of a collision understanding in
the autonomous driving systems, and trajectory prediction of a vehicle after a collision. In the
future, we aim to extend the CausalLP for answering counterfactual "What if" questions.
This work is supported in part by NSF grants #2133842, "EAGER: Advancing Neuro-symbolic
AI with Deep Knowledge Infused Learning", and #2119654, "RII Track 2 FEC: Enabling Factory
to Factory (F2F) Networking for Future Manufacturing". Any opinions, findings, conclusions, or
recommendations expressed in this material are those of the authors and do not necessarily
reflect the views of the NSF.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>U.</given-names>
            <surname>Jaimini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Henson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <article-title>Causallp: Learning causal relations with weighted knowledge graph link prediction</article-title>
          ,
          <source>arXiv preprint arXiv:2405.02327</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Next-qa: Next phase of question-answering to explaining temporal actions</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9777</fpage>
          -
          <lpage>9786</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pei</surname>
          </string-name>
          , W. Liang,
          <article-title>Discovering the real association: Multimodal causal reasoning in video question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19027</fpage>
          -
          <lpage>19036</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Cross-modal causal relational reasoning for event-level visual question answering</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tenenbaum</surname>
          </string-name>
          , Clevrer:
          <article-title>Collision events for video representation and reasoning</article-title>
          , in: International Conference on Learning Representations,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goodman</surname>
          </string-name>
          , J. Wu,
          <article-title>Clevrer-humans: Describing physical and causal events the human way</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>7755</fpage>
          -
          <lpage>7768</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>