<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshops, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Neuro-symbolic Visual Reasoning for Multimedia Event Processing: Overview, Prospects and Challenges</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muhammad Jaleed Khan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Curry</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SFI Centre for Research Training in Artificial Intelligence, Data Science Institute, National University of Ireland Galway</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Eficient multimedia event processing is a key enabler for real-time and complex decision making in streaming media. The need for expressive queries to detect high-level human-understandable spatial and temporal events in multimedia streams is inevitable due to the explosive growth of multimedia data in smart cities and internet. The recent work in stream reasoning, event processing and visual reasoning inspires the integration of visual and commonsense reasoning in multimedia event processing, which would improve and enhance multimedia event processing in terms of expressivity of event rules and queries. This can be achieved through careful integration of knowledge about entities, relations and rules from rich knowledge bases via reasoning over multimedia streams within an event processing engine. The prospects of neuro-symbolic visual reasoning within multimedia event processing are promising, however, there are several associated challenges that are highlighted in this paper.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multimedia event processing</kwd>
        <kwd>visual reasoning</kwd>
        <kwd>commonsense reasoning</kwd>
        <kwd>video stream processing</kwd>
        <kwd>spatiotemporal events</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Internet of multimedia things (IoMT), data
analytics and artificial intelligence are
continuously improving smart cities and urban
environments with their ever-increasing
applications ranging from trafic management
to public safety. As middleware between
internet of things and real-time applications,
complex event processing (CEP) systems
process structured data streams from multiple
producers and detect complex events queried
by subscribers in real-time. The enormous
increase in image and video content
surveillance cameras and other sources in IoMT
applications posed several challenges in
realtime processing of multimedia events, which
motivated researchers in this area to
extend the existing CEP engines and to devise
new CEP frameworks to support
unstructured multimedia streams. Over the past few
years, several eforts have been made to
mitigate the challenges in multimedia event
processing by developing techniques for
extension of existing CEP engines for
multimedia events [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and development of
end-toend CEP frameworks for multimedia streams
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. On the other hand, the research in
computer vision has focused on complimenting
object detection with human-like visual
reasoning that allows for prediction of
meaningful and useful semantic relations among
detected objects based on analogy and
commonsense (CS) knowledge [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <sec id="sec-1-1">
        <title>Emerging from the semantic web, stream</title>
        <p>
          ing data is conventionally modelled
accord2. Background ing to RDF [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a graph representation. The
real-time processing of RDF streams is
perIn this paper, we discuss the background, formed in time-dependent windows that
conprospects and challenges related to leverag- trol the access to the stream, each
containing the existing visual and commonsense rea- ing a small part of the stream over which
soning to enhance multimedia event process- a task needs to be performed at a certain
ing in terms of its applicability and expres- time instant. Reasoning is performed by
apsivity of multimedia event queries. The mo- plying RDF Schema rules to the graph
ustivation for development of an end-to-end ing SPARQL query language or its variants.
multimedia event processing system sup- Reasoning over knowledge graphs (KG)
proporting automated reasoning over multime- vides new relations among entities to
endia streams comes from its potential real- rich the knowledge graph and improve its
time applications in smart cities, internet and applicability [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Neuro-symbolic
computsports. Fig. 1 shows an example of traf- ing combines symbolic and statistical
apifc congestion event detected using visual proaches, i.e. knowledge is represented in
and commonsense reasoning over the objects symbolic form, whereas learning and
reasonand relations among the objects in the video ing are performed by DNN [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which has
stream. A conceptual level design and a mo- shown its eficacy in object detection [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] as
tivational example of a novel CEP framework well as enhanced feature learning via
knowlsupporting visual and commonsense reason- edge infusion in DNN layers from knowledge
ing is presented in Fig. 2. bases [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Temporal KG allows time-aware
        </p>
        <p>
          This section presents a review of the re- representation and tracking of entities and
cent work in stream reasoning, multimedia relations [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
event processing and visual reasoning that
could be complementary within a proposed
neuro-symbolic multimedia event processing
system with support for visual reasoning.
        </p>
        <sec id="sec-1-1-1">
          <title>2.2. Multimedia Event</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>Representation and</title>
        </sec>
        <sec id="sec-1-1-3">
          <title>Processing</title>
          <p>
            CEP engines inherently lacked the support
for unstructured multimedia events, which
was mitigated by a generalized approach for
handling multimedia events as native events
in CEP engines as presented in [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
Angsuchotmetee et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] has presented an
ontological approach for modeling complex
events and multimedia data with syntactic
and semantic interoperability in multimedia
sensor networks, which allows subscribers
to define application-specific complex events
while keeping the low-level network
representation generic. Aslam et al. [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] leveraged 2.3. Visual and Commonsense
domain adaption and online transfer learn- Reasoning
ing in multimedia event processing to
extend support for unknown events. Knowl- In addition to the objects and their attributes
edge graph is suitable for semantic repre- in images, detection of relations among these
sentation and reasoning over video streams objects is crucial for scene understanding
due to its scalability and maintainability [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], for which compositional models [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ], visual
as demonstrated in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. VidCEP [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], a CEP phrase models [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] and DNN based relational
framework for detection of spatiotemporal networks [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] are available. Visual and
sevideo events expressed by subscriber-defined mantic embeddings aid large scale visual
requeries, includes a graph-based representa- lation detection, such as Zhang et al. [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]
tion, Video Event Query Language (VEQL) employed both visual and textual features to
and a complex event matcher for video data. leverage the interactions between objects for
relation detection. Similarly, Peyre et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]
added a visual phrase embedding space
during learning to enable analogical reasoning
for unseen relations and to improve
robustness to appearance variations of visual
re
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Neuro-symbolic Visual</title>
    </sec>
    <sec id="sec-3">
      <title>Reasoning in</title>
    </sec>
    <sec id="sec-4">
      <title>Multimedia Event</title>
    </sec>
    <sec id="sec-5">
      <title>Processing</title>
      <p>
        lations. Table 1 presents some knowledge events will be performed on the objects,
rulebases publicly available for visual reasoning. based relations and relations extracted
usWan et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed the use of common- ing visual reasoning. The subscriber will
sense knowledge graph along with the visual be instantly notified of the high-level event
features to enhance visual relation detection. as a combined detection of those
spatialRajani et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] leverage human reasoning temporal patterns. The idea of developing an
and language models to generate human-like end-to-end multimedia event processing
sysexplanations for DNN-based commonsense tem supporting visual reasoning over video
question answering. There are various com- streams (Fig. 2) poses several challenges that
monsense reasoning methods and datasets are discussed in the next section. This novel
available for visual commonsense reasoning approach will give more expressive power to
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and story completion [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. subscribers in querying complex events in
multimedia streams, and thus increase the
scope of real-time applications of multimedia
event processing in smart city applications
as well as internet media streaming
applications.
      </p>
      <sec id="sec-5-1">
        <title>3.2. Challenges</title>
        <p>3.1. Prospects 1. Suitable representation for
reasoning It is crucial to select a generalized and
The current multimedia event representation scalable model to represent events and
efecmethods use knowledge graph to represent tively perform automated reasoning to derive
the detected objects, their attributes and re- more meaningful and expressive
spatiotemlations among the objects in video streams. poral events.</p>
        <p>
          Pre-defined spatial-temporal rules are used 2. Expressive query definition and
to form relations among the objects. How- matching Providing a generic and
humanever, the complex relations that exist among friendly format to subscribers for writing
real-world objects also depend on seman- expressive and high-level queries would
retic facts and situational variables that can quire new constructs. Matching queries with
not be explicitly specified for every possi- the low-level events and relations along with
ble event as rules. The statistical reason- reasoning via knowledge bases requires
efing methods and knowledge bases discussed ifcient retrieval within the complex event
in Section 2 have great potential to com- matcher. Real-world complex events can
plement the rule-based relation formation share similar patterns, occur as a cluster of
in multimedia event processing by inject- similar events or occur in a hierarchical
maning some semantic knowledge and reason- ner, which requires generalized, adaptive and
ing to extract more semantically meaning- scalable spatiotemporal constructs to query
ful relations among objects. This advance- such events.
ment will allow subscribers to define abstract 3. Labeling and training samples of
vior high-level human-understandable event sual relations There can be a large numbers
query rules that can be decomposed into of objects and possible relations among them
spatial and temporal patterns. The spatio- in images, which can result in a large
numtemporal matching of the queried high-level ber of categories of relations. It is dificult
been explored much, which is crucial for
spabalanced categories of relations in the
traintiotemporal event processing.
ing data. For example, Visual Genome [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] has
a huge number of relations with unbalanced
instances of each relation.
4. Consistent integration of knowledge
bases The object labels in datasets for
object detection and entity labels in
knowledge bases (e.g.
        </p>
        <p>person, human, man) are
not always the same. Similarly, knowledge
bases have diferent labels for the same
entity, diferent names for the same attribute
(e.g. birthPlace and placeOfBirth) or relation
(e.g. ’at left’ and ’to left of ’). This can cause
inconsistency or redundancy while
integrating relations from the knowledge bases. It is
important to select the knowledge base and
dataset that are consistent and suitable for
the combined use of both object detection
and visual reasoning.
5. Supporting rare or unseen visual
relations Apart from the common relations,
very rare or unseen relations among objects
also appear in certain scenes. It is nearly
impossible to collect suficient training samples
for all possible seen and unseen relations.</p>
        <p>Handling such relations while evaluating the
models is also a challenge.
6. Temporal processing of objects and
relations The recent methods on this subject
address complex inference tasks by
decomposing images or scenes into objects and
visual relations among the objects. The
temporal events and temporal tracking of the
detected objects and predicted relations has not</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <sec id="sec-6-1">
        <title>This work was conducted with the financial</title>
        <p>support of the Science Foundation Ireland</p>
      </sec>
      <sec id="sec-6-2">
        <title>Centre for Research Training in Artificial Intelligence under Grant No. 18/CRT/6223.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aslam</surname>
          </string-name>
          , E. Curry,
          <article-title>Towards a generalized approach for deep neural network based event processing for the internet of multimedia things</article-title>
          ,
          <source>IEEE Access 6</source>
          (
          <year>2018</year>
          )
          <fpage>25573</fpage>
          -
          <lpage>25587</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yadav</surname>
          </string-name>
          , E. Curry, Vidcep:
          <article-title>Complex event processing framework to detect spatiotemporal patterns in video streams</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Big Data (Big Data)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>2513</fpage>
          -
          <lpage>2522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peyre</surname>
          </string-name>
          , I. Laptev,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          ,
          <article-title>Detecting unseen visual relations using analogies</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1981</fpage>
          -
          <lpage>1990</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elgammal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          ,
          <article-title>Large-scale visual relationship understanding</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>9185</fpage>
          -
          <lpage>9194</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yadav</surname>
          </string-name>
          , E. Curry, Vekg:
          <article-title>Video event knowledge graph to represent video streams for complex event pattern matching</article-title>
          ,
          <source>in: 2019 First International Conference on Graph Computing (GC)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>123</volume>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Iterative visual relationship detection via commonsense knowledge graph</article-title>
          , in: Joint International Semantic Technology Conference, Springer,
          <year>2019</year>
          , pp.
          <fpage>210</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] Rdf 1.1 concepts and abstract syntax (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <article-title>A review: Knowledge reasoning over knowledge graph</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>141</volume>
          (
          <year>2020</year>
          )
          <fpage>112948</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Hybrid reasoning in knowledge graphs: Combing symbolic reasoning and statistical reasoning</article-title>
          , Semantic Web (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <article-title>Object detection meets knowledge graphs (</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>U.</given-names>
            <surname>Kursuncu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <article-title>Knowledge infused learning (k-il): Towards deep incorporation of knowledge in deep learning</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>00512</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Durán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumančić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Niepert</surname>
          </string-name>
          ,
          <article-title>Learning sequence encoders for temporal knowledge graph completion</article-title>
          , arXiv preprint arXiv:
          <year>1809</year>
          .
          <volume>03202</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Angsuchotmetee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chbeir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cardinale</surname>
          </string-name>
          ,
          <article-title>Mssn-onto: An ontology-based approach for flexible event processing in multimedia sensor networks</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>108</volume>
          (
          <year>2020</year>
          )
          <fpage>1140</fpage>
          -
          <lpage>1158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aslam</surname>
          </string-name>
          , E. Curry,
          <article-title>Reducing response time for multimedia event processing using domain adaptation</article-title>
          ,
          <source>in: Proceedings of the 2020 International Conference on Multimedia Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>261</fpage>
          -
          <lpage>265</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ritrovato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vento</surname>
          </string-name>
          ,
          <article-title>On the use of semantic technologies for video analysis</article-title>
          ,
          <source>Journal of Ambient Intelligence and Humanized Computing</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Vip-cnn:
          <article-title>Visual phrase guided convolutional neural network</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1347</fpage>
          -
          <lpage>1356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Kumar Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , Viske:
          <article-title>Visual knowledge extraction and question answering by visual verification of relation phrases</article-title>
          ,
          <source>in: CVPR</source>
          <year>2015</year>
          , ????, pp.
          <fpage>1456</fpage>
          -
          <lpage>1464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Detecting visual relationships with deep relational networks</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and Pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3076</fpage>
          -
          <lpage>3086</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Explain yourself! leveraging language models for commonsense reasoning</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>02361</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>From recognition to cognition: Visual commonsense reasoning</article-title>
          ,
          <source>in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Hellaswag:
          <article-title>Can a machine really finish your sentence?</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuznetsova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Alldrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uijlings</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Krasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pont-Tuset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kamali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malloci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , et al.,
          <source>The open images dataset v4</source>
          ,
          <source>International Journal of Computer Vision</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <article-title>Yago 4: A reasonable knowledge base</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ronchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <article-title>Describing common human visual actions in images</article-title>
          ,
          <source>arXiv preprint arXiv:1506.02203</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>