<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VGStore: A Multimodal Extension to SPARQL for Querying RDF Scene Graph</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yanzeng Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zilong Zheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenjuan Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lei Zou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beijing Institute for General Artificial Intelligence (BIGAI)</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wangxuan Institute of Computer Technology (WICT), Peking University</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantic Web technology has successfully facilitated many RDF models with rich data representation methods. It also has the potential ability to represent and store multimodal knowledge bases such as multimodal scene graphs. However, most existing query languages, especially SPARQL, barely explore the implicit multimodal relationships like semantic similarity, spatial relations, etc. We first explored this issue by organizing a large-scale scene graph dataset, namely Visual Genome, in the RDF graph database. Based on the proposed RDF-stored multimodal scene graph, we extended SPARQL queries to answer questions containing relational reasoning about color, spatial, etc. Further demo (i.e., VGStore) shows the efectiveness of customized queries and displaying multimodal data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SPARQL</kwd>
        <kwd>RDF</kwd>
        <kwd>Multimodal</kwd>
        <kwd>KBQA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>hasRegion</p>
      <sec id="sec-1-1">
        <title>Image</title>
        <p>hasObject</p>
      </sec>
      <sec id="sec-1-2">
        <title>Object</title>
        <p>(a)</p>
      </sec>
      <sec id="sec-1-3">
        <title>Synset</title>
        <p>inSynset
relation
hasAttr
rdfs:Literal x y</p>
      </sec>
      <sec id="sec-1-4">
        <title>Object</title>
        <p>Region
h
h
x y
rdfs:int rdfs:int rdfs:int rdfs:int
rdfs:int rdfs:int rdfs:int rdfs:int
(b)
whasName</p>
        <p>rdfs:Literal
description
w
rdfs:Literal
rdfs:Literal
primaryColor
&lt;mm.color&gt;
semanticSimilarity
?mm.simrelativePosition
?mm.pos Count</p>
        <p>
          ?mm.count
Visual
Block
Visual
Block
(c)
methods in multimodal datasets. Although RDF has suficient representation capability to
describe multimodal data, the lack of multimodal semantic relationships in standard SPARQL
has become a major challenge in applying SP-based KBQA methods to multimodal domains.
Researchers have attempted to extend SPARQL for this purpose. For example, SPARQL-MM [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
proposed to use custom aggregation functions to access media fragments. However, previous
works still sufer from limitations in extensibility and vague semantic representation, resulting
in rare applications. Specifically, the custom aggregation functions introduced in SPARQL-MM
are easy to understand by human beings but hard to understand and be expressed by the
machine. This is because it brings significantly extra complexity to the query statement, e.g., if
multimodal query statements in SPARQL-MM are used as query conditions, it inevitably leads
to the union or nested query; however, the SP-based methods only support simple queries in
the foreseeable future.
        </p>
        <p>In this demo, we designed an ontology to organize the multimodal scene graph and store it
with RDF. Furthermore, we implemented semantic multimodal SPARQL queries by extending
the SPARQL engine, enabling the ability to answer questions related to multimodal information
such as visual and spatial reasoning.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Storing Visual Genome with RDF</title>
      <p>
        Visual Genome (VG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a large-scale dataset for fine-grained scene graphs, with rich
annotations of images, regions, objects, as well as their relations1. A synset from WordNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is
introduced to link diferent scene graphs via the lexical relations between literals of the object
relations. In addition, VG provided 1,445,332 relevant questions for 108,077 images, which are
dificult to be answered by traditional SP-based KBQA methods because the SPARQL engine
does not support the arithmetic opeartions needed to answer these multimodal questions.
      </p>
      <p>
        For querying convenience, we formalize the elements of VG in RDF. Fig. 1(a) shows the
designed ontology of RDF-stored VG (RDF-VG)2. Fig. 1(b) demonstrates properties of the defined
classes in RDF-VG. The properties (, , ℎ, ) determine the visual block of region or object
1http://visualgenome.org/VGViz/explore demonstrates the dataset.
2Due to space limitations, the detail of data processing and ontology organization are attached to the code repository.
by tailoring image. We store RDF-VG in gStore [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is a graph-oriented RDF data
management system supporting complex SPARQL queries on graph data.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Querying Multimodal Information via SPARQL</title>
      <p>
        Traditional SPARQL engines (such as our backbone - gStore) cannot perform queries involving
multimodality and thus cannot directly answer such questions (e.g. what color is this cat?).
Therefore, we developed a VGStore extension based on standard SPARQL grammar and
pyparsing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to parse custom predicates (as Fig. 1(c) shows) for arbitrary query patterns, enabling
the ability of traditional graph databases to perform multimodal queries by passing through the
extra computing requirements to third-party tools (e.g. OpenCV, Torch, etc.). The architecture
of VGStore is shown in Fig. 2 (Left).
      </p>
      <p>VGStore analyzes, matches, and replaces the clauses in the original SPARQL query that
contain custom predicates. It recursively replaces all non-standard query patterns with the
standard SPARQL syntax, and stores the replacement process in an operation stack temporarily.
The standard SPARQL query can be handed over to the backbone graph database for execution.</p>
      <p>Sort the results via customized variables,
e.g., ORDER BY DESC(?mm.count).</p>
      <p>Query Response</p>
      <p>Result Gallery</p>
      <p>After getting the result, the inverse operation is successively performed according to the staged
replacement operations in the stack. Finally, the intent of the original SPARQL query would be
restored. Fig. 2 (Right) demonstrated how a multimodal query containing the non-standard
custom predicate variable is to be executed. Table 2 illustrates part of the supported query
patterns in VGStore, covering questions involving color, counting, and relative position in
the VG question-answer dataset, which account for 15.0%, 11.4%, and 7.0% of all questions,
respectively. Other simple questions (e.g., “What is this?”) can be expressed and queried by
native SPARQL directly, and the remaining non-factual questions (e.g., “What is this man’s
motivation?”) or inference questions (e.g., “When was the picture taken?”) are out of scope in
this demonstration.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and Next Step</title>
      <p>Although VGStore successfully supports multimodal queries by extending virtual predicates, it
still has some limitations. VGStore is written in Python, which brings additional latency in
runtime, and it is possible to reduce the performance loss by native support in the SPARQL
engine. In addition, when the VGStore queries large data, and there are multiple extended
query statements, it would cause severe performance problems. This drives us to schedule and
parallelize the third-party tool calls.</p>
      <p>VGStore currently only supports several basic query patterns (as listed in Table 1) specialized
for the RDF-VG dataset, and does not adapt to other VQA datasets, nor does it support richer
query patterns. Therefore, our next steps for improvements include supporting more query
patterns and extending the applicability of VGStore to more graph databases with large-scale
multimodal graphs.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Demonstration</title>
      <p>This paper presented VGStore, an extension to SPARQL for querying multimodal information
on RDF. The demo showcased the web user interface of VGStore for querying multimodal
SPARQL on RDF-VG, as shown in Fig. 3. With this demo, we provided a potential solution to the
main challenge of SP-based multimodal KBQA and laid the foundation of multimodal knowledge
graph. Our code of RDF-VG builder, preliminary parser and frontend of demonstration are
available at https://github.com/pkumod/VGStore.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by NSFC under grant 61932001, U20A20174. The corresponding
author of this paper is Lei Zou (zoulei@pku.edu.cn). We sincerely thank reviewers for their
valuable comments and advises.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zou</surname>
          </string-name>
          , Crake:
          <article-title>Causal-enhanced table-filler for question answering over large scale knowledge base, in: Findings of the Association for Computational Linguistics: NAACL</article-title>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>1787</fpage>
          -
          <lpage>1798</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kurz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schafert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stegmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kosch</surname>
          </string-name>
          ,
          <article-title>Sparql-mm-extending sparql to media fragments</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2014</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          ,
          <source>International journal of computer vision 123</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Wordnet: a lexical database for english</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , L. Zou,
          <article-title>Redesign of the gstore system</article-title>
          ,
          <source>Frontiers of Computer science 12</source>
          (
          <year>2018</year>
          )
          <fpage>623</fpage>
          -
          <lpage>641</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>McGuire</surname>
          </string-name>
          ,
          <article-title>Getting started with pyparsing, "</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          ,
          <source>Inc."</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>