<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Athens, Greece
* Corresponding author.
$ yy@dbcls.rois.ac.jp (Y. Yamamoto)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Semantic Data Management of Visual Computing Datasets: Increasing Usability of MetaVD</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yasunori Yamamoto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shusaku Egami</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuya Yoshikawa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ken Fukuda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chiba Institute of Technology</institution>
          ,
          <addr-line>Chiba</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Advanced Industrial Science and Technology (AIST)</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Research Organization of Information and Systems</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>MetaVD is a meta video dataset that interlinks six existing Computer Vision related datasets such as Charades and Kinetics-700. While MetaVD contributes to enhancing video recognition performance, we found two issues as follows. First, some is-a relationships defined and linked by MetaVD are inconsistent, including one circulation of is-a relationships. Second, all concepts in MetaVD are from the six datasets, and therefore some of them are not well semantically arranged, leading to a possibility of ineficient training of video recognition models. Here, we propose a knowledge graph dataset in Resource Description Framework (RDF), which links MetaVD concepts to those in the Commonsense Knowledge Graph (CSKG) and RDFizing MetaVD itself. By linking it, we can more easily detect inconsistent concept relationships. Furthermore, it allows us to link MetaVD concepts to those of conceptually higher ones. Then, some SPARQL queries were issued to it to evaluate its feasibility. The RDF dataset and SPARQL queries mentioned in this extended abstract are downloadable from https://github.com/aistairc/ MetaVD-CSKG .</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graphs</kwd>
        <kwd>Video Datasets</kwd>
        <kwd>RDFization</kwd>
        <kwd>Data Refine</kwd>
        <kwd>Data Linking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Lots of video datasets for human action recognition have been published, such as UCF101 [1]</title>
        <p>
          and Kinetics-700 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Each of them covers its specific domain, and we experience poor
recognition performance when applying a model trained on a dataset to one of the other domains.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>MetaVD [3] tackled this issue by interlinking concepts of six popular datasets for human action</title>
        <p>recognition. They defined three relation types of equality, similarity, and hierarchy, and then
annotated 568,015 relation labels in total by hand to the pairs of concepts obtained from the
original datasets using these relationships. After interlinking these datasets, they confirmed
that video recognition performance were increased by proposing two methods of how to utilize
the interlink.</p>
      </sec>
      <sec id="sec-1-3">
        <title>However, we found two issues in MetaVD. First, there are several semantically inconsistent</title>
        <p>
          relationships. For example, there are the following relationships in MetaVD: Playing_ice_hockey
is-a hit, and hit is-a Volleyball. In addition, there is a circulation for three is-a relationships.
Since the is-a relationship reflects a semantically hierarchical and directional relation between
concepts of the original datasets, a concept in the circulating relationships is claimed to be more
general than itself; that is to say, it is inconsistent. Second, MetaVD interlinks the six datasets,
and all the concepts are from them. Therefore, there are some cases where a more general
concept can be used to group a set of concepts that have a common trait, leading to a possibility
of gaining a better recognition performance [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. For example, there are playing_ice_hockey,
playing_lacrosse, and playing_basketball in MetaVD, which can be grouped by a concept of
playing_game.
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>We propose integrating MetaVD with the CommonSense Knowledge Graph (CSKG) [6] after</title>
      </sec>
      <sec id="sec-1-5">
        <title>RDFizing MetaVD. The CSKG is a commonsense knowledge graph that amalgamates seven</title>
        <p>widely-recognized sources, such as ConceptNet and Visual Genome and we found that almost
all MetaVD concepts can be linked with ConceptNet concepts. The conversion of MetaVD into</p>
      </sec>
      <sec id="sec-1-6">
        <title>RDF allows semantic validation of the dataset via the SPARQL query language [7]. Furthermore,</title>
      </sec>
      <sec id="sec-1-7">
        <title>SPARQL enables the management of user-specific subset data generation of the MetaVD, which</title>
        <p>is required to train a special-purpose human-action recognition model tailored for distinct</p>
      </sec>
      <sec id="sec-1-8">
        <title>MetaVD user applications, such as in the sports field.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <sec id="sec-2-1">
        <title>We use three datasets: MetaVD1, ConceptNet2, and CSKG3. They are downloadable in CSV,</title>
        <sec id="sec-2-1-1">
          <title>SQLite, and TSV formats, respectively.</title>
          <p>2.1. MetaVD to ConceptNet</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>As ConceptNet has a large number of concepts that can be linked to those in MetaVD, we</title>
          <p>ifrst attempted to align them. Of the 966 MetaVD concepts, 397 were fully aligned and 560
were partially aligned. "Fully aligned" here means that both concepts exactly match each other
after normalization described below. The seven concepts that could not be aligned include</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Powerbocking, Shotput, or barbequing. “Partially” here means that there were some words</title>
          <p>aligned to a MetaVD concept consisting of multiple words such as Turning on a light. In addition,
since some aligned concepts were identical words such as accordion for playing_accordion
[Kinetics-700] and Playing_accordion [ActivityNet], the total number of concepts is 868.</p>
          <p>We used ConceptNet Numberbatch4 to obtain a corresponding ConceptNet ID. The version
we obtained is 19.08 English. We extracted headwords from it and constructed an index based
on them. Before looking it up, we normalized the terms as follows:
• camel case to multiple words
• all letters to lowercase
• dash symbol to underline
• expressing multiple senses using slash symbols expanded to each concept</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>1https://github.com/STAIR-Lab-CIT/metavd</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>2https://github.com/ldtoolkit/conceptnet-lite</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>3https://zenodo.org/record/4331372</title>
        </sec>
        <sec id="sec-2-1-7">
          <title>4https://github.com/commonsense/conceptnet-numberbatch</title>
          <p>UCF101
HMDB51
weavsingi_mfabriiclar crocheting Knittingequal knitting
STAIR Actions
ActivityNet is-a pla
KCinheatircasd-7e0s0 hit ying_ice_hockeyis-aLongSbkaotaerBdoinagrding simiBlaandrMaprlacyhiinngg_accordion
MetaVD
knitting
is a
plyn
aigchce
ie</p>
          <p>oky
is a
aordinis a
c
c
o
..
.</p>
          <p>inte is a
lligentagenta action</p>
          <p>c
is a tivity
is a
skateboianrsdtirnugm playing
et
n
CSKG</p>
          <p>ConceptNet</p>
          <p>Roget
Visual Genome</p>
          <p>WordNet
ATOMIC
Wikidata
FrameNet
2.2. MetaVD to CSKG</p>
        </sec>
        <sec id="sec-2-1-8">
          <title>Next, for each concept, we obtained ConceptNet ID such as /c/en/accordion and retrieved</title>
        </sec>
        <sec id="sec-2-1-9">
          <title>CSKG edges and nodes. These CSKG edges link these ConceptNet IDs to the CSKG nodes. Note</title>
          <p>that the retrieved CSKG nodes are within one hop from the ConceptNet IDs. Fig 1 delineates
relationships between CSKG and MetaVD in our work. We linked MetaVD nodes to CSKG ones.</p>
        </sec>
        <sec id="sec-2-1-10">
          <title>As for accordion as an example, the following statements can be obtained.</title>
        </sec>
        <sec id="sec-2-1-11">
          <title>1. Accordion is an instrument.</title>
        </sec>
        <sec id="sec-2-1-12">
          <title>2. Accordion is a man-made object.</title>
        </sec>
        <sec id="sec-2-1-13">
          <title>3. Accordion is a musical instrument.</title>
          <p>
            We used Knowledge Graph Toolkit (KGTK) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] to retrieve designated data from CSKG. More
specifically, for each ConceptNet concept linking from MetaVD, we issued a query to retrieve
nodes whose IDs begin with that concept. For example, we issued a query that retrieves nodes
whose IDs matched the regular expression of /c/en/accordion(/.+)? along with their
edges and nodes linked by these edges. The resultant node IDs were the following.
• /c/en/accordion
• /c/en/accordion/n
• /c/en/accordion/n/wn/artifact
• /c/en/accordion/v
          </p>
        </sec>
        <sec id="sec-2-1-14">
          <title>As a result, we obtained 831 CSKG nodes linked to MetaVD. In addition, 47 293 nodes and 90 506 edges linked from these nodes were retrieved.</title>
          <p>2.3. Degree of abstraction</p>
        </sec>
        <sec id="sec-2-1-15">
          <title>Although it is not trivial how to measure the granularity of each MetaVD concept, we assume</title>
          <p>that ConceptNet graph structure can provide supporting evidence. Each ConceptNet node often
has IsA relationships, such as an accordion IsA instrument, and semantically higher concepts
have more child nodes linked with the IsA relationship. The number of children of accordion is
three, and that of instrument is 68.
2.4. RDFization</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>We used TogoDB 5 to build the MetaVD RDF dataset with a subset of CSKG linked to MetaVD.</title>
        <p>TogoDB accepts table data in CSV or TSV formats and provides a GUI-based environment where
we can edit configuration that defines how to generate RDF data from a given table. We took
a general approach of RDFization from a table dataset. A row becomes a set of triples whose
subject is from the cell value of the ID column and other column names and the corresponding
cell values denote its properties. RDF dataset from the MetaVD-CSKG data were also built in
the same way. The number of triples for MetaVD and MetaVD-CSKG were 25 218 and 1 190 904,
respectively. We also built an ancillary RDF dataset to link MetaVD data IDs to ConceptNet IDs
including partial words of MetaVD concepts. We loaded the RDF dataset into Fuseki Version
4.7.06.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Use Cases</title>
      <sec id="sec-3-1">
        <title>We constructed several SPARQL queries to verify the usefulness of the RDF dataset to fulfill the two purposes mentioned above.</title>
        <p>3.1. Inconsistency Checking</p>
      </sec>
      <sec id="sec-3-2">
        <title>It seems that the current MetaVD has some semantically inconsistent is-a relationships. Here,</title>
        <p>we say inconsistent in terms of the degree of abstraction obtained by counting children of a
concept in ConceptNet. For example, while Volleyball has one child and hit has 13, hit is-a
Volleyball in MetaVD.
3.2. MetaVD subsetting for customization</p>
      </sec>
      <sec id="sec-3-3">
        <title>Another use case is to make an MetaVD subset to train a human action recognition model</title>
        <p>tailored to one’s specific purpose. As a feasibility study, we issued a SPARQL query to retrieve
all MetaVD concepts related to intelligent agent activity, which returned 200 results. It includes
Playing_ice_hockey from ActivityNet, roller_skating from Kinetics-700, and playing_guitar from</p>
      </sec>
      <sec id="sec-3-4">
        <title>STAIR Actions.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <sec id="sec-4-1">
        <title>We made an RDF dataset consisting of MetaVD and subset of CSKG linking to MetaVD. In</title>
        <p>addition, we defined degree of abstraction based on the ConceptNet graph structure. By using
these data, we propose a way of showing potentially semantically inconsistent relationships in</p>
      </sec>
      <sec id="sec-4-2">
        <title>MetaVD. In addition, we propose a method of making an MetaVD subset in terms of a given abstract concept such as intellignet agent activity. This method enables us to train a video recognition model for a specific purpose.</title>
      </sec>
      <sec id="sec-4-3">
        <title>5http://togodb.org/</title>
      </sec>
      <sec id="sec-4-4">
        <title>6https://jena.apache.org/</title>
      </sec>
      <sec id="sec-4-5">
        <title>On the other hand, we need to evaluate the result. There are 1010 relationships defined in</title>
      </sec>
      <sec id="sec-4-6">
        <title>MetaVD, and we are considering whether we can check all of them manually or not. Future works include utilizing CSKG graph structure to consider datasets other than ConceptNet. In addition, we will consider a method of suggesting a relation in MetaVD based on the CSKG relationships.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This paper is based on results obtained from a project, JPNP20006, commissioned by the New</title>
      </sec>
      <sec id="sec-5-2">
        <title>Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI</title>
      </sec>
      <sec id="sec-5-3">
        <title>Grant Number JP22K18008.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Soomro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shah, UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          ,
          <source>arXiv preprint arXiv:1212.0402</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Quo vadis, action recognition? a new model and the kinetics dataset</article-title>
          ,
          <source>in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6299</fpage>
          -
          <lpage>6308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshikawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shigeto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Takeuchi</surname>
          </string-name>
          ,
          <article-title>Metavd: A meta video dataset for enhancing human action recognition datasets</article-title>
          ,
          <source>Computer Vision and Image Understanding</source>
          <volume>212</volume>
          (
          <year>2021</year>
          )
          <article-title>103276</article-title>
          . doi:https: //doi.org/10.1016/j.cviu.
          <year>2021</year>
          .
          <volume>103276</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Makarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.-E.</given-names>
            <surname>Ganea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pavllo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Greef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <article-title>Hierarchical image classification using entailment cone embeddings</article-title>
          ,
          <source>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          (
          <year>2020</year>
          )
          <fpage>3649</fpage>
          -
          <lpage>3658</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamazaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ohara</surname>
          </string-name>
          ,
          <article-title>Hierarchical image classification with conceptual hierarchies generated via lexical databases</article-title>
          ,
          <source>in: Proc. of the The 29th International Workshop on Frontiers of Computer Vision</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ilievski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Cskg: The commonsense knowledge graph</article-title>
          ,
          <source>Extended Semantic Web Conference (ESWC)</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Prud</surname>
          </string-name>
          <article-title>'hommeaux, A. Seaborne, SPARQL Query Language for RDF</article-title>
          ,
          <source>W3C Recommendation</source>
          ,
          <year>2008</year>
          . URL: http://www.w3.org/TR/rdf-sparql-query/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ilievski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chalupsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwabe</surname>
          </string-name>
          , P. Szekely,
          <article-title>KGTK: A toolkit for large knowledge graph manipulation and analysis</article-title>
          , in: International Semantic Web Conference, Springer,
          <year>2020</year>
          , pp.
          <fpage>278</fpage>
          -
          <lpage>293</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>