<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Hyperknowledge Approach to Support Dataset Engineering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcio Moreno</string-name>
          <email>mmoreno@br.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Polyana Bezerra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Costa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V tor Nascimento</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elton Soares</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo Machado</string-name>
          <email>marcelo.machadog@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research, Brazil</institution>
          ,
          <addr-line>Av Pasteur 146 Rio de Janeiro - RJ</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The use of machine learning has become a common approach for solving complex problems across multiple application domains. As its usage often requires training and validation of models with large and heterogeneous datasets, the engineering of these datasets becomes a critical task, although in many cases it does not follow any well-de ned process. In this demonstration paper, we present a novel approach to dataset engineering, which comprises the construction, structuring, understanding, and reuse of datasets from a semantic perspective. Our approach uses a hybrid conceptual model called Hyperknowledge, which can semantically describe both symbolic and non-symbolic nodes, including representing the datasets' structure and enabling dataset retrieval/creation queries.</p>
      </abstract>
      <kwd-group>
        <kwd>Hyperknowledge</kwd>
        <kwd>Hybrid Knowledge Representation</kwd>
        <kwd>Hyperlinked Knowledge Graph</kwd>
        <kwd>HyQL</kwd>
        <kwd>Multimodal data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As the popularity of Machine Learning (ML) tasks increases, so does the amount
of data used to train and test them. The e ectiveness of such algorithms is related
to the quality and variety of the data applied during the training stage [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The
continuous growth in size and heterogeneity of datasets used in ML tasks makes
what we call Dataset Engineering (DE) a key step for e ective data exploitation,
leading to more e ective models. Here we de ne DE as the process of handling
data through structuring and traceability. The process of DE can be roughly
divided into three main tasks: (i) representation; (ii) retrieval ; and (iii) creation.
The motivation behind these tasks is to prepare data for further use in an ML
task. Few works have approached some of the aforementioned tasks [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4</xref>
        ].
However, to the best of our knowledge, none of them deals with the lifecycle
of data, and few of them tackle the structuring of datasets in the ML context.
Hence, we propose a knowledge-oriented approach for DE in the ML domain
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
and describe how this approach can support DE tasks in this domain, as well as
its advantages. For the sake of this discussion, we rely on Hyperknowledge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
(HKW) for data structuring and HyQL (Hyperknowledge Query Language) for
both the retrieval and creation of datasets.
      </p>
      <p>
        HKW is a conceptual model that can represent, in the same description
framework, high-level semantic concepts and unstructured data. By semantic
concepts, we mean high-level description regarding linked data, such as facts
about subjects, or specialized knowledge from a given domain, formally described
by ontologies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. By unstructured data, we mean raw data, such as multimedia
content (image, audio, text, video) or executable content (ML models, programs,
etc). Traditionally, those two modalities of information are represented
separately, and engineers somehow create links to combine them in a system when
needed. HKW lls this representational gap by providing a higher-level
framework while also promoting reasoning over cross-modal types of information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Engineering with Hyperknowledge</title>
      <p>To support the understanding of the examples illustrated in the remainder of
this paper, we present a simpli ed formalization of HyQL grammar 2 and its
processing engine. First, it is important to understand the basic behavior of
HyQL when selecting HKW entities using the SELECT clause. A basic selection
returns the concept passed in the clause and any other concept that is linked
to this concept by a link that contains a connector of type hierarchical 3.
Figure 1(a) illustrates the instanceOf (i.e., the connector is declared as hierarchical)
relationship between Dog and Animal. Figure 1(b) depicts the result of a query
to select animals and returns the Animal node and also the Dog node.</p>
      <p>(a) (b)
Fig. 1. Representation of the relationship between concepts Animal and Dog (a) and
a query to select all animals (including Animal) from the base (b).</p>
      <p>Dataset Representation: The dataset representation task stands for
structuring data and its metadata in a common knowledge framework (e.g., ontology).
It promotes the enhancement of dataset querying and reuses capabilities, which
are important properties for supporting the other DE tasks. As an illustrative
example, consider the dataset ontology depicted in Figure 2. It shows the concept
2 Accessible at https://ibm.box.com/v/iswc2021-hyql-grammar. For clarity, we have
suppressed the handling of spaces, comments, and lower case keywords.
3 HKW connectors have types. The hierarchical type can be used to represent
taxonomical relationships, such as instantiations and specializations of concepts.
dataset modelled within its data and metadata concepts, i.e., datatype and class
(red arrows) and possible instances of each represented concept (blue arrows).</p>
      <p>HKW can be used for describing this ontology. Nodes can represent the
ontology's classes and instances (round rectangles in the gure) and links can
represent the relationships among these entities (arrows in the gure). The
description of each dataset could be organized in contexts, promoting a better
organization of the data. Likewise, Figure 3 illustrates an example of how HKW
can be used for structuring the information of datasets. In this example, the
Cleansed Dataset A is described as a context having other three nested contexts:
Training, Validation, and Testing. In each of these contexts, there are
descriptions of their content using nodes and links. For instance, it describes that the
training dataset contains an image (Image1 ) that has a cat and a dog.</p>
      <p>Dataset Retrieval: Suppose one is given the task of classifying wild animals
in images. For performing such a task, one could query the KB for retrieving all
datasets that contain images using lines 1 to 2 of the query depicted in Figure 4.
However, such a query is too
broad, as it returns datasets
having any type of image.</p>
      <p>An HKW entity can have
anchors that represent part of
its content. For instance, a Fig. 4. Query to return all contexts de ned as
node representing an image dataset, that contain images with animals. FROM
may have several anchors and clause reduces the search scope to entities contained
one can de ne relationships in HKW contexts that are instances of Dataset.
using these anchors, that is, de ning a meaning to fragments of the image. If we
assume that our KB has images of animals and that information is expressed in
terms of HKW relationships, one could de ne an additional constraint to this
query, which is depicted in line 3 of Figure 4. If we have a more detailed ontology,
one can be even more speci c in the construction of the query.</p>
      <p>Dataset Creation: As aforementioned, dataset creation is one of the DE tasks
that has been drawing the most attention from the research community, mostly
because of its importance and di culty. The creation of new datasets with data
from a variety of sources is di cult and can lead to errors. We argue that a
full- edged knowledge-oriented approach for DE should support the creation of
datasets through queries, which can avoid representation errors by applying a
common ontology to describe di erent instances of datasets. For example, in the
query depicted in Figure 5(a), every image from CopiedDataset is inserted into
NewDataset. Practical use of this query is when one wants to reuse images from
a preexisting dataset into a new one. Another example is depicted in Figure 5(b),
where a new dataset is created from a selection of data with speci c features.
(a)
(b)
Fig. 5. Query that copies the content of CopiedDataset to the a new context (a) and
query that copies the images of dogs from DatasetOfImages (b).</p>
    </sec>
    <sec id="sec-3">
      <title>Demonstration</title>
      <p>
        The main goal of this demo is to show HKW features that support dataset
engineering tasks through KES and HyQL. In this sense, we use the Pascal
VOC2012 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the Playing for Benchmarks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] image datasets to show how
HKW can support dataset representation, retrieval, and creation tasks. The
images will be exploited to demonstrate HKW's capability of integrating symbolic
knowledge with non-symbolic content. The use of the ontologies is to illustrate
the support to semantic queries for dataset retrieval. Finally, queries performed
using KES will demonstrate HKW's ability to answer dataset creation queries.
Demo video 1: Dataset representation and integration with HKW ontology.
https://ibm.box.com/v/iswc2021-dataseteng-video1
Demo video 2: Semantically enriched dataset retrieval and media visualization.
https://ibm.box.com/v/iswc2021-dataseteng-video2
Demo video 3: Dataset creation using HyQL dataset engineering functions.
https://ibm.box.com/v/iswc2021-dataseteng-video3
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper, we presented a knowledge-based approach for DE in the ML
domain. The main contributions of this work are the proposal and demonstration
of this approach, which comprises core DE tasks such as dataset representation,
retrieval, creation. Data scientists could use the approach here described to nd
new datasets, balance, clean, and resample them, and select speci c features,
which saves time and promotes better data exploitation.</p>
      <p>As the main drawback of our approach is the need to annotate datasets for
achieving richer descriptions, the (semi-)automatic annotation of datasets is a
future work that will allow saving more of data scientists' time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chun</surname>
            ,
            <given-names>D.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jun</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chao</surname>
            ,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cong</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Data engineering in information system construction</article-title>
          .
          <source>In: 2012 IEEE Symposium on Robotics and Applications (ISRA)</source>
          . pp.
          <volume>135</volume>
          {
          <fpage>137</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rekatsinas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Data integration and machine learning: A natural synergy</article-title>
          .
          <source>In: Proceedings of the 2018 international conference on management of data</source>
          . pp.
          <volume>1645</volume>
          {
          <issue>1650</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Everingham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eslami</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>C.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winn</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The pascal visual object classes challenge: A retrospective</article-title>
          .
          <source>International journal of computer vision 111(1)</source>
          ,
          <volume>98</volume>
          {
          <fpage>136</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kunze</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dataset retrieval</article-title>
          .
          <source>In: 2013 IEEE Seventh International Conference on Semantic Computing</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cerqueira</surname>
          </string-name>
          , R.:
          <article-title>Extending hypermedia conceptual models to support hyperknowledge speci cations</article-title>
          .
          <source>International Journal of Semantic Computing</source>
          <volume>11</volume>
          (
          <issue>01</issue>
          ),
          <volume>43</volume>
          {
          <fpage>64</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayder</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koltun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Playing for benchmarks</article-title>
          .
          <source>In: IEEE International Conference on Computer Vision</source>
          ,
          <string-name>
            <surname>ICCV</surname>
          </string-name>
          <year>2017</year>
          . pp.
          <volume>2232</volume>
          {
          <issue>2241</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>