<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Template-Based Approach for Annotating Long-Tail Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Garijo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ke-Thia Yao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amandeep Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Szekely?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Sciences Institute, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV les. In order to homogenize and query these data, the scienti c community has developed Extract, Transform and Load (ETL) tools and services that help making these les machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it di cult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our e ort with six non-expert users, obtaining promising preliminary results.</p>
      </abstract>
      <kwd-group>
        <kwd>Dataset annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>An increasing amount of data is shared on the Web by multiple organizations
using Excel and CSV formats. Content creators usually prefer to use tabular
data because it is simple to generate, manipulate and visualize by humans; and
there is a signi cant number of tools to help explore and edit the contents of
spreadsheets. These data need to be properly understood by others, and hence
documentation (e.g., variables captured, provenance, usage notes, etc.) is usually
included in auxiliary les or the spreadsheets themselves. As a result, many of
these spreadsheets have comments, clari cations, notes and references to other
les explaining how to interpret the information contained in them.</p>
      <p>
        In order to convert tabular data to a machine readable format, the Semantic
Web community has created Extract, Transform and Load (ETL) tools (e.g.,
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) and mapping languages (e.g., [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]) that help translating spreadsheets into
Knowledge Graphs. However, these tools and languages require signi cant
expertise when transforming heterogeneous tabular data with comments, incomplete
values or columns that are interrelated to each other, making it di cult for
domain experts to integrate their own datasets with existing KGs.
? Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0)
      </p>
      <p>
        In this paper we describe an approach to help non-experts transform their
data into a structured representation through dataset annotations. Our
contributions include 1) a dataset annotation schema that helps generating templates
for translating datasets into KGs; 2) an extension of the T2WML dataset
annotation tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to accommodate the proposed schema; and 3) an approach to
upload annotated datasets to a registry once the dataset annotation is complete.
      </p>
      <p>In order to assess our approach, we conducted a preliminary evaluation with 6
users unfamiliar with Knowledge Representation or Semantic Web technologies,
who were able to describe and integrate their annotated datasets as a KG.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Challenges in Long-Tail Dataset Annotation</title>
      <p>We focus on those datasets that are not straightforward to map into a structured
representation. Consider for example Table 1, which depicts the food prices in
di erent regions of Ethiopia at di erent points in time. The table has a time series
for the price value of di erent items at di erent dates, a repeated column with
the item being described (ignore), the item category and di erent information
about the region where that item was produced. The dataset has also some
missing values and labels marked as "unknown", which we may want to skip.
This dataset is representative of many open datasets with statistical/time series
information, and presents some interesting challenges:
{ The main subject of the annotation is not clear: The table describes
the price of an item in a location at a particular time. One possibility would
be to assert that the subject of the triple is the item (e.g., Sorghum), having
the price column as the object; and the rest of the columns as quali ers.
Alternatively, we could use the country (or the administrative name) as
main subject, as it is relevant to create aggregates. Finally, we could also
generate a blank node or URI to link together the contents of all columns.
{ Repeated columns and incomplete cell values: Spreadsheets contain
empty values, cell values (or columns) that need to be ignored and comments
(specially at the beginning and end) that complicate processing the data.
{ Distinguishing variables from quali ers: In some cases, it may be
difcult to distinguish whether a column is the object associated to a subject
or whether it is qualifying other values. For example, if Table 1 contained a
\quality" column, it could be interpreted as a new variable, or as a quali er
indicating the quality of the information source.</p>
      <p>Other problems that frequently occur include complex headers that
sometimes join the meaning of two columns (e.g., values and units, location and
country, etc.); comments in certain parts of the le; or critical missing
information, which is externally provided to the le. For example, there are cases where
the year in which the le was produced is part of the title of the CSV instead of
a column with a constant name.</p>
      <p>All these challenges make the automated annotation of datasets a challenging
problem. We need an approach for incorporating user feedback from content
creators or domain experts that are familiar with these datasets, but do not
necessarily know Semantic Web technologies or mapping languages.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Using Annotation Templates to Structure Datasets</title>
      <p>Our approach has three main elements: an annotation schema, which we use to
create mapping templates (Section 3.1); an extension of the T2WML tool to use
the proposed vocabulary when converting datasets into KGs (Section 3.2); and
an approach to integrate the mapped results with a reference KG (Section 3.3).
3.1</p>
      <sec id="sec-3-1">
        <title>A Schema to Describe Variable Metadata</title>
        <p>We have created a simple annotation schema1 by adding a set of headers to the
start of spreadsheet as shown in Table 2. The schema was designed to capture
basic metadata and to be easy to understand by content creators unfamiliar with
Semantic Web technologies. Therefore we capture 1) the dataset identi er to
be used when referring to the dataset; 2) the role of each column, i.e., whether
it is a variable, a unit or a quali er (location, time or other); 3) the type of
each column, i.e., whether the column should be the main subject, the format
used to represent dates, whether the variables to annotate are a number or a
string, etc.; the 4) column description in case users need to clarify any of the
columns to the persons reusing the data; 5) the variable name represented in
a column, as in some cases the headers used are di cult to understand; 7) the
variable unit; and 8) the header where the original dataset headers start.</p>
        <p>An example of our schema is represented in Table 2 by annotating Table 1.
As shown in the example, it is not necessary to complete all headers, in case the
information is not known or missing.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Extending the Table to Wikidata Mapping Language Tool</title>
        <p>
          We have implemented our approach by extending the Table to Wikidata
Mapping Language Tool (T2WML) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. T2WML is designed to 1) map data in
arbitrary data layouts used in Excel and CSV les without the need of complex
preprocessing steps to transform tables into a canonical \Database"
representation; 2) Enable users who are not familiar with RDF to map spreadsheets
and CSV les to KGs; and 3) Integrate mapping and entity linking so that the
resulting output is linked to a reference KG.
        </p>
        <sec id="sec-3-2-1">
          <title>1 https://t2wml-annotation.readthedocs.io/en/latest/</title>
          <p>
            T2WML is designed for the Wikidata data model [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. The main building
block in this model is a statement, which consists of a subject, a predicate, an
object, quali ers and references. The subject, predicate and object part mirror
their RDF counter parts. The quali ers are predicate/object pairs that provide
context information about a subject/predicate/object triple. For example, an
employment relation between a person and an organization can be quali ed to
record the period of time when the person was employed at that organization.
          </p>
          <p>
            Figure 1 shows how the T2WML extension would process a dataset similar as
the one shown in Table 2. T2WML recognizes the di erent headers annotated in
the spreadsheet to generate a template YAML following the T2WML mapping
language [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. Mapped results can be previsualized on the bottom right of the
screen, under \Output". This way, users can see how the automatically proposed
mappings will process the dataset and edit them accordingly in case of need.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Uploading Annotated Results to a Public Knowledge Graph</title>
        <p>Once users nish annotating a dataset, they can export their results in a
structured format like RDF. However, creating a KG with this information still needs
signi cant expertise. Therefore, we have created the USC Datamart, a
catalog which includes 1) key dataset metadata (i.e., creator, variables included,
etc.) of the datasets uploaded by users; and 2) the contents of those annotated
datasets (with variables and their quali ers like location, date, units, etc.). We
have extended T2WML to allow uploading the structured results into the USC
Datamart through a dedicated API2, enabling users to share their results online
(see the Upload to Datamart button in Figure 1). Each dataset has its own id,
which can be updated with new data. This way if a time series consists on a
set of spreadsheets with the same structure for di erent regions, they can all be
uploaded using a similar mapping template and the same dataset id.</p>
        <p>With the USC Datamart, users may retrieve dataset metadata (e.g., to nd
out which variables does a dataset include, or the time period they cover) and
once they nd the desired information they can download it as a table for their
own analysis. A usage example of the Datamart API can be seen online.3
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Preliminary Evaluation</title>
      <p>In order to assess our approach, we performed a preliminary evaluation with
six users. None of these users were familiar with Semantic Web technologies or
mapping languages, but three of them had expertise in data science and scripting
languages like Python or R. All users received a training in T2WML (one hour)
to understand the main capabilities of the tool and the annotation schema.</p>
      <p>The goal of the evaluation was to assess if users could understand the
proposed schema and use it in T2WML to annotate and upload datasets similar to
the one described in Table 1 (with their corresponding challenges). The
evaluation included three datasets with di erent indicators (economic, demographic,
production, etc.) in African countries. Each dataset was assigned to two di erent
users. As a result, all users were able to upload their datasets successfully to the
USC Datamart, with on the y corrections for one of the datasets where the
temporal information was part of the title of the le, instead of in its contents.</p>
      <p>When asked for feedback, users reported that the proposed annotation
approach was preferable to creating their own scripts for data cleaning, but they
claimed that it can be di cult to 1) align their own terminology to Wikidata
and 2) understand the di erence between a variable and their corresponding
quali ers. This means that while our approach successfully tackled the rst two
challenges described in Section 2 (annotating the main subject and incomplete
columns), additional work is required to guide users in the annotation process.
We are improving tutorials and documentation to address these issues.</p>
      <sec id="sec-4-1">
        <title>2 https://github.com/usc-isi-i2/datamart-api 3 https://tinyurl.com/y2lygs5v</title>
        <p>Garijo et al.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        A signi cant number of tools (e.g., [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]) and mapping languages (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ])
have been created by the community to help users map their datasets into KGs.
In this work we created a schema to help guide users in the dataset annotation
process without having to learn the complexity of existing tools or languages.
      </p>
      <p>
        Other work has focused on automated table understanding to label the
structure of tables without having users to annotate datasets themselves (e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
This work is very relevant to our own, and we plan to expand our approach
in this direction, (giving users the ability to correct the annotations proposed
automatically). In this paper we aim to ensure users understood the proposed
schema and also to have an end-to-end process (from annotation to upload)
incorporated in a single tool (T2WML).
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper we have described our approach for allowing content creators to
describe their own datasets to transform them into structured KGs. Our
preliminary results show that users are able to understand and use our schema for
annotating their datasets easily, enabling them to create and populate an existing
KG. Our next step will focus on incorporating table understanding approaches
which will make the process easier for users describing their own data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: a generic language for integrated RDF mappings of heterogeneous data</article-title>
          .
          <source>In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR Workshop Proceedings</source>
          , vol.
          <volume>1184</volume>
          (
          <year>Apr 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ermilov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stadler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Csv2rdf: User-driven csv to rdf mass conversion framework</article-title>
          .
          <source>In: Proceedings of the ISEM</source>
          . vol.
          <volume>13</volume>
          , pp.
          <volume>04</volume>
          {
          <issue>06</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ghasemi-Gol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pujara</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Learning cell embeddings for understanding table layouts</article-title>
          .
          <source>Knowledge and Information Systems (Sep</source>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taheriyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Karma: A system for mapping structured sources into the semantic web</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          . pp.
          <volume>430</volume>
          {
          <fpage>434</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Heyvaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , R.:
          <article-title>Declarative Rules for Linked Data Generation at your Fingertips! In: Proceedings of the 15th ESWC: Posters and Demos (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Szekely</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garijo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhatia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pujara</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>T2WML: Table to wikidata mapping language</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Knowledge Capture</source>
          . p.
          <volume>267</volume>
          {
          <fpage>270</fpage>
          .
          <string-name>
            <surname>K-CAP</surname>
          </string-name>
          '
          <fpage>19</fpage>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Vrandecic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Krotzsch, M.:
          <article-title>Wikidata: A free collaborative knowledge base</article-title>
          .
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <issue>10</issue>
          ),
          <volume>78</volume>
          {85 (Sep
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>