=Paper=
{{Paper
|id=Vol-2722/profiles2020-paper-3
|storemode=property
|title= A Template-Based Approach for Annotating Long-Tail Datasets
|pdfUrl=https://ceur-ws.org/Vol-2722/profiles2020-paper-3.pdf
|volume=Vol-2722
|authors=Daniel Garijo,Ke-Thia Yao,Amandeep Singh,Pedro Szekely
|dblpUrl=https://dblp.org/rec/conf/semweb/GarijoYSS20
}}
== A Template-Based Approach for Annotating Long-Tail Datasets==
<pdf width="1500px">https://ceur-ws.org/Vol-2722/profiles2020-paper-3.pdf</pdf>
<pre>
     A Template-Based Approach for Annotating
                Long-Tail Datasets

        Daniel Garijo, Ke-Thia Yao, Amandeep Singh, and Pedro Szekely?

                           Information Sciences Institute,
                          University of Southern California
                   {dgarijo, kyao, amandeep, szekely}@isi.edu


        Abstract. An increasing amount of data is shared on the Web through
        heterogeneous spreadsheets and CSV files. In order to homogenize and
        query these data, the scientific community has developed Extract, Trans-
        form and Load (ETL) tools and services that help making these files ma-
        chine readable in Knowledge Graphs (KGs). However, tabular data may
        be complex; and the level of expertise required by existing ETL tools
        makes it difficult for users to describe their own data. In this paper we
        propose a simple annotation schema to guide users when transforming
        complex tables into KGs. We have implemented our approach by extend-
        ing T2WML, a table annotation tool designed to help users annotate
        their data and upload the results to a public KG. We have evaluated our
        effort with six non-expert users, obtaining promising preliminary results.

        Keywords: Dataset annotation · Metadata · Knowledge Graph.


1     Introduction
An increasing amount of data is shared on the Web by multiple organizations
using Excel and CSV formats. Content creators usually prefer to use tabular
data because it is simple to generate, manipulate and visualize by humans; and
there is a significant number of tools to help explore and edit the contents of
spreadsheets. These data need to be properly understood by others, and hence
documentation (e.g., variables captured, provenance, usage notes, etc.) is usually
included in auxiliary files or the spreadsheets themselves. As a result, many of
these spreadsheets have comments, clarifications, notes and references to other
files explaining how to interpret the information contained in them.
     In order to convert tabular data to a machine readable format, the Semantic
Web community has created Extract, Transform and Load (ETL) tools (e.g.,
[4]) and mapping languages (e.g., [1, 5]) that help translating spreadsheets into
Knowledge Graphs. However, these tools and languages require significant exper-
tise when transforming heterogeneous tabular data with comments, incomplete
values or columns that are interrelated to each other, making it difficult for
domain experts to integrate their own datasets with existing KGs.
?
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0)
2         Garijo et al.

    In this paper we describe an approach to help non-experts transform their
data into a structured representation through dataset annotations. Our contri-
butions include 1) a dataset annotation schema that helps generating templates
for translating datasets into KGs; 2) an extension of the T2WML dataset an-
notation tool [6] to accommodate the proposed schema; and 3) an approach to
upload annotated datasets to a registry once the dataset annotation is complete.
    In order to assess our approach, we conducted a preliminary evaluation with 6
users unfamiliar with Knowledge Representation or Semantic Web technologies,
who were able to describe and integrate their annotated datasets as a KG.


2      Challenges in Long-Tail Dataset Annotation
We focus on those datasets that are not straightforward to map into a structured
representation. Consider for example Table 1, which depicts the food prices in
different regions of Ethiopia at different points in time. The table has a time series
for the price value of different items at different dates, a repeated column with
the item being described (ignore), the item category and different information
about the region where that item was produced. The dataset has also some
missing values and labels marked as ”unknown”, which we may want to skip.
This dataset is representative of many open datasets with statistical/time series
information, and presents some interesting challenges:

    – The main subject of the annotation is not clear: The table describes
      the price of an item in a location at a particular time. One possibility would
      be to assert that the subject of the triple is the item (e.g., Sorghum), having
      the price column as the object; and the rest of the columns as qualifiers.
      Alternatively, we could use the country (or the administrative name) as
      main subject, as it is relevant to create aggregates. Finally, we could also
      generate a blank node or URI to link together the contents of all columns.
    – Repeated columns and incomplete cell values: Spreadsheets contain
      empty values, cell values (or columns) that need to be ignored and comments
      (specially at the beginning and end) that complicate processing the data.
    – Distinguishing variables from qualifiers: In some cases, it may be dif-
      ficult to distinguish whether a column is the object associated to a subject
      or whether it is qualifying other values. For example, if Table 1 contained a
      “quality” column, it could be interpreted as a new variable, or as a qualifier
      indicating the quality of the information source.

    Other problems that frequently occur include complex headers that some-
times join the meaning of two columns (e.g., values and units, location and
country, etc.); comments in certain parts of the file; or critical missing informa-
tion, which is externally provided to the file. For example, there are cases where
the year in which the file was produced is part of the title of the CSV instead of
a column with a constant name.
    All these challenges make the automated annotation of datasets a challenging
problem. We need an approach for incorporating user feedback from content
              A Template-Based Approach for Annotating Long-Tail Datasets             3

         Table 1. Table 1: Example of a dataset with food prices in Ethiopia

      date      ignore item     name      category       price curr country admname
      7/15/2000 Sorghum Sorghum Wholesale cereals/tubers 238 ETB Ethiopia Addis
      7/15/2001 Rice    Rice    Retail    cereals/tubers 19    ETB Ethiopia Afar
      7/15/2002 Rice    Rice    Retail    cereals/tubers 18    ETB Ethiopia unkown
      7/15/2003 Sorghum Sorghum Retail    cereals/tubers       ETB Ethiopia Amhara


creators or domain experts that are familiar with these datasets, but do not
necessarily know Semantic Web technologies or mapping languages.


3     Using Annotation Templates to Structure Datasets
Our approach has three main elements: an annotation schema, which we use to
create mapping templates (Section 3.1); an extension of the T2WML tool to use
the proposed vocabulary when converting datasets into KGs (Section 3.2); and
an approach to integrate the mapped results with a reference KG (Section 3.3).

3.1    A Schema to Describe Variable Metadata
We have created a simple annotation schema1 by adding a set of headers to the
start of spreadsheet as shown in Table 2. The schema was designed to capture
basic metadata and to be easy to understand by content creators unfamiliar with
Semantic Web technologies. Therefore we capture 1) the dataset identifier to
be used when referring to the dataset; 2) the role of each column, i.e., whether
it is a variable, a unit or a qualifier (location, time or other); 3) the type of
each column, i.e., whether the column should be the main subject, the format
used to represent dates, whether the variables to annotate are a number or a
string, etc.; the 4) column description in case users need to clarify any of the
columns to the persons reusing the data; 5) the variable name represented in
a column, as in some cases the headers used are difficult to understand; 7) the
variable unit; and 8) the header where the original dataset headers start.
    An example of our schema is represented in Table 2 by annotating Table 1.
As shown in the example, it is not necessary to complete all headers, in case the
information is not known or missing.

3.2    Extending the Table to Wikidata Mapping Language Tool
We have implemented our approach by extending the Table to Wikidata Map-
ping Language Tool (T2WML) [6]. T2WML is designed to 1) map data in ar-
bitrary data layouts used in Excel and CSV files without the need of complex
preprocessing steps to transform tables into a canonical “Database” represen-
tation; 2) Enable users who are not familiar with RDF to map spreadsheets
and CSV files to KGs; and 3) Integrate mapping and entity linking so that the
resulting output is linked to a reference KG.
1
    https://t2wml-annotation.readthedocs.io/en/latest/
4      Garijo et al.

             Table 2. Example of a dataset using our proposed schema

      dataset Eth-FoodPrices
      role    time        qualifier qualifier  variable    unit location location
      type    %m/%d/%Y string       string     number            country main subject
                          Name of              Price
      desc.
                          the crop             in Ethiopia
                          Crop
      name                                     food price
                          name
      unit
      header date         item      category price         curr. country admname
                                    cereals
              7/15/2000   Sorghum              238         ETB Ethiopia Addis Ababa
                                    and tubers
                                    cereals
              7/15/2001   Rice                 19          ETB Ethiopia Afar
                                    and tubers


Fig. 1. T2WML screenshot with the annotation schema and mapping template (right).
Users can click on the CSV cells to previsualize their results on the bottom right.


    T2WML is designed for the Wikidata data model [7]. The main building
block in this model is a statement, which consists of a subject, a predicate, an
object, qualifiers and references. The subject, predicate and object part mirror
their RDF counter parts. The qualifiers are predicate/object pairs that provide
context information about a subject/predicate/object triple. For example, an
employment relation between a person and an organization can be qualified to
record the period of time when the person was employed at that organization.
    Figure 1 shows how the T2WML extension would process a dataset similar as
the one shown in Table 2. T2WML recognizes the different headers annotated in
the spreadsheet to generate a template YAML following the T2WML mapping
language [6]. Mapped results can be previsualized on the bottom right of the
screen, under “Output”. This way, users can see how the automatically proposed
mappings will process the dataset and edit them accordingly in case of need.
              A Template-Based Approach for Annotating Long-Tail Datasets         5

3.3    Uploading Annotated Results to a Public Knowledge Graph

Once users finish annotating a dataset, they can export their results in a struc-
tured format like RDF. However, creating a KG with this information still needs
significant expertise. Therefore, we have created the USC Datamart, a cata-
log which includes 1) key dataset metadata (i.e., creator, variables included,
etc.) of the datasets uploaded by users; and 2) the contents of those annotated
datasets (with variables and their qualifiers like location, date, units, etc.). We
have extended T2WML to allow uploading the structured results into the USC
Datamart through a dedicated API2 , enabling users to share their results online
(see the Upload to Datamart button in Figure 1). Each dataset has its own id,
which can be updated with new data. This way if a time series consists on a
set of spreadsheets with the same structure for different regions, they can all be
uploaded using a similar mapping template and the same dataset id.
    With the USC Datamart, users may retrieve dataset metadata (e.g., to find
out which variables does a dataset include, or the time period they cover) and
once they find the desired information they can download it as a table for their
own analysis. A usage example of the Datamart API can be seen online.3


4     Preliminary Evaluation

In order to assess our approach, we performed a preliminary evaluation with
six users. None of these users were familiar with Semantic Web technologies or
mapping languages, but three of them had expertise in data science and scripting
languages like Python or R. All users received a training in T2WML (one hour)
to understand the main capabilities of the tool and the annotation schema.
    The goal of the evaluation was to assess if users could understand the pro-
posed schema and use it in T2WML to annotate and upload datasets similar to
the one described in Table 1 (with their corresponding challenges). The evalu-
ation included three datasets with different indicators (economic, demographic,
production, etc.) in African countries. Each dataset was assigned to two different
users. As a result, all users were able to upload their datasets successfully to the
USC Datamart, with on the fly corrections for one of the datasets where the
temporal information was part of the title of the file, instead of in its contents.
    When asked for feedback, users reported that the proposed annotation ap-
proach was preferable to creating their own scripts for data cleaning, but they
claimed that it can be difficult to 1) align their own terminology to Wikidata
and 2) understand the difference between a variable and their corresponding
qualifiers. This means that while our approach successfully tackled the first two
challenges described in Section 2 (annotating the main subject and incomplete
columns), additional work is required to guide users in the annotation process.
We are improving tutorials and documentation to address these issues.
2
    https://github.com/usc-isi-i2/datamart-api
3
    https://tinyurl.com/y2lygs5v
6       Garijo et al.

5    Related Work

A significant number of tools (e.g., [4, 5]) and mapping languages (e.g., [1, 2])
have been created by the community to help users map their datasets into KGs.
In this work we created a schema to help guide users in the dataset annotation
process without having to learn the complexity of existing tools or languages.
    Other work has focused on automated table understanding to label the struc-
ture of tables without having users to annotate datasets themselves (e.g., [3]).
This work is very relevant to our own, and we plan to expand our approach
in this direction, (giving users the ability to correct the annotations proposed
automatically). In this paper we aim to ensure users understood the proposed
schema and also to have an end-to-end process (from annotation to upload)
incorporated in a single tool (T2WML).


6    Conclusions and Future Work
In this paper we have described our approach for allowing content creators to
describe their own datasets to transform them into structured KGs. Our pre-
liminary results show that users are able to understand and use our schema for
annotating their datasets easily, enabling them to create and populate an existing
KG. Our next step will focus on incorporating table understanding approaches
which will make the process easier for users describing their own data.


References
1. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
   Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous
   data. In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR
   Workshop Proceedings, vol. 1184 (Apr 2014)
2. Ermilov, I., Auer, S., Stadler, C.: Csv2rdf: User-driven csv to rdf mass conversion
   framework. In: Proceedings of the ISEM. vol. 13, pp. 04–06 (2013)
3. Ghasemi-Gol, M., Pujara, J., Szekely, P.: Learning cell embeddings for understand-
   ing table layouts. Knowledge and Information Systems (Sep 2020)
4. Gupta, S., Szekely, P., Knoblock, C.A., Goel, A., Taheriyan, M., Muslea, M.: Karma:
   A system for mapping structured sources into the semantic web. In: Extended Se-
   mantic Web Conference. pp. 430–434. Springer (2012)
5. Heyvaert, P., De Meester, B., Dimou, A., Verborgh, R.: Declarative Rules for Linked
   Data Generation at your Fingertips! In: Proceedings of the 15th ESWC: Posters and
   Demos (2018)
6. Szekely, P., Garijo, D., Bhatia, D., Wu, J., Yao, Y., Pujara, J.: T2WML: Table to
   wikidata mapping language. In: Proceedings of the 10th International Conference
   on Knowledge Capture. p. 267–270. K-CAP ’19, ACM (2019)
7. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledge base. Com-
   mun. ACM 57(10), 78–85 (Sep 2014)

</pre>