Morph-CSV: Virtual Knowledge Graph
               Access for Tabular Data

               David Chaves-Fraga1 , Luis Pozo-Gilo1 , Jhon Toledo1 ,
                      Edna Ruckhaus1 , and Oscar Corcho1

        Ontology Engineering Group, Universidad Politécnica de Madrid, Spain
           dchaves@fi.upm.es, luis.pozo@upm.es, ja.toledo@upm.es,
                    eruckhaus@fi.upm.es, ocorcho@fi.upm.es


        Abstract. Virtual knowledge graph access has traditionally focused on
        providing ontology-based access to relational databases (RDB) propos-
        ing SPARQL-to-SQL query translation techniques and optimizations.
        With the advent of mapping languages or annotations such as RML or
        CSVW, these techniques have been applied over tabular data by con-
        sidering each source as a single table that can be loaded into an RDB.
        However, such techniques do not take into account those characteristics
        that are normally present in real-world CSV files (e.g., normalization,
        constraints, joins). In this paper we present Morph-CSV, a framework
        for enhancing virtual knowledge graph access over a set of CSV files by
        using a combination of CSVW annotations and RML mappings with
        FnO transformation functions. Exploiting these inputs, the framework
        creates an enriched RDB representation of the CSV files together with
        the corresponding R2RML mappings, enabling the use of existing query
        translation (SPARQL-to-SQL) techniques and tools.

        Keywords: Knowledge Graphs · CSV · RML · CSVW


1     Introduction

Semi-structured data formats, and particularly spreadsheets in the form of CSV
or Excel files, are one of the most widely-used formats to publish data on the
Web. There are several reasons why tabular formats are so popular for data
publication. First, they are easy to generate by data providers. In many cases,
they are even used as one of the main ways to manage data inside organiza-
tions. Second, they are easy to consume with common office tools (e.g., Excel,
LibreOffice) and there are advanced tools that can be used to process them (e.g.,
OpenRefine, Tableau). However, more advanced consumers (e.g., application de-
velopers, knowledge workers) often have to face some relevant challenges when
consuming tabular data: there is no standard way to query data in them as it
can be done with other types of data formats, such as RDB, JSON or XML; data
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
        Chaves-Fraga et al.

are difficult to integrate since data constraints and relationships across different
files are not explicit; data are often difficult to understand since column names
are generally heterogeneous.
     Some of these challenges may be dealt following a Semantic Web approach.
Virtual knowledge graphs (VKG) provide a unified view and common access to
a set of data sources based on ontologies and mappings, translating SPARQL
queries into queries that are supported by the underlying source. Although cur-
rent proposals [7] provide support for querying this kind of formats, they treat
each source as if it was a single not-normalized RDB table with no keys or
integrity constraints, important elements that are used by SPARQL-to-SQL en-
gines for efficient querying. Several languages have been proposed to specify
annotations to deal with the heterogeneity of tabular datasets such as CSVW
[8] metadata and RML+FnO [5] mapping rules, but engines or systems have to
take them into account in their VKG access pipeline.
     In this demo we present Morph-CSV, an open source engine1 that extends
the typical VKG workflow to enhance performance and query completeness over
tabular datasets. Our approach exploits the information from CSVW anno-
tations and RML+FnO mappings so as to obtain details on the underlying
schema, required transformation functions, missing information, etc., pushing
down their application directly over the tabular dataset. It generates and pop-
ulates an enriched and normalized RDB schema from the CSV files, and trans-
lates RML+FnO to an equivalent function-free R2RML mapping document [4],
so that existing SPARQL-to-SQL optimizations can be used to query them. Fi-
nally, we describe two real use cases from transport and biomedical domains
where Morph-CSV is applied to enhance virtual KG access.


2     Tabular Annotations for VKG: RML+FnO and CSVW

There are specific challenges on querying tabular datasets using a VKG access
approach that have not been tackled by existing techniques. The selection of the
sources to answer a query, the normalization or heterogeneity of the dataset and
the absence of indexes affect the performance and completeness of SPARQL-to-
SQL engines. To deal with these challenges, RML [6] extends the R2RML W3C
Recommendation to provide support beyond relational databases, such as XML,
CSV, JSON, etc. Recently, RML has been integrated with the Function Ontol-
ogy (FnO) to support other types of transformations [5]. Additionally, CSVW
annotations [8] is a W3C Recommendation that provides metadata annotations
for tabular data on the web. In Table 1 we summarized the properties of these
two specifications and its related challenge(s). The manual and ad-hoc prepa-
ration of a tabular dataset for VKG access is usually the most time-consuming
and less reproducible task. Exploiting available standard and declarative anno-
tations allows its generalization and automatization, as well as ensuring query
completeness and improving performance of SPARQL-to-SQL techniques.
1
    https://doi.org/10.5281/zenodo.3572132
             Morph-CSV: Virtual Knowledge Graph Access for Tabular Data

Table 1. Properties of CSVW, and RML+FnO that can be used to address the chal-
lenges of dealing with tabular data in construction virtual knowledge graphs

                    Challenges                     Relevant Properties
      Describe the corresponding concept     rr:class, csvw:propertyUrl
      Describe the corresponding property    rr:predicateMap, csvw:propertyUrl
      Add header to a file                   csvw:rowTitles
      Column datatype                        csvw:datatype
      Constraining values                    csvw:minimum, csvw:maximum
      Specify the format of a column         csvw:format
      Specify a join                         rr:refObjectMap, csvw:foreignKeys
      Transform value                        fnml:functionValue
      Support for multiple values in one cell csvw:separator
      Primary key                            csvw:primaryKey
      Default for missing values             csvw:default
      Specify NULL values                    csvw:null
      Specify NOT NULL constraint            csvw:required
      Specify columns to be tranformed       rr:reference, rr:template


3     The Morph-CSV engine
The Morph-CSV2 open source engine exploits the typical inputs of a VKG pro-
cess (query, metadata and mappings) to improve performance and query com-
pleteness over tabular sources, dealing with their identified challenges. More in
detail, it extends the starting phase of a typical VKG access workflow to se-
lect the relevant sources from an input query, extract implicit constraints from
RML+FnO [5] mappings and CSVW [8] metadata, pushing down their appli-
cation directly to the selected sources and finally, it generates enriched inputs
for a SPARQL-to-SQL process (R2RML mappings and an RDB instance). The
architecture of Morph-CSV is shown in Figure 1, where we present the steps to
exploit declarative annotations for enhancing SPARQL query translations over
tabular data: i) Source Selection: Using the SPARQL query and the map-
ping rules, the engine selects only the relevant sources (and columns inside each
source) that are relevant to answer the input query. ii) Normalization: Two
functions for performing data normalization were implemented. The first one is
the treatment of multi-values in columns while the second one is the treatment of
multiple entities in the same source. iii) Data Preparation: In this step, three
different functions are executed. First, it performs all the substitutions such as
default values, NULL values and date formats, then, it creates a new column in
the specific source applying the transformation function defined in RML+FnO
and finally, the engine removes all duplicates in the raw data. iv) Mapping
Translation: The mapping rules are translated accordingly to the generated
2
    https://morph.oeg.fi.upm.es/tool/morph-csv
        Chaves-Fraga et al.


                                Morph-CSV                  RDB

                                     Normalization
                                                      Schema
          Extracting     Source           &                       SPARQLSQL
                                                     Creation &
          Constraints   Selection        Data                       Engine
                                                       Load
                                      Preparation

                             Mapping
                            Translation


         +
       RML+FnO


         Fig. 1. Proposed workflow to enhance VKG access over tabular data


data from RML+FnO to a standard R2RML document [4]. v) Schema Cre-
ation and Load: An optimized SQL schema is generated applying integrity
constraints (PK-FK), and the selected data sources are loaded.


4     Use Cases

In this demo we run Morph-CSV over two real use cases:

1. Transport National Access Points (NAP). Since 2019, most European
   countries are required to public transport data in accessible open query
   points called National Access Points or NAP3 . The main issues related to
   access to transport data across Europe, will be how to deal with the het-
   erogeneity of these access points and data formats, and how to efficiently
   query them. Using the de-facto standard for publishing open transport data,
   GTFS4 , which is composed by a set of tabular sources, our engine will ex-
   ploit RML+FnO and CSVW annotations to enable efficient and complete
   access to GTFS data through SPARQL queries.
2. Virtual KG over Bio2RDF. Bio2RDF [1] is one of the most popular
   projects that integrates and publishes biomedical datasets using Semantic
   Web technologies. Although its community has actively contributed to the
   generation of these datasets, they perform the integration using ad-hoc pro-
   gramming scripts, which negatively affects the maintainability of the project,
   therefore, SPARQL queries may return outdated results. Selecting the tab-
   ular original sources of Bio2RDF, Morph-CSV constructs a virtual KG over
   them following a declarative approach, hence, improving the maintainability
   of the project and ensuring up to date results from the SPARQL queries.
3
    https://ec.europa.eu/transport/themes/its/road/action_plan/nap_en
4
    https://developers.google.com/transit/gtfs
             Morph-CSV: Virtual Knowledge Graph Access for Tabular Data

   The obtained results are shown in a landing page5 and in a video6 . Besides
the real use cases, we present the results obtained in terms of performance and
completeness with Morph-CSV, using two virtual knowledge graph benchmarks
from the state of the art (BSBM [2] and GTFS-Madrid-Bench [3]), and two well
known open source SPARQL-to-SQL engines (Morph-RDB and Ontop).


5     Conclusions and Future Work
Morph-CSV enhances virtual knowledge graph access over heterogeneous CSV
files. It takes as input a set of CSV files, CSVW annotations, and an RML+FnO
mapping, and generates as output an enriched RDB instance with data from
the CSV files together with R2RML mappings, so that they can be used by any
state-of-the-art R2RML-compliant OBDA engine. As part of our future work,
we will improve its performance with new optimizations in the query-translation
process. We will also extend it for other types of data (e.g., XML, JSON).

Acknowledgements. The work presented in this paper is supported by the
Spanish Ministerio de Economı́a, Industria y Competitividad and EU FEDER
funds under the DATOS 4.0: RETOS Y SOLUCIONES - UPM Spanish national
project (TIN2016-78011-C4-4-R) and by an FPI grant (BES-2017-082511)


References
1. Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: to-
   wards a mashup to build bioinformatics knowledge systems. Journal of biomedical
   informatics 41(5), 706–716 (2008)
2. Bizer, C., Schultz, A.: The Berlin SPARQL Benchmark. International Journal on
   Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)
3. Chaves-Fraga, D., Priyatna, F., Cimmino, A., Toledo, J., Ruckhaus, E., Corcho, O.:
   GTFS-Madrid-Bench: A Benchmark for Virtual Knowledge Graph Access in the
   Transport Domain. Journal of Web Semantics 65 (2020)
4. Corcho, O., Priyatna, F., Chaves-Fraga, D.: Towards a new generation of ontology
   based data access. Semantic Web 11, 153–160 (2020)
5. De Meester, B., Maroy, W., Dimou, A., Verborgh, R., Mannens, E.: Declarative
   data transformations for Linked Data generation: the case of DBpedia. In: European
   Semantic Web Conference. pp. 33–48. Springer (2017)
6. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
   Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous
   data. In: LDOW (2014)
7. Priyatna, F., Corcho, O., Sequeda, J.: Formalisation and experiences of R2RML-
   based SPARQL to SQL query translation using morph. In: Proceedings of the 23rd
   international conference on World wide web. pp. 479–490 (2014)
8. Tennison, J., Kellogg, G., Herman, I.: Model for tabular data and metadata on the
   web. W3C recommendation. World Wide Web Consortium (W3C) (2015)

5
    https://morph.oeg.fi.upm.es/demo/morph-csv
6
    https://youtu.be/yzskzFSAMzA