<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Morph-CSV: Virtual Knowledge Graph Access for Tabular Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Chaves-Fraga</string-name>
          <email>dchaves@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Pozo-Gilo</string-name>
          <email>luis.pozo@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jhon Toledo</string-name>
          <email>ja.toledo@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edna Ruckhaus</string-name>
          <email>eruckhaus@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Corcho</string-name>
          <email>ocorcho@fi.upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Engineering Group, Universidad Politecnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Virtual knowledge graph access has traditionally focused on providing ontology-based access to relational databases (RDB) proposing SPARQL-to-SQL query translation techniques and optimizations. With the advent of mapping languages or annotations such as RML or CSVW, these techniques have been applied over tabular data by considering each source as a single table that can be loaded into an RDB. However, such techniques do not take into account those characteristics that are normally present in real-world CSV les (e.g., normalization, constraints, joins). In this paper we present Morph-CSV, a framework for enhancing virtual knowledge graph access over a set of CSV les by using a combination of CSVW annotations and RML mappings with FnO transformation functions. Exploiting these inputs, the framework creates an enriched RDB representation of the CSV les together with the corresponding R2RML mappings, enabling the use of existing query translation (SPARQL-to-SQL) techniques and tools.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graphs</kwd>
        <kwd>CSV</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>RML</p>
      <p>CSVW
Semi-structured data formats, and particularly spreadsheets in the form of CSV
or Excel les, are one of the most widely-used formats to publish data on the
Web. There are several reasons why tabular formats are so popular for data
publication. First, they are easy to generate by data providers. In many cases,
they are even used as one of the main ways to manage data inside
organizations. Second, they are easy to consume with common o ce tools (e.g., Excel,
LibreO ce) and there are advanced tools that can be used to process them (e.g.,
OpenRe ne, Tableau). However, more advanced consumers (e.g., application
developers, knowledge workers) often have to face some relevant challenges when
consuming tabular data: there is no standard way to query data in them as it
can be done with other types of data formats, such as RDB, JSON or XML; data
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
are di cult to integrate since data constraints and relationships across di erent
les are not explicit; data are often di cult to understand since column names
are generally heterogeneous.</p>
      <p>
        Some of these challenges may be dealt following a Semantic Web approach.
Virtual knowledge graphs (VKG) provide a uni ed view and common access to
a set of data sources based on ontologies and mappings, translating SPARQL
queries into queries that are supported by the underlying source. Although
current proposals [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] provide support for querying this kind of formats, they treat
each source as if it was a single not-normalized RDB table with no keys or
integrity constraints, important elements that are used by SPARQL-to-SQL
engines for e cient querying. Several languages have been proposed to specify
annotations to deal with the heterogeneity of tabular datasets such as CSVW
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] metadata and RML+FnO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] mapping rules, but engines or systems have to
take them into account in their VKG access pipeline.
      </p>
      <p>
        In this demo we present Morph-CSV, an open source engine1 that extends
the typical VKG work ow to enhance performance and query completeness over
tabular datasets. Our approach exploits the information from CSVW
annotations and RML+FnO mappings so as to obtain details on the underlying
schema, required transformation functions, missing information, etc., pushing
down their application directly over the tabular dataset. It generates and
populates an enriched and normalized RDB schema from the CSV les, and
translates RML+FnO to an equivalent function-free R2RML mapping document [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
so that existing SPARQL-to-SQL optimizations can be used to query them.
Finally, we describe two real use cases from transport and biomedical domains
where Morph-CSV is applied to enhance virtual KG access.
2
      </p>
      <p>
        Tabular Annotations for VKG: RML+FnO and CSVW
There are speci c challenges on querying tabular datasets using a VKG access
approach that have not been tackled by existing techniques. The selection of the
sources to answer a query, the normalization or heterogeneity of the dataset and
the absence of indexes a ect the performance and completeness of
SPARQL-toSQL engines. To deal with these challenges, RML [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] extends the R2RML W3C
Recommendation to provide support beyond relational databases, such as XML,
CSV, JSON, etc. Recently, RML has been integrated with the Function
Ontology (FnO) to support other types of transformations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Additionally, CSVW
annotations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a W3C Recommendation that provides metadata annotations
for tabular data on the web. In Table 1 we summarized the properties of these
two speci cations and its related challenge(s). The manual and ad-hoc
preparation of a tabular dataset for VKG access is usually the most time-consuming
and less reproducible task. Exploiting available standard and declarative
annotations allows its generalization and automatization, as well as ensuring query
completeness and improving performance of SPARQL-to-SQL techniques.
1 https://doi.org/10.5281/zenodo.3572132
The Morph-CSV2 open source engine exploits the typical inputs of a VKG
process (query, metadata and mappings) to improve performance and query
completeness over tabular sources, dealing with their identi ed challenges. More in
detail, it extends the starting phase of a typical VKG access work ow to
select the relevant sources from an input query, extract implicit constraints from
RML+FnO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] mappings and CSVW [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] metadata, pushing down their
application directly to the selected sources and nally, it generates enriched inputs
for a SPARQL-to-SQL process (R2RML mappings and an RDB instance). The
architecture of Morph-CSV is shown in Figure 1, where we present the steps to
exploit declarative annotations for enhancing SPARQL query translations over
tabular data: i) Source Selection: Using the SPARQL query and the
mapping rules, the engine selects only the relevant sources (and columns inside each
source) that are relevant to answer the input query. ii) Normalization: Two
functions for performing data normalization were implemented. The rst one is
the treatment of multi-values in columns while the second one is the treatment of
multiple entities in the same source. iii) Data Preparation: In this step, three
di erent functions are executed. First, it performs all the substitutions such as
default values, NULL values and date formats, then, it creates a new column in
the speci c source applying the transformation function de ned in RML+FnO
and nally, the engine removes all duplicates in the raw data. iv) Mapping
Translation: The mapping rules are translated accordingly to the generated
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 https://morph.oeg.fi.upm.es/tool/morph-csv</title>
      <p>Morph-CSV</p>
      <p>Normalization</p>
      <p>&amp;</p>
      <p>Data
Preparation</p>
      <p>RDB
Schema
Creation &amp;</p>
      <p>Load
Mapping</p>
      <p>Translation
Extracting
Constraints</p>
      <p>Source
Selection
SPARQLSQL</p>
      <p>
        Engine
+
RML+FnO
data from RML+FnO to a standard R2RML document [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. v) Schema
Creation and Load: An optimized SQL schema is generated applying integrity
constraints (PK-FK), and the selected data sources are loaded.
4
      </p>
      <sec id="sec-2-1">
        <title>Use Cases</title>
        <p>
          In this demo we run Morph-CSV over two real use cases:
1. Transport National Access Points (NAP). Since 2019, most European
countries are required to public transport data in accessible open query
points called National Access Points or NAP3. The main issues related to
access to transport data across Europe, will be how to deal with the
heterogeneity of these access points and data formats, and how to e ciently
query them. Using the de-facto standard for publishing open transport data,
GTFS4, which is composed by a set of tabular sources, our engine will
exploit RML+FnO and CSVW annotations to enable e cient and complete
access to GTFS data through SPARQL queries.
2. Virtual KG over Bio2RDF. Bio2RDF [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is one of the most popular
projects that integrates and publishes biomedical datasets using Semantic
Web technologies. Although its community has actively contributed to the
generation of these datasets, they perform the integration using ad-hoc
programming scripts, which negatively a ects the maintainability of the project,
therefore, SPARQL queries may return outdated results. Selecting the
tabular original sources of Bio2RDF, Morph-CSV constructs a virtual KG over
them following a declarative approach, hence, improving the maintainability
of the project and ensuring up to date results from the SPARQL queries.
3 https://ec.europa.eu/transport/themes/its/road/action_plan/nap_en
4 https://developers.google.com/transit/gtfs
        </p>
        <p>
          The obtained results are shown in a landing page5 and in a video6. Besides
the real use cases, we present the results obtained in terms of performance and
completeness with Morph-CSV, using two virtual knowledge graph benchmarks
from the state of the art (BSBM [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and GTFS-Madrid-Bench [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]), and two well
known open source SPARQL-to-SQL engines (Morph-RDB and Ontop).
5
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Conclusions and Future Work</title>
        <p>Morph-CSV enhances virtual knowledge graph access over heterogeneous CSV
les. It takes as input a set of CSV les, CSVW annotations, and an RML+FnO
mapping, and generates as output an enriched RDB instance with data from
the CSV les together with R2RML mappings, so that they can be used by any
state-of-the-art R2RML-compliant OBDA engine. As part of our future work,
we will improve its performance with new optimizations in the query-translation
process. We will also extend it for other types of data (e.g., XML, JSON).
Acknowledgements. The work presented in this paper is supported by the
Spanish Ministerio de Econom a, Industria y Competitividad and EU FEDER
funds under the DATOS 4.0: RETOS Y SOLUCIONES - UPM Spanish national
project (TIN2016-78011-C4-4-R) and by an FPI grant (BES-2017-082511)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5 https://morph.oeg.fi.upm.es/demo/morph-csv</title>
      <p>6 https://youtu.be/yzskzFSAMzA</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nolin</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourigny</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigault</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morissette</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Bio2RDF: towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>Journal of biomedical informatics 41(5)</source>
          ,
          <volume>706</volume>
          {
          <fpage>716</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : The Berlin SPARQL Benchmark.
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 5</source>
          (
          <issue>2</issue>
          ),
          <volume>1</volume>
          {
          <fpage>24</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Priyatna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimmino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toledo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruckhaus</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>GTFS-Madrid-Bench: A Benchmark for Virtual Knowledge Graph Access in the Transport Domain</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>65</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Priyatna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards a new generation of ontology based data access</article-title>
          .
          <source>Semantic Web</source>
          <volume>11</volume>
          ,
          <issue>153</issue>
          {
          <fpage>160</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>De Meester</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maroy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          , E.:
          <article-title>Declarative data transformations for Linked Data generation: the case of DBpedia</article-title>
          . In: European Semantic Web Conference. pp.
          <volume>33</volume>
          {
          <fpage>48</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: a generic language for integrated RDF mappings of heterogeneous data</article-title>
          .
          <source>In: LDOW</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Priyatna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sequeda</surname>
          </string-name>
          , J.:
          <article-title>Formalisation and experiences of R2RMLbased SPARQL to SQL query translation using morph</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on World wide web</source>
          . pp.
          <volume>479</volume>
          {
          <issue>490</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tennison</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellogg</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Model for tabular data and metadata on the web</article-title>
          .
          <source>W3C recommendation. World Wide Web Consortium (W3C)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>