<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mapping Spreadsheets to RDF: Supporting Excel in RML</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Schroder</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Jilek</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Dengel</string-name>
          <email>andreas.dengelg@dfki.de</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Dept.</institution>
          ,
          <addr-line>TU Kaiserslautern</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Smart Data &amp; Knowledge Services Dept., DFKI GmbH</institution>
          ,
          <addr-line>Kaiserslautern</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The RDF Mapping Language (RML) enables, among other formats, the mapping of tabular data as Comma-Separated Values (CSV) les to RDF graphs. Unfortunately, the widely used spreadsheet format is currently neglected by its speci cation and well-known implementations. Therefore, we extended RML Mapper to support Microsoft Excel spreadsheet les and demonstrate its capabilities in an interactive online demo. Our approach allows to access various meta data of spreadsheet cells in typical RML maps. Some experimental features for more speci c use cases are also provided. The implementation code is publicly available in a GitHub fork.</p>
      </abstract>
      <kwd-group>
        <kwd>Spreadsheet</kwd>
        <kwd>Excel</kwd>
        <kwd>RML</kwd>
        <kwd>RDF</kwd>
        <kwd>Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        As soon as knowledge graphs have to be constructed from given data, the
use of mapping languages is a common practice. Such languages let users de ne
declarative rules to map various input data formats to a complex interconnected
graph. In the Semantic Web community such graphs are usually modeled with
the Resource Description Framework (RDF) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and (semi-)structured data
is mapped with the RDF Mapping Language (RML) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] { a superset of the
W3C-recommended mapping language R2RML [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The speci cation of RML
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] describes how semi-structured data, like XML or JSON, and tables in form
of Comma-Separated Values (CSV) can be mapped properly. However, tabular
data can also be found in other le types, prominently in spreadsheets.
      </p>
      <p>Since the spreadsheet methodology enables a well understood, easy and fast
possibility to enter data, they are widely used by knowledge workers, especially in
the industrial sector. In contrast to simply structured CSV les, spreadsheets can
model complex workbooks containing multiple sheets with cells having rich meta
data. Besides its content, a single cell may store additional information such as its
appearance (colors, styles and borders) or cell comments. How these cells should
be lled or styled by users is not predetermined by spreadsheet applications
in general. As a consequence, in practice, cells can contain inconsistent and
unstructured content which can be arbitrarily arranged in a sheet. That is why a
mapping of such data to RDF can become a challenging task.</p>
      <p>
        Although spreadsheets are frequently used in industry, well-known RML-based
mappers such as RML Mapper3, CARML4, RocketRML [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or SDM-RDFizer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
currently do not support them natively. To x this issue, in this demo paper we
extended RML Mapper to support Microsoft Excel spreadsheet les. We have
chosen RML Mapper because its code structure lets us easily integrate our new
code. Natively supporting spreadsheets in RML has several advantages. There is
no need anymore to preprocess and transform spreadsheets in a format that can
be handled by a mapping tool (e.g. CSV). Compared to CSV-based processing
tools (like csvtk5), we are able to exploit all aspects of spreadsheets in a mapping.
This also eliminates extraneous conversion e orts for mapping experts and let
them focus on de ning proper rules. Having RML rules that directly refer to
spreadsheet data eases the communication with data providers about how their
spreadsheets will be mapped to a knowledge graph. Additionally, practitioners of
RML who already learned the language become able to map spreadsheets with
almost no extra e ort.
      </p>
      <p>In summary, this demo paper presents the following contributions to RML:
{ an implementation to access Excel spreadsheets via an RML logical source
in order to iterate (and also lter) cells in sheets
{ a mechanism to retrieve cell meta data by relative and absolute distances in
an RML reference formulation
{ two experimental features to map from one cell (a) information of di erent
properties and (b) multiple complex entities
{ a prototypical implementation of the features in RML Mapper
In the next section, we describe in detail how we realized the Excel support in
RML.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Supporting Excel in RML</title>
      <p>Our approach is implemented as an extension to the RML Mapper tool in a
separate GitHub fork6. For demonstration purpose, we additionally provide an
interactive demo page7 where visitors can try various mapping examples.</p>
      <p>At its core, our component utilizes the Apache POI8 library to read Microsoft
Excel spreadsheets as a rst step, but it is conceivable in future to support also
3 https://github.com/RMLio/rmlmapper-java
4 https://github.com/carml/carml
5 https://github.com/shenwei356/csvtk
6
https://github.com/mschroeder-github/rmlmapper-java/tree/mschroederfeatures
7 http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml
8 https://poi.apache.org
other spreadsheets such as LibreO ce or Google Spreadsheets by using
appropriate APIs. In order to refer to spreadsheet contents as a logical source, a small
spreadsheet ontology9 (usually pre xed with ss) was designed. As demonstrated
in Listing 1.1, using a spreadsheet reference formulation (Line 2), a workbook
source should name a spreadsheet le (Line 5), a referring sheet by name (Line
6) and a range of cells in the sheet (Line 7). A triples map that uses this logical
source will iterate over single cells in the given range.</p>
      <p>Listing 1.1: Exemplary RML de nition of a spreadsheet le as a logical source.
1 [ a rml : L o g i c a l S o u r c e ;
2 rml : r e f e r e n c e F o r m u l a t i o n q l : S p r e a d s h e e t ;
3 rml : s o u r c e [
4 a s s : Workbook ;
5 s s : u r l " workbook . x l s x " ;
6 s s : sheetName " Papers " ;
7 s s : range "A2 : A5" ;
8 s s : j a v a S c r i p t F i l t e r "/Knownnw∗ / . t e s t ( v a l u e S t r i n g ) " # o p t i o n a l
9 ]
10 ]</p>
      <p>Optionally, a lter can be added that picks certain cells based on a JavaScript
program (Line 8). Regarding our example, a regular expression is used to iterate
over only those cells which contain the phrase \Know". Using the expressive
script language and variables that represent cell meta data, appropriate lter
procedures can be realized. However, it is also conceivable to implement it as an
FnO function instead, which receives cell meta data and returns a boolean value.</p>
      <p>
        Listing 1.2: Demonstration of how cell meta data can be accessed in RML maps.
1 r r : subjectMap [
2 r r : template " http : / / example . org /f a d d r e s s g"
3 ] ;
4 r r : predicateObjectMap [
5 r r : predicateMap [
6 r r : template " http : / / example . org / f [
        <xref ref-type="bibr" rid="ref2">2 , 0</xref>
        ] . v a l u e S t r i n g g"
7 ] ;
8 r r : objectMap [
9 rml : r e f e r e n c e " ( 2 , 0 ) . valueNumeric "
10 ]
11 ]
      </p>
      <p>In various RML maps, meta data of the currently iterated cell can be retrieved, as
demonstrated in Listing 1.2. For example in Line 2, a template expression inserts
the cell's address (like \A2") to form a unique URI for a subject resource. However,
often it is required to refer to nearby cells in order to construct appropriate
statements. To refer to other cells relatively from the current cell, we introduce
a parenthesis notation, like (column,row). A column shift (x-axis of the sheet)
and a row shift (y-axis of the sheet) allows to reach any cell relative to the cell
representing the subject resource. Regarding Line 9 of our example, numeric
values which are located two columns away on the right are used as objects in
the mapped statements. In a similar way, cells can be referenced absolutely by
using square brackets (Line 6). We use relative and absolute distances, because
9 http://www.dfki.uni-kl.de/~mschroeder/ld/ss
spreadsheets in general can contain arbitrarily structured tables without any
anchor points like column names (in contrast to CSVs).</p>
      <p>As already mentioned, spreadsheet cells have several meta data values that
need to be retrievable in a mapping through variables (written in the following
with monospaced font). A cell's location in a sheet can be accessed either as a usual
spreadsheet address or as column and row indices. Since a cell (if not empty)
stores either a string value or a numeric value, various possibilities are given to
access its content: valueNumeric retrieves a oating point value (valueInt an
integer value), valueBoolean a boolean value, valueFormula if a cell contains a
formula (valueError to get its possible error code) and valueString retrieves
its text content. Besides content, one could also be interested in the appearance of
a cell in a sheet: since it can be colored, backgroundColor and foregroundColor
can be queried as a hexadecimal RGB value. Regarding the used font, fontColor,
fontName and fontSize are available. On a more ne-grained level, cells can
also store formatted text which is returned by valueRichText. The formatted
text is represented in an HTML syntax to enable usual XML-parsers to access
tags and attributes, for example,
\ b¡ i¡ font face='Arial' color='# 0000'¡red, italic and bold /font¡ /i¡ /b¡".</p>
      <p>If spreadsheets were completed in an inconsistent manner, it happens that
cells were unintentionally lled with di erent data types. For such cases, one can
use the value variable to always obtain a string representation of a cell's content
regardless of its cell type. However, if this is not su cient because more details
are needed, the json variable could be used to retrieve a JavaScript Object
Notation (JSON) representation of a cell. The JSON object contains, besides the
cell type, various data types mentioned above.</p>
      <p>In our demo implementation, we do not retrieve all aspects of a spreadsheet
(e.g. border thickness), yet they could easily be added to our code if desired.
2.1</p>
      <p>Experimental Features
We also would like to propose some experimental features that we found useful in
our use cases. However, the o cial introduction of these extensions would require
to change the RML speci cation on some points.</p>
      <p>
        Multiple Di erent Properties in a Cell. In our use cases, we frequently
experienced that users record multiple information in one cell. Each piece of
information ij P I then corresponded to a di erent property pj P P . Following the
current RML speci cation, one has to de ne for each pj P P a separate
predicateobject map. Instead, we propose a shortcut such that only one predicate-object
map is needed. This is done by allowing that an RDF list of properties pj P P
can be passed in a predicate map. By implementing a suitable Function Ontology
(FnO) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] procedure, we extract information pieces ij P I from a cell's content
and return them as an object list oj P O. Usually, RML provides that a Cartesian
product is formed between predicates and objects which is P O. However,
in our case we need to zip predicate and object lists such that the following
set is made: tppj; ojqu. Thus, instead of all possible pairs of the P O matrix,
only the diagonal ones are selected. This new behavior is activated by adding a
ss:zip true statement to a predicate-object map.
      </p>
      <p>Multiple Complex Entities in a Cell. Similarly to the previous observation,
we often discovered that users mention several entities in a single cell. A prominent
example is a list of persons having rst and last names, for instance book authors.
Represented as an RDF graph, such complex entities potentially require several
statements to be fully expressed. An FnO function that is able to perform entity
extraction would need to return an arbitrary large RDF graph in a certain
serialization format (e.g. Turtle) instead of a single value. In order to integrate
such a return value in an object map, we de ne a new term type ss:Graph.
Once this term type is chosen, the returning RDF graph is parsed and added
to the emerging knowledge graph. To prevent that all returning resources are
mapped in the object map, a special ss:SelectedObjects resource together
with rr:object statements should be returned by the function, too. This way, all
resources are added to the knowledge graph, yet only selected ones are mapped.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>In the past, several similar approaches were implemented that map spreadsheets
to RDF using a language. Domain speci c languages (DSL) other than RML are
provided to let knowledge engineers express how input data shall be mapped to
a graph structure.</p>
      <p>
        Spread2RDF10 uses a Ruby-internal language while M 2 (Mapping Master)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] builds upon a compact syntax for OWL ontologies (Manchester syntax).
Sheet2RDF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] uses a special ProjEction of Annotations Rule Language (PEARL),
whereas XLWrap [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] utilizes template graphs together with a special expression
language to refer to contents of sheets. A di erent approach is followed by
TabLinker11 which requires that input spreadsheets are annotated in advance
with certain styles.
      </p>
      <p>With our implementation, we combine the feature of mapping spreadsheets
with the advantages of the well speci ed RML approach. Those who already
used the language and need to process spreadsheets do not have to look for an
alternative anymore.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Outlook</title>
      <p>In this paper we discussed that spreadsheets, despite being widely used, are still
not supported by the RDF Mapping Language (RML). We therefore proposed a
rst solution which extends the already existing RML Mapper tool with necessary
components. Additionally, experimental features were introduced too. A publicly
available web page12 demonstrates the new capabilities.
10 https://github.com/marcelotto/spread2rdf
11 https://github.com/Data2Semantics/TabLinker
12 http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml</p>
      <p>
        In the future, we plan to integrate our proposed solution in the RML speci
cation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] so that other RML tools may support spreadsheets too. Further, we intend
to run several performance tests (e.g. memory consumption) and compliance
tests with our current reference implementation. This could be included in the
R2RML implementation report13.
      </p>
      <p>Acknowledgements This work was funded by the BMBF project SensAI (grant
no. 01IW20007).
13 https://github.com/kg-construct/r2rml-implementation-report</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sande</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          :
          <article-title>RDF Mapping Language (RML)</article-title>
          . https://rml.io/specs/ rml/ (
          <year>2020</year>
          ), accessed:
          <fpage>2021</fpage>
          -02-01
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vander</surname>
            <given-names>Sande</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Mannens</surname>
          </string-name>
          , E., Van de Walle, R.:
          <article-title>RML: a generic language for integrated RDF mappings of heterogeneous data</article-title>
          .
          <source>In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR Workshop Proceedings</source>
          , vol.
          <volume>1184</volume>
          (
          <year>Apr 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fiorelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lorenzetti</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazienza</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stellato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turbati</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Sheet2rdf: a exible and dynamic spreadsheet import&amp;lifting framework for RDF</article-title>
          .
          <source>In: Current Approaches in Applied Arti cial Intelligence - 28th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE</source>
          <year>2015</year>
          , Seoul, South Korea, June 10-12,
          <year>2015</year>
          ,
          <source>Proceedings. Lecture Notes in Computer Science</source>
          , vol.
          <volume>9101</volume>
          , pp.
          <volume>131</volume>
          {
          <fpage>140</fpage>
          . Springer (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Iglesias</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jozashoori</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collarana</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sdmrd zer: An RML interpreter for the e cient creation of RDF knowledge graphs</article-title>
          .
          <source>In: CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Ireland, October 19-23</source>
          ,
          <year>2020</year>
          . pp.
          <volume>3039</volume>
          {
          <fpage>3046</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Langegger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Wo , W.:
          <article-title>Xlwrap - querying and integrating arbitrary spreadsheets with SPARQL</article-title>
          .
          <source>In: The Semantic Web - ISWC</source>
          <year>2009</year>
          , 8th International Semantic Web Conference,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2009</year>
          ,
          <article-title>Chantilly</article-title>
          ,
          <string-name>
            <surname>VA</surname>
          </string-name>
          , USA, October
          <volume>25</volume>
          -
          <issue>29</issue>
          ,
          <year>2009</year>
          .
          <source>Proceedings. Lecture Notes in Computer Science</source>
          , vol.
          <volume>5823</volume>
          , pp.
          <volume>359</volume>
          {
          <fpage>374</fpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Meester</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>An ontology to semantically declare and describe functions</article-title>
          .
          <source>In: The Semantic Web - ESWC 2016 Satellite Events</source>
          , Heraklion, Crete, Greece, May 29 - June 2,
          <year>2016</year>
          ,
          <source>Revised Selected Papers. Lecture Notes in Computer Science</source>
          , vol.
          <volume>9989</volume>
          , pp.
          <volume>46</volume>
          {
          <issue>49</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            ,
            <surname>Halaschek-Wiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Musen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.A.</surname>
          </string-name>
          :
          <article-title>M2: A language for mapping spreadsheets to OWL. OWL: Experiences and Directions (OWLED)</article-title>
          ,
          <source>Sixth International Workshop</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Simsek</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          , Karle, E.,
          <string-name>
            <surname>Fensel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Rocketrml - A nodejs implementation of a use case speci c RML mapper</article-title>
          .
          <source>In: Joint Proceedings of the 1st International Workshop on Knowledge Graph Building and 1st International Workshop on Large Scale RDF Analytics (ESWC</source>
          <year>2019</year>
          ), Portoroz, Slovenia, June 3,
          <year>2019</year>
          .
          <source>CEUR Workshop Proceedings</source>
          , vol.
          <volume>2489</volume>
          , pp.
          <volume>46</volume>
          {
          <fpage>53</fpage>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. World Wide Web Consortium:
          <article-title>R2RML: RDB to RDF mapping language</article-title>
          . https: //www.w3.org/TR/r2rml/ (
          <year>2012</year>
          ), accessed:
          <fpage>2021</fpage>
          -02-01
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <source>World Wide Web Consortium: RDF 1.1 Primer</source>
          . https://www.w3.org/TR/rdf11- primer/ (
          <year>2014</year>
          ), accessed:
          <fpage>2021</fpage>
          -02-01
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>