=Paper= {{Paper |id=Vol-2873/paper3 |storemode=property |title=Mapping Spreadsheets to RDF: Supporting Excel in RML |pdfUrl=https://ceur-ws.org/Vol-2873/paper3.pdf |volume=Vol-2873 |authors=Markus Schröder,Christian Jilek,Andreas Dengel |dblpUrl=https://dblp.org/rec/conf/esws/SchroderJ021 }} ==Mapping Spreadsheets to RDF: Supporting Excel in RML== https://ceur-ws.org/Vol-2873/paper3.pdf
                Mapping Spreadsheets to RDF:
                 Supporting Excel in RML

              Markus Schröder, Christian Jilek, and Andreas Dengel
1
    Smart Data & Knowledge Services Dept., DFKI GmbH, Kaiserslautern, Germany
              2
                Computer Science Dept., TU Kaiserslautern, Germany
        {markus.schroeder, christian.jilek, andreas.dengel}@dfki.de



        Abstract. The RDF Mapping Language (RML) enables, among other
        formats, the mapping of tabular data as Comma-Separated Values (CSV)
        files to RDF graphs. Unfortunately, the widely used spreadsheet format
        is currently neglected by its specification and well-known implementa-
        tions. Therefore, we extended RML Mapper to support Microsoft Excel
        spreadsheet files and demonstrate its capabilities in an interactive online
        demo. Our approach allows to access various meta data of spreadsheet
        cells in typical RML maps. Some experimental features for more specific
        use cases are also provided. The implementation code is publicly available
        in a GitHub fork.

        Keywords: Spreadsheet · Excel · RML · RDF · Knowledge Graph


1     Introduction


    As soon as knowledge graphs have to be constructed from given data, the
use of mapping languages is a common practice. Such languages let users define
declarative rules to map various input data formats to a complex interconnected
graph. In the Semantic Web community such graphs are usually modeled with
the Resource Description Framework (RDF) [10] and (semi-)structured data
is mapped with the RDF Mapping Language (RML) [2] – a superset of the
W3C-recommended mapping language R2RML [9]. The specification of RML
[1] describes how semi-structured data, like XML or JSON, and tables in form
of Comma-Separated Values (CSV) can be mapped properly. However, tabular
data can also be found in other file types, prominently in spreadsheets.
    Since the spreadsheet methodology enables a well understood, easy and fast
possibility to enter data, they are widely used by knowledge workers, especially in
the industrial sector. In contrast to simply structured CSV files, spreadsheets can
model complex workbooks containing multiple sheets with cells having rich meta
data. Besides its content, a single cell may store additional information such as its
appearance (colors, styles and borders) or cell comments. How these cells should
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
         Markus Schröder, Christian Jilek, and Andreas Dengel

be filled or styled by users is not predetermined by spreadsheet applications
in general. As a consequence, in practice, cells can contain inconsistent and
unstructured content which can be arbitrarily arranged in a sheet. That is why a
mapping of such data to RDF can become a challenging task.
   Although spreadsheets are frequently used in industry, well-known RML-based
mappers such as RML Mapper3 , CARML4 , RocketRML [8] or SDM-RDFizer [4]
currently do not support them natively. To fix this issue, in this demo paper we
extended RML Mapper to support Microsoft Excel spreadsheet files. We have
chosen RML Mapper because its code structure lets us easily integrate our new
code. Natively supporting spreadsheets in RML has several advantages. There is
no need anymore to preprocess and transform spreadsheets in a format that can
be handled by a mapping tool (e.g. CSV). Compared to CSV-based processing
tools (like csvtk5 ), we are able to exploit all aspects of spreadsheets in a mapping.
This also eliminates extraneous conversion efforts for mapping experts and let
them focus on defining proper rules. Having RML rules that directly refer to
spreadsheet data eases the communication with data providers about how their
spreadsheets will be mapped to a knowledge graph. Additionally, practitioners of
RML who already learned the language become able to map spreadsheets with
almost no extra effort.
   In summary, this demo paper presents the following contributions to RML:
 – an implementation to access Excel spreadsheets via an RML logical source
   in order to iterate (and also filter) cells in sheets
 – a mechanism to retrieve cell meta data by relative and absolute distances in
   an RML reference formulation
 – two experimental features to map from one cell (a) information of different
   properties and (b) multiple complex entities
 – a prototypical implementation of the features in RML Mapper
In the next section, we describe in detail how we realized the Excel support in
RML.


2    Supporting Excel in RML
Our approach is implemented as an extension to the RML Mapper tool in a
separate GitHub fork6 . For demonstration purpose, we additionally provide an
interactive demo page7 where visitors can try various mapping examples.
    At its core, our component utilizes the Apache POI8 library to read Microsoft
Excel spreadsheets as a first step, but it is conceivable in future to support also
3
  https://github.com/RMLio/rmlmapper-java
4
  https://github.com/carml/carml
5
  https://github.com/shenwei356/csvtk
6
  https://github.com/mschroeder-github/rmlmapper-java/tree/mschroeder-
  features
7
  http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml
8
  https://poi.apache.org
                                  Mapping Spreadsheets to RDF: Supporting Excel in RML

     other spreadsheets such as LibreOffice or Google Spreadsheets by using appropri-
     ate APIs. In order to refer to spreadsheet contents as a logical source, a small
     spreadsheet ontology9 (usually prefixed with ss) was designed. As demonstrated
     in Listing 1.1, using a spreadsheet reference formulation (Line 2), a workbook
     source should name a spreadsheet file (Line 5), a referring sheet by name (Line
     6) and a range of cells in the sheet (Line 7). A triples map that uses this logical
     source will iterate over single cells in the given range.

     Listing 1.1: Exemplary RML definition of a spreadsheet file as a logical source.
 1   [ a rml : L o g i c a l S o u r c e ;
 2     rml : r e f e r e n c e F o r m u l a t i o n q l : S p r e a d s h e e t ;
 3     rml : s o u r c e [
 4       a s s : Workbook ;
 5       s s : u r l ” workbook . x l s x ” ;
 6       s s : sheetName ” P a p e r s ” ;
 7       s s : r a n g e ”A2 : A5” ;
 8       s s : j a v a S c r i p t F i l t e r ”/Know\\w∗ / . t e s t ( v a l u e S t r i n g ) ” # o p t i o n a l
 9     ]
10   ]


     Optionally, a filter can be added that picks certain cells based on a JavaScript
     program (Line 8). Regarding our example, a regular expression is used to iterate
     over only those cells which contain the phrase “Know”. Using the expressive
     script language and variables that represent cell meta data, appropriate filter
     procedures can be realized. However, it is also conceivable to implement it as an
     Fn O function instead, which receives cell meta data and returns a boolean value.

     Listing 1.2: Demonstration of how cell meta data can be accessed in RML maps.
 1   r r : subjectMap [
 2       r r : t e m p l a t e ” h t t p : / / example . o r g /{ a d d r e s s }”
 3   ] ;
 4   r r : predicateObjectMap [
 5       r r : predicateMap              [
 6           r r : t e m p l a t e ” h t t p : / / example . o r g / { [ 2 , 0 ] . v a l u e S t r i n g }”
 7       ] ;
 8       r r : objectMap [
 9           rml : r e f e r e n c e ” ( 2 , 0 ) . v a l u e N u m e r i c ”
10       ]
11   ]


     In various RML maps, meta data of the currently iterated cell can be retrieved, as
     demonstrated in Listing 1.2. For example in Line 2, a template expression inserts
     the cell’s address (like “A2”) to form a unique URI for a subject resource. However,
     often it is required to refer to nearby cells in order to construct appropriate
     statements. To refer to other cells relatively from the current cell, we introduce
     a parenthesis notation, like (column,row). A column shift (x-axis of the sheet)
     and a row shift (y-axis of the sheet) allows to reach any cell relative to the cell
     representing the subject resource. Regarding Line 9 of our example, numeric
     values which are located two columns away on the right are used as objects in
     the mapped statements. In a similar way, cells can be referenced absolutely by
     using square brackets (Line 6). We use relative and absolute distances, because
     9
         http://www.dfki.uni-kl.de/~mschroeder/ld/ss
         Markus Schröder, Christian Jilek, and Andreas Dengel

spreadsheets in general can contain arbitrarily structured tables without any
anchor points like column names (in contrast to CSVs).
    As already mentioned, spreadsheet cells have several meta data values that
need to be retrievable in a mapping through variables (written in the following
with monospaced font). A cell’s location in a sheet can be accessed either as a usual
spreadsheet address or as column and row indices. Since a cell (if not empty)
stores either a string value or a numeric value, various possibilities are given to
access its content: valueNumeric retrieves a floating point value (valueInt an
integer value), valueBoolean a boolean value, valueFormula if a cell contains a
formula (valueError to get its possible error code) and valueString retrieves
its text content. Besides content, one could also be interested in the appearance of
a cell in a sheet: since it can be colored, backgroundColor and foregroundColor
can be queried as a hexadecimal RGB value. Regarding the used font, fontColor,
fontName and fontSize are available. On a more fine-grained level, cells can
also store formatted text which is returned by valueRichText. The formatted
text is represented in an HTML syntax to enable usual XML-parsers to access
tags and attributes, for example,
“ăbąăiąăfont face=’Arial’ color=’#ff0000’ąred, italic and boldă/fontąă/iąă/bą”.
    If spreadsheets were completed in an inconsistent manner, it happens that
cells were unintentionally filled with different data types. For such cases, one can
use the value variable to always obtain a string representation of a cell’s content
regardless of its cell type. However, if this is not sufficient because more details
are needed, the json variable could be used to retrieve a JavaScript Object
Notation (JSON) representation of a cell. The JSON object contains, besides the
cell type, various data types mentioned above.
    In our demo implementation, we do not retrieve all aspects of a spreadsheet
(e.g. border thickness), yet they could easily be added to our code if desired.


2.1   Experimental Features

We also would like to propose some experimental features that we found useful in
our use cases. However, the official introduction of these extensions would require
to change the RML specification on some points.
Multiple Different Properties in a Cell. In our use cases, we frequently
experienced that users record multiple information in one cell. Each piece of
information ij P I then corresponded to a different property pj P P . Following the
current RML specification, one has to define for each pj P P a separate predicate-
object map. Instead, we propose a shortcut such that only one predicate-object
map is needed. This is done by allowing that an RDF list of properties pj P P
can be passed in a predicate map. By implementing a suitable Function Ontology
(Fn O) [6] procedure, we extract information pieces ij P I from a cell’s content
and return them as an object list oj P O. Usually, RML provides that a Cartesian
product is formed between predicates and objects which is P ˆ O. However,
in our case we need to zip predicate and object lists such that the following
set is made: tppj , oj qu. Thus, instead of all possible pairs of the P ˆ O matrix,
                   Mapping Spreadsheets to RDF: Supporting Excel in RML

only the diagonal ones are selected. This new behavior is activated by adding a
ss:zip true statement to a predicate-object map.
Multiple Complex Entities in a Cell. Similarly to the previous observation,
we often discovered that users mention several entities in a single cell. A prominent
example is a list of persons having first and last names, for instance book authors.
Represented as an RDF graph, such complex entities potentially require several
statements to be fully expressed. An Fn O function that is able to perform entity
extraction would need to return an arbitrary large RDF graph in a certain
serialization format (e.g. Turtle) instead of a single value. In order to integrate
such a return value in an object map, we define a new term type ss:Graph.
Once this term type is chosen, the returning RDF graph is parsed and added
to the emerging knowledge graph. To prevent that all returning resources are
mapped in the object map, a special ss:SelectedObjects resource together
with rr:object statements should be returned by the function, too. This way, all
resources are added to the knowledge graph, yet only selected ones are mapped.


3    Related Work

In the past, several similar approaches were implemented that map spreadsheets
to RDF using a language. Domain specific languages (DSL) other than RML are
provided to let knowledge engineers express how input data shall be mapped to
a graph structure.
    Spread2RDF10 uses a Ruby-internal language while M 2 (Mapping Master)
[7] builds upon a compact syntax for OWL ontologies (Manchester syntax).
Sheet2RDF [3] uses a special ProjEction of Annotations Rule Language (PEARL),
whereas XLWrap [5] utilizes template graphs together with a special expression
language to refer to contents of sheets. A different approach is followed by
TabLinker11 which requires that input spreadsheets are annotated in advance
with certain styles.
    With our implementation, we combine the feature of mapping spreadsheets
with the advantages of the well specified RML approach. Those who already
used the language and need to process spreadsheets do not have to look for an
alternative anymore.


4    Conclusion and Outlook

In this paper we discussed that spreadsheets, despite being widely used, are still
not supported by the RDF Mapping Language (RML). We therefore proposed a
first solution which extends the already existing RML Mapper tool with necessary
components. Additionally, experimental features were introduced too. A publicly
available web page12 demonstrates the new capabilities.
10
   https://github.com/marcelotto/spread2rdf
11
   https://github.com/Data2Semantics/TabLinker
12
   http://www.dfki.uni-kl.de/~mschroeder/demo/excel-rml
          Markus Schröder, Christian Jilek, and Andreas Dengel

    In the future, we plan to integrate our proposed solution in the RML specifica-
tion [1] so that other RML tools may support spreadsheets too. Further, we intend
to run several performance tests (e.g. memory consumption) and compliance
tests with our current reference implementation. This could be included in the
R2RML implementation report13 .

Acknowledgements This work was funded by the BMBF project SensAI (grant
no. 01IW20007).


References
 1. Dimou, A., Sande, M.V.: RDF Mapping Language (RML). https://rml.io/specs/
    rml/ (2020), accessed: 2021-02-01
 2. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
    Walle, R.: RML: a generic language for integrated RDF mappings of heterogeneous
    data. In: Proceedings of the 7th Workshop on Linked Data on the Web. CEUR
    Workshop Proceedings, vol. 1184 (Apr 2014)
 3. Fiorelli, M., Lorenzetti, T., Pazienza, M.T., Stellato, A., Turbati, A.: Sheet2rdf: a
    flexible and dynamic spreadsheet import&lifting framework for RDF. In: Current
    Approaches in Applied Artificial Intelligence - 28th International Conference on
    Industrial, Engineering and Other Applications of Applied Intelligent Systems,
    IEA/AIE 2015, Seoul, South Korea, June 10-12, 2015, Proceedings. Lecture Notes
    in Computer Science, vol. 9101, pp. 131–140. Springer (2015)
 4. Iglesias, E., Jozashoori, S., Chaves-Fraga, D., Collarana, D., Vidal, M.: Sdm-
    rdfizer: An RML interpreter for the efficient creation of RDF knowledge graphs. In:
    CIKM ’20: The 29th ACM International Conference on Information and Knowledge
    Management, Ireland, October 19-23, 2020. pp. 3039–3046. ACM (2020)
 5. Langegger, A., Wöß, W.: Xlwrap - querying and integrating arbitrary spreadsheets
    with SPARQL. In: The Semantic Web - ISWC 2009, 8th International Semantic Web
    Conference, ISWC 2009, Chantilly, VA, USA, October 25-29, 2009. Proceedings.
    Lecture Notes in Computer Science, vol. 5823, pp. 359–374. Springer (2009)
 6. Meester, B.D., Dimou, A., Verborgh, R., Mannens, E.: An ontology to semantically
    declare and describe functions. In: The Semantic Web - ESWC 2016 Satellite Events,
    Heraklion, Crete, Greece, May 29 - June 2, 2016, Revised Selected Papers. Lecture
    Notes in Computer Science, vol. 9989, pp. 46–49 (2016)
 7. O’Connor, M.J., Halaschek-Wiener, C., Musen, M.A.: M2: A language for map-
    ping spreadsheets to OWL. OWL: Experiences and Directions (OWLED), Sixth
    International Workshop (2010)
 8. Simsek, U., Kärle, E., Fensel, D.: Rocketrml - A nodejs implementation of a use
    case specific RML mapper. In: Joint Proceedings of the 1st International Workshop
    on Knowledge Graph Building and 1st International Workshop on Large Scale
    RDF Analytics (ESWC 2019), Portorož, Slovenia, June 3, 2019. CEUR Workshop
    Proceedings, vol. 2489, pp. 46–53. CEUR-WS.org (2019)
 9. World Wide Web Consortium: R2RML: RDB to RDF mapping language. https:
    //www.w3.org/TR/r2rml/ (2012), accessed: 2021-02-01
10. World Wide Web Consortium: RDF 1.1 Primer. https://www.w3.org/TR/rdf11-
    primer/ (2014), accessed: 2021-02-01

13
     https://github.com/kg-construct/r2rml-implementation-report