=Paper= {{Paper |id=Vol-2489/paper5 |storemode=property |title=RocketRML - A NodeJS Implementation of a Use Case Specific RML Mapper |pdfUrl=https://ceur-ws.org/Vol-2489/paper5.pdf |volume=Vol-2489 |authors=Umutcan Şimşek,Elias Kärle,Dieter Fensel |dblpUrl=https://dblp.org/rec/conf/esws/SimsekKF19 }} ==RocketRML - A NodeJS Implementation of a Use Case Specific RML Mapper== https://ceur-ws.org/Vol-2489/paper5.pdf
    RocketRML - A NodeJS Implementation of a
          Use-Case Specific RML Mapper

    Umutcan Şimşek[0000−0001−6459−474X] , Elias Kärle[0000−0002−2686−3221] , and
                                   Dieter Fensel

                      Semantic Technology Institute Innsbruck
               Department of Computer Science, University of Innsbruck
                           firstname.lastname@sti2.at



        Abstract. The creation of Linked Data from raw data sources is, in
        theory, no rocket science (pun intended). Depending on the nature of the
        input and the mapping technology in use, it can become a quite tedious
        task. For our work on mapping real-life touristic data to the schema.org
        vocabulary, we used RML but soon encountered, that the existing Java
        mapper implementations reached their limits and were not sufficient for
        our use cases. In this system paper, we describe a new implementation of
        an RML mapper. Written with the JavaScript-based NodeJS framework
        it performs quite well for our use cases where we work with large XML
        and JSON files. The performance testing and the execution of the RML
        test cases have shown that the implementation has great potential to
        perform heavy mapping tasks in reasonable time, but comes with some
        limitations regarding JOINs, Named Graphs and inputs other than XML
        and JSON - which is fine at the moment, due to the nature of the given
        use cases1 .

        Keywords: RML · RML Mapper · RDF generation · NodeJS


1     Introduction

During our work on the semantify.it platform [3] we were implementing map-
pings from different data sources to schema.org pragmatically. When we started
our work on the Tyrolean Tourism Knowledge Graph [4], the number of data
sources, data providers and use cases grew, and it quickly turned out, that the
programmatic approach does not scale. In a literature review we found out that
RML [2] looked very promising and would fit our needs perfectly. As an extension
of R2RML, RML not only supports relational database inputs, but also other
sources like XML and JSON. While working with real-life data from touristic IT
solution providers, we encountered the challenge that the input data may exceed
500MB. A list of hotel room offers in a region for a given time span or a list of
events of a given region for half a year, are quite some data to process. Soon
1
    Copyright ©2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2        U. Şimşek et al.

we encountered that existing RML mapper implementations reached a certain
performance limit that made it infeasible to work with for our use cases.
    For another project of ours, the MindLab2 project, we additionally realized
another requirement that some of the data we have do not contain necessary
primary and foreign keys for joins (e.g. a local business and its address). After we
collected requirements from different use cases, we decided to implement an RML
mapper that covers our needs. The requirements to the new implementation were
(in arbitrary order):
    – supporting XML and JSON input primarily, then expanding to other formats
    – handling nested objects that do not have any fields to join
    – working with larger files (e.g. >500MB)
    – integrating with our existing NodeJS infrastructure
    In this paper we describe RocketRML, a use case specific NodeJS implemen-
tation of the RML mapper. The implementation does not cover the RML speci-
fication 100%. It does, for example, not (yet) support JOINs or Named Graphs.
It introduce two additional features to the standard RML Mapper implementa-
tion, namely a global language tag for string literals and mapping nested objects
where no identifiers exist.
    The remainder of the paper is structured as follows: Section 2 describes our
tool, its limitations and customizations, Section 3 describes the results of running
our mapper against the RML test cases3 and Section 4 discusses the implemen-
tation and our next steps and concludes our paper.

2      Tool Presentation
RocketRML4 is a NodeJS implementation of the RML mapper. It supports a
subset of the RML specification that is needed for our use cases described in
Section 1. It covers most of the functionality the RML Mapper5 provides. In
this section, we explain the current limitations/deviations of our implementation
comparing to the standard RML Mapper implementation and the results of our
preliminary performance tests.

2.1     Limitations
No support for JOINs The main motivation of currently not supporting JOINs
for our use case is that the data we obtain from a good portion of IT solution
providers in tourism field. The objects are typically nested and do not have any
field that could serve as a joining point. Therefore applying joins between two
mappings (e.g. joining hotels and their rooms) is not possible without benefiting
from the structure of objects (i.e. how they are nested). For this purpose, we
customized the way iterators work in our implementation (see Section 2.3).
2
  https://mindlab.ai
3
  https://github.com/RMLio/rml-test-cases
4
  https://github.com/semantifyit/RML-mapper
5
  https://github.com/RMLio/rmlmapper-java
                                                              RocketRML         3

No support for Named Graphs Although we make heavy use of named graphs
[1] for provenance tracking and versioning purposes, in our use case, the named
graphs and provenance information are not part of generating RDF from a raw
data source at the moment. Therefore RocketRML currently does not support
generating quads.


Only JSON and XML formats are supported in a logical source In all of our cur-
rent use cases, the logical sources are JSON and XML files. Therefore currently
we only support these two formats as input. This means the relation database
specific features like SQL Views as logical source are also not supported. We will
add support for new logical sources (e.g. CSV files) as we need it.


Only JavaScript function implementations are supported We support the func-
tion extension of RML, however the function implementation must be provided
in JavaScript.


2.2    Performance tests

One motivation for developing RocketRML was the performance issues we had
with large files. This was mainly due to the external libraries used in the Java
based implementations to parse the input files. We did a preliminary performance
test to compare three implementations, namely the legacy RML Mapper (RML-
Mapper), RML Mapper Java (rmlmapper-java) and RocketRML (Figure 1 and
2)6 . We measured how the time required for mapping changes as the number of
objects to map increases. We tested all implementations with the same array of
randomly generated objects for both XML and JSON inputs7 . For each object,
the same mapping file has been used. Each JSON and XML object produces 5
triples. The tests have been run on a Lenovo T470s laptop with 16GB RAM
and Intel Core i7 2.7 GHz Quad-Core CPU. The results show that RocketRML
runs significantly faster for our use case. It can be also seen that RocketRML
performs with JSON input especially better, due to the native JSON support of
NodeJS. In fact, we convert the mapping files to JSON-LD in the beginning for
easier manipulation. Additionally, the generated RDF data is initially in JSON-
LD format. Another reason we can think of is the lack of certain features like
JOINs. This would reduce the overhead of separately mapping all objects and
then joining the relevant ones. On top of that, Java implementations may be
performing poorly due to the limitations of external libraries used for parsing
input files and applying JSONPath and XPath queries. Such components may
be tested separately to isolate the bottleneck.

6
    See here for detailed test results
7
    Similar    to     the    generation in   https://github.com/semantifyit/RML-
    mapper/blob/master/tests/performanceTest.js
4             U. Şimşek et al.



    [{
              "name":"Gschwandtkopflifte",
              "type":"SkiResort",
              "contactDetails":[
                 {
                    "address":{
                       "street":"Gschwandtkopf 700",
                       "postcode":"6100",
                       "city":"Seefeld",
                       "type":"Office"
                    }
                 },
                 {
                    "address":{
                       "street":"Gschwandtkopf 702",
                       "postcode":"6100",
                       "city":"Seefeld",
                       "type":"Lifte"
                    }
                 }
              ]
         }]



Listing 1: An example data snippet in JSON format from an IT solution provider


2.3       Customizations

In this section we talk about our iterator implementation in detail. Additionally,
we explain the small implementation tweaks we made to cover some needs of our
use case.


Custom Iterator Implementation In our use case, the raw data is mostly
coming from IT solution providers in the tourism domain. We have cases where
the objects represented in the data do not have any fields to join, instead the
parent and child objects are nested. Since we do not prefer to use RDF containers
for nested objects, an implementation with nested term mappings as in xR2RML
[5] would not solve this issue. Therefore we needed to customize how iterators
are interpreted in the mapper, in order to link instances of different types in
RDF output based on the nested structure of the input file.
    For example, the data in Listing 1 shows an array that contains SkiResort
objects that have multiple Address objects. The relationship between SkiResort
and Address is only provided by the nested structure of JSON elements. In a
typical mapping file, for example a SkiResortMapping and an AddressMapping
with iterators $.* and $.*.contactDetails.*.address would be defined and a join
                                                                  RocketRML          5

condition would specify on which fields the two resulting RDF graphs could be
joined. Since our data do not have such fields, the output of the mapping would
be wrong when there are multiple SkiResort objects with different addresses in
the array8 . In order to overcome this issue, we customized the way iterators are
interpreted in our mapper (Algorithm 1).



Algorithm 1 Custom iterator algorithm
 1: result ← {}
 2: function map(mappingObj, iterator, input, result)
 3:     input ← input.select(iterator)
 4:     result = subjectMapping(mapping, input, result)
 5:     for all pOM ∈ mapping.getP redicateObjectM appings() do
 6:        if pOM.parentT ripleM ap then
 7:            childMapping ← pOM.parentTripleMap.getMapping()
 8:            predicate ← pOM.getPredicate()
 9:            source ← childMapping.getLogicalSource()
10:            nestedIterator ← childMapping.getSubIterator(iterator)
11:            result[predicate] = map(childMapping, nestedIterator, input, result)
12:         else
13:            result[predicate] = doMapping(pOM, iterator, input, result)          .
    reference, template, constant...




    The main goal of the mapping algorithm with the customized iterator han-
dling is to recursively generate a JSON-LD object according to the mapping file.
The algorithm starts with a base mapping, which is explicitly specified before
running the mapper. After the subject mapping is done, the mapping function
iterates over all predicate-object mappings. Whenever a parent triple mapping
is encountered, it is processed recursively by the iterator of the nested mapping
and the result is attached to the parent JSON-LD object on the corresponding
predicate.



Other Customizations The data in the tourism domain often comes with
a lot of string literal valued properties in different languages. This requires to
attach a language tag on many string values, which may be a tedious task in a
big mapping file. As a workaround, we have a global language option parameter
in our mapper that attaches the specified language tag to every string literal
during the mapping process.

8
    It would still be possible to use joins for cases where only parent has an ID field
    by traversing from the child to the parent. For this the JSONPath implementation
    should support this feature.
6       U. Şimşek et al.




Fig. 1. Performance comparison of three different implementations for sources in JSON
format




Fig. 2. Performance comparison of three different implementations for sources in XML
format
                                                                RocketRML         7

3     Results of the Test Cases
Our implementation passes all the test cases for JSON and XML format except
the     ones     that    require     joins  and     consider   named      graphs9 .
Table 1 gives a summary of the failed
tests. The first group fails because
of the lack of named graph support. Test Case              Reason for Failure
Note that, some of the tests that con- RMLTC006a-*
tain graph mappings actually create RMLTC007e h-* No Named Graph Support
triples in the default graph, there- RMLTC008a-XML
                                           RMLTC009a-XML
fore they produce the same output RMLTC009b-*                 No JOIN Support
as our implementation. However, we
still consider them as failed tests since
                                           Table 1. A summary of the failed test
we do not support the graph map- cases. The asterisk (*) indicates both JSON
ping. The second group fails because and XML formats for the same test case.
of the lack of JOIN support. Although The underscore ( ) indicates a range of test
our implementation can handle nested cases (e.g. from e to h)
objects with the custom iterator im-
plementation, we cannot handle two
sources that are conceptually related but are not in the same tree (e.g. students
and sports they practice are in different files) at the moment.


4     Conclusion and Discussion
Generating RDF data from various (semi-)structured data is a crucial task for
endeavours like building knowledge graphs. Choosing a mapping framework for
this purpose is not only about the performance of the tool, but also about the
convenience and usability of the mapping language. We found RML convenient
in terms of mapping language as well as amount of available documentation
and examples. RML allows us to create RDF data from heterogeneous tourism
related data sources in a reusable and a rather scalable way. Due to the nature of
our use cases we could not use RML as it is. With RocketRML we have created a
new implementation of an RML mapper which performs well considering certain
use cases. Current limitations do not give a full coverage of RML specifications.
    For our future work on the mapper, we are implementing JOINs, in order to
increase our coverage of RML specification and support some of our future use
cases that will require joins. However, the reality of a good portion of our data
sources will not change, so we need to still support the case where there are no
fields to join. Therefore we are going to generate artificial unique identifiers for
objects during the mapping process and join them similar to the standard RML
implementation. We will then observe how the tool performance is affected by
the implementation of JOIN support.
    Our use cases also showed, that having the input file’s name hardcoded in
the mapping file is not always practical. Sometimes it is required to use the
9
    Full results available online.
8         U. Şimşek et al.

same mapping file for different input files during runtime. A standard way to
parameterize the input file for logical sources could be useful.
    Moreover, we will implement more performance tests under considerations
of simple, flat file structures as well as deeply nested XML and JSON files. We
will run those tests on our implementation as well as other implementations and
publish the results.


Acknowledgements
This work is partially supported by the MindLab project10 . Umutcan Şimşek is
supported also by the 2018 netidee11 grant. The authors would like to thank
to all our developers, especially Thibault Gerrier and Philipp Häusle for their
implementation, support and helpful comments. We would like to also thank
Ioan Toma and Jürgen Umbrich from Onlim GmbH for fruitful discussions.


References
1. Carroll, J.J., Bizer, C., Hayes, P., Stickler, P.: Named graphs, prove-
   nance and trust. In: Proceedings of the 14th International Conference
   on World Wide Web. pp. 613–622. WWW ’05, ACM, New York, NY,
   USA (2005). https://doi.org/10.1145/1060745.1060835, http://doi.acm.org/10.
   1145/1060745.1060835
2. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
   Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Heteroge-
   neous Data. In: Proceedings of the 7th Workshop on Linked Data on the Web (Apr
   2014), http://events.linkeddata.org/ldow2014/papers/ldow2014 paper 01.pdf
3. Kärle, E., Şimşek, U., Fensel, D.: semantify.it, a Platform for Creation, Publication
   and Distribution of Semantic Annotations. In: SEMAPRO 2017: The Eleventh In-
   ternational Conference on Advances in Semantic Processing. pp. 22–30. New York:
   Curran Associates, Inc. (Jun 2017), http://arxiv.org/abs/1706.10067
4. Kärle, E., Şimşek, U., Panasiuk, O., Fensel, D.: Building an ecosystem for the ty-
   rolean tourism knowledge graph. In: Pautasso, C., Sánchez-Figueroa, F., Systä, K.,
   Murillo Rodrı́guez, J.M. (eds.) Current Trends in Web Engineering. pp. 260–267.
   Springer International Publishing, Cham (2018)
5. Michel, F., Djimenou, L., Faron-Zucker, C., Montagnat, J.: Translation of
   relational and non-relational databases into RDF with xr2rml. In: Mon-
   fort, V., Krempels, K., Majchrzak, T.A., Turk, Z. (eds.) WEBIST 2015 -
   Proceedings of the 11th International Conference on Web Information Sys-
   tems and Technologies, Lisbon, Portugal, 20-22 May, 2015. pp. 443–454.
   SciTePress (2015). https://doi.org/10.5220/0005448304430454, https://doi.org/10.
   5220/0005448304430454




10
     https://mindlab.ai
11
     https://netidee.at