=Paper=
{{Paper
|id=Vol-3718/paper2
|storemode=property
|title=RML-view-to-CSV: A Proof-of-Concept Implementation for RML Logical Views
|pdfUrl=https://ceur-ws.org/Vol-3718/paper2.pdf
|volume=Vol-3718
|authors=Els de Vleeschauwer,Pano Maria,Ben De Meester,Pieter Colpaert
|dblpUrl=https://dblp.org/rec/conf/kgcw/VleeschauwerMMC24
}}
==RML-view-to-CSV: A Proof-of-Concept Implementation for RML Logical Views==
<pdf width="1500px">https://ceur-ws.org/Vol-3718/paper2.pdf</pdf>
<pre>
                                RML-view-to-CSV: A Proof-of-Concept
                                Implementation for RML Logical Views
                                Els de Vleeschauwer1 , Pano Maria2 , Ben De Meester1 and Pieter Colpaert1
                                1
                                    IDLab, Dept. Electronics & Information Systems, Ghent University – imec, Belgium
                                2
                                    Skemu, Schiedam, The Netherlands


                                                                         Abstract
                                                                         Although the W3C Community Group on Knowledge Graph Construction (KGC)’s work on the modular
                                                                         RDF Mapping Language (RML) specification has taken great strides, open issues and respective solution
                                                                         proposals remain. Some of these issues are (i) inability to handle hierarchy in nested data, (ii) limited
                                                                         join functionality, and (iii) inability to handle mixed data formats. To combat these issues, the RML
                                                                         Logical Views module is proposed. However, proper but efficient validation of this module requires
                                                                         an implementation that allows short development cycles. In this workshop paper, we propose a proof-
                                                                         of-concept RML Logical Views implementation, independent of and complementary to existing RML
                                                                         mapping engines. Our proof-of-concept covers three important features of the new RML Logical
                                                                         Views module: (i) flattening of nested data, (ii) extended joining of data sources, and (iii) handling
                                                                         mixed data formats. Our implementation supports one nested source format (JSON) and one tabular
                                                                         source format (CSV), and can be used independently, as preprocessor, by any RML Engine. With this
                                                                         implementation, we successfully executed the available relevant test cases of the RML Logical Views
                                                                         module. Additionally, we measured the knowledge graph construction times on GTFS-Madrid-Bench.
                                                                         To accomplish this we added an option to our implementation that replaces referencing object maps
                                                                         with joins in RML Logical Views. When we included our implementation in the knowledge graph
                                                                         construction pipeline, we noticed considerable execution time reductions. We conclude that the RML
                                                                         Logical Views specification can be implemented, and can solve needs that were not yet solvable by RML.
                                                                         The current implementation can already be realized as a modular part of a knowledge graph construction
                                                                         process. Although boosting performance was not the aim of our work, our implementation reduces the
                                                                         execution time of GTFS-Madrid-Bench scale 100 by 16%, 33%, and 39% when combined respectively with
                                                                         SMD-Rdfizer or RPT/Sansa, Morph-KGC, and Carml. RMLStreamer, when used alone, times out after
                                                                         two hours on this task, but, in conjunction with our implementation, completes it in 236 seconds. We
                                                                         hope this proof-of-concept inspires the developers of existing RML engines to integrate the RML Logical
                                                                         Views module and benefit from its features.

                                                                         Keywords
                                                                         RML Logical View, flattening, joining, mixed content, proof-of-concept


                                KGCW’24: 5th International Workshop on Knowledge Graph Construction, May 27, 2024, Crete, GRE
                                Envelope-Open els.devleeschauwer@ugent.be (E. de Vleeschauwer); pano@skemu.com (P. Maria); ben.demeester@ugent.be
                                (B. De Meester); pieter.colpaert@ugent.be (P. Colpaert)
                                GLOBE https://skemu.com (P. Maria); https://ben.de-meester.org/#me (B. De Meester); https://pietercolpaert.be/#me
                                (P. Colpaert)
                                Orcid 0000-0002-8630-3947 (E. de Vleeschauwer); 0009-0000-2598-1894 (P. Maria); 0000-0003-0248-0987 (B. De
                                Meester); 0000-0001-6917-2167 (P. Colpaert)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
1. Introduction
The W3C Community Group on Knowledge Graph Construction (KGC)1 works on a declarative
approach to construct RDF graph data from existing, heterogeneous data sources. The group
recently proposed a new modular specification, ontology, and accompanying SHACL shapes for
the RDF Mapping Language (RML)2 , including novel features which increase its expressiveness
and empowers practitioners to define mapping rules for constructing RDF graph data that were
previously unattainable [1]. Nevertheless, challenges such as handling hierarchy of nested data,
more flexible joining (also across data hierarchies), and handling data sources that mix source
formats (e.g., a table that contains a column storing data as JSON) remain unsolved.
   As of July 2023, a dedicated task force of the W3C Community Group on KGC works on an
additional RML module: RML Logical Views3 . This module aims to resolve the aforementioned
challenges by allowing to specify a logical view: a flattened, source format-agnostic view over
one or more existing data sources.
   RML Logical views are new in RML, and still under development. The specification4 is not
finalized and still subject to change. Hence, there are no implementations available to validate
the feasibility of the theoretical concepts. Proper but efficient validation of the RML Logical
Views module requires an implementation that allows short development cycles.
   A proof-of-concept implementation can validate if the proposed RML constructions are
implementable, and reveal ambiguities in the specification. This allows for corrective iterations
during the development of the specification, and can also support the creation of test cases.
   In this workshop paper, we present RML-view-to-CSV: a proof-of-concept implementation
made available under MIT license and designed following state-of-the-art best practices. RML-
view-to-CSV5 materializes all RML Logical Views specified in a given set of RML mapping
rules as CSV files, and rewrites these RML mapping rules to RML mapping rules without
RML Logical Views. Any RML engine that supports CSV files, can then use these resulting
RML mapping rules and the generated CSV files to construct RDF graph data using existing
RML constructs. The implementation supports two source file formats (CSV and JSON) and
supports the following features: flattening of nested data, handling of mixed data formats, and
more flexible joining of data sources (also across data hierarchies). We added two optional
functionalities: the elimination of referencing object maps by moving the joins to logical views,
and the optimization of logical views based on the linked triples maps. The first option allows us
to test our implementation with existing RML benchmarks. Combined with the second option,
we obtained considerable execution time reductions for the knowledge graph construction
pipeline.
   After discussing related work (Section 2) and explaining the main principles of logical views
and how they are specified using RML Logical Views (Section 3), we describe our approach and
the implemented features (Section 4). Finally, we evaluate (Section 5), and conclude (Section 6).


1
  https://www.w3.org/community/kg-construct/
2
  https://w3id.org/rml/portal/
3
  https://github.com/kg-construct/rml-lv
4
  https://kg-construct.github.io/rml-lv/dev.html
5
  https://github.com/RMLio/rml-view-to-csv
2. Related work
In this section, we first list solutions that were formulated in the past and influenced the devel-
opment of RML Logical Views (Section 2.1). Then, we describe similar modular implementation
approaches (Section 2.2).

2.1. Related proposals to extend RML
Partial solutions that also apply the concept of a (logical) view were formulated in the past
to address (parts of) the challenges presented in this paper [2, 3]. RML Fields is a proposed
solution to deal with nested data and mixed content in RML [2]. The proposal includes a
language construct and algorithm, but no implementation. RML Logical Views is a revised
extension of this proposal. Another solution is expanding RML’s existing SQL views to general
tabular sources instead of solely relational databases [3], to better support transformation
functions, complex joins, and mixed content in RML. These views are formulated as SQL queries
over tabular sources. The proposal is implemented as an extension of Morph-KGC6 [4]: a
state-of-the-art RML mapping engine implemented using the pandas Python package7 [5].
   Other proposed solutions include Facade-X [6], which directly maps the data source into
RDF graph data (so not a flattened, but a graph-based source-agnostic view), allowing to, e.g.
support joining over hierarchy via an iteration index8 ; and xR2RML, which (among others)
supports mixed syntax paths for handling data sources that mix source formats [7].

2.2. RML engine module implementations
State-of-the-art RML engine module implementations typically provide (integrated or separate)
preprocessing steps. SDM-RDFizer introduces a preprocessing step for grouping mapping
rules, which leads to optimizations due to parallelization [8]. FunMap [9] is an interpreter
of RML+FnO (the RML module that allows to describe data transformations in the mapping
process), that converts a data integration system defined using RML+FnO into an equivalent
one where RML mappings are function-free.


3. RML Logical Views Primer
Within RML (https://w3id.org/rml/portal), RDF triple generation is defined using triples maps.
Within these triples maps, expressions are used to fill in data values in specified places in the
triples. On which data these triples are generated, is currently specified using a logical source:
a construct to describe how to extract logical iterations out of heterogeneous data sources,
e.g. extracting individual rows from CSV files or extracting specific objects from JSON data
from a Web API response. Reference formulations allow specifying the manner to create logical
iterations and expressions (e.g. using the JSONPath reference formulation to handle JSON data).


6
  https://github.com/morph-kgc/morph-kgc/releases/tag/2.3.0
7
  https://pandas.pydata.org/
8
  https://github.com/kg-construct/mapping-challenges/issues/43
   For example, when mapping person data from a local CSV file that contains a person’s
(national) ID and name, the logical source describes how to extract rows as logical iterations
from that CSV file, using a default CSV reference formulation. On these logical iterations,
specific ID and name expressions are evaluated to create triples, using the ID expression as part
of the subject identifier, and the name expression as a literal object.
   With RML Logical Views, the W3C Community Group on KGC aims to describe a virtual
flattened view on top of a logical source. A logical view allows defining fields where each field
is an expression on its parent in terms of a reference formulation. A field’s parent is either
the logical iteration of the logical source when a field is defined at root-level of the logical
view, or another field. The result of evaluating field’s expression is a list of values called an
iteration sequence [2]. The reference formulation of the expression of the field is the reference
formulation of its parent, unless it is specified on the field. Each field has a declared name which
is an alphanumeric string, and a field name which is the concatenation of the name of the parent
field, a dot, and the field’s declared name.
   The evaluation of the fields of a logical view on a given logical iteration forms a new view
iteration, which is the natural, full outer join of the logical iteration of the logical source and all
the fields’ iteration sequences in order of the defined field hierarchy. Following this, expression
result values derived along the same root-to-leaf path in the input data’s tree structure, will end
up in the same view iteration [2].
   To reference a field value while generating triples in a triples map, its field name can be used.
This approach gives a mapping author the control to map a hierarchical source in a variety of
ways whilst still being able to respect the context of the source hierarchy.
   A logical view can be further extended with fields from one or more other logical views by
defining joins with other logical views. Since the logical iterations of a logical view can be
represented as relation in a relational algebra, we can also define joins in terms of relational
algebra (left join, inner join). These joins will define which field values from a parent logical
view will be incorporated in a logical iteration of the child logical view.
   In this section, we give an example-based explanation of the main RML Logical Views func-
tionalities. The examples9 (Figures 1 to 3) include an RML mapping illustrating the declarative
description of a logical view, the source data for the logical source used as source for the logical
view, and an intermediate representation of the logical view. The headers of the columns of this
intermediate representation can be used as reference expression in expression maps.

3.1. Flattening of nested data structures
Problem: References to nested data structures, like JSON or XML, may return multiple values.
These values can be composite: they may again contain multiple values. RML defines mapping
constructs that produce results by combining the results of other mapping constructs in a specific
order. For example, a triples map combines the results of a subject map and a predicate-object
map in that order. Another example is a template expression, which combines character strings
and zero or more reference expressions in declared order. When mapping constructs produce
multiple results, the combining mapping constructs will apply an n-ary Cartesian product10
9
    Prefixes are omitted but can be found on https://prefix.cc
10
     https://w3id.org/rml/core/spec#dfn-n-ary-cartesian-product
    :jsonSource a rml:LogicalSource ;
     1
     2rml:source "json_data.json" ;
  3   rml:referenceFormulation rml:JSONPath ;
  4   rml:iterator "$.people[*]" .
  5 :jsonView a rml:LogicalView ;                                  1{ "people": [
  6   rml:onLogicalSource :jsonSource ;                            2    { "name": "alice",
  7   rml:field [                                                 3       "items": [
  8     rml:fieldName "name" ;                                    4         { "type": "sword",
  9     rml:reference "$.name" ; ] ;                              5           "weight": 1500 },
 10   rml:field [                                                 6         { "type": "shield",
 11     rml:fieldName "item" ;                                    7           "weight": 2500 }
 12     rml:reference "$.items[*]" ;                              8         ] },
 13     rml:field [                                               9     { "name": "bob",
 14       rml:fieldName "type" ;                                 10       "items": [
 15       rml:reference "$.type" ; ] ;                           11         { "type": "flower",
 16     rml:field [                                              12           "weight": 15 }
 17       rml:fieldName "weight" ;                               13         ] }
 18       rml:reference "$.weight" ; ] ; ] .                     14     ] }


               (a) mapping_json_view.ttl (part 1)                          (b) json_data.json

                     name     item                                     item.type   item.weight
                     alice    {”type”: ”sword”, ”weight”: 1500}        sword              1500
                     alice    {”type”: ”shield”, ”weight”: 2500}       shield             2500
                     bob      {”type”: ”flower”, ”weight”: 15}         flower               15
                                  (c) Intermediate representation of :jsonView
Figure 1: Example of a logical view on a logical source containing nested JSON data


over the sets of results, maintaining the order of the mapping constructs. In the case of nested
data structures, this may cause the generation of results that do not match the source hierarchy,
i.e. do not follow the root-to-leaf paths in the source data, since values are combined irrespective
of it.
   Furthermore, there is varying expressiveness in data source expression and query languages,
and many languages have limited support for hierarchy traversal. For example, JSONPath has
no operator to refer to an ancestor in the document hierarchy.
   This limits the ability of RML to map nested data.
   Solution: Fields can be defined in a hierarchy following the source document to produce
iterations from which source values along the same root-to-leaf path can be referenced. The
view iteration can be seen as flattening the hierarchy of the source document (Figure 1).

3.2. Handling of mixed data formats
Problem: Data in one format can contain multiple or composite values stored in another
format11 [2], e.g. a CSV dataset could contain columns containing JSON values.

11
     https://webusers.i3s.unice.fr/~fmichel/xr2rml_specification_v5.1.html#_Toc466307454
 1 :mixedSource a rml:LogicalSource ;
 2     rml:source "./mixed_data.csv";
 3     rml:referenceFormulation rml:CSV .
 4 :mixedView a rml:LogicalView ;
 5   rml:onLogicalSource :mixedSource;
 6   rml:field [
 7     rml:fieldName "name" ;
 8     rml:reference "name" ; ] ;
 9   rml:field [
10     rml:fieldName "item" ;
11     rml:reference "item" ;
12     rml:referenceFormulation rml:JSONPath ;
13     rml:field [
14       rml:fieldName "type" ;
15       rml:reference "$.type" ; ] ;         1 name, item
16     rml:field [                            2 alice,"{""type"":""sword"",""weight"":2500}"
17       rml:fieldName "weight";              3 alice,"{""type"":""shield"",""weight"":1500}"
18       rml:reference "$.weight" ;] ; ] .    4 bob,"{""type"":""flower"",""weight"": 15 }"


     (a) mapping_mixed_view.ttl                              (b) mixed_data.json

                name    item                                 item.type   item.weight
                alice   {”type”: ”sword”, ”weight”: 1500}    sword       1500
                alice   {”type”: ”shield”, ”weight”: 2500}   shield      2500
                bob     {”type”: ”flower”, ”weight”: 15}     flower      15
                           (c) Intermediate representation of :mixedView
Figure 2: Example of a logical view on a logical source containing nested JSON data


   To define the expected form of references to input data RML employs the notion of a reference
formulation that is a property of every logical source. However, currently a logical source
is limited to having a single reference formulation, meaning mixed format data can only be
referenced using a query language that supports just one of the formats.
   Solution: For every field, the reference formulation can be adapted (Figure 2). The default
reference formulation for a field is the reference formulation of its parent, or of the source of
the logical view for the fields at the root of the iteration. Thus, every iteration level can iterate
over data in a different format.

3.3. Extended joining of data sources
Problem: RML restricts join operations to referencing object maps. Since a referencing object
map can only generate an object that is an IRI or blank node subject as specified by a parent
triples map, it is not possible to combine data from two sources in one term, use data from a join
on another position than the object, or generate a literal using data from a join [3]. Moreover,
RML cannot join correctly across hierarchies [2].
   Solution: A logical view can be extended with fields from one or more other logical views as
a result of a join operation (Figure 3). The logical iteration will be adapted according to the type
    :csvSource a rml:LogicalSource ;
     1
     2  rml:source "./csv_data.csv";
  3     rml:referenceFormulation rml:CSV .
  4 :joinedView a rml:LogicalView ;
  5   rml:onLogicalSource :csvSource;
  6   rml:field [
  7     rml:fieldName "name" ;
  8     rml:reference "name" ;
  9   ] ;
 10   rml:field [
 11     rml:fieldName "birthyear" ;
                                                                  1 name,birthyear
 12     rml:reference "birthyear" ;
                                                                  2 alice,1995
 13   ] ;
                                                                  3 bob,1999
 14   rml:leftJoin [
                                                                  4 tobias,2005
 15     rml:parentLogicalView :jsonView ;
 16     rml:joinCondition [
 17       rml:parent "name" ;                                                    (b) csv_data.csv
 18       rml:child "name" ;
 19     ] ;                                                       name      birthyear    item_type
 20     rml:field [                                               alice     1995         sword
 21       rml:fieldName "item_type" ;                             alice     1995         shield
 22       rml:reference "item.type" ;
                                                                  bob       1999         flower
 23     ] ; ] .
                                                                  tobias    2005
                (a) mapping.ttl (part 3)                        (c) Intermediate representation of :joinedView
Figure 3: Example of a logical view on a logical source containing CSV data and extended with data
from the logical view from Figure 1


of join specified (left join or inner join). Any needed flattening of hierarchical data is done in
the logical view before applying the join operation. Data that originally comes from a different
data source is thus treated equally in a joined logical view, allowing more flexibility as to where
to apply that joined data.


4. Approach and Implementation
We have designed our proof-of-concept implementation as a standalone application, independent
of and complementary to existing RML mapping engines, that can be used as a preprocessing
step in a knowledge graph construction pipeline (Figure 4). As such, we designed our proof-of-
concept following the state-of-the-art best practices of, amongst others, FunMap [9], i.e. we
rely on a set of lossless rewriting rules to push down and materialize the execution of RML
Logical Views in the initial step of knowledge graph construction process.
   Our implementation, named RML-view-to-CSV, is available online12 under the permissive
MIT license. Our code is built on top of pandas [5], a Python library with data frames for
manipulation of structured data sets.
   At the moment of writing, our implementation supports one nested source format (JSON),

12
     https://github.com/RMLio/rml-view-to-csv/ (v0.0.0), https://doi.org/10.5281/zenodo.11045497
                                                     RML mapping
      RML mapping
                                                  without logical views

                       RML-​view-​to-​csv                                 RML engine   RDF knowledge graph

      source data                                     materialized
      (CSV, JSON)                                  logical views (CSV)


Figure 4: RML-view-to-CSV materializes all logical views specified in a given set of RML mapping rules
as CSV files, and rewrites these RML mapping rules to RML mapping rules without logical views. The
original source data, the materialized logical views and the new RML mapping without logical views are
input for an RML engine to generate RDF graph data.


 1   :jsonSource a rml:LogicalSource; rml:source "json_data.json";
 2       rml:referenceFormulation rml:JSONPath; rml:iterator "$.people[*]".
 3
 4   :jsonView a rml:LogicalSource; rml:source "./view0.csv";
 5       rml:referenceFormulation rml:CSV.


                                            (a) MappingWithoutViews.ttl
 1 name,item,item.type,item.weight
 2 alice,"{""type"": ""sword"", ""weight"": 1500}",sword,1500
 3 alice,"{""type"": ""shield"", ""weight"": 2500}",shield,2500
 4 bob,"{""type"": ""flower"", ""weight"": 15}",flower,15


                                                   (b) view0.csv
Figure 5: Output of RML-view-to-CSV when using Figure 1a and Figure 1b as input


one tabular source format (CSV), and the three important features of the new RML Logical
Views module: flattening of nested data, more flexible joining of data sources (also across data
hierarchies), and handling of mixed data formats. Its main functionality is the materialization of
logical views (Section 4.1). We added two optional functionalities: the elimination of referencing
object maps (Section 4.2) and the optimization of logical views based on the linked triples maps
(Section 4.3).

4.1. Materialization of logical views
RML-view-to-CSV takes as input a given set of RML mapping rules and the source data used
in these mapping rules. It produces CSV files with the intermediate representation of every
logical view specified in the RML mapping rules, and a new set of RML mapping rules in which
all RML Logical Views are replaced by logical sources (Figure 5).

4.2. Elimination of referencing object maps
     With the introduction of logical views, joins of data sources can be expressed within the
 1   :map_services1_0 a rml:TriplesMap; rml:logicalSource :source_5.
 2       rml:subjectMap [rr:template "http://gtfs/services/{service_id}"].
 3
 4 :map_trips_0 a rml:TriplesMap; rml:logicalSource :source_1;
 5     rml:subjectMap [rml:template "http://gtfs/trips/{trip_id}"];
 6     rml:predicateObjectMap [
 7         rml:predicateMap [rml:constant gtfs:service];
 8         rml:objectMap [
 9             rml:parentTriplesMap :map_services1_0];
10             rml:joinCondition [rml:child "service_id"; rml:parent "service_id"]].


                                (a) MappingWithRefObjectMap.ttl
 1 :new_map_0 a rml:TriplesMap; rml:logicalSource :new_child_view_0;
 2     rml:subjectMap [rml:template "http://gtfs/trips/{trip_id}"];
 3     rml:predicateObjectMap [
 4         rml:predicateMap [rml:constant gtfs:service];
 5         rml:objectMap [rml:template "http://gtfs/services/{service_id_from_parent}"]] .
 6
 7 :new_child_view_0 a rml:LogicalView; rml:onLogicalSource :source_1;
 8     rml:field
 9         [rml:fieldName "trip_id"; rml:reference "trip_id"],
10         [rml:fieldName "service_id"; rml:reference "service_id"];
11     rml:leftJoin [a rml:ViewJoin; rml:parentLogicalView :new_parent_view_0;
12         rml:joinCondition [rml:child "service_id"; rml:parent "service_id"];}
13         rml:field [rml:fieldName "service_id_from_parent"; rml:reference "service_id"]].
14
15   :new_parent_view_0 a rml:LogicalView; rml:onLogicalSource :source_5;
16       rml:field [rml:fieldName "service_id"; rml:reference "service_id"].


                               (b) MappingWithoutRefObjectMap.ttl
Figure 6: The referencing object map from MappingWithRefObjectMap.ttl is replaced by an equivalent
combination of two new logical views and one new triples map in MappingWithoutRefObjectMap.ttl.


triples map (via referencing object maps) or within the logical view. We added an option
to RML-view-to-CSV to delegate the execution of joins expressed in triples maps to logical
views. With this option selected, RML-view-to-CSV rewrites the RML mapping rules before
materializing the logical views.
   It is known that self-join elimination must be performed for time-efficient execution of
RML mappings, e.g. the mappings used in the GTFS-Madrid-Benchmark [10, 4, 8]. Therefore,
RML-view-to-CSV first eliminates unnecessary self-joins, i.e. when the same logical source
is used for the child and the parent triples map, and all involved join conditions use the same
references for both parent and child, and either of subject map of the parent triples map or the
subject map, predicate map and graph map of the child triples map only mention a subset of
the references used in the join conditions, the referencing object map is replaced with a simple
object map based on the subject map of the parent triples maps.
   All remaining referencing object maps are rewritten as an equivalent combination of two
new logical views and a new triples map (Figure 6).
      test case number      test case description
      RMLLVTC0001           logical view over JSON source
      RMLLVTC0002           logical view over JSON source with flattening of nested data
      RMLLVTC0003           logical view over CSV source, extended with data from a logical view over
                            JSON source using a left join
      RMLLVTC0004           logical view over CSV source, extended with data from a logical view over
                            JSON source using an inner join
      RMLLVTC0005           logical view over CSV source, extended with data from a logical view over
                            JSON source using an inner join, and use of references to field indexes

Table 1
Description of the test cases included in the RML Logical View module at the moment of writing.


   This option allows us to test our proof-of-concept implementation with existing RML bench-
marks, although these existing benchmarks do not include RML Logical Views yet, as logical
views are new in RML.

4.3. Optimization of logical views
By default, RML-view-to-CSV does not take the content of triples maps into account. The
materialized logical views represent all fields and logical iterations of the declared RML Logical
Views. Thanks to this behaviour, we can verify if the processing of the logical views remains
aligned with the specification.
   However, logical views can contain fields and logical iterations that are not used by any
triples map. As the size of the source data impacts the knowledge graph construction process,
we added an option in RML-view-to-CSV to eliminate unnecessary fields and logical iterations.
With this option selected, RML-view-to-CSV first removes fields that are not used in any triples
map linked to the logical view. Then, all duplicate logical iterations are removed, except when
any linked triples map produces blank nodes that are not based on a field from the logical
view (as this latter case results in always generate a new unique identifier, hence non-duplicate
logical iterations).


5. Evaluation
We tested our proof-of-concept implementation against the test cases defined in the RML Logical
Views modules, and added an evaluation on the GTFS-Madrid-Bench [11].

5.1. Test cases in the RML Logical Views module
At the moment of writing the RML Logical Views module includes 5 test cases13 (Table 1). With
RML-view-to-CSV as preprocessor to RMLMapper, we generated the expected output for the
relevant test cases (i.e. test cases RMLLV0001 to RMLLV0004). We excluded test case RMLLV0005
as it makes use of field indexes, which we have not yet included in our implementation due to
13
     https://github.com/kg-construct/rml-lv/tree/main/test-cases
      GTFS-Madrid-Bench                         RML-view-to-CSV (optimize) & RML Engine      GTFS-Madrid-Bench                     RML-view-to-CSV (optimize) & RML Engine
      scale 10                                  RML-view-to-CSV & RML engine                 scale 100                             RML-view-to-CSV & RML engine
      execution time (s)                        RML Engine                                   execution time (s)                    RML Engine


                                                      141                                                                               1.531
          CARML                                                                    280           CARML                                                              2.998
                                                                          238                                                                             2.532

                           41                                                                                         639
      Morph-KGC                 52                                                           Morph-KGC                         1.138
                       36                                                                                                   958

                       35
      RMLMapper                  60                                                          RMLMapper


                            50                                                                                236
     RMLStreamer                     71                                                     RMLStreamer             465


                                                            161                                                                   1.162
       RPT/Sansa                                              174                             RPT/Sansa                              1.312
                                                                187                                                                   1.396

                                                                       226                                                                            2.255
     SDM-Rdfizer                                                        234                 SDM-Rdfizer                                                 2.348
                                                                                264                                                                             2.697

                   0       50             100        150        200       250         300                 0    500        1.000     1.500     2.000   2.500      3.000   3.500


Figure 7: GTFS-Madrid-Bench scale 10 and 100 executed with three pipelines (only RML engine, RML-
view-to-csv and RML engine, and RML-view-to-CSV with optimization and RML engine) and six RML
engines (Carml, Morph-KGC, RMLMapper, RMLStreamer, RPT/Sansa, and SDM-Rdfizer), average of five
runs, time-out after one hour. The pipelines with RML-view-to-CSV with optimization as preprocessor
have the lowest execution time.


an unclarity about the expected behaviour14 . During the evaluation of the test cases, we noticed
and corrected human mistakes in the mappings and expected output15 . This confirms the
need and benefit of a proof-of-concept implementation during the development phase of a new
RML module: a proof-of-concept implementation helps to spot ambiguities in the specification,
ontology and shapes as well as errors in the test cases in an early development stage.

5.2. GTFS-Madrid-Bench
We tested our proof-of-concept implementation on the GTFS-Madrid-Bench, comparing the
execution of joins by our implementation versus the execution joins by existing RML engines,
i.e. Carml, Morph-KGC, RMLMapper, RMLStreamer, RPT/Sansa, and SDM-Rdfizer16 . Our test
setup included three pipelines: RML engine only, RML-view-to-CSV and RML engine, and RML-
view-to-CSV with optimization (Section 4.3) and RML engine. In the latter two pipelines, RML-
view-to-CSV first eliminates all referencing object maps in the GTFS-Madrid-Bench mapping,
14
   https://github.com/kg-construct/rml-lv/issues/20
15
   https://github.com/kg-construct/rml-lv/pull/22
16
   https://github.com/carml/carml-jar (V1.3.0), https://github.com/morph-kgc/morph-kgc (v2.6.4), https://github.com/
   RMLio/rmlmapper-java (v6.3.0), https://github.com/RMLio/RMLStreamer (v2.5.0), https://github.com/SDM-TIB/
   SDM-RDFizer (v4.7.3.5), and https://github.com/SmartDataAnalytics/RdfProcessingToolkit/ (v.1.9.5) respectively.
                                                     RML-view-to-CSV - execution time (s)
                                           4                                      RML-view-to-CSV (optimize)
                             GTFS 1
                                          3                                       RML-view-to-CSV

                                           5
                            GTFS 10
                                               12

                                                    16
                           GTFS 100
                                                                                                   102

                                      0             20       40        60        80          100           120


Figure 8: Execution time of RML-view-to-CSV (first step of the pipelines in Figure 7). With the
optimization option, RML-view-to-CSV eliminates unnecessary logical view fields and duplicate logical
iterations. This reduces its executions time with a factor of six on the largest scale measured.


rewriting them as logical views whenever applicable (Section 4.2). We measured the knowledge
graph construction time per pipeline (i.e. including the execution time of RML-view-to-CSV
where applicable) and per RML engine for scales 1, 10 and 100 of the GTFS-Madrid-Bench, with
CSV as source format, using a device with following specifications: 2 x Hexacore Intel E5645
(2.4GHz) CPU, 24GB RAM, 1x 250GB harddisk. All experiments were performed 5 times and
the average of the measurements is reported (Figure 7). The test scripts are available online17 .
   RMLMapper and RMLStreamer cannot generate any output for the GTFS-Madrid-Bench
within one hour when we count on these RML engines to execute the joins. However, when
delegating the joins to RML-view-to-CSV, these mapping engines were able to generate correct
output, with timings similar or better to using a state-of-the-art RML engine like Morph-KGC.
We note that RMLMapper still cannot handle GTFS scale 100: the RMLMapper loads all data
in memory during mapping, and the testing device ran out of memory during GTFS scale
100. We also note that the elimination of unnecessary fields and duplicate logical iterations
(RML-view-to CSV with optimization) reduces the execution time of RML-view-to-CSV by a
factor of six for scale 100 (Figure 8), and leads to the fastest pipelines in combination with all
tested engines.
   The combination of RML-view-to-CSV with optimization and RMLStreamer emerges as the
most efficient approach18 . This showcases the potential of modular mapping engines, delegating
each task to the most suitable framework, i.e. the dataframes from pandas (used in RML-view-
to-CSV) are optimized for data transformations and joins, while streaming and parallelization
of Flink enables RMLStreamer to create RDF graph data with a linear scaling of execution time
and CPU usage, proportional to the size of the input data, while maintaining a constant memory
usage [12].


6. Conclusion
In this paper, we show how the RML Logical Views specification can be implemented and can
solve needs that were not solvable yet by RML. The implementation can be realized as a modular

17
     10.5281/zenodo.10987733
18
     The current set-up of using a preprocessing materialization step prevents the RMLStreamer of currently using this
     optimization for streaming data.
part of a knowledge graph construction process.
   Our proof-of-concept, RML-views-to-CSV, as a preprocessor to any RML engine (that supports
CSV input) did not only help to validate and improve the RML Logical Views module, but
benchmarks also show performance gains for handling joins between CSV sources. The modular
approach showcases the potential of modular mapping engines, allowing to use specialized
data structures and delegating each task to the most suitable framework, offering best-of-breed
performance enhancements.
   The RML Logical Views module is still under development. Its finalization, including formal
definitions and more features (e.g. indexes per field, data transformation functions, groups
and aggregations), is future work. We intend to gradually integrate the additional features in
RML-view-to-CSV, as they are discussed in the W3C Community Group on KGC and described
in the RML Logical Views specification. Furthermore, we will investigate whether the detected
performance improvements hold as well for sources other than CSV.
   We will continue to share our code as inspiration for the developers who want to implement
RML Logical Views directly in RML Engines once this new RML module has been finalized.


Acknowledgments
The described research activities were supported by SolidLab Vlaanderen (Flemish Government,
EWI and RRF project VV023/10), and the European Union’s Horizon Europe research and
innovation program under grant agreement no. 101058682 (Onto-DESIDE). The authors want
to thank David Chaves-Fraga for the discussions that were a source of inspiration for this
proof-of-concept implementation.


References
 [1] A. Iglesias-Molina, D. Van Assche, J. Arenas-Guerrero, B. De Meester, C. Debruyne, S. Joza-
     shoori, P. Maria, F. Michel, D. Chaves-Fraga, A. Dimou, The RML Ontology: A Community-
     Driven Modular Redesign After a Decade of Experience in Mapping Heterogeneous Data
     to RDF, in: Proceedings of the International Semantic Web Conference (ISWC), Lecture
     Notes in Computer Science, Springer, Cham, 2023. doi:10.1007/978- 3- 031- 47243- 5_9 .
 [2] T. Delva, D. Van Assche, P. Heyvaert, B. De Meester, A. Dimou, Integrating nested data
     into knowledge graphs with RML fields, in: D. Chaves-Fraga, A. Dimou, P. Heyvaert,
     F. Priyatna, J. Sequeda (Eds.), Proceedings of the 2nd International Workshop on Knowledge
     Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC
     2021), volume 2873, CEUR, 2021. URL: http://ceur-ws.org/Vol-2873/paper9.pdf.
 [3] J. Arenas-Guerrero, A. Alobaid, M. Navas-Loro, M. S. Pérez, O. Corcho, Boosting knowledge
     graph generation from tabular data with RML views, in: The Semantic Web, Springer
     Nature Switzerland, 2023, pp. 484–501. doi:10.1007/978- 3- 031- 33455- 9_29 .
 [4] J. Arenas-Guerrero, D. Chaves-Fraga, J. Toledo, M. S. Pérez, O. Corcho, Morph-KGC:
     Scalable knowledge graph materialization with mapping partitions, Semantic Web (2022)
     1–20. doi:10.3233/sw- 223135 .
 [5] W. McKinney, pandas: a foundational python library for dataanalysis and statistics (2011).
 [6] E. Daga, L. Asprino, P. Mulholland, A. Gangemi, Facade-X: An Opinionated Approach
     to SPARQL Anything, in: Further with Knowledge Graphs – Proceedings of the 17th
     International Conference on Semantic Systems, 6–9 September 2021, Amsterdam, The
     Netherlands, volume 53 of Studies on the Semantic Web, IOS Press, 2021, pp. 58–73. doi:10.
     3233/SSW210035 .
 [7] F. Michel, L. Djimenou, C. Faron-Zucker, J. Montagnat, Translation of Heterogeneous
     Databases into RDF, and Application to the Construction of a SKOS Taxonomical Reference,
     in: International Conference on Web Information Systems and Technologies, Springer,
     2015, pp. 275–296. doi:10.1007/978- 3- 319- 30996- 5_14 .
 [8] E. Iglesias, S. Jozashoori, M.-E. Vidal, Scaling up knowledge graph creation to large and
     heterogeneous data sources, Journal of Web Semantics 75 (2023). URL: http://arxiv.org/
     abs/2201.09694. doi:10.1016/j.websem.2022.100755 .
 [9] S. Jozashoori, D. Chaves-Fraga, E. Iglesias, M.-E. Vidal, O. Corcho, Funmap: Efficient
     execution of functional mappings for knowledge graph creation, in: International Semantic
     Web Conference, Springer, 2020, pp. 276–293. doi:10.1007/978- 3- 030- 62419- 4_16 .
[10] C. Stadler, L. Bühmann, L.-P. Meyer, M. Martin, Scaling RML and SPARQL-based Knowledge
     Graph Construction with Apache Spark, in: Knowledge Graph Construction Workshop,
     co-located with ESWC, 2023.
[11] D. Chaves-Fraga, F. Priyatna, A. Cimmino, J. Toledo, E. Ruckhaus, O. Corcho, Gtfs-madrid-
     bench: A benchmark for virtual knowledge graph access in the transport domain, Journal
     of Web Semantics 65 (2020) 100596. doi:10.1016/j.websem.2020.100596 .
[12] E. de Vleeschauwer, G. Haesendonck, , D. Van Assche, B. De Meester, RMLStreamer with
     Reference Conditions in the KGCW Challenge 2023, in: Knowledge Graph Construction
     Workshop, co-located with ESWC, 2023.

</pre>