<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Implementation for R ML Logical Views</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Els de Vleeschauwer</string-name>
          <email>els.devleeschauwer@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pano Maria</string-name>
          <email>pano@skemu.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <email>ben.demeester@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Colpaert</string-name>
          <email>pieter.colpaert@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IDLab, Dept. Electronics &amp; Information Systems, Ghent University - imec</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>KGCW'24: 5th International Workshop on Knowledge Graph Construction</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Skemu</institution>
          ,
          <addr-line>Schiedam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Although the W3C Community Group on Knowledge Graph Construction (KGC)'s work on the modular RDF Mapping Language (RML) specification has taken great strides, open issues and respective solution proposals remain. Some of these issues are (i) inability to handle hierarchy in nested data, (ii) limited join functionality, and (iii) inability to handle mixed data formats. To combat these issues, the RML Logical Views module is proposed. However, proper but eficient validation of this module requires an implementation that allows short development cycles. In this workshop paper, we propose a proofof-concept RML Logical Views implementation, independent of and complementary to existing RML mapping engines. Our proof-of-concept covers three important features of the new RML Logical Views module: (i) flattening of nested data, (ii) extended joining of data sources, and (iii) handling mixed data formats. Our implementation supports one nested source format (JSON) and one tabular source format (CSV), and can be used independently, as preprocessor, by any RML Engine. With this implementation, we successfully executed the available relevant test cases of the RML Logical Views module. Additionally, we measured the knowledge graph construction times on GTFS-Madrid-Bench. To accomplish this we added an option to our implementation that replaces referencing object maps with joins in RML Logical Views. When we included our implementation in the knowledge graph construction pipeline, we noticed considerable execution time reductions. We conclude that the RML Logical Views specification can be implemented, and can solve needs that were not yet solvable by RML. The current implementation can already be realized as a modular part of a knowledge graph construction process. Although boosting performance was not the aim of our work, our implementation reduces the execution time of GTFS-Madrid-Bench scale 100 by 16%, 33%, and 39% when combined respectively with SMD-Rdfizer or RPT/Sansa, Morph-KGC, and Carml. RMLStreamer, when used alone, times out after two hours on this task, but, in conjunction with our implementation, completes it in 236 seconds. We hope this proof-of-concept inspires the developers of existing RML engines to integrate the RML Logical Views module and benefit from its features.</p>
      </abstract>
      <kwd-group>
        <kwd>RML Logical View</kwd>
        <kwd>flattening</kwd>
        <kwd>joining</kwd>
        <kwd>mixed content</kwd>
        <kwd>proof-of-concept</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>R</p>
      <p>ML
CEUR
Workshop
Proceedings
(P. Colpaert)
https://skemu.com (P. Maria); https://ben.de-meester.org/#me (B. De Meester); https://pietercolpaert.be/#me
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The W3C Community Group on Knowledge Graph Construction (KGC)1 works on a declarative
approach to construct RDF graph data from existing, heterogeneous data sources. The group
recently proposed a new modular specification, ontology, and accompanying SHACL shapes for
the RDF Mapping Language (RML)2, including novel features which increase its expressiveness
and empowers practitioners to define mapping rules for constructing RDF graph data that were
previously unattainable [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Nevertheless, challenges such as handling hierarchy of nested data,
more flexible joining (also across data hierarchies), and handling data sources that mix source
formats (e.g., a table that contains a column storing data as JSON) remain unsolved.
      </p>
      <p>As of July 2023, a dedicated task force of the W3C Community Group on KGC works on an
additional RML module: RML Logical Views3. This module aims to resolve the aforementioned
challenges by allowing to specify a logical view: a flattened, source format-agnostic view over
one or more existing data sources.</p>
      <p>RML Logical views are new in RML, and still under development. The specification 4 is not
ifnalized and still subject to change. Hence, there are no implementations available to validate
the feasibility of the theoretical concepts. Proper but eficient validation of the RML Logical
Views module requires an implementation that allows short development cycles.</p>
      <p>A proof-of-concept implementation can validate if the proposed RML constructions are
implementable, and reveal ambiguities in the specification. This allows for corrective iterations
during the development of the specification, and can also support the creation of test cases.</p>
      <p>In this workshop paper, we present RML-view-to-CSV: a proof-of-concept implementation
made available under MIT license and designed following state-of-the-art best practices.
RMLview-to-CSV5 materializes all RML Logical Views specified in a given set of RML mapping
rules as CSV files, and rewrites these RML mapping rules to RML mapping rules without
RML Logical Views. Any RML engine that supports CSV files, can then use these resulting
RML mapping rules and the generated CSV files to construct RDF graph data using existing
RML constructs. The implementation supports two source file formats (CSV and JSON) and
supports the following features: flattening of nested data, handling of mixed data formats, and
more flexible joining of data sources (also across data hierarchies). We added two optional
functionalities: the elimination of referencing object maps by moving the joins to logical views,
and the optimization of logical views based on the linked triples maps. The first option allows us
to test our implementation with existing RML benchmarks. Combined with the second option,
we obtained considerable execution time reductions for the knowledge graph construction
pipeline.</p>
      <p>After discussing related work (Section 2) and explaining the main principles of logical views
and how they are specified using RML Logical Views (Section 3), we describe our approach and
the implemented features (Section 4). Finally, we evaluate (Section 5), and conclude (Section 6).</p>
      <sec id="sec-2-1">
        <title>1https://www.w3.org/community/kg-construct/</title>
        <p>2https://w3id.org/rml/portal/
3https://github.com/kg-construct/rml-lv
4https://kg-construct.github.io/rml-lv/dev.html
5https://github.com/RMLio/rml-view-to-csv</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>In this section, we first list solutions that were formulated in the past and influenced the
development of RML Logical Views (Section 2.1). Then, we describe similar modular implementation
approaches (Section 2.2).</p>
      <sec id="sec-3-1">
        <title>2.1. Related proposals to extend RML</title>
        <p>
          Partial solutions that also apply the concept of a (logical) view were formulated in the past
to address (parts of) the challenges presented in this paper [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ]. RML Fields is a proposed
solution to deal with nested data and mixed content in RML [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The proposal includes a
language construct and algorithm, but no implementation. RML Logical Views is a revised
extension of this proposal. Another solution is expanding RML’s existing SQL views to general
tabular sources instead of solely relational databases [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], to better support transformation
functions, complex joins, and mixed content in RML. These views are formulated as SQL queries
over tabular sources. The proposal is implemented as an extension of Morph-KGC6 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: a
state-of-the-art RML mapping engine implemented using the pandas Python package7 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Other proposed solutions include Facade-X [6], which directly maps the data source into
RDF graph data (so not a flattened, but a graph-based source-agnostic view), allowing to, e.g.
support joining over hierarchy via an iteration index8; and xR2RML, which (among others)
supports mixed syntax paths for handling data sources that mix source formats [7].</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. RML engine module implementations</title>
        <p>State-of-the-art RML engine module implementations typically provide (integrated or separate)
preprocessing steps. SDM-RDFizer introduces a preprocessing step for grouping mapping
rules, which leads to optimizations due to parallelization [8]. FunMap [9] is an interpreter
of RML+FnO (the RML module that allows to describe data transformations in the mapping
process), that converts a data integration system defined using RML+FnO into an equivalent
one where RML mappings are function-free.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. RML Logical Views Primer</title>
      <p>Within RML (https://w3id.org/rml/portal), RDF triple generation is defined using triples maps.
Within these triples maps, expressions are used to fill in data values in specified places in the
triples. On which data these triples are generated, is currently specified using a logical source:
a construct to describe how to extract logical iterations out of heterogeneous data sources,
e.g. extracting individual rows from CSV files or extracting specific objects from JSON data
from a Web API response. Reference formulations allow specifying the manner to create logical
iterations and expressions (e.g. using the JSONPath reference formulation to handle JSON data).</p>
      <sec id="sec-4-1">
        <title>6https://github.com/morph-kgc/morph-kgc/releases/tag/2.3.0 7https://pandas.pydata.org/ 8https://github.com/kg-construct/mapping-challenges/issues/43</title>
        <p>For example, when mapping person data from a local CSV file that contains a person’s
(national) ID and name, the logical source describes how to extract rows as logical iterations
from that CSV file, using a default CSV reference formulation. On these logical iterations,
specific ID and name expressions are evaluated to create triples, using the ID expression as part
of the subject identifier, and the name expression as a literal object.</p>
        <p>
          With RML Logical Views, the W3C Community Group on KGC aims to describe a virtual
lfattened view on top of a logical source. A logical view allows defining fields where each field
is an expression on its parent in terms of a reference formulation. A field’s parent is either
the logical iteration of the logical source when a field is defined at root-level of the logical
view, or another field. The result of evaluating field’s expression is a list of values called an
iteration sequence [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The reference formulation of the expression of the field is the reference
formulation of its parent, unless it is specified on the field. Each field has a declared name which
is an alphanumeric string, and a field name which is the concatenation of the name of the parent
ifeld, a dot, and the field’s declared name.
        </p>
        <p>
          The evaluation of the fields of a logical view on a given logical iteration forms a new view
iteration, which is the natural, full outer join of the logical iteration of the logical source and all
the fields’ iteration sequences in order of the defined field hierarchy. Following this, expression
result values derived along the same root-to-leaf path in the input data’s tree structure, will end
up in the same view iteration [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>To reference a field value while generating triples in a triples map, its field name can be used.
This approach gives a mapping author the control to map a hierarchical source in a variety of
ways whilst still being able to respect the context of the source hierarchy.</p>
        <p>A logical view can be further extended with fields from one or more other logical views by
defining joins with other logical views. Since the logical iterations of a logical view can be
represented as relation in a relational algebra, we can also define joins in terms of relational
algebra (left join, inner join). These joins will define which field values from a parent logical
view will be incorporated in a logical iteration of the child logical view.</p>
        <p>In this section, we give an example-based explanation of the main RML Logical Views
functionalities. The examples9 (Figures 1 to 3) include an RML mapping illustrating the declarative
description of a logical view, the source data for the logical source used as source for the logical
view, and an intermediate representation of the logical view. The headers of the columns of this
intermediate representation can be used as reference expression in expression maps.</p>
        <sec id="sec-4-1-1">
          <title>3.1. Flattening of nested data structures</title>
          <p>Problem: References to nested data structures, like JSON or XML, may return multiple values.
These values can be composite: they may again contain multiple values. RML defines mapping
constructs that produce results by combining the results of other mapping constructs in a specific
order. For example, a triples map combines the results of a subject map and a predicate-object
map in that order. Another example is a template expression, which combines character strings
and zero or more reference expressions in declared order. When mapping constructs produce
multiple results, the combining mapping constructs will apply an n-ary Cartesian product10
9Prefixes are omitted but can be found on https://prefix.cc
10https://w3id.org/rml/core/spec#dfn-n-ary-cartesian-product
1 :jsonSource a rml:LogicalSource ;
2 rml:source "json_data.json" ;
3 rml:referenceFormulation rml:JSONPath ;
4 rml:iterator "$.people[*]" .
5 :jsonView a rml:LogicalView ;
6 rml:onLogicalSource :jsonSource ;
7 rml:field [
8 rml:fieldName "name" ;
9 rml:reference "$.name" ; ] ;
10 rml:field [
11 rml:fieldName "item" ;
12 rml:reference "$.items[*]" ;
13 rml:field [
14 rml:fieldName "type" ;
15 rml:reference "$.type" ; ] ;
16 rml:field [
17 rml:fieldName "weight" ;
18 rml:reference "$.weight" ; ] ; ] .
1 { "people": [
2 { "name": "alice",
3 "items": [
4 { "type": "sword",
5 "weight": 1500 },
6 { "type": "shield",
7 "weight": 2500 }
8 ] },
9 { "name": "bob",
10 "items": [
11 { "type": "flower",
12 "weight": 15 }
13 ] }
14 ] }
(a) mapping_json_view.ttl (part 1)
(b) json_data.json
name
alice
alice
bob
item
{”type”: ”sword”, ”weight”: 1500}
{”type”: ”shield”, ”weight”: 2500}
{”type”: ”flower”, ”weight”: 15}
item.type
sword
shield
flower
item.weight
1500
2500
15
(c) Intermediate representation of :jsonView
over the sets of results, maintaining the order of the mapping constructs. In the case of nested
data structures, this may cause the generation of results that do not match the source hierarchy,
i.e. do not follow the root-to-leaf paths in the source data, since values are combined irrespective
of it.</p>
          <p>Furthermore, there is varying expressiveness in data source expression and query languages,
and many languages have limited support for hierarchy traversal. For example, JSONPath has
no operator to refer to an ancestor in the document hierarchy.</p>
          <p>This limits the ability of RML to map nested data.</p>
          <p>Solution: Fields can be defined in a hierarchy following the source document to produce
iterations from which source values along the same root-to-leaf path can be referenced. The
view iteration can be seen as flattening the hierarchy of the source document (Figure 1).</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>3.2. Handling of mixed data formats</title>
          <p>
            Problem: Data in one format can contain multiple or composite values stored in another
format11 [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], e.g. a CSV dataset could contain columns containing JSON values.
11https://webusers.i3s.unice.fr/~fmichel/xr2rml_specification_v5.1.html#_Toc466307454
          </p>
          <p>To define the expected form of references to input data RML employs the notion of a reference
formulation that is a property of every logical source. However, currently a logical source
is limited to having a single reference formulation, meaning mixed format data can only be
referenced using a query language that supports just one of the formats.</p>
          <p>Solution: For every field, the reference formulation can be adapted (Figure 2). The default
reference formulation for a field is the reference formulation of its parent, or of the source of
the logical view for the fields at the root of the iteration. Thus, every iteration level can iterate
over data in a diferent format.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>3.3. Extended joining of data sources</title>
          <p>
            Problem: RML restricts join operations to referencing object maps. Since a referencing object
map can only generate an object that is an IRI or blank node subject as specified by a parent
triples map, it is not possible to combine data from two sources in one term, use data from a join
on another position than the object, or generate a literal using data from a join [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. Moreover,
RML cannot join correctly across hierarchies [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
          </p>
          <p>Solution: A logical view can be extended with fields from one or more other logical views as
a result of a join operation (Figure 3). The logical iteration will be adapted according to the type
of join specified ( left join or inner join). Any needed flattening of hierarchical data is done in
the logical view before applying the join operation. Data that originally comes from a diferent
data source is thus treated equally in a joined logical view, allowing more flexibility as to where
to apply that joined data.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Approach and Implementation</title>
      <p>We have designed our proof-of-concept implementation as a standalone application, independent
of and complementary to existing RML mapping engines, that can be used as a preprocessing
step in a knowledge graph construction pipeline (Figure 4). As such, we designed our
proof-ofconcept following the state-of-the-art best practices of, amongst others, FunMap [9], i.e. we
rely on a set of lossless rewriting rules to push down and materialize the execution of RML
Logical Views in the initial step of knowledge graph construction process.</p>
      <p>
        Our implementation, named RML-view-to-CSV, is available online12 under the permissive
MIT license. Our code is built on top of pandas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a Python library with data frames for
manipulation of structured data sets.
      </p>
      <p>At the moment of writing, our implementation supports one nested source format (JSON),
12https://github.com/RMLio/rml-view-to-csv/ (v0.0.0), https://doi.org/10.5281/zenodo.11045497
RML mapping
source data
(CSV, JSON)</p>
      <p>RML mapping 
without logical views</p>
      <p>materialized
logical views (CSV)
RML-view-to-csv</p>
      <p>RML engine</p>
      <p>RDF knowledge graph
one tabular source format (CSV), and the three important features of the new RML Logical
Views module: flattening of nested data, more flexible joining of data sources (also across data
hierarchies), and handling of mixed data formats. Its main functionality is the materialization of
logical views (Section 4.1). We added two optional functionalities: the elimination of referencing
object maps (Section 4.2) and the optimization of logical views based on the linked triples maps
(Section 4.3).</p>
      <sec id="sec-5-1">
        <title>4.1. Materialization of logical views</title>
        <p>RML-view-to-CSV takes as input a given set of RML mapping rules and the source data used
in these mapping rules. It produces CSV files with the intermediate representation of every
logical view specified in the RML mapping rules, and a new set of RML mapping rules in which
all RML Logical Views are replaced by logical sources (Figure 5).</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Elimination of referencing object maps</title>
        <p>With the introduction of logical views, joins of data sources can be expressed within the
triples map (via referencing object maps) or within the logical view. We added an option
to RML-view-to-CSV to delegate the execution of joins expressed in triples maps to logical
views. With this option selected, RML-view-to-CSV rewrites the RML mapping rules before
materializing the logical views.</p>
        <p>
          It is known that self-join elimination must be performed for time-eficient execution of
RML mappings, e.g. the mappings used in the GTFS-Madrid-Benchmark [
          <xref ref-type="bibr" rid="ref4">10, 4, 8</xref>
          ]. Therefore,
RML-view-to-CSV first eliminates unnecessary self-joins, i.e. when the same logical source
is used for the child and the parent triples map, and all involved join conditions use the same
references for both parent and child, and either of subject map of the parent triples map or the
subject map, predicate map and graph map of the child triples map only mention a subset of
the references used in the join conditions, the referencing object map is replaced with a simple
object map based on the subject map of the parent triples maps.
        </p>
        <p>All remaining referencing object maps are rewritten as an equivalent combination of two
new logical views and a new triples map (Figure 6).</p>
        <p>test case number test case description
RMLLVTC0001
RMLLVTC0002
RMLLVTC0003
RMLLVTC0004
RMLLVTC0005
logical view over JSON source
logical view over JSON source with flattening of nested data
logical view over CSV source, extended with data from a logical view over
JSON source using a left join
logical view over CSV source, extended with data from a logical view over
JSON source using an inner join
logical view over CSV source, extended with data from a logical view over</p>
        <p>JSON source using an inner join, and use of references to field indexes</p>
        <p>This option allows us to test our proof-of-concept implementation with existing RML
benchmarks, although these existing benchmarks do not include RML Logical Views yet, as logical
views are new in RML.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Optimization of logical views</title>
        <p>By default, RML-view-to-CSV does not take the content of triples maps into account. The
materialized logical views represent all fields and logical iterations of the declared RML Logical
Views. Thanks to this behaviour, we can verify if the processing of the logical views remains
aligned with the specification.</p>
        <p>However, logical views can contain fields and logical iterations that are not used by any
triples map. As the size of the source data impacts the knowledge graph construction process,
we added an option in RML-view-to-CSV to eliminate unnecessary fields and logical iterations.
With this option selected, RML-view-to-CSV first removes fields that are not used in any triples
map linked to the logical view. Then, all duplicate logical iterations are removed, except when
any linked triples map produces blank nodes that are not based on a field from the logical
view (as this latter case results in always generate a new unique identifier, hence non-duplicate
logical iterations).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Evaluation</title>
      <p>We tested our proof-of-concept implementation against the test cases defined in the RML Logical
Views modules, and added an evaluation on the GTFS-Madrid-Bench [11].</p>
      <sec id="sec-6-1">
        <title>5.1. Test cases in the RML Logical Views module</title>
        <p>At the moment of writing the RML Logical Views module includes 5 test cases13 (Table 1). With
RML-view-to-CSV as preprocessor to RMLMapper, we generated the expected output for the
relevant test cases (i.e. test cases RMLLV0001 to RMLLV0004). We excluded test case RMLLV0005
as it makes use of field indexes, which we have not yet included in our implementation due to
13https://github.com/kg-construct/rml-lv/tree/main/test-cases</p>
        <p>CARML
Morph-KGC
RMLMapper
RPT/Sansa
SDM-Rdfizer
41
36
35
52
60
50
RMLStreamer</p>
        <p>71
GTFS-Madrid-Bench
scale 10
execution time (s)</p>
        <p>RML-view-to-CSV (optimize) &amp; RML Engine
RML-view-to-CSV &amp; RML engine
RML Engine</p>
        <p>GTFS-Madrid-Bench
scale 100
execution time (s)</p>
        <p>RML-view-to-CSV (optimize) &amp; RML Engine
RML-view-to-CSV &amp; RML engine</p>
        <p>RML Engine
141
238
an unclarity about the expected behaviour14. During the evaluation of the test cases, we noticed
and corrected human mistakes in the mappings and expected output15. This confirms the
need and benefit of a proof-of-concept implementation during the development phase of a new
RML module: a proof-of-concept implementation helps to spot ambiguities in the specification,
ontology and shapes as well as errors in the test cases in an early development stage.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. GTFS-Madrid-Bench</title>
        <p>We tested our proof-of-concept implementation on the GTFS-Madrid-Bench, comparing the
execution of joins by our implementation versus the execution joins by existing RML engines,
i.e. Carml, Morph-KGC, RMLMapper, RMLStreamer, RPT/Sansa, and SDM-Rdfizer 16. Our test
setup included three pipelines: RML engine only, RML-view-to-CSV and RML engine, and
RMLview-to-CSV with optimization (Section 4.3) and RML engine. In the latter two pipelines,
RMLview-to-CSV first eliminates all referencing object maps in the GTFS-Madrid-Bench mapping,
14https://github.com/kg-construct/rml-lv/issues/20
15https://github.com/kg-construct/rml-lv/pull/22
16https://github.com/carml/carml-jar (V1.3.0), https://github.com/morph-kgc/morph-kgc (v2.6.4), https://github.com/
RMLio/rmlmapper-java (v6.3.0), https://github.com/RMLio/RMLStreamer (v2.5.0), https://github.com/SDM-TIB/
SDM-RDFizer (v4.7.3.5), and https://github.com/SmartDataAnalytics/RdfProcessingToolkit/ (v.1.9.5) respectively.</p>
        <p>RML-view-to-CSV - execution time (s)
GTFS 1
GTFS 10
GTFS 100
0
rewriting them as logical views whenever applicable (Section 4.2). We measured the knowledge
graph construction time per pipeline (i.e. including the execution time of RML-view-to-CSV
where applicable) and per RML engine for scales 1, 10 and 100 of the GTFS-Madrid-Bench, with
CSV as source format, using a device with following specifications: 2 x Hexacore Intel E5645
(2.4GHz) CPU, 24GB RAM, 1x 250GB harddisk. All experiments were performed 5 times and
the average of the measurements is reported (Figure 7). The test scripts are available online17.</p>
        <p>RMLMapper and RMLStreamer cannot generate any output for the GTFS-Madrid-Bench
within one hour when we count on these RML engines to execute the joins. However, when
delegating the joins to RML-view-to-CSV, these mapping engines were able to generate correct
output, with timings similar or better to using a state-of-the-art RML engine like Morph-KGC.
We note that RMLMapper still cannot handle GTFS scale 100: the RMLMapper loads all data
in memory during mapping, and the testing device ran out of memory during GTFS scale
100. We also note that the elimination of unnecessary fields and duplicate logical iterations
(RML-view-to CSV with optimization) reduces the execution time of RML-view-to-CSV by a
factor of six for scale 100 (Figure 8), and leads to the fastest pipelines in combination with all
tested engines.</p>
        <p>The combination of RML-view-to-CSV with optimization and RMLStreamer emerges as the
most eficient approach 18. This showcases the potential of modular mapping engines, delegating
each task to the most suitable framework, i.e. the dataframes from pandas (used in
RML-viewto-CSV) are optimized for data transformations and joins, while streaming and parallelization
of Flink enables RMLStreamer to create RDF graph data with a linear scaling of execution time
and CPU usage, proportional to the size of the input data, while maintaining a constant memory
usage [12].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>In this paper, we show how the RML Logical Views specification can be implemented and can
solve needs that were not solvable yet by RML. The implementation can be realized as a modular
1710.5281/zenodo.10987733
18The current set-up of using a preprocessing materialization step prevents the RMLStreamer of currently using this
optimization for streaming data.
part of a knowledge graph construction process.</p>
      <p>Our proof-of-concept, RML-views-to-CSV, as a preprocessor to any RML engine (that supports
CSV input) did not only help to validate and improve the RML Logical Views module, but
benchmarks also show performance gains for handling joins between CSV sources. The modular
approach showcases the potential of modular mapping engines, allowing to use specialized
data structures and delegating each task to the most suitable framework, ofering best-of-breed
performance enhancements.</p>
      <p>The RML Logical Views module is still under development. Its finalization, including formal
definitions and more features (e.g. indexes per field, data transformation functions, groups
and aggregations), is future work. We intend to gradually integrate the additional features in
RML-view-to-CSV, as they are discussed in the W3C Community Group on KGC and described
in the RML Logical Views specification. Furthermore, we will investigate whether the detected
performance improvements hold as well for sources other than CSV.</p>
      <p>We will continue to share our code as inspiration for the developers who want to implement
RML Logical Views directly in RML Engines once this new RML module has been finalized.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The described research activities were supported by SolidLab Vlaanderen (Flemish Government,
EWI and RRF project VV023/10), and the European Union’s Horizon Europe research and
innovation program under grant agreement no. 101058682 (Onto-DESIDE). The authors want
to thank David Chaves-Fraga for the discussions that were a source of inspiration for this
proof-of-concept implementation.
[6] E. Daga, L. Asprino, P. Mulholland, A. Gangemi, Facade-X: An Opinionated Approach
to SPARQL Anything, in: Further with Knowledge Graphs – Proceedings of the 17th
International Conference on Semantic Systems, 6–9 September 2021, Amsterdam, The
Netherlands, volume 53 of Studies on the Semantic Web, IOS Press, 2021, pp. 58–73. doi:10.
3233/SSW210035.
[7] F. Michel, L. Djimenou, C. Faron-Zucker, J. Montagnat, Translation of Heterogeneous
Databases into RDF, and Application to the Construction of a SKOS Taxonomical Reference,
in: International Conference on Web Information Systems and Technologies, Springer,
2015, pp. 275–296. doi:10.1007/978-3-319-30996-5_14.
[8] E. Iglesias, S. Jozashoori, M.-E. Vidal, Scaling up knowledge graph creation to large and
heterogeneous data sources, Journal of Web Semantics 75 (2023). URL: http://arxiv.org/
abs/2201.09694. doi:10.1016/j.websem.2022.100755.
[9] S. Jozashoori, D. Chaves-Fraga, E. Iglesias, M.-E. Vidal, O. Corcho, Funmap: Eficient
execution of functional mappings for knowledge graph creation, in: International Semantic
Web Conference, Springer, 2020, pp. 276–293. doi:10.1007/978-3-030-62419-4_16.
[10] C. Stadler, L. Bühmann, L.-P. Meyer, M. Martin, Scaling RML and SPARQL-based Knowledge
Graph Construction with Apache Spark, in: Knowledge Graph Construction Workshop,
co-located with ESWC, 2023.
[11] D. Chaves-Fraga, F. Priyatna, A. Cimmino, J. Toledo, E. Ruckhaus, O. Corcho,
Gtfs-madridbench: A benchmark for virtual knowledge graph access in the transport domain, Journal
of Web Semantics 65 (2020) 100596. doi:10.1016/j.websem.2020.100596.
[12] E. de Vleeschauwer, G. Haesendonck, , D. Van Assche, B. De Meester, RMLStreamer with
Reference Conditions in the KGCW Challenge 2023, in: Knowledge Graph Construction
Workshop, co-located with ESWC, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Van Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Debruyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>The RML Ontology: A CommunityDriven Modular Redesign After a Decade of Experience in Mapping Heterogeneous Data to RDF</article-title>
          , in
          <source>: Proceedings of the International Semantic Web Conference (ISWC), Lecture Notes in Computer Science</source>
          , Springer, Cham,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 47243-
          <issue>5</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Delva</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Van Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Integrating nested data into knowledge graphs with RML fields</article-title>
          , in: D.
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Heyvaert</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Priyatna</surname>
          </string-name>
          , J. Sequeda (Eds.),
          <source>Proceedings of the 2nd International Workshop on Knowledge Graph Construction co-located with 18th Extended Semantic Web Conference (ESWC</source>
          <year>2021</year>
          ), volume
          <volume>2873</volume>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2873</volume>
          /paper9.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alobaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Navas-Loro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Boosting knowledge graph generation from tabular data with RML views</article-title>
          ,
          <source>in: The Semantic Web</source>
          , Springer Nature Switzerland,
          <year>2023</year>
          , pp.
          <fpage>484</fpage>
          -
          <lpage>501</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 33455- 9_
          <fpage>29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-KGC:
          <article-title>Scalable knowledge graph materialization with mapping partitions, Semantic Web (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .3233/sw- 223135.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>W. McKinney</surname>
          </string-name>
          <article-title>, pandas: a foundational python library for dataanalysis and statistics (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>