=Paper=
{{Paper
|id=Vol-3718/paper7
|storemode=property
|title=RMLStreamer supported by RML-view-to-CSV in the performance track of the KGCW Challenge 2024
|pdfUrl=https://ceur-ws.org/Vol-3718/paper7.pdf
|volume=Vol-3718
|authors=Els de Vleeschauwer,Ben De Meester
|dblpUrl=https://dblp.org/rec/conf/kgcw/VleeschauwerM24
}}
==RMLStreamer supported by RML-view-to-CSV in the performance track of the KGCW Challenge 2024==
RMLStreamer supported by RML-view-to-CSV in the
performance track of the KGCW Challenge 2024
Els de Vleeschauwer1 , Ben De Meester1
1
IDLab, Dept. Electronics & Information Systems, Ghent University – imec, Belgium
Abstract
This paper presents the results of the performance track of the Knowledge Graph Construction
Workshop 2024 Challenge with RMLStreamer, an RML mapping engine that processes all
data in a streaming fashion. On mappings without joins, RMLStreamer scales well regarding
execution time and CPU usage, while maintaining a constant memory usage. To optimize
the processing of the joins, we added RML-view-to-CSV as a first step to our knowledge
graph construction pipeline. RML-view-to-CSV is a proof-of-concept implementation for RML
Logical Views, i.e. flattened, source format-agnostic views over one or more existing data
sources. RML-view-to-CSV can additionally rewrite referencing object maps as logical views,
before it materializes the logical views as CSV files. The combination of RML-view-to-CSV and
RMLStreamer emerges as an efficient approach, showcasing the potential of modular mapping
engines that delegate each task to the most suitable framework.
Keywords
RMLStreamer, RML-view-to-CSV, challenge, knowledge graph construction
1. Introduction
The Knowledge Graph Construction Workshop (KGCW) 2024 Challenge1 consists of two
tracks: (i) a conformance track, that aims to spark development of implementations for
the new RML specifications and improve the test-cases, and (2) a performance track, that
wants to encourage the implementation of optimizations not only for execution time but
also for CPU and memory usage. The conformance track covers the same experiments as
the KGCW 2023 Challenge2 , consisting of two parts: (i) knowledge graph construction
(KGC) parameters to evaluate individual parameters, e.g. joins and duplicates, with
artificial data, and (ii) GTFS-Madrid-Bench [1] to focus on real-life use cases based on
public transport data from Madrid. In contrast to the previous edition, all participants
now conduct the experiments on identical virtual machines provided by Orange3 . This
ensures that the results from all participants can be directly compared.
KGCW’24: 5th International Workshop on Knowledge Graph Construction, May 27, 2024, Crete, GRE
Envelope-Open els.devleeschauwer@ugent.be (E. de Vleeschauwer); ben.demeester@ugent.be (B. De Meester)
GLOBE https://ben.de-meester.org/#me (B. De Meester)
Orcid 0000-0002-8630-3947 (E. de Vleeschauwer); 0000-0003-0248-0987 (B. De Meester)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
CEUR
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
https://doi.org/10.5281/zenodo.10721875
2
https://doi.org/10.5281/zenodo.7689310
3
https://hellofuture.orange.com/en/
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
CHALLENGE INPUT: RML mapping
RML mapping without joins
OUTPUT KGC pipeline:
RML-view-to-CSV RMLStreamer
RDF knowledge graph
CHALLENGE INPUT: materialized joins
source data (CSV files)
Figure 1: Knowledge graph construction pipeline: joins are resolved by RML-view-to-CSV to reduce
the knowledge graph construction execution time and the size of the resulting RDF knowledge graph.
In this paper, we present the results of the performance track for RMLStreamer [2], an
RML mapping engine which processes all data in a streaming fashion, in combination
with RML-view-to-CSV4 , a proof-of-concept implementation for RML Logical Views [3].
Section 2 describes the components of our knowledge graph construction pipeline.
Section 3 discusses the setup used to execute the challenge’s experiments. We present
our results in Section 4 and our conclusion in Section 5.
2. Knowledge Graph Construction Pipeline
Our knowledge graph construction pipeline (Figure 1) consists of two components: (i)
RMLStreamer executes the RML mapping rules in a streaming fashion [2], and (ii)
RML-view-to-CSV is a proof-of-concept implementation for RML Logical Views, that
can resolve joins between data sources [3].
2.1. RMLStreamer
RMLStreamer executes RML mapping rules to generate high quality Linked Data from
multiple originally (semi-)structured data sources in a streaming way. It handles big
input files and continuous data streams like sensor data without consuming more memory
when the input data size increases. It leverages Apache Flink to scale vertically across
multiple CPU cores and horizontally across multiple machines. In the challenge, we use
RMLStreamer version v2.5.0 with an embedded Flink version in a Docker container5 .
The challenge results of 2023 show that, on mapping tasks without joins, RMLStreamer
scales well regarding execution time and CPU usage, while maintaining a constant memory
usage [4].
However, joins can significantly prolongue the execution time and increase the size of
the output of RMLStreamer, as RMLStreamer does not eliminate self-joins or duplicates.
As a result, RMLStreamer needs more than three hours to execute the first scale of
the GTFS-Madrid-Bench, generating an output of 105 GB. Therefore, we delegate the
execution of joins to RML-view-to-CSV.
4
https://github.com/RMLio/rml-view-to-csv/
5
https://zenodo.org/records/7998156
2.2. RML-view-to-CSV
RML-view-to-CSV is a proof-of-concept implementation for RML Logical Views, a new
RML module that is still under development. In our previous paper [3] we elaborated
on both RML Logical Views and RML-view-to-CSV. In this section, we highlight the
aspects used for the challenge.
RML Logical Views6 allow specifying a logical view: a flattened, source format-agnostic
view over one or more existing data sources. A view over multiple data sources can be
created by joining a logical view with other logical views.
RML-view-to-CSV resolves logical views as a first step in the KGC pipeline. It includes
an option to delegate the execution of joins expressed in triples maps to logical views.
First, RML-view-to-CSV identifies and optimizes redundant self-joins, and eliminates the
remaining referencing object maps from the mapping, replacing them by equivalent logical
views. Afterwards, it materializes the logical views as CSV files. During this process, RML-
view-to-CSV also takes the related triples maps into account, eliminating redundant fields
and duplicate logical iterations. Finally, it rewrites the mapping accordingly, replacing
the logical views as logical sources over the materialized logical views.
At this moment, RML-view-to-CSV supports one nested source format (JSON) and
one tabular source format (CSV). Slight adaptations to the experiment setup were needed
to overcome this limitation.
3. Experiment setup
The KGCW 2024 Challenge provides CSV files as source data, mapping files, queries,
baseline results (i.e. the expected set of triples and query results), an example pipeline
based on the MySQL, RMLMapper, and Virtuoso for reaching those results, and a tool
for executing the example pipeline.
The following adaptations were made to the provided experiments to enable execution
with our KGC pipeline. (i) In the provided end-to-end pipelines, the CSV files are loaded
into a relational database. As RML-view-to-CSV does not support SQL (yet), we used
the CSV files directly to construct the knowledge graphs and adapted the mapping
files accordingly. (ii) As RML-view-to-CSV does not support XML (yet), we replaced
the XML files in the GTFS-Madrid-Bench heterogeneity experiments by JSON files.
We conducted two heterogeneity experiments: one experiment with only JSON source
data and one experiment with 50 % CSV and 50 % JSON source data, both on scale
100. (iii) We added a condition to the mapping files to recognize the string NULL in the
provided CSV files as an empty value. The GTFS experiments were executed with our
KGC pipeline as shown in Figure 1. For the KGC parameters experiments without joins,
we skipped the preprocessing step with RML-view-to-CSV, as it is a redundant step for
these experiments. For the KGC parameter experiments with joins, we tested both the
pipeline with and without RML-view-to-CSV as preprocessor, to measure also the impact
of RML-view-to-CSV for those experiments.
6
https://github.com/kg-construct/rml-lv
We compared our experiments’ results to ensure that our output is correct with
respect to the baseline results of the challenge. For the first part of the challenge (KGC
parameters), where the output of RMLStreamer is not loaded into a triples store, we
deduplicated the output results as RMLStreamer cannot eliminate duplicates by itself.
After deduplication, we compared the number of triples to the baseline results of the
challenge. For the second part of the challenge (GTFS-Madrid-Bench) we compared the
number of query results to the baseline.
All experiments were executed on the virtual machine provided by Orange, with
following specifications: 4 vCPUs, 16 GB RAM and 140GiB SSD storage running on
Ubuntu 22.04.3 LTS. The challenge execution tool configures the Java heap space to 50
% of the available memory. All experiments were performed 5 times, and the experiment
with the median of the measurements is reported. All files needed to reproduce the
conducted experiments are available on Zenodo7 .
4. Results
In Figure 2 and Table 1 we included the measured execution time, CPU time, and maximal
memory usage of the knowledge graph construction pipeline for selected experiments.
The complete overview of the results, as it was submitted to the KGCW 2024 Challenge,
is available on Zenodo8 .
First, we verify how RMLStreamer behaves when the size of the expected output in-
creases. This is best illustrated by the GTFS-Madrid-Bench scale experiments. Figure 2a
and Figure 2b show that both RML-view-to-CSV and RMLStreamer scale towards a linear
trend. Note that the reported metrics include the startup time of RML-view-to-CSV
and RMLStreamer (no separate measurements in the challenge execution tool), that is
independent of data size and has a higher impact on the lower scales. The execution time
and CPU time increases with a factor of ten, the same factor as the data size for higher
scales. The peak RAM memory (fig. 2c) measured is similar for all scales when using
RMLStreamer. RMLStreamer has a constant memory usage independent of the data size,
because it processes everything in a streaming way. This ensures a stable performance
independent of the data size. As long as there is space to store the output, RMLStreamer
can continue its knowledge graph construction process. For RML-view-to-CSV we note
an increasing use of memory for the higher GTFS scales. Nevertheless, it was still able
to handle scale 1000 without reaching the memory limitations of the provided hardware.
The measurements for the KGC parameter experiments confirm these observations
(Table 1 Section 1). RMLStreamer shows linear scaling of execution time and CPU usage,
proportional to the size of the input data, in combination with a constant memory usage.
Second, we evaluate the impact of the format of the data input. Replacing the
CSV source data by JSON data increases the execution time of RMLStreamer with a
factor of two. The difference in execution time for RMLStreamer is the consequence of
RMLStreamer chunking CSV files and processing the chunks in parallel. This is not the
7
https://doi.org/10.5281/zenodo.11100801
8
https://doi.org/10.5281/zenodo.11100801
5 RML-view-to-CSV 6 RML-view-to-CSV
GTFS-Madrid-Bench scale (CSV) 1 29 1 76
GTFS-Madrid-Bench scale (CSV)
RMLStreamer RMLStreamer
34 82
Total KGC pipeline Total KGC pipeline
7 7
10 79 10 272
86 279
19 18
100 565 100 2.194
584 2.213
190 185
1.000 5.766 1000 22.918
5.956 23.103
1 10 100 1.000 10.000 1 10 100 1.000 10.000 100.000
Execution time (s) CPU time (s)
(a) Linear trend: the execution time of both (b) Linear trend: the CPU usage of both
RML-view-to-CSV and RMLStreamer RML-view-to-CSV and RMLStreamer
increases with the same factor as the increases with the same factor as the
data size for higher scales. data size for higher scales.
RML-view-to-CSV RMLStreamer
175
8,0 Execution
1.325
time (s)
7,0 1.500
Peak RAM memory (GB)
6,3 6,4 6,3
6,0
178
5,0 CPU
4,0 2.003
4,0 time (s)
2.180
3,0 2,6
7,5 RML-view-to-CSV
2,0 Peak RAM
6,3 RMLStreamer
1,0 0,5 0,5 0,7 memory (GB) Total KGC pipeline
0,0
1 10 100 1.000 1 10 100 1.000 10.000
GTFS-Madrid-Bench scale (CSV) GTFS-Madrid-Bench scale 100 (JSON)
(c) The memory consumption of RML-view- (d) When using JSON data as source instead
to-CSV increases, while RMLStreamer of CSV data, the performance impact
has a constant memory usage indepen- is higher on RML-view-to-CSV than on
dent of data size. RMLStreamer.
Figure 2: Metrics of the knowledge graph construction pipeline for the GTFS-Madrid-Bench experi-
ments.
case for the JSON formats yet. The performance impact of nested data is higher for
RML-view-to-CSV. The execution time and CPU usage of RML-view-to-CSV increases
with a factor of ten, and its memory consumption with a factor of three (fig. 2d).
Third, we investigate the impact of duplicates and empty values in the input data. As
RMLStreamer does not eliminate duplicates and the duplicate tests all start with the
same amount of input data, there is no noticeable difference in performance between
the experiments with and without duplicates, whilst a duplication elimination could
result in a much better performance for tests with duplicates (Table 1 Section 2). In the
source data of the experiments with empty values (Table 1 Section 3), this string NULL is
representing an empty value in a CSV file. We had to add a condition to the mappings
of those experiments to recognise this string as an empty values. The execution of this
condition increased the execution time and CPU usage with 50% when all rows contain
Execution (s) CPU (s) Peak RAM (GB) Output (triples)
1. Records
10K rows 20 columns 22 49 1,8 200.000
100K rows 20 columns 41 120 6,1 2.000.000
1M rows 20 columns 187 694 6,1 20.000.000
10M rows 20 columns 1.769 6.918 6,2 200.000.000
2. Duplicates
100 percent 43 128 6,1 20
0 percent 44 130 6,1 2.000.000
3. Empty values
100 percent 66 223 6,1 0
0 percent 46 147 6,1 2.000.000
4a. Joins (without support of RML-view-to-CSV)
1-1 0 percent 41 120 6.3 0
5-5 100 percent 110 396 6.4 2.500.000
4b. Joins (with support of RML-view-to-CSV)
1-1 0 percent 37 88 6,1 0
5-5 100 percent 88 242 6,1 2.500.000
Table 1
Metrics of the knowledge graph construction step for selected KGC parameter experiments
empty values.
Last, we comment on the KGC parameter experiments that include joins (Table 1
Section 4). We executed these experiments without and with RML-view-to-CSV as
preprocessor and noticed that RML-view-to-CSV reduced the total execution time and
CPU usage with respectively 20% and 39% for the experiments containing most joins.
5. Conclusion
The KGCW 2024 Challenge results confirms the observations of the KGCW 2023 Chal-
lenge: RMLStreamer has a linear scaling of execution time and CPU usage, proportional
to the size of the input data, while maintaining a constant memory usage. Scalability is
the main strength of RMLStreamer.
The main weakness of RMLStreamer is its inefficient implementation of join opera-
tions (e.g. GTFS-Madrid-Bench experiments with joins cannot be handled properly by
RMLStreamer). When delegating this task to RML-view-to-CSV as a preprocessor, we
resolve this weakness and build a reliable and performing knowledge graph construction
pipeline.
The challenge results reveal of slower performance of RML-view-to-CSV when handling
nested data. As RML Logical Views enable the flattening of nested data, the implemen-
tation of RML-view-to-CSV is processing all JSON fields separately. A more efficient
implementation for nested data sources is a challenge for future implementations, after
the RML Logical View specification is finalized.
Comparing to the results of KGCW Challenge 2023 [4], we see a direct relation between
the performance of RMLStreamer, and the available CPU cores and RAM memory of the
virtual machines used for the experiments. RMLStreamer is now up to a factor of three
slower and uses 30% less memory. The virtual machine used for the KGCW Challenge
2023 had 12 CPU cores and 24 GB RAM; for the KGCW Challenge 2024, all experiments
were conducted on a virtual machine with 4 CPU cores and 16 GB RAM. We concluded
that RMLStreamer takes full advantage of the number of available CPU cores and of the
available memory.
At the moment of writing, we have no insight in the results of the other engines
participating in the performance track of the KGCW Challenge 2024. We are looking
forward to the comparison of the challenge results.
Acknowledgments
The described research activities were supported by SolidLab Vlaanderen (Flemish
Government, EWI and RRF project VV023/10), and the European Unions Horizon
Europe research and innovation program under grant agreement no. 101058682 (Onto-
DESIDE).
References
[1] D. Chaves-Fraga, F. Priyatna, A. Cimmino, J. Toledo, E. Ruckhaus, O. Corcho,
Gtfs-madrid-bench: A benchmark for virtual knowledge graph access in the transport
domain, Journal of Web Semantics 65 (2020) 100596. doi:10.1016/j.websem.2020.
100596 .
[2] Sitt Min Oo, G. Haesendonck, B. De Meester, A. Dimou, RMLStreamer-SISO: An
RDF Stream Generator from Streaming Heterogeneous Data, in: U. Sattler, A. Hogan,
M. Keet, V. Presutti, J. P. A. Almeida, H. Takeda, P. Monnin, G. Pirrò, C. d’Amato
(Eds.), The Semantic Web – ISWC 2022, Springer, Springer International Publishing,
Cham, 2022, pp. 697–713. doi:10.1007/978- 3- 031- 19433- 7_40 .
[3] E. de Vleeschauwer, B. De Meester, P. Colpaert, RML-view-to-CSV: A Proof-
of-Concept Implementation for RML Logical Views, in: Proceedings of the 5th
International Workshop on Knowledge Graph Construction (KGCW 2024) co-located
with 20th Extended Semantic Web Conference (ESWC 2024), 2024.
[4] E. de Vleeschauwer, G. Haesendonck, D. Van Assche, B. D. Meester, B. De Meester,
RMLStreamer with Reference Conditions in the KGCW Challenge 2023, in: Proceed-
ings of the 4rd International Workshop on Knowledge Graph Construction (KGCW
2023) co-located with 20th Extended Semantic Web Conference (ESWC 2023), 2023.