<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>with Reference Conditions in the KGC W Challenge 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Els de Vleeschauwer</string-name>
          <email>els.devleeschauwer@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerald Haesendonck</string-name>
          <email>gerald.haesendonck@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dylan Van Assche</string-name>
          <email>dylan.vanassche@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben De</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Meester</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IDLab, Dept. Electronics &amp; Information Systems, Ghent University - imec</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>3</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>The Knowledge Graph Construction Workshop (KGCW) 2023 Challenge aims to be a competitive challenge for knowledge graph construction systems to encourage optimizations not only for execution time but also for CPU and memory usage. We participated in this challenge with RMLStreamer, an RML mapping engine which processes all data in a streaming fashion. For the second part of the challenge, which is based on the Madrid-GTFS-Bench, we added RMLLooseGenerator as a first step. RMLLooseGenerator is a proof-of-concept implementation that simulates the efect of using reference conditions in RML mapping rules. In previous work we showed that using reference conditions in the GTFS-Madrid-Bench mapping file results in exactly the same graph output, while join operations are executed faster. The challenge results show that RMLStreamer scales well regarding execution time and CPU usage, while maintaining a constant memory usage. Therefore it received the Scalability Award in the KGCW 2023 Challenge. The challenge also highlighted some weaknesses of RMLStreamer: no support for relational databases, ineficient implementation of join operations, and longer execution time when handling nested sources such a JSON and XML files. After the challenge, the RMLStreamer has been expanded with support for relational databases. In the future, we will investigate how to optimize further the handling of joins and nested sources.</p>
      </abstract>
      <kwd-group>
        <kwd>RMLStreamer</kwd>
        <kwd>challenge</kwd>
        <kwd>knowledge graph construction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>W</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Knowledge graph construction of heterogeneous data has seen a lot of uptake in the
last decade from compliance to performance optimizations with respect to execution
time [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Besides execution time as a metric for benchmarking knowledge graph
construction systems, other metrics, e.g. CPU or memory usage, are often not considered.
https://dylanvanassche.be/ (D. Van Assche); https://ben.de-meester.org/#me (B. De Meester)
      </p>
      <p>2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
      <p>CEUR
Workshop
Proceedings
htp:/ceur-ws.org
ISN1613-073</p>
      <sec id="sec-2-1">
        <title>CEUR</title>
      </sec>
      <sec id="sec-2-2">
        <title>Workshop Proceedings (CEUR-WS.org)</title>
        <p>
          tions for execution time, but also CPU and memory usage. This challenge consists of two
parts: (i) knowledge graph construction (KGC) parameters to evaluate individual
parameters, e.g. joins and duplicates, with artificial data, and (ii) GTFS-Madrid-Bench [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] to
focus on real-life use cases based on public transport data from Madrid.
        </p>
        <p>
          In this paper, we present the results of the KGCW 2023 Challenge for RMLStreamer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
an RML mapping engine which processes all data in a streaming fashion, in combination
with RMLLooseGenerator2, the proof-of-concept implementation simulating the efect of
using reference conditions [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Section 2 describes the components of our knowledge graph construction pipeline.
Section 3 discusses the setup we used to execute the challenge’s experiments. Section 4
explains our setups with other RML engines we compare with. We present our results in
Section 5 and our conclusion in Section 6.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Knowledge Graph Construction Pipeline</title>
      <p>
        Our knowledge graph construction pipeline consists of two parts: (i) RMLLooseGenerator
emulates the efect of reference conditions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and (ii) RMLStreamer executes the RML
mapping rules in a streaming fashion [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>2.1. RMLLooseGenerator</title>
        <p>
          RMLLooseGenerator is a proof-of-concept implementation for simulating the efect of
reference conditions [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] in RML mapping rules. Reference conditions enable the reuse of
another RML triples map’s subject map, without joining the logical sources. If a subject
map referenced by an RML join has no references outside of those mentioned in the join
conditions, RMLLooseGenerator interprets the join conditions concerned as reference
conditions. It generates a new mapping file where those joins are replaced by
appropriately adjusted object maps (crafted URIs), e.g. in the mapping of the
GTFS-MadridBench it replaces rr:objectMap [rr:parentTriplesMap &lt;#agency&gt;; rr:joinCondition
[rr:child "agency_id"; rr:parent "agency_id"]] by rr:objectMap [rr:template
"http://transport.linkeddata.es/madrid/agency/agency_id"]. The new mapping file
can be processed by any RML engine.
        </p>
        <p>
          RMLLooseGenerator is used only for the second part of the challenge (fig. 2), which
is based on the GTFS-Madrid-Bench. RMLStreamer does not eliminate self-joins or
duplicates. It needs more than three hours to execute the first scale of the
GTFS-MadridBench, generating an output of 105 GB. However, when re-interpreting all join conditions
of the GTFS-Madrid-Bench mapping as reference conditions RMLStreamer needs only
33 seconds, generating an output of 76 MB, while the resulting knowledge graph remains
semantically identical [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. RMLStreamer</title>
        <p>RMLStreamer executes RML mapping rules to generate high quality Linked Data from
multiple originally (semi-)structured data sources in a streaming way. RMLStreamer
processes all data in a streaming fashion. It handles big input files and continuous data
streams like sensor data without consuming more memory when the input data size
increases. RMLStreamer leverages Apache Flink to scale vertically across multiple CPU
cores and horizontally across multiple machines. In the challenge we use RMLStreamer
version 2.4.2 with an embedded Flink version in a Docker container3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Experiment setup</title>
      <p>The KGCW 2023 Challenge provides CSV files as source datasets, mapping files, queries,
baseline results (i.e. the expected set of triples and query results), an example pipeline
based on the RMLMapper, MySQL, and Virtuoso for reaching those results, and a tool
for executing the example pipeline.</p>
      <p>We made the following adaptations to the provided experiments to enable execution
with RMLStreamer. (i) In the provided end-to-end pipelines the CSV files are loaded
into a relational database. As the RMLStreamer did not support SQL yet when the
execution of the challenge was performed, we used the CSV files directly to construct
the knowledge graphs. (ii) We replaced JSON iterator [*] with $[*] in the mapping
ifles. (iii) We added a condition to the mapping files to recognize the string NULL in the
provided CSV files as an empty value.</p>
      <p>We compared our experiments’ results to ensure that our output is correct with
respect to the baseline results of the challenge. For the first part of the challenge (KGC
parameters), where the output of RMLStreamer is not loaded into a triples store, we
deduplicated the output results as RMLStreamer cannot eliminate duplicates by itself.
After deduplication we compared the number of triples to the baseline results of the
challenge. For the second part of the challenge (GTFS-Madrid-Bench) we compared the
number of query results to the baseline.</p>
      <p>Figure 1 and Figure 2 visualize the resulting experiment setups.</p>
      <p>We executed all experiments on a Intel Xeon CPU E5645 (12 cores with
HyperThreading, 2.4GHz) with 24GB RAM and 250GB HDD. The challenge execution tool
configures the Java heap space to 50 % of the available memory. All experiments were
performed 5 times and the experiment with the median execution time is reported. All
ifles needed to reproduce the conducted experiments are available on Zenodo 4.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Comparison with other RML engines</title>
      <p>
        We also executed the experiments in our setup (section 3) with RMLMapper5 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for
both parts of the challenge, and with Morph-KGC6 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for the GTFS-Madrid-Benchmark
part. This way, we can compare the results of RMLStreamer on the same setup with
other implementations.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <p>In Figure 3 and Table 1 we included the measured execution time, CPU and memory usage
of the knowledge graph construction pipeline (RMLLooseGenerator and RMLStreamer)
for selected experiments. The complete overview of the results, as it was submitted to
the KGCW 2023 Challenge, is available on Zenodo7.
(a) Linear trend: the execution time of</p>
      <p>RMLStreamer increases with the same
factor as the data size for higher scales.
(b) Linear trend: the CPU usage of
RML</p>
      <p>Streamer increases with the same factor
as the data size for higher scales.
(c) RMLStreamer has a constant memory
usage independent of data size
(d) The execution time of RMLStreamer
increases with a factor 10 when including
json and xml sources</p>
      <p>First, we verify how RMLStreamer behaves when the size of the input data increases.
This is best illustrated by the GTFS-Madrid-Bench scale experiments. Figure 3a and
Figure 3b show that RMLStreamer scales towards a linear trend. Note that the reported
metrics include the execution time of RMLLooseGenerator (average execution time of
8 seconds) and the startup process of RMLStreamer (no separate measurements in the
challenge execution tool, but from the logs of RMLStreamer we conclude that this is on
average 17 seconds for the GTFS-Madrid-Bench cases). These processes are independent
of data size. The execution time increases with the same factor as the data size for
higher scales, where RMLMapper and Morph-KGC cannot complete the experiments for
respectively scale 100 and scale 1000. At scale 100 RMLStreamer is two times faster than
the state-of-the-art RML engine Morph-KGC (fig. 3a). RMLStreamer uses more CPU
(fig. 3b) in comparison to RMLMapper and Morph-KGC because it maximizes parallelism
over all given slots. The CPU usage also scales linear with the size of the input data for
higher scales. The peak RAM memory (fig. 3c) measured is similar for all scales when
using RMLStreamer, where RMLMapper and Morph-KGC hit the limits of the available
memory and fail to complete all GTFS-Madrid-Bench experiments. RMLStreamer has a
constant memory usage independent of the data size. We assume this is mostly due to
the fact that it processes everything in a streaming way, which is to a lesser extent the
case for RMLMapper and Morph-KGC.</p>
      <p>The measurements for the KGC parameter experiments show similar trends: linear
scaling of execution time and CPU usage, proportional to the size of the input data, in
combination with a constant memory usage. Table 1 shows the results of the experiments
with the lowest and the highest measured values, which are dubbed easiest and hardest
experiments in the table respectively.</p>
      <p>Second, we evaluate the impact of the format of the data input. Adding nested sources,
such as JSON and XML files, increases the execution time of RMLStreamer with a
factor ten (fig. 3d). The diference in execution time is the consequence of RMLStreamer
chunking CSV files and processing the chunks in parallel. This is not the case for the
XML and JSON formats yet.</p>
      <p>Execution time (s) CPU (s) Peak RAM (GB)
1. Easiest experiment: records 10K rows 20 columns
RMLMapper 8 15 1,4
RMLStreamer 21 107 2,4
2. Hardest experiment for RMLMapper: properties 1M rows 30 columns
RMLMapper 262 646 13,2
RMLStreamer 120 2.247 9,2
3. Hardest experiment for RMLStreamer: records 10M rows 20 columns
RMLMapper out of memory out of memory out of memory
RMLStreamer 653 14.360 9,2
4. Easiest experiment with joins: join 1-1 0%
RMLMapper out of memory out of memory out of memory
RMLStreamer8 219 2409 9,3
RMLStreamer 43 371 9,4
5. Hardest experiment with joins: join 5-5 100%
RMLMapper out of memory out of memory out of memory
RMLStreamer8 434 4145 9,3
RMLStreamer 66 1093 9,5
Output (triples)
200.000
200.000
20.000.000
20.000.000
out of memory
200.000.000
out of memory
0
0
out of memory
2.500.000
2.500.000</p>
      <p>Last, we comment on the experiments including joins. At the time of the challenge
we limited RMLStreamer to eight task slots (referenced as RMLStreamer8 in Table 1)
for the execution of experiments including joins, because RMLStreamer reported an
error and failed to start processing some mapping files including joins (e.g. the original
GTFS-Bench-Mapping with joins). We assumed that this error appeared with any
mapping file that includes joins. Further investigation afterwards revealed that this error
got triggered by mappings with a large number of mapping rules (i.e. a mapping with
two triples maps and one join operation does not result in error). Hence, the limitation of
task slots is not required for the experiments with joins in the first part of the challenge.
For completeness we added the results of those join experiments with all task slots. Using
all 24 task slots on the hardest experiment with joins decreases the execution time by a
factor of 16, and the CPU usage by a factor of nine, compared to the results registered
during the challenge using RMLStreamer8. The decreased execution time was in line
with our expectations, however, the reason for the decreased CPU usage requires further
investigation.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>The KGCW 2023 Challenge results show that RMLStreamer has a linear scaling of
execution time and CPU usage, proportional to the size of the input data, while maintaining
a constant memory usage. Therefore it received the Scalability Award in the KGCW
2023 Challenge.</p>
      <p>We noted the following improvement areas for RMLStreamer: (i) a lack of support for
relational databases, (ii) an ineficient implementation of join operations (e.g.
GTFSMadrid-Bench experiments with joins cannot be handled properly by RMLStreamer),
and (iii) a longer execution time when handling nested sources such a JSON and XML
ifles.</p>
      <p>Additionally we noticed that there are cases where the number of task slots used
by RMLStreamer needs to be limited. At the time of the challenge this could only be
achieved by modifying RMLStreamer’s code.</p>
      <p>Support for changing the number of task slots dynamically was implemented in
RMLStreamer 2.5.08, together with support for SQL databases.</p>
      <p>For future work we will investigate further optimizations for execution of joins and
nested sources.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The described research activities were supported by SolidLab Vlaanderen (Flemish
Government, EWI and RRF project VV023/10), the imec ICON project AI4Foodlogistics
(Agentschap Innoveren en Ondernemen project nr. HBC.2020.3097), and funded by the
Special Research Fund of Ghent University under grant BOF20/DOC/132.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Gtfs-madrid-bench: A benchmark for virtual knowledge graph access in the transport domain</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <article-title>100596</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2020</year>
          .
          <volume>100596</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-KGC:
          <article-title>Scalable knowledge graph materialization with mapping partitions, Semantic Web (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .3233/sw-223135.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Collarana</surname>
          </string-name>
          , M.-E. Vidal,
          <article-title>SDM-RDFizer: An RML Interpreter for the Eficient Creation of RDF Knowledge Graphs</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1145/3340531.3412881.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sitt</given-names>
            <surname>Min Oo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Haesendonck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            . Dimou,
            <surname>RMLStreamer-SISO</surname>
          </string-name>
          :
          <article-title>An RDF Stream Generator from Streaming Heterogeneous Data</article-title>
          , in: U. Sattler,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Keet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Presutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P. A.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , G. Pirrò, C. d'Amato (Eds.),
          <source>The Semantic Web - ISWC 2022</source>
          , Springer, Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>697</fpage>
          -
          <lpage>713</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -19433-7_
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>E. de Vleeschauwer</surname>
            ,
            <given-names>S. Min</given-names>
          </string-name>
          <string-name>
            <surname>Oo</surname>
          </string-name>
          , B. De Meester, P. Colpaert,
          <article-title>Reference conditions: Relating mapping rules without joining</article-title>
          ,
          <source>in: Proceedings of the 4rd International Workshop on Knowledge Graph Construction (KGCW</source>
          <year>2023</year>
          )
          <article-title>co-located with 20th Extended Semantic Web Conference (ESWC</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, R. Van de Walle,
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          ,
          <source>in: Proceedings of the 7th Workshop on Linked Data on the Web</source>
          , volume
          <volume>1184</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>