<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on Knowledge Graph Construction, May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>KGCW2024 Challenge Report: RDFProcessingToolkit</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claus Stadler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Bin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Applied Informatics (InfAI)</institution>
          ,
          <addr-line>Leipzig</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>27</volume>
      <issue>2024</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This is the report of the participation of the RDFProcessingToolkit (RPT) in the KGCW2024 Challenge at ESWC 2024. The RPT system processes RML specifications by translating them into a series of extended SPARQL CONSTRUCT queries. The necessary SPARQL extensions are provided as plugins for the Apache Jena framework. This year's challenge comprises a performance and a conformance track. For the performance track, a homogeneous environment was kindly provided by the workshop organizers in order to facilitate comparability of measurements. In this track, we mainly adapted the setup from our last year's participation. For the conformance track, we updated our system with support for the rml-core module of the upcoming RML revision. We also report on the issues and shortcomings we encountered as a base for future improvements.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;RML</kwd>
        <kwd>SPARQL</kwd>
        <kwd>RDF</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Big data</kwd>
        <kwd>Semantic Query Optimization</kwd>
        <kwd>Apache Spark</kwd>
        <kwd>Challenge</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        of CSV and JSON source data and execution of SPARQL algebra operations, such as JOIN and
DISTINCT. The Sansa engine is not a general purpose SPARQL engine. By leveraging Spark’s
map-reduce processing model, it is best used for extract-transform-load (ETL) workloads, which
includes RML mapping execution. RPT/Sansa [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] refers to the use of RPT with the Sansa engine.
      </p>
      <p>
        The challenge was divided into an Performance and an Conformance part. The challenge
description and output files were published on Zenodo [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The remainder of this report is
structured as follows: In Section 2 we report on our setup for the participation in the performance
challenge. In Section 3 we provide insights into how we added support for the core module of
the upcoming revised RML specification. We conclude this report in Section 4.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Performance Track</title>
      <p>
        In this section, we report on our setup and results for the performance track. Overall, since
our last year’s participation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], there were no major updates to our system that targeted
performance. However, we had a series of bug fixes and maintenance upgrades, the most
significant one being the upgrade of the code base to Jena 5.
      </p>
      <p>The tasks of the performance challenge were also the same as last year’s. One notable addition
was that the RML mapping files were also provided in the upcoming RML revision. However,
we did not evaluate against those because we completed our work on the conformance part
only after the performance evaluations.</p>
      <p>A significant improvement over last year’s organization was that uniform hardware (virtual
machines) was kindly provided to all participants by the challenge organizers. This allowed
evaluation of all participating systems on a similar hardware, so that the results are expected to
be much more comparable than the year before. The specifications reported by the VM were: 4
virtual cores (Intel(R) Xeon(R) Gold 6161 CPU), 136GB VMware virtual disk, 16 GiB RAM.</p>
      <p>The benchmark tool’s implementation difered from last year’s and we forked it in order to
deal with the following issues:
• We needed to add support to allow for the configuration of the working directory of the
docker container running our RPT tool.
• We needed to make it possible to pass Java options to our docker container, especially for
setting the maximum heap memory (-Xmx).</p>
      <p>• The benchmark tool failed to extract system information on SUSE Linux systems.
At the time of writing, the changes are tracked in our fork7 and there is an open pull request to
the benchmark tool.8</p>
      <p>We employed the same two workarounds as last year: Instead of reading tabular data from
an SQL database, we ingest the corresponding CSV files directly. Due to the lack of parallel
ingestion of XML data in RPT/Sansa, we adjusted the afected task pipelines by adding an
extra XML-to-JSON conversion step (which is counted towards the total time) and by manually
changing the XPath expressions in the RML files to corresponding JSONPath ones. With this
workaround, we could leverage RPT/Sansa’s parallel data ingestion and produce estimated
measures despite the lack of proper XML support.
7https://github.com/AKSW/kg-construct-challenge-tool/tree/code-on-new
8https://github.com/kg-construct/challenge-tool/pull/4</p>
      <p>
        The performance results for RPT/Sansa are consistent with those from last year, albeit the
execution times were generally higher due to the weaker hardware. The hardware specs from
the year before were: 32 threads (AMD Ryzen 9 5950X 16-Core CPU), 4TB SSD, 128 GiB RAM
(60GiB assigned to the JVM). Figure 1 shows a juxtaposition of this year’s results for the
GTFSMadrid-Bench[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with those of last year. The complete recent as well as past measurements for
RPT/Sansa can be found in our in our evaluation results repository.9
      </p>
      <p>Also, we incorrectly allocated only 5 GB of heap memory to our Java process, although 16
GB of RAM would have been available. However, it also shows that our system efectively
leverages Apache Spark in order to run with limited resources. A noteworthy finding is related
to the GTFS-Madrid-bench: The amount of RAM and disk space were insuficient to process the
GTFS1000 task without compression. To get around this problem, we experimented with two
approaches: (1) a hard disk volume compressed with lzo and (2) using Spark’s built-in bzip2
compression. The following commands were used:
# Commands to create and mount a compressed filesystem image
fallocate -l 60G ./filesystem.btrfs
mkfs.btrfs ./filesystem.btrfs
mount -o loop,compress=lzo ./filesystem.btrfs /compressed-fs
# Spark's built-in compression enabled via system properties
JAVA_OPTS="-Dspark.hadoop.mapred.output.compress=true \
-Dspark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec" \
rpt sansa query converted-rml.rq --out-file data.nt.bz2</p>
      <p>The file system compression setup turned out to be significantly faster: 4 178s vs 10 368s. The
shorter duration was obtained by running Spark with a 5 GB JVM heap memory limit, whereas
the longer one was obtained with a limit of 14 GB which means that more RAM was available
when the compression was done in the JVM. The main reason for the better performance
of file system level compression is most likely that lzo is designed to sacrifice compression
9https://github.com/AKSW/RdfProcessingToolkit-Resources/tree/main/2024-05-27-KGCW-at-ESWC
ratio for speed and is thus generally faster than bzip2.10 Also, we’d expect the implementation
of a compression algorithm in the file system driver to be faster than in Java (even with JIT
compilation). However, a more in-depth evaluation, such as by configuring Apache Spark with
an lzo codec, is future work.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conformance Track</title>
      <p>The challenge of the conformance track is to establish conformance with the upcoming revised
RML specification. 11 At the point of writing, the revision did not have a final oficial name, so
we refer to it as RML2 in the following.</p>
      <p>ITriplesMap
IAbstractSource getAbstractSource()
Set&lt;? extends IPredicateObjectMap&gt; getPredicateObjectMaps()</p>
      <p>R2rml</p>
      <p>Backbone</p>
      <p>Rml
Rml1</p>
      <p>Rml2</p>
      <p>TriplesMapR2ml
LogicalTable getLogicalTable()
Set&lt;PredicateObjectMapR2rml&gt; getPredicateObjectMaps()
default LogicalTable getAbstractSource() { return getLogicalTable(); }</p>
      <p>ITriplesMapRml
ILogicalSource getLogicalSource();
Set&lt;? extends IPredicateObjectMapRml&gt; getPredicateObjectMaps()
default ILogicalSource getAbstractSource() { return getLogicalSource(); }</p>
      <p>TriplesMapRml1 TriplesMapRml2
LogicalSourceRml1 getLogicalSource(); LogicalSourceRml2 getLogicalSource()
Set&lt;PredicateObjectMapRml1&gt; getPredicateObjectMaps() Set&lt;PredicateObjectMapRml2&gt; getPredicateObjectMaps()</p>
      <p>IAbstractSource</p>
      <p>IPredicateObjectMap
LogicalTableR2rml</p>
      <p>ILogicalSourceRml</p>
      <p>PredicateObjectMapR2rml</p>
      <p>IPredicateObjectMapRml
BaseTableOrView asBaseTableOrView()
R2rmlView asR2rmlView()</p>
      <p>RDFNode getSource()
String getIterator()
String getReferenceFormulationIri()</p>
      <p>PredicateObjectMapRml1 PredicateObjectMapRml2</p>
      <p>LogicalSourceRml1 LogicalSourceRml2</p>
      <p>Some of the major changes introduced by RML2 are as follows:
• RML2 is now a model on its own and no longer an extension of R2RML. As a consequence,
most ontology elements now reside in the namespace http://w3id.org/rml/.
• The specification is now modular, with the modules for (a) the core ( RML-Core), (b) sources
and targets (RML-IO), (c) containers and collections (RML-CC), (d) RDF-Star generation
(RML-Star), and (e) functions (RML-FNML). A logical views module is being worked on.
10https://linuxaria.com/article/linux-compressors-comparison-on-centos-6-5-x86-64-lzo-vs-lz4-vs-gzip-vs-bzip2-vs-lzma
11https://kg-construct.github.io/rml-resources/portal/</p>
      <p>In this work, we only attempted to establish conformance with RML-Core. In order to add
support for RML2, we needed to decide whether to rewrite our engine from scratch or whether
to generalize the interfaces and classes used to capture the RML model. We decided to take the
latter approach. An excerpt of the revised class hierarchy is shown in Figure 2. We integrated the
conformance test suite as unit tests using JUnit 12 and Testcontainers.13 One issue we encountered
was that the benchmark tool would download version 0.9.0 of the conformance test suite rather
than v1.0.0, which caused issues in running the PostgreSQL tests. As a remedy, we downloaded
the version 1.0.0 manually. This also has been fixed in the meantime. Furthermore, we were not
able to generate the results.zip for the test cases with the benchmark tool because it would
terminate abnormally. The reason has yet to be investigated. Of the 238 test cases of rml-core,
we were able to establish conformance in 236 cases. The failing tests cases were 9a-mysql and
9b-mysql. These test cases include a join between an int and varchar column. RPT first maps
the MySQL types of the columns to xsd:int and xsd:string respectively, before executing the join
in SPARQL. Since these types are incompatible in SPARQL, the join fails. While the test case
is arguably not ideal, the solution to mitigate the issue in the future is to push the join to the
database. Interestingly, the corresponding PostgreSQL test cases have the column types fixed
because PostgreSQL refuses to execute such joins.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Work</title>
      <p>A significant improvement in this year’s performance setup was that the participants could
evaluate their systems on a homogeneous environment. Although the benchmark tool is capable
of collecting measurements for various performance metrics, it still lacks the functionality to
generate summary reports. The performance results showed that our tool is robust under
limited resources and in the course of our evaluation it turned out that lzo compression on
the file system level clearly outperformed Spark’s built-in bzip2 support. As future work we
aim to add proper support for parallel ingestion of XML data. We will analyze to which extent
the remaining RML2 modules can be translated to SPARQL elements. For example, while
support for RML-FNML should be fairly easy to add, support for RML-IO will require an extra
RDF vocabulary for describing how to transfer the results of SPARQL queries to specified
destinations.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors acknowledge the financial support by the German Federal Ministry for Economic
Afairs and Climate Action in the project Coypu (project number 01MK21007A) and by the
German Federal Ministry for Digital and Transport in the Project Moby Dex (project number
19F2266A).
12https://junit.org
13https://testcontainers.com/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bühmann</surname>
          </string-name>
          , L.-P. Meyer, M. Martin,
          <string-name>
            <surname>Scaling RML</surname>
          </string-name>
          and
          <article-title>SPARQL-based knowledge graph construction with Apache Spark</article-title>
          ,
          <source>in: Proceedings of the 4th International Workshop on Knowledge Graph Construction, ESWC</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Van Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Şimşek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <source>KGCW 2024 Challenge @ ESWC</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.10973433.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          , L. Bühmann,
          <source>KGCW2023 Challenge Report RDFProcessingToolkit/Sansa</source>
          .,
          <source>in: KGCW Challenge @ ESWC</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <surname>GTFS-MadridBench</surname>
          </string-name>
          :
          <article-title>A benchmark for virtual knowledge graph access in the transport domain</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <fpage>100596</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>