<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on Knowledge Graph Construction, May</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Scaling RML and SPARQL-based Knowledge Graph Construction with Apache Spark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claus Stadler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenz Bühmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars-Peter Meyer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Martin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Applied Informatics (InfAI)</institution>
          ,
          <addr-line>Leipzig</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>28</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Approaches for the construction of knowledge graphs from heterogeneous data sources range from ad-hoc scripts to dedicated mapping languages. Two common foundations are thereby RML and SPARQL. So far, both approaches are treated as diferent: On the one hand there are tools specifically for processing RML whereas on the other hand there are tools that extend SPARQL in order to incorporate additional data sources. In this work, we first show how this gap can be bridged by translating RML to a sequence of SPARQL CONSTRUCT queries and introduce the necessary SPARQL extensions. In a subsequent step, we employ techniques to optimize SPARQL query workloads as well as individual query execution times in order to obtain an optimized sequence of queries with respect to the order and uniqueness of the generated triples. Finally, we present a corresponding SPARQL query execution engine based on the Apache Spark Big Data framework. In our evaluation on benchmarks we show that our approach is capable of achieving RML mapping execution performance that surpasses the current state of the art.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;RML</kwd>
        <kwd>SPARQL</kwd>
        <kwd>RDF</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Big data</kwd>
        <kwd>Semantic Query Optimization</kwd>
        <kwd>Apache Spark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>https://w3id.org/aksw/norse#. The implementations are part of our JenaX resource1 which
is available on Maven Central. It features several unoficial extensions for the Apache Jena
framework. (3) We furthermore present an Apache Spark-based SPARQL engine that executes
NORSE-enhanced SPARQL by leveraging its massive parallel processing model. We show that
performance- and scalability-wise this approach surpasses the state of the art in several
scenarios. It is worth noting that achieving these results is not the sole merit of Apache Spark, but
also that of the optimizations we used.</p>
      <p>The remainder of this paper is structured as follows: We present related work in Section 2. The
translation of RML models to extended SPARQL CONSTRUCT queries is described in Section 3.
Optimizations of query workloads with respect to the uniqueness and ordering of the produced
RDF triples and/or quads are shown in Section 4. In Section 5 we present our implementations for
(1) converting RML to SPARQL (2) the NORSE SPARQL extensions and (3) the implementation
of a SPARQL engine on Apache Spark using the SANSA Big Data RDF framework. Subsequently,
in Section 6 we present an evaluation of our approach based on the GTFS Madrid Bench and
one dataset of the SDM Genomic Datasets. We conclude our paper in Section 7.</p>
      <p>The implementation of our approach is part of our RDF Processing Toolkit (RPT)2 which is
based on JenaX.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In this section we provide an overview of contemporary SPARQL and RML based knowledge
graph construction approaches as well as a brief summary of Apache Spark. As there exist
many mapping languages[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], a discussion of general concepts and translations between
them can be found in [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. SPARQL-based Mapping Approaches</title>
        <p>
          SPARQL is a W3C standard for processing (loading, retrieving, transforming and updating) RDF
data3. SPARQL engines can be leveraged to build advanced features on top. Two prominent
representatives of the category of SPARQL-based mapping approaches are SPARQL-Generate[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
and SPARQL Anything[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. SPARQL Anything extends SERVICE evaluation such that references
to remote non-RDF data can be made. The data is converted to an opinionated RDF graph (as
per documentation) which then serves as the base for evaluating the remainder of the SERVICE
clause. SPARQL-Generate features a SPARQL-based template language which can produce
output beyond what is possible with conventional SPARQL.
        </p>
        <p>Our JenaX project compares to these approaches as follows: JenaX provides (among other
things) SPARQL plugins for the Apache Jena ecosystem that allow for processing heterogeneous
data within the SPARQL syntax already supported by the framework. As part of this efort
we contributed a plugin system to facilitate interoperability of custom SERVICE execution
implementations4. The SPARQL extensions for processing RML sources, as described in this
1https://github.com/Scaseco/jenax
2https://github.com/SmartDataAnalytics/RdfProcessingToolkit
3https://www.w3.org/TR/sparql11-query/
4https://github.com/apache/jena/pull/1388
paper, are built on this system. While some of JenaX’s SPARQL extension functions for XML,
JSON and CSV processing conceptually overlap with those provided by SPARQL-Generate,
there are yet diferences in the implementations. For example, JenaX provides dedicated RDF
datatype implementations that internally retain JSON and XML data in an object model, whereas
SPARQL-Generate (as of version 2.0.12) falls back to string representations. In order to avoid
clashes we use IRIs in the norse namespace for our implementations.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. RML Processors and Benchmarks</title>
        <p>
          R2RML is a W3C standard and vocabulary for mapping relational data to RDF5. On the one
hand these mappings can be used in ETL processes to dump databases as RDF. On the other
hand, the same mappings can be used in SPARQL-to-SQL rewriting, a.k.a. OBDA
(ontologybased data access). While R2RML is considered quite verbose, several alternatives have been
developed, such as the Stardog Mapping Syntax (SMS, currently in version 2), the Ontop Mapping
Language[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and the Sparqlification Mapping Language[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          RML6 is an extension of R2RML which adds additional vocabulary for mapping non-relational
data[
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]. In essence these additional declarations allow for expressing a mapping of
nonrelational data (such as XML and JSON) into a relational model where from each row RDF
tuples are generated. Like R2RML it sufers from verbosity, for which reason simplified models
were derived such as YARRRML[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. A mapping translation between ShExML[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and RML is
presented by [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          There exist several RML processors [
          <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
          ]7 for the well known extension RML of the W3C
standard R2RML. In this paper we are comparing benchmarks with the following: SDM-RDFizer8
is an RML processor implemented in python with optimized data structures and operators. It is
developed with scalability and complex data in mind[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. CARML9 and RMLMapper10 are Java
implementations which operate single threaded at the time of writing. Morph-KGC11 is an RML
processor implemented in python and supports partitioning RML assertions[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for parallel
execution.
        </p>
        <p>
          For measuring the performance of the RML processors we use the following benchmarks:
The Madrid GTFS benchmark12 was introduced by [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. It is based on data from subway network
of Madrid and the benchmark data can be scaled up. A survey on RML tools[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] conducted
in 2021 evaluated 3 virtualizers and 6 materializers on the GTFS Madrid Benchmark. The
SDM-Genomic-Datasets benchmark13 was introduced by [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This benchmark is motivated from
the biomedical domain and based on the Catalogue Of Somatic Mutations In Cancer14.
5https://www.w3.org/TR/r2rml/
6https://rml.io/specs/rml/
7https://github.com/kg-construct/awesome-kgc-tools
8https://github.com/SDM-TIB/SDM-RDFizer
9https://github.com/carml/carml
10https://github.com/RMLio/rmlmapper-java
11https://github.com/morph-kgc/morph-kgc
12https://github.com/oeg-upm/gtfs-bench
13https://figshare.com/articles/dataset/SDM-Genomic-Datasets/14838342/1
14https://cancer.sanger.ac.uk/cosmic
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Apache Spark And SANSA</title>
        <p>
          Apache Spark15 is a framework for high parallelisation. It can scale workload execution from
a single node to big clusters. Apache Spark advanced Hadoop’s Map-Reduce paradigm with
an abstraction called "resilient distributed datasets" (RDDs). The SANSA framework[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is an
efort to enable various forms of RDF processing on Apache Spark.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Translating RML to SPARQL</title>
      <p>In this section we describe our approach to translate RML to SPARQL. For this purpose we first
briefly summarize the notion of a SPARQL CONSTRUCT query.</p>
      <sec id="sec-3-1">
        <title>3.1. CONSTRUCT Queries</title>
        <p>A CONSTRUCT query has the form CONSTRUCT { template } WHERE { pattern }. Without
loss of generality, for this work we assume generalized RDF16. Let there be the pairwise disjoint
sets of IRIs , blank nodes  and literals . The set of RDF terms is defined as  :=  ∪  ∪ .
Furthermore, let there be another set of SPARQL variables  . We define the set of SPARQL
terms  :=  ∪  . A concrete triple is an element of  ×  ×  whereas a triple pattern is an
element of  ×  × . Likewise, a concrete quad is an element of  ×  ×  ×  whereas a
quad pattern is an element of  ×  ×  × . The current SPARQL standard only allows for
a CONSTRUCT template to specify the creation of triples using triple patterns. However the
importance of this issue has been noted17 and several engines already support the production of
quads as well. The approach presented in the following can be used in either setting, so instead
of talking about a triple and quad (pattern) we generally speak of a tuple (pattern). A construct
query’s template is thus made up of a set of tuple patterns. Substituting all variables of these
tuple patterns with RDF terms thus produces a set of concrete tuples.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Translating RML Logical Sources</title>
        <p>The two main issues that need to be solved are how to translate (1) RML sources and (2) RML
references to SPARQL elements. RML sources conceptually emit a set of records whose attribute
access is specified via rml:references. On the SPARQL side, the SERVICE clause can be used
to generate a set of bindings based on its contained pattern. We can thus introduce a special
SERVICE IRI norse:rml.source which contains a graph pattern that represents an RML source.
In addition, we add an additional triple pattern with the special predicate norse:output in
order to bind the source records as RDF terms to a SPARQL variable. Therefore, we introduce
custom XML18 and JSON datatypes as well as corresponding functions, namely norse:json and
norse:xml to capture XML and JSON data eficiently, respectively.
15https://spark.apache.org/
16https://www.w3.org/TR/rdf11-concepts/#section-generalized-rdf
17https://github.com/w3c/sparql-12/issues/31
18Jena’s implementation of the rdf:xmlLiteral datatype only stores XML as a string which is not suited for eficient
XPath evaluation.</p>
        <p>&lt;map_stops_0&gt; a rr:TriplesMap ;
rml:logicalSource [ a rml:LogicalSource ;
rml:referenceFormulation ql:CSV ;
rml:source "STOPS.csv"
] ;
rr:subjectMap [ a rr:SubjectMap ;
rr:template "http://example.org/stops/{stop_id}"
] ;
rr:predicateObjectMap [ a rr:PredicateObjectMap ;
rr:predicateMap [ a rr:PredicateMap ;
rr:constant wgs84:long
] ;
rr:objectMap [ a rr:ObjectMap ;
rml:reference "stop_lon" ;
rr:datatype xsd:double ;
rr:termType rr:Literal
condition is transformed into a natural join of SPARQL graph patterns where the same variable (?jc0)
is bound on both sides.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Translating RML TermMaps</title>
        <p>RML TermMaps specify how to map the referenced data to RDF terms. SPARQL operates at
the level of bindings where variables are bound to RDF terms. Hence, we can represent RML
TermMaps in SPARQL by using BIND to define variables as expressions over a source’s data.
SPARQL provides the functions IRI, STRDT, STRLANG and BNODE19 for the construction of RDF
terms. Consequently, every TriplesMap’s term map can be represented using a freshly allocated
variable that is bound to a corresponding definition using a SPARQL BIND statement. A summary
for mapping RML term maps to SPARQL is shown in Figure 2. The function access is thereby a
placeholder that needs to be replaced with a concrete variant based on the type of the logical
source (e.g. XML, JSON, CSV) as explained in Section 3.4.
19Unfortunately the standard BNODE function is not deterministic so SPARQL-based knowledge graph construction
tools typically either alter the semantics or provide an alternative function.</p>
        <p>• [ rr:reference "ref" ] → BIND(access(?source, "ref") AS ?v0)
• [ rr:reference "ref" ; rr:termType rr:IRI ] → BIND(IRI(access(?source, "ref")) AS ?v0)
• [ rr:reference "ref" ; rr:termType rr:BlankNode ] →</p>
        <p>BIND(BNODE(access(?source, "ref")) AS ?v0)
• [ rr:reference "ref" ; rr:datatype xsd:float ] →</p>
        <p>BIND(STRDT(access(?source, "ref"), xsd:float) AS ?v0)</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Translating RML References</title>
        <p>The concrete expression of the access function depends on the logical sources’ format. Because
the format is specified, we can rewrite access with the following concrete functions, where
REF is substituted with reference expression string.</p>
        <p>• JSON: norse:json.path(?x, "$['REF']") If the result of the JSON path evaluation is
a primitive JSON object then it is converted to an RDF term. JSON null is treated as
“unbound”. For JSON arrays and objects an RDF term of type norse:json is returned.
• CSV: In our approach we represent CSV rows as JSON documents and thus access could
be performed using the aforementioned norse:json.path function. If headers are absent
then every row is represented as a JSON array, otherwise every row is turned into a
JSON object whose keys are the CSV headers. However, in order to avoid the overhead of
JSON path evaluation we also introduce the function norse:json.get(?obj, "REF") for
accessing a JSON object’s immediate keys directly.
• XML: norse:xml.text(norse:xml.path(?xmlNode, "//:REF")) The result of an XPath
evaluation is generally another XML node, such as &lt;lon&gt;42.5&lt;/lon&gt;. The function
norse:xml.text extracts an XML element’s content as text, in this example 42.5.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Translating RefObjectMaps (Joins)</title>
        <p>
          Joins in RML are declared using rr:RefObjectMap. The outcome of the translation of an RML
join is a CONSTRUCT query which involves a natural join based on the references to the
sources that act as child and parent as shown in Section 3.2. Every rr:RefObjectMap results in
an independent CONSTRUCT query with only one tuple pattern in its template.
3.5.1. Duplicate-Reducing Self Join Elimination
For time-eficient execution of RML mappings, such as the ones used in the
GTFS-MadridBenchmark, it is known that a form of self-join elimination must be performed[
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]. An RML
join condition can be generally omitted if the following conditions are met:
• The same logical source is used for the child and the parent TriplesMap.
• All involved join conditions use the same reference expression for both parent and child,
such as rr:parent = "ref" ; rr:child "ref".
        </p>
        <p>• Either of the subject maps only mentions a subset of the references used in the join.
In such a case a referencing object map can be replaced with a simple object map based on the
referenced TriplesMap’s subject map. The underlying principle is sketched as follows. Let  be
the TriplesMap’s logical relation. Let  and  be the set of attributes referenced by the child
and parent subject maps, respectively. Let  be the set of joining attributes. Then the following
transformation can be applied if the condition  ⊆  or  ⊆  is met (c and p are SQL aliases):</p>
        <p>SELECT DISTINCT  ∪  FROM  c JOIN  p USING () → SELECT DISTINCT  ∪  FROM 
If the condition is met but DISTINCT is omitted then the JOIN can only introduce additional
duplicates. Applying the transformation then reduces the duplicates to only those present in .
4. Optimizing SPARQL CONSTRUCT Query Workloads
By transforming RML mappings into a set of SPARQL queries, the problem of eficient RML
mapping execution becomes one of workload optimization of a set of SPARQL CONSTRUCT
queries. The essential optimization goals are to eficiently produce tuples that are unique and/or
ordered: For RDF data it is desirable to avoid duplicates as they needlessly increase size and
processing time. Sorted RDF data eases inspection of the available information and assessment
of fitness for use. It also enables eficient lookups using e.g. binary search. In the remainder of
this section we detail our employed optimization procedure.</p>
      </sec>
      <sec id="sec-3-6">
        <title>4.1. Merging CONSTRUCT Queries using LATERAL</title>
        <p>There are two main issues with SPARQL 1.1 CONSTRUCT queries for the purpose of producing
sorted and unique knowledge graph output:
• Although ORDER BY and/or DISTINCT can be used with CONSTRUCT queries, these
solution modifiers only afect the underlying bindings and not the produced tuples. This
is especially an issue when a CONSTRUCT query’s template mentions multiple RDF tuple
patterns. With SPARQL 1.1 there is no generic procedure to compute unique tuples in an
eficient way that only has to evaluate the query pattern once.
• While multiple SELECT queries can be combined with UNION, no such operator exists
for CONSTRUCT queries.</p>
        <p>These two issues make it dificult to devise a general procedure to eficiently combine tuples
generated by a set of CONSTRUCT queries. A recent efort towards the next version of the
SPARQL specification is the introduction of the LATERAL keyword which is already supported
by a few SPARQL engines20. The keyword’s corresponding operation first evaluates the
lefthand-side. Each obtained binding is then used to substitute all (in-scope) variables on the
right-hand-side before the substituted right-hand-side is evaluated:</p>
        <p>[[Lateral(left, right)]] := {  ∪  2| 1 ∈ [[left]] and  2 ∈ [[subst(right,  1)]]}
With this keyword it is now possible to “normalize“ any CONSTRUCT query into an equivalent
one with a canonical template of the form GRAPH ?g { ?s ?p ?o } for quad-based approaches
or ?s ?p ?o for triple-based ones. Without loss of generality, any clashes in variable naming
can be resolved with appropriate renaming. This way, a set of normalized CONSTRUCT queries
can be UNION’d simply by creating a UNION of their graph patterns and adding the uniform
template. The operations ORDER BY and DISTINCT can be applied likewise. The general
CONSTRUCT-to-LATERAL rewrite is described in Figure 3. Note, that DEFAULT is thereby an
implementation dependent constant for the default graph21. Given a set of CONSTRUCT queries,
a generic merge can be accomplished based on their lateral form as shown in Figure 5.
CONSTRUCT {
s1 p1 o1
...</p>
        <p>GRAPH gn { sn pn on }
} WHERE</p>
        <p>PATTERN
}</p>
        <p>CONSTRUCT { GRAPH ?g { ?s ?p ?o } }
WHERE {</p>
        <p>SELECT DISTINCT ?g ?s ?p ?o {</p>
        <p>PATTERN
LATERAL {
{ BIND(DEFAULT AS ?g)</p>
        <p>BIND(s1 AS ?s) BIND(p1 AS ?p) BIND(o1 AS ?o) }
UNION</p>
        <p>...</p>
        <p>UNION
{ BIND(gn AS ?g)</p>
        <p>BIND(sn AS ?s) BIND(pn AS ?p) BIND(on AS ?o) }
}</p>
        <p>}
} ORDER BY ?s ?p ?o ?g</p>
      </sec>
      <sec id="sec-3-7">
        <title>4.2. Partitioning Mappings</title>
        <p>In Section 3 we showed how to translate RML TriplesMaps into a set of SPARQL CONSTRUCT
queries. Furthermore, we described how a set of CONSTRUCT queries can be combined into
a single one using the novel LATERAL keyword. This tooling is already suficient to produce
a single CONSTRUCT query from any RML document where DISTINCT and ORDER BY is
applied at the top of its SPARQL algebra expression. However, if it is known that two queries
produce disjoint sets of RDF tuples then DISTINCT (and possibly ORDER BY) can be applied
independently and their results can be UNION’d. As this leads to operations on fewer data it
can significantly improve performance.</p>
        <p>In order to achieve this goal it is necessary to obtain a description of the possible set of RDF
tuples that can be created from a CONSTRUCT query. For this purpose we introduce a model
where the set(s) of possible RDF terms (produced by a tuple’s component) are represented as
21See discussion https://github.com/w3c/sparql-12/issues/43
(a)
(b)
...</p>
        <p>[calendar_date_rule/ ... calendar_date_rule0) [calendar_rules/ .. calendar_date_rule0)
3 canonical queries 10 canonical queries
[feed/ .. feed0)
7 canonical queries</p>
        <p>IRI range for &lt;http://transport.linkeddata.es/madrid/metro/</p>
        <p>i3
i1
i2
i4
i5
Lexical literal range starts with "</p>
        <p>Lexical IRI range starts with &lt;</p>
        <p>Lexical blank node range starts with _:
intervals. For brevity, we only focus on sorting RDF terms based on the lexical space of their
N-Quads serialization.</p>
        <p>For example, from an expression such as BIND(IRI(CONCAT("gtfsbench/", ?id)) AS ?x)
we can derive that ?x may be any of (1) unbound22 or (2) an IRI with a string value in the interval
[”gtfsbench/” .. ”gtfsbench0”) (under lexicographic order), where [ denotes a closed boundary
and ) an open one, and “0” is the successor character of “/” in (the ASCII-subset of) UTF-8.</p>
        <p>Given a tuple of a construct template, we can thus determine a set of possible values for each
of its components. If the construct template has multiple quads then we can take the
componentwise union of the intervals in order to obtain a single description of its producible quads. If a
variable’s set of values is unknown we can gracefully represent it as an interval covering the
complete range such as (−∞ .. +∞). This way, we can ”project“ every CONSTRUCT query
22This is the case if ?id is not a string because CONCAT only allows for string arguments.
to an interval. Figure 4 (a) shows a concrete projection based on a subset of the mappings
of the GTFS-Madrid-Bench. Each interval corresponds to one or more CONSTRUCT queries.
Figure 4 (b) shows an abstract example where intervals overlap. A set of queries with overlapping
ranges forms a partition and can be merged as shown in Figure 5 for the sake of applying
DISTINCT and ORDER BY. Extending this approach to SPARQL is possible but more complex
because then the definition of intervals need to consider of the RDF term types and RDF literal
datatypes.</p>
      </sec>
      <sec id="sec-3-8">
        <title>4.3. Optimizing DISTINCT by Pulling Up BINDs</title>
        <p>A short-coming of the generated queries is that the DISTINCT operation runs over variables
that may be assigned to constants. By “pulling” such definitions up in the algebra DISTINCT
can operate on significantly fewer data, which in general increases performance by means of
lowering the computational overhead. Figure 6 shows an example of rewrite rules we use for
optimization. Note, that EXTEND is the algebraic correspondence to the BIND syntax23. Note, that
more sophisticated rules can be devised to split expressions such as CONCAT(const, ...) into
a constant and variable part where the constant part can be pulled up.</p>
        <p>• DISTINCT(EXTEND(var, constant, subOp)) → EXTEND(var, constant, DISTINCT(subOp))
• UNION(EXTEND(var, constant, left), EXTEND(var, constant, right)) →</p>
        <p>EXTEND(var, constant, UNION(left, right))
• EXTEND(var, non-constant-expr, EXTEND(var, constant, subOp)) →</p>
        <p>EXTEND(var, constant, EXTEND(var, non-constant-expr, subOp))
Figure 6: A brief excerpt of algebra rewrite rules used to pull EXTEND up.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Implementation</title>
      <p>In this section we provide a brief overview of our related implementations: The NORSE Sparql
Extensions, the implementation of the SANSA binding engine (SaBiNe) for evaluating SPARQL
on Apache Spark, and finally the RDF Processing Toolking RPT which bundles all components
together – including the RML to SPARQL tooling – into a single command line toolkit.
NORSE SPARQL Extensions and RPT JenaX is our project of unoficial extensions for the
Apache Jena project. Among its features are the NORSE SPARQL extensions. Adding the plugin
module as a Maven dependency enhances a Jena-based project with the datatypes and functions
for processing CSV, XML and JSON24.</p>
      <p>Evaluating SPARQL with SANSA and Apache Spark Our approach to evaluating SPARQL
in Spark is a direct one: A SPARQL result set is represented as an RDD&lt;Binding&gt;. On this basis
we present a translation function [[.]] that recursively translates SPARQL algebra operations
23https://www.w3.org/TR/sparql11-query/#sparqlAlgebra
24https://mvnrepository.com/artifact/org.aksw.jenax/jenax-arq-plugins-bundle
to operations on (Java) RDDs. The SANSA Framework thereby provides several features that
enable use of functionality from Apache Jena with Apache Spark, such as serializers for SPARQL
bindings and algebra expressions. Figure 7 shows an excerpt for the evaluation of the SPARQL
operations most relevant to RML execution on Apache Spark .</p>
      <p>• [[SERVICE norse:rml.source {[ ... norse:output ?s ]}]] := Create a RDD&lt;Binding&gt;
where ?s is bound to records of the specified RML source.
• [[FILTER(subOp, expr)]] := [[subOp]].filter( → exprEval(expr,  ) == )
• [[JOIN(left, right)]] := [[left]].mapToPair( 1 → ⟨Π ( 1),  1⟩).join([[right]].</p>
      <p>mapToPair( 2 → ⟨Π ( 2),  2⟩)).map(⟨key, ⟨ 1,  2⟩⟩ →  1 ∪  2)
where  is the set of join variables vars(left) ∩ vars(right) and Π ( ) is the projection of a
binding to these variables.
• [[PROJECT(subOp, vars)]] := [[subOp]].map( → Π( ))
• [[DISTINCT(subOp)]] := [[subOp]].distinct()
• [[LATERAL(left, right)]] := if right has no basic graph patterns then</p>
      <p>[[left]].mapPartitions( → { ∪  |∀ ∈ convEval(subst(right,  ))})
where convEval is conventional SPARQL evaluation into a (Java) collection of bindings rather
than a Spark RDD.</p>
      <p>• [[EXTEND(var, expr, subOp)]] := [[subOp]].map( →  ∪ {var → exprEval(,  )})
RDF Processing Toolkit RPT is the integration project that provides a powerful frontend
for both Jena’s ARQ and SANSA’s SPARQL engines. Both engines support the NORSE and the
RML extensions, however only the latter supports parallelization. Example usage of the tooling
is shown in Listing 1.</p>
      <p>Listing 1: Example for using RPT to translate and execute RML
r p t r m l t k rml t o s p a r q l mapping . rml . t t l &gt; raw . r q
r p t r m l t k o p t i m i z e workload raw . r q −−no− o r d e r &gt; mapping . r q
JAVA_OPTS=" −Xmx16g " r p t i n t e g r a t e mapping . r q −− out − f i l e r p t − a r q . n t
JAVA_OPTS=" −Xmx16g " r p t s a n s a que ry mapping . r q −− out − f i l e r p t − s a n s a . n t</p>
    </sec>
    <sec id="sec-5">
      <title>6. Evaluation</title>
      <p>
        We evaluate our approach on the GTFS-Madrid-Bench and one of the largest datasets of
SDMGenomics-Datasets25. For this purpose we converted the benchmark’s RML files to extended
SPARQL and ran them using Jena’s ARQ and SANSA’s SPARQL engine as shown in Listing 1. In
25The used files are 75percent_of_records_with_duplicate_and_each_duplicate_being_repeated_20times.csv and
4POM_Normal.ttl
a first step we evaluated several RML tools on a server with 128GB RAM, AMD Ryzen 9 5950X
16-Core CPU and SSD storage running Ubuntu 20.04. In order to establish comparability, we
used all tools’ native unique output feature26. The results for the scale factors 1, 10, 100, 300 are
shown in Figure 8. We also attempted to evaluate RocketRML, however we ran into memory
issues with it27. As for RMLStreamer[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], on the one hand it requires a Flink setup and on the
other hand the initially obtained execution suggested that it lacks the self-join elimination
similar to RMLMapper.
      </p>
      <p>In a subsequent step, we evaluated the fastest approaches which are the ones that rely on
parallel processing, namely Morph_KGC and RPT/SANSA. For this evaluation we needed a
machine with more RAM and its specs were: Ubuntu 22.04, 2x Intel(R) Xeon(R) CPU E5-2683 v4
@ 2.10GHz (totalling to 64 threads) and 512GB DDR4 RAM at 2133 MHz. In order to avoid I/O
bounds in parallel processing, we performed the experiments for both tools with the benchmark
datasets served from the default RAM drive /dev/shm. With this machine it was possible to
scale up to factor 1000. In addition, we evaluated the tools on the SDM-Genomic-Dataset
as this includes a workload that does not involve joins but many duplicates. As can be seen
from Figure 9 the execution times for both tools on both workloads converge to scaling linearly.
On smaller sizes Morph_KGC outperforms RPT/SANSA. With increasing data scale the Apache
Spark-based approach gains an advantage. However, on the workload that is mainly about
duplicate removal the performance benefit is quite small considering CPU usage: Morph’s
average CPU usage in both scenarios is roughly around 400% whereas RPT/SANSA’s is around
4000%. This means that the latter requires almost 10 times the CPU resources of the former
in order to accomplish the same task. There are many aspects that can cause this significant
26The only exception was Carml for which we appended a sort -u step
27https://github.com/semantifyit/RocketRML/issues/44
(a)
(b)
diference: As a primary source we suspect Apache Spark’s processing model for DISTINCT
which relies on hash partitioning and shufling of data which involves (de-)serialization. This
introduces a significant overhead when compared to e.g. simply keeping records in an
inmemory hash set. Also, Jena introduces additional overhead by having to parse all RDF literals
for expression evaluation. The benefit is, that invalid literals are reported whenever they are
produced during knowledge graph construction. A performance issue which we detected
and fixed during profiling was that Jena would needlessly materialize literals during SPARQL
evaluation28. Overall, further investigations are necessary to assess the impact of all relevant
aspects on the performance in detail.</p>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusions and Future Work</title>
      <p>In this work we showed that with the conversion of RML to SPARQL we can leverage suitably
enhanced SPARQL engines for the task of knowledge graph construction. We further showed
that by transforming CONSTRUCT queries to their “lateral” form it is now finally possible to
“merge” CONSTRUCT queries and remove duplicates which has direct applications in knowledge
graph construction. Using query workload analysis we can push down DISTINCT operations
such that this expensive operation can be computed on smaller RDF graphs. We showed that
the same query workload can be executed on diferent engines yielding the same result sets
however with significantly diferent performance characteristics. By leveraging a Big Data
framework this approach can outperform state of the art approaches. We emphasize that
as part of this work we contributed the SERVICE extension plugin system as well as minor
performance improvements to Apache Jena. One direction of future work is to optimize the
generated SPARQL algebra further as to minimize the amount of data that has to be processed
in DISTINCT and ORDER BY operations. Also, as shown in the evaluation, the improved overall
performance comes at the cost of significant higher resource usage for which we plan in-depth
investigation of the reasons and possible mitigation approaches such as using custom Spark
operator implementations. Furthermore, we identify the need for the standardization of SPARQL
for heterogeneous data as this would not only make it possible to transform RML to SPARQL in
a truly interoperable way, but also provide a common ground for query and query workload
optimization.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors acknowledge the financial support by the German Federal Ministry for Economic
Affairs and Energy in the project Coypu (project number 01MK21007A) and by the German Federal
Ministry of Education and Research in the project StahlDigital (project number 13XP5116B).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Mapping languages: Analysis of comparative characteristics</article-title>
          ,
          <source>in: KGB@ESWC</source>
          ,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2489</volume>
          / paper4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <article-title>Knowledge Graph Construction from Heterogeneous Data Sources exploiting Declarative Mapping Rules, phdthesis</article-title>
          , Universidad Politécnica de Madrid,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .20868/UPM.thesis.67890.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>An ontological approach for representing declarative mapping languages, Semantic Web (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .3233/sw-223224.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Delva</surname>
          </string-name>
          , G. Haesendonck,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Declarative RDF graph generation from heterogeneous (semi-)structured data: A systematic literature review</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>75</volume>
          (
          <year>2023</year>
          )
          <article-title>100753</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2022</year>
          .
          <volume>100753</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Chaves-Fraga,
          <article-title>Towards a new generation of ontology based data access</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>153</fpage>
          -
          <lpage>160</lpage>
          . doi:
          <volume>10</volume>
          .3233/sw-190384.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          , Ó. Corcho,
          <article-title>Devising mapping interoperability with mapping translation</article-title>
          ,
          <source>in: KGCW@ESWC</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3141</volume>
          /paper6. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lefrançois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bakerally</surname>
          </string-name>
          ,
          <article-title>A sparql extension for generating RDF from heterogeneous formats</article-title>
          ,
          <source>in: The Semantic Web: 14th International Conference, ESWC</source>
          <year>2017</year>
          , Portorož, Slovenia, May 28-June 1,
          <year>2017</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -58068-
          <issue>5</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Asprino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulholland</surname>
          </string-name>
          ,
          <article-title>Knowledge graph construction with a façade: A unified method to access heterogeneous data sources on the web</article-title>
          ,
          <source>ACM Trans. Internet Technol</source>
          . (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .1145/3555312.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cogrel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Komla-Ebri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezk</surname>
          </string-name>
          , M. RodriguezMuro, G. Xiao, Ontop:
          <article-title>Answering sparql queries over relational databases</article-title>
          ,
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>471</fpage>
          -
          <lpage>487</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Unbehauen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Westphal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sherif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Simplified RDB2RDF mapping</article-title>
          .,
          <source>LDOW@ WWW</source>
          <volume>1409</volume>
          (
          <year>2015</year>
          ). URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1409</volume>
          /paper-09.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, R. V. de Walle,
          <article-title>RML: A generic language for integrated RDF mappings of heterogeneous data</article-title>
          ,
          <source>in: Proceedings of the Workshop on Linked Data on the Web (LDOW)</source>
          <year>2014</year>
          ,
          <article-title>co-located with the 23rd</article-title>
          <source>International WWW Conference</source>
          ,
          <year>2014</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1184</volume>
          /ldow2014_paper_01.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>R2RML and RML comparison for RDF generation, their rules validation and inconsistency resolution</article-title>
          ,
          <source>ArXiv</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>2005</year>
          .
          <volume>06293</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <article-title>Declarative rules for linked data generation at your fingertips!, in: The Semantic Web: ESWC 2018 Satellite Events: ESWC 2018 Satellite Events</article-title>
          , Heraklion, Crete, Greece, June 3-7,
          <year>2018</year>
          ,
          <source>Revised Selected Papers 15</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>García-González</surname>
          </string-name>
          ,
          <article-title>A ShExML perspective on mapping challenges: Already solved ones, language modifications and future required actions (</article-title>
          <year>2021</year>
          ). URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2873</volume>
          /paper2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>García-González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Why to tie to a single data mapping language? enabling a transformation from shexml to rml</article-title>
          ,
          <source>in: International Conference on Semantic Systems</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3235</volume>
          /paper11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pozo-Gilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Donà</surname>
          </string-name>
          , Ó. Corcho,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <article-title>Knowledge graph construction with r2rml and rml: An etl systembased overview</article-title>
          , in: KGCW@ESWC,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2873</volume>
          /paper11.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          , M.-E. Vidal,
          <article-title>Scaling up knowledge graph creation to large and heterogeneous data sources</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>75</volume>
          (
          <year>2023</year>
          )
          <article-title>100755</article-title>
          . doi:
          <volume>10</volume>
          .1016/ j.websem.
          <year>2022</year>
          .
          <volume>100755</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Collarana</surname>
          </string-name>
          , M.-E. Vidal,
          <article-title>SDM-RDFizer: An RML interpreter for the eficient creation of rdf knowledgegraphs</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management, ACM</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .1145/3340531.3412881.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-KGC:
          <article-title>Scalable knowledge graph materialization with mapping partitions, Semantic Web (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-223135.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Gtfs-madridbench: A benchmark for virtual knowledge graph access in the transport domain</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <article-title>100596</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2020</year>
          .
          <volume>100596</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , et al.,
          <article-title>Distributed semantic analytics using the sansa stack</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2017</year>
          : 16th International Semantic Web Conference, Vienna, Austria,
          <source>October 21-25</source>
          ,
          <year>2017</year>
          , Proceedings,
          <source>Part II 16</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -68204-4_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Haesendonck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Maroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Dimou, Parallel RDF generation from heterogeneous big data</article-title>
          ,
          <source>in: Proceedings of the International Workshop on Semantic Big Data, ACM</source>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1145/3323878.3325802.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>