=Paper=
{{Paper
|id=Vol-3718/paper12
|storemode=property
|title=Results for Knowledge Graph Creation Challenge 2024: SDM-RDFizer
|pdfUrl=https://ceur-ws.org/Vol-3718/paper12.pdf
|volume=Vol-3718
|authors=Enrique Antonio Iglesias,Maria-Esther Vidal
|dblpUrl=https://dblp.org/rec/conf/kgcw/IglesiasV24
}}
==Results for Knowledge Graph Creation Challenge 2024: SDM-RDFizer==
Results for Knowledge Graph Creation Challenge
2024: SDM-RDFizer
Enrique Iglesias1,2,* , Maria-Esther Vidal1,2,3
1
L3S Research Center, Hannover, Germany
2
Leibniz University of Hannover, Hannover, Germany
3
TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
Abstract
The volume of data generated in recent years has increased drastically, necessitating a unified schema to
integrate multiple data sources into a single format. The RDF Mapping Language (RML) was developed to
define the structure of knowledge graphs (KGs). Over time, various extensions have been introduced to
enhance RML’s functionality, creating a need for a new specification that consolidates these extensions.
Track 1 of the KGCW 2023 Challenge dataset addresses this need by providing a comprehensive set of
test cases to ensure that knowledge graph creation engines comply with the updated RML specification.
This paper reports on the conformance evaluation of SDM-RDFizer using this dataset, highlighting its
capabilities and areas for improvement in achieving full RML compliance.
Keywords
Knowledge Graph Creation, Data Integration System, RDF Mapping Languages
1. Introduction
The substantial surge in data volume has led to the increasing use of knowledge graphs (KGs) to
integrate multiple data sources in different formats. Consequently, various mapping languages
have emerged to define KGs; some of the more well-known are R2RML and its extension,
RDF Mapping Language (RML) [1]. Both of these languages adhere to the rules established
by the Resource Description Framework (RDF)1 . Over time, new extensions for RML have
been developed, adding functionalities like the execution of functions for value transformation
(RML+FnO [2]) and the use of RDF-Star (RML-Star [3]). A new specification for RML2 has been
defined to incorporate all these extensions and remove references to R2RML formally.
The Track 1 dataset of the KGCW 2024 Challenge covers many test cases, including basic cases,
functions, RML-Star, remote sources, and specific outputs. These test cases ensure that existing
KG creation tools can incorporate the new RML specification and achieve full compliance. This
report presents the updates needed to integrate the new specification into SDM-RDFizer and
KGCW’24: 5th International Workshop on Knowledge Graph Construction, May 27, 2024, Crete, GRE
*
Corresponding author.
†
These authors contributed equally.
$ iglesias@l3s.de (E. Iglesias); maria.vidal@tib.eu (M. Vidal)
0000-0002-8734-3123 (E. Iglesias); 0000-0003-1160-8727 (M. Vidal)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
CEUR Workshop Proceedings (CEUR-WS.org)
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
1
https://www.w3.org/TR/2004/REC-rdf-primer-20040210/
2
https://kg-construct.github.io/rml-core/spec/docs/
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
evaluates the results using the Track 1 dataset.
This paper is organized into three additional sections. Section 2 provides an overview of SDM-
RDFizer, including its techniques, data structures, and physical operators for optimizing KG
creation. Section 3 details the results of the challenge, including the dataset definition and the
necessary improvements to meet the test cases. Finally, Section 4 presents the conclusions and
future steps for SDM-RDFizer.
2. SDM-RDFizer
SDM-RDFizer [4] is a KG creation engine that is RML compliant. SDM-RDFizer is comprised
of two modules: Triples Maps Planning (TMP) and Triples Maps Execution (TME). Each
module has different data structures that optimize different aspects of the KG graph creation
process. TMP determines the execution order for the triples maps (TM) to keep memory usage
to a minimum. TME generates KG following the order established by TMP. Multiple novel
operators are defined to transform different types of TMs. Simple Object Map (SOM) operator
executes rml:template and rml:reference, Object Reference Map (ORM) executes parent triples
maps, and Object Join Map (OJM) executes joins. All generated triples are compared to the
corresponding Predicate Tuple Table (PTT) to determine if it is a duplicate and Dictionary
Table (DT) compress the resources stored in PTT. Predicate Join Tuple Table (PJTT) stores
the result of executing join. SDM-RDFizer is publically available on GitHub 3 .
3. Test Cases of the Knowledge Graph Creation Challenge
Track 1 of the KGCW 2024 Challenge4 aims to inspire new methods and techniques for incorpo-
rating the new RML specification into existing KG creation engines. This dataset comprises five
sets of test cases. RML-Core: This set includes basic test cases originally defined in the RML
test cases5 to test the compliance of KG creation engines. These cases have been updated to
reflect the new specification and utilize CSV, JSON, XML files, and relational databases (MySQL
and Postgres) as data sources. RML-FNML: This set contains test cases that use functions to
transform data, employing a series of pre-defined functions to execute these transformations.
RML-Star: This set incorporates RDF-Star6 test cases, generated from the RML-Star test cases7
in accordance with the new specification. RML-IO: This set includes a wide variety of remote
data sources such as endpoints, compressed files, and JSON and XML files. It also defines
outputs for specific properties in various formats like Turtle, RDF/JSON, JSON-LD, and multiple
compressed formats like Zip and Tar. The name reflects its focus on Input and Output. RML-CC:
This set is comprised of collections and containers.
This work presents the results of executing RML-Core, RML-FNML, RML-Star, and RML-IO
with SDM-RDFizer. RML-CC will be incorporated at a later date. Table 1 shows the total number
3
https://github.com/SDM-TIB/SDM-RDFizer
4
https://zenodo.org/records/10973433
5
https://rml.io/test-cases/
6
https://www.w3.org/2021/12/rdf-star.html
7
https://zenodo.org/records/6518802
Set # of Test Cases # of Passed Cases # of Fail Cases
RML-Core 238 238 0
RML-FNML 14 14 0
RML-Star 18 18 0
RML-IO 67 65 2
RML-CC 29 0 29
Total 366 335 31
Table 1
Test Cases of the KGCW 2024 Challenge Track 1 dataset.
of test cases in the dataset and which cases were passed and failed by SDM-RDFizer. The full
results are available on GitHub 8 .
3.1. Results of RML-Core
RML-Core comprises test cases covering the fundamentals of RML, such as the definition of
classes, the use of rml:template and rml:reference, the execution of parent triples maps and
joins, and the definition of data types, languages, and graphs. SDM-RDFizer implements three
operators to execute different types of mappings.
To extract data from various data sources, SDM-RDFizer employs several Python libraries: csv
for CSV files, json for JSON files, xml for XML files, mysql-connector for connecting to MySQL
databases, and psycopg2 for connecting to Postgres databases.
To parse the new specification, the parser query is updated to replace the rml prefix with its
new namespace, replace all mentions of the R2RML namespace with the RML namespace, and
update the definition of rml:logicalSource to include the use of the rml:path and rml:root clauses,
replacing rml:query with rml:iterator. The rdflib library is used to execute the parser query.
SDM-RDFizer successfully performs all 238 test cases from this set.
3.2. Results of RML-FNML
RML-FNML consists of test cases that use functions to transform values, including tasks like
replacing strings, transforming strings to lower and upper case, concatenating strings, and
more. These test cases demonstrate the use of RML+FnO in the new specification. SDM-RDFizer
incorporates strategies from FunMap [5] to execute functions. FunMap is a TM translator that
converts TMs containing functions and their corresponding data sources into TMs without
functions, reflecting the execution of the functions.
To parse TMs with functions, a new parser query is explicitly defined for extracting the functions
from the TMs. This allows for proper handling of nested functions, as each function is extracted
individually, and nested functions are called from the function’s parameters.
SDM-RDFizer successfully performs all 14 test cases from this set.
8
https://github.com/SDM-TIB/SDM-RDFizer/tree/master/kgcw_2024_challenge
3.3. Results of RML-Star
RML-Star comprises test cases that use RDF-Star, an extension of RDF that introduces a new term,
the quoted triple, which can be used as either the subject or the object of a triple. Therefore,
RML-Star presents rml:quotedTriplesMap, enabling the definition of quoted triples in a KG.
SDM-RDFizer implements a new operator to generate quoted triples, allowing for recursive
application since quoted triples can contain other quoted triples.
Another challenge in these test cases was using joins in the rml:subjectMap. These cases were
handled similarly to joins in the rml:objectMap, using the OJM operator for join execution and
PJTT for storing the results. The parser query was expanded to recognize rml:quotedTriplesMap.
SDM-RDFizer successfully performs all 18 test cases from this set.
3.4. Results of RML-IO
RML-IO consists of test cases that cover a wide range of remote data sources, including com-
pressed files, JSON and XML files, and data extracted from SPARQL endpoints. Some cases
introduce the concept of outputting certain triples to specific output files, which may need to
be compressed or translated into different formats, such as JSON-LD, Turtle, etc.
SDM-RDFizer uses the requests library to collect data from remote sources. For SPARQL end-
points, the SPARQLWrapper library connects and executes SPARQL queries, with the results
converted to a format similar to CSV. When dealing with compressed files, SDM-RDFizer down-
loads the file locally and decompresses it using the appropriate library for the format, such as
the zip library for Zip files.
These test cases introduce the concept of defining an alternate output within the TM. These
can be specified in the rml:subjectMap, rml:predicateMap, rml:objectMap, rml:languageMap,
rml:datatypeMap, or rml:graphMap. Based on its location, triples will be outputted to the alter-
nate output file. If defined in the rml:subjectMap, all valid triples generated from this TM will be
sent to that particular output. If the alternate output is defined elsewhere, only triples with that
specific property will be outputted there. Any remaining triples not destined for an alternate
output will be sent to the original output file, which is defined at the start of the SDM-RDFizer
execution. When handling these alternative output files, SDM-RDFizer prioritizes generating
them over the standard output and can compress them. Finally, SDM-RDFizer converts the
output files into various RDF formats, such as RDF/XML, JSON-LD, etc., using the rdflib library.
SDM-RDFizer successfully executes 65 of the 67 test cases. The two failures occurred because
SDM-RDFizer cannot upload the generated triples into a SPARQL endpoint.
4. Conclusions
The KGCW 2024 Challenge Track 1 dataset evaluates the compliance of state-of-the-art engines
with the new formulation of RML. It comprises 366 test cases across five categories: RML-Core,
RML-IO, RML-FNML, RML-Star, and RML-CC. SDM-RDFizer successfully executes 335 of these
test cases, fully covering RML-Core, RML-FNML, RML-Star, and all but two from RML-IO.
To achieve this, SDM-RDFizer introduced a new parsing query, strategies from FunMap, and
an operator for generating inner triples for RML-Star. Moving forward, SDM-RDFizer aims
to address the remaining RML-CC cases by incorporating new methods. For this purpose, a
new operator will defined, which will behave differently based on whether it is transforming
a list, bag, or sequence. Additionally, a new data structure will be developed for intermediate
results in RML-Star to avoid repeated inner triple generation, and an optimized parser will be
implemented to manage the increasing complexity.
With these improvements, SDM-RDFizer is set to become a fully compliant RML engine.
Acknowledgments
This work has been partially supported by the Federal Ministry for Economic Affairs and Energy
of Germany (BMWK) in the project CoyPu (project number 01MK21007[A-L]). Leibniz Associ-
ation partially funds Maria-Esther Vidal in the "Leibniz Best Minds: Programme for Women
Professors", project TrustKG-Transforming Data in Trustable Insights with grant P99/2020.
References
[1] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, RML:
A Generic Language for Integrated RDF Mappings of Heterogeneous Data, in: Workshop
on Linked Data on the Web, 2014.
[2] B. De Meester, A. Dimou, R. Verborgh, E. Mannens, An ontology to semantically declare
and describe functions, in: European Semantic Web Conference, Springer, 2016, pp. 46–49.
[3] T. Delva, J. Arenas-Guerrero, A. Iglesias-Molina, Ó. Corcho, D. Chaves-Fraga, A. Dimou, Rml-
star: A declarative mapping language for rdf-star generation, in: O. Seneviratne, C. Pesquita,
J. Sequeda, L. Etcheverry (Eds.), Proceedings of the ISWC 2021 Posters, Demos and Industry
Tracks: From Novel Ideas to Industrial Practice co-located with 20th International Semantic
Web Conference (ISWC 2021), Virtual Conference, October 24-28, 2021, volume 2980 of CEUR
Workshop Proceedings, CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-2980/paper374.
pdf.
[4] E. Iglesias, S. Jozashoori, D. Chaves-Fraga, D. Collarana, M.-E. Vidal, SDM-RDFizer: An
RML Interpreter for the Efficient Creation of RDF Knowledge Graphs, in: CIKM, 2020.
doi:10.1145/3340531.3412881.
[5] S. Jozashoori, D. Chaves-Fraga, E. Iglesias, M. Vidal, Ó. Corcho, Funmap: Efficient execution
of functional mappings for knowledge graph creation, in: J. Z. Pan, V. A. M. Tamma,
C. d’Amato, K. Janowicz, B. Fu, A. Polleres, O. Seneviratne, L. Kagal (Eds.), The Seman-
tic Web - ISWC 2020 - 19th International Semantic Web Conference, Athens, Greece,
November 2-6, 2020, Proceedings, Part I, volume 12506 of Lecture Notes in Computer
Science, Springer, 2020, pp. 276–293. URL: https://doi.org/10.1007/978-3-030-62419-4_16.
doi:10.1007/978-3-030-62419-4\_16.