Declarative Description of Knowledge Graphs
Construction Automation: Status & Challenges
David Chaves-Fraga1,2,3 , Anastasia Dimou1,3,4
1
KU Leuven, Department of Computer Science, Sint-Katelijne-Waver, Belgium
2
Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, Spain
3
Flanders Make – DTAI-FET
4
Leuven.AI – KU Leuven institute for AI, B-3000 Leuven, Belgium
Abstract
Nowadays, Knowledge Graphs (KG) are among the most powerful mechanisms to represent knowledge
and integrate data from multiple domains. However, most of the available data sources are still described
in heterogeneous data structures, schemes, and formats. The conversion of these sources into the
desirable KG requires manual and time-consuming tasks, such as programming translation scripts,
defining declarative mapping rules, etc. In this vision paper, we analyze the trends regarding the
automation of KG construction but also the use of mapping languages for the same process, and align
the two by analyzing their tasks and a few exemplary tools. Our aim is not to have a complete study
but to investigate if there is potential in this direction and, if so, to discuss what challenges we need to
address to guarantee the maintainability, explainability, and reproducibility of the KG construction.
Keywords
Knowledge Graphs, Automation, Explainable AI, Declarative Rules
1. Introduction
A lot of works on knowledge graph (KG) construction are focused on defining mapping languages
to declaratively describe the transformation process, and on optimizing the execution of such
declarative rules. The mapping languages rely on either dedicated syntaxes, such as the family
of languages around the W3C recommended R2RML1 (e.g., RML [1] or R2RML-F [2]), or on
re-purposing existing specifications, such as query languages like the W3C recommended
SPARQL2 (e.g., SPARQL-Generate [3] or SPARQL-Anything [4]), or constraints languages like
ShEx3 (e.g., ShExML [5, 6]).
Despite the plethora of mapping languages and the increasing number of optimizations
for the execution of the declarative rules, these rules are still defined through a manual and
time-consuming process, affecting negatively their adoption. Different solutions were proposed
to automate the definition of mapping rules that describe how a KG should be constructed.
KGCW’22: International Workshop on Knolwedge Graph Construction, May 30, 2021, Creete, GRE
Envelope-Open david.chaves@upm.es (D. Chaves-Fraga); anastasia.dimou@kuleuven.be (A. Dimou)
Orcid 0000-0003-3236-2789 (D. Chaves-Fraga); 0000-0003-2138-7972 (A. Dimou)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings CEUR Workshop Proceedings (CEUR-WS.org)
http://ceur-ws.org
ISSN 1613-0073
1
http://www.w3.org/TR/r2rml/
2
https://www.w3.org/TR/sparql11-overview/
3
https://shex.io/
On the one hand, MIRROR [7], D2RQ [8] and Ontop [9] follow a similar approach, extracting
from the RDB schema a target ontology and the mapping correspondences. On the other hand,
AutoMap4OBDA [10] and BootOX [11] consider an input ontology and generate actual R2RML
mappings from the RDB. However, these solutions are focused on declarative solutions only
for relational databases, while recent solutions investigate non-declarative automation of KG
construction.
Beyond relational databases, the recent SemTab challenge4 presents a set of tabular datasets [12]
with the aim of matching them automatically to external KGs, such as DBpedia and Wikidata.
The proposed solutions [13, 14, 15] address the problem using different techniques, such as
heuristic rules, fuzzy searching over the KGs or knowledge graph embeddings. Although their
final objective is the same (to obtain high precision and recall results) and they perform similar
procedures, each solution implements its own workflow and addresses each proposed task
by SemTab in different ways. Hence, making a fair and fine-grained comparison among the
different solutions to understand how they obtain the actual results is not an easy task.
In this vision paper, we align tasks followed by solutions for the automation of the semantic
table annotation with concepts of existing declarative solutions. We indicatively select and
analyze a few tools for the automation of KG construction and identify common steps. We
discuss whether they can be declaratively described relying on existing mapping languages, and
what the challenges are to proceed in this direction. We consider the RDF Mapping Language
(RML) [1] as a high-level and general representation to describe schema transformations and its
extension, the Function Ontology (FnO) [16] to describe data transformations.
Our objective is not to present a complete study but to investigate if there is potential in this
direction. By describing the steps followed by different solutions in a more fine-grained and
standard manner, we make the steps comparable, and we can better discuss what challenges we
need to address to guarantee the maintainability, explainability, and reproducibility of the KG
construction, as well as to ensure the provenance of each performed task.
2. Task alignment with mapping languages
We analyze the different steps of the SemTab challenge, inspect the relationship between the
SemTab challenge tasks, and align them with concepts from the declarative construction of
RDF graphs (Figure 1). To achieve this, we include the relationship between each of the tasks
and their potential declarations within a mapping language. We consider the RML mapping
language because it is commonly used and the authors are more familiar with it, but we are
confident that the other mapping languages could express the same concepts. Before we proceed
with the alignment, we give a small introduction on the SemTab challenge and RML:
SemTab challenge The SemTab challenge consists of three tasks: (i) cell to KG entity
matching (CEA), which matches cells to individuals; (ii) column to KG class matching (CTA),
which matches cells to classes; and (iii) column pair to KG property matching (CPA), which
captures the relationships between pairs of columns.
4
https://www.cs.ox.ac.uk/isg/challenges/sem-tab/
restaurant
(Q11707)
CTA
Col0 Col1 Col2 Col3
Union Depot Tudor Revival Tudor Revival Arch.
Union Depot Spier & Rohns 1902-01-01
(Q7885655) architecture (Q7851317)
The Dorchester The Owen Art Art Deco
CEA 1931-01-01 CEA (Q173782)
(Q2749941) Dorchester Williams Deco
Willow Tearooms Willow Charles Rennie Art Nouveau
(Q1537781) 1903-01-01 Art Nouveau (Q34636)
Tearooms Mackintosh
architectural style
CPA (P149)
1 mappings:
2 triplesMap1: 1 @prefix wdt: .
3 sources: 2 @prefix wd: .
4 - ["input_table.csv~csv"] 3
5 s: CEA_FUNCTION($(Col0)) 4 wd:Q7885655 a wd:Q11707; wdt:P149 wd:Q7851317 .
6 po: 5 wd:Q2749941 a wd:Q11707; wdt:P149 wd:Q173782 .
7 - [a, CTA_FUNCTION($(Col0))] 6 wd:Q1537781 a wd:Q11707; wdt:P149 wd:Q34636 .
8 - [CPA_FUNCTION($(Col0), $(Col3)), CEA_FUNCTION($(Col3))~iri]
Figure 1: Automation tasks alignment within declarative mapping language. Example extracted
from SemTab 2021 challenge, where the CEA, CTA and CPA tasks are aligned with a declarative
construction of a knowledge graph using the RML mapping language (YARRRML serialisation).
RML The RDF mapping language (RML), a superset of the W3C recommended R2RML,
expresses schema transformations from heterogeneous data to RDF. An RML mapping contains
one or more Triple Maps which on their own turn contain a Subject Map to generate the subjects
of the RDF triples, and zero or more Predicate Object Maps with pairs of Predicate and Object
Maps to generate the predicates and the objects respectively for each incoming data record. RML
was aligned with the Function Ontology (FnO) [16] to describe the data transformations which
are required to construct the desired RDF graph, ensuring that the functions are independent
from any implementation.
We analyze how the different tasks of the challenge contribute in constructing a part of an
RDF triple, and we align these tasks with the corresponding concepts of the RML mapping
language that construct the same part of an RDF triple.
Cell-Entity Annotation (CEA): This task identifies the URI of an entity from a cell. In the
target RDF graph, this is the subject or the object of the RDF triple. In Fig. 1, the Col0 values
are used to obtain the subjects of the triples while the Col3 values generate the objects (both
green colored in the RDF extract of Fig. 1). If a declarative approach is considered to generate
these triples, for example in RML, the rr:subjectMap property is used (line 5 of RML doc in
Fig. 1), which declares how the subjects of the triples are generated and the rr:objectMap (line
8 of RML doc in Fig. 1), when the expected objects are in the form of URIs.
Column-Type Annotation (CTA): This task predicts the common class of a set of items
given a column from the table. SemTab assumes that a table only generates one kind of entity
(i.e. the first column is used for CTA). In Figure 1, we can observe that the URIs retrieved
using Col0 are considered for obtaining the corresponding shared concept (i.e., restaurant)
(red colored in the RDF extract of Fig. 1). Declaring the class in RML can be done through the
shortcut rr:class property within the rr:SubjectMap or using a rr:predicateObjectMap
with a rdf:type fixed predicate (line 7 of RML doc in Fig. 1).
Columns-Property Annotation (CPA): This task aims to predict the property that relates
the CTA column (subjects) to the rest of the columns. Fig. 1 shows a CPA task that relates Col0
with Col3 through the property architectural style (wdt:P149 , yellow colored in the RDF
extract). In RML, the predicates of the triples are declared using the rr:predicateMap property
(line 8 of RML doc in Fig. 1), and unlike typical mapping rules, where it is usually assumed that
predicates are constants (as they are declared in the input ontology), the predicates depend on
the data, hence they are dynamically defined.
Based on the aforementioned analysis, we conclude that the tasks performed to automate the
KG construction can be aligned with concepts from declarative mapping languages. The CEA
task is aligned with the RDF term construction for the subject or the object of the RDF triple,
the CTA task assigns the class and the CPA task aligns with the Predicate and Object Map.
3. Comparing semantic tabular matching systems
In this section, we analyze in detail the steps performed by some of the tools proposed for
solving the SemTab challenge. The comparative analysis among the three selected engines
(summarized in Table 1), is not meant to be exhaustive. We aim to identify if there are common
steps and functions that the engines perform to accomplish the challenge’s tasks and ultimately
if it is possible and desired to declaratively describe them with mapping languages.
3.1. Selected Systems
We indicatively selected the systems that: (i) obtained good results in the SemTab 2021 chal-
lenge5 ; and (ii) have the source code openly available. Therefore, we included in this comparison
JenTab [14], MTab [13] and MantisTable V [17]. The use of different terminologies for describing
similar tasks (e.g., majority vote in Mantis V is referred as frequency) and the complexity of the
proposed workflows, where the results from one of the task influence the others in a iterative
way, create difficulties to compare the approaches and reproduce their results.
JenTab6 participated in SemTab 2020 and 2021, and it was always positioned among the top
five solutions for most rounds. It follows a heuristic-based approach proposing the CFS (Create,
Filter, Select) approach for all tasks and with different configurations and workflows.
MTab7 participated in all SemTab editions, winning the first prize in 2019 and 2020. Apart
from the support of multilingual datasets, MTab implements several approaches for performing
the entity search (i.e. CEA): keyword search, fuzzy search, and aggregation search8 .
5
https://www.cs.ox.ac.uk/isg/challenges/sem-tab/2021
6
https://github.com/fusion-jena/JenTab
7
https://github.com/phucty/mtab_tool
8
https://mtab.app/mtabes/docs
MantisTable V9 is an extended and improved version of MantisTable [18]. Similarly to
JenTab, MantisTable has also participated in SemTab 2020 and 2021 editions. It implements a
set of heuristic rules (similar as JenTab) and complex string similarity functions for the entity
recognition task (like MTab). Additionally, it provides a general and efficient tool (LamAPI) to
fetch the necessary data for all SemTab tasks, independently of the target KG.
3.2. Observations
The systems we inspected follow the same steps: they perform a preprocessing step, and setup
lookup and datatype prediction services. Then the CEA task is performed followed by the CTA
and CPA tasks which depend on the CEA task. Given that the systems follow the same steps,
we could map the three main tasks (CEA, CPA, CTA) to the Create-Filter-Select (CFS) procedure
proposed by JenTab (see Table 1).
We observe similarities in most tasks among the engines. The subtasks performed in the
preprocessing step, are very similar in the three engines. Preprocessing tasks include several
functions, such as fixing encoding issues, removing HTML tags or special characters, and
detecting missing white spaces (see Table 1), and they usually delegate them to third-party
libraries (e.g., ftfy10 ). We observe similar tasks are performed when declarative solutions are
used for cleaning and preparing the data. These preprocessing tasks are described with FnO
in the case of RML and executed either together with the schema transformations or as a
preprocessing task too.
The same occurs for the datatype prediction, where regular expressions are often used to
detect if cell values are entities or literals, and what type of literals (string, date, or numbers). In
the case of declarative solutions, this datatype inspection task is performed manually. However,
adjusting the datatype is possible by relying on functions for data transformations.
Most of them also incorporate a lookup step to retrieve the necessary data from the KGs (e.g.,
using SPARQL queries), including similarity functions or fuzzy search. The search engine for
the KG lookups in JenTab and Mantis V is ElasticSearch, although the former implements the
Jaro Winkler distance [19] while the latter embeds it in a more efficient engine and exploits
its query capabilities. Lookups were also incorporated in the case of declarative solutions [20],
where lookup services retrieve a URI to identify an entity instead of assigning a new one.
As far as the actual tasks are concerned, each engine performs its own approach for the CEA,
CTA, and CPA tasks, although we also find some similarities. The most important ones that
are implemented in the three engines are: (i) the Levenshtein distance [21] to filter candidates
and (ii) the majority vote (called frequency in Mantis V) to select the final annotations. We
believe that the use of declarative approaches, such as the Function Ontology [16] for describing
common functions (e.g., Levenshtein), could make the solutions more comparable. It would
also be clearer if they perform the same function and more explainable, as current solutions
for the automation of KG construction act like blackboxes: neither their implementations are
open sourced nor the declarative descriptions of what they execute are available. Providing at
least declarative descriptions of the tasks performed would enhance the transparency of these
solutions.
9
https://bitbucket.org/disco_unimib/mantistable-v/
10
https://pypi.org/project/ftfy/
Table 1
Tasks comparison among different SemTab solutions
JenTab MTab Mantis V
ElasticSearch on top of KG WikiGraph Generation
KG Lookup LamAPI(ElasticSearch, Mongo and Python)
SPARQL Queries Ad-hoc API
Fix encoding Y Y N
Special characters Y N Y
Preprocessing Restore
Y N Y
missing spaces
Remove
N Y N
HTML tags
Remove
N Y N
non-cell-values
Cell values identification (literal, entity)
REGEX REGEX for datatypes exceeding a threshold
Datatype SpaCy models for potential types
Type-based cleaning Entity columns that do not exceed the threshold
Majority vote to define column type
Different query Keword search (BM25)
CREATE LamAPI lookup with IB similarity
rewriting techniques Fuzzy search (Levenshtein distance)
CEA
Levenshtein distance Filter and hashing (Symetric Delete) Levenshtein confidence score for entities
FILTER
(among others) Context similarities by row Literals ad-hoc confidence score
SELECT Levenshtein distance Highest context similarity Highest confidence score
CREATE Types from CEA Types from CEA Types from CEA
CTA Remove the less
FILTER - -
popular types
SELECT Majority vote Majority vote Majority vote
Cell annotations (CEA) and Aggregate all properties
CREATE Properties from CEA lookups
fuzzy match for data properties from CEA by row
CPA
FILTER - - -
SELECT Majority vote Majority vote Majority vote
4. Challenges for a declarative automation of KG Construction
We identify a set of challenges to be addressed to declaratively describe solutions for automatic
KG construction. These challenges can be divided into two categories: technical and conceptual.
On the technical side, there is a major difference between the solutions for the automation of
KG construction and the execution of declarative KG construction solutions: The solutions for
automatic KG construction rely on iterative processes that continuously refine and improves a
task, while the different tasks influence each other. To the contrary, the declarative KG construc-
tion is a linear process that is executed only once. Not all declarative rules are executed linearly,
solutions that restructure [6] or parallelize them [22, 23] are increasingly encountered, but no
iterative solutions were proposed so far. Thus, if the solutions for automatic KG construction
are declaratively described, their iterative execution needs to be described as well. How do we
do that with the mapping languages? When is it meaningful?
Besides the overall execution process, the iteration patterns are different. The solutions for
automatic KG construction are applied to all directions, both per column and per row, and even
combined. To the contrary, the declarative solutions are applied only per row, and the mapping
languages are designed under this assumption. Should the mapping languages be extended to
support more iteration patterns? If so, would the rml:iteration for RML and the relevant
constructs in the other mapping languages be sufficient or more adjustments are required?
The solutions for automatic KG construction rely on interrelated tasks which may produce
intermediate representations, e.g., probabilistic methods, and their results impact the rest tasks.
The declarative KG construction solutions then need to deal with dynamic and recursive
steps (e.g., intermediate representation of the input data sources and mapping rules, multiple
function execution, etc.) that can negatively impact the generation process. Hence, declaratively
describing is a challenge. Should the mapping languages be further extended then?
On the conceptual side, there are two major differences with respect to the training data and
target KGs. In most real projects, the input data and sometimes the target ontology are only
provided, but there is neither similar data to train the solutions nor existing KGs to target that
can be used to find entities or to predict the relationships. In the past, alternative approaches for
KG construction were discussed depending on what is available where the process starts (e.g.,
data, ontologies, target KGs), but it is not investigates neither how these editing approaches
affect the KG generation nor its automation. How are the automated solutions proposed for the
SemTab but not only affected by the lack of training data and target ontologies and KGs. How
does this affect their declarative representation?
While relying on ontology matching techniques between existing KGs (e.g., DBPedia, Wiki-
data) and the target ontology or exploiting NLP approaches between ontology and input sources
documentation could be a solution for the latter, would it be realistic given that most ontologies
are not aligned and not all of them provide documentation?
5. Conclusions and Future Work
In this paper, we analyze the KG construction solutions and compare the automatic with the
declarative. While the tasks can be aligned with respect to what they achieve, their execution is
fundamentally different and a direct alignment is not feasible.
Automatic solutions for KG construction are required to facilitate the adoption of KGs, but
there are also merits when the automation tasks are declaratively described, with respect to
maintenability, sustainability, and reproducibility. However, directly aligning the automatic
solutions with the declarative solutions might be technically and conceptually challenging
considering their different execution and iteration patterns. Extending the existing mapping
languages would be a solution, but it would also require to address the identified challenges
and not only. Would such extensions be feasible and desired or would they lead them beyond
their purpose? Although, mapping languages are not the only approach to have declarative
descriptions. Declarative descriptions of workflows emerge as well. Would that be a more viable
solution? If so, would the automatic and declarative solutions keep on growing in different
directions? These are questions that would be nice to reflect and discuss during the workshop.
Acknowledgments
David Chaves-Fraga is supported by the Spanish Minister of Universities (Ministerio de Universi-
dades) and by the NextGenerationEU funds through the Margarita Salas postdoctoral fellowship.
Both authors are supported by Flanders Make.
References
[1] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, RML:
a generic language for integrated RDF mappings of heterogeneous data, in: Ldow, 2014.
[2] C. Debruyne, D. O’Sullivan, R2RML-F: towards sharing and executing domain logic in
R2RML mappings, in: LDOW@ WWW, 2016.
[3] M. Lefrançois, A. Zimmermann, N. Bakerally, A SPARQL extension for generating RDF
from heterogeneous formats, in: European Semantic Web Conference, Springer, 2017, pp.
35–50.
[4] E. Daga, L. Asprino, P. Mulholland, A. Gangemi, Facade-X: an opinionated approach to
SPARQL anything, Studies on the Semantic Web 53 (2021) 58–73.
[5] E. Iglesias, S. Jozashoori, D. Chaves-Fraga, D. Collarana, M.-E. Vidal, SDM-RDFizer: An
RML Interpreter for the Efficient Creation of RDF Knowledge Graphs, in: Proceedings of
the 29th ACM International Conference on Information & Knowledge Management, 2020,
pp. 3039–3046.
[6] S. Jozashoori, D. Chaves-Fraga, E. Iglesias, M.-E. Vidal, O. Corcho, Funmap: Efficient
execution of functional mappings for knowledge graph creation, in: International Semantic
Web Conference, Springer, 2020, pp. 276–293.
[7] L. F. d. Medeiros, F. Priyatna, O. Corcho, MIRROR: Automatic R2RML mapping generation
from relational databases, in: International Conference on Web Engineering, Springer,
2015, pp. 326–343.
[8] C. Bizer, A. Seaborne, D2RQ-treating non-RDF databases as virtual RDF graphs, in:
Proceedings of the 3rd international semantic web conference (ISWC2004), volume 2004,
Springer Hiroshima, 2004.
[9] D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, D. Lanti, M. Rezk, M. Rodriguez-
Muro, G. Xiao, Ontop: Answering SPARQL queries over relational databases, Semantic
Web 8 (2017) 471–487.
[10] Á. Sicilia, G. Nemirovski, AutoMap4OBDA: Automated generation of R2RML mappings
for OBDA, in: European Knowledge Acquisition Workshop, Springer, 2016, pp. 577–592.
[11] E. Jiménez-Ruiz, E. Kharlamov, D. Zheleznyakov, I. Horrocks, C. Pinkel, M. G. Skjæveland,
E. Thorstensen, J. Mora, Bootox: Bootstrapping OWL 2 ontologies and R2RML mappings
from relational databases, in: International Semantic Web Conference (P&D), 2015.
[12] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Re-
sources to benchmark tabular data to knowledge graph matching systems, in: European
Semantic Web Conference, Springer, 2020, pp. 514–530.
[13] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, SemTab 2021: Tabular
Data Annotation with MTab Tool, SemTab@ ISWC (2021) 92–101.
[14] N. Abdelmageed, S. Schindler, JenTab Meets SemTab 2021’s New Challenges, in: SemTab@
ISWC, 2021, pp. 42–53.
[15] V.-P. Huynh, J. Liu, Y. Chabot, F. Deuzé, T. Labbé, P. Monnin, R. Troncy, DAGOBAH: Table
and Graph Contexts For Efficient Semantic Annotation Of Tabular Data, in: SemTab@
ISWC, 2021, pp. 19–31.
[16] B. De Meester, T. Seymoens, A. Dimou, R. Verborgh, Implementation-independent function
reuse, Future Generation Computer Systems 110 (2020) 946–959.
[17] R. Avogadro, M. Cremaschi, MantisTable V: a novel and efficient approach to Semantic
Table Interpretation, SemTab@ ISWC (2021) 79–91.
[18] M. Cremaschi, F. De Paoli, A. Rula, B. Spahiu, A fully automated approach to a complete
semantic table interpretation, Future Generation Computer Systems 112 (2020) 478–500.
[19] W. E. Winkler, String comparator metrics and enhanced decision rules in the Fellegi-Sunter
model of record linkage (1990).
[20] S. Jozashoori, A. Sakor, E. Iglesias, M.-E. Vidal, EABlock: A Declarative Entity Alignment
Block for Knowledge Graph Creation Pipelines, in: Proceedings of the 37th ACM/SIGAPP
Symposium On Applied Computing, 2022.
[21] V. I. Levenshtein, et al., Binary codes capable of correcting deletions, insertions, and
reversals, in: Soviet physics doklady, volume 10, Soviet Union, 1966, pp. 707–710.
[22] G. Haesendonck, W. Maroy, P. Heyvaert, R. Verborgh, A. Dimou, Parallel RDF generation
from heterogeneous big data, in: Proceedings of the International Workshop on Semantic
Big Data, 2019, pp. 1–6.
[23] J. Arenas-Guerrero, D. Chaves-Fraga, J. Toledo, M. S. Pérez, O. Corcho, Morph-kgc: Scalable
knowledge graph materialization with mapping partitions, Semantic Web (2022).