=Paper=
{{Paper
|id=Vol-3632/ISWC2023_paper_412
|storemode=property
|title=Towards Algebraic Mapping Operators for Knowledge Graph Construction
|pdfUrl=https://ceur-ws.org/Vol-3632/ISWC2023_paper_412.pdf
|volume=Vol-3632
|authors=Sitt Min Oo,Ben De Meester,Ruben Taelman,Pieter Colpaert
|dblpUrl=https://dblp.org/rec/conf/semweb/OoMTC23
}}
==Towards Algebraic Mapping Operators for Knowledge Graph Construction==
Towards Algebraic Mapping Operators for
Knowledge Graph Construction
Sitt Min Oo1,∗ , Ben De Meester1 , Ruben Taelman1 and Pieter Colpaert1
1
IDLab, Department of Electronics and Information Systems, Ghent University – imec, Technologiepark-Zwijnaarde 122,
9052 Ghent, Belgium
Abstract
Declarative knowledge graph construction has matured to the point where state of the art techniques are
focusing on optimizing the mapping processes. However, these optimization techniques use the syntax
of the mapping language without considering the impact of the semantics. As a result, it is difficult
to compare different engines fairly due to the obscurity in their semantic differences. In this poster
paper, we propose an initial set of algebraic mapping operators to define the operational semantics of
mapping processes, and provide a first step towards a theoretical foundation for mapping languages.
We translated a series of RML documents to algebraic mapping operators to show the feasibility of our
approach. We believe that further pursuing these initial results will lead to greater interoperability of
mapping engines and languages, intensify requirements analysis for the upcoming RML standardization
work, and an improved developer experience for all current and future mapping engines.
1. Introduction
Several mapping engines exist to generate RDF Knowledge Graphs (KG) from heterogeneous
data sources [1, 2, 3]. Each mapping engine has its own operational semantics depending
on the software architecture and the mapping language it supports. This leads to redundant
implementation of similar operations and incompatibility with the other engines, especially
in terms of optimization techniques. For example, SDM-RDFizer [1] relies on Triples Maps (an
RML concept [4]) to optimize deduplication and joins, which is incompatible with SPARQL-
Generate [3] where SPARQL is being used (i.e. no notion of Triples Maps).
In the domain of knowledge graph querying, algebraic operators form the foundation for
the query semantics through formalization [5]. Semantic formalization enables a) execution
consistency across different query engine implementations, b) identification of redundant and
contradicting notions, c) analysis of complexity and expressiveness, and d) more portable
algorithms, enabling easy inheritance from existing algorithms with similar semantics. Thus,
having an equivalent set of algebraic operators for the mapping process will lay the foundation
to formalize the mapping process, clarifying the operational semantics and improving the
ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece
∗
Corresponding author.
Envelope-Open x.sittminoo@ugent.be (S. M. Oo); ben.demeester@ugent.be (B. D. Meester); ruben.taelman@ugent.be
(R. Taelman); pieter.colpaert@ugent.be (P. Colpaert)
Orcid 0000-0002-0877-7063 (S. M. Oo); 0000-0001-7116-9338 (B. D. Meester); 0000-0002-9421-8566 (R. Taelman);
0000-0002-9421-8566 (P. Colpaert)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
1
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Sitt Min Oo et al. CEUR Workshop Proceedings 1–5
interoperability of mapping engines.
There are solutions which provide an ontology for representing the different mapping lan-
guages [6] or provide a language-independent template for RDF knowledge graph construc-
tion [7] to increase interoperability between mapping engines. Nonetheless, the aforementioned
solutions do not provide a theoretical foundation for generic mapping languages since they
capture the language syntax instead of the semantics.
In this poster paper, we introduce an initial set of algebraic mapping operators and define
their semantics. We apply this initial set to RML. We translated a series of RML documents to
algebraic mapping operators to validate our approach.
2. Definition
We first introduce terminologies. For our initial work, we reuse some of the terminologies from
SPARQL algebra, allowing us to align our mapping algebra with SPARQL algebra in the future
to study expressiveness. We can reuse following definitions. A solution mapping 𝜇 is a partial
function mapping from 𝑉, a set of variables, to Τ, a set of terms, provided 𝑉 ∩ Τ = ∅. Τ = 𝐼 ∪ 𝐵 ∪ 𝐿
where 𝐼, 𝐵, and 𝐿 are disjoint, infinite sets of IRIs, blank nodes, and literals respectively.
Mapping languages enable users to fragment the generated data into different data sinks
(e.g. multiple files or web sockets). In order to future proof the mapping algebra, we need to
introduce the fragment. A fragment, 𝑓 ∈ 𝐹, is a grouping of the multiset of solution mappings.
It can be seen as a generic sink: a file, a database, or a logical fragment such as a specific social
context, e.g. information about a person only known by friends.
A mapping tuple 𝜔 is the core of our mapping algebra; a partial function which maps fragments
to multiset of solution mappings: 𝜔 ∶ 𝐹 → Ω with Ω a multiset of solution mappings. A multiset
of mapping tuples is 𝜉. This mapping of fragments to multiset of solution mappings enable
us to group solution mappings based on some abstract concept. For example, we could have
mapping tuples where solution mappings are grouped according to some social context (e.g.
personal information and friend’s information) (Table 1). Currently, grouping solution mappings
according to fragments can not be achieved with SPARQL’s definition of group algebra.
Table 1
Two mapping tuples describing information related to John. The tuples are fragmented according to
personal information about John and information about John’s friends.
Multiset of Solution Mappings
Fragment Solution Mapping ?name ?age ?email
𝑓𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙 𝜇1 John Doe 23 john.doe@example.com
𝜇2 Susan Sue 25 susan.sue@example.com
𝑓𝑓 𝑟𝑖𝑒𝑛𝑑𝑠
𝜇3 Alice Joe 26 alice.joe@example.com
We now define the initial set of algebraic mapping operators computing on 𝜉: Source, Project,
Extend, and Serialize. We do not yet include the fragmentation operator where 𝜉 could be
recursively fragmented nor the join operator. The algebraic mapping operators take 𝜉 as input
2
Sitt Min Oo et al. CEUR Workshop Proceedings 1–5
and produces 𝜉 as output unless otherwise stated.
Source operator The Source operator is needed to generate the mapping tuples from hetero-
geneous data sources used in the downstream operators for mapping. The source operator is
the leaf node operator in the mapping plan and does not have 𝜉 as input.
Given a configuration, 𝐶, a source operator generates a multi-set, 𝜉, of mapping tuples, 𝜔’s,
where a default fragment 𝑓0 is mapped to a multiset of solution mappings Ω. 𝜇 ∈ Ω is generated
by flattening the data records which are derived by iterating over the data source. For example,
𝜇 derived from a CSV row is a partial function from the headers of the CSV to the corresponding
data values in the CSV row. We define:
𝑓0 = a default fragment
Ω = {𝜇 ∣ 𝜇 = flattened data record} (1)
Source(C) = {𝜔 ∣ 𝜔 ∶ 𝑓0 → Ω}
Project operator The Project operator restricts the variables in the solution mapping, needed
to efficiently process the mapping tuples. For example, RML’s single iteration of CSV contains
all columns for a data record, but implicitly projects the required references for the mapping
process. It is similar to the SPARQL algebra counterpart. Let {𝑎0 , … , 𝑎𝑛 } ∈ 𝑃, be a set of projection
attributes. We define:
Project(𝜇, P) = 𝜇 restricted to attribute variables in P
Project(Ω, P) = {Project(𝜇, P) ∣ 𝜇 ∈ Ω}
(2)
Project(𝜔, P) = 𝑓 → Project(Ω, P)
Project(𝜉, P) = {Project(𝜔, P) ∣ 𝜔 ∈ 𝜉 }
Extend operator The Extend operator derives new attributes for a solution mapping. For
example, to include body mass index (BMI) of a person in the output, we derive the BMI from
the height and weight attributes of a person. In RML, this is equivalent to the template and
constant Term Maps, where new values, not existing in the data records, are generated. The
Extend operator derives a value, by executing an expression 𝑒𝑥𝑝𝑟 on the solution mapping,
and coupled it to new variable 𝑣 not in the domain of the solution mapping. If evaluating the
expression causes an error and the variable is not in the domain of the solution mapping, the
extend operator behaves like an identity operator. It is undefined if the variable restriction is
violated. We define:
Extend(𝜇, v, expr) = 𝜇 ∪ {(𝑣, 𝑣𝑎𝑙𝑢𝑒) ∣ 𝑣 ∉ dom(𝜇) and value = expr(𝜇)}
Extend(Ω, v, expr) = {Extend(𝜇, v, expr) ∣ 𝜇 ∈ Ω}
(3)
Extend(𝜔, v, expr) = 𝑓 → Extend(Ω, v, expr)
Extend(𝜉, v, expr) = {Extend(𝜔, v, expr) ∣ 𝜔 ∈ 𝜉 }
Serialize operator The Serialize operator serializes the mapping tuples into the specified
format. This is the core functionality of mapping engines: to generate data in a specific format
3
Sitt Min Oo et al. CEUR Workshop Proceedings 1–5
from some data. For example, RML defines the data format implicitly using the Term Maps. The
Serialize operator is the root node operator.
The Serialize operator generates data in a specific data format by replacing query variables
in the template with the values from the input solution mappings. Each solution mapping
generates one data item in the specified format. Given a string template 𝜏. We define:
Serialize(Ω,𝜏) = {𝜏 (𝜇) ∣ 𝜇 ∈ Ω, 𝜏 (𝜇) = variables in 𝜏 substituted with 𝜇}
Serialize(𝜔, 𝜏) = 𝑓 → Serialize(Ω, 𝜏) (4)
Serialize(𝜉, 𝜏) = {Serialize(𝜔, 𝜏) ∣ 𝜔 ∈ 𝜉 }
3. Preliminary Results
We implemented these proposed semantics in a proof-of-concept algebra interpreter for RML
mapping rules: https://github.com/s-minoo/meamer-rs . We translated several RML documents
(without joins) to a tree of mapping algebraic operators: a mapping plan. This provides an
initial validation of our proposed semantics, and showcases its potential: multiple mapping
plans of different complexity are translated, allowing for inspection and optimization proposals.
4. Conclusion
This poster paper presents an initial set of algebraic mapping operators (Source, Project, Extend,
Serialize) and proof-of-concept implementation which can already be used to describe a subset
of mapping processes represented in RML. The generated mapping plans show the potential
to define optimization rules based on the semantics and not syntax of the mapping language.
Furthermore, working with algebraic mapping operators enables us to “rewrite” the mapping
plan, generated using the algebraic mapping operators, to optimize the mapping process. For
example, we could push the Projection operator close to the Source operator, to filter out
unnecessary data unused in the output, to reduce memory usage during the knowledge graph
construction.
As future work, we plan to extend and refine the algebraic operators, to be used as a theoretical
foundation for mapping languages. We plan to provide a generic mapping framework: the
reference implementation of these algebraic operators. Users could use this framework to easily
create a mapping engine using their choice of mapping language. Finally, we plan to conduct
an empirical study on the existing mapping optimization techniques and translate them to
optimization rules using these algebraic operators.
Acknowledgments
The described research activities were supported by SolidLab Vlaanderen (Flemish Government,
EWI and RRF project VV023/10), and the imec ICON project BoB (Agentschap Innoveren en
Ondernemen project nr. HBC.2021.0658).
4
Sitt Min Oo et al. CEUR Workshop Proceedings 1–5
References
[1] E. Iglesias, S. Jozashoori, M.-E. Vidal, Scaling up knowledge graph creation to large and
heterogeneous data sources, Journal of Web Semantics (2023).
[2] Sitt Min Oo, G. Haesendonck, B. De Meester, A. Dimou, RMLStreamer-SISO: An RDF Stream
Generator from Streaming Heterogeneous Data, in: The Semantic Web – ISWC, 2022.
[3] M. Lefrançois, A. Zimmermann, N. Bakerally, A SPARQL extension for generating RDF
from heterogeneous formats, in: The Semantic Web – ISWC, 2017.
[4] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, RML:
A Generic Language for Integrated RDF Mappings of Heterogeneous Data, in: LDOW, 2014.
URL: http://ceur-ws.org/Vol-1184/ldow2014_paper_01.pdf.
[5] J. Pérez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL, ACM Trans.
Database Syst. (2009).
[6] A. Iglesias-Molina, A. Cimmino, E. Ruckhaus, D. Chaves-Fraga, R. García-Castro, O. Corcho,
An ontological approach for representing declarative mapping languages, Semantic Web
Journal (2022).
[7] A. Iglesias-Molina, D. Chaves-Fraga, F. Priyatna, O. Corcho, Towards the definition of a
language-independent mapping template for knowledge graph creation, in: Third Interna-
tional Workshop on Capturing Scientific Knowledge (SciKnow19), 2019.
5