Towards Algebraic Mapping Operators for Knowledge Graph Construction Sitt Min Oo1,∗ , Ben De Meester1 , Ruben Taelman1 and Pieter Colpaert1 1 IDLab, Department of Electronics and Information Systems, Ghent University – imec, Technologiepark-Zwijnaarde 122, 9052 Ghent, Belgium Abstract Declarative knowledge graph construction has matured to the point where state of the art techniques are focusing on optimizing the mapping processes. However, these optimization techniques use the syntax of the mapping language without considering the impact of the semantics. As a result, it is difficult to compare different engines fairly due to the obscurity in their semantic differences. In this poster paper, we propose an initial set of algebraic mapping operators to define the operational semantics of mapping processes, and provide a first step towards a theoretical foundation for mapping languages. We translated a series of RML documents to algebraic mapping operators to show the feasibility of our approach. We believe that further pursuing these initial results will lead to greater interoperability of mapping engines and languages, intensify requirements analysis for the upcoming RML standardization work, and an improved developer experience for all current and future mapping engines. 1. Introduction Several mapping engines exist to generate RDF Knowledge Graphs (KG) from heterogeneous data sources [1, 2, 3]. Each mapping engine has its own operational semantics depending on the software architecture and the mapping language it supports. This leads to redundant implementation of similar operations and incompatibility with the other engines, especially in terms of optimization techniques. For example, SDM-RDFizer [1] relies on Triples Maps (an RML concept [4]) to optimize deduplication and joins, which is incompatible with SPARQL- Generate [3] where SPARQL is being used (i.e. no notion of Triples Maps). In the domain of knowledge graph querying, algebraic operators form the foundation for the query semantics through formalization [5]. Semantic formalization enables a) execution consistency across different query engine implementations, b) identification of redundant and contradicting notions, c) analysis of complexity and expressiveness, and d) more portable algorithms, enabling easy inheritance from existing algorithms with similar semantics. Thus, having an equivalent set of algebraic operators for the mapping process will lay the foundation to formalize the mapping process, clarifying the operational semantics and improving the ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece ∗ Corresponding author. Envelope-Open x.sittminoo@ugent.be (S. M. Oo); ben.demeester@ugent.be (B. D. Meester); ruben.taelman@ugent.be (R. Taelman); pieter.colpaert@ugent.be (P. Colpaert) Orcid 0000-0002-0877-7063 (S. M. Oo); 0000-0001-7116-9338 (B. D. Meester); 0000-0002-9421-8566 (R. Taelman); 0000-0002-9421-8566 (P. Colpaert) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Sitt Min Oo et al. CEUR Workshop Proceedings 1–5 interoperability of mapping engines. There are solutions which provide an ontology for representing the different mapping lan- guages [6] or provide a language-independent template for RDF knowledge graph construc- tion [7] to increase interoperability between mapping engines. Nonetheless, the aforementioned solutions do not provide a theoretical foundation for generic mapping languages since they capture the language syntax instead of the semantics. In this poster paper, we introduce an initial set of algebraic mapping operators and define their semantics. We apply this initial set to RML. We translated a series of RML documents to algebraic mapping operators to validate our approach. 2. Definition We first introduce terminologies. For our initial work, we reuse some of the terminologies from SPARQL algebra, allowing us to align our mapping algebra with SPARQL algebra in the future to study expressiveness. We can reuse following definitions. A solution mapping 𝜇 is a partial function mapping from 𝑉, a set of variables, to Τ, a set of terms, provided 𝑉 ∩ Τ = ∅. Τ = 𝐼 ∪ 𝐵 ∪ 𝐿 where 𝐼, 𝐵, and 𝐿 are disjoint, infinite sets of IRIs, blank nodes, and literals respectively. Mapping languages enable users to fragment the generated data into different data sinks (e.g. multiple files or web sockets). In order to future proof the mapping algebra, we need to introduce the fragment. A fragment, 𝑓 ∈ 𝐹, is a grouping of the multiset of solution mappings. It can be seen as a generic sink: a file, a database, or a logical fragment such as a specific social context, e.g. information about a person only known by friends. A mapping tuple 𝜔 is the core of our mapping algebra; a partial function which maps fragments to multiset of solution mappings: 𝜔 ∶ 𝐹 → Ω with Ω a multiset of solution mappings. A multiset of mapping tuples is 𝜉. This mapping of fragments to multiset of solution mappings enable us to group solution mappings based on some abstract concept. For example, we could have mapping tuples where solution mappings are grouped according to some social context (e.g. personal information and friend’s information) (Table 1). Currently, grouping solution mappings according to fragments can not be achieved with SPARQL’s definition of group algebra. Table 1 Two mapping tuples describing information related to John. The tuples are fragmented according to personal information about John and information about John’s friends. Multiset of Solution Mappings Fragment Solution Mapping ?name ?age ?email 𝑓𝑝𝑒𝑟𝑠𝑜𝑛𝑎𝑙 𝜇1 John Doe 23 john.doe@example.com 𝜇2 Susan Sue 25 susan.sue@example.com 𝑓𝑓 𝑟𝑖𝑒𝑛𝑑𝑠 𝜇3 Alice Joe 26 alice.joe@example.com We now define the initial set of algebraic mapping operators computing on 𝜉: Source, Project, Extend, and Serialize. We do not yet include the fragmentation operator where 𝜉 could be recursively fragmented nor the join operator. The algebraic mapping operators take 𝜉 as input 2 Sitt Min Oo et al. CEUR Workshop Proceedings 1–5 and produces 𝜉 as output unless otherwise stated. Source operator The Source operator is needed to generate the mapping tuples from hetero- geneous data sources used in the downstream operators for mapping. The source operator is the leaf node operator in the mapping plan and does not have 𝜉 as input. Given a configuration, 𝐶, a source operator generates a multi-set, 𝜉, of mapping tuples, 𝜔’s, where a default fragment 𝑓0 is mapped to a multiset of solution mappings Ω. 𝜇 ∈ Ω is generated by flattening the data records which are derived by iterating over the data source. For example, 𝜇 derived from a CSV row is a partial function from the headers of the CSV to the corresponding data values in the CSV row. We define: 𝑓0 = a default fragment Ω = {𝜇 ∣ 𝜇 = flattened data record} (1) Source(C) = {𝜔 ∣ 𝜔 ∶ 𝑓0 → Ω} Project operator The Project operator restricts the variables in the solution mapping, needed to efficiently process the mapping tuples. For example, RML’s single iteration of CSV contains all columns for a data record, but implicitly projects the required references for the mapping process. It is similar to the SPARQL algebra counterpart. Let {𝑎0 , … , 𝑎𝑛 } ∈ 𝑃, be a set of projection attributes. We define: Project(𝜇, P) = 𝜇 restricted to attribute variables in P Project(Ω, P) = {Project(𝜇, P) ∣ 𝜇 ∈ Ω} (2) Project(𝜔, P) = 𝑓 → Project(Ω, P) Project(𝜉, P) = {Project(𝜔, P) ∣ 𝜔 ∈ 𝜉 } Extend operator The Extend operator derives new attributes for a solution mapping. For example, to include body mass index (BMI) of a person in the output, we derive the BMI from the height and weight attributes of a person. In RML, this is equivalent to the template and constant Term Maps, where new values, not existing in the data records, are generated. The Extend operator derives a value, by executing an expression 𝑒𝑥𝑝𝑟 on the solution mapping, and coupled it to new variable 𝑣 not in the domain of the solution mapping. If evaluating the expression causes an error and the variable is not in the domain of the solution mapping, the extend operator behaves like an identity operator. It is undefined if the variable restriction is violated. We define: Extend(𝜇, v, expr) = 𝜇 ∪ {(𝑣, 𝑣𝑎𝑙𝑢𝑒) ∣ 𝑣 ∉ dom(𝜇) and value = expr(𝜇)} Extend(Ω, v, expr) = {Extend(𝜇, v, expr) ∣ 𝜇 ∈ Ω} (3) Extend(𝜔, v, expr) = 𝑓 → Extend(Ω, v, expr) Extend(𝜉, v, expr) = {Extend(𝜔, v, expr) ∣ 𝜔 ∈ 𝜉 } Serialize operator The Serialize operator serializes the mapping tuples into the specified format. This is the core functionality of mapping engines: to generate data in a specific format 3 Sitt Min Oo et al. CEUR Workshop Proceedings 1–5 from some data. For example, RML defines the data format implicitly using the Term Maps. The Serialize operator is the root node operator. The Serialize operator generates data in a specific data format by replacing query variables in the template with the values from the input solution mappings. Each solution mapping generates one data item in the specified format. Given a string template 𝜏. We define: Serialize(Ω,𝜏) = {𝜏 (𝜇) ∣ 𝜇 ∈ Ω, 𝜏 (𝜇) = variables in 𝜏 substituted with 𝜇} Serialize(𝜔, 𝜏) = 𝑓 → Serialize(Ω, 𝜏) (4) Serialize(𝜉, 𝜏) = {Serialize(𝜔, 𝜏) ∣ 𝜔 ∈ 𝜉 } 3. Preliminary Results We implemented these proposed semantics in a proof-of-concept algebra interpreter for RML mapping rules: https://github.com/s-minoo/meamer-rs . We translated several RML documents (without joins) to a tree of mapping algebraic operators: a mapping plan. This provides an initial validation of our proposed semantics, and showcases its potential: multiple mapping plans of different complexity are translated, allowing for inspection and optimization proposals. 4. Conclusion This poster paper presents an initial set of algebraic mapping operators (Source, Project, Extend, Serialize) and proof-of-concept implementation which can already be used to describe a subset of mapping processes represented in RML. The generated mapping plans show the potential to define optimization rules based on the semantics and not syntax of the mapping language. Furthermore, working with algebraic mapping operators enables us to “rewrite” the mapping plan, generated using the algebraic mapping operators, to optimize the mapping process. For example, we could push the Projection operator close to the Source operator, to filter out unnecessary data unused in the output, to reduce memory usage during the knowledge graph construction. As future work, we plan to extend and refine the algebraic operators, to be used as a theoretical foundation for mapping languages. We plan to provide a generic mapping framework: the reference implementation of these algebraic operators. Users could use this framework to easily create a mapping engine using their choice of mapping language. Finally, we plan to conduct an empirical study on the existing mapping optimization techniques and translate them to optimization rules using these algebraic operators. Acknowledgments The described research activities were supported by SolidLab Vlaanderen (Flemish Government, EWI and RRF project VV023/10), and the imec ICON project BoB (Agentschap Innoveren en Ondernemen project nr. HBC.2021.0658). 4 Sitt Min Oo et al. CEUR Workshop Proceedings 1–5 References [1] E. Iglesias, S. Jozashoori, M.-E. Vidal, Scaling up knowledge graph creation to large and heterogeneous data sources, Journal of Web Semantics (2023). [2] Sitt Min Oo, G. Haesendonck, B. De Meester, A. Dimou, RMLStreamer-SISO: An RDF Stream Generator from Streaming Heterogeneous Data, in: The Semantic Web – ISWC, 2022. [3] M. Lefrançois, A. Zimmermann, N. Bakerally, A SPARQL extension for generating RDF from heterogeneous formats, in: The Semantic Web – ISWC, 2017. [4] A. Dimou, M. Vander Sande, P. Colpaert, R. Verborgh, E. Mannens, R. Van de Walle, RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data, in: LDOW, 2014. URL: http://ceur-ws.org/Vol-1184/ldow2014_paper_01.pdf. [5] J. Pérez, M. Arenas, C. Gutierrez, Semantics and complexity of SPARQL, ACM Trans. Database Syst. (2009). [6] A. Iglesias-Molina, A. Cimmino, E. Ruckhaus, D. Chaves-Fraga, R. García-Castro, O. Corcho, An ontological approach for representing declarative mapping languages, Semantic Web Journal (2022). [7] A. Iglesias-Molina, D. Chaves-Fraga, F. Priyatna, O. Corcho, Towards the definition of a language-independent mapping template for knowledge graph creation, in: Third Interna- tional Workshop on Capturing Scientific Knowledge (SciKnow19), 2019. 5