<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Operators for Knowledge Graph Construction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sitt Min Oo</string-name>
          <email>x.sittminoo@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <email>ben.demeester@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Taelman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pieter Colpaert</string-name>
          <email>pieter.colpaert@ugent.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>9052 Ghent</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IDLab, Department of Electronics and Information Systems, Ghent University - imec</institution>
          ,
          <addr-line>Technologiepark-Zwijnaarde 122</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Declarative knowledge graph construction has matured to the point where state of the art techniques are focusing on optimizing the mapping processes. However, these optimization techniques use the syntax of the mapping language without considering the impact of the semantics. As a result, it is dificult to compare diferent engines fairly due to the obscurity in their semantic diferences. In this poster paper, we propose an initial set of algebraic mapping operators to define the operational semantics of mapping processes, and provide a first step towards a theoretical foundation for mapping languages. We translated a series of RML documents to algebraic mapping operators to show the feasibility of our approach. We believe that further pursuing these initial results will lead to greater interoperability of mapping engines and languages, intensify requirements analysis for the upcoming RML standardization work, and an improved developer experience for all current and future mapping engines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Several mapping engines exist to generate RDF Knowledge Graphs (KG) from heterogeneous
data sources [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. Each mapping engine has its own operational semantics depending
on the software architecture and the mapping language it supports. This leads to redundant
implementation of similar operations and incompatibility with the other engines, especially
in terms of optimization techniques. For example, SDM-RDFizer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] relies on Triples Maps (an
RML concept [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) to optimize deduplication and joins, which is incompatible with
SPARQLGenerate [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] where SPARQL is being used (i.e. no notion of Triples Maps).
      </p>
      <p>
        In the domain of knowledge graph querying, algebraic operators form the foundation for
the query semantics through formalization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Semantic formalization enables a) execution
consistency across diferent query engine implementations, b) identification of redundant and
contradicting notions, c) analysis of complexity and expressiveness, and d) more portable
algorithms, enabling easy inheritance from existing algorithms with similar semantics. Thus,
having an equivalent set of algebraic operators for the mapping process will lay the foundation
to formalize the mapping process, clarifying the operational semantics and improving the
      </p>
      <p>CEUR
interoperability of mapping engines.</p>
      <p>
        There are solutions which provide an ontology for representing the diferent mapping
languages [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or provide a language-independent template for RDF knowledge graph
construction [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to increase interoperability between mapping engines. Nonetheless, the aforementioned
solutions do not provide a theoretical foundation for generic mapping languages since they
capture the language syntax instead of the semantics.
      </p>
      <p>In this poster paper, we introduce an initial set of algebraic mapping operators and define
their semantics. We apply this initial set to RML. We translated a series of RML documents to
algebraic mapping operators to validate our approach.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Definition</title>
      <p>We first introduce terminologies. For our initial work, we reuse some of the terminologies from
SPARQL algebra, allowing us to align our mapping algebra with SPARQL algebra in the future
to study expressiveness. We can reuse following definitions. A solution mapping  is a partial
function mapping from  , a set of variables, to Τ, a set of terms, provided  ∩ Τ = ∅ . Τ =  ∪  ∪ 
where  ,  , and  are disjoint, infinite sets of IRIs, blank nodes, and literals respectively.</p>
      <p>Mapping languages enable users to fragment the generated data into diferent data sinks
(e.g. multiple files or web sockets). In order to future proof the mapping algebra, we need to
introduce the fragment. A fragment,  ∈  , is a grouping of the multiset of solution mappings.
It can be seen as a generic sink: a file , a database, or a logical fragment such as a specific social
context, e.g. information about a person only known by friends.</p>
      <p>A mapping tuple  is the core of our mapping algebra; a partial function which maps fragments
to multiset of solution mappings:  ∶  → Ω with Ω a multiset of solution mappings. A multiset
of mapping tuples is  . This mapping of fragments to multiset of solution mappings enable
us to group solution mappings based on some abstract concept. For example, we could have
mapping tuples where solution mappings are grouped according to some social context (e.g.
personal information and friend’s information) (Table 1). Currently, grouping solution mappings
according to fragments can not be achieved with SPARQL’s definition of group algebra.
 
  
 1
 2
 3</p>
      <sec id="sec-3-1">
        <title>John Doe 23</title>
      </sec>
      <sec id="sec-3-2">
        <title>Susan Sue 25</title>
        <p>Alice Joe 26
john.doe@example.com
susan.sue@example.com
alice.joe@example.com</p>
        <p>We now define the initial set of algebraic mapping operators computing on  : Source, Project,
Extend, and Serialize. We do not yet include the fragmentation operator where  could be
recursively fragmented nor the join operator. The algebraic mapping operators take  as input
Source operator The Source operator is needed to generate the mapping tuples from
heterogeneous data sources used in the downstream operators for mapping. The source operator is
the leaf node operator in the mapping plan and does not have  as input.</p>
        <p>Given a configuration,  , a source operator generates a multi-set,  , of mapping tuples,  ’s,
where a default fragment  0 is mapped to a multiset of solution mappings Ω.  ∈ Ω is generated
by flattening the data records which are derived by iterating over the data source. For example,
 derived from a CSV row is a partial function from the headers of the CSV to the corresponding
data values in the CSV row. We define:
 0 = a default fragment
lfattened data record }
0 → Ω}
Project operator The Project operator restricts the variables in the solution mapping, needed
to eficiently process the mapping tuples. For example, RML’s single iteration of CSV contains
all columns for a data record, but implicitly projects the required references for the mapping
process. It is similar to the SPARQL algebra counterpart. Let { 0, … ,   } ∈  , be a set of projection
attributes. We define:</p>
        <p>Project( , P) =  restricted to attribute variables in P
Project(Ω, P) = {Project( , P) ∣  ∈ Ω}
Project( , P) =  → Project(Ω, P)</p>
        <p>Project( , P) = {Project( , P) ∣  ∈  }
and produces  as output unless otherwise stated.
1–5
(1)
(2)
(3)
Extend operator The Extend operator derives new attributes for a solution mapping. For
example, to include body mass index (BMI) of a person in the output, we derive the BMI from
the height and weight attributes of a person. In RML, this is equivalent to the template and
constant Term Maps, where new values, not existing in the data records, are generated. The
Extend operator derives a value, by executing an expression  on the solution mapping,
and coupled it to new variable  not in the domain of the solution mapping. If evaluating the
expression causes an error and the variable is not in the domain of the solution mapping, the
extend operator behaves like an identity operator. It is undefined if the variable restriction is
violated. We define:</p>
        <p>Extend( , v, expr) =  ∪ {( ,  ) ∣  ∉
Extend(Ω, v, expr) = {Extend( , v, expr) ∣  ∈ Ω}
Extend( , v, expr) =  → Extend(Ω, v, expr)
Extend( , v, expr) = {Extend( , v, expr) ∣  ∈  }
dom( ) and value = expr( )}
Serialize operator The Serialize operator serializes the mapping tuples into the specified
format. This is the core functionality of mapping engines: to generate data in a specific format
from some data. For example, RML defines the data format implicitly using the Term Maps. The
Serialize operator is the root node operator.</p>
        <p>The Serialize operator generates data in a specific data format by replacing query variables
in the template with the values from the input solution mappings. Each solution mapping
generates one data item in the specified format. Given a string template  . We define:
Serialize(Ω, ) = { () ∣  ∈ Ω,  () =
Serialize( ,  ) =  → Serialize(Ω,  )
Serialize( ,  ) = {Serialize( ,  ) ∣  ∈  }
variables in  substituted with }
(4)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Preliminary Results</title>
      <p>We implemented these proposed semantics in a proof-of-concept algebra interpreter for RML
mapping rules: https://github.com/s-minoo/meamer-rs . We translated several RML documents
(without joins) to a tree of mapping algebraic operators: a mapping plan. This provides an
initial validation of our proposed semantics, and showcases its potential: multiple mapping
plans of diferent complexity are translated, allowing for inspection and optimization proposals.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>This poster paper presents an initial set of algebraic mapping operators (Source, Project, Extend,
Serialize) and proof-of-concept implementation which can already be used to describe a subset
of mapping processes represented in RML. The generated mapping plans show the potential
to define optimization rules based on the semantics and not syntax of the mapping language.
Furthermore, working with algebraic mapping operators enables us to “rewrite” the mapping
plan, generated using the algebraic mapping operators, to optimize the mapping process. For
example, we could push the Projection operator close to the Source operator, to filter out
unnecessary data unused in the output, to reduce memory usage during the knowledge graph
construction.</p>
      <p>As future work, we plan to extend and refine the algebraic operators, to be used as a theoretical
foundation for mapping languages. We plan to provide a generic mapping framework: the
reference implementation of these algebraic operators. Users could use this framework to easily
create a mapping engine using their choice of mapping language. Finally, we plan to conduct
an empirical study on the existing mapping optimization techniques and translate them to
optimization rules using these algebraic operators.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The described research activities were supported by SolidLab Vlaanderen (Flemish Government,
EWI and RRF project VV023/10), and the imec ICON project BoB (Agentschap Innoveren en
Ondernemen project nr. HBC.2021.0658).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          , M.-E. Vidal,
          <article-title>Scaling up knowledge graph creation to large and heterogeneous data sources</article-title>
          ,
          <source>Journal of Web Semantics</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sitt</given-names>
            <surname>Min Oo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Haesendonck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            . Dimou,
            <surname>RMLStreamer-SISO</surname>
          </string-name>
          :
          <article-title>An RDF Stream Generator from Streaming Heterogeneous Data</article-title>
          ,
          <source>in: The Semantic Web - ISWC</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lefrançois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bakerally</surname>
          </string-name>
          ,
          <article-title>A SPARQL extension for generating RDF from heterogeneous formats</article-title>
          ,
          <source>in: The Semantic Web - ISWC</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, R. Van de Walle,
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          , in: LDOW,
          <year>2014</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1184</volume>
          /ldow2014_paper_01.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <article-title>Semantics and complexity of SPARQL, ACM Trans</article-title>
          . Database Syst. (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>García-Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>An ontological approach for representing declarative mapping languages</article-title>
          ,
          <source>Semantic Web Journal</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Towards the definition of a language-independent mapping template for knowledge graph creation</article-title>
          ,
          <source>in: Third International Workshop on Capturing Scientific Knowledge (SciKnow19)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>