<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Declarative RDF Construction from In-Memory Data Structures with RML</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioannis Dasoulas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Chaves-Fraga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Garijo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KU Leuven, Department of Computer Science</institution>
          ,
          <addr-line>B-2860, Sint-Katelijne-Waver</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leuven.AI - KU Leuven institute for AI</institution>
          ,
          <addr-line>B-3000 Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>Campus de Montegancedo, Boadilla del Monte</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge graphs are often constructed from heterogeneous data sources using declarative mapping languages. Mapping languages define rules that apply ontology terms to raw data to describe how a knowledge graph should be constructed from these raw data. While most mapping languages and systems support knowledge graph construction from diferent data formats, e.g., CSV, XML or JSON, and diferent types of data sources, e.g., files, Web APIs or databases, there is still no support for mapping in-memory data structures to knowledge graphs, i.e. data which is temporarily stored in RAM. Currently, this data must first be stored in HDD, locally or in the cloud, for RDF construction systems to access them and construct a knowledge graph. However, writing these data to HDD and reading from HDD is a computationally expensive and redundant task. In this paper, we propose a method to construct RDF graphs from data produced by a software process and stored in RAM. We introduce an extension of RML's Logical Source to describe data structures produced by software, and exemplify our proposal with Python data structures. We extend a well-known RML system, Morph-KGC, to show the feasibility of our method and validate this extension with two use cases: OpenML, which translates machine learning executions into RDF, and SOMEF, which extracts software metadata from its documentation, converting them to triples. This proposal simplifies the construction of RDF graphs from in-memory data structures stored temporarily in RAM and enables the integration of data stored both in RAM and HDD.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graphs</kwd>
        <kwd>Mapping Languages</kwd>
        <kwd>RML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Graphs represented with the Resource Description Framework (RDF)1 and knowledge graphs
(KGs), in general, have become increasingly popular as a means of representing and analyzing
complex data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Mapping languages (e.g., the Relational to RDF mapping language (R2RML) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and its extension, e.g., the RDF mapping language (RML) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) construct RDF graphs based on
a set of declarative rules, defined by the languages’ syntax. These rules define how the data
should be represented as RDF graphs using terms from ontologies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and are executed by RDF
construction systems that construct the RDF graph.
      </p>
      <p>
        While existing mapping languages and systems support KG construction from heterogeneous
data sources, in-memory data structures are not fully supported. In-memory data structures are
produced by a software program or application and are temporarily stored in Random Access
Memory (RAM) (e.g., SPARK DataFrames, C++ linked lists, in-memory databases, etc.) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As
mapping languages assume that data sources are stored in the Hard Disk Drive (HDD), current
KG construction system cannot construct RDF graphs from in-memory data structures. As a
result, in-memory data structures are firstly stored locally or in the cloud and, then the KG
construction systems load the data, either sequentially, e.g., RMLMapper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or ShExML [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
or in parallel, e.g., RMLStreamer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Morph-KGC [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This process can be computationally
expensive, as data is converted from one structure to another, read from and written to HDD [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
While supporting in-memory Relational Databases (RDBs) (e.g., SQlite2 or Redis3) is barely a
matter of configuration, as they use the same connectivity interfaces as regular RDBs, the same
does not hold for NoSQL Databases (DBs). In-memory NoSQL DBs (e.g., TinyDB4 or ZODB5)
access in-memory data structures in a diferent way than regular NoSQL DBs.
      </p>
      <p>
        In this paper, we propose a method for constructing RDF graphs from in-memory data
structures by extending RML to describe data produced by a software, independently of their
internal structure. We use the Software Description Ontology (SDO) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to annotate the metadata
about the software and the data that the software produces and exemplify this extension for
Python dictionaries and Pandas6 DataFrames. Our solution leverages data generated in run-time
to construct RDF graphs without first having to store them and combines in-memory and locally
stored heterogeneous data which was not possible so far.
      </p>
      <p>
        We extend Morph-KGC [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a scalable Python KG construction system that outperforms
state-of-the-art systems in terms of execution time. Morph-KGC-RAM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] our extension, is a
Python package that adds the functionality of in-memory data structures integration, using the
proposed RML syntax extension. Our approach requires the software that generates the data
and the KG construction systems to run within the same process and share physical resources.
After the data structures are generated and processed, the KG construction package can use
them in combination with declarative mapping rules to transform them into RDF.
      </p>
      <p>
        We validate our extension with two use cases from the data mining domain where an RDF
graph is constructed from Python-based software. The OpenML [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] use case extracts metadata
about machine learning experiments and translates them into RDF, while the SOMEF [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] use
case extracts software metadata from its documentation and converts them to triples.
      </p>
      <p>The contributions of this paper are the following: (i) an extension of RML’s Logical Source
to describe data sources produced by a software; (ii) a preliminary proof-of-concept of our
2https://sqlite.org/index.html
3https://redis.io
4https://tinydb.readthedocs.io/en/latest/
5https://zodb.org/en/latest/index.html
6https://pandas.pydata.org
extension for Python data structures; and (iii) a demonstration of two use cases for in-memory
data integration to RDF. The remaining paper is structured as follows: Section 2 describes
related work. Section 3 presents our extension of RML using the Software Description Ontology
(SDO). Section 4 describes the Morph-KGC extension, while Section 5 discusses the use cases
that the extension was used. Section 6 outlines our conclusions and plans for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Diferent software packages provide an abstraction layer for users to construct RDF graphs from
in-memory data structures using ad-hoc scripts. KGLab7, for example, is a software package
based on RDFLib8 to construct KGs from Python data structures. Nevertheless, these packages
do not support the transformation of in-memory data structures to RDF graphs using mapping
languages, requiring users to specify their own rules for integrating data into RDF graphs.
While these solution works well for small-scale projects, it is harder to maintain and scale [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Mapping languages provide a standard formalization for KG construction, they are more
interoperable than ad-hoc solutions and highly reusable for diferent types of data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Over the
past decades, several mapping languages have been proposed to describe the construction of
KGs from heterogeneous data sources in a declarative way [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. R2RML [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are two
of the most popular languages, supported by a large amount of KG construction systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
KG construction systems are software implementations that process data sources and mapping
rules to generate RDF graphs. Despite the large amount of KG construction systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], none
of them supports in-memory data structures so far.
      </p>
      <p>
        To describe in-memory data structures, we investigated ontologies which describe diferent
aspects of a software that produces these data structures, e.g., projects, components, and
processes. The Description of a Project Ontology (DOAP)9 is a popular ontology for describing
software projects. DOAP focuses on software organization and version control, without
providing detailed annotations about data. Thus, it cannot be used to describe information about data
structures and the characteristics of the software that produces them. The Software Ontology
(SWO)10 is a comprehensive ontology for software artifacts, covering topics such as software
applications, libraries, and operating systems. Originally created for bioinformatics-related
software, SWO reuses high-level Open Biomedical Ontologies to annotate software components,
but it does not describe the expected content of data. In addition, SWO contains thorough
taxonomies about data transformation techniques and programming languages, but defines
them as classes, which reduces its flexibility and poses dificulties for reusing these concepts
as instances10. The Core Software Ontology (CSO)11 and its extension Core Ontology of
Software Components (COSC) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] provide concepts for describing software services and
components. However, they are not being actively documented, while they require users to
adapt to some unique formalizations they provide, thus making the ontologies hard to reuse.
OntoSoft [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and its extension OntoSoft-VFF [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] capture scientific software metadata from a
7https://derwen.ai/docs/kgl/
8https://rdflib.readthedocs.io/en/stable/
9http://usefulinc.com/ns/doap
10http://purl.obolibrary.org/obo/swo.owl
11http://km.aifb.kit.edu/sites/cos/
scientist’s perspective, focusing mostly on provenance and ease of use, by integrating scientific
questions in the descriptions of the ontologies’ properties. The software description ontology
SD [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides a lightweight set of concepts and relationships to describe software data, their
metadata, and their parameters and features. SD provides a number of classes to describe
software data, their features and variable presentations. SD favors the reuse of existing vocabularies
such Schema.org12, Codemeta13 and W3C Data Cubes standard [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which favors
interoperability. Furthermore, it is a well documented ontology, recently tested in RDF systems, such as
MODFLOW14, whereas other well known software ontologies (e.g., Software Ontology (SWO)
and Core Software Ontology (CSO)) are not being actively supported. Hence, we concluded
that SD is suitable for annotating in-memory data structures and the software that produces
them, due to its lightweight nature and preference for reusing existing core ontologies.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. RML Extension</title>
      <p>
        RML [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] describes customized mapping rules that define how RDF graphs are
constructed from heterogeneous data formats (e.g., data in relational databases, data in CSV
or JSON format, etc.) and sources, (e.g., databases, files or Web APIs). This data is
described in the Logical Source (rml:LogicalSource) and the reference formulation
(rml:ReferenceFormulation) defines how to access the data retrieved from the data source
(e.g., rr:SQL2008 for relational databases, ql:CSV for CSV files, ql:JSONPath for JSON files).
      </p>
      <p>In this work, we extend RML’s Logical Source to provide a way of defining RML mapping
rules for in-memory data structures. We introduce a new type of source, the in-memory data
structures, as well as new reference formulations, e.g., Python dictionaries, Pandas16 DataFrames,
to express how the retrieved data structures should be accessed.</p>
      <p>Our proposed approach provides a uniform way to annotate data structures for diferent
types of software, built in diferent programming languages. We do not only annotate the
data structure, but also the software that produces this data. KG construction systems can
leverage information about the software and its dependencies, to identify possible compatibility
issues with the software that produces the data structures. For example, systems can evaluate if
they support data structures from the programming language version "Python3.9" or "Java20",
since there might be small variations for a data structure between diferent versions of a
programming language. Software descriptions help systems detect possible conflicts between
their dependencies and the software that produces the data structures, not only regarding
programming languages but also software versions. For example, systems can evaluate whether
they support data structures from the software package version that satisfies the requirement
"pandas&gt;=1.1.0" in Python or "org.apache.commons,commons-lang3,3.12.0" in Java.</p>
      <p>Furthermore, software descriptions, such as software dependencies, also increase the
reusability and interoperability of RML mapping rules with other software. Using this information,
third parties can better understand how the data structures are generated and and the
requirements of the software used, e.g., understand what programming language and dependencies to
12https://schema.org
13https://codemeta.github.io/
14https://models.mint.isi.edu/models/explore/MODFLOW/</p>
      <p>Listing 1: SD description of Pandas data structures.
1 &lt;#Logical-Source&gt; a rml:LogicalSource;
2 rml:source &lt;#In-memory-Source&gt;;
3 rml:referenceFormulation ql:DataFrame;
4 &lt;#In-memory-Source&gt; a sd:DatasetSpecification;
5 sd:name "output_dataframe";
6 sd:hasDataTransformation &lt;#Data-Transformation-Software&gt;;
7 &lt;#Data-Transformation-Software&gt; a sd:DataTransformation;
8 sd:name "DataFrame creation application";
9 sd:hasSourceCode &lt;#Software-Source-Code&gt;;
10 sd:softwareRequirements "pandas&gt;=1.1.0";
11 sd:license "MIT";
12 &lt;#Software-Source-Code&gt; a sd:SourceCode;
13 sd:programmingLanguage "Python3.9";
14 ql:DataFrame a rml:ReferenceFormulation; kg4di:definedBy "Pandas".
use, and replicate the complete pipeline of KG construction from in-memory data structures,
having more information about the setup they need to accomplish to construct the KG.</p>
      <p>In the remaining of the section, we describe how we use the Software Description Ontology
as Logical Source and the Reference Formulations that we introduce for the data structures.</p>
      <sec id="sec-3-1">
        <title>Software Description Ontology as Logical Source. We leverage the SD ontology15 [9] to</title>
        <p>describe the software that produces as output the in-memory data structures.</p>
        <p>In-memory data structures are described as sd:DatasetSpecification (Listing 1: lines
45), a SD class designed to describe inputs and outputs types of software applications and models.
A name representing the data structure is defined ( sd:name). This name can be used from the
KG construction system to identify which data structure should be mapped to RDF (section
4.1). The software that produces the data structure is described as sd:DataTransformation
(Listing 1: lines 6-11). This can be a software application, a function or a concrete software
script. We describe common information about the software such as its name, its dependencies,
the source code it uses and its license. Depending on the implementation, more software
annotations can be added using SD (e.g., software inputs, parameters, versions). The source
code of the software is described as sd:SourceCode (Listing 1: lines 12-13).
Reference Formulation for in-memory data structures. We introduce new reference
formulations for Python data structures: ql:Dictionary for Python dictionaries and
ql:DataFrame (Listing 1: line 3) for Pandas16 DataFrames. For in-memory JSON data (Python
strings in JSON format) we used the already existing ql:JSONPath reference formulation,
since this data structure shares the structure of JSON files. Using this information, a system can
decide how it can process and parse the data structure.
15https://w3id.org/okn/o/sd/
16https://pandas.pydata.org</p>
        <p>Listing 2: SD description of Java data structures.
1 &lt;#Logical-Source&gt; a rml:LogicalSource;
2 rml:source &lt;#In-memory-Source&gt;;
3 rml:referenceFormulation ql:LinkedList;
4 &lt;#In-memory-Source&gt; a sd:DatasetSpecification;
5 sd:name "output_linked_list";
6 sd:hasDataTransformation &lt;#Data-Transformation-Software&gt;;
7 &lt;#Data-Transformation-Software&gt; a sd:DataTransformation;
8 sd:name "Linked list creation application";
9 sd:hasSourceCode &lt;#Software-Source-Code&gt;;
10 sd:softwareRequirements "org.apache.commons,commons-lang3,3.12.0";
11 sd:license "MIT";
12 &lt;#Software-Source-Code&gt; a sd:SourceCode;
13 sd:programmingLanguage "Java20";
14 ql:LinkedList a rml:ReferenceFormulation; kg4di:definedBy "Java";</p>
        <p>The reference formulation contains the name of the data structure (e.g., dictionary). The
programming language (e.g., C#) or the software library (e.g., PySpark17) that defines the data
structure is also specified 18 (Listing 1: line 16). Diferent programming languages and software
libraries provide diferent ways of organizing and storing data in computer memory, even
though they may refer to the data structures in the same way. For example, both Python and C#
provide data structures that they refer to as ‘dictionary’. Still, the data structures are diferent
as each programming language has a unique way of organizing and storing data. Even if the
programming language is common, there might be diferent software packages that produce
diferent data structures but refer to them using the same name. For example, Pandas 16 and
PySpark17 are Python libraries that produce data structures which they refer to as ‘DataFrames’.
While both data structures share some similarities in their basic structure and operations (such
as indexing and filtering), they have diferent underlying implementations and are designed to
handle diferent types of data. As a result, there should be a discrete description for them.
Discussion. We showcase how our approach can construct RDF from in-memory data
structures for Python using compatible dependencies, but it can be extended for other data structures
of other programming languages. Our approach assumes that the software that generates the
data is in the same programming language as the KG construction system and they run within
the same process and share compatible software dependencies and physical resources. The
SD ontology allows describing any software or package by diferent programming languages,
ultimately, yielding any data structure. As long as a reference formulation indicates how to
refer to these data structures, and corresponding parsers exist to interpret these references, any
in-memory data structure can be addressed. For example, in Listing 2, a Java linked list and
the software that produced it are declaratively described. This description can be used by a
Java-based KG construction system.
17https://spark.apache.org/docs/latest/api/python/
18https://w3id.org/kg4di/definedBy</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Validation</title>
      <p>
        To validate our approach, we implemented our proposed solution in Morph-KGC19,
creating the Morph-KGC-RAM [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] extension (Section 4.1). We applied our approach to two use
cases: the Open Machine Learning (OpenML)20 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] on the extraction of machine learning
experiment metadata and their transformation to RDF and, the Software Metadata Extraction
Framework (SOMEF)21 [18] on the extraction of scientific software metadata from files.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Implementation</title>
        <p>
          We extended Morph-KGC19 with the extension Morph-KGC-RAM, which implements our
proposed solution. Morph-KGC is a Python software library that can be used in the run-time of
other Python-based software and outperforms state-of-the-art systems in execution time, in
most cases [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Morph-KGC expects a configuration file (.ini) that specifies which mapping rules will be used.
We extended Morph-KGC, to also expect Python objects that represent input data sources for the
construction of KGs. The Python data structures currently supported are Python dictionaries,
Pandas DataFrames and Python strings with the JSON format. Listing 3 shows how the extension
can be used to map a Python data structure into RDF, using the RML rules from Listing 1.
Listing 3: Morph-KGC extension that uses the RML file from Listing 1 to generate a knowledge
graph from the Python data structure ‘my_data’
1 import morph_kgc
2 my_data = pd.DataFrame({’Id’: [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
          ],’Username’: ["@jude","@emily","@wayne"]})
3 graph = morph_kgc.materialize(’./config.ini’, {"output_dataframe": my_data})
        </p>
        <p>Using this extension, KGs can be constructed from multiple heterogeneous data sources stored
in HDD, multiple in-memory data structures stored in RAM or their combination. Following
the CI/CD development of Morph-KGC we integrated 72 test cases22 for KG construction with
Python data structures and examples23 for constructing KGs using Python data structures.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Use Cases</title>
        <p>We demonstrate two use cases from the data mining domain for which we used
Morph-KGCRAM: OpenML and SOMEF. We discuss each one in more details:
OpenML KG: Constructing a Machine Learning Experiment Knowledge Graph.
OpenML20 is an open platform for sharing datasets, algorithms and experiments related to
machine learning. In this work, we used OpenML’s Python API24 to connect to the platform
and download machine learning experiment data and related metadata, storing them as Python
dictionaries and Pandas DataFrames. We used Morph-KGC-RAM to map data from thousands
19https://github.com/morph-kgc/morph-kgc
20https://www.openml.org
21https://somef.readthedocs.io/en/latest/
22https://github.com/morph-kgc/morph-kgc/tree/main/test/rml-in-memory
23https://github.com/morph-kgc/morph-kgc/tree/main/examples
24https://openml.github.io/openml-python/main/</p>
        <p>Listing 4: Mapping OpenML dataset metadata into RDF using Morph-KGC-RAM
1 import openml, pandas as pd, morph_kgc
2 dataset_list = openml.datasets.list_datasets(size=10)
3 dataset_df = pd.DataFrame.from_dict(dataset_list, orient="index").reset_index()
4 g_rdflib = morph_kgc.materialize(’./config.ini’, {"df1": dataset_df})</p>
        <p>Listing 5: RML Mapping rules for KG construction using OpenML Pandas DataFrames.
1 &lt;Datasets_Map&gt; a rr:TriplesMap;
2 rml:logicalSource [ rml:source [
3 a sd:DatasetSpecification; sd:name "df1"; sd:hasDataTransformation [
4 sd:hasSoftwareRequirements "Pandas&gt;=1.1.0";
5 sd:hasSourceCode[ sd:programmingLanguage "Python3.9"; ]; ]; ];
6 rml:referenceFormulation ql:DataFrame; ];
7 rr:subjectMap [ rr:class mls:Dataset;
8 rr:template "http://mldata.com/resource/openml/dataset{did}"; ].
9 ql:DataFrame a rml:ReferenceFormulation; kg4di:definedBy "Pandas".
of experiments into RDF, without having to store them first in the form of a database or a local
ifle. An indicative example of the work on the use case can be seen in Listings 4 and 5. In Listing
4, OpenML’s Python API is used to load some metadata about datasets of machine learning
experiments in a Pandas DataFrame. Then, with Morph-KGC-RAM, the DataFrame is mapped
into RDF, leveraging the mapping rules from Listing 5.23
SOMEF: Creating structured metadata from software repositories The SOftware
Metadata Extraction Framework (SOMEF) is a Python engine designed to process software code
repositories and represent their metadata as an RDF graph. SOMEF extracts these metadata
properties using diferent techniques (regular expressions, supervised classification) and APIs
(GitHub API, GitLab API), conflating them in a single, homogeneous record. Internally, the
data is stored in a dictionary which is then translated to RDF. In Listing 6, a Python dictionary
containing software metadata is translated into RDF using Morph-KGC-RAM. In Listing 7, a
simple example of RML rules for the SOMEF use case is demonstrated.</p>
        <p>Listing 6: Mapping SOMEF software metadata into RDF using Morph-KGC-RAM
1 import json, pandas as pd, morph_kgc
2 #somef_dictionary being a SOMEF metadata dictionary
3 g_rdflib = morph_kgc.materialize(’./config.ini’, {"dict1": somef_dictioanry})</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>In this paper, we present an extension for RML’s Logical Source, to describe in-memory data
structures. This extension enables the definition of mapping rules for in-memory data structures
stored in RAM. Moreover, we implement our proposal for Python dictionaries and Pandas
Listing 7: Mapping rules for KG construction using JSON strings of SOMEF software metadata.
1 &lt;Software_Map&gt; a rr:TriplesMap;
2 rml:logicalSource [ rml:source [
3 a sd:DatasetSpecification; sd:name "dict1";
4 sd:hasDataTransformation [ sd:hasSourceCode[
5 sd:programmingLanguage "Python3.9"; ]; ]; ];
6 rml:referenceFormulation ql:Dictionary; rml:iterator "$"; ];
7 rr:subjectMap [
8 rr:template "https://www.w3id.org/okn/i/Repo/{full_name.*.result.value}";];
9 rr:predicateObjectMap [ rr:predicate emi:applicationDomain;
10 rr:objectMap [rml:reference "application_domain.*.result.value"; ]; ].
11 ql:Dictionary a rml:ReferenceFormulation; kg4di:definedBy "Python".</p>
      <p>DataFrames, extending Morph-KGC, creating Morph-KGC-RAM, a system to construct RDF
from heterogeneous data and Python data structures without having to store them first. We
validate our approach over two use cases from the data mining domain, confirming that our
approach constitutes a simple setup for KG construction from in-memory data structures.</p>
      <p>The performance of our approach in terms of speed and eficiency is yet to be tested, compared
to current workflows, where in-memory data structures are first stored for the KG construction
system to access them. In the future, we plan to perform evaluations for systems that use this
extension to map in-memory data structures into RDF, compared to systems that map data
structures to RDF after first storing them in HDD, to measure the eficiency of our approach.
Furthermore, we plan to evaluate our approach with more use cases and explore additional data
structures, from diferent programming languages that can be used for KG construction. Finally,
we plan to evaluate our approach to other KG construction systems, based on other well known
programming languages, such as JAVA and JavaScript.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This research was partially supported by Flanders Make, the strategic research centre for the
manufacturing industry and the Flanders innovation and entrepreneurship (VLAIO). David
Chaves-Fraga and Daniel Garijo are supported by the Madrid Government (Comunidad de
Madrid-Spain) under the Multiannual Agreement with Universidad Politécnica de Madrid in the
line Support for R&amp;D projects for Beatriz Galindo researchers, in the context of the V PRICIT.
[18] A. Kelley, D. Garijo, A Framework for Creating Knowledge Graphs of Scientific Software
Metadata, Quantitative Science Studies (2021) 1–37. doi:10.1162/qss_a_00167.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>D'amato</article-title>
          , G. D.
          <string-name>
            <surname>Melo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kirrane</surname>
            ,
            <given-names>J. E. L.</given-names>
          </string-name>
          <string-name>
            <surname>Gayo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Neumaier</surname>
            ,
            <given-names>A.-C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Schmelzeisen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sequeda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs</article-title>
          ,
          <source>ACM Computing Surveys</source>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1145/3447772.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sundara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <article-title>R2RML: RDB to RDF Mapping Language</article-title>
          , W3C Recommendation,
          <source>World Wide Web Consortium (W3C)</source>
          ,
          <year>2012</year>
          . URL: http://www.w3.org/TR/r2rml/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, R. Van de Walle,
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          ,
          <source>in: Proceedings of the 7th Workshop on Linked Data on the Web</source>
          , volume
          <volume>1184</volume>
          , CEUR Workshop Proceedings,
          <year>2014</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1184</volume>
          /ldow2014_paper_01.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Van Assche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Delva</surname>
          </string-name>
          , G. Haesendonck,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>De Meester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <article-title>Declarative rdf graph generation from heterogeneous (semi-) structured data: A systematic literature review</article-title>
          ,
          <source>Journal of Web Semantics</source>
          (
          <year>2022</year>
          )
          <fpage>100753</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wendell</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Das</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Armbrust</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dave</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
          </string-name>
          , et al.,
          <article-title>Apache spark: a unified engine for big data processing</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>59</volume>
          (
          <year>2016</year>
          )
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>García-González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernández-Álvarez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Labra Gayo</surname>
          </string-name>
          ,
          <article-title>ShExML: An heterogeneous data mapping language based on ShEx</article-title>
          ,
          <source>in: European Knowledge Acquisition Workshop</source>
          , EKAW, volume
          <volume>2262</volume>
          ,
          <year>2018</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2262</volume>
          /ekaw-poster-
          <volume>08</volume>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Haesendonck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Maroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Dimou, Parallel RDF generation from heterogeneous big data</article-title>
          ,
          <source>in: Proceedings of the International Workshop on Semantic Big Data</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-KGC:
          <article-title>Scalable knowledge graph materialization with mapping partitions, Semantic Web (</article-title>
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .3233/SW-223135.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ratnakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          ,
          <string-name>
            <surname>OKG-Soft</surname>
          </string-name>
          :
          <article-title>An Open Knowledge Graph with Machine Readable Scientific Software Metadata</article-title>
          ,
          <source>in: 15th International Conference on eScience (eScience)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>349</fpage>
          -
          <lpage>358</lpage>
          . doi:
          <volume>10</volume>
          .1109/eScience.
          <year>2019</year>
          .
          <volume>00046</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[10] I. Dasoulas, morph-kgc/morph-kgc: 2.5.0</source>
          ,
          <year>2023</year>
          . URL: https://doi.org/10.5281/zenodo. 7829223. doi:
          <volume>10</volume>
          .5281/zenodo.7829223.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanschoren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N. Van</given-names>
            <surname>Rijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bischl</surname>
          </string-name>
          , L. Torgo,
          <article-title>OpenML: networked science in machine learning</article-title>
          ,
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>15</volume>
          (
          <year>2014</year>
          )
          <fpage>49</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          , S. Fakhraei,
          <article-title>SoMEF: A framework for capturing scientific software metadata from its documentation</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Big Data (Big Data)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>3032</fpage>
          -
          <lpage>3037</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Iglesias-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <article-title>Enhancing the Maintainability of the Bio2RDF Project Using Declarative Mappings</article-title>
          .,
          <source>in: SWAT4HCLS</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Oberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lamparter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Grimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          ,
          <article-title>Towards ontologies for formalizing modularization and communication in large software systems</article-title>
          ,
          <source>Applied Ontology</source>
          <volume>1</volume>
          (
          <year>2006</year>
          )
          <fpage>163</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ratnakar</surname>
          </string-name>
          , D. Garijo, OntoSoft: Capturing Scientific Software Metadata,
          <source>in: Proceedings of the 8th International Conference on Knowledge Capture, K-CAP</source>
          <year>2015</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1145/2815833.2816955.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. A. M. C.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauzer Medeiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gil</surname>
          </string-name>
          ,
          <article-title>Semantic software metadata for workflow exploration and evolution</article-title>
          ,
          <source>in: 2018 IEEE 14th International Conference on e-Science (e-Science)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>431</fpage>
          -
          <lpage>441</lpage>
          . doi:
          <volume>10</volume>
          .1109/eScience.
          <year>2018</year>
          .
          <volume>00132</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <source>The RDF Data Cube Vocabulary, W3C Recommendation, World Wide Web Consortium (W3C)</source>
          ,
          <year>2014</year>
          . URL: https://www.w3.org/TR/vocab
          <article-title>-data-cube/.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>