<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OntoPy: a framework to integrate diferent file types</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro Paulo Rezende Silva Domingos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Maria Parente de Oliveira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Tecnológico de Aeronáutica (ITA)</institution>
          ,
          <addr-line>São José dos Campos, SP</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The OntoPy framework is introduced in this paper as a new solution for data integration through ontologies, aiming to provide an accessible approach to integrating various types of data sources. OntoPy responds to challenges in scenarios where organizations are structuring their data and seeking exploratory analyses to derive insights for decision-making. Unlike existing tools such as Ontop, which focus on structured databases and require commercial software and plugins, OntoPy leverages data science-friendly features of Python to simplify the integration of unstructured data in formats like Parquet, CSV, and XLSX. This framework ofers a more versatile and eficient method for bridging the semantic heterogeneity of data and ontologies. By emphasizing the relationship between ontology classes and properties and the attributes available in data sources, OntoPy facilitates a seamless and efective data integration process. Initial tests with the framework have shown promising results, with adequate performance in handling larger data volumes compared to other tools such as Morph-KGC. OntoPy has been successfully applied in querying data for diesel-electric locomotive maintenance, handling complex queries across heterogeneous data sources with good performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data integration</kwd>
        <kwd>ontology</kwd>
        <kwd>mapping</kwd>
        <kwd>databases</kwd>
        <kwd>framework</kwd>
        <kwd>software engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data integration is considered a recurring problem in data management, and it is observed to be a
significant challenge today [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is estimated that 50 to 80% of a data scientist’s time is dedicated to
manipulating, integrating, and preparing data for efective use [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this context, the use of ontologies
emerges as an important tool for semantically modeling concepts and relationships in data domains
and integrating diferent data sources.
      </p>
      <p>
        An ontology can be defined as a formal representation of a set of concepts within a domain and the
relationships between those concepts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The Ontology-Based Data Access (OBDA) approach bridges
the semantics of ontologies and data heterogeneity, with mature and widely adopted solutions like
Ontop [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to integrate diferent data sources.
      </p>
      <p>However, the availability of data in various application scenarios is not always structured, often being
in files such as Parquet, CSV, and XLSX formats. In scenarios like these, structuring multiple files to
then use a framework like Ontop can result in unnecessary efort in data sources that may not be used
later on. It is possible to use Ontop through federated bases, accessing various types of files, but with
the need for commercial software and plugins, as well as additional complexity involved in establishing
this structure. Ontop is written in Java, not making use of data science-focused features as in languages
like Python. There is an opportunity to make the OBDA approach and the use of ontologies more
accessible, both in terms of application structure and programming language.</p>
      <p>Regarding the most widespread mapping standards, it is noted that they require specific knowledge
for their development. Since the data scenario in question would be in the process of structuring, it is
understood that the use of ontologies for data integration would be at the same level. Thus, it is believed
that this process can become more accessible by focusing exclusively on the relationship between
classes and properties of the defined ontology and the available attributes in the data sources.</p>
      <p>Thus, this paper presents the OntoPy framework, developed for integrating diferent types of data
sources through ontologies in an accessible manner. It addresses the scenario where an organization is
in the process of structuring its data and seeking exploratory analyses to better understand the value
data can provide in decision-making processes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Ontop is the most widespread tool when it comes to OBDA solutions. However, it is a solution that
deals with relational databases, either already structured or accessible through federated databases
generated by third-party applications. For the proposed framework, a greater versatility is sought, as it
relates to a scenario of structuring an organization’s data.</p>
      <p>Through related works, it is observed that most of them operate on relational databases or files such
as CSV, JSON, XML, and RDF (Table 1). Among other solutions, Morph-KGC also stands out, mainly
because of its similarity to the method proposed in this present document. It can access various types of
diferent file systems, as well as pandas dataframes, generating results in this format as well. The main
diference from the present proposal is that this framework uses only the TTL file as input, focusing
on mapping, which may or may not contain additional statements from an ontology. By using only
a TTL file, some concepts and practices of using ontologies for data integration are not met, such as
reusability, sharing, and portability across multiple platforms, as well as increased maintainability and
reliability [16].</p>
      <p>In the case of a dedicated file for ontology and another for mapping, such as in Ontop, MASTRO, and
the proposed OntoPy framework, the OWL file would be the same for all three solutions, requiring
adjustments only in the mapping. This complexity of adjustments becomes more evident, particularly
in extensive ontologies or when dealing with multiple diferent data sources.</p>
    </sec>
    <sec id="sec-3">
      <title>3. OntoPy framework</title>
      <p>The research method is primarily based on the development of the OntoPy framework for data
integration. In order to develop the OntoPy framework, it must:
• Be developed in Python due to its widespread use for data science applications, its growing
community, and ongoing advancements;
• Load an ontology in OWL format, following W3C standards;
• Load a mapping file in JSON format, aiming to make this step more accessible to applications
where those involved are in the initial stages of learning how to use ontologies for data integration;
• Materialize data from diferent types of files based on the loaded ontology;
• Provide the materialized knowledge graph in execution memory for SPARQL queries.</p>
      <p>The code was developed based on the Owlready2 library to access the ontology and convert the
loaded data into triples (Figure 1). Up to the current state of the framework, it is possible to load CSV,
XLSX, and Parquet files. Other file types will be considered in the future. However, OntoPy also operates
on databases running in pandas DataFrame objects. This capability allows the user to pre-load data
sources of other types through Python code in pandas DataFrame format and use them in Python. The
OntoPy framework, its proposal, and implementation case in a freight railway transportation company
are available on GitHub1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The mapping file</title>
      <p>The proposed framework for accessing diferent file systems also introduces a mapping file in JSON
format. The aim of this approach is to explore alternatives or propose a method that simplifies its
understanding for creation. The mapping file, in addition to serving as a guide for constructing triples,
defines which data should be used in the application. Therefore, even if a data source and the ontology
used have many attributes, the mapping file may only include the minimum necessary to form a triple.</p>
      <p>Listing 1: Structure of the mapping files for OntoPy in JSON format
1 {
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
‘‘test_database’’ :
{
‘‘data_source_path’’: ‘‘C:\test.csv’’,
‘‘separator’’: ‘‘;’’,
‘‘decimal’’: ‘‘,’’,
‘‘triples’’:
[
{
‘‘subject’’:
{
‘‘data_source_attribute_name’’: ‘‘Equipment’’,
‘‘ontology_subject_name’’: ‘‘Locomotive’’
},
‘‘predicates_and_objects’’:
[
{
‘‘data_source_attribute_name’’: ‘‘Model’’,
‘‘ontology_predicate_name’’: ‘‘hasModel’’,
‘‘ontology_object_name’’: ‘‘Locomotive_model’’
1https://github.com/pedropdomingos/OntoPy
...,
• Index 2: Identifier for the mapped data source;
• Index 4: Path of the data source;
• Index 5: Character defining the column separation in CSV files;
• Index 6: Character defining the decimal separation in numerical values in CSV files;
• Index 12: Identification of the attribute in the data source treated as the subject;
• Index 13: Identification of the corresponding ontology class that the subject attribute refers to;
• Index 18: Identification of the attribute in the data source treated as the object;
• Index 19: Identification of the corresponding ontology property responsible for the relationship
between the subject and the object;
• Index 20: Identification of the corresponding ontology class that the object attribute refers to;
• Index 23: Next mapping of the predicate for the corresponding subject;
• Index 27: Next sequence of triples with the mapping of a subject and its predicates;
• Index 31: Next data source to be mapped.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Tests</title>
      <p>It was decided to conduct a comparative test regarding performance with Morph-KGC precisely because
of the similarities with the proposed OntoPy framework. The idea was to choose a simple and
easyto-use test from Morph-KGC’s Github and adapt it to OntoPy to compare the performance of the
frameworks. Among the available tests, there is one related to the number of Instagram followers
that is very accessible as it uses pandas DataFrames to integrate. To generate the data, a code was
developed that generates identifiers from zero to the desired amount, repeating data such as first name,
last name, and username on the platform, as well as random numbers for the number of followers. The
comparative test was limited to comparing the processing times for inserting data from databases into
the ontology or knowledge graph and the times for performing a simple SPARQL query. The query
executed was to return the user identifiers and their respective number of followers.</p>
      <p>The initial tests of OntoPy have shown promissing. In the comparative test with Morph-KGC, OntoPy
excelled in performance in scenarios with a large volume of data, which is significant (Table 2). It is
under such conditions that the need for performance becomes more apparent. Both presented the same
results in the query, also demonstrating the reliability of the proposed approach.</p>
      <p>The entire implementation in a real case of a company in the railway sector focused on diesel-electric
locomotives and the application of a query can be seen on the previously mentioned project’s GitHub.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and results</title>
      <p>OntoPy has been used for querying data in the domain of diesel-electric locomotive maintenance. It
has been possible to perform queries that are challenging due to the heterogeneity of the data sources,
both in terms of attributes and structures, and it has shown good performance. It is understood that
there is room for improvement, particularly in the stage of populating the materialized ontology with
instances, but the solution has already yielded consistent results.</p>
      <p>There are possibilities for improving the eficiency of the framework by using threads and
multiprocessing resources. Additionally, the use of the Owlready2 library should be re-evaluated in terms of
performance.
[15] S. Kamm, N. Jazdi, M. Weyrich, Knowledge discovery in heterogeneous and unstructured data of
industry 4.0 systems: challenges and approaches, Procedia CIRP 104 (2021) 975–980.
[16] H. Li, Ontology-Driven Data Access and Data Integration with an Application in the Materials
Design Domain, Ph.D. thesis, Linköping University Electronic Press, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <article-title>Using ontologies for semantic data integration, A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years (</article-title>
          <year>2018</year>
          )
          <fpage>187</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorneanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Ruan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heshmat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arellano-Garcia</surname>
          </string-name>
          ,
          <article-title>Big data and machine learning: A roadmap towards smart plants</article-title>
          ,
          <source>Frontiers of Engineering Management</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>623</fpage>
          -
          <lpage>639</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lanti</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. M. De Farias</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mosca</surname>
          </string-name>
          , G. Xiao,
          <article-title>Accessing scientific data through knowledge graphs with ontop</article-title>
          ,
          <source>Patterns</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>100346</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cogrel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Komla-Ebri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kontchakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodriguez-Muro</surname>
          </string-name>
          , G. Xiao, Ontop:
          <article-title>Answering sparql queries over relational databases</article-title>
          ,
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>471</fpage>
          -
          <lpage>487</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lembo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenzerini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodriguez-Muro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rosati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ruzzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Savo</surname>
          </string-name>
          ,
          <article-title>The mastro system for ontology-based data access</article-title>
          ,
          <source>Semantic Web</source>
          <volume>2</volume>
          (
          <year>2011</year>
          )
          <fpage>43</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Arenas-Guerrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-KGC:
          <article-title>Scalable knowledge graph materialization with mapping partitions</article-title>
          ,
          <source>Semantic Web</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . doi:
          <volume>10</volume>
          . 3233/SW-223135.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaves-Fraga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pozo-Gilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Toledo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ruckhaus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , Morph-csv:
          <article-title>Virtual knowledge graph access for tabular data</article-title>
          ., in: ISWC (Demos/Industry),
          <year>2020</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Priyatna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <article-title>Formalisation and experiences of r2rml-based sparql to sql query translation using morph</article-title>
          ,
          <source>in: Proceedings of the 23rd international conference on World wide web</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>479</fpage>
          -
          <lpage>490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Djimenou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Zucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Montagnat</surname>
          </string-name>
          ,
          <article-title>Translation of relational and non-relational databases into rdf with xr2rml</article-title>
          ,
          <source>in: 11th International Confenrence on Web Information Systems and Technologies (WEBIST'15)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jean-Baptiste</surname>
          </string-name>
          ,
          <article-title>Ontologies with Python: Programming OWL 2.0 Ontologies with Python and Owlready2</article-title>
          , Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, R. Van de Walle,
          <article-title>Rml: A generic language for integrated rdf mappings of heterogeneous data</article-title>
          .,
          <source>Ldow</source>
          <volume>1184</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Slepicka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <article-title>Kr2rml: An alternative interpretation of r2rml for heterogenous sources</article-title>
          ., in: Cold,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mauri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Calbimonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dell'Aglio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balduini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brambilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Della</given-names>
            <surname>Valle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          , Triplewave:
          <article-title>Spreading rdf streams on the web</article-title>
          ,
          <source>in: The Semantic Web-ISWC</source>
          <year>2016</year>
          : 15th International Semantic Web Conference, Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part II 15</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Daga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Asprino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulholland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          , et al.,
          <article-title>Facade-x: an opinionated approach to sparql anything</article-title>
          ,
          <source>Studies on the Semantic Web</source>
          <volume>53</volume>
          (
          <year>2021</year>
          )
          <fpage>58</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>