<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Knowledge Graph Approach for Dataset Metadata Harmonisation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paula Peña-Larena</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>María del Carmen Rodríguez-Hernández</string-name>
          <email>rdelhoyo@ita.es</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis García-Garcés</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rosa M. Montañés-Salas</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael del-Hoyo-Alonso</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AI DIH network, MIDIH, DIHELP and Euhubus4Data project</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>European data strategy</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>New DIH Catalogue</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technological Institute of Aragon (ITA), María de Luna</institution>
          ,
          <addr-line>7-8, Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The importance of data in Europe's economy, industry, and society is widely recognized, with data-driven innovation playing a crucial role in fostering competitiveness. To ensure more data availability for business use, the concept of data spaces has emerged as a key strategy. In this context, delivering high-quality data-driven services has become imperative. This paper presents an automatic semantic approach for harmonise data for seamless integration by enriching a reference ontology using metadata and textual content. Although Natural Language Processing (NLP) techniques and transformer-based linguistic models are employed for this purpose, our approach goes further by leveraging additional knowledge bases to identify and incorporate new interconnected concepts and relationships into a knowledge graph. As part of the European Data Innovation Hub initiative, our approach's efectiveness in automating data harmonization and enriching the knowledge domain is validated by experimental results derived from extensive dataset analysis and assessment of the generated knowledge graph's quality.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knownledge graph</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Transformers</kwd>
        <kwd>Data spaces</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ent levels of information depending on the availability Considering the significant challenge of
enhancand accessibility of the data and ofering a single access ing knowledge graphs automatically[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], without
point to a common data space. For this reason, it is neces- human supervision, semantic approaches have been
sary to devise a strategy or methodology for harmonising adopted. A novel harmonised algorithm, detailed in
sevdynamically data sources, datasets and metadata. eral steps (Figure 1) has been proposed to enrich the
      </p>
      <p>This work presents a scalable semantic technique for IDS Model by leveraging metadata or textual data from
automatically harmonising and enriching datasets from datasets to extract additional insights. Typically,
knowldata catalogues and textual content. It utilizes a reference edge graphs are constructed manually, which demands
ontology to model datasets, extracting metadata through extensive time and expertise from domain specialists.
an API and leveraging textual data, such as the "descrip- Hence, the importance and implications of this challenge
tion" property, through semantic and NLP techniques, are profound, particularly given the complexity of
autoand transformer-based linguistic models. The result is matically building a knowledge graph from unstructured
a structured knowledge graph that not only maps meta- content. Moreover, the quality evaluation of the resulting
data but also enriches it by extracting additional insights knowledge graph is crucial, as it directly influences the
from textual information. Furthermore, the approach fa- provision of superior content.
cilitates enhancing descriptions of entities by integrating
concepts from other ontologies like DBpedia related to
the name entities extracted from unstructured text.</p>
      <p>This paper is organized as follows. In Section 2, the
proposed approach is described. The reference ontology
used is described in Section 3. We discuss experimental
results in Section 4. Finally, in Section 5, conclusions and
future work are presented.
2. Datasets Metadata</p>
      <p>Harmonisation Approach
In the current landscape, the delivery of data-driven
services requires a strategy approach to harmonise and
standardise data, ensuring seamless integration across diverse
sources and domains. These sources cover a broad spec- Figure 1: Harmonisation algorithm.
trum of regions and applications domains (open,
industrial, personal, research, and more). As a result, datasets Our algorithm discriminates between metadata and
often exhibit a multitude of formats, data types and li- textual data obtained through an API from datasets. The
censing arrangements. In addition, various metadata or dataset information encompasses various attributes such
schemas (e.g., Dublin Core, VoID, OMV, DCAT, MOD) are as name, if it is a repository, description, domain,
locaapplied to describe these datasets, adding another layer tion, license, formats, privacy, publisher, language, etc.
of complexity to the data landscape. To enhance the metadata pre-processing and mapping</p>
      <p>In this context, the approach followed by ITA involves stages, we introduce NLP techniques aimed at
identifyadding a semantic layer to enhance data harmonisation ing semantic similarities and retrieving knowledge graph
and enrich datasets with additional insights extracted properties to improve information filtering.
from text. A dataset is conceptualized as a DataResource During the metadata mapping process, a search is
carusing a RDF/OWL ontology based on the widely rec- ried out for data type properties within the reference
ognized International Data Spaces (IDS) Information ontology that are most akin to a given metadata
propModel4. This ontology delegates domain modelling to erty, particularly those related to the DataResource class
share vocabularies and data schemes (DCAT, SKOS, FOAF, and its super-classes. If the input metadata matches
Owl-Time, among others). This model pursues the goal of any property in the reference ontology, the knowledge
establishing an ecosystem that facilitates secure, trusted, graph is enriched with the property description.
Otherand semantically interoperable data exchange. It defines wise, knowledge-based semantic similarity metrics are
essential concepts for describing actors within a data applied to compute similarity scores between concepts,
space, their interactions, the resources exchanged and words, and entities. If these scores surpass a predefined
data usage restrictions. threshold, the knowledge graph is enriched with the
metadata-provided information. Certain datasets include
4IDS Information Model a property that describes their contents, typically denoted
as the description property. In such cases, the process
starts with the extraction of textual data, followed by
pre-processing steps such as parsing, segmentation into
sentences, and augmentation with metadata attributes
like Part-of-Speech (POS) tags.</p>
      <p>
        In the matching step, attention mechanisms are used to
infer candidate triplets, leveraging transformer-based
language models like BERT[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the first instance, we have
used bert-base-uncased for English text and
bert-basespanish for Spanish text from the Huggingface library.
      </p>
      <p>An attention mechanism replicates human selective
attention in neural networks, focusing on pertinent
information while disregarding irrelevant details. It operates
by converting two sentences into a matrix, where words
from one sentence form columns and words from another
form rows, and then identifies relevant contextual
relationships. The attention weights, learned by pre-trained
transformer-based models during training, connections
between terms in the text (see Figure 2). These weights
suggest potential RDF triplets (subject, predicate, object),
which are further refined through filtering using
constraints such as threshold or frequency (STEP 3).
verted index, suitable classes are identified to populate
the knowledge graph stored in a Neo4j graph database.</p>
      <p>The process applies, among others, the cosine
similarity algorithm to DBpedia classes. If values difer, the path
similarity semantic metric, which calculates the shortest
path between two words in WordNet’s representation
(synsets), is used. When candidate triplet values cannot
be mapped to the IDS reference ontology or DBpedia, a
multilingual Named-Entity Recognition (NER) model5 is
integrated. This model searches for nodes in additional
ontologies related to the NER predictions, generating
new relationships and expanding the knowledge graph.</p>
      <p>In STEP 5, harmonised metadata and enriched
information might be transferred to another system, such as
CKAN, for further use. CKAN serves as an open-source
Data Management System (DMS), facilitating the
publication, sharing, and utilization of data in hubs and portals.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Reference ontology</title>
      <p>of classes (types of entities), relationships (connections 148 single datasets exhibiting diversity in domains,
covbetween classes), and attributes (properties of entities). erage, formats, personal data protection levels, available</p>
      <p>By using a reference ontology framework, real-world URLs (public or private), and languages.
data can be integrated to form a knowledge graph. For the Comparing basic metrics between the reference
onscenario involving EuHubs4Data project, our base knowl- tology and the automatically enriched knowledge graph
edge graph is built on the IDS Information Model ontol- (such as number of classes, instances, attributes, and class
ogy, which allows modeling datasets as DataResource en- relationships), as presented in Table 1, seems to reveal
tities and incorporates various metadata attributes such notable diferences. The automated knowledge graph
as dataset representation, dates, language, and source. has evolved significantly with a population of elements.
This ontology establishes fundamental concepts and ex- This enrichment is attributed to extracting additional
tends them using external ontologies like DCAT, SKOS, insights from textual information within dataset
metaFOAF, Owl-Time, and PROV. It also enables the definition data properties and leveraging supplementary knowledge
of metadata for resources, such as datasets in our case. repositories, vocabularies, and ontologies (i.e., DBpedia).
This reference model provides a structured framework
for modeling datasets, as illustrated in Figure 4.</p>
      <p>In an experiment showcasing our approach, we
examine results from a subset of 18 datasets within the
catalogue containing the term Barcelona, mapped to the
knowledge graph as DataResource. These datasets
include various topics like Barcelona-Territory,
BarcelonaFigure 4: IDS Information Model. Population, Barcelona-Administration, Barcelona -
Economy and Business and Barcelona-City and services with</p>
      <p>In addition, our knowledge graph remains dynamic, as descriptions ranging from housing and town planning to
it continuously incorporates new concepts and relation- demographics and education.
ships identified from various external sources, including Through analysis of the datasets’ metadata and
texDBpedia and other ontologies, aligning with NER pre- tual content, we observe the enrichment of the
knowldiction labels like organization, person, location, and edge graph with new classes and relationships. Notably,
time. Furthermore, our approach can be extended to inte- these datasets are associated with the new concept of the
grate with other collective knowledge bases. With the ap- "Barcelona City Council," which in turn links to concepts
proach presented before, alternative reference ontologies like Housing, Urban_planning, Demography, Education,
could be employed in diferent contexts (i.e., Tourism). and more. This demonstrates the ability of our approach
to capture and represent relevant concepts and
relationships from datasets, facilitating a deeper understanding
4. Experimental Results of the data landscape.</p>
      <p>Similarly, as depicted in Figure 5, the concept of human
To evaluate our automated approach for harmonizing resources is linked to the previously mentioned
Barcelonaand enriching datasets through knowledge graph gener- Administration and Barcelona City Council classes, along
ation, 231 datasets from the catalogue6 provided by Eu- with other new concepts, particularly Open data. Upon
hubs4Data project members and open data sources have further exploration, it becomes evident that the graph
been used. This collection comprises 83 repositories and is enhanced with additional sub-graph derived from
insights extracted from other datasets or repositories
re6https://euhubs4data.eu/datasets/ garding the Open data concept, using the aforementioned
knowledge bases and additional ontologies. In addition, tions similar to those in an available tool8.
by navigating through classes such as Energy linked to
the Open Data class, one can access further graphs
concerning resource items. As a result, the enriched
knowledge graph, which establishes connections not previously
defined, is stored in a Neo4j graph database. This enables
data retrieval and visualization through applications such
as the web application developed by us7.</p>
      <p>Moreover, to foster higher quality and better outcomes,
we have worked on assessing the quality of the generated
knowledge graph. In recent years, ontology quality
assessment is a key aspect in their development and reuse.</p>
      <p>
        The results achieved allow the expert to identify areas
that might require further refinement. While various
approaches exist[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], in our case, the oQuARE framework[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
is used. The oQuARE framework employs diverse
metrics across various dimensions (structural, operability,
reliability, transferability, functional adequacy, maintain- Figure 6: Quality assessment results.
ability, and compatibility) to evaluate ontology quality.
      </p>
      <p>Each dimension’s assessment relies on its quality
subcharacteristics, which are evaluated using associated met- The comparison between the two ontologies reveals
rics. For instance, the structural dimension is assessed notable quality discrepancies. The reference ontology
based on cohesion, consistency, formal relations support, achieved higher scores with respect to the automatically
formalization, redundancy, and tangledness, involving generated ontology. While some metrics related to
atmetrics like ANOnto, PROnto, TMOnto, and LCOMOnto. tribute and class richness, as well as object coupling,</p>
      <p>
        To evaluate ontology quality, well-known metrics yielded similar values, overall, the automatic ontology
(LCOMOnto, WMCOnto, DITOnto, NACOnto, NOCOnto, exhibited areas requiring improvement.
CBOOnto, RFCOnto, NOMOnto, RROnto, AROnto, IN- Moreover, using the OOPS! tool[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to identify issues
ROnto, CROnto, ANOnto, TMOnto) are depicted in a in ontologies uncovered shared pitfalls between the
refradar chart (Figure 6), comparing scores between the ref- erence and enriched ontology. These included missing
erence ontology (IDS Model) and the enriched knowledge domain or range in properties, misusing ontology
annograph. Additionally, metrics associated with oQuaRE di- tations, undeclared equivalent classes, multiple classes
mensions are represented in another chart, with calcula- with identical labels, and untyped properties. Addressing
7https://euhub4data-graphs.itainnova.es/
8http://sele.inf.um.es/ontology-metrics
these issues in the reference ontology before enriching
the knowledge graph could enhance overall quality. The
quality of results delivered by data-driven services
depends on the quality of the ontological model built.
els to assess the approach and adapt it to a specific uses
cases.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>5. Conclusions and Future Work This work has been partially funded by the Department
of Big Data and Cognitive Systems at the Technological
In recent years, the exponential growth of data produc- Institute of Aragon, by IODIDE group of the Government
tion has underline the significance and impact of data in of Aragon, grant number T1720R and by the European
the economy, industry, and society. Social trends towards Regional Development Fund (ERDF).
openness and sharing further emphasise the
transformative potential of data in the global economy and society.</p>
      <p>Providing quality data-driven services has thus become References
a critical business strategy. Under diferent application
domains, regions, data types, formats and licences, data
comes from multiple sources. Given the diverse sources,
formats, and licenses of data, there is a need for a
strategy to harmonize and standardize data for seamless
integration. This paper introduces a semantic approach
based on the automatic generation of a knowledge graph
from metadata and textual content. Leveraging NLP
techniques, transformer-based linguistic models and the
support of additional knowledge bases and ontologies, the
approach aims to facilitate interoperability between
datadriven sources by extracting knowledge from texts and
enriching a reference ontology automatically. It follows
a series of steps including information extraction,
preprocessing of textual content, matching to infer candidate
triplets, filtering these triplets, and the mapping and
integration of triplets or metadata into the knowledge graph.</p>
      <p>The experimental results demonstrate an innovative
approach for automatically building knowledge graphs
from natural language text within a specific domain.</p>
      <p>However, the quality assessment conducted, combining
diferent well-know quality metrics, has identified
defects or errors in the ontology-learning process that need
to be addressed in order to improve the quality of the
knowledge graph and thus the quality of results to deliver
data-driven services.</p>
      <p>As future work, we aim to extract knowledge from
various textual sources and automatically generate a
highquality knowledge graph. This graph will serve as a
semantic layer for data harmonization and
interoperability that might empower recommendation systems with
semantic understanding, data quality assurance,
crossdomain integration, context-awareness, explainability
and adaptive learning capabilities. By harnessing the
semantic richness of the knowledge graph,
recommendation systems could provide more accurate, relevant,
insightful and personalized recommendations that enhance
user satisfaction and drive engagement across diverse
domains and contexts (Tourism, Health,...). In addition,
with the emergence of new large linguistic models like
GPT-4, there is an opportunity to explore and test
mod</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Glennon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundbland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Croce</surname>
          </string-name>
          , G. Micheletti,
          <string-name>
            <given-names>N.</given-names>
            <surname>Raczko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Freitaso</surname>
          </string-name>
          ,
          <source>European DATA Market Study 2021-2023. D2.4 Second Report on Facts and Figures</source>
          ,
          <source>Technical Report, IDC and Lisbon Council</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>BDVA</surname>
          </string-name>
          ,
          <article-title>Big Data Value cPPP</article-title>
          ,
          <source>Technical Report, Big Data Value Association</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mertens</surname>
          </string-name>
          ,
          <source>The Data Spaces Radar</source>
          ,
          <source>Technical Report, International Data Spaces Association</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mellal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Guerram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bouhalassa</surname>
          </string-name>
          ,
          <article-title>An Approach for Automatic Ontology Enrichment from Texts</article-title>
          , Informatica (Slovenia)
          <volume>45</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Amani</surname>
          </string-name>
          <string-name>
            <surname>Drissi</surname>
          </string-name>
          , Ahmed Khemiri,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chbeir</surname>
          </string-name>
          ,
          <source>A New Automatic Ontology Construction Method Based on Machine Learning Techniques: Application on Financial Corpus, Procedding 13th International Conference on Management of Digital EcoSystems</source>
          (
          <year>2021</year>
          )
          <fpage>57</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bi</surname>
          </string-name>
          , S. Cheng, J.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Relphormer:
          <article-title>Relational graph transformer for knowledge graph representations</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>566</volume>
          (
          <year>2023</year>
          )
          <fpage>127044</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <source>Proceedings Conference of the North American Chapter of the Association for Computational Linguistics</source>
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Roldán-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruano-Ordás</surname>
          </string-name>
          , V. BastoFernandes,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Méndez</surname>
          </string-name>
          ,
          <article-title>An ontology knowledge inspection methodology for quality assessment and continuous improvement</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>133</volume>
          (
          <year>2021</year>
          )
          <fpage>101889</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Duque-Ramos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Fernández-Breis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iniesta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumontier</surname>
          </string-name>
          , et.,
          <article-title>Evaluation of the OQuaRE framework for ontology quality</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>40</volume>
          (
          <year>2013</year>
          )
          <fpage>2696</fpage>
          -
          <lpage>2703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Poveda-Villalón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gomez-Perez</surname>
          </string-name>
          , M. C.
          <article-title>SuárezFigueroa, OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation</article-title>
          ,
          <source>International Journal on Semantic Web and Information Systems (IJSWIS) 10</source>
          (
          <year>2014</year>
          )
          <fpage>7</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>