1. Introduction

Automated Knowledge Graph Approach for Dataset Metadata Harmonisation

Paula Peña-Larena

María del Carmen Rodríguez-Hernández

rdelhoyo@ita.es 3

Luis García-Garcés

Rosa M. Montañés-Salas

Rafael del-Hoyo-Alonso

3 0 AI DIH network, MIDIH, DIHELP and Euhubus4Data project 1 European data strategy 2 New DIH Catalogue 3 Technological Institute of Aragon (ITA), María de Luna , 7-8, Zaragoza , Spain

The importance of data in Europe's economy, industry, and society is widely recognized, with data-driven innovation playing a crucial role in fostering competitiveness. To ensure more data availability for business use, the concept of data spaces has emerged as a key strategy. In this context, delivering high-quality data-driven services has become imperative. This paper presents an automatic semantic approach for harmonise data for seamless integration by enriching a reference ontology using metadata and textual content. Although Natural Language Processing (NLP) techniques and transformer-based linguistic models are employed for this purpose, our approach goes further by leveraging additional knowledge bases to identify and incorporate new interconnected concepts and relationships into a knowledge graph. As part of the European Data Innovation Hub initiative, our approach's efectiveness in automating data harmonization and enriching the knowledge domain is validated by experimental results derived from extensive dataset analysis and assessment of the generated knowledge graph's quality.

eol>Knownledge graph Natural Language Processing Transformers Data spaces

1. Introduction

ent levels of information depending on the availability Considering the significant challenge of enhancand accessibility of the data and ofering a single access ing knowledge graphs automatically[ 4 ][ 5 ][ 6 ], without point to a common data space. For this reason, it is neces- human supervision, semantic approaches have been sary to devise a strategy or methodology for harmonising adopted. A novel harmonised algorithm, detailed in sevdynamically data sources, datasets and metadata. eral steps (Figure 1) has been proposed to enrich the

This work presents a scalable semantic technique for IDS Model by leveraging metadata or textual data from automatically harmonising and enriching datasets from datasets to extract additional insights. Typically, knowldata catalogues and textual content. It utilizes a reference edge graphs are constructed manually, which demands ontology to model datasets, extracting metadata through extensive time and expertise from domain specialists. an API and leveraging textual data, such as the "descrip- Hence, the importance and implications of this challenge tion" property, through semantic and NLP techniques, are profound, particularly given the complexity of autoand transformer-based linguistic models. The result is matically building a knowledge graph from unstructured a structured knowledge graph that not only maps meta- content. Moreover, the quality evaluation of the resulting data but also enriches it by extracting additional insights knowledge graph is crucial, as it directly influences the from textual information. Furthermore, the approach fa- provision of superior content. cilitates enhancing descriptions of entities by integrating concepts from other ontologies like DBpedia related to the name entities extracted from unstructured text.

This paper is organized as follows. In Section 2, the proposed approach is described. The reference ontology used is described in Section 3. We discuss experimental results in Section 4. Finally, in Section 5, conclusions and future work are presented. 2. Datasets Metadata

Harmonisation Approach In the current landscape, the delivery of data-driven services requires a strategy approach to harmonise and standardise data, ensuring seamless integration across diverse sources and domains. These sources cover a broad spec- Figure 1: Harmonisation algorithm. trum of regions and applications domains (open, industrial, personal, research, and more). As a result, datasets Our algorithm discriminates between metadata and often exhibit a multitude of formats, data types and li- textual data obtained through an API from datasets. The censing arrangements. In addition, various metadata or dataset information encompasses various attributes such schemas (e.g., Dublin Core, VoID, OMV, DCAT, MOD) are as name, if it is a repository, description, domain, locaapplied to describe these datasets, adding another layer tion, license, formats, privacy, publisher, language, etc. of complexity to the data landscape. To enhance the metadata pre-processing and mapping

In this context, the approach followed by ITA involves stages, we introduce NLP techniques aimed at identifyadding a semantic layer to enhance data harmonisation ing semantic similarities and retrieving knowledge graph and enrich datasets with additional insights extracted properties to improve information filtering. from text. A dataset is conceptualized as a DataResource During the metadata mapping process, a search is carusing a RDF/OWL ontology based on the widely rec- ried out for data type properties within the reference ognized International Data Spaces (IDS) Information ontology that are most akin to a given metadata propModel4. This ontology delegates domain modelling to erty, particularly those related to the DataResource class share vocabularies and data schemes (DCAT, SKOS, FOAF, and its super-classes. If the input metadata matches Owl-Time, among others). This model pursues the goal of any property in the reference ontology, the knowledge establishing an ecosystem that facilitates secure, trusted, graph is enriched with the property description. Otherand semantically interoperable data exchange. It defines wise, knowledge-based semantic similarity metrics are essential concepts for describing actors within a data applied to compute similarity scores between concepts, space, their interactions, the resources exchanged and words, and entities. If these scores surpass a predefined data usage restrictions. threshold, the knowledge graph is enriched with the metadata-provided information. Certain datasets include 4IDS Information Model a property that describes their contents, typically denoted as the description property. In such cases, the process starts with the extraction of textual data, followed by pre-processing steps such as parsing, segmentation into sentences, and augmentation with metadata attributes like Part-of-Speech (POS) tags.

In the matching step, attention mechanisms are used to infer candidate triplets, leveraging transformer-based language models like BERT[ 7 ]. In the first instance, we have used bert-base-uncased for English text and bert-basespanish for Spanish text from the Huggingface library.

An attention mechanism replicates human selective attention in neural networks, focusing on pertinent information while disregarding irrelevant details. It operates by converting two sentences into a matrix, where words from one sentence form columns and words from another form rows, and then identifies relevant contextual relationships. The attention weights, learned by pre-trained transformer-based models during training, connections between terms in the text (see Figure 2). These weights suggest potential RDF triplets (subject, predicate, object), which are further refined through filtering using constraints such as threshold or frequency (STEP 3). verted index, suitable classes are identified to populate the knowledge graph stored in a Neo4j graph database.

The process applies, among others, the cosine similarity algorithm to DBpedia classes. If values difer, the path similarity semantic metric, which calculates the shortest path between two words in WordNet’s representation (synsets), is used. When candidate triplet values cannot be mapped to the IDS reference ontology or DBpedia, a multilingual Named-Entity Recognition (NER) model5 is integrated. This model searches for nodes in additional ontologies related to the NER predictions, generating new relationships and expanding the knowledge graph.

In STEP 5, harmonised metadata and enriched information might be transferred to another system, such as CKAN, for further use. CKAN serves as an open-source Data Management System (DMS), facilitating the publication, sharing, and utilization of data in hubs and portals.

3. Reference ontology

of classes (types of entities), relationships (connections 148 single datasets exhibiting diversity in domains, covbetween classes), and attributes (properties of entities). erage, formats, personal data protection levels, available

By using a reference ontology framework, real-world URLs (public or private), and languages. data can be integrated to form a knowledge graph. For the Comparing basic metrics between the reference onscenario involving EuHubs4Data project, our base knowl- tology and the automatically enriched knowledge graph edge graph is built on the IDS Information Model ontol- (such as number of classes, instances, attributes, and class ogy, which allows modeling datasets as DataResource en- relationships), as presented in Table 1, seems to reveal tities and incorporates various metadata attributes such notable diferences. The automated knowledge graph as dataset representation, dates, language, and source. has evolved significantly with a population of elements. This ontology establishes fundamental concepts and ex- This enrichment is attributed to extracting additional tends them using external ontologies like DCAT, SKOS, insights from textual information within dataset metaFOAF, Owl-Time, and PROV. It also enables the definition data properties and leveraging supplementary knowledge of metadata for resources, such as datasets in our case. repositories, vocabularies, and ontologies (i.e., DBpedia). This reference model provides a structured framework for modeling datasets, as illustrated in Figure 4.

In an experiment showcasing our approach, we examine results from a subset of 18 datasets within the catalogue containing the term Barcelona, mapped to the knowledge graph as DataResource. These datasets include various topics like Barcelona-Territory, BarcelonaFigure 4: IDS Information Model. Population, Barcelona-Administration, Barcelona - Economy and Business and Barcelona-City and services with

In addition, our knowledge graph remains dynamic, as descriptions ranging from housing and town planning to it continuously incorporates new concepts and relation- demographics and education. ships identified from various external sources, including Through analysis of the datasets’ metadata and texDBpedia and other ontologies, aligning with NER pre- tual content, we observe the enrichment of the knowldiction labels like organization, person, location, and edge graph with new classes and relationships. Notably, time. Furthermore, our approach can be extended to inte- these datasets are associated with the new concept of the grate with other collective knowledge bases. With the ap- "Barcelona City Council," which in turn links to concepts proach presented before, alternative reference ontologies like Housing, Urban_planning, Demography, Education, could be employed in diferent contexts (i.e., Tourism). and more. This demonstrates the ability of our approach to capture and represent relevant concepts and relationships from datasets, facilitating a deeper understanding 4. Experimental Results of the data landscape.

Similarly, as depicted in Figure 5, the concept of human To evaluate our automated approach for harmonizing resources is linked to the previously mentioned Barcelonaand enriching datasets through knowledge graph gener- Administration and Barcelona City Council classes, along ation, 231 datasets from the catalogue6 provided by Eu- with other new concepts, particularly Open data. Upon hubs4Data project members and open data sources have further exploration, it becomes evident that the graph been used. This collection comprises 83 repositories and is enhanced with additional sub-graph derived from insights extracted from other datasets or repositories re6https://euhubs4data.eu/datasets/ garding the Open data concept, using the aforementioned knowledge bases and additional ontologies. In addition, tions similar to those in an available tool8. by navigating through classes such as Energy linked to the Open Data class, one can access further graphs concerning resource items. As a result, the enriched knowledge graph, which establishes connections not previously defined, is stored in a Neo4j graph database. This enables data retrieval and visualization through applications such as the web application developed by us7.

Moreover, to foster higher quality and better outcomes, we have worked on assessing the quality of the generated knowledge graph. In recent years, ontology quality assessment is a key aspect in their development and reuse.

The results achieved allow the expert to identify areas that might require further refinement. While various approaches exist[ 8 ], in our case, the oQuARE framework[ 9 ] is used. The oQuARE framework employs diverse metrics across various dimensions (structural, operability, reliability, transferability, functional adequacy, maintain- Figure 6: Quality assessment results. ability, and compatibility) to evaluate ontology quality.

Each dimension’s assessment relies on its quality subcharacteristics, which are evaluated using associated met- The comparison between the two ontologies reveals rics. For instance, the structural dimension is assessed notable quality discrepancies. The reference ontology based on cohesion, consistency, formal relations support, achieved higher scores with respect to the automatically formalization, redundancy, and tangledness, involving generated ontology. While some metrics related to atmetrics like ANOnto, PROnto, TMOnto, and LCOMOnto. tribute and class richness, as well as object coupling,

To evaluate ontology quality, well-known metrics yielded similar values, overall, the automatic ontology (LCOMOnto, WMCOnto, DITOnto, NACOnto, NOCOnto, exhibited areas requiring improvement. CBOOnto, RFCOnto, NOMOnto, RROnto, AROnto, IN- Moreover, using the OOPS! tool[ 10 ] to identify issues ROnto, CROnto, ANOnto, TMOnto) are depicted in a in ontologies uncovered shared pitfalls between the refradar chart (Figure 6), comparing scores between the ref- erence and enriched ontology. These included missing erence ontology (IDS Model) and the enriched knowledge domain or range in properties, misusing ontology annograph. Additionally, metrics associated with oQuaRE di- tations, undeclared equivalent classes, multiple classes mensions are represented in another chart, with calcula- with identical labels, and untyped properties. Addressing 7https://euhub4data-graphs.itainnova.es/ 8http://sele.inf.um.es/ontology-metrics these issues in the reference ontology before enriching the knowledge graph could enhance overall quality. The quality of results delivered by data-driven services depends on the quality of the ontological model built. els to assess the approach and adapt it to a specific uses cases.

Acknowledgments

5. Conclusions and Future Work This work has been partially funded by the Department of Big Data and Cognitive Systems at the Technological In recent years, the exponential growth of data produc- Institute of Aragon, by IODIDE group of the Government tion has underline the significance and impact of data in of Aragon, grant number T1720R and by the European the economy, industry, and society. Social trends towards Regional Development Fund (ERDF). openness and sharing further emphasise the transformative potential of data in the global economy and society.

Providing quality data-driven services has thus become References a critical business strategy. Under diferent application domains, regions, data types, formats and licences, data comes from multiple sources. Given the diverse sources, formats, and licenses of data, there is a need for a strategy to harmonize and standardize data for seamless integration. This paper introduces a semantic approach based on the automatic generation of a knowledge graph from metadata and textual content. Leveraging NLP techniques, transformer-based linguistic models and the support of additional knowledge bases and ontologies, the approach aims to facilitate interoperability between datadriven sources by extracting knowledge from texts and enriching a reference ontology automatically. It follows a series of steps including information extraction, preprocessing of textual content, matching to infer candidate triplets, filtering these triplets, and the mapping and integration of triplets or metadata into the knowledge graph.

The experimental results demonstrate an innovative approach for automatically building knowledge graphs from natural language text within a specific domain.

However, the quality assessment conducted, combining diferent well-know quality metrics, has identified defects or errors in the ontology-learning process that need to be addressed in order to improve the quality of the knowledge graph and thus the quality of results to deliver data-driven services.

As future work, we aim to extract knowledge from various textual sources and automatically generate a highquality knowledge graph. This graph will serve as a semantic layer for data harmonization and interoperability that might empower recommendation systems with semantic understanding, data quality assurance, crossdomain integration, context-awareness, explainability and adaptive learning capabilities. By harnessing the semantic richness of the knowledge graph, recommendation systems could provide more accurate, relevant, insightful and personalized recommendations that enhance user satisfaction and drive engagement across diverse domains and contexts (Tourism, Health,...). In addition, with the emergence of new large linguistic models like GPT-4, there is an opportunity to explore and test mod

[1]

Glennon ,

Kolding ,

Sundbland ,

C. L.

Croce , G. Micheletti,

Raczko ,

Freitaso , European DATA Market Study 2021-2023. D2.4 Second Report on Facts and Figures , Technical Report, IDC and Lisbon Council , 2023 .

[2] BDVA , Big Data Value cPPP , Technical Report, Big Data Value Association , 2019 .

[3]

Mertens , The Data Spaces Radar , Technical Report, International Data Spaces Association , 2023 .

[4]

Mellal ,

Guerram ,

Bouhalassa , An Approach for Automatic Ontology Enrichment from Texts , Informatica (Slovenia) 45 ( 2021 ).

[5]

S. S.

Amani Drissi , Ahmed Khemiri,

Chbeir , A New Automatic Ontology Construction Method Based on Machine Learning Techniques: Application on Financial Corpus, Procedding 13th International Conference on Management of Digital EcoSystems ( 2021 ) 57 - 61 .

[6]

Bi , S. Cheng, J. Chen , X.

Liang , F.

Xiong , N.

Zhang , Relphormer: Relational graph transformer for knowledge graph representations , Neurocomputing 566 ( 2023 ) 127044 .

[7]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , Proceedings Conference of the North American Chapter of the Association for Computational Linguistics 1 ( 2019 ) 4171 - 4186 .

[8]

G. R.

Roldán-Molina ,

Ruano-Ordás , V. BastoFernandes,

J. R.

Méndez , An ontology knowledge inspection methodology for quality assessment and continuous improvement , Data & Knowledge Engineering 133 ( 2021 ) 101889 .

[9]

Duque-Ramos ,

J. T.

Fernández-Breis ,

Iniesta ,

Dumontier , et., Evaluation of the OQuaRE framework for ontology quality , Expert Systems with Applications 40 ( 2013 ) 2696 - 2703 .

[10]

Poveda-Villalón ,

Gomez-Perez , M. C. SuárezFigueroa, OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation , International Journal on Semantic Web and Information Systems (IJSWIS) 10 ( 2014 ) 7 - 34 .