1. Introduction

The Knowledge Graph Lifecycle in NT T DATA ⋆

Javier Flores

Emmanuel Jamin

Sergi Nadal

Oscar Romero

1 0 NTT Data , Barcelona , Spain 1 Universitat Politècnica de Catalunya , Barcelona , Spain

The Semantic Business Unit (SEMBU) in NTT DATA aims to increase the semantic interoperability and accessibility of European institutions' data projects by following Linked Open Data (LOD) principles to build controlled vocabularies and produce Knowledge Graphs (KGs). One of its most notable projects revolves around the CORDIS portal1, which publishes information about research and innovation projects funded by the European Commission. SEMBU pursues two main goals: (i) expose semantic data related to CORDIS via a SPARQL endpoint that facilitates access and reuse of quality scientific-related data, and (ii) design an eficient, incremental, and automated KG lifecycle to be used as a reference in other data projects. To that end, we have adopted state-of-the-art semantic technologies to support the creation and management of the KG with the goal of centralizing knowledge and providing an overall view of data assets that improve data governance, maintenance, and external interaction by data consumers. We have also identified some of their limitations which are tackled via an industrial PhD. This paper reports our experience, the obstacles, and proposals for generating and maintaining the CORDIS KG.

1. Introduction 2. CORDIS KG lifecycle

CORDIS KG is generated using the lifecycle depicted in Figure 1, which considers the business needs and incremental integration needs with legacy systems. The remainder of the section is devoted to describing such lifecycle.

Collection. XML files are collected from the CORDIS portal and materialized into RDF triples. Each file is automatically analyzed and mapped to resources of the EURIO ontology 2 via RML mappings. Since files evolve in their schema and data, manually updating the mappings is a laborious and error-prone task. We thus proposed a largely automated process bootstrapping the file schema and automatically updating the RML mappings.

Incremental 01 COLLECT 02 CLEAN 03 ENRICH 04 LINK Figure 1: Overview of the implemented KG lifecycle in CORDIS. 05 REFACTOR 06 PUBLISH Clean. A specific set of quality rules and SHACL shapes act as entity resolution system. Inconsistencies and errors are solved by manual validation, determining the best strategy to solve the conflict. Due to the manual efort, using learning techniques that combine attributelevel data could improve the detection of similarities and discrepancies between entities and suggest resolution strategies.

Enrich. Unstructured data related to the project funding is analyzed using named entity recognition tools to identify relevant information (e.g., organisations and people) to link them to EURIO resources. Here, CRF-NER3 has facilitated this step due to its generic infrastructure, allowing to reuse already created resources within the KG and eficiently enriching them. Link. LOD repositories are explored via schema and instance alignment tools (e.g., LogMap, Alignment API, and AML) to enhance the added value of the KG. However, the low precision of alignment tools due to lexical similarity bias (e.g., syntactically similar elements which are semantically diferent), and their inability to scale up to large volumes of data has lead to manual approaches. To overcome these limitations, graph data profiles can be compared using learning techniques to predict their expected similarity and reduce the low precision. Refactor. CORDIS KG is manually updated using hints acquired in previous steps, leading to diferent KG versions. Tracing changes made in the schemata and instances is crucial for supporting evolution transparency between versions. Thus, we propose a versioning mechanism focused on entities and individual changes inspired by the PAV ontology4. Moreover, automated refactoring hints via learning techniques is an exciting direction that could maximize the linking of LOD repositories and reduce human efort.

Publish. Finally, the publication of the enriched KG consists of (i) ensuring best practices for publishing ontologies on the web using FOOPS!, and (ii) generating the proper documentation using WIDOCO. Adhering to the previously described lifecycle, the Publication Ofice of the European Union plans to release the first version of the CORDIS KG through a SPARQL endpoint in December 2022.

In the on-site presentation, we will present each of the phases, the tools used and the level of automatization that is possible to achieve. The complete lifecycle will be exemplified with real data samples.