=Paper=
{{Paper
|id=Vol-3254/paper402
|storemode=property
|title=The Knowledge Graph Lifecycle in NTT DATA
|pdfUrl=https://ceur-ws.org/Vol-3254/paper402.pdf
|volume=Vol-3254
|authors=Javier Flores,Emmanuel Jamin,Sergi Nadal,Oscar Romero
|dblpUrl=https://dblp.org/rec/conf/semweb/0002JN022
}}
==The Knowledge Graph Lifecycle in NTT DATA==
The Knowledge Graph Lifecycle in NTT DATA ⋆ Javier Flores1,* , Emmanuel Jamin2 , Sergi Nadal1 and Oscar Romero1 1 Universitat Politècnica de Catalunya, Barcelona, Spain 2 NTT Data, Barcelona, Spain 1. Introduction The Semantic Business Unit (SEMBU) in NTT DATA aims to increase the semantic interoper- ability and accessibility of European institutions’ data projects by following Linked Open Data (LOD) principles to build controlled vocabularies and produce Knowledge Graphs (KGs). One of its most notable projects revolves around the CORDIS portal1 , which publishes information about research and innovation projects funded by the European Commission. SEMBU pursues two main goals: (i) expose semantic data related to CORDIS via a SPARQL endpoint that facili- tates access and reuse of quality scientific-related data, and (ii) design an efficient, incremental, and automated KG lifecycle to be used as a reference in other data projects. To that end, we have adopted state-of-the-art semantic technologies to support the creation and management of the KG with the goal of centralizing knowledge and providing an overall view of data assets that improve data governance, maintenance, and external interaction by data consumers. We have also identified some of their limitations which are tackled via an industrial PhD. This paper reports our experience, the obstacles, and proposals for generating and maintaining the CORDIS KG. 2. CORDIS KG lifecycle CORDIS KG is generated using the lifecycle depicted in Figure 1, which considers the business needs and incremental integration needs with legacy systems. The remainder of the section is devoted to describing such lifecycle. Collection. XML files are collected from the CORDIS portal and materialized into RDF triples. Each file is automatically analyzed and mapped to resources of the EURIO ontology2 via RML mappings. Since files evolve in their schema and data, manually updating the mappings is a laborious and error-prone task. We thus proposed a largely automated process bootstrapping the file schema and automatically updating the RML mappings. The 21st International Semantic Web Conference, October 23–27, 2022, Hangzhou, CN ⋆ This work was partly funded by the Spanish Ministerio de Ciencia e Innovación under project PID2020-117191RB- I00 (DOGO4ML). Javier Flores is supported by contract 2020-DI-027 of the Industrial Doctorate Program of the Government of Catalonia and CONACYT’s scholarship. Sergi Nadal is partly supported by the Spanish Ministerio de Ciencia e Innovación, as well as the European Union - NextGenerationEU, under project FJC2020-045809-I / AEI/10.13039/501100011033. * Corresponding author. " jflores@essi.upc.edu (J. Flores); emmanueljeanjacques.jamin@nttdata.com (E. Jamin); snadal@essi.upc.edu (S. Nadal); oromero@essi.upc.edu (O. Romero) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 1 cordis.europa.eu 2 op.europa.eu/en/web/eu-vocabularies/eurio Incremental 01 COLLECT 02 CLEAN 03 ENRICH 04 LINK 05 REFACTOR 06 PUBLISH Figure 1: Overview of the implemented KG lifecycle in CORDIS. Clean. A specific set of quality rules and SHACL shapes act as entity resolution system. Inconsistencies and errors are solved by manual validation, determining the best strategy to solve the conflict. Due to the manual effort, using learning techniques that combine attribute- level data could improve the detection of similarities and discrepancies between entities and suggest resolution strategies. Enrich. Unstructured data related to the project funding is analyzed using named entity recognition tools to identify relevant information (e.g., organisations and people) to link them to EURIO resources. Here, CRF-NER3 has facilitated this step due to its generic infrastructure, allowing to reuse already created resources within the KG and efficiently enriching them. Link. LOD repositories are explored via schema and instance alignment tools (e.g., LogMap, Alignment API, and AML) to enhance the added value of the KG. However, the low precision of alignment tools due to lexical similarity bias (e.g., syntactically similar elements which are semantically different), and their inability to scale up to large volumes of data has lead to manual approaches. To overcome these limitations, graph data profiles can be compared using learning techniques to predict their expected similarity and reduce the low precision. Refactor. CORDIS KG is manually updated using hints acquired in previous steps, leading to different KG versions. Tracing changes made in the schemata and instances is crucial for supporting evolution transparency between versions. Thus, we propose a versioning mechanism focused on entities and individual changes inspired by the PAV ontology4 . Moreover, automated refactoring hints via learning techniques is an exciting direction that could maximize the linking of LOD repositories and reduce human effort. Publish. Finally, the publication of the enriched KG consists of (i) ensuring best practices for publishing ontologies on the web using FOOPS!, and (ii) generating the proper documentation using WIDOCO. Adhering to the previously described lifecycle, the Publication Office of the European Union plans to release the first version of the CORDIS KG through a SPARQL endpoint in December 2022. In the on-site presentation, we will present each of the phases, the tools used and the level of automatization that is possible to achieve. The complete lifecycle will be exemplified with real data samples. 3 nlp.stanford.edu/software/CRF-NER.html 4 pav-ontology.github.io/pav/