<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Knowledge Graph Lifecycle in NT T DATA ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javier Flores</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel Jamin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergi Nadal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Romero</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NTT Data</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Semantic Business Unit (SEMBU) in NTT DATA aims to increase the semantic interoperability and accessibility of European institutions' data projects by following Linked Open Data (LOD) principles to build controlled vocabularies and produce Knowledge Graphs (KGs). One of its most notable projects revolves around the CORDIS portal1, which publishes information about research and innovation projects funded by the European Commission. SEMBU pursues two main goals: (i) expose semantic data related to CORDIS via a SPARQL endpoint that facilitates access and reuse of quality scientific-related data, and (ii) design an eficient, incremental, and automated KG lifecycle to be used as a reference in other data projects. To that end, we have adopted state-of-the-art semantic technologies to support the creation and management of the KG with the goal of centralizing knowledge and providing an overall view of data assets that improve data governance, maintenance, and external interaction by data consumers. We have also identified some of their limitations which are tackled via an industrial PhD. This paper reports our experience, the obstacles, and proposals for generating and maintaining the CORDIS KG.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. CORDIS KG lifecycle</title>
      <p>CORDIS KG is generated using the lifecycle depicted in Figure 1, which considers the business
needs and incremental integration needs with legacy systems. The remainder of the section is
devoted to describing such lifecycle.</p>
      <p>Collection. XML files are collected from the CORDIS portal and materialized into RDF triples.
Each file is automatically analyzed and mapped to resources of the EURIO ontology 2 via RML
mappings. Since files evolve in their schema and data, manually updating the mappings is a
laborious and error-prone task. We thus proposed a largely automated process bootstrapping
the file schema and automatically updating the RML mappings.</p>
      <p>Incremental
01 COLLECT 02 CLEAN 03 ENRICH 04 LINK
Figure 1: Overview of the implemented KG lifecycle in CORDIS.
05 REFACTOR 06 PUBLISH
Clean. A specific set of quality rules and SHACL shapes act as entity resolution system.
Inconsistencies and errors are solved by manual validation, determining the best strategy to
solve the conflict. Due to the manual efort, using learning techniques that combine
attributelevel data could improve the detection of similarities and discrepancies between entities and
suggest resolution strategies.</p>
      <p>Enrich. Unstructured data related to the project funding is analyzed using named entity
recognition tools to identify relevant information (e.g., organisations and people) to link them
to EURIO resources. Here, CRF-NER3 has facilitated this step due to its generic infrastructure,
allowing to reuse already created resources within the KG and eficiently enriching them.
Link. LOD repositories are explored via schema and instance alignment tools (e.g., LogMap,
Alignment API, and AML) to enhance the added value of the KG. However, the low precision
of alignment tools due to lexical similarity bias (e.g., syntactically similar elements which are
semantically diferent), and their inability to scale up to large volumes of data has lead to manual
approaches. To overcome these limitations, graph data profiles can be compared using learning
techniques to predict their expected similarity and reduce the low precision.
Refactor. CORDIS KG is manually updated using hints acquired in previous steps, leading
to diferent KG versions. Tracing changes made in the schemata and instances is crucial for
supporting evolution transparency between versions. Thus, we propose a versioning mechanism
focused on entities and individual changes inspired by the PAV ontology4. Moreover, automated
refactoring hints via learning techniques is an exciting direction that could maximize the linking
of LOD repositories and reduce human efort.</p>
      <p>Publish. Finally, the publication of the enriched KG consists of (i) ensuring best practices for
publishing ontologies on the web using FOOPS!, and (ii) generating the proper documentation
using WIDOCO. Adhering to the previously described lifecycle, the Publication Ofice of the
European Union plans to release the first version of the CORDIS KG through a SPARQL endpoint
in December 2022.</p>
      <p>In the on-site presentation, we will present each of the phases, the tools used and the level of
automatization that is possible to achieve. The complete lifecycle will be exemplified with real
data samples.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>