<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Jarnac); pierre.monnin@orange.com (P. Monnin)
 https://pmonnin.github.io (P. Monnin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Wikidata to Bootstrap an Enterprise Knowledge Graph: How to Stay on Topic?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucas Jarnac</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierre Monnin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Orange</institution>
          ,
          <addr-line>Belfort</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>An Enterprise Knowledge Graph (EKG) is a major asset for companies as it supports several downstream tasks, e.g., business vocabulary sharing, search, or question answering. The bootstrap of an EKG can rely on internal business data sets or open knowledge graphs. In this paper, we consider the latter approach. We propose to build the nucleus of an EKG class hierarchy by mapping internal business terms to Wikidata entities and performing an expansion along the ontology hierarchy. This nucleus thus contains additional business terms that will, in turn, support further knowledge extraction approaches from texts or tables. However, since Wikidata contains numerous classes, there is a need to limit this expansion by pruning classes unrelated to the business topics of interest. To this aim, we propose to rely on the distance between node embeddings and node degree. We show that considering the embedding of a class as the centroid of the embeddings of its instances improves pruning and that node degree is a necessary feature to consider.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graph</kwd>
        <kwd>Pruning</kwd>
        <kwd>Node Degree</kwd>
        <kwd>Graph Embedding</kwd>
        <kwd>Distance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Knowledge graphs (KGs) provide a structured representation of data and knowledge in which
entities are represented as nodes, and relations between them are represented as edges [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
Enterprise Knowledge Graphs (EKGs) are major assets of companies since they support various
downstream applications including knowledge/vocabulary sharing and reuse, data integration,
information system unification, search, or question answering [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. That is why, several
companies such as Google, Microsoft, Amazon, Facebook, or IBM have built their own knowledge
graphs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Building an EKG is an iterative and continuous process that can be carried out with
various approaches. For example, it is possible to feed the KG with data and knowledge extracted
from relational databases [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Other approaches rely on automatic knowledge extraction from
semi-structured or textual data, such as the AutoKnow system for the Amazon Product Graph [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In such a view, it is common to first build a high quality nucleus of the KG with entities and
categories extracted from premium sources. This nucleus will then support automatic knowledge
extraction systems applied on a wider variety of data sources [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        In this paper, we consider this first step of building a high quality nucleus of an EKG. In
particular, we focus on bootstrapping its ontology hierarchy. To this aim, we assume that we
have at our disposal a repository of business terms. Building the nucleus can be achieved by
mapping these terms to entities of an existing KG and integrating parts of its knowledge into the
EKG. Here, we propose to perform an expansion along the ontology hierarchy of the existing
KG. We retrieve the direct classes of mapped entities, their super- and sub-classes, and integrate
them in the EKG (see Figure 1). Thus, the EKG nucleus contains additional business terms
related to the original terms. Choosing the existing KG to integrate highly depends on the target
business domains. In our work, we consider Wikidata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a large, free, and collaborative KG of
the Wikimedia Foundation. It is a general-purpose KG that can be seen as a premium source [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
However, due to its large size (more than 90 million items), there is a need to limit the expansion
to avoid integrating unrelated terms to the target business domains and the original business
terms. To illustrate, from 839 original terms related to Orange business domains, it is possible
to retrieve more than 2.5 million sub-classes.
      </p>
      <p>
        In our work, we propose to carry out this limitation by pruning unrelated classes during the
expansion along the Wikidata ontology hierarchy. To do so, we rely on node degree (i.e., number
of incoming and outgoing edges) and the distance between node embeddings which are learned
by graph embedding models. The latter encode graph structures (e.g., nodes, edges) into a
low-dimensional vector space that preserves as much as possible the properties of the graph [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Graph embeddings have shown impressive performance in various tasks such as link prediction,
matching, classification, or recommendation [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Inspired by these successful approaches,
we aim at investigating in this preliminary work whether distance in the embedding space can
be leveraged to prune unrelated classes. Because entities and relations are represented as low
dimensional vectors, an embedding-based pruning could alleviate computational complexity
issues that are one disadvantage of most pruning methods, according to Faralli et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Additionally, the continuous aspect of graph embeddings compared to the discrete aspect of
graph structures may allow to capture approximate relatedness [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which would, in turn,
improve pruning results. Such an approximate relatedness captured in the embedding space may
also alleviate issues related to misuse of P31 (“instance of”) and P279 (“subclass of”) properties
in the ontology hierarchy of Wikidata [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. We experiment our approach on building the
nucleus of an Orange Knowledge Graph based on a repository of Orange business terms [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
In particular, we test diferent definitions of distance in the embedding space as well as diferent
distance and degree thresholds. We then evaluate the quality of our pruning approach by
manually labeling kept and pruned classes.
      </p>
      <p>This paper is organized as follows. In Section 2, we present related work about pruning in
knowledge graphs. Our pruning approach is detailed in Section 3 and evaluated on our Orange
use case in Section 4. We provide a discussion of our results in Section 5 as well as future
research directions in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, several methods for pruning in KGs have been proposed. Faralli et al. distinguish
two categories of pruning: soft pruning and aggressive pruning [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Soft pruning requires
human input of relevant taxonomic concepts whereas aggressive pruning relies on the topology
of the graph. In their paper, they propose an aggressive pruning method to extract a Direct
Acyclic Graph from a potentially noisy and cyclic knowledge graph, given a set of input nodes.
      </p>
      <p>
        Pruning approaches for KGs represent major assets to reduce the computational complexity
of downstream applications, e.g., recommender or question-answering systems, by allowing
them to focus on relevant entities. For example, Tian et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] propose a knowledge pruning
method based on a Recurrent Graph Convolutional Network to limit information augmentation
in a news recommendation perspective. A perception layer is used on each entity related to an
original entity to compute the relevance between them, where only entities with a relevance
larger than zero are kept. Xian et al. [17] integrate a user-conditional action pruning in a
KG-based reinforcement learning model for a recommendation system. This pruning action
relies on a scoring function that keeps the promising edges and takes into account the starting
user.
      </p>
      <p>To speed up a link prediction task, Joshi et al. [18] propose to frame it as a query-answering
problem and use star-shaped subgraphs that group similar entities. In particular, they learn
subgraph and query embeddings, and then compute a likelihood score between them. This
score determines which subgraph will be explored to predict the correct entity by pruning
irrelevant subgraphs. Regarding question-answering, Lu et al. [19] prune irrelevant paths with
both a tail-based pruning and a path-based pruning. Tail-based pruning removes paths whose
tail relation is not the relation that leads to the answer. However, this pruning is not adapted
to questions involving new relations. That is why, the authors propose a path-based pruning
that consists in removing paths that do not involve the domain types of all the relations in the
correct path.</p>
      <p>Alternatively, distance in the embedding space has already been leveraged in several ways.
In link prediction approaches, distance can be used as a mean to measure prediction error. For
example, in TransE [20], the distance betwee→n−  +→−  an→d−  is used to evaluate the plausibility
of a triple ⟨, , ⟩ to be true, wher→e−  →,−  , an→d−  are the embeddings of ,  and . Distance
can also represent the semantic relatedness between entities to match [21]. Inspired by these
approaches, we study in this paper whether distance in the embedding space can represent
topic relatedness between classes, and thus can be used for pruning unrelated classes.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Expanding and Pruning from Original Business Terms</title>
      <p>As aforementioned, we consider that we have at our disposal a set of original business terms
mapped to entities of Wikidata. We then perform an expansion along the ontology hierarchy of
Wikidata (Section 3.1). To keep retrieved classes related to target business domains and original
business terms, this expansion is controlled with node degree and distance in the embedding
space (Section 3.2).</p>
      <sec id="sec-3-1">
        <title>3.1. Expansion Along the Ontology Hierarchy</title>
        <p>Figure 1 illustrates the expansion along the hierarchy of Wikidata classes. Each original business
term is mapped to a Wikidata entity represented by the starting QID. We first retrieve its direct
classes, following P31 (“instance of”) edges. For each of these direct classes, we then retrieve all
their super-classes by following P279 (“subclass of”) edges up to the root. We also retrieve their
sub-classes by following reversed P279 edges up to the deepest reachable Wikidata classes. To
avoid a high number of SPARQL queries, we use a local hashmap that is built from the Wikidata
dump and contains the adjacency lists of all entities in Wikidata. This allows to significantly
reduce the execution time of the expansion.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Pruning Unrelated Classes</title>
        <p>To prune unrelated classes when traversing the hierarchy of Wikidata classes, we propose to
rely on node degree and distance in the embedding space. A class c that does not respect the
thresholds described below is not traversed and not integrated in the EKG nucleus. This applies
to direct classes, super- and sub-classes.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Node Degree</title>
          <p>The degree of a node is defined as the sum of its incoming and outgoing edges. In our case, we
compute the degree of Wikidata classes by only considering incoming and outgoing P31 and
P279 edges since they are the only edges traversed in our expansion. For example, the degree of
cl1 in Figure 1 is equal to 4. We assume that classes with a high degree will cause a deviation
from the original business domains and terms. Indeed, classes with a high degree may be too
general and will lead to adding numerous super- or sub-classes to the EKG nucleus. We thus
view the degree of a class as indicative of its concreteness and specificity.</p>
          <p>To prune such classes, we use the two following thresholds. First, an absolute degree threshold
 degree-abs configured as an input parameter allows to prune classes c such that degree(c) &gt;
 degree-abs. However, this threshold may not always be applicable. Consider classes reached at a
specific expansion level. It is possible for some of them to have much higher degrees than the
other classes of the same level without these degrees being higher than  degree-abs. In this case,
we consider these classes with relative higher degrees as anomalies. We prune them with an
approach commonly used in anomaly detection. We use a relative degree threshold  degree-rel
computed at each expansion level and defined as follows:
 degree-rel = 3 +  × (3 − 1)
(1)
We compute the first ( 1) and third (3) quartile of the degree of classes reached at a given
expansion level. The  coeficient is an input parameter that allows to control how much class
degree is allowed to deviate from the third quartile in terms of number of interquartile range.
We then prune classes c such that degree(c) &gt;  degree-rel. We compute this relative threshold
at each expansion level since we assume degree may vary depending on the level but should
be consistent at a given level. It should be noted that we apply this threshold if and only if
the maximum degree at an expansion level exceeds a parameter  . This parameter allows to
retrieve all classes reached at an expansion level if they all have a low degree, regardless of
discrepancies in their degrees.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Distances in the Embedding Space</title>
          <p>We also assume that distance in the embedding space represents topic relatedness between
classes. Hence, classes with high distance between their embeddings and the embedding of the
starting QID may deviate from the original business domains and terms. To prune them, for
each starting QID, we define a relative distance threshold  dist-rel as follows:
 dist-rel =  ×  dist-cl
(2)
To compute this threshold, we first retrieve all the direct classes cl of the starting QID. We
assume that these direct classes are closely related to the starting QID. Hence, they can serve
as a basis to measure the remoteness of other classes in the embedding space. We compute
the distances between their embeddings and the embedding of the starting QID. Direct classes
whose distance with the starting QID is greater than the third quartile of distances plus the
interquartile range are removed. These classes are abnormally far from the starting QID, and
thus may not constitute a correct basis for remoteness. Then, we compute the mean  dist-cl of
the distances between the embeddings of the remaining direct classes and the embedding of the
starting QID. In the expansion, we prune all classes c whose distance between their embeddings
and the embedding of the starting QID is greater than  dist-rel.  is an input coeficient that
controls the allowed range of distances.</p>
          <p>Since classes can have instances, we propose the two following definitions for the distance
between a class and the starting QID:
Definition 1 (Distance 1). The distance between a class and the starting QID is the Euclidean
distance between their embeddings.</p>
          <p>Definition 2 (Distance 2). The distance between a class and the starting QID is the Euclidean
distance between the centroids of the embeddings of their respective instances. In case the class or
the starting QID has no instance, its embedding is used instead.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We experimented our approach with the pre-trained embeddings of Wikidata available in
PyTorch-BigGraph [22]1. These embeddings were learned for more than 78,000,000 entities of
the 2019-03-06 version of Wikidata.</p>
      <p>1https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html</p>
      <sec id="sec-4-1">
        <title>4.1. Validating Distance as an Indicator of Topic Relatedness</title>
        <p>
          We wanted to validate our hypothesis that distance in the embedding space can be used as
an indicator of topic relatedness between classes. To this aim, we checked that the distance
between embeddings of classes increases with the number of P279 edges between them. That is
to say, distant classes in the ontology hierarchy should also be distant in the embedding space.
The Wikidata hierarchy is known to present some issues related to misuse of P31 and P279
properties [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
          ]. That is why, we expected a large range of distances between embeddings
of classes separated by the same number of P279 edges. This is also the reason motivating not
to rely on the number of edges for pruning.
        </p>
        <p>We retrieved all direct and indirect sub-classes of the root class of Wikidata (Q351202). Then,
for each class c, similarly to our expansion (Section 3.1), we retrieved all its super-classes and
sub-classes, the number of traversed P279 edges to reach each of them, and computed the
distance between their embeddings and the embedding of c. This allowed us to compute the
distributions of distances between classes in the embedding space w.r.t. the number of P279
edges between them. It is noteworthy that we performed a breadth-first search to retrieve
all super- and sub-classes of a class c. Consequently, only the lowest number of P279 edges
between two classes was taken into account.</p>
        <p>We conducted this experiment with the two distances presented in Section 3.2. We retrieved
2,615,380 sub-classes of the root class. Figure 2 depicts the distributions of distances obtained
with 54% of these sub-classes, which accounts for around 55.8 million pairwise distances3. We
can see that distance 2 seems to better capture the distance in the ontology hierarchy than
distance 1. With distance 1, the median distance oscillates whereas, with distance 2, the
median distance consistently increases with the number of P279 edges between classes. Hence,
our hypothesis may only be valid with 2. As expected, large ranges of distances exist between
classes separated by the same number of P279 edges. This can be caused by misuse of the P279
property and diferent levels of representation granularity in diferent parts of the ontology
hierarchy. Such a result confirms our choice not to rely on number of P279 edges between
classes for pruning.</p>
        <p>Let us illustrate the diference between the two distances outlined in Figure 2 with an example.
Consider the original business term “Microsoft SharePoint” as the starting QID. We performed
the expansion as described in Section 3.1 and we computed the distances between “Microsoft
SharePoint” and its direct classes. Figure 3 displays these distances. We notice that, with
1, distances in the embedding space are scattered, contrary to 2. This leads to important
discrepancies. For example, the “Content Management System” class (green circle) is a direct
class and directly characterizes “Microsoft SharePoint”. However, with 1, they are far from
each other in the embedding space, which could potentially lead to incorrectly prune “Content
Management System”. On the contrary, 2 correctly leads to a close proximity between this
class and the starting QID.</p>
        <p>2https://www.wikidata.org/wiki/Q35120
3Our computations are limited due to time constraints.
5
e
tsc4
n
a
i
d
g
in3
d
d
e
b
2
m
E
1
0
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425</p>
        <p>Number of traversed P279 edges
(a) With 1</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Building the Nucleus of the Orange Knowledge Graph</title>
        <p>We applied our approach to build the nucleus of the Orange Knowledge Graph. Orange is a
multi-service operator (e.g., telecommunications, video on demand, music, banking,
cybersecurity) with more than 140,000 employees and a heterogeneous client portfolio (e.g., individuals,
companies). To face such a diversity of activities, an internal repository contains public and
internal business terms together with their textual definitions.</p>
        <p>839 business terms were manually matched with their corresponding Wikidata entities before
applying the expansion mechanism described in Section 3.1. This matching was manually
performed by AI researchers from Orange who identified “same as” alignments between terms
in the internal repository and Wikidata entities. To help them in their task, each term was
presented with candidate entities based on string similarity between terms and labels of entities.
Their expansion led to retrieve 393 distinct direct classes, 946 distinct super-classes, and 2,560,426
distinct sub-classes. From these results, we chose to focus only on pruning sub-classes since
their important number may indicate unrelated classes to the original business terms.</p>
        <p>1e7
1 2 3 4 5 6 7 8 9 10E1x1p1a2n1s3io1n41le5v1e6l17181920212223242526</p>
        <p>To illustrate the evolution of the two characteristics of interest of our pruning approach,
we show in Figure 4a the maximum, mean, and minimum distances (with the 2 definition)
of reached classes at each level of expansion from our 839 starting QIDs. Similarly, Figure 4b
depicts the maximum, mean, and minimum degrees of reached classes at each level of expansion
from our 839 starting QIDs. We notice that the mean distance only increases slightly through
the expansion w.r.t. its initial value at expansion level 1. Such a stability makes us think that
distances of direct classes are good representatives of close distances and can efectively serve
as a basis to evaluate remoteness in the embedding space. This validates our definition of  dist-rel
(Equation (2)) as dependent upon the mean distance of direct classes. Figure 4b illustrates that
many generic classes are retrieved. For example, “Galaxy”, “Human”, “Taxon”, and “Star” are
retrieved from “Linux”. However, sub-classes retrieved at the next expansion level from such
generic classes may be unrelated to the original business terms. Interestingly, these peaks in
node degrees are not associated with peaks in distances. This observation confirms the need
for a degree-based pruning beside a distance-based pruning. In particular, the fixed degree
threshold allows to tackle the degree peaks displayed.</p>
        <p>We also performed six expansion and pruning experiments from the 839 original business
terms with diferent configurations. We fixed  degree-abs = 200,  = 1.5, and  = 20 but tested
with diferent values of  ∈ {1.2, 1.25, 1.3} and the two definitions of distance 1 and 2. We
manually labeled the pruned and kept classes to evaluate the performance of our approach.
Results are presented in Table 1. Each row presents the results of one experiment in which all
defined thresholds are applied. Since configurations difer between expansions, classes may
be diferently pruned in some expansions due to distance thresholds. This leads to diferent
explorations of the ontology hierarchy, and thus diferent numbers for degree-based pruning.
The left part of Table 1 presents the number of pruned classes and the resulting precision per
pruning threshold as well as global results. It should be noted that some classes were pruned by
exceeding both the  degree-rel and  dist-rel thresholds (represented in columns (4)). The right part
of Table 1 presents the number of kept classes and the resulting precision. Results are discussed
below.
for  . (1) stands for pruning with  degree-rel; (2) stands for pruning with  degree-abs; (3) stands for pruning
with  dist-rel; (4) stands for pruning with both  degree-rel and  dist-rel; (5) stands for global pruning. Values
in bold indicate the best precision.</p>
        <p />
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>obtained with 2.</p>
      <p>
        Regarding pruning precision, we note in Table 1 that 2 obtains a better precision than 1 for
distance-based pruning. This result was expected given the better organization of the embedding
space with 2 than with 1, as described in Section 4.1. It appears through our experiments
that a distance based on centroids of class instances seems to better carry the relatedness
between classes. 2 may thus better cope with “instance of” and “subclass of” properties
not always being used correctly in the hierarchy which impacts the quality of the Wikidata
ontology [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
        ]. Table 1 shows that 1 reaches a better precision for degree-based pruning
but this only applies to a reduced number of classes. Consequently, the better performance
of 1 on these thresholds has little impact on global pruning precision, whose best result is
      </p>
      <p>We notice that more classes are pruned with 2 than with 1. Similarly, more classes are kept
with 2. This indicates that 2 leads to a diferent and more extensive hierarchical exploration
than 1. This more extensive exploration is at the expanse of the precision of kept classes that is
lower with 2 in all configurations. Conversely, the more restrained exploration with 1 comes
at the expanse of the precision of pruned classes that is lower in all configurations. However,
precision of kept classes with 2 is only lower by 5 to 7 points for 510 to 827 additionally
kept classes. This means that many of the additionally kept classes are correct. Hence, 2
appears to be a better setting that ofers a further enrichment of the EKG nucleus without an
important precision decrease. In both cases, our pruning approach allows to reduce the number
of integrated classes in the EKG nucleus from more than 2.5 million (without pruning) to around
2,000 with a good precision.</p>
      <p>Regarding distance in the embedding space, it should be noted that requiring a user to
provide a distance threshold is not a trivial task since such a distance may not be interpretable.
This assessment is in line with the need for semantic embeddings to give a meaning and an
interpretation to distance in the embedding space to humans [23]. In this perspective, Wikidata
could serve as a benchmark. It would also be interesting to validate our hypothesis of
topicrelatedness representativity not only with super- or sub-classes as in Section 4 but with siblings
that are not currently considered in our approach.</p>
      <p>Table 1 shows that precisions of pruned and kept classes are inter-dependent. Indeed,
increasing the distance coeficient  improves the precision of pruned classes at the expanse of
the precision of kept classes. This means that a higher distance threshold allows to remove
less classes but with a higher precision. However, some of the additionally retrieved classes
are unrelated to the original business terms. This shows that the decision boundary between
classes to prune and to keep is not perfectly captured with our approach and the two considered
features, i.e., node degree and distance in the embedding space. A trade-of must thus be found
to obtain good precision for both pruned and kept classes.</p>
      <p>One can wonder about the performance of other embedding models for a pruning based on
Euclidean distance since they may diferently organize the embedding space. A future work
on performance comparison should also consider symbolic pruning approaches and take into
account scalability and execution time beside precision. Going a step further, we could envision
a machine learning model that learns or adapts graph embeddings to better prune classes. Such
a model could, for example, use Graph Convolutional Networks (GCN) with the Soft Nearest
Neighbor Loss (SNNL) as in [21]. To reduce user input, only pruned classes could be labeled
which would lead to a semi-supervised pruning task. Finally, to further expand the EKG nucleus,
instances of kept classes could also be evaluated with our thresholds to be integrated.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we aimed at building a nucleus of the class hierarchy of an Enterprise Knowledge
Graph for Orange. To this aim, we mapped original business terms available in an internal
repository to their corresponding entities in Wikidata. Then, we performed a hierarchical
expansion along the ontology hierarchy to integrate super- and sub-classes. Due to the high
number of traversed classes, we proposed to limit this expansion with a pruning approach
relying on node degree and distance in the embedding space. Our experiments showed that
considering the embedding of a class as the centroid of the embeddings of its instances improves
distance pruning. Results also highlighted that node degree is an efective and necessary feature
that cannot be substituted by distance in the embedding space. In future works, we ambition to
confirm these results with diferent graph embedding models and investigate the learning of
graph embeddings specific to the pruning task. The EKG nucleus could also be further enriched
by matching Orange with similar corporations and retrieving their Wikidata statements. Such
statements could also undergo a pruning process similar to the one described in this paper.
[17] Y. Xian, Z. Fu, S. Muthukrishnan, G. de Melo, Y. Zhang, Reinforcement knowledge graph
reasoning for explainable recommendation, in: Proceedings of the 42nd International
ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 285–294.
[18] U. Joshi, J. Urbani, Searching for embeddings in a haystack: Link prediction on knowledge
graphs with subgraph pruning, in: WWW ’20: The Web Conference 2020, Taipei, Taiwan,
April 20-24, 2020, ACM / IW3C2, 2020, pp. 2817–2823.
[19] J. Lu, Z. Zhang, X. Yang, J. Feng, Eficient subgraph pruning &amp; embedding for multi-relation
QA over knowledge graph, in: International Joint Conference on Neural Networks, IJCNN
2021, Shenzhen, China, July 18-22, 2021, IEEE, 2021, pp. 1–8.
[20] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating embeddings
for modeling multi-relational data, in: Advances in Neural Information Processing Systems
26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings
of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013, pp.
2787–2795.
[21] P. Monnin, C. Raïssi, A. Napoli, A. Coulet, Discovering alignment relations with graph
convolutional networks: A biomedical case study, Semantic Web 13 (2022) 379–398.
[22] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, A. Peysakhovich,
Pytorchbiggraph: A large scale graph embedding system, in: Proceedings of Machine Learning
and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, mlsys.org,
2019.
[23] H. Paulheim, Make embeddings semantic again!, in: Proceedings of the ISWC 2018 Posters
&amp; Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International
Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th to 12th, 2018, volume
2180 of CEUR Workshop Proceedings, CEUR-WS.org, 2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>Graphs</given-names>
          </string-name>
          ,
          <source>Synthesis Lectures on Data, Semantics, and Knowledge</source>
          , Morgan &amp; Claypool Publishers,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Industry-scale knowledge graphs: lessons and challenges</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>62</volume>
          (
          <year>2019</year>
          )
          <fpage>36</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Galkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scerri</surname>
          </string-name>
          ,
          <article-title>Enterprise knowledge graphs: A backbone of linked enterprise data</article-title>
          , in: 2016 IEEE/WIC/ACM International Conference on Web Intelligence,
          <string-name>
            <surname>WI</surname>
          </string-name>
          <year>2016</year>
          ,
          <article-title>Omaha</article-title>
          ,
          <string-name>
            <surname>NE</surname>
          </string-name>
          , USA, October
          <volume>13</volume>
          -
          <issue>16</issue>
          ,
          <year>2016</year>
          , IEEE Computer Society,
          <year>2016</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lassila</surname>
          </string-name>
          ,
          <article-title>Designing and Building Enterprise Knowledge Graphs, Synthesis Lectures on Data, Semantics, and</article-title>
          <string-name>
            <surname>Knowledge</surname>
          </string-name>
          , Morgan &amp; Claypool Publishers,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>AutoKnow: Self-driving knowledge collection for products of thousands of types</article-title>
          ,
          <source>in: KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , Virtual Event, CA, USA,
          <year>August</year>
          23-
          <issue>27</issue>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>2724</fpage>
          -
          <lpage>2734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <article-title>Machine knowledge: Creation and curation of comprehensive knowledge bases</article-title>
          ,
          <source>Foundations and Trends Databases</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>108</fpage>
          -
          <lpage>490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandecic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of graph embedding: Problems, techniques, and applications</article-title>
          ,
          <source>IEEE Trans. Knowl. Data Eng</source>
          .
          <volume>30</volume>
          (
          <year>2018</year>
          )
          <fpage>1616</fpage>
          -
          <lpage>1637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          , E. Cambria,
          <string-name>
            <given-names>P.</given-names>
            <surname>Marttinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A survey on knowledge graphs: Representation, acquisition, and applications</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>33</volume>
          (
          <year>2022</year>
          )
          <fpage>494</fpage>
          -
          <lpage>514</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Faralli</surname>
          </string-name>
          , I. Finocchi,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Velardi</surname>
          </string-name>
          ,
          <article-title>Eficient pruning of large knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19</source>
          ,
          <year>2018</year>
          , Stockholm, Sweden, ijcai.org,
          <year>2018</year>
          , pp.
          <fpage>4055</fpage>
          -
          <lpage>4063</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. V.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <article-title>Towards A model theory for distributed representations</article-title>
          , in: 2015 AAAI Spring Symposia, Stanford University, Palo Alto, California, USA, March
          <volume>22</volume>
          -25,
          <year>2015</year>
          , AAAI Press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Brasileiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P. A.</given-names>
            <surname>Almeida</surname>
          </string-name>
          , V. A. de Carvalho, G. Guizzardi,
          <article-title>Applying a multi-level modeling theory to assess taxonomic hierarchies in Wikidata</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on World Wide Web, WWW</source>
          <year>2016</year>
          , Montreal, Canada,
          <source>April 11-15</source>
          ,
          <year>2016</year>
          , Companion Volume,
          <source>ACM</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>975</fpage>
          -
          <lpage>980</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piscopo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Phethean</surname>
          </string-name>
          , E. Simperl,
          <article-title>What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata</article-title>
          , in: Social Informatics - 9th International Conference,
          <source>SocInfo 2017</source>
          , Oxford, UK,
          <source>September 13-15</source>
          ,
          <year>2017</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>10539</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2017</year>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shenoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ilievski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garijo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <article-title>A study of the quality of Wikidata</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>72</volume>
          (
          <year>2022</year>
          )
          <fpage>100679</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Deuzé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Labbé</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>A framework for automatically interpreting tabular data at orange</article-title>
          ,
          <source>in: Proceedings of the ISWC 2021 Posters</source>
          ,
          <article-title>Demos and Industry Tracks: From Novel Ideas to Industrial Practice co-located with 20th International Semantic Web Conference (ISWC</article-title>
          <year>2021</year>
          ), Virtual Conference,
          <source>October 24-28</source>
          ,
          <year>2021</year>
          , volume
          <volume>2980</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Joint knowledge pruning and recurrent graph convolution for news recommendation</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>