=Paper=
{{Paper
|id=Vol-3262/paper16
|storemode=property
|title=Wikidata to Bootstrap an Enterprise Knowledge Graph: How to Stay on Topic?
|pdfUrl=https://ceur-ws.org/Vol-3262/paper16.pdf
|volume=Vol-3262
|authors=Lucas Jarnac,Pierre Monnin
|dblpUrl=https://dblp.org/rec/conf/semweb/JarnacM22
}}
==Wikidata to Bootstrap an Enterprise Knowledge Graph: How to Stay on Topic?==
<pdf width="1500px">https://ceur-ws.org/Vol-3262/paper16.pdf</pdf>
<pre>
Wikidata to Bootstrap an Enterprise Knowledge
Graph: How to Stay on Topic?
Lucas Jarnac1 , Pierre Monnin1
1
    Orange, Belfort, France


                                         Abstract
                                         An Enterprise Knowledge Graph (EKG) is a major asset for companies as it supports several downstream
                                         tasks, e.g., business vocabulary sharing, search, or question answering. The bootstrap of an EKG can
                                         rely on internal business data sets or open knowledge graphs. In this paper, we consider the latter
                                         approach. We propose to build the nucleus of an EKG class hierarchy by mapping internal business
                                         terms to Wikidata entities and performing an expansion along the ontology hierarchy. This nucleus thus
                                         contains additional business terms that will, in turn, support further knowledge extraction approaches
                                         from texts or tables. However, since Wikidata contains numerous classes, there is a need to limit this
                                         expansion by pruning classes unrelated to the business topics of interest. To this aim, we propose to rely
                                         on the distance between node embeddings and node degree. We show that considering the embedding
                                         of a class as the centroid of the embeddings of its instances improves pruning and that node degree is a
                                         necessary feature to consider.

                                         Keywords
                                         Knowledge Graph, Pruning, Node Degree, Graph Embedding, Distance


1. Introduction
Knowledge graphs (KGs) provide a structured representation of data and knowledge in which
entities are represented as nodes, and relations between them are represented as edges [1, 2].
Enterprise Knowledge Graphs (EKGs) are major assets of companies since they support various
downstream applications including knowledge/vocabulary sharing and reuse, data integration,
information system unification, search, or question answering [2, 3, 4]. That is why, several
companies such as Google, Microsoft, Amazon, Facebook, or IBM have built their own knowledge
graphs [2]. Building an EKG is an iterative and continuous process that can be carried out with
various approaches. For example, it is possible to feed the KG with data and knowledge extracted
from relational databases [4]. Other approaches rely on automatic knowledge extraction from
semi-structured or textual data, such as the AutoKnow system for the Amazon Product Graph [5].
In such a view, it is common to first build a high quality nucleus of the KG with entities and
categories extracted from premium sources. This nucleus will then support automatic knowledge
extraction systems applied on a wider variety of data sources [6].


Wikidata’22: Wikidata workshop at ISWC 2022
$ lucas.jarnac@orange.com (L. Jarnac); pierre.monnin@orange.com (P. Monnin)
 https://pmonnin.github.io (P. Monnin)
 0000-0002-2819-2679 (L. Jarnac); 0000-0002-2017-8426 (P. Monnin)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   In this paper, we consider this first step of building a high quality nucleus of an EKG. In
particular, we focus on bootstrapping its ontology hierarchy. To this aim, we assume that we
have at our disposal a repository of business terms. Building the nucleus can be achieved by
mapping these terms to entities of an existing KG and integrating parts of its knowledge into the
EKG. Here, we propose to perform an expansion along the ontology hierarchy of the existing
KG. We retrieve the direct classes of mapped entities, their super- and sub-classes, and integrate
them in the EKG (see Figure 1). Thus, the EKG nucleus contains additional business terms
related to the original terms. Choosing the existing KG to integrate highly depends on the target
business domains. In our work, we consider Wikidata [7], a large, free, and collaborative KG of
the Wikimedia Foundation. It is a general-purpose KG that can be seen as a premium source [6].
However, due to its large size (more than 90 million items), there is a need to limit the expansion
to avoid integrating unrelated terms to the target business domains and the original business
terms. To illustrate, from 839 original terms related to Orange business domains, it is possible
to retrieve more than 2.5 million sub-classes.
   In our work, we propose to carry out this limitation by pruning unrelated classes during the
expansion along the Wikidata ontology hierarchy. To do so, we rely on node degree (i.e., number
of incoming and outgoing edges) and the distance between node embeddings which are learned
by graph embedding models. The latter encode graph structures (e.g., nodes, edges) into a
low-dimensional vector space that preserves as much as possible the properties of the graph [8].
Graph embeddings have shown impressive performance in various tasks such as link prediction,
matching, classification, or recommendation [8, 9]. Inspired by these successful approaches,
we aim at investigating in this preliminary work whether distance in the embedding space can
be leveraged to prune unrelated classes. Because entities and relations are represented as low
dimensional vectors, an embedding-based pruning could alleviate computational complexity
issues that are one disadvantage of most pruning methods, according to Faralli et al. [10].
Additionally, the continuous aspect of graph embeddings compared to the discrete aspect of
graph structures may allow to capture approximate relatedness [11], which would, in turn,
improve pruning results. Such an approximate relatedness captured in the embedding space may
also alleviate issues related to misuse of P31 (“instance of”) and P279 (“subclass of”) properties
in the ontology hierarchy of Wikidata [12, 13, 14]. We experiment our approach on building the
nucleus of an Orange Knowledge Graph based on a repository of Orange business terms [15].
In particular, we test different definitions of distance in the embedding space as well as different
distance and degree thresholds. We then evaluate the quality of our pruning approach by
manually labeling kept and pruned classes.
   This paper is organized as follows. In Section 2, we present related work about pruning in
knowledge graphs. Our pruning approach is detailed in Section 3 and evaluated on our Orange
use case in Section 4. We provide a discussion of our results in Section 5 as well as future
research directions in Section 6.


2. Related Work
In recent years, several methods for pruning in KGs have been proposed. Faralli et al. distinguish
two categories of pruning: soft pruning and aggressive pruning [10]. Soft pruning requires
human input of relevant taxonomic concepts whereas aggressive pruning relies on the topology
of the graph. In their paper, they propose an aggressive pruning method to extract a Direct
Acyclic Graph from a potentially noisy and cyclic knowledge graph, given a set of input nodes.
   Pruning approaches for KGs represent major assets to reduce the computational complexity
of downstream applications, e.g., recommender or question-answering systems, by allowing
them to focus on relevant entities. For example, Tian et al. [16] propose a knowledge pruning
method based on a Recurrent Graph Convolutional Network to limit information augmentation
in a news recommendation perspective. A perception layer is used on each entity related to an
original entity to compute the relevance between them, where only entities with a relevance
larger than zero are kept. Xian et al. [17] integrate a user-conditional action pruning in a
KG-based reinforcement learning model for a recommendation system. This pruning action
relies on a scoring function that keeps the promising edges and takes into account the starting
user.
   To speed up a link prediction task, Joshi et al. [18] propose to frame it as a query-answering
problem and use star-shaped subgraphs that group similar entities. In particular, they learn
subgraph and query embeddings, and then compute a likelihood score between them. This
score determines which subgraph will be explored to predict the correct entity by pruning
irrelevant subgraphs. Regarding question-answering, Lu et al. [19] prune irrelevant paths with
both a tail-based pruning and a path-based pruning. Tail-based pruning removes paths whose
tail relation is not the relation that leads to the answer. However, this pruning is not adapted
to questions involving new relations. That is why, the authors propose a path-based pruning
that consists in removing paths that do not involve the domain types of all the relations in the
correct path.
   Alternatively, distance in the embedding space has already been leveraged in several ways.
In link prediction approaches, distance can be used as a mean to measure prediction error. For
example, in TransE [20], the distance between →   −
                                                  𝑠 +→−𝑝 and →
                                                             −𝑜 is used to evaluate the plausibility
                                        →
                                        −   →
                                            −       →
                                                    −
of a triple ⟨𝑠, 𝑝, 𝑜⟩ to be true, where 𝑠 , 𝑝 , and 𝑜 are the embeddings of 𝑠, 𝑝 and 𝑜. Distance
can also represent the semantic relatedness between entities to match [21]. Inspired by these
approaches, we study in this paper whether distance in the embedding space can represent
topic relatedness between classes, and thus can be used for pruning unrelated classes.


3. Expanding and Pruning from Original Business Terms
As aforementioned, we consider that we have at our disposal a set of original business terms
mapped to entities of Wikidata. We then perform an expansion along the ontology hierarchy of
Wikidata (Section 3.1). To keep retrieved classes related to target business domains and original
business terms, this expansion is controlled with node degree and distance in the embedding
space (Section 3.2).

3.1. Expansion Along the Ontology Hierarchy
Figure 1 illustrates the expansion along the hierarchy of Wikidata classes. Each original business
term is mapped to a Wikidata entity represented by the starting QID. We first retrieve its direct
classes, following P31 (“instance of”) edges. For each of these direct classes, we then retrieve all
Figure 1: Illustration of the expansion along the ontology hierarchy starting from an original business
term mapped to a Wikidata entity (starting QID). We first retrieve its direct classes following P31 edges
(cl𝑖 ), and then their super-classes (sup𝑗 ) and sub-classes (sub𝑘 ) following P279 edges.


their super-classes by following P279 (“subclass of”) edges up to the root. We also retrieve their
sub-classes by following reversed P279 edges up to the deepest reachable Wikidata classes. To
avoid a high number of SPARQL queries, we use a local hashmap that is built from the Wikidata
dump and contains the adjacency lists of all entities in Wikidata. This allows to significantly
reduce the execution time of the expansion.

3.2. Pruning Unrelated Classes
To prune unrelated classes when traversing the hierarchy of Wikidata classes, we propose to
rely on node degree and distance in the embedding space. A class c that does not respect the
thresholds described below is not traversed and not integrated in the EKG nucleus. This applies
to direct classes, super- and sub-classes.

3.2.1. Node Degree
The degree of a node is defined as the sum of its incoming and outgoing edges. In our case, we
compute the degree of Wikidata classes by only considering incoming and outgoing P31 and
P279 edges since they are the only edges traversed in our expansion. For example, the degree of
cl1 in Figure 1 is equal to 4. We assume that classes with a high degree will cause a deviation
from the original business domains and terms. Indeed, classes with a high degree may be too
general and will lead to adding numerous super- or sub-classes to the EKG nucleus. We thus
view the degree of a class as indicative of its concreteness and specificity.
   To prune such classes, we use the two following thresholds. First, an absolute degree threshold
𝜏degree-abs configured as an input parameter allows to prune classes c such that degree(c) >
𝜏degree-abs . However, this threshold may not always be applicable. Consider classes reached at a
specific expansion level. It is possible for some of them to have much higher degrees than the
other classes of the same level without these degrees being higher than 𝜏degree-abs . In this case,
we consider these classes with relative higher degrees as anomalies. We prune them with an
approach commonly used in anomaly detection. We use a relative degree threshold 𝜏degree-rel
computed at each expansion level and defined as follows:

                                 𝜏degree-rel = 𝑄3 + 𝛼 × (𝑄3 − 𝑄1 )                                    (1)
We compute the first (𝑄1 ) and third (𝑄3 ) quartile of the degree of classes reached at a given
expansion level. The 𝛼 coefficient is an input parameter that allows to control how much class
degree is allowed to deviate from the third quartile in terms of number of interquartile range.
We then prune classes c such that degree(c) > 𝜏degree-rel . We compute this relative threshold
at each expansion level since we assume degree may vary depending on the level but should
be consistent at a given level. It should be noted that we apply this threshold if and only if
the maximum degree at an expansion level exceeds a parameter 𝛾. This parameter allows to
retrieve all classes reached at an expansion level if they all have a low degree, regardless of
discrepancies in their degrees.

3.2.2. Distances in the Embedding Space
We also assume that distance in the embedding space represents topic relatedness between
classes. Hence, classes with high distance between their embeddings and the embedding of the
starting QID may deviate from the original business domains and terms. To prune them, for
each starting QID, we define a relative distance threshold 𝜏dist-rel as follows:
                                            𝜏dist-rel = 𝛽 × 𝜇dist-cl                             (2)
To compute this threshold, we first retrieve all the direct classes cl of the starting QID. We
assume that these direct classes are closely related to the starting QID. Hence, they can serve
as a basis to measure the remoteness of other classes in the embedding space. We compute
the distances between their embeddings and the embedding of the starting QID. Direct classes
whose distance with the starting QID is greater than the third quartile of distances plus the
interquartile range are removed. These classes are abnormally far from the starting QID, and
thus may not constitute a correct basis for remoteness. Then, we compute the mean 𝜇dist-cl of
the distances between the embeddings of the remaining direct classes and the embedding of the
starting QID. In the expansion, we prune all classes c whose distance between their embeddings
and the embedding of the starting QID is greater than 𝜏dist-rel . 𝛽 is an input coefficient that
controls the allowed range of distances.
   Since classes can have instances, we propose the two following definitions for the distance
between a class and the starting QID:
Definition 1 (Distance 𝒟1 ). The distance between a class and the starting QID is the Euclidean
distance between their embeddings.
Definition 2 (Distance 𝒟2 ). The distance between a class and the starting QID is the Euclidean
distance between the centroids of the embeddings of their respective instances. In case the class or
the starting QID has no instance, its embedding is used instead.


4. Experiments
We experimented our approach with the pre-trained embeddings of Wikidata available in
PyTorch-BigGraph [22]1 . These embeddings were learned for more than 78,000,000 entities of
the 2019-03-06 version of Wikidata.
   1
       https://torchbiggraph.readthedocs.io/en/latest/pretrained_embeddings.html
4.1. Validating Distance as an Indicator of Topic Relatedness
We wanted to validate our hypothesis that distance in the embedding space can be used as
an indicator of topic relatedness between classes. To this aim, we checked that the distance
between embeddings of classes increases with the number of P279 edges between them. That is
to say, distant classes in the ontology hierarchy should also be distant in the embedding space.
The Wikidata hierarchy is known to present some issues related to misuse of P31 and P279
properties [12, 13, 14]. That is why, we expected a large range of distances between embeddings
of classes separated by the same number of P279 edges. This is also the reason motivating not
to rely on the number of edges for pruning.
   We retrieved all direct and indirect sub-classes of the root class of Wikidata (Q351202 ). Then,
for each class c, similarly to our expansion (Section 3.1), we retrieved all its super-classes and
sub-classes, the number of traversed P279 edges to reach each of them, and computed the
distance between their embeddings and the embedding of c. This allowed us to compute the
distributions of distances between classes in the embedding space w.r.t. the number of P279
edges between them. It is noteworthy that we performed a breadth-first search to retrieve
all super- and sub-classes of a class c. Consequently, only the lowest number of P279 edges
between two classes was taken into account.
   We conducted this experiment with the two distances presented in Section 3.2. We retrieved
2,615,380 sub-classes of the root class. Figure 2 depicts the distributions of distances obtained
with 54% of these sub-classes, which accounts for around 55.8 million pairwise distances3 . We
can see that distance 𝒟2 seems to better capture the distance in the ontology hierarchy than
distance 𝒟1 . With distance 𝒟1 , the median distance oscillates whereas, with distance 𝒟2 , the
median distance consistently increases with the number of P279 edges between classes. Hence,
our hypothesis may only be valid with 𝒟2 . As expected, large ranges of distances exist between
classes separated by the same number of P279 edges. This can be caused by misuse of the P279
property and different levels of representation granularity in different parts of the ontology
hierarchy. Such a result confirms our choice not to rely on number of P279 edges between
classes for pruning.
   Let us illustrate the difference between the two distances outlined in Figure 2 with an example.
Consider the original business term “Microsoft SharePoint” as the starting QID. We performed
the expansion as described in Section 3.1 and we computed the distances between “Microsoft
SharePoint” and its direct classes. Figure 3 displays these distances. We notice that, with
𝒟1 , distances in the embedding space are scattered, contrary to 𝒟2 . This leads to important
discrepancies. For example, the “Content Management System” class (green circle) is a direct
class and directly characterizes “Microsoft SharePoint”. However, with 𝒟1 , they are far from
each other in the embedding space, which could potentially lead to incorrectly prune “Content
Management System”. On the contrary, 𝒟2 correctly leads to a close proximity between this
class and the starting QID.


   2
       https://www.wikidata.org/wiki/Q35120
   3
       Our computations are limited due to time constraints.
               12                                                                                                 6


               10                                                                                                 5
Embedding distance


                                                                                             Embedding distance
                     8                                                                                            4


                     6                                                                                            3


                     4                                                                                            2


                     2                                                                                            1


                     0                                                                                            0
                         1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25                            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
                                     Number of traversed P279 edges                                                                 Number of traversed P279 edges

                                              (a) With 𝒟1                                                                                (b) With 𝒟2
 Figure 2: Distributions of distances between embeddings of classes w.r.t. the number of P279 edges
 between them.

                             Microsoft SharePoint                                                                                                   content management system
                     D1
                              Microsoft SharePoint                        content management system
                     D2
                                         0                 1                 2                  3                                   4                 5               6
                                                                                             Distance

 Figure 3: Distances between the original business term “Microsoft SharePoint” and its direct classes
 (expansion level 1): social software ( ), document management system ( ), Enterprise Content Man-
 agement system ( ), server software ( ), and content management system ( ). With 𝒟1 , direct classes
 are dispersed. For example, “content management system” is far from the starting QID. On the contrary,
 𝒟2 leads to a better organization of distances.


 4.2. Building the Nucleus of the Orange Knowledge Graph
We applied our approach to build the nucleus of the Orange Knowledge Graph. Orange is a
multi-service operator (e.g., telecommunications, video on demand, music, banking, cybersecu-
rity) with more than 140,000 employees and a heterogeneous client portfolio (e.g., individuals,
companies). To face such a diversity of activities, an internal repository contains public and
internal business terms together with their textual definitions.
   839 business terms were manually matched with their corresponding Wikidata entities before
applying the expansion mechanism described in Section 3.1. This matching was manually
performed by AI researchers from Orange who identified “same as” alignments between terms
in the internal repository and Wikidata entities. To help them in their task, each term was
presented with candidate entities based on string similarity between terms and labels of entities.
Their expansion led to retrieve 393 distinct direct classes, 946 distinct super-classes, and 2,560,426
distinct sub-classes. From these results, we chose to focus only on pruning sub-classes since
their important number may indicate unrelated classes to the original business terms.
                                                                                                             1e7
                               12
Distance from business terms


                               10                                                                        3


                                                                                           Node degree
                                8
                                                                                                         2
                                6

                                4
                                                                                                         1
                                2

                                0                                                                        0
                                    1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526                      1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526
                                                        Expansion level                                                           Expansion level

                                          (a) Distances (𝒟2 definition)                                                    (b) Node degrees
   Figure 4: Maximum ( ), mean ( ), and minimum ( ) distances and node degrees of retrieved classes
   per level of expansion from 839 original business terms.


    To illustrate the evolution of the two characteristics of interest of our pruning approach,
 we show in Figure 4a the maximum, mean, and minimum distances (with the 𝒟2 definition)
 of reached classes at each level of expansion from our 839 starting QIDs. Similarly, Figure 4b
 depicts the maximum, mean, and minimum degrees of reached classes at each level of expansion
 from our 839 starting QIDs. We notice that the mean distance only increases slightly through
 the expansion w.r.t. its initial value at expansion level 1. Such a stability makes us think that
 distances of direct classes are good representatives of close distances and can effectively serve
 as a basis to evaluate remoteness in the embedding space. This validates our definition of 𝜏dist-rel
 (Equation (2)) as dependent upon the mean distance of direct classes. Figure 4b illustrates that
 many generic classes are retrieved. For example, “Galaxy”, “Human”, “Taxon”, and “Star” are
 retrieved from “Linux”. However, sub-classes retrieved at the next expansion level from such
 generic classes may be unrelated to the original business terms. Interestingly, these peaks in
 node degrees are not associated with peaks in distances. This observation confirms the need
 for a degree-based pruning beside a distance-based pruning. In particular, the fixed degree
 threshold allows to tackle the degree peaks displayed.
    We also performed six expansion and pruning experiments from the 839 original business
 terms with different configurations. We fixed 𝜏degree-abs = 200, 𝛼 = 1.5, and 𝛾 = 20 but tested
 with different values of 𝛽 ∈ {1.2, 1.25, 1.3} and the two definitions of distance 𝒟1 and 𝒟2 . We
 manually labeled the pruned and kept classes to evaluate the performance of our approach.
 Results are presented in Table 1. Each row presents the results of one experiment in which all
 defined thresholds are applied. Since configurations differ between expansions, classes may
 be differently pruned in some expansions due to distance thresholds. This leads to different
 explorations of the ontology hierarchy, and thus different numbers for degree-based pruning.
 The left part of Table 1 presents the number of pruned classes and the resulting precision per
 pruning threshold as well as global results. It should be noted that some classes were pruned by
 exceeding both the 𝜏degree-rel and 𝜏dist-rel thresholds (represented in columns (4)). The right part
 of Table 1 presents the number of kept classes and the resulting precision. Results are discussed
 below.
Table 1
Number and precision of pruned and kept classes with the two distances and different configurations
for 𝛽. (1) stands for pruning with 𝜏degree-rel ; (2) stands for pruning with 𝜏degree-abs ; (3) stands for pruning
with 𝜏dist-rel ; (4) stands for pruning with both 𝜏degree-rel and 𝜏dist-rel ; (5) stands for global pruning. Values
in bold indicate the best precision.
                     # Pruned classes                         Precision
         𝛽                                                                            # Kept classes Precision
               (1) (2)     (3)    (4)    (5)     (1)    (2)      (3)      (4)   (5)
      1.2       11 45 1,344 182 1,582 0.91 0.87 0.69 0.78 0.71                                1,135        0.86
  𝒟1 1.25       19 46 1,289 183 1,537 0.89 0.87 0.73 0.79 0.74                                1,293        0.83
      1.3       25 46 1,224 189 1,484 0.88 0.87 0.75 0.79 0.76                                1,484        0.77
      1.2 184 60 2,108 118 2,470                0.74 0.85 0.75 0.92 0.76                      1,645        0.81
  𝒟2 1.25 213 64 2,032 110 2,419                0.77 0.84 0.79 0.92 0.79                      1,931        0.76
      1.3 250 65 1,917 100 2,332                0.79 0.83 0.84 0.94 0.84                      2,311        0.71


5. Discussion
Regarding pruning precision, we note in Table 1 that 𝒟2 obtains a better precision than 𝒟1 for
distance-based pruning. This result was expected given the better organization of the embedding
space with 𝒟2 than with 𝒟1 , as described in Section 4.1. It appears through our experiments
that a distance based on centroids of class instances seems to better carry the relatedness
between classes. 𝒟2 may thus better cope with “instance of” and “subclass of” properties
not always being used correctly in the hierarchy which impacts the quality of the Wikidata
ontology [12, 13, 14]. Table 1 shows that 𝒟1 reaches a better precision for degree-based pruning
but this only applies to a reduced number of classes. Consequently, the better performance
of 𝒟1 on these thresholds has little impact on global pruning precision, whose best result is
obtained with 𝒟2 .
   We notice that more classes are pruned with 𝒟2 than with 𝒟1 . Similarly, more classes are kept
with 𝒟2 . This indicates that 𝒟2 leads to a different and more extensive hierarchical exploration
than 𝒟1 . This more extensive exploration is at the expanse of the precision of kept classes that is
lower with 𝒟2 in all configurations. Conversely, the more restrained exploration with 𝒟1 comes
at the expanse of the precision of pruned classes that is lower in all configurations. However,
precision of kept classes with 𝒟2 is only lower by 5 to 7 points for 510 to 827 additionally
kept classes. This means that many of the additionally kept classes are correct. Hence, 𝒟2
appears to be a better setting that offers a further enrichment of the EKG nucleus without an
important precision decrease. In both cases, our pruning approach allows to reduce the number
of integrated classes in the EKG nucleus from more than 2.5 million (without pruning) to around
2,000 with a good precision.
   Regarding distance in the embedding space, it should be noted that requiring a user to
provide a distance threshold is not a trivial task since such a distance may not be interpretable.
This assessment is in line with the need for semantic embeddings to give a meaning and an
interpretation to distance in the embedding space to humans [23]. In this perspective, Wikidata
could serve as a benchmark. It would also be interesting to validate our hypothesis of topic-
relatedness representativity not only with super- or sub-classes as in Section 4 but with siblings
that are not currently considered in our approach.
   Table 1 shows that precisions of pruned and kept classes are inter-dependent. Indeed, in-
creasing the distance coefficient 𝛽 improves the precision of pruned classes at the expanse of
the precision of kept classes. This means that a higher distance threshold allows to remove
less classes but with a higher precision. However, some of the additionally retrieved classes
are unrelated to the original business terms. This shows that the decision boundary between
classes to prune and to keep is not perfectly captured with our approach and the two considered
features, i.e., node degree and distance in the embedding space. A trade-off must thus be found
to obtain good precision for both pruned and kept classes.
   One can wonder about the performance of other embedding models for a pruning based on
Euclidean distance since they may differently organize the embedding space. A future work
on performance comparison should also consider symbolic pruning approaches and take into
account scalability and execution time beside precision. Going a step further, we could envision
a machine learning model that learns or adapts graph embeddings to better prune classes. Such
a model could, for example, use Graph Convolutional Networks (GCN) with the Soft Nearest
Neighbor Loss (SNNL) as in [21]. To reduce user input, only pruned classes could be labeled
which would lead to a semi-supervised pruning task. Finally, to further expand the EKG nucleus,
instances of kept classes could also be evaluated with our thresholds to be integrated.


6. Conclusion
In this paper, we aimed at building a nucleus of the class hierarchy of an Enterprise Knowledge
Graph for Orange. To this aim, we mapped original business terms available in an internal
repository to their corresponding entities in Wikidata. Then, we performed a hierarchical
expansion along the ontology hierarchy to integrate super- and sub-classes. Due to the high
number of traversed classes, we proposed to limit this expansion with a pruning approach
relying on node degree and distance in the embedding space. Our experiments showed that
considering the embedding of a class as the centroid of the embeddings of its instances improves
distance pruning. Results also highlighted that node degree is an effective and necessary feature
that cannot be substituted by distance in the embedding space. In future works, we ambition to
confirm these results with different graph embedding models and investigate the learning of
graph embeddings specific to the pruning task. The EKG nucleus could also be further enriched
by matching Orange with similar corporations and retrieving their Wikidata statements. Such
statements could also undergo a pruning process similar to the one described in this paper.


References
 [1] A. Hogan, et al., Knowledge Graphs, Synthesis Lectures on Data, Semantics, and Knowledge,
     Morgan & Claypool Publishers, 2021.
 [2] N. F. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, J. Taylor, Industry-scale knowledge
     graphs: lessons and challenges, Commun. ACM 62 (2019) 36–43.
 [3] M. Galkin, S. Auer, S. Scerri, Enterprise knowledge graphs: A backbone of linked enterprise
     data, in: 2016 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2016,
     Omaha, NE, USA, October 13-16, 2016, IEEE Computer Society, 2016, pp. 497–502.
 [4] J. Sequeda, O. Lassila, Designing and Building Enterprise Knowledge Graphs, Synthesis
     Lectures on Data, Semantics, and Knowledge, Morgan & Claypool Publishers, 2021.
 [5] X. L. Dong, et al., AutoKnow: Self-driving knowledge collection for products of thousands
     of types, in: KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and
     Data Mining, Virtual Event, CA, USA, August 23-27, 2020, ACM, 2020, pp. 2724–2734.
 [6] G. Weikum, X. L. Dong, S. Razniewski, F. M. Suchanek, Machine knowledge: Creation and
     curation of comprehensive knowledge bases, Foundations and Trends Databases 10 (2021)
     108–490.
 [7] D. Vrandecic, M. Krötzsch, Wikidata: a free collaborative knowledgebase, Communications
     of the ACM 57 (2014) 78–85.
 [8] H. Cai, V. W. Zheng, K. C. Chang, A comprehensive survey of graph embedding: Problems,
     techniques, and applications, IEEE Trans. Knowl. Data Eng. 30 (2018) 1616–1637.
 [9] S. Ji, S. Pan, E. Cambria, P. Marttinen, P. S. Yu, A survey on knowledge graphs: Represen-
     tation, acquisition, and applications, IEEE Transactions on Neural Networks and Learning
     Systems 33 (2022) 494–514.
[10] S. Faralli, I. Finocchi, S. P. Ponzetto, P. Velardi, Efficient pruning of large knowledge
     graphs, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial
     Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, ijcai.org, 2018, pp. 4055–
     4063.
[11] R. V. Guha, Towards A model theory for distributed representations, in: 2015 AAAI Spring
     Symposia, Stanford University, Palo Alto, California, USA, March 22-25, 2015, AAAI Press,
     2015.
[12] F. Brasileiro, J. P. A. Almeida, V. A. de Carvalho, G. Guizzardi, Applying a multi-level
     modeling theory to assess taxonomic hierarchies in Wikidata, in: Proceedings of the 25th
     International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11-15,
     2016, Companion Volume, ACM, 2016, pp. 975–980.
[13] A. Piscopo, C. Phethean, E. Simperl, What Makes a Good Collaborative Knowledge Graph:
     Group Composition and Quality in Wikidata, in: Social Informatics - 9th International
     Conference, SocInfo 2017, Oxford, UK, September 13-15, 2017, Proceedings, Part I, volume
     10539 of Lecture Notes in Computer Science, Springer, 2017, pp. 305–322.
[14] K. Shenoy, F. Ilievski, D. Garijo, D. Schwabe, P. A. Szekely, A study of the quality of
     Wikidata, Journal of Web Semantics 72 (2022) 100679.
[15] Y. Chabot, P. Monnin, F. Deuzé, V. Huynh, T. Labbé, J. Liu, R. Troncy, A framework for
     automatically interpreting tabular data at orange, in: Proceedings of the ISWC 2021 Posters,
     Demos and Industry Tracks: From Novel Ideas to Industrial Practice co-located with 20th
     International Semantic Web Conference (ISWC 2021), Virtual Conference, October 24-28,
     2021, volume 2980 of CEUR Workshop Proceedings, CEUR-WS.org, 2021.
[16] Y. Tian, Y. Yang, X. Ren, P. Wang, F. Wu, Q. Wang, C. Li, Joint knowledge pruning and
     recurrent graph convolution for news recommendation, in: SIGIR ’21: The 44th Interna-
     tional ACM SIGIR Conference on Research and Development in Information Retrieval,
     Virtual Event, Canada, July 11-15, 2021, ACM, 2021, pp. 51–60.
[17] Y. Xian, Z. Fu, S. Muthukrishnan, G. de Melo, Y. Zhang, Reinforcement knowledge graph
     reasoning for explainable recommendation, in: Proceedings of the 42nd International
     ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
     2019, Paris, France, July 21-25, 2019, ACM, 2019, pp. 285–294.
[18] U. Joshi, J. Urbani, Searching for embeddings in a haystack: Link prediction on knowledge
     graphs with subgraph pruning, in: WWW ’20: The Web Conference 2020, Taipei, Taiwan,
     April 20-24, 2020, ACM / IW3C2, 2020, pp. 2817–2823.
[19] J. Lu, Z. Zhang, X. Yang, J. Feng, Efficient subgraph pruning & embedding for multi-relation
     QA over knowledge graph, in: International Joint Conference on Neural Networks, IJCNN
     2021, Shenzhen, China, July 18-22, 2021, IEEE, 2021, pp. 1–8.
[20] A. Bordes, N. Usunier, A. García-Durán, J. Weston, O. Yakhnenko, Translating embeddings
     for modeling multi-relational data, in: Advances in Neural Information Processing Systems
     26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings
     of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, 2013, pp.
     2787–2795.
[21] P. Monnin, C. Raïssi, A. Napoli, A. Coulet, Discovering alignment relations with graph
     convolutional networks: A biomedical case study, Semantic Web 13 (2022) 379–398.
[22] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, A. Peysakhovich, Pytorch-
     biggraph: A large scale graph embedding system, in: Proceedings of Machine Learning
     and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, mlsys.org,
     2019.
[23] H. Paulheim, Make embeddings semantic again!, in: Proceedings of the ISWC 2018 Posters
     & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International
     Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th to 12th, 2018, volume
     2180 of CEUR Workshop Proceedings, CEUR-WS.org, 2018.

</pre>