Enriching Wikidata with Semantified
                  Wikipedia Hyperlinks

                        Armand Boschin and Thomas Bonald

                   Télécom Paris, Institut Polytechnique de Paris
                {armand.boschin,thomas.bonald}@telecom-paris.fr


        Abstract. We propose a novel approach to enrich Wikidata with the
        textual content of Wikipedia. Specifically, we leverage knowledge graph
        (KG) embedding models to classify the hyperlinks between Wikipedia
        articles and predict the corresponding facts. For instance, we would like
        to complete the triple (Berlin, *, Germany) with the relation capital of,
        given a hyperlink from Berlin to Germany in Wikipedia. While existing
        KG embedding models can be used for this task of relation prediction,
        they were not explicitly designed for it and their performance is not satis-
        factory. In this paper, we propose two methods that greatly improve the
        performance of these models on this task: first, a new negative sampling
        method that balances the roles of entities and relations during training;
        second, a method to exploit the types of entities in the selection of candi-
        date relations. We obtain accuracy scores as high as 94% on the popular
        FB15k237 dataset and 75% on WDV5, an extraction of Wikidata. The
        efficiency of the approach is illustrated on some Wikipedia pages, where
        new facts unknown to Wikidata are predicted by our method.

        Keywords: Wikidata · knowledge graph · embedding · relation predic-
        tion · negative sampling · relation typing · machine learning


1     Introduction

In the recent years, Wikipedia has become the largest open-source collection of
knowledge. Its textual content is however mostly unstructured, the structured
information being mainly limited to the content of infoboxes (e.g., place and date
of birth for articles on humans). The hyperlinks make another structure which
is not fully integrated in Wikidata yet. The main challenge is that, in order to
know the meaning of an hyperlink, an agent needs to read the text in which the
hyperlink is embedded. While some hyperlinks do not correspond to relevant
facts, we claim that this is a rich source of information to complete Wikidata.
To illustrate this, one can look at the level 5 of Wikipedia vital articles1 , that
is about 40,000 pages serving as a centralized watchlist to track the quality of
the most important articles. These Wikipedia pages are linked by slightly more
than 3 million hyperlinks, which is far more than the approximately 200,000
1
    https://en.wikipedia.org/wiki/Wikipedia:Vital articles/Level/5


Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2         A. Boschin, T. Bonald

facts linking the corresponding entities in Wikidata. For instance, there is a link
from the page Henri Poincaré to the page Optics in Wikipedia. This suggests
the existence of a relation linking the two entities, here field of work. This fact
is not present in Wikidata.
    Formally, a KG consists of a set of vertices called entities (e.g., person, place,
date, concept) linked by directed edges in the form of triples (h, r, t) where h
(resp. t) is the head (resp. the tail) entity and r is a relation carrying the semantic
nature of the edge. When a triple is known to be true, it is called a fact.
    In this paper, we address the issue of relation prediction: finding the relation
linking some given head and tail entities. For instance, we would like to complete
the triple (Berlin, *, Germany) with the relation capital of, assuming the fact
is not in the KG. This task is also known as the semantification of a link. For
this, we leverage the embedding of the entities and relations of the KG to com-
pute scores on possible triples. Though most existing works on embeddings have
focused on the task of link prediction, that is, completing either the triple (*,
capital of, Germany) (head prediction) or (Berlin, capital of, *) (tail prediction),
we show that embeddings for relation prediction can also perform notably well.
We propose two techniques for that: first, we adapt the training of the models
by balancing the role of entities and relations in the negative sampling step and
then we use the types of entities to filter candidate relations.
    These techniques prove very efficient, allowing a simple embedding model
like TransE [3] to reach accuracy of 94% on the popular FB15k237 dataset and
75% on WDV5, an extraction of Wikidata based on the level 5 of Wikipedia
vital articles. This suggests that Wikidata can be significantly enriched by the
semantification of Wikipedia hyperlinks.


Contributions. The main contributions of this work are the following:

    • An approach to enrich Wikidata by the semantification of the hyperlinks of
      Wikipedia.
    • A novel negative sampling technique for improving the ability of KG em-
      bedding models to predict relations, without affecting their performance on
      link prediction.
    • A novel filtering technique for relation prediction where candidate relations
      are selected through the types of the head and tail entities.
    • A new dataset, WDV5, consisting of the facts between entities of Wikidata
      corresponding to the level 5 of Wikipedia vital articles2 .


2      Related work

KG embedding. A KG embedding model is defined as a function f that com-
putes a score for any triple (h, r, t) using some vector representations of h, t and
2
    https://en.wikipedia.org/wiki/Wikipedia:Vital articles/Level/5
                   Enriching Wikidata with Semantified Wikipedia Hyperlinks              3

r. By extension, the vectors representing entities and relations are called embed-
dings. KG embeddings have been specifically designed for link prediction3 : given
an entity h (resp. t) and a relation r, the model is used to predict an entity t
(resp. h) so that the fact (h, r, t) is the most likely to be true. This prediction is
done by selecting the entity giving the highest score among all entities.
    There are three categories of models depending on the form of the scoring
function f and thus on the way entities and relations interact in the vector space
(see [20] for more details):

    • Linear models, where h, r, t are linked by a linear relation in the vector space.
      Some projections can be added to increase the expressiveness of the model.
      Examples of such models include TransE [3] and TransH [22].
    • Bilinear models, where the relation r is a bilinear form of h and t in the
      vector space. Examples include RESCAL [10] and ComplEx [18].
    • Deep models, based on neural networks, possibly including attention mech-
      anisms [9,21]. These models give state-of-the-art performance in link predic-
      tion but are usually heavy and hard to train and prone to over-fitting.


Negative sampling. The scoring function f of an embedding model is expected
to discriminate facts from false statements and thus needs to be trained with
both. Since most KGs do not record false statements, training is usually done
under the Closed World Assumption (CWA), i.e., unknown triples are considered
as false. This may seem contradictory as the model is then used to predict
unknown facts, that are expected to be true. This is however the only way to
learn meaningful scoring functions f . The random generation of false statements
is known as Negative Sampling (NS). It has a major impact on the performance
of the trained model [7].
    Given some known fact (h, r, t), the usual way to create a false statement
from it (under the CWA) is to randomly choose either the head entity or the
tail entity and to replace it with another random entity of the KG [3]. This
technique was improved in [22] by using a Bernoulli parameter (see Section 3).
The replacement of the relation is rarely considered. It is mentioned in [23] but
not precisely described nor studied, as it is not the main focus of that article.
We propose a modification of the Bernoulli NS technique to include random
replacement of the relation, to get high performance in both link prediction and
relation prediction.


Type Filtering. Most KGs assign one or several type(s) to each entity through
a rdf:type relation (e.g., the P31: “instance of ” relation in Wikidata). The
types of entities have mainly been used in link prediction, either to enforce type
constraints in negative sampling or to select the candidate entities [8,23].
3
    Note that some authors refer to this task as relation prediction, see [9] for instance.
    We make a clear distinction between link prediction (head or tail entity unknown)
    and relation prediction (relation unknown).
4        A. Boschin, T. Bonald

    KBs can also enforce type constraints on relations via rdfs:domain and
rdfs:range constraints. In relation prediction, selecting candidate relations with
these constraints seems natural but they can be missing or too coarse grained
making the filtering either too restrictive or with no effect. In Wikidata, relation
constraints are hints for the editors, not firm restrictions4 . We propose a method
to infer such constraints simply from the rdf:type relation of the KG at hand
and use the resulting constraints in relation prediction. We show that it has a
major impact on performance.

NLP for relation prediction There are two main tasks tackled by NLP meth-
ods. The first is relation prediction, also known as relation extraction, consisting
in predicting the semantic relation linking two entities using sentences describing
these entities. The best performing models rely on deep neural networks with
attention mechanisms [19,12,25]. The second task is entity linking, that is link-
ing relations of a KG to plain text surface forms. Some interesting articles are
[14,24]. Our task of relation prediction in KG is different as it relies on the graph
structure of the KG only and on not any textual content. A method combining
both approaches is left as an interesting perspective for future research.

Relation linking

Hyperlink semantification. Very few works exist on the semantification of
Wikipedia hyperlinks using the graph structure of the KG only. The approach
of Galarraga et al. [5] is based on rule mining. A limit of this method is that it
can only predict relations for entities matching the body of the mined rule. Our
technique based on KG embedding applies to all links.


3     Background
Bernoulli Negative Sampling. The usual negative sampling technique (noted
BerNS) relies on relation-specific Bernoulli distributions to choose between the
head or the tail which entity of a fact should be replaced to maximize the prob-
ability of the resulting triple to be false [22]. Formally, a Bernoulli parameter pr
is computed for each relation r as follows:
                                             ρrt,h
                                  pr =                ,
                                         ρt,h + ρrh,t
                                          r


where ρrt,h (resp. ρrh,t ) is the average number of tail entity per head entity (resp.
head entity per tail entity) among all known facts involving r. This parameter
pr is the probability to replace the head entity of the fact.
    As an example, consider “author of ” which is a one-to-many relation (one
author and many potential books). In that case, the head entity (an author)
4
    https://www.wikidata.org/wiki/Help:Property constraints portal
                  Enriching Wikidata with Semantified Wikipedia Hyperlinks             5

should be more likely replaced than the tail entity (a book), yielding a false
statement with greater probability.


Model training. Training a model comes down to finding its parameters (the
embeddings) so that the scoring function gives high scores to facts and low
scores to false statements. Given a training set of facts denoted (h, r, t), the
corresponding false statements (h0 , r, t0 ) are generated by NS. Then for each pair
of facts (h, r, t) and (h0 , r, t0 ), a loss measuring the gap between the corresponding
scores is computed, `(f (h, r, t), f (h0 , r, t0 )). This loss ` should be high for close
scores. Examples include the logistic loss and the margin loss [16]. Minimizing
the overall loss (e.g., by gradient descent [13]) gives a scoring function f that is
expected to discriminate facts from false statements.


4     Relation prediction

4.1   Approach

Our approach relies on the following techniques.


KG Embedding. KG embedding models can be used for relation prediction the
same way they are used in the aforementioned link prediction: given two entities h
and t, an embedding model and its scoring function f , the relations of the graph
can be ranked by decreasing order of scores: f (h, r1 , t) > f (h, r2 , t) > · · · >
f (h, rk , t). The relation r1 is then predicted, corresponding to the fact (h, r1 , t).
Note that this method applies to the case of undirected links, by ranking the
scores of the predictions for both directed links (h, t) and (t, h). This is especially
useful when some relations have no reciprocal.


Balanced Negative Sampling. Simple experiments show that off-the-shelf
linear models like TransE perform really badly in relation prediction (3% of
Hit@1 on FB15k237, see Table 2a). This suggests that the representation of re-
lations is not as good as that of entities. It turns out that entities and relations
play similar roles in the training procedure except for the NS step. Usually, only
the entities are randomly replaced to get false statements (see Section 3). We
propose a simple modification of BerNS to balance the roles of entities and re-
lations during training. Rather than just replacing one of the two entities of a
known fact, we replace the relation with some probability p, and apply BerNS
otherwise (See Algorithm 1). This new method is called Balanced Negative Sam-
pling (BalNS). The default value for p is set to 21 . Experiments have shown that
the value of the parameter has no major impact on the performances of the
approach as long as it is greater than 0.1.
6            A. Boschin, T. Bonald


    Algorithm 1: Balanced Negative Sampling (BalNS).
      Input: (h, r, t), a fact
      Input: p, probability to replace the relation
      Output: (h0 , r0 , t0 ), a false statement
      Data: T , the facts in the KG
      Data: pr , Bernoulli parameter for relation r
        0    0 0
    1 (h , r , t ) ← (h, r, t)
                  0 0 0
    2 while (h , r , t ) ∈ T do
    3      u ← uniform random variable on [0, 1]
    4      if u < p then
    5           r0 ← random relation
    6       else
    7           (h0 , t0 ) ← BerNS(h, r, t)

    8   return (h0 , r0 , t0 )


Type Filtering for relation prediction. Another key technique to improve
the quality of relation prediction is through Type Filtering (TF). An entity e
is said to have the type t if the fact (e, rdf:type, t) is known. To predict the
relation linking h and t, only relations that are known to link entities of the
type(s) of h to entities of the type(s) of t should be considered. Formally, we say
that a relation r links type a to type b if there exists some known fact (h, r, t)
with the head entity h of type a and the tail entity t of type b. Now for predicting
the relation missing in (h, ∗, t), we propose to consider as candidates only the
relations r linking any type of h to any type of t. The corresponding algorithm
for relation prediction is described in Algorithm 2. Observe that if either the
head entity h and/or the tail entity t is not typed, the candidate relations are
then the relations that are involved in a training fact with either h as a head
entity or t as a tail entity. In the end, if no relation meet any constraint, there
is no filtering, i.e., all relations are selected. Regarding speed, this step has no
significant impact on the global computation time with proper index: linking
entities to their types and types to possible relations.
                     Enriching Wikidata with Semantified Wikipedia Hyperlinks                   7


    Algorithm 2: Relation prediction with Type Filtering (TF).
   Input: h, t, entities
   Input: f , scoring function
   Output: r, relation linking h to t
   Data: T , the facts in the KG
   Data: R, the relations in the KG
 1 A ← types of h
 2 B ← types of t
 3 if |A| > 0 and |B| > 0 then
 4     R ← {r : ∀(a, b) ∈ A × B, ∃h0 , t0 : type(h0 ) = a, type(t0 ) = b, (h0 , r, t0 ) ∈ T }
 5    else
 6        R ← {r ∈ R : ∃e : (h, r, e) ∈ T } ∪ {r ∈ R : ∃e : (e, r, t) ∈ T }
 7    if |R| = 0 then
 8        R←R
 9    r ← arg max({f (h, r, t), r ∈ R})
10    return r


To summarize, our approach relies on the following steps:

 1. Training the model (e.g., TransE or ComplEx) with BalNS (Algorithm 1).
 2. Predicting relations with TF (Algorithm 2).


4.2      Evaluation

Given an embedding model trained with BalNS and some known fact (h, r, t) of
a test set, all relations selected by TF are ranked by decreasing score. The rank
of the true relation r is recorded as the recovery rank (if the true relation r is not
selected by TF, the rank is set to the maximum). Usual metrics of link prediction
like Mean Reciprocal Rank (MRR: average of the inverses of the recovery ranks)
and Hit@k (proportion of tests in which the recovery rank is at most k) can then
be reported. In the filtered setting, any relation that is ranked better than r and
that is known to lead to a fact (i.e., in the training set) is discarded, so that the
model is not penalized for predicting known facts that are simply more likely
than the target one.


5      Experiments

The experiments aim at assessing the performance of our approach on existing
KGs and at showing its practical interest on a real-world task, i.e., the seman-
tification of Wikipedia hyperlinks. All experiments can be reproduced using the
publicly available code5 and data 6 .
5
     https://gitlab.telecom-paris.fr/aboschin/hyperlinks-semantification
6
     https://netset.telecom-paris.fr/pages/wikivitals+.html
8         A. Boschin, T. Bonald

Datasets The datasets used in the experiments are shown in Table 1.

        Dataset     Entities/nodes Facts/edges Relations Types Typed entities
        FB15k237            14,541     310,116       237    73         2,719
        WDV5                39,062     231,744       607 1,206        22,883
        Wikivitals+         39,062 3,008,116

             Table 1: Key features of the datasets used in experiments.


    One of the most common datasets used to evaluate the quality of KG em-
beddings is a subset of Freebase called FB15k237 [17]. The typing relation from
Freebase is however not included in it and resources are no longer available online
since the discontinuation of the Freebase project [2]. Types were then imported
from Wikidata using a matching between the two KBs. Attention was paid to
prevent data leakage by removing any imported fact that could match an ex-
isting validation or test fact. For comparability reasons, the new facts were not
used to train the embedding models, only for the TF step. Only 18,6% of entities
are typed, see Table 1.
    We also introduce WDV5, a new dataset containing the facts linking entities
of Wikidata corresponding to the level 5 of Wikipedia vital articles (see Section
1). To type entities, only typing facts included in the dataset are used (i.e., all
types are entities of WDV5). In particular, not all entities are typed (only 56%,
see Table 1). It is important to note that WDV5 is a raw extract from Wikidata,
without any pre-processing. As such, we expect the corresponding experiments
to be more representative of real use cases than those based FB15k237.
    For the semantification of Wikipedia hyperlinks, we use Wikivitals+, an ex-
traction of the level 5 of Wikipedia vital articles and the hyperlinks between
them. We only keep the pages that have a corresponding Wikidata entity. This
dataset provides many hyperlinks that are natural candidates for true facts, after
relation prediction.


Baseline In order to measure the impact of using a KG embedding model for
ranking the candidates selected by TF, we compare our approach to a simple
baseline that ranks the candidate relations by popularity in the training set, in
number of facts.


Process Two off-the-shelf embedding models were chosen for the experiments:

    • TransE [3], the simplest linear model, intuitive and fast to train and apply.
    • ComplEx [18], the best bilinear model, with twice more parameters, longer
      to train and apply.

   The models were trained using the Adam algorithm for optimization [6],
dropout for regularization [15] and early-stopping with 100 epochs of patience
(on the filtered validation MRR for link prediction). All experiments were done
                 Enriching Wikidata with Semantified Wikipedia Hyperlinks           9

using Python 3.8, PyTorch 1.7.0 [11], TorchKGE 0.16.25 [4] pytorch-ignite 0.4.4
and a Nvidia Titan V GPU powered with Cuda 10.1. The hyper-parameters of
the embedding models were tuned using hyperopt 0.2.5. The possible values
along with those chosen are listed in the provided supplemental material.
    In the case of FB15k237, the split between train, validation and test sets is
set by Toutanova et al. [17]. For WDV5, we split the dataset at random with
80% of the facts for training, 10% for validation (for choosing hyper-parameters)
and 10% for testing. The reported metrics are averaged over 6 distinct random
splits and independent training procedures.


6     Results

6.1     Performance

The results for relation prediction are shown in Table 2 with metrics computed in
a filtered setting, for different variants of the model so as to assess the respective
gains of the proposed techniques:

    • Original: The base model (either TransE or ComplEx) trained with BerNS.
    • BalNS: The base model trained with BalNS.
    • TF: The base model evaluated with Type Filtering (TF).
    • BalNS & TF: The base model trained with BalNS and evaluated with TF.

    The original version of TransE is not efficient on FB15k237 (only 3% of
Hit@1). ComplEx performs however notably well on the same dataset (89% of
Hit@1). It seems less sensitive to the unbalanced role of entities and relations
during training. We suspect however that the score of ComplEx on FB15k237
mainly results from over-fitting due to lack of new datasets in the KG embedding
literature over the past few years and over-engineering of FB15k237 (it is the
second version of the subset). This has already been argued in [1] and it is
confirmed by the fact that TransE and ComplEx have almost the same scores
(around 45% of Hit@1) on the new dataset WDV5 which is a raw extraction
from Wikidata.


Balanced Negative Sampling. Training with BalNS has a strong impact on
the relation prediction performance of the models: training TransE on FB15k237
with BalNS rather than BerNS increases the Hit@1 from 3% to 91%. This con-
firms the intuition that the relation embeddings were not well trained. The dif-
ference is less impressive for ComplEx on FB15k237 but the original ComplEx
model performs already quite well on this dataset. On WDV5, there is a big
increase in Hit@1 for both models: 24% for TransE and 28% for ComplEx.


Type filtering. TF has a strong impact on the performance of the models.
Looking at Hit@1 on WDV5, TransE goes from 45% to 58% and ComplEx goes
from 45% to 76%. Note that Type Filtering alone (the baseline) performs almost
10     A. Boschin, T. Bonald

                  Base model Variant       MRR Hit@1 Hit@5
                               Original   0.061   0.033 0.049
                               BalNS      0.940   0.914 0.972
                    TransE
                               TF         0.405   0.184 0.744
                               BalNS & TF 0.957   0.935 0.983
                           Original   0.928 0.894 0.967
                           BalNS      0.956 0.934 0.982
                   ComplEx
                           TF         0.953 0.927 0.983
                           BalNS & TF 0.961 0.943 0.983
                          Baseline         0.153 0.050 0.262
                                  (a) FB15k237

     Base model Variant              MRR          Hit@1           Hit@5
                Original       0.556 ± 0.116   0.458 ± 0.128   0.664 ± 0.102
                BalNS          0.779 ± 0.006   0.697 ± 0.017   0.881 ± 0.012
       TransE
                TF             0.711 ± 0.063   0.588 ± 0.092   0.872 ± 0.020
                BalNS & TF     0.821 ± 0.002   0.754 ± 0.003   0.903 ± 0.002
              Original    0.546 ± 0.078 0.454 ± 0.108 0.649 ± 0.052
              BalNS       0.816 ± 0.037 0.734 ± 0.059 0.917 ± 0.011
      ComplEx
              TF          0.826 ± 0.009 0.765 ± 0.013 0.902 ± 0.004
              BalNS & TF 0.827 ± 0.024 0.762 ± 0.041 0.910 ± 0.003
             Baseline          0.516 ± 0.006   0.416 ± 0.006   0.618 ± 0.006
                    (b) WDV5 (mean ± standard deviation).


       Table 2: Results of relation prediction on FB15k237 and WDV5.


as well as the original embedding models. It is however largely beaten by the
combination of TF with scoring by an embedding model. The gain of using and
embedding model is very important.


Complete model. The combination of BalNS and TF gives the best results.
On FB15k237, the increase in performance of TransE is impressive (Hit@1 from
3% to 94%) and makes this model almost as efficient as ComplEx. This is ob-
tained through additional facts imported from Wikidata for TF but the scores of
TransE simply trained with BalNS (and without TF) are already close to those
of ComplEx.
    On WDV5, all performance metrics are significantly improved by our ap-
proach. Both models that perform similarly in their original form remain close.
On average, ComplEx beats TransE by 1% in Hit@1 but the scores of TransE
are much more stable from one split to the other, as shown by the lower standard
deviation. The intervals of fluctuation of MRR and Hit@1 tend to be reduced
                   Enriching Wikidata with Semantified Wikipedia Hyperlinks     11

if the model is trained with BalNS. This is particularly true for TransE, whose
standard deviation for each metric is very small.
    It is remarkable to get almost identical performance with TransE and Com-
plEx, knowing that TransE has half the number of parameters of ComplEx, is
more geometrically intuitive and requires 6 times less operations for each gradi-
ent descent step during training.

6.2     Application to Wikipedia Hyperlinks
In order to predict the relation associated to a hyperlink, we use the TransE em-
bedding of WDV5 trained with BalNS and applied using TF. When two pages
are linked and the corresponding Wikidata entities are involved in a fact of
Wikidata (110,311 out of 3,008,116 hyperlinks), we can compare the predicted
relation to the ground-truth. We obtain 84% of accuracy. This good score is
expected as the model is trained on WDV5 facts and some hyperlinks indeed
correspond to existing facts. However, it is interesting to look at cases where
the prediction is different from the true fact. We have observed that the model
can hardly predict directed relations (e.g., parent-child) or semantically close
relations (e.g., employer and educated at for links between scholars and univer-
sities). This is not surprising as the only available data is the structure of the
KG. Some other mistakes come from the embedding model itself, for example
headquarter location always has a lower score than twinned administrative body
for some reason, making the headquarter predictions all wrong.
    In Table 3, we report for two pages the semantified hyperlinks that got the
highest scores. It is reassuring to see that most of the resulting facts are true,
many of them being however already known in Wikidata. A few mistakes could
be avoided using a little bit of context (i.e., text information) but these results
suggest that our approach is able to correctly semantify many links.
    It seems however difficult to produce automatically a full dataset in this way.
First, the scores of embedding models are usually not normalized so comparing
them works fine when done locally (e.g., looking at the links of a particular
page) but comparing the scores of the three million possible facts is not feasible.
Second, many facts that get a high score are very likely but require additional
information not present in data. For example the three most likely facts resulting
from semantified hyperlinks of Wikivitals+ are:
    • (Serbia, member of, World Trade Organization): Serbia’s application is still
      under review.
    • (Taiwan, member of, World Trade Organization): Taiwan is already a mem-
      ber of the WTO through the Chinese Taipei but not in its name.
    • (Kosovo, member of, Interpol ): Kosovo’s application was rejected in 2018.
      Clearly, some additional textual content is needed in these cases.

7      Conclusion
We have proposed a novel approach to relation prediction by KG embedding. Our
approach is based on two key ideas: Balanced Negative Sampling in the training
12       A. Boschin, T. Bonald

         Head     Predicted Relation           Tail         Score Evaluation
        Allergy     health specialty       Immunology -0.728593
        Allergy        has effect           Rhinorrhea    -0.839387
        Allergy        has cause             Allergen     -0.844231
        Allergy     health specialty        Pediatrics    -0.972245
        Allergy     health specialty    Internal medicine -1.022068
        Allergy       instance of             Disease     -1.022585
        Allergy drug used for treatment     Adrenaline    -1.040919
        Allergy medical examinations        Blood test    -1.165760
        Allergy drug used for treatment       Aspirin     -1.172504
        Allergy     health specialty       Hematology     -1.219639
        Allergy   possible treatment        Medication    -1.221141
        Allergy       symptoms           Abdominal pain -1.228720
        Allergy drug used for treatment      Penicillin   -1.235292
        Allergy       subclass of            Pollution    -1.283537
        Allergy     health specialty         Statistics   -1.297987
        Allergy     health specialty      Epidemiology -1.301585
        Allergy          afflicts        Immune system -1.374023
        Allergy          afflicts             Blood       -1.376165
        Allergy   possible treatment        Antibiotic    -1.406908
        Allergy       symptoms                 Itch       -1.416243
                  (a) Top-20 facts predicted from the page Allergy.

         Head        Predicted Relation          Tail           Score Evaluation
     Henri Poincaré       employer      University of Paris -0.358521
     Henri Poincaré       employer     École Polytechnique -0.412040
     Henri Poincaré     occupation        Mathematician      -0.414613
     Henri Poincaré place of death             Paris         -0.492610
     Henri Poincaré     occupation            Engineer       -0.537747
     Henri Poincaré    field of work      Number theory      -0.561962
     Henri Poincaré      student of      Charles Hermite -0.592644
     Henri Poincaré    field of work       Epistemology      -0.605310
     Henri Poincaré    field of work         Topology        -0.678377
     Henri Poincaré    field of work    Algebraic geometry -0.712538
     Henri Poincaré      student of      Wilhelm Wundt       -0.738560
     Henri Poincaré    field of work           Optics        -0.738636
     Henri Poincaré   notable work      Poincaré conjecture -0.739285
     Henri Poincaré    field of work   Philosophy of science -0.791077
     Henri Poincaré    field of work       Metaphysics       -0.822683
     Henri Poincaré   place of birth      Nancy, France      -0.824190
     Henri Poincaré different from      Raymond Poincaré -0.830676
     Henri Poincaré      student of      Karl Weierstrass -0.843734
     Henri Poincaré    field of work         Set theory      -0.855588
     Henri Poincaré    field of work    Celestial mechanics -0.877295
              (b) Top-20 facts predicted from the page Henri Poincaré.


Table 3: Relation prediction applied to the semantification of Wikipedia hyper-
links (green = true fact unknown by Wikidata, blue = fact already known by
Wikidata, red = false statement).
                Enriching Wikidata with Semantified Wikipedia Hyperlinks       13

of the embedding model, and Type Filtering to select candidate relations. We
have shown that this approach performs well using an embedding model as simple
as TransE, opening the way to robust and explainable predictions. Our results
suggest that the model can be used to enrich Wikidata, by the semantification
of Wikipedia hyperlinks associated with known entities.
    This approach is however not yet fully automatable and performance still
needs to be increased for that goal. For future work, we would like to further
improve our negative sampling technique by replacing the relation with some
probability that depends on the considered fact (h, r, t), instead of some fixed
probability. It seems also necessary to integrate some context from textual data
for example (like the description of the relations and the articles themselves) in
order to help the embedding model in its choices. A fully automatized process
of enriching Wikidata with semantified Wikipedia hyperlinks seems however not
out of reach.
14      A. Boschin, T. Bonald

References
 1. Akrami, F., Saeef, M.S., Zhang, Q., Hu, W., Li, C.: Realistic re-evaluation of
    knowledge graph completion methods: An experimental study. In: Proceedings of
    the 2020 ACM SIGMOD International Conference on Management of Data. p.
    1995–2010. SIGMOD ’20, Association for Computing Machinery, New York, NY,
    USA (2020). https://doi.org/10.1145/3318464.3380599
 2. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: A collabo-
    ratively created graph database for structuring human knowledge. In: Proceedings
    of the 2008 ACM SIGMOD International Conference on Management of Data. p.
    1247–1250. SIGMOD ’08, Association for Computing Machinery, New York, NY,
    USA (2008). https://doi.org/10.1145/1376616.1376746
 3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translat-
    ing Embeddings for Modeling Multi-relational Data. In: Burges, C.J.C., Bottou,
    L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural In-
    formation Processing Systems 26, pp. 2787–2795. Curran Associates, Inc. (2013)
 4. Boschin, A.: TorchKGE: Knowledge Graph Embedding in Python and PyTorch.
    KDD-IWKG 2020 p. 6 (Aug 2020)
 5. Galárraga, L., Symeonidou, D., Moissinac, J.C.: Rule Mining for Semantifying
    Wikilinks. In: LDOW@WWW (2015)
 6. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio,
    Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations,
    ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
    (2015)
 7. Kotnis, B., Nastase, V.: Analysis of the impact of negative sampling on link pre-
    diction in knowledge graphs. arXiv preprint arXiv:1708.06816 (2017)
 8. Krompaß, D., Baier, S., Tresp, V.: Type-constrained representation learning in
    knowledge graphs. In: Arenas, M., Corcho, O., Simperl, E., Strohmaier, M.,
    d’Aquin, M., Srinivas, K., Groth, P., Dumontier, M., Heflin, J., Thirunarayan, K.,
    Thirunarayan, K., Staab, S. (eds.) The Semantic Web - ISWC 2015. pp. 640–655.
    Springer International Publishing, Cham (2015)
 9. Nathani, D., Chauhan, J., Sharma, C., Kaul, M.: Learning attention-based em-
    beddings for relation prediction in knowledge graphs. In: Proceedings of the 57th
    Annual Meeting of the Association for Computational Linguistics (2019)
10. Nickel, M., Tresp, V., Kriegel, H.P.: A Three-way Model for Collective Learning
    on Multi-relational Data. In: Proceedings of the 28th International Conference on
    International Conference on Machine Learning. pp. 809–816. ICML’11, Omnipress,
    Bellevue, WA, USA (2011)
11. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
    Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In:
    Proceedings of the 31st Conference on Neural Information Processing Systems.
    Long Beach, CA, USA (Oct 2017)
12. Peng, N., Poon, H., Quirk, C., Toutanova, K., Yih, W.t.: Cross-sentence n-ary
    relation extraction with graph LSTMs. Transactions of the Association for Com-
    putational Linguistics 5, 101–115 (2017). https://doi.org/10.1162/tacl a 00049
13. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint
    arXiv:1609.04747 (2016)
14. Sakor, A., Singh, K., Patel, A., Vidal, M.E.: Falcon 2.0: An entity and
    relation linking tool over wikidata. In: Proceedings of the 29th ACM
    International Conference on Information and Knowledge Management. p.
                  Enriching Wikidata with Semantified Wikipedia Hyperlinks            15

    3141–3148. CIKM ’20, Association for Computing Machinery, New York, NY,
    USA (2020). https://doi.org/10.1145/3340531.3412777, https://doi.org/10.1145/
    3340531.3412777
15. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
    Dropout: A simple way to prevent neural networks from overfitting. Journal of
    Machine Learning Research 15(56), 1929–1958 (2014)
16. Suchanek, F.M., Lajus, J., Boschin, A., Weikum, G.: Knowledge Representation
    and Rule Mining in Entity-Centric Knowledge Bases. In: Krötzsch, M., Stepanova,
    D. (eds.) Reasoning Web. Explainable Artificial Intelligence: 15th International
    Summer School 2019, Bolzano, Italy, September 20–24, 2019, Tutorial Lectures, pp.
    110–152. Lecture Notes in Computer Science, Springer International Publishing,
    Cham (Sep 2019)
17. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Repre-
    senting Text for Joint Embedding of Text and Knowledge Bases. In: Proceedings of
    the 2015 Conference on Empirical Methods in Natural Language Processing. pp.
    1499–1509. Association for Computational Linguistics, Lisbon, Portugal (2015).
    https://doi.org/10.18653/v1/D15-1174
18. Trouillon, T., Dance, C.R., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Knowl-
    edge Graph Completion via Complex Tensor Factorization. arXiv:1702.06879 [cs,
    math, stat] (Feb 2017)
19. Wang, L., Cao, Z., de Melo, G., Liu, Z.: Relation Classification via Multi-
    Level Attention CNNs. In: Proceedings of the 54th Annual Meeting of the As-
    sociation for Computational Linguistics (Volume 1: Long Papers). pp. 1298–
    1307. Association for Computational Linguistics, Berlin, Germany (Aug 2016).
    https://doi.org/10.18653/v1/P16-1123
20. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge Graph Embed-
    ding: A Survey of Approaches and Applications. IEEE Transactions
    on Knowledge and Data Engineering 29(12), 2724–2743 (Dec 2017).
    https://doi.org/10.1109/TKDE.2017.2754499
21. Wang, R., Li, B., Hu, S., Du, W., Zhang, M.: Knowledge Graph Embedding
    via Graph Attenuated Attention Networks. IEEE Access 8, 5212–5224 (2020).
    https://doi.org/10.1109/ACCESS.2019.2963367, conference Name: IEEE Access
22. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by trans-
    lating on hyperplanes. In: Proceedings of the Twenty-Eighth AAAI Conference on
    Artificial Intelligence. p. 1112–1119. AAAI’14, AAAI Press (2014)
23. Xie, R., Liu, Z., Sun, M.: Representation learning of knowledge graphs with hier-
    archical types. In: Proceedings of the Twenty-Fifth International Joint Conference
    on Artificial Intelligence. p. 2965–2971. IJCAI’16, AAAI Press (2016)
24. Yang, X., Ren, S., Li, Y., Shen, K., Li, Z., Wang, G.: Relation linking for wikidata
    using bag of distribution representation. In: Huang, X., Jiang, J., Zhao, D., Feng,
    Y., Hong, Y. (eds.) Natural Language Processing and Chinese Computing. pp.
    652–661. Springer International Publishing, Cham (2018)
25. Zhang, Y., Qi, P., Manning, C.D.: Graph convolution over pruned depen-
    dency trees improves relation extraction. In: Proceedings of the 2018 Confer-
    ence on Empirical Methods in Natural Language Processing. pp. 2205–2215.
    Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018).
    https://doi.org/10.18653/v1/D18-1244