=Paper= {{Paper |id=Vol-3262/paper11 |storemode=property |title=Identifying Surprising Facts in Wikidata |pdfUrl=https://ceur-ws.org/Vol-3262/paper11.pdf |volume=Vol-3262 |authors=Nicholas Klein,Filip Ilievski,Hayden Freedman,Pedro Szekely |dblpUrl=https://dblp.org/rec/conf/semweb/KleinIFS22 }} ==Identifying Surprising Facts in Wikidata== https://ceur-ws.org/Vol-3262/paper11.pdf
Identifying Surprising Facts in Wikidata
Nicholas Klein1 , Filip Ilievski1 , Hayden Freedman2 and Pedro Szekely1
1
    Information Sciences Institute, University of Southern California
2
    University of California Irvine


                                         Abstract
                                         To expand their boundary of knowledge, to solve a certain task, or merely to entertain themselves,
                                         people on the Web often hunt for surprising facts. While large repositories of structured knowledge
                                         like Wikidata hold the promise to provide surprising facts to their users and developers, currently no
                                         mechanism exists to identify surprising facts in Wikidata. In this paper, we study the ability of popular
                                         embedding models to estimate the surprise level of a fact in Wikidata. We formulate a novel task of
                                         Surprising Fact Identification, and we create two associated benchmarks: Trivia and Survey. We propose
                                         two scalable methods based on outlier detection and link prediction to estimate surprise scores for any
                                         statement in a graph like Wikidata. We evaluate our methods with various embedding models on the
                                         two benchmarks. We perform further analysis of the predictions for outlier and non-outlier facts to
                                         investigate to what extent link prediction models regress to the mean.




1. Introduction
Surprise is a measure of how unexpected a certain statement is. For example, while it is expected
that Vladimir Putin speaks Russian, it is more surprising that he can speak German too. It
is expected that Putin is a politician, but it may be surprising that he has studied law. To
expand their boundary of knowledge, to solve a certain task, or merely to entertain themselves,
people on the Web often hunt for surprising facts [1]. It is therefore more valuable for users to
be presented with a surprising fact rather than a trivial one. Recognizing this phenomenon,
popular sites on the Web readily provide contextually relevant facts that are fun or surprising.1
   The expansion of AI technologies and the vast amount of available knowledge provides an
opening for surprising facts to be identified automatically. Prior work by Tsurel et al. [1] has
investigated the ability of statistical AI methods to extract surprising facts from Wikipedia.
Meanwhile, very large and curated structured knowledge sources like Wikidata [2] have emerged,
providing over a billion facts about nearly one hundred million entities. While Wikidata’s
knowledge is intuitively suitable to support human exploration and AI reasoning, the provided
information can easily get overwhelming for both humans and machines. Wikidata contains
some information (e.g., ranks and temporal qualifiers) that enables prioritizing entity statements
for the same property. It also has the Curious Facts Dashboard [3], which presents potentially
incorrect entries for user review. However, this system uses a rule-based reasoner and tends to

Wikidata’22: Wikidata workshop at ISWC 2022
Envelope-Open nmklein@usc.edu (N. Klein); ilievski@isi.edu (F. Ilievski); hfreedma@uci.edu (H. Freedman); pszekely@isi.edu
(P. Szekely)
                                       Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings            CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      E.g., https://www.historyhit.com/facts-about-vladimir-putin/, accessed May 20, 2022.
surface problems such as property constraint violations, rather than facts that are ontologically
correct but surprising. The emergence of large structured knowledge graphs and representation
learning methods [4, 5] brings up the question: can we apply state-of-the-art representation
learning techniques to effectively identify surprising facts in Wikidata?
   In this paper, we study the ability of popular embedding models, like TransE [4], ComplEx [5],
and language models like BERT [6], to identify the surprise level of a fact in Wikidata. We
formulate Surprising Fact Identification, a novel task where a system has to estimate how
surprising a Wikidata fact is. We create two benchmarks for this task: Trivia, where systems
rank the facts per question according to their surprise, and Survey, where the goal is to provide
a global ranking of 70 facts. We propose two scalable methods based on embedding models
with an ability to estimate surprise score for any statement in a graph like Wikidata. We
evaluate a list of embedding model variants to perform Surprising Fact Identification on the
two benchmarks. Hypothesizing that embedding-based link prediction models regress to the
mean, we also analyze whether their predictions for a given subject entity tend towards the
values that are most commonly correct for similar entities.


2. Surprising Fact Identification
2.1. Task Definition
We define surprise as a measure of how unusual or unexpected a certain statement is. We
formulate the task of Surprising Fact Identification as a ranking task. For a given set of facts
𝐹, we want to assign a ranking to each of the facts 𝑓 ∈ 𝐹 that corresponds to the relative
surprise a human would likely have at learning each 𝑓 is true. Each fact is of the form 𝑓 ∢
(𝑒, [(𝑝1 , π‘œ1 ), ..., (𝑝𝑛 , π‘œπ‘› )]), consisting of an entity 𝑒 that is the subject of the fact and a non-empty
list of predicate-object pairs (𝑝𝑖 , π‘œπ‘– ). A fact 𝑓 indicates that each of the triples < 𝑒, 𝑝𝑖 , π‘œπ‘– > is true
for 𝑖 ∈ [1, 𝑛].

2.2. Benchmarks
We collect two Surprising Fact Identification benchmarks: Wikidata-MC-Trivia-118 and Wikidata-
Survey-FunFacts-70.
   Wikidata-MC-Trivia-118 This benchmark concretely defines surprise of a fact 𝑓 as the
probability that a human would not guess 𝑓 to be true. It consists of 24 multiple-choice questions
with a total of 118 candidate answers, each having an inferred human surprise score. Most
questions have five candidate answers, and questions can have more than one correct answer.
The questions are about 22 unique entities: 14 humans, 3 sovereign states, 3 paintings, 1
film, and 1 business. Of the 24 questions, 7 are single-answer (e.g., β€œWhat is the birthplace
of Boris Johnson”) and 6 ask about numeric values (e.g., β€œHow many children does Arnold
Schwarzenegger have?”). Each question presents a single entity and property, and each candidate
answer is a potential object of that entity-property pair. Thus, all facts in this benchmark consist
of a single triple. We collected this benchmark with a quiz administered via Google Forms to
a group of 26 researchers who participated voluntarily. A screenshot of the quiz is shown in
Figure 1: Screenshot of the quiz on Google Forms.


Figure 1. Surprise scores were inferred as the percentage of participants who did not choose
each candidate answer.2
   The quiz was designed to include a uniform distribution of surprising-correct answers,
unsurprising-correct answers, and incorrect (distracting) answers. We began by selecting a set
of well-known entities from several classes. To help ensure we would find entities that have
surprising facts, we did some simple Google searches of the form ”s with surprising facts”.
The set of candidate facts to pose as questions for each entity was refined by utilizing prior work
that identifies entities’ most salient facts [8]. Next, surprising-correct answers were selected
     2
      We expect that if a true statement is considered untrue by many participants, this is indicative of a surprising
fact. We expect that statements that are simply unknown would be guessed randomly by participants, leading to
an unsurprising score. We do acknowledge that a more principled annotation of this data may need to distinguish
between interesting, surprising, and unknown facts [7].
by manual selection of seemingly surprising facts from the bottom half of each entities’ facts
ordered by frequency of an object for a class-property pair. The selected facts were converted
into questions. Facts for the same property were combined into the same question, serving as
multiple correct answers. Non-surprising correct answers for the questions were generated by
filling in any other answers that are correct per Wikidata. Distracting answers were generated
by sampling answers that are incorrect for the entity according to Wikidata, and weighted by
the frequency with which they occur for the given property over all entities of the given entity’s
class. We fact-checked all answers using Google, omitting questions with disputable answers.
Finally, we trimmed down the number of candidate answers for each question to a maximum of
5, aiming to keep the distribution even amongst incorrect answers, correct answers expected to
be surprising, and correct answers expected to be unsurprising.
    Wikidata-Survey-FunFacts-70 consists of 70 facts, each consisting of an entity and a non-
empty list of predicate-object pairs. Each fact is labeled with three scores indicating the degree
to which humans find the fact surprising, think the fact is a good trivia question, and would say
they knew the fact. The facts are about 47 different humans and only include entity-valued
properties. This benchmark was created by mapping a subset of the rows from the FunFacts
benchmark [1] for Wikipedia. Tsurel et al. [1] collected the facts by several statistical methods
and asked crowd workers to judge the degree of fact surprise, fact trivia-worth, and whether
they knew the fact before reading it. The scores were then averaged across workers. The
original data contains 362 natural language facts about 109 different humans. As the facts
in the original dataset are natural language, not all of them can be mapped to Wikidata (e.g.,
β€œEinstein offered and was called on to give judgments and opinions on matters often unrelated
to theoretical physics or mathematics”). We mapped 113 of these 362 facts, resulting in 70 facts
that have been mapped to Wikidata.

2.3. Connection to Prior Studies
Prakash et al. [9] gather trivia facts about Wikipedia entities from IMDb and propose an algo-
rithm to estimate trivia-worthiness of a fact. Tsurel et al. [1] evaluate methods for identification
of surprising, trivia-worthy, and unknown facts from articles in Wikipedia. As, to our knowledge,
no such work exists for knowledge graphs like Wikidata; we adapt the benchmark introduced
in [1] for surprising fact identification in Wikidata. Serban et al. [10] generate natural language
questions from Freebase automatically with a neural network, yet, it is unclear if these questions
concern surprising facts. Prior work has provided methods for estimating quality of individual
Wikidata facts, by estimating aspects such as veracity [11, 12]. Quality estimation and triple
classification are orthogonal tasks to ours, and it is unclear how to apply quality estimation
methods to identify surprising facts.
   Notably, Surprising Facts Identification is related to the popular task of Link Prediction
where the goal is to correctly predict the object of a subject-predicate pair. Previous work
has explored Link Prediction in Wikidata in various capacities. Safavi et al. [13] perform a
comprehensive evaluation of the calibration of several embeddings models. Wu et al. [14] use a
rule-based approach to link prediction which is evaluated over several large knowledge bases,
including Wikidata. Rosso et al. [15] introduce an embedding-based predictive model that uses
graph triples together with their associated key-value pairs to restrict or disambiguate facts
in Wikidata. Joshi et al. [16] propose an embedding-based model that can reduce the search
space by automatically selecting a pool of promising entities to reduce computational load.
Recognizing this connection and opportunity to build upon prior research, we incorporate link
prediction models like TransE [4] and ComplEx [5] into our surprise identification methods,
and we perform further analysis of the inherent ability of the models to estimate surprise.


3. Method
We describe two novel methods that score the surprise for a Wikidata fact based on statistical
outlier detection and on link prediction.

3.1. Identifying Surprise with Statistical Outlier Detection (SOD)
Our first method measures whether the entity 𝑒 is an atypical subject (outlier) for the property-
object pair. We estimate this by comparing 𝑒 to the set of entities 𝐸𝑓 for which < 𝑒 β€² , 𝑝𝑖 , π‘œπ‘– > is true
βˆ€π‘– ∈ [1, 𝑛] and 𝑒 β€² ∈ 𝐸𝑓 ∢ 𝑒 β€² β‰  𝑒. For example, to estimate whether < 𝑃𝑒𝑑𝑖𝑛, π‘™π‘Žπ‘›π‘”π‘’π‘Žπ‘”π‘’, π‘…π‘’π‘ π‘ π‘–π‘Žπ‘› is
an outlier, we would compare Putin to the entities which speak Russian according to Wikidata.
   We compute the surprise score for a statement as a ratio between the distance between 𝑒 and
the entities in 𝐸𝑓 , and the dispersion of entities within 𝐸𝑓 : π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’π‘œπ‘’π‘‘π‘™π‘–π‘’π‘Ÿ (𝑓 ) = π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(𝑒, 𝐸𝑓 )/π‘‘π‘–π‘ π‘π‘’π‘Ÿπ‘ π‘–π‘œπ‘›(𝐸𝑓 ).
We utilize embeddings to represent each entity in Wikidata. π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(𝑒, 𝐸𝑓 ) is a measure of dis-
tance from the entity 𝑒 to the entities in 𝐸𝑓 in the embedding space. We experimented with
two such formulations: π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’π‘ (𝑒, 𝐸𝑓 ) computes the cosine distance from 𝑒 to the centroid of 𝐸𝑓
and π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’π‘Žπ‘ (𝑒, 𝐸𝑓 ) computes the average pairwise cosine distance between 𝑒 and each 𝑒 β€² ∈ 𝐸𝑓 .
π‘‘π‘–π‘ π‘π‘’π‘Ÿπ‘ π‘–π‘œπ‘›(𝐸𝑓 ) is a measure of how spread out 𝐸𝑓 is in the embedding space, where we again
experiment with two analogous formulations: π‘‘π‘–π‘ π‘π‘’π‘Ÿπ‘ π‘–π‘œπ‘›π‘ (𝐸𝑓 ) computes the average cosine
distance from each 𝑒 β€² ∈ 𝐸𝑓 to the centroid of 𝐸𝑓 and π‘‘π‘–π‘ π‘π‘’π‘Ÿπ‘ π‘–π‘œπ‘›π‘Žπ‘ (𝐸𝑓 ) computes the average
pairwise distance between 𝑒 β€² and 𝑒 β€³ for 𝑒 β€² , 𝑒 β€³ ∈ 𝐸𝑓 ∢ 𝑒 β€² β‰  𝑒 β€³ .
   Intuitively, the distance term causes an entity 𝑒 to be found more surprising for a fact 𝑓 if it
is dissimilar from entities that have fact 𝑓. The dispersion term normalizes this dissimilarity,
decreasing our surprise for facts that many diverse entities have (e.g., β€œlanguage written, spoken,
or signed = English”) and increasing our surprise for facts that typically belong to very similar
entities (e.g., β€œOccupation = Basketball Player”). For the edge case where |𝐸𝑓 | < 2, there are too
few entities in Wikidata that have all predicate-object pairs of 𝑓 for us to compute a dispersion
score, so we assign the 𝑓 maximum surprise score (∞).

3.2. Identifying Surprise via Link Prediction (LP)
Another approach for identifying surprising facts is by using link prediction. Graph embedding
models like TransE and ComplEx define ways of operating on their representations of entities
and properties to yield predictions for where in the embedding space the corresponding subject
or object entity resides. This gives us two potential methods to compute surprise scores: we
define π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘™β„Žπ‘  (𝑓 ) as the aggregated distance from each π‘œπ‘– ∈ 𝑓 to the predicted locations
of the objects of < 𝑒, 𝑝𝑖 >, and we define π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘Ÿβ„Žπ‘  (𝑓 ) as the aggregated distances from
𝑒 to each of the predicted locations of the subjects of < 𝑝𝑖 , π‘œπ‘– >. There are several additional
settings for these methods: the distances in both the lhs and rhs methods can be aggregated
using either max or avg; the rhs method can be modified to instead compute the distance from
𝑒 to the centroid of the predicted object locations; and various distance functions can be used
here, including cosine, L2, and negative dot-product.

3.3. Experimental Setup
Evaluation protocol We evaluate our methods’ ability to identify surprising facts by measuring
correlation between their surprise scores with the human scores available in each benchmark
using Spearman’s 𝜌 and Kendall’s 𝜏. For Wikidata-MC-Trivia-118, we evaluate our method’s
surprise scores by measuring their correlation with the inferred human surprise scores. To
mimic the format that this benchmark was created in, we measure the average correlation
of answers’ surprise scores within each question. We additionally report separate results for
the subsets of questions that ask about entity-valued and numeric-valued properties. For
Wikidata-Survey-FunFacts-70, we evaluate our method’s surprise scores by measuring their
correlation with the crowdsourced scores for goodTrivia, surprise, and knew. A successful
surprise identification method will give scores that correlate positively with goodTrivia and
surprise, and negatively with knew. As this benchmark was created by presenting each fact to a
human annotator independently, we measure the global correlation over all facts.
   Models We experiment with eight embedding models: (1) BERT text embeddings that we
computed by automatically generating a node description based on seven properties and using
sentence-transformers to encode the resulting sentence; 3 (2, 3) TransE and ComplEx graph
embeddings computed directly on Wikidata; (4, 5) TransE and ComplEx graph embeddings that
have been computed over a graph derived from Wikidata by abstractive summarization into
β€˜profiles’ (P-TransE and P-ComplEx) [8] ; (6, 7, 8) H, A, and S random-walk-based graph embed-
dings, designed to capture similarity based on homophily, numeric attributes, and structure,
respectively; (9) kNN-P-TransE supervised TransE embeddings, based on a kNN model for each
property.
   Method details We evaluate our outlier-based method with the eight unsupervised embed-
ding models. We limit the size of 𝐸𝑓 for each fact to 10,000 by sampling. We evaluate our LP
method with the embedding models TransE, ComplEx, P-TransE, P-ComplEx, and kNN-P-TransE,
as these allow for unsupervised link prediction via translation and complex-diagonal operators.
Vanilla TransE and ComplEx are not trained on numeric-valued edges and therefore cannot be
used directly for LP on numeric-valued properties. Because the P- variants were trained on a
graph that replaces numeric values with nodes corresponding to ranges, they can be used for
LP on numeric-valued properties. Meanwhile, as the profile graph omits information from the
original graph, the profile (P-) embeddings lack information for 19 of the facts in the Wikidata-
Survey-FunFacts-70 dataset, while the kNN model lacks information for 10 facts. For these
missing facts, we use a frequency method as a fallback strategy. For the Wikidata-MC-Trivia-
118 benchmark, we report π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘™β„Žπ‘  for the vanilla and the supervised graph embeddings,
and π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘Ÿβ„Žπ‘  for the profile-graph embeddings. For the Wikidata-Survey-FunFacts-70
benchmark, we use π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘Ÿβ„Žπ‘  for the unsupervised LP models, aggregating predictions by
    3
     Properties used: P31 (instance of), P279 (subclass of), P106 (occupation), P39 (position held), P1382 (partially
coincident with), P373 (Commons Category), and P452 (industry).
Table 1
Results on Wikidata-MC-Trivia-118.
                                         Qnode facts    Numeric facts      All facts
                   Methods               Rho     Tau    Rho      Tau    Rho        Tau
                    random              -0.003 -0.002   0.024   0.019   0.003 0.003
         Baselines
                    frequency            0.043  0.055   0.134   0.129   0.066 0.074
                    BERT                 0.574  0.502    0.49 0.394     0.553 0.475
                    ComplEx              0.556  0.468   0.455   0.382   0.531 0.447
                    TransE              0.638 0.526     0.326   0.228    0.56 0.452
                    P-ComplEx            0.429  0.381   0.056   0.075   0.335 0.305
           SOD
                    P-Transe             0.421  0.382   0.211   0.137   0.368      0.32
                    H-RandomWalk         0.594  0.505   0.401   0.348   0.546 0.466
                    A-RandomWalk         0.081  0.088   0.171   0.152   0.103 0.104
                    S-RandomWalk        -0.025 -0.067   0.381   0.286   0.076 0.021
                    ComplEx              0.475  0.413       -       -       -         -
                    TransE               0.442  0.391       -       -       -         -
            LP
                    P-ComplEx            0.392  0.346   0.219    0.18   0.348 0.304
                    P-Transe             0.246  0.211   0.422   0.376    0.29 0.252
                    kNN-P-TransE         0.109  0.099   0.639   0.565   0.242 0.215


measuring the distance from their centroid to the given fact’s subject entity. Analogously, we
use π‘†π‘’π‘Ÿπ‘π‘Ÿπ‘–π‘ π‘’πΏπ‘ƒβˆ’π‘™β„Žπ‘  for the supervised LP model. We use cosine distance for all LP methods.
  We define two baselines: random and frequency. The frequency-based method assigns the
surprise score for a fact as the percent of Wikidata entities that it does not apply to.


4. Results
The results on both benchmarks in tables 1 and 2 show that our SOD methods are able to identify
surprising facts better than the baselines and our LP methods. On the Trivia benchmark, SOD
yields the highest correlation scores with the BERT, TransE, and H embeddings, followed closely
by the ComplEx embeddings. On the Survey benchmark, the H-RandomWalk embeddings
perform the best on all three crowd-sourced scores, followed by ComplEx and TransE on
the β€œgoodTrivia” and β€œsurprising” scores, and S-RandomWalk and BERT on β€œKnew”. The
performance of the baselines is low on the Trivia benchmark, while the frequency baseline
performs notably better on the Survey benchmark, giving scores that are competitive with
most of our models. We think that this is due to the presence of compound facts in the Survey
benchmark, whose combination reveals unlikely facts that align with human judgments of
trivia-worthiness or surprise. Yet, several of the SOD methods improve over the frequency
baseline on this benchmark as well, revealing that the subject entity embeddings provide key
additional information that is not captured by the predicate-object frequencies.
   Curiously, link prediction with the same embedding models performs consistently worse
than the SOD methods overall and on the entity-valued facts, though this trend is reversed
for the P-TransE and P-ComplEx on the numeric facts. This shows that, while the embedding
models may contain valuable information, the method that uses them to identify surprising facts
Table 2
Results on Wikidata-Survey-FunFacts-70. The P-ComplEx and P-TransE models with the LP method
lack scores for 19 of the 70 facts, and the kNN-P-TransE model lacks scores for 10 facts. For those facts,
we fall back to the frequency baseline to fill in a surprise score. We mark these results with an asterisk
(*).
                                           GoodTrivia (↑)      Surprising (↑)         Knew (↓)
                  Methods                    Rho     Tau        Rho      Tau        Rho       Tau
                   random                  -0.005 -0.003      -0.005 -0.004         0.005    0.004
        Baselines
                   frequency                0.328   0.236      0.329    0.238      -0.410   -0.304
                   BERT                     0.396   0.271      0.345    0.235      -0.307   -0.218
                   ComplEx                  0.478   0.345      0.373    0.261      -0.247   -0.177
                   TransE                   0.460   0.314      0.350    0.237      -0.244   -0.180
                   P-ComplEx                0.230   0.160      0.137    0.094      -0.074   -0.057
          SOD
                   P-Transe                 0.353   0.250      0.257    0.178      -0.144   -0.107
                   H-RandomWalk            0.540 0.355        0.521 0.359         -0.499 -0.366
                   A-RandomWalk             0.311   0.224      0.240    0.169      -0.199   -0.110
                   S-RandomWalk             0.331   0.229      0.255    0.183      -0.333   -0.243
                   ComplEx                  0.146   0.113      0.148    0.111      -0.170   -0.126
                   TransE                   0.197   0.141      0.162    0.114      -0.109   -0.069
           LP
                   P-ComplEx               0.086* 0.058*      0.018* 0.009*       0.087*    0.056*
                   P-Transe                0.183* 0.145*      0.174* 0.131*      -0.080* -0.050*
                   kNN-P-TransE            0.076* 0.061*      0.048* 0.031*      -0.073* -0.060*

Table 3
MRR results on the Trivia dataset for TransE-based Link Prediction models. We show MRR results for
the entire benchmark, only on the correct/incorrect triples, and only on the (non-)outliers subsets. For
each question, we take the two facts closest to the centroid to be non-outliers, and the two facts furthest
from the centroid to be outliers.
                         LP Model        TransE P-TransE kNN-P-TransE
                             All          0.0009        0.0006              0.2213
                           Correct        0.0012        0.0004              0.2617
                          Incorrect       0.0005        0.0008              0.1787
                        Non-Outliers      0.0011        0.0005              0.2752
                           Outliers       0.0007        0.0007              0.1493



plays a key role in the final performance. While LP performs consistently better than random
on both benchmarks, it has a relatively poor performance on the Survey benchmark, which
may again be attributed to the presence of compound facts in this benchmark. Link prediction
predicts a single subject vector at a time for each predicate-object pair and we account for the
compound facts afterwards by computing their centroid. The centroid of these predicted vectors
is an estimate of where in the embedding space an entity that has all of the predicate object
pairs would be. This estimate may be poor when the predicted subject vectors are far apart (as
in the case of facts that combine predicate-object pairs that rarely co-occur), and thus the LP
method cannot as directly and effectively take into consideration interactions between such
predicate-object pairs that are important to humans for determining if a fact is surprising.
   Hypothesizing that LP methods regress to the mean of the entities with similar facts, we
study the behavior of the embedding models further by analyzing the LP MRR of a fact in
relation to the centroid of the cluster. The results are shown in table 3. The MRR for two of
the three models is higher on the correct subset compared to the incorrect one, and on the
non-outlier facts compared to the outliers. We observe that across any subset of the data, the
supervised embedding model obtains highest MRR. Interestingly, while the supervised model
performs the best on the link prediction task, its performance on the surprise task is the lowest
out of the three models on this benchmark. The unsupervised version, P-TransE, assigns a
comparably low MRR value on both the outliers and the non-outliers, yet, its performance on
the surprise task is notably higher. This means that LP models optimize to predict a value that
is typical rather than surprising. Based on these results, we hypothesize that more accurate
link prediction models may not necessarily be better at predicting surprising facts; subsequent
research on this topic is beneficial to further analyze their behavior.


5. Conclusions
In this paper, we studied the challenge of identifying surprising facts in Wikidata automatically.
Inspired by earlier work on Wikipedia, we formulated a novel task of Surprising Fact Identification
where an AI system has to mimic the surprise estimation of humans. We developed two novel
benchmarks to evaluate representation learning models on this task. We proposed two generic
methods to identify surprising facts based on statistical outlier detection and link prediction. Our
experiments revealed that the best performance on both datasets was obtained with different
variants of the outlier detection method. While link prediction methods performed relatively
well on single facts, their performance on compound facts was much lower than the outlier
detection methods and the frequency baseline. Further analysis revealed that link prediction
models are optimized to predict typical values rather than surprising ones, as they tend to
regress to the centroid of the entities with similar facts.
   This paper represents a first investigation of identifying surprising facts in Wikidata by using
automatic methods. However, the significance of our findings is limited by the small size of
the benchmarks, which cannot be expected to be representative of the size of Wikidata. We
have found it challenging to create a large evaluation dataset - a key future work task is to
develop a more representative evaluation set. Future work should investigate the robustness of
the obtained results by evaluating on such a larger dataset, and it should develop novel surprise
methods that, for example, leverage entity embeddings in ways that focus on capturing surprise.
We make our code and data publicly available to facilitate subsequent work on identifying
surprising facts in Wikidata: https://github.com/usc-isi-i2/surprising-facts.


Acknowledgments
This material is based on research sponsored by Air Force Research Laboratory under agreement
number FA8750-20-2-10002. The U.S. Government is authorized to reproduce and distribute
reprints for Governmental purposes notwithstanding any copyright notation thereon. The
views and conclusions contained herein are those of the authors and should not be interpreted
as necessarily representing the official policies or endorsements, either expressed or implied, of
Air Force Research Laboratory or the U.S. Government.


References
 [1] D. Tsurel, D. Pelleg, I. Guy, D. Shahaf, Fun facts: Automatic trivia fact extraction from
     wikipedia, Proceedings of the Tenth ACM International Conference on Web Search and
     Data Mining (2017).
 [2] D. VrandečiΔ‡, M. KrΓΆtzsch, Wikidata: a free collaborative knowledgebase, Communications
     of the ACM 57 (2014) 78–85.
 [3] Wikidata, Curious facts dashboard, ???? URL: https://wikidata-analytics.wmcloud.org/
     app/CuriousFacts.
 [4] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings
     for modeling multi-relational data, Advances in neural information processing systems 26
     (2013).
 [5] T. Trouillon, J. Welbl, S. Riedel, Γ‰. Gaussier, G. Bouchard, Complex embeddings for
     simple link prediction, in: International conference on machine learning, PMLR, 2016, pp.
     2071–2080.
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
 [7] H. Arnaout, S. Razniewski, G. Weikum, J. Z. Pan, Negative statements considered useful,
     Journal of Web Semantics 71 (2021) 100661.
 [8] N. Klein, F. Ilievski, P. Szekely, Generating explainable abstractions for wikidata entities,
     in: Proceedings of the 11th on Knowledge Capture Conference, 2021, pp. 89–96.
 [9] A. Prakash, M. K. Chinnakotla, D. Patel, P. Garg, Did you know?β€”mining interesting trivia
     for entities from wikipedia, in: Twenty-Fourth International Joint Conference on Artificial
     Intelligence, 2015.
[10] I. V. Serban, A. GarcΓ­a-DurΓ‘n, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, Y. Bengio,
     Generating factoid questions with recurrent neural networks: The 30m factoid question-
     answer corpus, arXiv preprint arXiv:1603.06807 (2016).
[11] A. Piscopo, C. Phethean, E. Simperl, What makes a good collaborative knowledge graph:
     group composition and quality in wikidata, in: International Conference on Social Infor-
     matics, Springer, 2017, pp. 305–322.
[12] A. Piscopo, L.-A. Kaffee, C. Phethean, E. Simperl, Provenance information in a collaborative
     knowledge graph: an evaluation of wikidata external references, in: International semantic
     web conference, Springer, 2017, pp. 542–558.
[13] T. Safavi, D. Koutra, E. Meij, Evaluating the calibration of knowledge graph embeddings
     for trustworthy link prediction, arXiv preprint arXiv:2004.01168 (2020).
[14] H. Wu, Z. Wang, X. Zhang, P. G. Omran, Z. Feng, K. Wang, A system for reasoning-based
     link prediction in large knowledge graphs., in: ISWC Satellites, 2019, pp. 121–124.
[15] P. Rosso, D. Yang, P. CudrΓ©-Mauroux, Beyond triplets: hyper-relational knowledge graph
     embedding for link prediction, in: Proceedings of The Web Conference 2020, 2020, pp.
     1885–1896.
[16] U. Joshi, J. Urbani, Searching for embeddings in a haystack: Link prediction on knowledge
     graphs with subgraph pruning, in: Proceedings of The Web Conference 2020, 2020, pp.
     2817–2823.