=Paper=
{{Paper
|id=Vol-3274/paper2
|storemode=property
|title=Prompting as Probing: Using Language Models for Knowledge Base Construction
|pdfUrl=https://ceur-ws.org/Vol-3274/paper2.pdf
|volume=Vol-3274
|authors=Dimitrios Alivanistos,Selene Baez Santamaria,Michael Cochez,Jan-Christoph Kalo,Emile van Krieken,Thiviyan Thanapalasingam
|dblpUrl=https://dblp.org/rec/conf/kbclm/AlivanistosSCKK22
}}
==Prompting as Probing: Using Language Models for Knowledge Base Construction==
<pdf width="1500px">https://ceur-ws.org/Vol-3274/paper2.pdf</pdf>
<pre>
Prompting as Probing: Using Language Models for
Knowledge Base Construction
Dimitrios Alivanistos1,2,4 , Selene Báez Santamaría1,2 , Michael Cochez1,2,4 ,
Jan-Christoph Kalo1,2,5 , Emile van Krieken1,2 and Thiviyan Thanapalasingam1,2,3
1
  Authors are listed in alphabetical order to denote equal contributions.
2
  Vrije Universiteit Amsterdam
3
  Universiteit van Amsterdam
4
  Discovery Lab, Elsevier, The Netherlands
5
  DReaMS Lab, Huawei, The Netherlands


                                         Abstract
                                         Language Models (LMs) have proven to be useful in various downstream applications, such as sum-
                                         marisation, translation, question answering and text classification. LMs are becoming increasingly
                                         important tools in Artificial Intelligence, because of the vast quantity of information they can store.
                                         In this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a large Language Model
                                         originally proposed by OpenAI in 2020, to perform the task of Knowledge Base Construction (KBC).
                                         ProP implements a multi-step approach that combines a variety of prompting techniques to achieve
                                         this. Our results show that manual prompt curation is essential, that the LM must be encouraged to
                                         give answer sets of variable lengths, in particular including empty answer sets, that true/false questions
                                         are a useful device to increase precision on suggestions generated by the LM, that the size of the LM
                                         is a crucial factor, and that a dictionary of entity aliases improves the LM score. Our evaluation study
                                         indicates that these proposed techniques can substantially enhance the quality of the final predictions:
                                         ProP won track 2 of the LM-KBC competition, outperforming the baseline by 36.4 percentage points.
                                         Our implementation is available on https://github.com/HEmile/iswc-challenge.


1. Introduction
Language Models (LMs) have been at the center of attention, presented as a recent success story
of Artificial Intelligence. LMs have shown great promise across a wide range of domains in a
variety of different tasks, such as Text Classification [1], Financial Sentiment Analysis [2], and
Protein Binding Site Prediction [3]). In recent years, prompt engineering for LMs has become a
research field in itself, with a plethora of papers working on LM understanding (e.g. [4]).
   Natural Language Processing (NLP) researchers have recently investigated whether LMs
could potentially be used as Knowledge Bases, by querying for particular information. In Petroni
et al. [5], the LAMA dataset for probing relational facts from Wikidata in LMs was presented.

LM-KBC’22: Knowledge Base Construction from Pre-trained Language Models, Challenge at ISWC 2022
$ d.alivanistos@vu.nl (D. Alivanistos); s.baezsantamaria@vu.nl (S. B. Santamaría); m.cochez@vu.nl (M. Cochez);
j.c.kalo@vu.nl (J. Kalo); e.van.krieken@vu.nl (E. v. Krieken); t.thanapalasingam@uva.nl (T. Thanapalasingam)
 https://dimitrisalivas.github.io/ (D. Alivanistos); https://selbaez.github.io/ (S. B. Santamaría);
https://www.cochez.nl/ (M. Cochez); https://research.vu.nl/en/persons/jan-christoph-kalo (J. Kalo);
https://emilevankrieken.com/ (E. v. Krieken); https://thiviyansingam.com/ (T. Thanapalasingam)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
The authors show that the masked LM BERT can complete Wikidata facts with a precision of
around 32%. Several follow-up papers have pushed this number to almost 50% [6, 7]. While the
prediction quality on LAMA is promising, others have argued that LMs should not be used as
knowledge graphs, but rather to support the augmentation and curation of KGs [8].
   In this paper, we describe ProP, the system we implemented for the “Knowledge Base Con-
struction from Pre-trained Language Models” challenge at ISWC 20221 . The task is to predict
possible objects of a triple, where the subject and relation are given. For example, given the
string "Ronnie James Dio" as the subject and the relation PersonInstrument, an LM needs to
predict the answers "bass guitar" or "guitar", and "trumpet", as the objects of the triple. In
contrast to the LAMA dataset [5]), the LM-KB challenge dataset contains questions for which
there are no answers or where multiple answers are valid. Bakel et al. argue for precision and
recall metrics over the common ranking-based metrics for query answering. [9]. Since the
LM-KBC dataset requires predicting a set of answers and in some cases even empty answer sets,
we choose precision and recall metrics to evaluate our system.
   ProP uses the GPT-3 model, a large LM proposed by OpenAI in 2020[10]. For each relation
type, we engineered prompts that, when given to the language model, probe it to respond with
the set of objects for that relation. The components of ProP can be divided into two categories:
prompt generation focuses on generating the correct prompts that yield the desired answers
for the questions in the LM-KBC dataset, and post-processing aims to enhance the quality of
the predictions.


2. Related Work
The development of large pre-trained Language Models (LMs) has led to substantial improve-
ments in NLP research. It was shown that extensive pre-training on large text corpora encodes
large amounts of linguistic and factual knowledge into a language model that can help to
improve the performance on various downstream tasks (see [11] for a recent overview).
   LM as KG Petroni et al. [5], and later others [8], asked the question to what extent language
models can replace or support the creation and curation of knowledge graphs. Petroni et al.
proposed the LAMA dataset for probing relational knowledge in language models by using
masked language models to complete cloze-style sentences. As an example, the language model
BERT can complete the sentence “Paris is the capital of [MASK]" with the word “France". In
this case, it is assumed that the model knows about, or can predict, the triple (Paris, capitalOf,
France).
   While the original paper relied on manually designed prompts for probing the language
model, various follow-up works have shown the superiority of automatically learning prompts.
Methods can mine prompts from large text corpora and pick the best prompts by applying
them to a training dataset as demonstrated in [12, 13]. Prompts can also be directly learned via
backpropagation: BERTESE [14] and AutoPrompt [15] show how prompts can be learned to
improve the performance on LAMA. The probing performance can be pushed even further by
either directly learning continuous embeddings for the prompts [7, 16] or by directly fine-tuning
    1
      LM-KBC, https://lm-kbc.github.io/. This work is a submission in the open track (Track 2) in which LMs of any
sizes can be used
the LM on the training data [17]. Similar to our work, in FewShot-LAMA [18] few-shot learning
on the original LAMA dataset is evaluated. The authors show that a combination of few-shot
examples with learned prompts achieves the best probing results.
   Since the publication of the LAMA dataset, a large variety of other probing datasets for factual
knowledge in LMs have been created. LAMA-UHN is a more difficult version of LAMA [19].
TimeLAMA adds a time component to facts that the model is probed for [20]. Furthermore,
BioLAMA [21] and MedLAMA [22] are domain-specific probing datasets for biological and
medical knowledge.
   Most existing approaches have in common that they only probe the language models for
entities with a label consisting of a single token from the language model’s vocabulary. Thus,
the prediction of long, complex entity names is mostly unexplored. Furthermore, most existing
works have only asked language models to complete a triple with a single prediction, even
though some triples might actually allow for multiple possible predictions. Both these aspects
substantially change the challenge of knowledge graph construction.
   LM for KG While probing language models has been heavily studied in the NLP community,
the idea of using language models to support knowledge graph curation is not sufficiently
studied [8]. Some works have shown how LMs in combination with knowledge graphs can
be used to complete query results [23]. Other works have looked into how to use language
models to identify errors in knowledge graphs [24], or have studied how to weight KG triples
from ConceptNet with language models to measure semantic similarity. Biswas et al. [25] have
shown that language models can be used to perform entity typing by predicting the class using
language models.
   KG-BERT is a system that is most similar to what is required for the KBC Challenge [26]. A
standard BERT model is trained on serialized knowledge graph data to perform link prediction
on standard link prediction benchmark datasets. KG-BERT’s performance on link prediction is
comparable to many state-of-the-art systems that use knowledge graph embeddings for this
task.
   Similar tasks This KBC Challenge task is similar to Factual Q&A with Language models,
where the goal is to respond to questions that fall outside the scope of a knowledge base. This
shared task differs in that the responses need to include 0 to k answers. Moreover, in the shared
task, the factual questions are generated from triples, thus including variation in how a triple
might be phrased as a question.


3. The LM-KBC Challenge
The LM-KBC dataset contains triples of the form (𝑠, 𝑝, 𝑂), where 𝑠 is the textual representation
of the subject, 𝑝 is one of 12 different predicates and 𝑂 the (possibly empty) set of textual
representations of object entities for prediction. The subjects and objects are of various types.
After learning from a training set of such triples, given a new subject and one of the known
relations, the task is to predict the complete set of objects.
3.1. The LM-KBC Dataset
For each of the 12 relations, the number of unique subject-entities in the train, dev, and test sets
are 100, 50, and 50 respectively. We include detailed distributions of the cardinality (the number
of object-entities) for each relation type (Appendix, Figures 2 and 3). Table 4 in the Appendix
shows the aggregated statistics about the number of object-entities, as well as the number of
alternative labels per object-entities in the development set. Certain relation types have a much
higher average cardinality (e.g. PersonProfession=7.42 or StateSharesBorderState=5.6) than others
(e.g. CompanyParentOrganization=0.32, PersonPlaceOfDeath=0.50). We also note that only five
of the relations allow for empty answer sets. For example, relations associated with a person’s
death (PersonPlaceOfDeath and PersonCauseOfDeath) often contain empty answer sets, because
many persons in the dataset are still alive. In these cases, a models needs to be able to predict
empty answer sets correctly.

3.2. The Baseline
The baseline model is a masking-based approach that uses the popular BERT model [27] in
3 variants (base-cased, large-cased and RoBERTa [28]). The BERT model is tasked with doing
prompt completion to predict object-entities for a given subject-relation pair. The prompts used
by the baseline approach have been customised for the different relation types.


4. Our Method
Previous studies have shown that prompt formulation plays a critical role in the performance
of LMs on downstream applications [29, 4] and this also applies to our work. We investigated
different prompting approaches using few-shot learning with GPT-3 [10] using OpenAI’s API
2 . In this Section, we first describe how we generate the prompts (prompt generation phase),

and followed by how we utilise different components in our pipeline to further fine-tune the
prompts for an enhanced performance (post-processing phase).

4.1. Prompt Generation
For each relation in the dataset, we manually curate a set of prompt templates consisting of
four representative examples3 selected from the training split. We use these relation-specific
templates to generate prompts for every subject entity by replacing a placeholder token in the
final line of the template with the subject entity of interest. We task GPT-3 to complete the
prompts, and evaluate the completions and compute the macro-precision, macro-accuracy and
macro-F1 score for each relation.


    2
       https://openai.com/api (temperature=0, max_tokens=100, top_p=1, frequency_penalty=0, presence_penalty=0,
logprobs=1)
     3
       This is an arbitrary number of training examples. Since few-shot learners are efficient at learning from a
handful of examples [10], including an excessive number of training examples may not necessarily lead to any
improved precision or recall. Therefore, we did not study the effects of varying the number of training examples in
the prompts.
   We ensured that the training examples for few-shot learning included all of the following: (i)
questions with answer sets of variable lengths to inform the LM that it can generate multiple
answers; (ii) questions with empty answer sets to ensure that the LM returns nothing when
there is no correct answer to a given question; (iii) a fixed question-answer order, where we
provide the question and then immediately the answer so that the LM learns the style of the
format, and (iv) the answer set formatted as a list to ensure we can efficiently parse the answer
set from the LM. We did not study the order of these examples, but hypothesize that this is not
hugely important to GPT-3 as it can handle long-range dependencies well.
   We formulate the questions either in natural language or in the form of a triple. We hand-
designed the natural language questions and did not compare them in a structured manner
to alternatives. However, we tried out several variations in OpenAI GPT-3 playground to get
an intuition of what style of questions are effective. We found those are usually shorter and
simpler. We include the prompt templates used in Section 7.2 of the Appendix. In our work,
we investigate the use of both prompting styles and compare natural language prompts with
triple-based prompts for the different relations.

4.2. Empty Answer sets
There are some questions in the dataset for which the correct answer is an empty answer set.
For instance, there are no countries that share borders with Fiji. The empty answer set can be
represented as either an empty list [], or as a string within a list [‘None’]. We have observed
(see Table 2) that the way that empty sets are represented affects the precision and recall of our
approach. Allowing the explicit answer ‘None’ encourages the LM to exclude answers that it is
uncertain about.

4.3. Fact probing
Our initial results indicated that the recall of our approach was high, but the precision for
certain relations was low. Therefore, we add a post-processing step called fact probing in ProP’s
pipeline. We use fact probing to ask the LM to probe whether each completion proposed by
the LM in the previous step is correct. Inspired by maieutic prompting [30], we create a simple
prompt, where we translate each predicted completion into a natural language fact. Then we
ask the LM to predict whether this fact is true or false. One example of a fact-probing prompt is
Niger neighbours Libya TRUE for the CountryBordersWithCountry relation. We ensure that the
LM only predicts either TRUE or FALSE by adding a true and a false example to the prompt.


5. Results
In this Section, we analyse the ProP pipeline we built for generating prompts and evaluate the
contribution of each component. We explain how we combine the best-performing components
to yield a prediction that obtains a high macro F1-score on the test split.
5.1. Prompt-Fine Tuning
5.1.1. Natural Language vs Triple-based Prompts
Table 1 shows the quality of the predictions for the natural language prompts and triple-based
prompts. We note that on F1, the performance between these two prompt styles is mixed, with
F1 being higher on the triple-style prompts in only seven out of twelve cases. Unpacking the
F1 into recall and precision shows that the triple-style prompts yield higher precision, while
natural language prompts yield higher recall. Overall, the triple-style prompts do yield a higher
F1 when averaged over each relation. Our intuition is that natural language prompts contain
certain words that badly influence the precision of the predictions. It is, however, difficult to
study this systematically as the enumeration of word combinations in the prompts is very large.
Triple-based prompts circumvent this problem because they only contain the relevant terms in
the prompts for the subject entities and relations that are required to predict the object entities.

Table 1
Precision, Recall, and F1-score for predictions generated using natural language and triple-based prompts
across the different relations. Results are on the dev dataset. Values rounded up to three decimal places.
Best scores are in bold.
   Relation Type                                Precision                Recall                 F1

   Method                              Triple        Natural    Triple      Natural    Triple    Natural
                                                     Language               Language             Language

   ChemicalCompoundElement             0.976         0.895      0.919       0.885      0.940     0.884
   CompanyParentOrganization           0.587         0.385      0.600       0.400      0.590     0.388
   CountryBordersWithCountry           0.865         0.809      0.733       0.800      0.766     0.785
   CountryOfficialLanguage             0.933         0.798      0.810       0.882      0.833     0.785
   PersonCauseOfDeath                  0.560         0.500      0.550       0.500      0.553     0.500
   PersonEmployer                      0.261         0.273      0.267       0.323      0.226     0.262
   PersonInstrument                    0.547         0.489      0.508       0.458      0.502     0.446
   PersonLanguage                      0.840         0.750      0.894       0.932      0.827     0.793
   PersonPlaceOfDeath                  0.820         0.840      0.820       0.840      0.820     0.840
   PersonProfession                    0.669         0.713      0.527       0.535      0.556     0.581
   RiverBasinsCountry                  0.845         0.820      0.868       0.863      0.832     0.822
   StateSharesBorderState              0.587         0.628      0.407       0.462      0.472     0.522

   Average over all relations          0.707         0.658      0.658       0.657      0.660     0.634


5.1.2. Empty vs None
As explained in Section 3.1, five out of twelve relations allow for empty answer sets. We
experiment with the different ways to represent such empty answer sets, and Table 2 shows
the results. Three relations get a performance boost when prompted with ‘NONE’ (Compa-
nyParentOrganization, PersonCauseOfDeath, PersonInstrument), while the other two relations
perform better when using empty lists (CountryBordersWithCountry, PersonPlaceOfDeath). For
subsequent experiments, we modified the prompt of each relation to use the best performant
representation.
Table 2
Precision, recall, and F1-score for the Davinci model across the different relations. Results are on the
dev dataset. We only include those relations which require empty answer sets. Values are rounded up
to three decimal places. Best scores are in bold.
    Relation Type                                      Precision             Recall              F1

    Method                                     Empty         None    Empty       None    Empty        None

    CompanyParentOrganization                  0.587         0.767   0.600       0.780   0.590        0.770
    CountryBordersWithCountry                  0.865         0.826   0.733       0.719   0.766        0.749
    PersonCauseOfDeath                         0.560         0.600   0.550       0.590   0.553        0.593
    PersonInstrument                           0.547         0.600   0.508       0.561   0.502        0.568
    PersonPlaceOfDeath                         0.820         0.780   0.820       0.780   0.820        0.780

    Average over all relations                 0.685         0.697   0.674       0.685   0.657        0.669


5.1.3. Language Model Size
Usually, the size of LMs is measured by the number of learnable parameters in the model.
However, the OpenAI API does not quantify the number of total parameters but only the size of
the embedding dimensions for the tokens 4 . We assume there is a positive correlation between
the token dimension size and the total number of GPT-3 parameters. Figure 1 shows our results,
and we can see that as the language model size increases, the F1-score also increases. This
shows that a larger LM gives better performance on KBC.


Figure 1: The F1-scores of GPT-3 models with different number of parameters. In brackets, the
embedding dimension for the model. We observe an almost analogous increase between size and
performance.


    4
        https://beta.openai.com/docs/models/gpt-3
5.2. Post-Processing Predictions
Up to this point, we have discussed how we generate the optimal prompts for the different
relation types. Once the LM produces the completions using these optimal prompt techniques,
we can employ two additional steps to enhance the precision and recall of our predictions. Table
3 shows the results of including fact probing and entity aliases in our system.

5.2.1. Fact probing
We found that fact probing has a different impact on different relation types. This difference could
stem from the cardinalities of the relation types. For example, the relation PersonPlaceOfDeath,
which has only one correct answer, should show a larger improvement than State Borders,
which has a higher cardinality. We found that fact probing helped to boost the predictions
of five relations (CompanyParentOrganization, CountryOfficialLanguage, PersonCauseOfDeath,
PersonInstrument, PersonLanguage). We only apply fact probing to these relations. On the dev
set, the precision of fact probing is 0.737, and 0.608 among the predictions removed by fact
probing. That is, in 60.8% fact probing filtered a prediction, it correctly removed a prediction
that was not in the ground truth set.

Table 3
Precision, recall, and F1-score for predictions generated using different post-processing techniques on
the development (dev) and test sets. We round up values to three decimal places. Best F1-scores are in
bold.
          Method                                       Precision    Recall        F1

          BERT Baseline (dev)                          0.349        0.295         0.309
          BERT Baseline + fact probing (dev)           0.357        0.304         0.317
          BERT Baseline + fact probing + alias (dev)   0.357        0.304         0.317

          GPT3-Davinci (dev)                           0.736        0.699         0.697
          GPT3-Davinci + fact probing (dev)            0.741        0.692         0.698
          GPT3-Davinci + fact probing + alias (dev)    0.755        0.709         0.712

          GPT3-Davinci (test)                          0.782        0.701         0.679
          GPT3-Davinci + fact probing (test)           0.798        0.690         0.676
          GPT3-Davinci + fact probing + alias (test)   0.813        0.704         0.689


5.2.2. Entity aliases
As we discussed in Section 3.1, the predictions from the language model are sometimes correct,
but not according to what the gold standard expects. Whether this is problematic depends on
the final use of the predictive model. In interactive use, this would not be an issue because
the user will be able to disambiguate. For actual KBC, the system will have to disambiguate
what exact entity it predicted. Here, however, we only check whether the text generated by the
model corresponds to one of the gold standard alternatives.
  While experimenting, we noticed that in the training and development datasets the names of
entities often correspond with the labels of entities in Wikidata5 . On Wikidata, these entities
also have aliases, and we wanted to know whether we can improve our system by looking up
the aliases on Wikidata. This lookup does use language models, so it is not included as part of
the ProP pipeline, as this would violate the terms of the LM-KBC challenge. Instead, we perform
it as an ablation study.
   The alias-fetcher works as follows. First, we extract a set of types which could be relevant
for the specific relation types. For example, country (Q6256) is relevant for RiverBasinsCountry.
Then, for each relation type, we extract all correctly typed entities, their aliases, and claim
count. Then, we take the prediction of the LM, and check whether there is an entity with that
label for that relation. If so, we retain the prediction.
   Otherwise, we check whether the prediction is equal to any alias. There could be multiple
entities for which this is the case. Therefore, we pick the label of the entity with the most claims
on Wikidata. The assumption is that it is more likely the answer if it is an ‘interesting’ entity
and that these have more claims on Wikidata.
   We observe that for four relation types, the changes in the scores are insignificant. For the
eight other relations, we see that the F1 score goes up slightly. Overall, this results in an average
improvement of the F1 score with 0.014 (Table 3) on the development set. On the test set, we
notice a similar improvement. This experiment is not extensive enough to derive definitive
conclusions, but it appears to be useful to use structured data to augment the predictions of an
LM.

5.2.3. Contemporaneity of LMs
We found that questions regarding recent events, particularly those that occurred after 2020,
did not yield good predictions by GPT-3 (see 7.3 in the Appendix). This is in line with related
findings around LMs and was confirmed by OpenAI 6 . Two examples of this include Facebook,
Inc. changing its name to Meta Platforms, Inc. (in October 2021), and the country of Swaziland
changing its name to Eswatini (in 2018). We also observed similar problems with several instances
from the following relations: PersonProfession, PersonCauseOfDeath, and PersonPlaceOfDeath.
It is worth noting that it is not important when the model was trained, but whether the training
date contains up-to-date information.

5.3. Future Work
A natural continuation of this work revolves around improving the individual steps in our
pipeline (e.g. fact checking) and their performance, which will directly reflect on the overall
macro-F1 score of our approach. Additionally, we could experiment with inverting our pipeline
and allow the LM to generate the best prompts by providing the ground truth as input. For exam-
ple, we could explore techniques that automatically learn prompts, similar to AutoPrompt [15]
and OptiPrompt [6], but ideally with a method that requires fewer resources.
   In terms of additional components that make use of LMs, we considered developing meta-
prompts as in [29]. We think it would be interesting to study what meta-prompts can be
    5
        https://www.wikidata.org
    6
        https://beta.openai.com/docs/guides/embeddings/limitations-risks
developed for KBC. Is there a set of specific patterns that work better than others? Finally, we
could modify our alias-fetcher to use the LM to generate well-known aliases for both entities
and relations in the training data. This approach could act as a diversification factor, and we
believe it will have more freedom in its choice of aliases.
   Data augmentation differs from prompt tuning in the following ways: While prompt tuning is
looking for the optimal prompt to increase the performance for a specific task, data augmentation
acts as a diversification mechanism for our existing prompts. By employing a more diverse set
of prompts we can increase our performance, especially the recall. We base our hypothesis
on the fact that knowledge is expressed diversely in the training data (e.g. ambiguity), and we
believe this should be considered when prompting an LM.
   Furthermore, it would be interesting to further investigate if huge language models are
required to perform knowledge graph construction and how to achieve the best prediction
performance for the lowest costs.


6. Conclusion
We introduced ProP, our "Prompting as Probing" approach to performing knowledge-base com-
pletion using a high-capacity pre-trained LM. We showed how we developed different modular
components that utilise both LM and the data provided by the organisers to improve ProP’s
performance, such as the fact probing and the alias fetcher components. We also investigated
well-known techniques around prompt engineering and optimisation and analysed the effect
of different prompt formulations on the final performance. However, we conclude that the
parameter count of the GPT-3 models is the most significant contributor to performance. Our
ProP pipeline outperforms the baseline by 36.4 percentage points on the test split.
   Our approach does not only obtain a high macro F1-score on the ground truth but its actual
score is suspected to be higher because in several cases where the result was counted as incorrect,
the ground truth was either incomplete or used aliases that refer to the same entity as our
prediction. Overall, we conclude that language models can be used to augment Knowledge
Bases, and we emphasise the difficulty of evaluating question-answering tasks where simple
string matching does not suffice.

Supplemental Material Statement: Code and data are publicly available from https://github.
com/HEmile/iswc-challenge.

Acknowledgements
We thank Frank van Harmelen for his insightful comments. This research was funded by the
Vrije Universiteit Amsterdam and the Netherlands Organisation for Scientific Research (NWO)
via the Spinoza grant (SPI 63-260) awarded to Piek Vossen, the Hybrid Intelligence Centre via the
Zwaartekracht grant (024.004.022), Elsevier’s Discovery Lab, and Huawei’s DReaMS Lab.
References
 [1] D. Daza, M. Cochez, P. Groth, Inductive entity representations from text via link prediction,
     in: Proceedings of the Web Conference 2021, 2021, pp. 798–808.
 [2] D. Araci, Finbert: Financial sentiment analysis with pre-trained language models, arXiv
     preprint arXiv:1908.10063 (2019).
 [3] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, et al., ProtTrans: Towards
     cracking the language of life’s code through self-supervised deep learning and high per-
     formance computing, arXiv preprint arXiv:2007.06225 (2020).
 [4] T. Sorensen, J. Robinson, C. M. Rytting, A. G. Shaw, K. J. Rogers, A. P. Delorey, M. Khalil,
     N. Fulda, D. Wingate, An information-theoretic approach to prompt engineering without
     ground truth labels, 2022.
 [5] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, A. Miller, Language
     Models as Knowledge Bases?, in: Proc. of the 2019 Conference on Empirical Methods
     in Natural Language Processing and the 9th International Joint Conference on Natural
     Language Processing (EMNLP-IJCNLP), 2019, pp. 2463–2473.
 [6] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,
     X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar,
     T. Wang, L. Zettlemoyer, Opt: Open pre-trained transformer language models, 2022.
     arXiv:2205.01068.
 [7] G. Qin, J. Eisner, Learning how to ask: Querying LMs with mixtures of soft prompts, in:
     Proceedings of the 2021 Conference of the North American Chapter of the Association for
     Computational Linguistics: Human Language Technologies, 2021, pp. 5203–5212.
 [8] S. Razniewski, A. Yates, N. Kassner, G. Weikum, Language Models As or For Knowledge
     Bases (2021). arXiv:2110.04888.
 [9] R. v. Bakel, T. Aleksiev, D. Daza, D. Alivanistos, M. Cochez, Approximate knowledge graph
     query answering: from ranking to binary classification, in: International Workshop on
     Graph Structures for Knowledge Representation and Reasoning, Springer, Cham, 2020, pp.
     107–124.
[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[11] H. Li, Language models: past, present, and future, Commun. ACM 65 (2022) 56–63.
[12] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How Can We Know What Language Models Know?,
     in: Transactions of the Association for Computational Linguistics 2020 (TACL), volume 8,
     2020, pp. 423–438.
[13] Z. Bouraoui, J. Camacho-Collados, S. Schockaert, Inducing Relational Knowledge from
     BERT, in: Proc. of the Thirty-Fourth Conference on Artificial Intelligence, AAAI’20, 2020.
[14] A. Haviv, J. Berant, A. Globerson, BERTese: Learning to speak to BERT, in: Proceedings
     of the 16th Conference of the European Chapter of the Association for Computational
     Linguistics: Main Volume, 2021, pp. 3618–3623.
[15] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, S. Singh, AutoPrompt: Eliciting Knowl-
     edge from Language Models with Automatically Generated Prompts (2020) 4222–4235.
     arXiv:2010.15980.
[16] Z. Zhong, D. Friedman, D. Chen, Factual probing is [MASK]: Learning vs. learning to
     recall (2021) 5017–5033.
[17] L. Fichtel, J.-C. Kalo, W.-T. Balke, Prompt Tuning or Fine-Tuning - Investigating Relational
     Knowledge in Pre-Trained Language Models (2021) 1–15.
[18] T. He, K. Cho, J. Glass, An Empirical Study on Few-shot Knowledge Probing for Pretrained
     Language Models (2021). arXiv:2109.02772.
[19] N. Poerner, U. Waltinger, H. Schütze, BERT is Not a Knowledge Base (Yet): Factual
     Knowledge vs. Name-Based Reasoning in Unsupervised QA 0 (2019). arXiv:1911.03681.
[20] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, W. W. Cohen, Time-Aware
     Language Models as Temporal Knowledge Bases (2021). arXiv:2106.15110.
[21] M. Sung, J. Lee, S. Yi, M. Jeon, S. Kim, J. Kang, Can Language Models be Biomedical
     Knowledge Bases? (2021) 4723–4734. arXiv:2109.07154.
[22] Z. Meng, F. Liu, E. Shareghi, Y. Su, C. Collins, N. Collier, Rewire-then-Probe: A Con-
     trastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models (2021).
     arXiv:2110.08173.
[23] J.-C. Kalo, L. Fichtel, P. Ehler, W.-T. Balke, KnowlyBERT - Hybrid Query Answering over
     Language Models and Knowledge Graphs, in: Proceedings of the International Semantic
     Web Conference (ISWC), 2020, pp. 294–310.
[24] H. Arnaout, T.-K. Tran, D. Stepanova, M. H. Gad-Elrab, S. Razniewski, G. Weikum, Utilizing
     language model probes for knowledge graph repair, in: Wiki Workshop 2022, 2022.
[25] R. Biswas, R. Sofronova, M. Alam, N. Heist, H. Paulheim, H. Sack, Do Judge an Entity
     by Its Name! Entity Typing Using Language Models, in: The Semantic Web: ESWC 2021
     Satellite Events, 2021, pp. 65–70.
[26] L. Yao, C. Mao, Y. Luo, Kg-bert: Bert for knowledge graph completion, 2019.
     arXiv:1909.03193.
[27] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019).
[29] L. Reynolds, K. McDonell, Prompt programming for large language models: Beyond the
     few-shot paradigm, in: Extended Abstracts of the 2021 CHI Conference on Human Factors
     in Computing Systems, 2021, pp. 1–7.
[30] J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula, R. L. Bras, Y. Choi, Maieutic
     prompting: Logically consistent reasoning with recursive explanations, arXiv preprint
     arXiv:2205.11822 (2022).
7. Appendix
7.1. Dataset statistics
Here, we provide statistics about the LM-KBC dataset for the training and development split.
The statistics of the test split is unknown, because the test split is not public. We assume that
the instances from the test are also sampled from a similar data distribution.

Table 4
The mean and standard deviation (std) of the number of object-entities per relation. Since some object-
entities have alternative labels, we also count the alternative labels. The values are rounded to 2 decimal
places.
                                                        Number of Object Entities
                                                        mean          std

                            Relation Type

                            CompanyParentOrganization    0.32                0.47
                            PersonPlaceOfDeath           0.50                0.51
                            PersonCauseOfDeath           0.52                0.54
                            PersonLanguage               1.70                1.18
                            PersonInstrument             1.86                2.81
                            CountryOfficialLanguage      2.06                2.47
                            PersonEmployer               2.14                1.65
                            RiverBasinsCountry           2.28                1.67
                            ChemicalCompoundElement      3.38                1.12
                            CountryBordersWithCountry    4.04                2.69
                            StateSharesBorderState       5.62                2.92
                            PersonProfession             7.42                4.85


7.1.1. Problems arising from Alternative Labels
The LM-KBC challenge does not include entity linking. Instead, predicted entities are scored
against a list of their aliases in the LM-KBC dataset. However, we noticed these lists are
often incomplete. For example, for the "National Aeronautics and Space Administration", the
extremely common and widely used abbreviation "NASA", is not included in the list of aliases.
Another example occurs when the model predicts Aluminum (US and Canadian English) where
the ground truth only has Aluminium (British English; a term globally adopted), a lower score
gets obtained. Hence, if the model predicts Aluminum or NASA, the predictions are deemed
incorrect.

7.2. Prompts
Here, we show the templates we used to generate the prompts for the different relations. In our
template, we use {subject_entity} to refer to the head entity for which we are predicting
the tail entities for. The generated prompts were used for the following models: Ada, Babbage,
Curie and Davinci.
7.2.1. CountryBordersWithCountry
Which countries neighbour Dominica?
[’Venezuela’]

Which countries neighbour North Korea?
[’South Korea’, ’China’, ’Russia’]

Which countries neighbour Serbia?
[’Montenegro’, ’Kosovo’, ’Bosnia and Herzegovina’, ’Hungary’,
’Croatia’, ’Bulgaria’, ’Macedonia’, ’Albania’, ’Romania’]

Which countries neighbour Fiji?
[]

Which countries neighbour {subject_entity}?


7.2.2. CountryOfficialLanguage
Suriname CountryOfficialLanguage: [’Dutch’]

Canada CountryOfficialLanguage: [’English’, ’French’]

Singapore CountryOfficialLanguage: [’English’, ’Malay’, ’Mandarin’,
’Tamil’]

Sri Lanka CountryOfficialLanguage: [’Sinhala’, ’Tamil’]

{subject_entity} CountryOfficialLanguage:


7.2.3. StateSharesBorderState
San Marino StateSharesBorderState: [’San Leo’, ’Acquaviva’,
’Borgo Maggiore’, ’Chiesanuova’, ’Fiorentino’]

Whales StateSharesBorderState: [’England’]

Liguria StateSharesBorderState: [’Tuscany’, ’Auvergne-Rhoone-Alpes’,
’Piedmont’, ’Emilia-Romagna’]

Mecklenberg-Western Pomerania StateSharesBorderState: [’Brandenburg’,
’Pomeranian’, ’Schleswig-Holstein’, ’Lower Saxony’]

{subject_entity} StateSharesBorderState:
7.2.4. RiverBasinsCountry
Drava RiverBasinsCountry: [’Hungary’, ’Italy’, ’Austria’,
’Slovenia’, ’Croatia’]

Huai river RiverBasinsCountry: [’China’]

Paraná river RiverBasinsCountry: [’Bolivia’, ’Paraguay’,
Argentina’, ’Brazil’]

Oise RiverBasinsCountry: [’Belgium’, ’France’]

{subject_entity} RiverBasinsCountry:


7.2.5. ChemicalCompoundElement
Water ChemicalCompoundElement: [’Hydrogen’, ’Oxygen’]

Bismuth subsalicylate ChemicalCompoundElement: [’Bismuth’]

Sodium Bicarbonate ChemicalCompoundElement: [’Hydrogen’, ’Oxygen’,
’Sodium’, ’Carbon’]

Aspirin ChemicalCompoundElement: [’Oxygen’, ’Carbon’, ’Hydrogen’]

{subject_entity} ChemicalCompoundElement:


7.2.6. PersonLanguage
Aamir Khan PersonLanguage: [’Hindi’, ’English’, ’Urdu’]

Pharrell Williams PersonLanguage: [’English’]

Xabi Alonso PersonLanguage: [’German’, ’Basque’, ’Spanish’, ’English’]

Shakira PersonLanguage: [’Catalan’, ’English’, ’Portuguese’, ’Spanish’,
’Italian’, ’French’]

{subject_entity} PersonLanguage:


7.2.7. PersonProfession
What is Danny DeVito’s profession?
[’Comedian’, ’Film Director’, ’Voice Actor’, ’Actor’, ’Film Producer’,
’Film Actor’, ’Dub Actor’, ’Activist’, ’Television Actor’]
What is David Guetta’s profession?
[’DJ’]

What is Gary Lineker’s profession?
[’Commentator’, ’Association Football Player’, ’Journalist’,
’Broadcaster’]

What is Gwyneth Paltrow’s profession?
[’Film Actor’,’Musician’]

What is {subject_entity}’s profession?


7.2.8. PersonInstrument
Liam Gallagher PersonInstrument: [’Maraca’, ’Guitar’]

Jay Park PersonInstrument: [’None’]

Axl Rose PersonInstrument: [’Guitar’, ’Piano’, ’Pander’, ’Bass’]

Neil Young PersonInstrument: [’Guitar’]

{subject_entity} PersonInstrument:


7.2.9. PersonEmployer
Where is or was Susan Wojcicki employed?
[’Google’]

Where is or was Steve Wozniak employed?
[’Apple Inc’, ’Hewlett-Packard’, ’University of Technology Sydney’, ’Atari’]

Where is or was Yukio Hatoyama employed?
[’Senshu University’,’Tokyo Institute of Technology’]

Where is or was Yahtzee Croshaw employed?
[’PC Gamer’, ’Hyper’, ’Escapist’]

Where is or was {subject_entity} employed?


7.2.10. PersonPlaceOfDeath
What is the place of death of Barack Obama?
[]
What is the place of death of Ennio Morricone?
[’Rome’]

What is the place of death of Elon Musk?
[]

What is the place of death of Prince?
[’Chanhassen’]

What is the place of death of {subject_entity}?


7.2.11. PersonCauseOfDeath
André Leon Talley PersonCauseOfDeath: [’Infarction’]

Angela Merkel PersonCauseOfDeath: [’None’]

Bob Saget PersonCauseOfDeath: [’Injury’, ’Blunt Trauma’]

Jamal Khashoggi PersonCauseOfDeath: [’Murder’]

{subject_entity} PersonCauseOfDeath:


7.2.12. CompanyParentOrganization
Microsoft CompanyParentOrganization: [’None’]

Sony CompanyParentOrganization: [’Sony Group’]

Saab CompanyParentOrganization: [’Saab Group’, ’Saab-Scania’,
’Spyker N.V.’, ’National Electric Vehicle Sweden’’, ’General Motors’]

Max Motors CompanyParentOrganization: [’None]

{subject_entity} CompanyParentOrganization:


7.3. Failure cases
Here, we list three failure examples for each relation for the Davinci model. A comprehensive list
of failure cases can be found under https://github.com/HEmile/iswc-challenge/tree/main/failure_cases.

7.3.1. CountryBordersWithCountry
SubjectEntity: Bahrain
Ground Truth: [’iran’, ’saudi arabia’]
GPT-3 Prediction: [’qatar’, ’saudi arabia’, ’united arab emirates’]

SubjectEntity: Barbados
Ground Truth: []
GPT-3 Prediction: [’trinidad and tobago’]

SubjectEntity: Cuba
Ground Truth: [’united states of america’, ’usa’]
GPT-3 Prediction: [’bahamas’, ’haiti’, ’jamaica’,
’turks and caicos islands’, ’united states’]


7.3.2. CountryOfficialLanguage
SubjectEntity: Afghanistan
Ground Truth: [’arabic’, ’baluchi’, ’dari’, ’nuristani’, ’pamir’,
’pashayi’, ’pashto’, ’turkmen’, ’uzbek’]
GPT-3 Prediction: [’dari’, ’pashto’]

SubjectEntity: Botswana
Ground Truth: [’english’]
GPT-3 Prediction: [’setswana’]

SubjectEntity: Zimbabwe
Ground Truth: [’barwe’, ’chewa’, ’english’, ’kalanga’, ’khoisan’,
’nambya’, ’ndau’, ’ndebele’, ’northern ndebele’, ’sesotho’, ’shona’,
’tonga’, ’tsonga’, ’tswana’, ’venda’, ’xhosa’]
GPT-3 Prediction: [’chewa’, ’english’, ’ndebele’, ’shangaan’, ’shona’,
’sotho’, ’tonga’, ’venda’]


7.3.3. StateSharesBorderState
SubjectEntity: Andalusia
Ground Truth: [’beja’, ’castile-la mancha’, ’extremadura’, ’faro’,
’gibraltar’, ’murcia’, ’region of murcia’]
GPT-3 Prediction: [’castilla-la mancha’, ’ceuta’, ’extremadura’,
’melilla’, ’murcia’]

SubjectEntity: Obwalden
Ground Truth: [’canton of bern’, ’canton of lucerne’, ’lucerne’,
’nidwalden’, ’schwyz’, ’uri’]
GPT-3 Prediction: [’berne’, ’uri’, ’valais’]

SubjectEntity: Zagreb
Ground Truth: [’krapina-zagorje’, ’krapina-zagorje county’,
’zagreb’, ’zagreb county’]
GPT-3 Prediction: [’bosnia and herzegovina’, ’serbia’, ’slovenia’]


7.3.4. RiverBasinsCountry
SubjectEntity: Didi Liakhvi river
Ground Truth: [’georgia’]
GPT-3 Prediction: [’georgia’, ’russia’]

SubjectEntity: Dniester
Ground Truth: [’moldova’, ’ukraine’]
GPT-3 Prediction: [’moldova’, ’romania’, ’ukraine’]

SubjectEntity: San
Ground Truth: [’poland’, ’ukraine’]
GPT-3 Prediction: [’mexico’]


7.3.5. ChemicalCompoundElement
SubjectEntity: Aluminum Iodide
Ground Truth: [’aluminium’, ’iodine’]
GPT-3 Prediction: [’aluminum’, ’iodine’]

SubjectEntity: Atracurium Besylate
Ground Truth: [’carbon’, ’hydrogen’, ’nitrogen’, ’oxygen’, ’sulfur’]
GPT-3 Prediction: [’atracurium’, ’hydrogen’, ’oxygen’]

SubjectEntity: Flavin Adenine Dinucleotide
Ground Truth: [’carbon’, ’hydrogen’, ’nitrogen’, ’oxygen’, ’phosphorus’]
GPT-3 Prediction: [’adenine’, ’dinucleotide’, ’flavin’]


7.3.6. PersonLanguage
SubjectEntity: A. R. Rahman
Ground Truth: [’hindi’, ’tamil’, ’telugu’]
GPT-3 Prediction: [’bengali’, ’hindi’, ’kannada’, ’malayalam’,
’marathi’, ’tamil’, ’telugu’]

SubjectEntity: Akon
Ground Truth: [’english’]
GPT-3 Prediction: [’english’, ’french’, ’wolof’]

SubjectEntity: Genelia D’Souza
Ground Truth: [’hindi’, ’malayalam’, ’tamil’, ’telugu’]
GPT-3 Prediction: [’bengali’, ’english’, ’hindi’, ’marathi’, ’telugu’]


7.3.7. PersonProfession
SubjectEntity: Elon Musk
Ground Truth: [’engineer’, ’entrepreneur’, ’inventor’, ’programmer’]
GPT-3 Prediction: [’businessperson’, ’engineer’, ’inventor’,
’investor’, ’philanthropist’, ’product designer’]

SubjectEntity: J. K. Rowling
Ground Truth: [’author’, "children’s writer", ’executive producer’,
’film producer’, ’novelist’, ’producer’, ’producer’, ’screenwriter’,
’writer’, ’writer’]
GPT-3 Prediction: [’author’, ’film producer’, ’philanthropist’,
’screenwriter’]

SubjectEntity: Lady Gaga
Ground Truth: [’singer’, ’songwriter’]
GPT-3 Prediction: [’actress’, ’dancer’, ’philanthropist’,
’record producer’, ’singer’, ’songwriter’]


7.3.8. PersonInstrument
SubjectEntity: Ronan Keating
Ground Truth: []
GPT-3 Prediction: [’piano’]

SubjectEntity: Taeyang
Ground Truth: []
GPT-3 Prediction: [’guitar’]

SubjectEntity: Chester Bennington
Ground Truth: []
GPT-3 Prediction: [’guitar’, ’piano’]


7.3.9. PersonEmployer
SubjectEntity: Kent Beck
Ground Truth: [’meta platforms’]
GPT-3 Prediction: [’facebook’, ’three rivers institute’]

SubjectEntity: Serena Williams
Ground Truth: [’unicef’]
GPT-3 Prediction: [’tennis’]
SubjectEntity: Guido van Rossum
Ground Truth: [’microsoft’]
GPT-3 Prediction: [’dropbox’, ’google’]


7.3.10. PersonPlaceOfDeath
SubjectEntity: Avicii
Ground Truth: [’muscat’]
GPT-3 Prediction: [’muscat, oman’]

SubjectEntity: John Coltrane
Ground Truth: [’huntington’]
GPT-3 Prediction: [’new york city’]

SubjectEntity: Rachel Caine
Ground Truth: [’texas’]
GPT-3 Prediction: [’’]


7.3.11. PersonCauseOfDeath
SubjectEntity: Ahmed Zewail
Ground Truth: [’lymphoma’, ’spinal cord lymphoma’]
GPT-3 Prediction: [’cancer’]

SubjectEntity: Avicii
Ground Truth: [’exsanguination’]
GPT-3 Prediction: [’suicide’]

SubjectEntity: Ennio Morricone
Ground Truth: [’femoral fracture’, ’fracture’]
GPT-3 Prediction: [’’]


7.3.12. CompanyParentOrganization
SubjectEntity: Aston Martin lagonda
Ground Truth: []
GPT-3 Prediction: [’aston martin lagonda global holdings plc’]

SubjectEntity: Austro-Daimler
Ground Truth: []
GPT-3 Prediction: [’daimler ag’]

SubjectEntity: Hyundai Motor Company
Ground Truth: [’hyundai’]
GPT-3 Prediction: [’hyundai motor group’]
7.4. Language Model Size
Table 5 shows the values of the scaling experiments. These values were used to produce Figure
1.

Table 5
Precision, Recall and F1-score for the GPT-3 models with varying embedding dimensions. Best scores
are in bold.
        Method                              Precision       Recall          F1-score
        Baseline (BERT)                     0.175           0.129           0.140
        Ada                                 0.180           0.194           0.161
        Babbage                             0.325           0.263           0.269
        Curie                               0.378           0.375           0.343
        Davinci                             0.707           0.694           0.677
Figure 2: The number of answers per relation type for the training set provided by the organisers.
Figure 3: The number of answers per relation type for the development set provided by the organisers.

</pre>