<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prompting as Probing: Using Language Models for Knowledge Base Construction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dimitrios Alivanistos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Selene Báez Santamaría</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Cochez</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan-Christoph Kalo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emile van Krieken</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thiviyan Thanapalasingam</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DReaMS Lab</institution>
          ,
          <addr-line>Huawei</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Discovery Lab</institution>
          ,
          <addr-line>Elsevier</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vrije Universiteit Amsterdam</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Language Models (LMs) have proven to be useful in various downstream applications, such as summarisation, translation, question answering and text classification. LMs are becoming increasingly important tools in Artificial Intelligence, because of the vast quantity of information they can store. In this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a large Language Model originally proposed by OpenAI in 2020, to perform the task of Knowledge Base Construction (KBC). ProP implements a multi-step approach that combines a variety of prompting techniques to achieve this. Our results show that manual prompt curation is essential, that the LM must be encouraged to give answer sets of variable lengths, in particular including empty answer sets, that true/false questions are a useful device to increase precision on suggestions generated by the LM, that the size of the LM is a crucial factor, and that a dictionary of entity aliases improves the LM score. Our evaluation study indicates that these proposed techniques can substantially enhance the quality of the final predictions: ProP won track 2 of the LM-KBC competition, outperforming the baseline by 36.4 percentage points. Our implementation is available on https://github.com/HEmile/iswc-challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Language Models (LMs) have been at the center of attention, presented as a recent success story
of Artificial Intelligence. LMs have shown great promise across a wide range of domains in a
variety of diferent tasks, such as Text Classification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Financial Sentiment Analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
Protein Binding Site Prediction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). In recent years, prompt engineering for LMs has become a
research field in itself, with a plethora of papers working on LM understanding ( e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>
        Natural Language Processing (NLP) researchers have recently investigated whether LMs
could potentially be used as Knowledge Bases, by querying for particular information. In Petroni
et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the LAMA dataset for probing relational facts from Wikidata in LMs was presented.
The authors show that the masked LM BERT can complete Wikidata facts with a precision of
around 32%. Several follow-up papers have pushed this number to almost 50% [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. While the
prediction quality on LAMA is promising, others have argued that LMs should not be used as
knowledge graphs, but rather to support the augmentation and curation of KGs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this paper, we describe ProP, the system we implemented for the “Knowledge Base
Construction from Pre-trained Language Models” challenge at ISWC 20221. The task is to predict
possible objects of a triple, where the subject and relation are given. For example, given the
string "Ronnie James Dio" as the subject and the relation PersonInstrument, an LM needs to
predict the answers "bass guitar" or "guitar", and "trumpet", as the objects of the triple. In
contrast to the LAMA dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), the LM-KB challenge dataset contains questions for which
there are no answers or where multiple answers are valid. Bakel et al. argue for precision and
recall metrics over the common ranking-based metrics for query answering. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Since the
LM-KBC dataset requires predicting a set of answers and in some cases even empty answer sets,
we choose precision and recall metrics to evaluate our system.
      </p>
      <p>
        ProP uses the GPT-3 model, a large LM proposed by OpenAI in 2020[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. For each relation
type, we engineered prompts that, when given to the language model, probe it to respond with
the set of objects for that relation. The components of ProP can be divided into two categories:
prompt generation focuses on generating the correct prompts that yield the desired answers
for the questions in the LM-KBC dataset, and post-processing aims to enhance the quality of
the predictions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The development of large pre-trained Language Models (LMs) has led to substantial
improvements in NLP research. It was shown that extensive pre-training on large text corpora encodes
large amounts of linguistic and factual knowledge into a language model that can help to
improve the performance on various downstream tasks (see [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for a recent overview).
      </p>
      <p>
        LM as KG Petroni et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and later others [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], asked the question to what extent language
models can replace or support the creation and curation of knowledge graphs. Petroni et al.
proposed the LAMA dataset for probing relational knowledge in language models by using
masked language models to complete cloze-style sentences. As an example, the language model
BERT can complete the sentence “Paris is the capital of [MASK]" with the word “France". In
this case, it is assumed that the model knows about, or can predict, the triple (Paris, capitalOf,
France).
      </p>
      <p>
        While the original paper relied on manually designed prompts for probing the language
model, various follow-up works have shown the superiority of automatically learning prompts.
Methods can mine prompts from large text corpora and pick the best prompts by applying
them to a training dataset as demonstrated in [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Prompts can also be directly learned via
backpropagation: BERTESE [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and AutoPrompt [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] show how prompts can be learned to
improve the performance on LAMA. The probing performance can be pushed even further by
either directly learning continuous embeddings for the prompts [
        <xref ref-type="bibr" rid="ref16 ref7">7, 16</xref>
        ] or by directly fine-tuning
1LM-KBC, https://lm-kbc.github.io/. This work is a submission in the open track (Track 2) in which LMs of any
sizes can be used
the LM on the training data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Similar to our work, in FewShot-LAMA [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] few-shot learning
on the original LAMA dataset is evaluated. The authors show that a combination of few-shot
examples with learned prompts achieves the best probing results.
      </p>
      <p>
        Since the publication of the LAMA dataset, a large variety of other probing datasets for factual
knowledge in LMs have been created. LAMA-UHN is a more dificult version of LAMA [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
TimeLAMA adds a time component to facts that the model is probed for [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Furthermore,
BioLAMA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and MedLAMA [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] are domain-specific probing datasets for biological and
medical knowledge.
      </p>
      <p>Most existing approaches have in common that they only probe the language models for
entities with a label consisting of a single token from the language model’s vocabulary. Thus,
the prediction of long, complex entity names is mostly unexplored. Furthermore, most existing
works have only asked language models to complete a triple with a single prediction, even
though some triples might actually allow for multiple possible predictions. Both these aspects
substantially change the challenge of knowledge graph construction.</p>
      <p>
        LM for KG While probing language models has been heavily studied in the NLP community,
the idea of using language models to support knowledge graph curation is not suficiently
studied [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Some works have shown how LMs in combination with knowledge graphs can
be used to complete query results [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Other works have looked into how to use language
models to identify errors in knowledge graphs [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], or have studied how to weight KG triples
from ConceptNet with language models to measure semantic similarity. Biswas et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] have
shown that language models can be used to perform entity typing by predicting the class using
language models.
      </p>
      <p>
        KG-BERT is a system that is most similar to what is required for the KBC Challenge [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. A
standard BERT model is trained on serialized knowledge graph data to perform link prediction
on standard link prediction benchmark datasets. KG-BERT’s performance on link prediction is
comparable to many state-of-the-art systems that use knowledge graph embeddings for this
task.
      </p>
      <p>Similar tasks This KBC Challenge task is similar to Factual Q&amp;A with Language models,
where the goal is to respond to questions that fall outside the scope of a knowledge base. This
shared task difers in that the responses need to include 0 to k answers. Moreover, in the shared
task, the factual questions are generated from triples, thus including variation in how a triple
might be phrased as a question.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The LM-KBC Challenge</title>
      <p>The LM-KBC dataset contains triples of the form (, , ), where  is the textual representation
of the subject,  is one of 12 diferent predicates and  the (possibly empty) set of textual
representations of object entities for prediction. The subjects and objects are of various types.
After learning from a training set of such triples, given a new subject and one of the known
relations, the task is to predict the complete set of objects.</p>
      <sec id="sec-3-1">
        <title>3.1. The LM-KBC Dataset</title>
        <p>For each of the 12 relations, the number of unique subject-entities in the train, dev, and test sets
are 100, 50, and 50 respectively. We include detailed distributions of the cardinality (the number
of object-entities) for each relation type (Appendix, Figures 2 and 3). Table 4 in the Appendix
shows the aggregated statistics about the number of object-entities, as well as the number of
alternative labels per object-entities in the development set. Certain relation types have a much
higher average cardinality (e.g. PersonProfession=7.42 or StateSharesBorderState=5.6) than others
(e.g. CompanyParentOrganization=0.32, PersonPlaceOfDeath=0.50). We also note that only five
of the relations allow for empty answer sets. For example, relations associated with a person’s
death (PersonPlaceOfDeath and PersonCauseOfDeath) often contain empty answer sets, because
many persons in the dataset are still alive. In these cases, a models needs to be able to predict
empty answer sets correctly.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. The Baseline</title>
        <p>
          The baseline model is a masking-based approach that uses the popular BERT model [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] in
3 variants (base-cased, large-cased and RoBERTa [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]). The BERT model is tasked with doing
prompt completion to predict object-entities for a given subject-relation pair. The prompts used
by the baseline approach have been customised for the diferent relation types.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Our Method</title>
      <p>
        Previous studies have shown that prompt formulation plays a critical role in the performance
of LMs on downstream applications [
        <xref ref-type="bibr" rid="ref29 ref4">29, 4</xref>
        ] and this also applies to our work. We investigated
diferent prompting approaches using few-shot learning with GPT-3 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] using OpenAI’s API
2. In this Section, we first describe how we generate the prompts (prompt generation phase),
and followed by how we utilise diferent components in our pipeline to further fine-tune the
prompts for an enhanced performance (post-processing phase).
      </p>
      <sec id="sec-4-1">
        <title>4.1. Prompt Generation</title>
        <p>For each relation in the dataset, we manually curate a set of prompt templates consisting of
four representative examples3 selected from the training split. We use these relation-specific
templates to generate prompts for every subject entity by replacing a placeholder token in the
ifnal line of the template with the subject entity of interest. We task GPT-3 to complete the
prompts, and evaluate the completions and compute the macro-precision, macro-accuracy and
macro-F1 score for each relation.</p>
        <p>2https://openai.com/api (temperature=0, max_tokens=100, top_p=1, frequency_penalty=0, presence_penalty=0,
logprobs=1)</p>
        <p>
          3This is an arbitrary number of training examples. Since few-shot learners are eficient at learning from a
handful of examples [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], including an excessive number of training examples may not necessarily lead to any
improved precision or recall. Therefore, we did not study the efects of varying the number of training examples in
the prompts.
        </p>
        <p>We ensured that the training examples for few-shot learning included all of the following: (i)
questions with answer sets of variable lengths to inform the LM that it can generate multiple
answers; (ii) questions with empty answer sets to ensure that the LM returns nothing when
there is no correct answer to a given question; (iii) a fixed question-answer order, where we
provide the question and then immediately the answer so that the LM learns the style of the
format, and (iv) the answer set formatted as a list to ensure we can eficiently parse the answer
set from the LM. We did not study the order of these examples, but hypothesize that this is not
hugely important to GPT-3 as it can handle long-range dependencies well.</p>
        <p>We formulate the questions either in natural language or in the form of a triple. We
handdesigned the natural language questions and did not compare them in a structured manner
to alternatives. However, we tried out several variations in OpenAI GPT-3 playground to get
an intuition of what style of questions are efective. We found those are usually shorter and
simpler. We include the prompt templates used in Section 7.2 of the Appendix. In our work,
we investigate the use of both prompting styles and compare natural language prompts with
triple-based prompts for the diferent relations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Empty Answer sets</title>
        <p>There are some questions in the dataset for which the correct answer is an empty answer set.
For instance, there are no countries that share borders with Fiji. The empty answer set can be
represented as either an empty list [], or as a string within a list [‘None’]. We have observed
(see Table 2) that the way that empty sets are represented afects the precision and recall of our
approach. Allowing the explicit answer ‘None’ encourages the LM to exclude answers that it is
uncertain about.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Fact probing</title>
        <p>
          Our initial results indicated that the recall of our approach was high, but the precision for
certain relations was low. Therefore, we add a post-processing step called fact probing in ProP’s
pipeline. We use fact probing to ask the LM to probe whether each completion proposed by
the LM in the previous step is correct. Inspired by maieutic prompting [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], we create a simple
prompt, where we translate each predicted completion into a natural language fact. Then we
ask the LM to predict whether this fact is true or false. One example of a fact-probing prompt is
Niger neighbours Libya TRUE for the CountryBordersWithCountry relation. We ensure that the
LM only predicts either TRUE or FALSE by adding a true and a false example to the prompt.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this Section, we analyse the ProP pipeline we built for generating prompts and evaluate the
contribution of each component. We explain how we combine the best-performing components
to yield a prediction that obtains a high macro F1-score on the test split.</p>
      <sec id="sec-5-1">
        <title>5.1. Prompt-Fine Tuning</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Natural Language vs Triple-based Prompts</title>
          <p>Table 1 shows the quality of the predictions for the natural language prompts and triple-based
prompts. We note that on F1, the performance between these two prompt styles is mixed, with
F1 being higher on the triple-style prompts in only seven out of twelve cases. Unpacking the
F1 into recall and precision shows that the triple-style prompts yield higher precision, while
natural language prompts yield higher recall. Overall, the triple-style prompts do yield a higher
F1 when averaged over each relation. Our intuition is that natural language prompts contain
certain words that badly influence the precision of the predictions. It is, however, dificult to
study this systematically as the enumeration of word combinations in the prompts is very large.
Triple-based prompts circumvent this problem because they only contain the relevant terms in
the prompts for the subject entities and relations that are required to predict the object entities.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Empty vs None</title>
          <p>As explained in Section 3.1, five out of twelve relations allow for empty answer sets. We
experiment with the diferent ways to represent such empty answer sets, and Table 2 shows
the results. Three relations get a performance boost when prompted with ‘NONE’
(CompanyParentOrganization, PersonCauseOfDeath, PersonInstrument), while the other two relations
perform better when using empty lists (CountryBordersWithCountry, PersonPlaceOfDeath). For
subsequent experiments, we modified the prompt of each relation to use the best performant
representation.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Language Model Size</title>
          <p>Usually, the size of LMs is measured by the number of learnable parameters in the model.
However, the OpenAI API does not quantify the number of total parameters but only the size of
the embedding dimensions for the tokens 4. We assume there is a positive correlation between
the token dimension size and the total number of GPT-3 parameters. Figure 1 shows our results,
and we can see that as the language model size increases, the F1-score also increases. This
shows that a larger LM gives better performance on KBC.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Post-Processing Predictions</title>
        <p>Up to this point, we have discussed how we generate the optimal prompts for the diferent
relation types. Once the LM produces the completions using these optimal prompt techniques,
we can employ two additional steps to enhance the precision and recall of our predictions. Table
3 shows the results of including fact probing and entity aliases in our system.</p>
        <sec id="sec-5-2-1">
          <title>5.2.1. Fact probing</title>
          <p>We found that fact probing has a diferent impact on diferent relation types. This diference could
stem from the cardinalities of the relation types. For example, the relation PersonPlaceOfDeath,
which has only one correct answer, should show a larger improvement than State Borders,
which has a higher cardinality. We found that fact probing helped to boost the predictions
of five relations ( CompanyParentOrganization, CountryOficialLanguage, PersonCauseOfDeath,
PersonInstrument, PersonLanguage). We only apply fact probing to these relations. On the dev
set, the precision of fact probing is 0.737, and 0.608 among the predictions removed by fact
probing. That is, in 60.8% fact probing filtered a prediction, it correctly removed a prediction
that was not in the ground truth set.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Entity aliases</title>
          <p>As we discussed in Section 3.1, the predictions from the language model are sometimes correct,
but not according to what the gold standard expects. Whether this is problematic depends on
the final use of the predictive model. In interactive use, this would not be an issue because
the user will be able to disambiguate. For actual KBC, the system will have to disambiguate
what exact entity it predicted. Here, however, we only check whether the text generated by the
model corresponds to one of the gold standard alternatives.</p>
          <p>While experimenting, we noticed that in the training and development datasets the names of
entities often correspond with the labels of entities in Wikidata5. On Wikidata, these entities
also have aliases, and we wanted to know whether we can improve our system by looking up
the aliases on Wikidata. This lookup does use language models, so it is not included as part of
the ProP pipeline, as this would violate the terms of the LM-KBC challenge. Instead, we perform
it as an ablation study.</p>
          <p>The alias-fetcher works as follows. First, we extract a set of types which could be relevant
for the specific relation types. For example, country (Q6256) is relevant for RiverBasinsCountry.
Then, for each relation type, we extract all correctly typed entities, their aliases, and claim
count. Then, we take the prediction of the LM, and check whether there is an entity with that
label for that relation. If so, we retain the prediction.</p>
          <p>Otherwise, we check whether the prediction is equal to any alias. There could be multiple
entities for which this is the case. Therefore, we pick the label of the entity with the most claims
on Wikidata. The assumption is that it is more likely the answer if it is an ‘interesting’ entity
and that these have more claims on Wikidata.</p>
          <p>We observe that for four relation types, the changes in the scores are insignificant. For the
eight other relations, we see that the F1 score goes up slightly. Overall, this results in an average
improvement of the F1 score with 0.014 (Table 3) on the development set. On the test set, we
notice a similar improvement. This experiment is not extensive enough to derive definitive
conclusions, but it appears to be useful to use structured data to augment the predictions of an
LM.</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>5.2.3. Contemporaneity of LMs</title>
          <p>We found that questions regarding recent events, particularly those that occurred after 2020,
did not yield good predictions by GPT-3 (see 7.3 in the Appendix). This is in line with related
ifndings around LMs and was confirmed by OpenAI 6. Two examples of this include Facebook,
Inc. changing its name to Meta Platforms, Inc. (in October 2021), and the country of Swaziland
changing its name to Eswatini (in 2018). We also observed similar problems with several instances
from the following relations: PersonProfession, PersonCauseOfDeath, and PersonPlaceOfDeath.
It is worth noting that it is not important when the model was trained, but whether the training
date contains up-to-date information.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Future Work</title>
        <p>
          A natural continuation of this work revolves around improving the individual steps in our
pipeline (e.g. fact checking) and their performance, which will directly reflect on the overall
macro-F1 score of our approach. Additionally, we could experiment with inverting our pipeline
and allow the LM to generate the best prompts by providing the ground truth as input. For
example, we could explore techniques that automatically learn prompts, similar to AutoPrompt [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
and OptiPrompt [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], but ideally with a method that requires fewer resources.
        </p>
        <p>
          In terms of additional components that make use of LMs, we considered developing
metaprompts as in [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We think it would be interesting to study what meta-prompts can be
5https://www.wikidata.org
6https://beta.openai.com/docs/guides/embeddings/limitations-risks
developed for KBC. Is there a set of specific patterns that work better than others? Finally, we
could modify our alias-fetcher to use the LM to generate well-known aliases for both entities
and relations in the training data. This approach could act as a diversification factor, and we
believe it will have more freedom in its choice of aliases.
        </p>
        <p>Data augmentation difers from prompt tuning in the following ways: While prompt tuning is
looking for the optimal prompt to increase the performance for a specific task, data augmentation
acts as a diversification mechanism for our existing prompts. By employing a more diverse set
of prompts we can increase our performance, especially the recall. We base our hypothesis
on the fact that knowledge is expressed diversely in the training data (e.g. ambiguity), and we
believe this should be considered when prompting an LM.</p>
        <p>Furthermore, it would be interesting to further investigate if huge language models are
required to perform knowledge graph construction and how to achieve the best prediction
performance for the lowest costs.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We introduced ProP, our "Prompting as Probing" approach to performing knowledge-base
completion using a high-capacity pre-trained LM. We showed how we developed diferent modular
components that utilise both LM and the data provided by the organisers to improve ProP’s
performance, such as the fact probing and the alias fetcher components. We also investigated
well-known techniques around prompt engineering and optimisation and analysed the efect
of diferent prompt formulations on the final performance. However, we conclude that the
parameter count of the GPT-3 models is the most significant contributor to performance. Our
ProP pipeline outperforms the baseline by 36.4 percentage points on the test split.</p>
      <p>Our approach does not only obtain a high macro F1-score on the ground truth but its actual
score is suspected to be higher because in several cases where the result was counted as incorrect,
the ground truth was either incomplete or used aliases that refer to the same entity as our
prediction. Overall, we conclude that language models can be used to augment Knowledge
Bases, and we emphasise the dificulty of evaluating question-answering tasks where simple
string matching does not sufice.</p>
      <p>Supplemental Material Statement: Code and data are publicly available from https://github.
com/HEmile/iswc-challenge.</p>
      <sec id="sec-6-1">
        <title>Acknowledgements</title>
        <p>We thank Frank van Harmelen for his insightful comments. This research was funded by the
Vrije Universiteit Amsterdam and the Netherlands Organisation for Scientific Research (NWO)
via the Spinoza grant (SPI 63-260) awarded to Piek Vossen, the Hybrid Intelligence Centre via the
Zwaartekracht grant (024.004.022), Elsevier’s Discovery Lab, and Huawei’s DReaMS Lab.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Appendix</title>
      <sec id="sec-7-1">
        <title>7.1. Dataset statistics</title>
        <p>Here, we provide statistics about the LM-KBC dataset for the training and development split.
The statistics of the test split is unknown, because the test split is not public. We assume that
the instances from the test are also sampled from a similar data distribution.</p>
        <sec id="sec-7-1-1">
          <title>7.1.1. Problems arising from Alternative Labels</title>
          <p>The LM-KBC challenge does not include entity linking. Instead, predicted entities are scored
against a list of their aliases in the LM-KBC dataset. However, we noticed these lists are
often incomplete. For example, for the "National Aeronautics and Space Administration", the
extremely common and widely used abbreviation "NASA", is not included in the list of aliases.
Another example occurs when the model predicts Aluminum (US and Canadian English) where
the ground truth only has Aluminium (British English; a term globally adopted), a lower score
gets obtained. Hence, if the model predicts Aluminum or NASA, the predictions are deemed
incorrect.</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Prompts</title>
        <p>Here, we show the templates we used to generate the prompts for the diferent relations. In our
template, we use {subject_entity} to refer to the head entity for which we are predicting
the tail entities for. The generated prompts were used for the following models: Ada, Babbage,
Curie and Davinci.</p>
        <sec id="sec-7-2-1">
          <title>7.2.1. CountryBordersWithCountry</title>
          <p>Which countries neighbour Dominica?
[’Venezuela’]
Which countries neighbour North Korea?
[’South Korea’, ’China’, ’Russia’]
Which countries neighbour Serbia?
[’Montenegro’, ’Kosovo’, ’Bosnia and Herzegovina’, ’Hungary’,
’Croatia’, ’Bulgaria’, ’Macedonia’, ’Albania’, ’Romania’]
Which countries neighbour Fiji?
[]
Which countries neighbour {subject_entity}?</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>7.2.2. CountryOficialLanguage</title>
          <p>Suriname CountryOfficialLanguage: [’Dutch’]
Canada CountryOfficialLanguage: [’English’, ’French’]
Singapore CountryOfficialLanguage: [’English’, ’Malay’, ’Mandarin’,
’Tamil’]
Sri Lanka CountryOfficialLanguage: [’Sinhala’, ’Tamil’]
{subject_entity} CountryOfficialLanguage:</p>
        </sec>
        <sec id="sec-7-2-3">
          <title>7.2.3. StateSharesBorderState</title>
          <p>San Marino StateSharesBorderState: [’San Leo’, ’Acquaviva’,
’Borgo Maggiore’, ’Chiesanuova’, ’Fiorentino’]
Whales StateSharesBorderState: [’England’]
Liguria StateSharesBorderState: [’Tuscany’, ’Auvergne-Rhoone-Alpes’,
’Piedmont’, ’Emilia-Romagna’]
Mecklenberg-Western Pomerania StateSharesBorderState: [’Brandenburg’,
’Pomeranian’, ’Schleswig-Holstein’, ’Lower Saxony’]
{subject_entity} StateSharesBorderState:</p>
        </sec>
        <sec id="sec-7-2-4">
          <title>7.2.4. RiverBasinsCountry</title>
          <p>Drava RiverBasinsCountry: [’Hungary’, ’Italy’, ’Austria’,
’Slovenia’, ’Croatia’]
Huai river RiverBasinsCountry: [’China’]
Paraná river RiverBasinsCountry: [’Bolivia’, ’Paraguay’,
Argentina’, ’Brazil’]
Oise RiverBasinsCountry: [’Belgium’, ’France’]
{subject_entity} RiverBasinsCountry:</p>
        </sec>
        <sec id="sec-7-2-5">
          <title>7.2.5. ChemicalCompoundElement</title>
          <p>Water ChemicalCompoundElement: [’Hydrogen’, ’Oxygen’]
Bismuth subsalicylate ChemicalCompoundElement: [’Bismuth’]
Sodium Bicarbonate ChemicalCompoundElement: [’Hydrogen’, ’Oxygen’,
’Sodium’, ’Carbon’]
Aspirin ChemicalCompoundElement: [’Oxygen’, ’Carbon’, ’Hydrogen’]
{subject_entity} ChemicalCompoundElement:</p>
        </sec>
        <sec id="sec-7-2-6">
          <title>7.2.6. PersonLanguage</title>
          <p>Aamir Khan PersonLanguage: [’Hindi’, ’English’, ’Urdu’]
Pharrell Williams PersonLanguage: [’English’]
Xabi Alonso PersonLanguage: [’German’, ’Basque’, ’Spanish’, ’English’]
Shakira PersonLanguage: [’Catalan’, ’English’, ’Portuguese’, ’Spanish’,
’Italian’, ’French’]
{subject_entity} PersonLanguage:</p>
        </sec>
        <sec id="sec-7-2-7">
          <title>7.2.7. PersonProfession</title>
          <p>What is Danny DeVito’s profession?
[’Comedian’, ’Film Director’, ’Voice Actor’, ’Actor’, ’Film Producer’,
’Film Actor’, ’Dub Actor’, ’Activist’, ’Television Actor’]
What is David Guetta’s profession?
[’DJ’]
What is Gary Lineker’s profession?
[’Commentator’, ’Association Football Player’, ’Journalist’,
’Broadcaster’]
What is Gwyneth Paltrow’s profession?
[’Film Actor’,’Musician’]
What is {subject_entity}’s profession?</p>
        </sec>
        <sec id="sec-7-2-8">
          <title>7.2.8. PersonInstrument</title>
          <p>Liam Gallagher PersonInstrument: [’Maraca’, ’Guitar’]
Jay Park PersonInstrument: [’None’]
Axl Rose PersonInstrument: [’Guitar’, ’Piano’, ’Pander’, ’Bass’]
Neil Young PersonInstrument: [’Guitar’]
{subject_entity} PersonInstrument:</p>
        </sec>
        <sec id="sec-7-2-9">
          <title>7.2.9. PersonEmployer</title>
          <p>Where is or was Susan Wojcicki employed?
[’Google’]
Where is or was Steve Wozniak employed?
[’Apple Inc’, ’Hewlett-Packard’, ’University of Technology Sydney’, ’Atari’]
Where is or was Yukio Hatoyama employed?
[’Senshu University’,’Tokyo Institute of Technology’]
Where is or was Yahtzee Croshaw employed?
[’PC Gamer’, ’Hyper’, ’Escapist’]
Where is or was {subject_entity} employed?</p>
        </sec>
        <sec id="sec-7-2-10">
          <title>7.2.10. PersonPlaceOfDeath</title>
          <p>What is the place of death of Barack Obama?
[]
What is the place of death of Ennio Morricone?
[’Rome’]
What is the place of death of Elon Musk?
[]
What is the place of death of Prince?
[’Chanhassen’]
What is the place of death of {subject_entity}?</p>
        </sec>
        <sec id="sec-7-2-11">
          <title>7.2.11. PersonCauseOfDeath</title>
          <p>André Leon Talley PersonCauseOfDeath: [’Infarction’]
Angela Merkel PersonCauseOfDeath: [’None’]
Bob Saget PersonCauseOfDeath: [’Injury’, ’Blunt Trauma’]
Jamal Khashoggi PersonCauseOfDeath: [’Murder’]
{subject_entity} PersonCauseOfDeath:</p>
        </sec>
        <sec id="sec-7-2-12">
          <title>7.2.12. CompanyParentOrganization</title>
          <p>Microsoft CompanyParentOrganization: [’None’]
Sony CompanyParentOrganization: [’Sony Group’]
Saab CompanyParentOrganization: [’Saab Group’, ’Saab-Scania’,
’Spyker N.V.’, ’National Electric Vehicle Sweden’’, ’General Motors’]
Max Motors CompanyParentOrganization: [’None]
{subject_entity} CompanyParentOrganization:</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Failure cases</title>
        <p>Here, we list three failure examples for each relation for the Davinci model. A comprehensive list
of failure cases can be found under https://github.com/HEmile/iswc-challenge/tree/main/failure_cases.</p>
        <sec id="sec-7-3-1">
          <title>7.3.1. CountryBordersWithCountry</title>
          <p>SubjectEntity: Bahrain
Ground Truth: [’iran’, ’saudi arabia’]
GPT-3 Prediction: [’qatar’, ’saudi arabia’, ’united arab emirates’]
SubjectEntity: Barbados
Ground Truth: []
GPT-3 Prediction: [’trinidad and tobago’]
SubjectEntity: Cuba
Ground Truth: [’united states of america’, ’usa’]
GPT-3 Prediction: [’bahamas’, ’haiti’, ’jamaica’,
’turks and caicos islands’, ’united states’]</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>7.3.2. CountryOficialLanguage</title>
          <p>SubjectEntity: Afghanistan
Ground Truth: [’arabic’, ’baluchi’, ’dari’, ’nuristani’, ’pamir’,
’pashayi’, ’pashto’, ’turkmen’, ’uzbek’]
GPT-3 Prediction: [’dari’, ’pashto’]
SubjectEntity: Botswana
Ground Truth: [’english’]
GPT-3 Prediction: [’setswana’]
SubjectEntity: Zimbabwe
Ground Truth: [’barwe’, ’chewa’, ’english’, ’kalanga’, ’khoisan’,
’nambya’, ’ndau’, ’ndebele’, ’northern ndebele’, ’sesotho’, ’shona’,
’tonga’, ’tsonga’, ’tswana’, ’venda’, ’xhosa’]
GPT-3 Prediction: [’chewa’, ’english’, ’ndebele’, ’shangaan’, ’shona’,
’sotho’, ’tonga’, ’venda’]</p>
        </sec>
        <sec id="sec-7-3-3">
          <title>7.3.3. StateSharesBorderState</title>
          <p>SubjectEntity: Andalusia
Ground Truth: [’beja’, ’castile-la mancha’, ’extremadura’, ’faro’,
’gibraltar’, ’murcia’, ’region of murcia’]
GPT-3 Prediction: [’castilla-la mancha’, ’ceuta’, ’extremadura’,
’melilla’, ’murcia’]
SubjectEntity: Obwalden
Ground Truth: [’canton of bern’, ’canton of lucerne’, ’lucerne’,
’nidwalden’, ’schwyz’, ’uri’]
GPT-3 Prediction: [’berne’, ’uri’, ’valais’]
SubjectEntity: Zagreb
Ground Truth: [’krapina-zagorje’, ’krapina-zagorje county’,
’zagreb’, ’zagreb county’]
GPT-3 Prediction: [’bosnia and herzegovina’, ’serbia’, ’slovenia’]</p>
        </sec>
        <sec id="sec-7-3-4">
          <title>7.3.4. RiverBasinsCountry</title>
          <p>SubjectEntity: Didi Liakhvi river
Ground Truth: [’georgia’]
GPT-3 Prediction: [’georgia’, ’russia’]
SubjectEntity: Dniester
Ground Truth: [’moldova’, ’ukraine’]
GPT-3 Prediction: [’moldova’, ’romania’, ’ukraine’]
SubjectEntity: San
Ground Truth: [’poland’, ’ukraine’]
GPT-3 Prediction: [’mexico’]</p>
        </sec>
        <sec id="sec-7-3-5">
          <title>7.3.5. ChemicalCompoundElement</title>
          <p>SubjectEntity: Aluminum Iodide
Ground Truth: [’aluminium’, ’iodine’]
GPT-3 Prediction: [’aluminum’, ’iodine’]
SubjectEntity: Atracurium Besylate
Ground Truth: [’carbon’, ’hydrogen’, ’nitrogen’, ’oxygen’, ’sulfur’]
GPT-3 Prediction: [’atracurium’, ’hydrogen’, ’oxygen’]
SubjectEntity: Flavin Adenine Dinucleotide
Ground Truth: [’carbon’, ’hydrogen’, ’nitrogen’, ’oxygen’, ’phosphorus’]
GPT-3 Prediction: [’adenine’, ’dinucleotide’, ’flavin’]</p>
        </sec>
        <sec id="sec-7-3-6">
          <title>7.3.6. PersonLanguage</title>
          <p>SubjectEntity: A. R. Rahman
Ground Truth: [’hindi’, ’tamil’, ’telugu’]
GPT-3 Prediction: [’bengali’, ’hindi’, ’kannada’, ’malayalam’,
’marathi’, ’tamil’, ’telugu’]
SubjectEntity: Akon
Ground Truth: [’english’]
GPT-3 Prediction: [’english’, ’french’, ’wolof’]
SubjectEntity: Genelia D’Souza
Ground Truth: [’hindi’, ’malayalam’, ’tamil’, ’telugu’]
GPT-3 Prediction: [’bengali’, ’english’, ’hindi’, ’marathi’, ’telugu’]</p>
        </sec>
        <sec id="sec-7-3-7">
          <title>7.3.7. PersonProfession</title>
          <p>SubjectEntity: Elon Musk
Ground Truth: [’engineer’, ’entrepreneur’, ’inventor’, ’programmer’]
GPT-3 Prediction: [’businessperson’, ’engineer’, ’inventor’,
’investor’, ’philanthropist’, ’product designer’]
SubjectEntity: J. K. Rowling
Ground Truth: [’author’, "children’s writer", ’executive producer’,
’film producer’, ’novelist’, ’producer’, ’producer’, ’screenwriter’,
’writer’, ’writer’]
GPT-3 Prediction: [’author’, ’film producer’, ’philanthropist’,
’screenwriter’]
SubjectEntity: Lady Gaga
Ground Truth: [’singer’, ’songwriter’]
GPT-3 Prediction: [’actress’, ’dancer’, ’philanthropist’,
’record producer’, ’singer’, ’songwriter’]</p>
        </sec>
        <sec id="sec-7-3-8">
          <title>7.3.8. PersonInstrument</title>
          <p>SubjectEntity: Ronan Keating
Ground Truth: []
GPT-3 Prediction: [’piano’]
SubjectEntity: Taeyang
Ground Truth: []
GPT-3 Prediction: [’guitar’]
SubjectEntity: Chester Bennington
Ground Truth: []
GPT-3 Prediction: [’guitar’, ’piano’]</p>
        </sec>
        <sec id="sec-7-3-9">
          <title>7.3.9. PersonEmployer</title>
          <p>SubjectEntity: Kent Beck
Ground Truth: [’meta platforms’]
GPT-3 Prediction: [’facebook’, ’three rivers institute’]
SubjectEntity: Serena Williams
Ground Truth: [’unicef’]
GPT-3 Prediction: [’tennis’]
SubjectEntity: Guido van Rossum
Ground Truth: [’microsoft’]
GPT-3 Prediction: [’dropbox’, ’google’]</p>
        </sec>
        <sec id="sec-7-3-10">
          <title>7.3.10. PersonPlaceOfDeath</title>
          <p>SubjectEntity: Avicii
Ground Truth: [’muscat’]
GPT-3 Prediction: [’muscat, oman’]
SubjectEntity: John Coltrane
Ground Truth: [’huntington’]
GPT-3 Prediction: [’new york city’]
SubjectEntity: Rachel Caine
Ground Truth: [’texas’]
GPT-3 Prediction: [’’]</p>
        </sec>
        <sec id="sec-7-3-11">
          <title>7.3.11. PersonCauseOfDeath</title>
          <p>SubjectEntity: Ahmed Zewail
Ground Truth: [’lymphoma’, ’spinal cord lymphoma’]
GPT-3 Prediction: [’cancer’]
SubjectEntity: Avicii
Ground Truth: [’exsanguination’]
GPT-3 Prediction: [’suicide’]
SubjectEntity: Ennio Morricone
Ground Truth: [’femoral fracture’, ’fracture’]
GPT-3 Prediction: [’’]</p>
        </sec>
        <sec id="sec-7-3-12">
          <title>7.3.12. CompanyParentOrganization</title>
          <p>SubjectEntity: Aston Martin lagonda
Ground Truth: []
GPT-3 Prediction: [’aston martin lagonda global holdings plc’]
SubjectEntity: Austro-Daimler
Ground Truth: []
GPT-3 Prediction: [’daimler ag’]
SubjectEntity: Hyundai Motor Company
Ground Truth: [’hyundai’]
GPT-3 Prediction: [’hyundai motor group’]</p>
        </sec>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Language Model Size</title>
        <p>Table 5 shows the values of the scaling experiments. These values were used to produce Figure
1.
Figure 2: The number of answers per relation type for the training set provided by the organisers.
Figure 3: The number of answers per relation type for the development set provided by the organisers.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Daza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Inductive entity representations from text via link prediction</article-title>
          ,
          <source>in: Proceedings of the Web Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>798</fpage>
          -
          <lpage>808</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Araci</surname>
          </string-name>
          , Finbert:
          <article-title>Financial sentiment analysis with pre-trained language models</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10063</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnaggar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dallago</surname>
          </string-name>
          , G. Rihawi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>ProtTrans: Towards cracking the language of life's code through self-supervised deep learning and high performance computing</article-title>
          , arXiv preprint arXiv:
          <year>2007</year>
          .
          <volume>06225</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Rytting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Delorey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fulda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wingate</surname>
          </string-name>
          ,
          <article-title>An information-theoretic approach to prompt engineering without ground truth labels</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language Models as Knowledge Bases?</article-title>
          ,
          <source>in: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Roller,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dewan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mihaylov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Simig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , Opt: Open pre-trained
          <source>transformer language models</source>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>01068</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisner</surname>
          </string-name>
          ,
          <article-title>Learning how to ask: Querying LMs with mixtures of soft prompts</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5203</fpage>
          -
          <lpage>5212</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kassner</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Weikum, Language Models As or For Knowledge Bases (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2110</volume>
          .
          <fpage>04888</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. v.</given-names>
            <surname>Bakel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aleksiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Daza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alivanistos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          ,
          <article-title>Approximate knowledge graph query answering: from ranking to binary classification</article-title>
          ,
          <source>in: International Workshop on Graph Structures for Knowledge Representation and Reasoning</source>
          , Springer, Cham,
          <year>2020</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Language models: past, present, and future</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>65</volume>
          (
          <year>2022</year>
          )
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          , G. Neubig,
          <article-title>How Can We Know What Language Models Know?, in: Transactions of the Association for Computational Linguistics 2020 (TACL)</article-title>
          , volume
          <volume>8</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>438</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bouraoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          ,
          <article-title>Inducing Relational Knowledge from BERT</article-title>
          ,
          <source>in: Proc. of the Thirty-Fourth Conference on Artificial Intelligence</source>
          , AAAI'
          <fpage>20</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Haviv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Globerson,
          <article-title>BERTese: Learning to speak to BERT, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          : Main Volume,
          <year>2021</year>
          , pp.
          <fpage>3618</fpage>
          -
          <lpage>3623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Razeghi</surname>
          </string-name>
          , R. L.
          <string-name>
            <surname>Logan</surname>
            <given-names>IV</given-names>
          </string-name>
          , E. Wallace, S. Singh,
          <article-title>AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts (</article-title>
          <year>2020</year>
          )
          <fpage>4222</fpage>
          -
          <lpage>4235</lpage>
          . arXiv:
          <year>2010</year>
          .15980.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Factual probing is [MASK]
          <article-title>: Learning vs. learning to recall (</article-title>
          <year>2021</year>
          )
          <fpage>5017</fpage>
          -
          <lpage>5033</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fichtel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          , W.-T. Balke,
          <article-title>Prompt Tuning or Fine-Tuning - Investigating Relational Knowledge in Pre-Trained Language Models (</article-title>
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>An Empirical Study on Few-shot Knowledge Probing for Pretrained Language Models (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2109</volume>
          .
          <fpage>02772</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N.</given-names>
            <surname>Poerner</surname>
          </string-name>
          , U. Waltinger,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>BERT is Not a Knowledge Base (Yet): Factual Knowledge vs</article-title>
          .
          <source>Name-Based Reasoning in Unsupervised QA 0</source>
          (
          <year>2019</year>
          ). arXiv:
          <year>1911</year>
          .03681.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Eisenschlos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gillick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Time-Aware Language Models as Temporal Knowledge Bases (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2106</volume>
          .
          <fpage>15110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Can Language Models be Biomedical Knowledge Bases? (</article-title>
          <year>2021</year>
          )
          <fpage>4723</fpage>
          -
          <lpage>4734</lpage>
          . arXiv:
          <volume>2109</volume>
          .
          <fpage>07154</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , E. Shareghi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , C. Collins,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>Rewire-then-</article-title>
          <string-name>
            <surname>Probe</surname>
          </string-name>
          :
          <article-title>A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2110</volume>
          .
          <fpage>08173</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>J.-C. Kalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fichtel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ehler</surname>
          </string-name>
          , W.-T. Balke, KnowlyBERT
          <article-title>- Hybrid Query Answering over Language Models and Knowledge Graphs</article-title>
          ,
          <source>in: Proceedings of the International Semantic Web Conference (ISWC)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnaout</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-K. Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Stepanova</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <string-name>
            <surname>Gad-Elrab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Razniewski</surname>
          </string-name>
          , G. Weikum,
          <article-title>Utilizing language model probes for knowledge graph repair</article-title>
          ,
          <source>in: Wiki Workshop</source>
          <year>2022</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sofronova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          ,
          <article-title>Do Judge an Entity by Its Name! Entity Typing Using Language Models</article-title>
          ,
          <source>in: The Semantic Web: ESWC 2021 Satellite Events</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Kg-bert: Bert for knowledge graph completion</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1909</year>
          .03193.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McDonell</surname>
          </string-name>
          ,
          <article-title>Prompt programming for large language models: Beyond the few-shot paradigm</article-title>
          ,
          <source>in: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Maieutic prompting: Logically consistent reasoning with recursive explanations</article-title>
          ,
          <source>arXiv preprint arXiv:2205.11822</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>