<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LM-KBC: Knowledge Base Construction from Pre-trained Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sneha Singhania</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan-Phong Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Max Planck Institute for Informatics</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Pre-trained Language Models (LMs) have advanced a range of semantic tasks, and have also shown promise for factual knowledge extraction encoded in them. Although several works have explored this ability in the LM probing setting, viability of knowledge base construction from LMs has not yet been explored. In light of this, we hosted the LM-KBC challenge at the 21st International Semantic Web Conference (ISWC 2022). Participants were asked to build actual knowledge bases from LMs, for a given set of subjects and relations. In crucial diference to existing probing benchmarks like LAMA [ 1], we made no simplifying assumptions on relation cardinalities, i.e., a subject-entity could stand in relation with zero, one, or many object-entities. Furthermore, submitted systems were required to go beyond just ranking the predictions and materialize the outputs, which we evaluated using the established KB metrics of precision, recall, and 1-score. The challenge had two tracks: (1) a BERT-type LM track with low computational requirements and (2) an open track, where participants could use any LM of their choice. In this rfist edition of the challenge, we received a total of five submissions, four for track 1 and one for track 2. We present the contributions and insights of our peer-reviewed submissions and lay out the possible paths for future work. The challenge website is https://lm-kbc.github.io.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Language Models</kwd>
        <kwd>Knowledge Base Construction</kwd>
        <kwd>Prompt Learning</kwd>
        <kwd>Language Model Probing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. About LM-KBC</title>
      <p>
        Previous approaches to KB construction utilized unstructured text [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], crowdsourcing, or
semi-structured resources [
        <xref ref-type="bibr" rid="ref10 ref11 ref5">10, 5, 11</xref>
        ]. In the seminal LAMA paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Petroni et al. showed
that LMs achieved encouraging results in masked knowledge ranking tasks—ranking
candidate objects for a given subject-relation pair. Despite much follow-up work reporting further
advancements [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
        ], as well as criticism [
        <xref ref-type="bibr" rid="ref16">16, 17, 18, 19, 20</xref>
        ], the prospect of using
LMs for KB construction remains under-explored. The LAMA benchmark, and its variants, are
not suited to investigate actual KB construction since they (i) evaluate on randomly sampled
subject-object pairs, thus missing out on assessing per-subject recall, and on deciding whether a
subject has objects at all, (ii) focus on single word object-entities due to the limitation of single
masked token prediction specification of the underlying LM, and (iii) only evaluate a model’s
ranking abilities, but do not force it to make deliberate accept/reject decisions. Knowledge base
construction is a task diferent from ranking—it requires challenging decisions on how to obtain
recall in the long tail [21, 22] and how to decide acceptance thresholds.
      </p>
      <p>In our challenge, we invited participants to present LM-based systems for actual KB
construction, with three main challenges:
1. Variance in the number of true objects per subject-relation pair. For example, Germany
shares borders with 9 countries, whereas Vietnam borders only 3 countries. Thus, systems
had to make decisions on how many objects to retain.
2. Instances without any true object. For example, Apple has no parent organization, while
Google is owned by Alphabet. Thus, systems had to make decisions on whether to output
any objects at all.
3. Materialization. Systems were required to output lists of objects for each subject-relation
pair, hence had to make deliberate binary retain/discard decisions on candidates and
could not hide behind ranking metrics.</p>
      <p>We evaluated the resulting KBs using established precision, recall, and 1 metrics.
Task Description Given an input tuple of a subject-entity  and a relation , the task is to
generate the correct object-entities [1, 2, . . . , ], using language model probing.</p>
      <p>For example, as shown in Table 1, for a given input consisting of a subject-entity and relation
pair, when BERT is probed using the sample prompt, we obtain the following top predictions
with likelihood in the placeholder position “[MASK]”. The last column gives the correct
groundtruth objects. The crux of the task is that across various subject-relation pairs, there is no optimal
solution to make accept/reject decisions using a uniform threshold on the LM’s likelihood. The
problem lies even within a single relation: if we retain predictions up to 10.7% likelihood,
Germany’s neighbour Belgium would be dropped. Conversely, if the threshold is lowered to
2.2%, for Vietnam, wrongly, India would be asserted as its neighbour.</p>
      <p>BERT-style models only annotate outputs with these problematic relative likelihoods over
each other; nevertheless, participating systems need to make decisions on which and how
many of the candidates to retain. Participants were allowed to paraphrase the input prompts
manually or through existing prompt engineering techniques [23, 24], and could even form
prompt ensembles [25] for final predictions.
Vietnam, shares-border
Germany, shares-border
Carbon dioxiode, consists-of
Angela Merkel, speaks-language
Vietnam shares a land
border with [MASK].</p>
      <p>Germany shares a land
border with [MASK].</p>
      <p>Carbon dioxiode consists of
[MASK].</p>
      <p>Angela Merkel can speak in
[MASK].</p>
      <p>Elon Musk, place-of-death</p>
      <p>Elon Musk died in [MASK].</p>
      <p>LM Prediction
&amp; Likelihood
Cambodia, 12.1%</p>
      <p>China, 10.7%
India, 10.1%
Austria, 17.7%</p>
      <p>...</p>
      <p>Belgium, 2.2%
Oxygen, 20.8%</p>
      <p>Water, 14%
Nitrogen, 11.5%
German, 89.1%
English, 5.3%
Italian, 0.5%
ofice, 4.8%
prison, 3%
Chicago, 2.8%</p>
      <p>China,
Cambodia,</p>
      <p>Laos
Austria,</p>
      <p>...</p>
      <p>Belgium
Carbon,
Oxygen
German,
English,
Russian
∅
1. BERT track, where only computationally modest BERT-type models were allowed;
2. Open track, where any language model, also autoregressive or generative models, could
be used.</p>
      <p>
        Using a public training dataset, participants were allowed to prompt-engineer, retrain, fine-tune,
use context examples (e.g., for GPT-3 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), or use additional textual data (e.g., Wikipedia snippets
as prompt context), to optimize their output.
      </p>
      <p>LM-KBC22 Dataset We curated a dataset comprising 12 relations, each comprising a set of
subjects and a complete list of ground-truth objects per subject-relation-pair. For each relation,
maximum of 100 subjects were provided for training, another 50 for validation and testing,
while a third 50 were withheld (private test) for challenge evaluation. Table 2 gives more
details on our released dataset. The relations were chosen so as to ensure diversity, and the
subject-entities were of diferent types, e.g., person, country, organization. To further increase
realism, 5 relations also contained subjects without any correct ground truth objects (e.g., Apple
having no parent organization). We provided aliases for ground-truth objects that are known
under multiple names, and outputting any one of them was suficient. In particular, to facilitate
usage of LMs like BERT (which are constrained by single-token predictions), we provided a
valid single-token form for multi-token object-entities, wherever such a form was meaningful.
Evaluation For each test instance, predictions submitted by participating systems were
evaluated by calculating precision, recall, and 1 metrics against ground-truth values. Let 
be the prediction list of object-entities for a test subject-entity and  be it’s corresponding
Precision =  ∩ 
| |</p>
      <p>Recall =  ∩ 
| |
1 =
2 ×</p>
      <p>Precision ×
Precision + Recall</p>
      <p>Recall
When  is empty, and  is not, precision = 1 and recall = 0, leading to 1 = 0. On
the other hand, when  is empty, recall = 1 but precision = 1 only when  is empty, else
precision = 0, leading to either 1 or 0 1-score. Scores were macro-averaged across subjects, and
across relations, and systems were ranked by the final macro- 1-score. Participants could submit
their system predictions on CodaLab at https://codalab.lisn.upsaclay.fr/competitions/5815 to get
scores on the private test dataset, and check their submission ranking on the leaderboard.</p>
      <p>To ease participation, we released a baseline implementation that probed the BERT language
model using one sample prompt per relation, like “China shares border with [MASK]”, and
selected the object-entities predicted in the [MASK] position with greater than or equal to 0.5
likelihood as outputs. This baseline achieved 31.08% 1-score on the hidden test dataset. We also
submitted a second baseline on CodaLab, where the predictions list  for all test instances was
empty. This baseline achieve 18% 1-score, highlighting that predicting nothing is also a
plausible baseline, with non-zero 1 scores, since in realistic KBC scenarios subjects without objects
do occur. We also released a Juptyer Notebook for getting started at
https://github.com/lmkbc/dataset/blob/main/getting_started.ipynb, where the baseline is explained, and modularized.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System Submissions</title>
      <p>The challenge received five submissions—four based on the BERT model (track 1) and one based
on GPT-3 (track 2). Below we list the contributions and main insights of each participating
system.
[Track 1 Winner]: Task-specific Pre-training and Prompt Decomposition for
Knowledge Graph Population with Language Models
Tianyi Li, Wenyu Huang, Nikos Papasarantopoulos, Pavlos Vougiouklis, Jef.Z Pan</p>
      <p>The authors present a system that performed task-specific pre-training of BERT, employed
prompt decomposition for progressive generation of candidate objects, and use adaptive
thresholds for final candidate object selection. They collected additional knowledge triples from
Wikidata KB and further pre-trained BERT on the masked token prediction objective. They
formulated the input as a cloze-style prompt and masked the object-entity, ensuring that the
model knows what to recover during prediction. In this modified pre-training step, they also
experimented with additionally masking tokens (window size of 1 or 2) appearing in vicinity
of the object-entity; however, this did not lead to a gain in the overall performance. They also
showed that task-specific pre-training of BERT specific to a relation performed better than
pre-training for all relations.</p>
      <p>Following Jiang et al. [25], they mined prompts from Wikipedia, and used the top-20 retrieved
sentences as potential prompts. These top-20 prompted where used in an ensemble fashion
with averaged voting for the final object-entity prediction. For the shares-border relation
with subject-entities as state-type, they proposed a pre-condition prompt, “[SUBJECT], as a
place, is a [MASK]”, which generated the exact type (state, province , department, city or region)
for the subject-entity leading to higher gain in performance.</p>
      <p>Finally, for candidate selection, they proposed sticky thresholds, which essentially selected
a candidate in the ranked list if its likelihood was at least 80% of the previous candidate’s
likelihood. This system scored 55.01% 1-score on the private test dataset, and won track 1 of
the challenge. The code for this system is available at github.com/Teddy-Li/LMKBC-Track1.
[Track 2 Winner]: Prompting as Probing: Using Language Models for Knowledge Base
Construction
Dimitrios Alivanistos, Selene Baez Santamaria, Michael Cochez, Jan-Christoph Kalo, Thiviyan
Thanapalasingam, Emile van Krieken</p>
      <p>The authors present the Prompting as Probing (ProP) system, which probes the GPT-3 model
under few-shot setting for KB construction. ProP combines various prompting techniques
including careful manual prompt creation and question style prompts for checking the veracity
of GPT-3 generated claims. Since the GPT-3 model performs well with in-context examples
illustrating the task, ProP system uses four representative examples from the training set for
each relation, and allows the model to generate after the subject entity of interest is mentioned
in the end.</p>
      <p>Their context examples had the following properties: 1) answer sets of varying length was
used to force the model to generate multiple objects; 2) subjects with empty answer set was given
whenever possible; 3) examples formulated as question-answer pairs, e.g., “Which countries
neighbour Dominica? [‘Venezuela’]”, to enforce learning the task style; 4) answer list formatted
as a list to accurately post-process the generations. Following Jung et al. [26], ProP has a
post-processing step called fact probing, which checks the veracity of GPT-3 generated answers.
In fact probing, they probe GPT-3 by converting the previous generations into natural language
fact prompt, and ask the model to further generate either True or False, leading to a high gain in
performance.</p>
      <p>Finally, they also experimented with GPT-3 models difering in size (Ada &lt; Babbage &lt; Curie
&lt; Da-vinci), and found an analogous increase in performance as the size increased. ProP won
track 2 of the challenge, with an 1-score of 67.56 % on the private test dataset. Their code is
available at github.com/HEmile/iswc-challenge.</p>
      <p>Knowledge Base Construction from Pre-trained Language Models by Prompt learning
Xiao Ning, Remzi Celebi
The authors used manual prompts, designed based on three automated sources, and also tried
ensemble learning for generating the final predictions. The descriptive information from Wikidata
is used in the following three ways for designing prompts: 1) “middle-word” strategy, which
selects the words occurring between subject and object as a prompt, 2) “dependency-based”,
which uses the syntactic structure or dependency path of the description as the prompt, and 3)
“paraphrasing-based”, where the original prompts are paraphrased using semantically similar
expressions. Each of these prompts are used to probe the BERT-large model, and for a given
subject, the five most frequent and likely objects are selected from the ensemble. Before selecting
the top-5 objects, the candidate list is post-processed by removing stopwords. They also treated
the threshold for candidate selection as a hyper-parameter and tuned it on the train dataset
for each relation separately. This system obtained 49.35 % 1-score on the private test dataset.
Their code is available at github.com/xiao-nx/LMKBC_2022.</p>
      <p>Prompt Design and Answer Processing for Knowledge Base Construction from
Pretrained Language Models (KBC-LM)
Xiao Fang, Alex Kalinowski, Haoran Zhao, Ziao You, Huhao Zhang, Yuan An
The authors propose manual prompts for each relation and probe the BERT-large model. They
used semantics and domain knowledge of each relation to craft the prompts carefully. Uniquely,
they used the intuition behind word co-occurrences in a context to design the prompt for
place-of-death and cause-of-death relations. The system first checked the relative
likelihoods of dead or alive tokens using a question prompt, “[SUBJECT] (is|has) [MASK]”, and
then probed the model for original relation only when the dead token had a higher probability.
This simple and intuitive idea led to an overall gain in performance. For plays-instrument
relation, authors observed that changing the article from ‘an’ to ‘a’ in the prompt improved
the performance, although ‘an’ was grammatically correct. Similarly, even for other relations,
doctor_who</p>
      <p>Teddy487</p>
      <p>Xiao
xf49
anonuser123</p>
      <p>SumitDalal
abhiseksharma
baseline-1
chitrank
baseline-2</p>
      <p>Paper</p>
      <sec id="sec-2-1">
        <title>Alivanistos et al. Li et al. Ning and Celebi Fang et al.</title>
      </sec>
      <sec id="sec-2-2">
        <title>Dalal et al.</title>
        <p>authors tried to reason out the relationship between subject and objects in question for optimal
prompt design. They achieved 49.27 % 1-score on the private test dataset. Their code is
available at github.com/anyuanay/KBC-LM-Drexel.</p>
        <p>Manual Prompt Generation For Language Model Probing
Sumit Dalal, Abhisek Sharma, Sarika Jain, Mayank Dave
The authors experiment with various manual prompts and thresholds for candidate selection
for each relation while probing the BERT model. Notably, they also check if selecting more
candidate in the object list (100, 150, 180, or 200) has an efect on the overall performance. They
also created an ensemble of their manually crafted prompts, finally achieving an 1-score of
33.7 % on the private test dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Discussion</title>
      <p>The first edition of our LM-KBC challenge received encouraging uptake, with five teams going
past the finish line and submitting both code and system descriptions. Table 3 presents the final
leaderboard of our challenge.</p>
      <sec id="sec-3-1">
        <title>3.1. Main Observations</title>
        <p>The main findings across all the submissions towards KB construction using existing language
models are:
1. Designing optimal prompts is crucial for efective knowledge elicitation from
LMs. The majority of the submissions focused on manual prompt engineering, tuning
them using domain knowledge and training data. Prompt choices, sometimes even just
based on small syntactic variations, had a major impact on overall system performance,
and all teams reported that variations there gave huge gains in evaluation metrics.
2. Relation specific tuning of LMs leads to better performance compared to iteratively
tuning a single LM on all relations. This may appear surprising insofar as language models
are generally held to be multi-task learners. Still, it may be explained by the significant
topical and distributional diferences between relations, where transfer of learning results
was not beneficial.
3. Relation-specific thresholding is necessary. Given that LMs heavily rely on word
co-occurrences and patterns during training stage, LM’s confidence score highly varied for
object-entity prediction and a fixed threshold across all relations for candidate selection
is inadequate.
4. Subjects without objects are challenging, and few systems identified them at high
accuracy. For example, even the best-performing system incorrectly predicted some object
for 10% of those subjects. Further research on how to identify whether objects exist for a
given subject-relation pair at all appears necessary.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Challenge Extensions</title>
        <p>Deciding on the challenge complexity required navigating a trade-of between ease of access,
and realism. Several avenues for extension are:
1. Including entity disambiguation: We consciously decided not to require resolution
to specific entity identifiers, but to match only on String labels, in order to keep the
challenge pure (not require pipelined systems). Yet this also creates some challenges in
evaluation, such as when lists of aliases are long, or labels are ambiguous (e.g., should
Korea be accepted as correct as birth place for someone born in South Korea?). Evaluating
systems on disambiguated identifiers is a possible extension, for example, by using an
entity-aware LM as default [27].
2. Expanding training data size: The LM-KBC22 dataset contains 100 samples per relation,
which is too little for most supervised approaches. Providing more training data could
open the challenge to more machine-learning centric approaches.
3. Other metrics: Our evaluation focused on macro-averaged 1-scores, which give equal
weight to precision and recall. It might be interesting to explore other trade-ofs, as for
KBs, precision often is way more critical than recall. Also, as subjects with no objects
dominate many domains (e.g., very few people hold political ofices), a higher presence,
or more weight on no-object-subjects, might be interesting.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Reviewing Process</title>
      <p>All papers received 2-3 single-blind peer reviews. The following researchers contributed reviews:</p>
    </sec>
    <sec id="sec-5">
      <title>5. Acknowledgements</title>
      <p>We thank the semantic web challenge chairs, Catia Pesquita and Daniele Dell’Aglio, for helping
us host a successful first edition of our challenge. We very much appreciate all the efort by
the PC members and thank them for their timely and detailed reviews. Finally, we thank the
participating teams for their enthusiasm and contributions.
[17] N. Kassner, H. Schütze, Negated and misprimed probes for pretrained language models:</p>
      <p>Birds can talk, but cannot fly, in: ACL, 2020, pp. 7811–7818.
[18] S. Razniewski, A. Yates, N. Kassner, G. Weikum, Language models as or for knowledge
bases, DL4KG (2021).
[19] B. Cao, H. Lin, X. Han, L. Sun, L. Yan, M. Liao, T. Xue, J. Xu, Knowledgeable or educated
guess? revisiting language models as knowledge bases, in: ACL, 2021, pp. 1860–1874.
[20] T.-P. Nguyen, S. Razniewski, Materialized knowledge bases from commonsense
transformers, CSRR (2022).
[21] S. Razniewski, F. Suchanek, W. Nutt, But what do we actually know?, in: AKBC, 2016, pp.</p>
      <p>40–44.
[22] S. Singhania, S. Razniewski, G. Weikum, Predicting document coverage for relation
extraction, TACL (2022).
[23] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, S. Singh, AutoPrompt: Eliciting knowledge
from language models with automatically generated prompts, in: EMNLP, 2020, pp.
4222–4235.
[24] Z. Zhong, D. Friedman, D. Chen, Factual probing is [MASK]: Learning vs. learning to
recall, in: NAACL, 2021, pp. 5017–5033.
[25] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we know what language models know?,</p>
      <p>TACL (2020) 423–438.
[26] J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula, R. L. Bras, Y. Choi, Maieutic
prompting: Logically consistent reasoning with recursive explanations, CoRR (2022).
[27] N. De Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive entity retrieval, in: ICLR, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          ,
          <source>in: EMNLP-IJCNLP</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: NAACL</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          , in: neurIPS,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: A free collaborative knowledgebase, Commun</article-title>
          . ACM (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          ,
          <source>in: ISWC</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, G. Weikum,
          <article-title>Yago: a core of semantic knowledge</article-title>
          ,
          <source>in: WWW</source>
          ,
          <year>2007</year>
          , p.
          <fpage>697</fpage>
          -
          <lpage>706</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Speer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Havasi</surname>
          </string-name>
          ,
          <article-title>Conceptnet 5.5: An open multilingual graph of general knowledge</article-title>
          ,
          <source>in: AAAI</source>
          ,
          <year>2017</year>
          , p.
          <fpage>4444</fpage>
          -
          <lpage>4451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nakashole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          , G. Weikum,
          <article-title>Scalable knowledge harvesting with high precision and high recall</article-title>
          ,
          <source>in: WSDM</source>
          ,
          <year>2011</year>
          , p.
          <fpage>227</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          , E. Gabrilovich, G. Heitz,
          <string-name>
            <given-names>W.</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Strohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. Zhang,</surname>
          </string-name>
          <article-title>Knowledge vault: A web-scale approach to probabilistic knowledge fusion</article-title>
          ,
          <source>in: KDD</source>
          ,
          <year>2014</year>
          , p.
          <fpage>601</fpage>
          -
          <lpage>610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Freebase:
          <article-title>A collaboratively created graph database for structuring human knowledge</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2008</year>
          , p.
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Berberich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lewis-Kelham</surname>
          </string-name>
          , G. de Melo, G. Weikum,
          <article-title>Yago2: Exploring and querying world knowledge in time, space, context, and many languages</article-title>
          ,
          <source>in: WWW</source>
          ,
          <year>2011</year>
          , p.
          <fpage>229</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pasupat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Retrieval augmented language model pre-training</article-title>
          , in: ICML,
          <year>2020</year>
          , pp.
          <fpage>3929</fpage>
          -
          <lpage>3938</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <article-title>How much knowledge can you pack into the parameters of a language model?</article-title>
          , in: EMNLP, Online,
          <year>2020</year>
          , pp.
          <fpage>5418</fpage>
          -
          <lpage>5426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          , Bertnet:
          <article-title>Harvesting knowledge graphs from pretrained language models</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnaout</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-K. Tran</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Stepanova</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          <string-name>
            <surname>Gad-Elrab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Razniewski</surname>
          </string-name>
          , G. Weikum,
          <article-title>Utilizing language model probes for knowledge graph repair</article-title>
          ,
          <source>in: Wiki Workshop</source>
          <year>2022</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>McCoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          , T. Linzen,
          <article-title>Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference</article-title>
          ,
          <source>in: ACL</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3428</fpage>
          -
          <lpage>3448</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>