<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Limits of Zero-shot Probing on Object Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shrestha Ghosh</string-name>
          <email>ghoshs@mpi-inf.mpg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Max Planck Institute for Informatics</institution>
          ,
          <addr-line>Saarbruck̈en 66125</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saarland University</institution>
          ,
          <addr-line>Saarbruck̈en 66125</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>In this work, we present Minimal Probe. The system is one of the participants of the LM-KBC challenge LLMs are trained on huge amount of general knowledge, they are known to store knowledge in their parameters. They have been probed for factual and commonsense knowledge. Minimal Probe aims to analyze how LLMs perform in low-resource setting. By careful prompt construction and intuitive answer cleaning, we show that LLMs can be used to extract multiple objects for a given subject and relation, without any demonstrations. Our system performs equally well on precision and recall, surpassing the baseline by more than 40% on F1. Minimal Probe achieves an average F1 score of 0.608 on the hidden test set: only 9.2% behind the winning team, which does use demonstrations. The code and results are available at https://github.com/ghoshs/LM-KBC2023.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Extracting knowledge from Large Language Models (LLMs) for Knowledge Bases (KBs) has
been tackled in recent literature [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Predicting facts for entities where the relation cardinality
is not fixed is yet under-explored. LLMs for natural language processing tasks are more popular
with demonstrations-aided prompting [2, 3] and have been shown to exhibit emergent abilities,
such as performing tasks on few-shot prompts, with increasing parameters [4]. Newer models
trained on code and text are more capable of generating structured data [5, 6].
      </p>
      <p>Prompts consist of the task description and optional demonstrations instantiating the task
input/output, followed by the input for which the model is expected to provide an answer.
The prompting technique with no demonstrations is called zero-shot prompting, and when
demonstrations are provided, it is called a few-shot or k-shot prompting. The zero-shot setting
is interesting since it allows us to determine how much knowledge is already stored in LLMs
and how efectively could we retrieve this parameterized knowledge. Previous works have
shown LLMs to be good zero-shot reasoners [7].</p>
      <sec id="sec-2-1">
        <title>1.1. LM-KBC Challenge</title>
        <p>This is the second edition of the LM-KBC challenge, which addresses the task of KB construction
using LLMs [8]. Here, the LLMs provide an alternative to the traditional information extraction
from unstructured text. As in the first edition [ 9], the main task is to predict all objects given
CEUR
Workshop
Proceedings
a subject and a relation. Since the number of objects is not fixed and could be zero as well,
using LLMs is especially challenging. New additions to the current challenge included entity
disambiguation to Wikidata and predicting relation cardinality as an integer.</p>
        <p>Two major observations from the first edition of the LM-KBC challenge [ 9] are that larger
models have higher performance and that triple-based prompts performed better than natural
language [10].</p>
      </sec>
      <sec id="sec-2-2">
        <title>1.2. Related Work</title>
        <p>
          The potential of LLMs as KBs was first explored by Petroni et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where they introduced
the LAMA probe to test knowledge stored in LLMs. This prompted a recent body of work on
probing LLMs for factual knowledge and KB curation tasks [11, 12, 13, 14, 15, 16]. There are
still challenges to using LLMs for and as KBs which need more focus [17].
        </p>
        <p>Prompting is an important part of using LLMs for downstream tasks [18]. Zero-shot
prompting has received some attention in recent literature for task generalization [19] and
chain-ofthought prompting [7]. Early experiments with GPT-4 have shown remarkable increment on
certain tasks with zero-shot prompts as compared to the previous models [20].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Approach</title>
      <p>Through this challenge, this work investigates how much information can be retrieved from
LLMs using zero-shot prompts on the GPT family of models. Through our system, named
Minimal Probe, we will focus on the following aspects: prompt design, and answer post-processing.</p>
      <sec id="sec-3-1">
        <title>2.1. Prompt Design</title>
        <p>The prompt consists of three parts: i) a task description, ii) optional demonstrations, and, iii) the
task itself. We focus on each section one by one.</p>
        <p>Task description. The challenge presents a baseline (GPT3 (curie-002) + NED) with no task
description and only demonstrations, which achieves 21% F1. We hypothesize that a task
description provides a guide for the model to output more reliably and formulate a simple task
description where the models asked exactly what is required in the output, i.e., a list of values.
Example 1. A prompt with the task description on the first line and the task in the next line.</p>
        <p>Please fill the empty list, if necessary, to create a correct fact. Return a valid tuple.
(”Paraguay”, ”borders country”, [])</p>
        <p>We focus on the task description as an alternative to providing demonstrations and probing
LLMs in a zero-shot way.</p>
        <p>Demonstrations. Demonstrations serve the purpose of laying the format of the task. As
mentioned earlier, with just some sample demonstrations in the prompt, GPT3 (curie-002)
achieves an F1 score of 21%. In the case where an LLM is prompted directly with the task
without any task description or demonstrations, the output structure becomes very unreliable.
The output could be anything from a comma-separated string to a paragraph about the subject
entity. This is probably because the model has not seen enough text, if not any, in the prompt
format to make predictions in a consistent format. This work probes the limits of LLMs for KB
construction without using demonstrations.</p>
        <p>Task. The task itself can be divided into two components: the format and the content. As
shown in Example 1, the model is provided with an instruction to fill an empty list in a tuple
format. Editing and inserting capabilities were introduced to the GPT family of LLMs in March
2022 [21] and is carried forward in to the newer models. The format guides the LLM to return a
similar response. As a result, the predictions can be automatically parsed with minimal efort.</p>
        <p>Another departure from the baseline is in presenting the relation. Even though LLMs returns
acceptable answers when probed with the provided relation as is, i.e., in camel-case, we probe
the model with relations in natural language, but without the subject type to see how well
the model performs. The relations in their current form have certain inconsistencies. Some
object types are in the singular. CountryBordersCountry and some in the plural:
CountryHasStates. The relation RiverBasinsCountry has no verb. We paraphrased the relations to
have plural object type and added missing verbs. Some relations, which were more generic,
returned results of granularity diferent from those expected in the evaluation. For instance,
the relation CompoundHasParts was changed to more specific, has elements. The relations
CountryHasStates and StateBordersState were modified such that the object type “ state”
was replaced by “provinces”.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Answer Post-processing</title>
        <p>Extraction. The careful prompt design keeps LLM response from deviating from the expected
output. As a result, minimal parsing is required to extract the object entities. The response is
parsed by the abstract syntax tree library of Python1 using its literal_eval function. In case
of a parse error, a regex pattern is used to extract all word groups within double quotes. We
keep all matched patterns which do not match the task subject and relation. Once we have all
the surface forms, except for the integers, we use the entity search service of the MediaWiki
action API2 to link the surface forms to Wikidata entities.</p>
        <p>Cleaning. Another pass of processing is required after linking surface forms to Wikidata
entities to make the object entities compatible with the desired format of the challenge.</p>
        <p>Firstly, integer objects are encoded as strings. Secondly, if an object is not an integer, we
check if non-entity literals, such as “unknown”, “N/A”, “none”, “false” and its variants have
been predicted. If yes, then the corresponding object IDs if any are removed since the current</p>
        <sec id="sec-3-2-1">
          <title>1https://docs.python.org/3/library/ast.html</title>
          <p>2We use the wbsearchentities action. API page: https://www.mediawiki.org/wiki/API:Main_page
evaluation treats them as empty values. Thirdly, object IDs which are the same as the subject IDs
are removed. Since none of the relations are reflexive, this operation can be safely performed.</p>
          <p>Next, a relation specific clean-up is applied to specific relations. For the
PersonHasNobelPrize relation, it was observed that the majority of the answers returned were the fields of the
award, which was not enough to disambiguate the objects. Since the object range of Nobel Prizes
comprises six awards, we mapped each category’s surface form (the field names) to its Wikidata
ID. Whenever the named-entity disambiguation failed, i.e., the object entity did not match any
of the six award entities, we directly looked up the Wikidata ID from its object surface. This
is a pragmatic choice when the object range is limited. In the case of CityLocatedAtRiver,
whenever the object returned did not belong to the class river, we re-ran the KB linking search
by appending the surface form with the object type, here river.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Results</title>
      <p>The main results on the validation set are presented in Table 1 and Table 2. Based on these
results, we determine the final configuration to be used on the test set.</p>
      <p>Setup. We compare the performance of two models trained on human feedback:
GPT-3.5turbo and GPT-4 [22]. The model parameters used are temperature set to 0, and max tokens
set to 200.</p>
      <p>Larger Models. As observed in the previous edition of the challenge, increasing the model
size increased the overall performance. To confirm this, we evaluate the LM-KBC baseline
on bigger models, without making any other changes. Table 1 shows the performance of the
LM-KBC baseline increases when run with larger models. There is a big jump within the
completion models when switching from text-curie-001 to text-davinci-003, with a gain
of more than 20% in F1. This stabilizes to about 5%-10% for higher models.</p>
      <p>Answer cleaning. The cleaning component of answer post-processing improves the F1 by
more than 10% in our Minimal Probe. This shows that the model outputs can still be noisy and
just employing a larger model is not enough. Relation-specific cleaning of the objects, as was
done for the Nobel Prize categories, increased the precision to 99%. Another observation was
that the objects for parent organization of companies returned the companies itself, especially
when the answer was None.</p>
      <p>The cleaning component, which deals with normalizing unknown values to an empty value,
also boost the performance. It is interesting to note that the LLMs distinguish between not
knowing, i.e., “unknown” and knowing that there are no objects, i.e., “none” or “false”. However,
for this challenge, both the meanings are merged to empty value.</p>
      <p>Relation Modifications. In Table 2, we highlight how GPT-4 performs when the relations
are presented i) in their original camel-cased format as presented in the challenge, ii) in natural
language format without the subject type and, iii) after paraphrasing. We observe F1 increases
for most relations after modifications. For those relations where the original scores the highest,
PersonHasAutobiography and CountryHasOfficialLanguage, the other configurations are
not too far behind. The most impacted is the relation CompoundHasParts, where the F1 increases
by 50% points. For the test set, we keep the relations in their natural language format, but
perform paraphrasing only for certain relations.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>In this section, we present some error analysis on the validation data and provide anecdotes on
the pros and cons of using LLMs for KB curation. We observe that the majority of errors stem
from relation-specific idiosyncrasies. In general, the relations dealing with people’s professions
and employers were most dificult to predict.
4.0.1. Granularity
With no other information except the task description and the task itself, it is sometimes dificult
to control the granularity of the type of the objects returned. Wikidata provides property
constraints such as value-type to restrict the class of objects. For instance, the objects for the
country of citizenship relation is always of the type country, but it is not always the case. This
was especially true for the relation PersonHasProfession, where the model predicts {“actor ”,
“producer ”, “director ”}, which can be considered sibling occupations, but the ground truth has
“actor ” as well as other subclass occupations, like “film actor” , “television actor”, “voice actor”
and so on. These are particularly dificult to elicit unless provided with additional context.</p>
      <p>Another example is the relation CompanyHasParentOrganisation, where the model often
returned objects linked to the subject via the Wikidata relation “owned by”, which semantically
is correct but not considered in the ground truth labels. Interestingly, the model never predicts
“voice” as an object for the relation, PersonPlaysInstrument. CityLocatedAtRiver also
sufers from the problem of granularity, sometimes returning water bodies instead of or along
Relation
BandHasMember
CityLocatedAtRiver
CompanyHasParentOrganisation
CompoundHasParts
CountryBordersCountry
CountryHasOficialLanguage
CountryHasStates
FootballerPlaysPosition
PersonCauseOfDeath
PersonHasAutobiography
PersonHasEmployer
PersonHasNobelPrize
PersonHasNumberOfChildren
PersonHasPlaceOfDeath
PersonHasProfession
PersonHasSpouse
PersonPlaysInstrument
PersonSpeaksLanguage
RiverBasinsCountry
SeriesHasNumberOfEpisodes
StateBordersState
Average
4.0.2. Cardinality
4.0.3. Recency
The LLMs get almost 50% of the cardinalities wrong for both the relations
PersonHasNumberOfChildren and SeriesHasNumberOfEpisodes. In the former relation, 50% of the error is
one-of, for the latter though no visible pattern to the errors.</p>
      <p>The relation PersonHasEmployer can be misleading for the LLM since it returns only recent
employers, whereas the expectation is all employers, including past ones.
4.0.4. Ground Truth Mismatch
Very rarely does the model return an answer which doesn’t match the ground truth, but can be
corroborated in Wikidata or Wikipedia. One such triple is regarding the number of episodes
for the Perry Mason series, which the model predicts as 271 as opposed to the ground truth
prediction of 15. Another example is of the company AGL Resources, which has outdated
information on Wikidata and more recent information, including its parent organization in
Wikipedia3. These instances are rare and generally due to changes not yet reflected in Wikidata.
4.0.5. Object Type
The relations most afected by disambiguation errors were PersonHasNobelPrize and
CityLocatedAtRiver, where the Wikidata object returned were of types very diferent from the type
expected for the relation. For instance, the Wikidata objects for the Nobel Prize were not of the
type of award, but of the branch of science. In the instance of a city located at rivers, a common
error occurred when the river name coincided with the name of a city and the Wikidata object
returned was of the type city. There was a drastic increase in the F1 score, by more than 20%
points, once this was handled.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>The LM-KBC challenge explores the problem of KB construction using LLMs. Through our
system, we show that LLMs not only improve with size, but are also capable of predicting
under low-resource setting, such as zero-demonstrations. We show additionally that answer
post-processing and relation-specific modifications can greatly improve fact prediction. Minimal
Probe improves the baseline F1 by 42%, while maintaining the average precision and recall
across relations at 65%. According to the oficial leaderboard on the hidden test set, Minimal
Probe was only 9.2% points behind the winning system, which relied on demonstrations for
ifnal predictions.</p>
      <p>Future work would include exploring whether further meta information such as object
granularity or relation cardinality, such as the average number of objects, can improve object recall.
Further, as evident from Table 2, relation surface forms are important. Worse representations
may lead to poor predictions. Additionally, verification of the predicted objects remains a
challenge. There are diferent design choices to be made as well: on one hand LLM as a retriever
is less expensive, but answer verification is computation-heavy, on the other hand, LLM as an
information extractor is expensive and limited by the model’s context.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>I thank the reviewers for their helpful suggestions.</title>
      </sec>
      <sec id="sec-7-2">
        <title>3https://en.wikipedia.org/wiki/Southern_Company_Gas</title>
        <p>[2] T. Schick, H. Schütze, Few-shot text generation with natural language instructions, in:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
2021, pp. 390–402.
[3] T. Schick, H. Schütze, It’s not just size that matters: Small language models are also
few-shot learners, in: Proceedings of the 2021 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, 2021,
pp. 2339–2352.
[4] J. Wei, Y. Tay, R. Bommasani, C. Rafel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,
D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus,
Emergent abilities of large language models, Transactions on Machine Learning Research
(2022). URL: https://openreview.net/forum?id=yzkSU5zdwD, survey Certification.
[5] A. Madaan, S. Zhou, U. Alon, Y. Yang, G. Neubig, Language models of code are few-shot
commonsense learners, in: Proceedings of the 2022 Conference on Empirical Methods in
Natural Language Processing, 2022, pp. 1384–1403.
[6] X. Wang, S. Li, H. Ji, Code4struct: Code generation for few-shot event structure
prediction, in: Proceedings of the 61st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2023, pp. 3640–3663.
[7] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot
reasoners, Advances in neural information processing systems 35 (2022) 22199–22213.
[8] S. Singhania, J.-C. Kalo, S. Razniewski, J. Z. Pan, Lm-kbc: Knowledge base construction
from pre-trained language models, semantic web challenge @ iswc, CEUR-WS (2023).</p>
        <p>URL: https://lm-kbc.github.io/challenge2023/.
[9] S. Singhania, T.-P. Nguyen, S. Razniewski, Lm-kbc: Knowledge base construction from
pretrained language models, CEUR-WS (2022). URL: https://lm-kbc.github.io/challenge2022/.
[10] D. Alivanistos, S. B. Santamaría, M. Cochez, J. C. Kalo, E. van Krieken, T. Thanapalasingam,
Prompting as probing: Using language models for knowledge base construction, in: 2022
Semantic Web Challenge on Knowledge Base Construction from Pre-Trained Language
Models, LM-KBC 2022, CEUR-WS.org, 2022, pp. 11–34.
[11] T. Safavi, D. Koutra, Relational world knowledge representation in contextual language
models: A review, in: Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, 2021, pp. 1053–1067.
[12] R. Zhao, F. Zhao, G. Xu, S. Zhang, H. Jin, Can language models serve as temporal knowledge
bases?, in: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022,
pp. 2024–2037.
[13] S. Hao, B. Tan, K. Tang, B. Ni, X. Shao, H. Zhang, E. Xing, Z. Hu, Bertnet: Harvesting
knowledge graphs with arbitrary relations from pretrained language models, in: Findings
of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5000–5015.
[14] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we know what language models know?,</p>
        <p>Transactions of the Association for Computational Linguistics 8 (2020) 423–438.
[15] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, S. Singh, Autoprompt: Eliciting knowledge
from language models with automatically generated prompts, in: Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp.
4222–4235.
[16] B. Veseli, S. Singhania, S. Razniewski, G. Weikum, Evaluating language models for
knowledge base completion, in: European Semantic Web Conference, Springer, 2023, pp. 227–243.
[17] S. Razniewski, A. Yates, N. Kassner, G. Weikum, Language models as or for knowledge
bases, arXiv preprint arXiv:2110.04888 (2021).
[18] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A
systematic survey of prompting methods in natural language processing, ACM Computing
Surveys 55 (2023) 1–35.
[19] C. Zhou, J. He, X. Ma, T. Berg-Kirkpatrick, G. Neubig, Prompt consistency for zero-shot
task generalization, in: Findings of the Association for Computational Linguistics: EMNLP
2022, 2022, pp. 2613–2626.
[20] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee,
Y. Li, S. Lundberg, et al., Sparks of artificial general intelligence: Early experiments with
gpt-4, arXiv preprint arXiv:2303.12712 (2023).
[21] M. Bavarian, A. Jiang, H. Jun, H. Pondé, New gpt-3 capabilities: Edit &amp; insert, 2022. URL:
https://openai.com/blog/gpt-3-edit-insert.
[22] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>