<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can large language models generate salient negative statements?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hiba Arnaout</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Razniewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Center for AI</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Max Planck Institute for Informatics</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We examine the ability of large language models (LLMs) to generate salient (interesting) negative statements about real-world entities; an emerging research topic of the last few years. We probe the LLMs using zero- and -shot unconstrained probes, and compare with traditional methods for negation generation, i.e., pattern-based textual extractions and knowledge-graph-based inferences, as well as crowdsourced gold statements. We measure the correctness and salience of the generated lists about subjects from diferent domains. Our evaluation shows that guided probes do in fact improve the quality of generated negatives, compared to the zero-shot variant. Nevertheless, using both prompts, LLMs still struggle with the notion of factuality of negatives, frequently generating many ambiguous statements, or statements with negative keywords but a positive meaning.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Text</title>
        <p>KG</p>
      </sec>
      <sec id="sec-1-2">
        <title>ChatGPT 0-shot</title>
      </sec>
      <sec id="sec-1-3">
        <title>ChatGPT k-shot</title>
      </sec>
      <sec id="sec-1-4">
        <title>Alpaca 0-shot</title>
      </sec>
      <sec id="sec-1-5">
        <title>Alpaca k-shot</title>
        <p>Human
didn’t make his high school team</p>
        <p>
          ˂˂˂˂
˂do˂esn˂’t ˂ha˂ve ˂so˂cial media
isn’t a basketball coach
didn’t play as a power forward
didn’t invent basketball
didn’t only play basketball (positive)
never played for a team outside the u.s.
didn’t play for the bulls exclusively (positive)
didn’t play for chicago until 84 (positive)
didn’t win a championship for the lakers
wasn’t the youngest player in the nba
didn’t win an oscar
didn’t buy stakes in the chicago bulls
never coached the chicago bulls
lists of statements (biographic summaries) about subjects, where the statements are truly
negative, but also salient, unexpected, or normally mistaken as true positives. To compile
these lists, diferent data sources and methodologies have been explored. In [
          <xref ref-type="bibr" rid="ref2">2, 3</xref>
          ], using
web-scale knowledge graphs, candidate salient negatives are derived from existing positive
statements about highly related entities. The computation relies on the local closed-world
assumption, an assumption of completeness over identified relevant subgraphs, coupled with
ranking metrics such as relative frequencies. Similarly, [4] explores graph embeddings to
generate candidate negative statements, which are then scored using a fine-tuned language
model (LM), by descending order of negativity. Textual sources have been explored in [5],
where commonsense negative statements are extracted, by mining query logs, using
predefined patterns. [ 6] makes use of the edit history of large collaborative encyclopedias, namely
Wikipedia, by looking at sentences edited, where only an entity or a number are changed. The
old version of the sentence is then considered an interesting negative statement.
LLMs for Negative Statements Generation. Recently, LMs have been examined about
their ability to store factual knowledge about general topics [7, 8]. With LMs such as BERT [9],
this was done via masked probing, e.g., “Paris is the the capital of [MASK]” generates france
as the top prediction. With large LMs (LLMs), such as GPT-3 [10], autoregressive generation
from textual prompts is the standard, e.g., “Complete the following. Paris is..”, and receive the
completion the capital of France. A few papers focused on the ability of these models to store
and understand negative knowledge [
          <xref ref-type="bibr" rid="ref2">11, 2, 12</xref>
          ]. In [11], using masked probing, authors found
that LMs, such as BERT, struggle to understand negation, predicting fly for the probe “Birds
cannot [MASK]”. In [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], methods to infer negative statements from knowledge graphs and text
have been compared on a more specific negation task, namely generating salient negative
commonsense statements. Results of these models are compared to ones using GPT-3. Even
though performing better than BERT-like models [11], GPT-3 was not able to beat the SOTA
model (inferences from KGs), neither on the true negativity of statements, nor their salience.
More recently, [12] studies advanced LLMs, such as ChatGPT [13], on their ability to store
negative knowledge in a constrained text generation and question answering tasks. The finding
are contradictions in the LLM’s belief, when comparing results of both tasks. For instance, LLMs
generate the sentence “Lions live in the ocean”, but answer “No” when asked “Do lions live in
the ocean?”. [12] is an important step towards examining LLMs’ understanding of the falseness
of statements, however, it has four main diferences from our study: (i) our prompts are not
constrained to commonsense knowledge; (ii) not constrained to puzzles around a set of words,
but allowed to generate arbitrary subject-relevant statements; (iii) our comparison includes
SOTA baselines from KG and text, not just LLMs; (iv) our study evaluates also the salience of
outputs, not just their correctness.
        </p>
        <p>We summarize our contributions as follows.</p>
        <p>• We design constraint-free prompts for LLM-based negation generation, where we only
instantiate the input subject.
• We examine LLMs’ understanding of salient factual negation, finding that, even though
they struggle with the notion of true negativity (-18% in correctness compared to SOTA
model), on truly negative statements, the guided few-shot ChatGPT variant ranks first
among models in salience.
• We study both encyclopedic and commonsense domains, finding that it is more challenging
for LLMs to generate longer lists of salient commonsense negatives. For instance, the
zero-shot ChatGPT variant shows a decrease of 22% in correctness@5 (compared to @1)
for commonsense subjects. No decrease is observed for encyclopedic subjects.
• We compare the LLM-generated negative statements to existing SOTA methods, from
text [5] and knowledge graphs [3].
• We measure the quality of the negative statements over two aspects, the correctness (true
negativity) and salience (interestingness).</p>
        <p>The data generated can be downloaded at: https://www.mpi-inf.mpg.de/fileadmin/inf/d5/
research/negation_in_KBs/data.csv.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Probe Construction</title>
      <p>Given a subject, we probe the LLM to generate a list of salient negative statements about it.
Zero-shot Probe. In this probe, we test the performance of the LLM without providing any
samples in our instructions.</p>
      <p>Write a list of [ n] salient factual negated statement about [ SUBJECT].</p>
      <p>The goal is to inspect the model’s interpretation of the notion of salient negation without any
prior examples nor definitions.</p>
      <p>Guided Few-shot Probe. In this probe, we guide the model with both definitions and examples
(for in-context learning).</p>
      <p>A salient factual negated statement about an entity means that the statement doesn’t hold
in reality. Moreover, the negated statement is either surprising, unexpected, or useful to the
reader. For example:
[ EXAMPLE1]
[...]</p>
      <p>Given this definition and examples, write a list of [ n] salient factual negated statement
about [ SUBJECT].</p>
      <p>In the following sample, we show a 4-shot probe with 2 salient and 2 nonsalient samples
about diferent types of subjects, and request 3 salient negative statements about lebanon
(LLM=ChatGPT).</p>
      <p>A salient factual negated statement about an entity means that the statement doesn’t hold
in reality. Moreover, the negated statement is either surprising, unexpected, or useful to the
reader. For example:
penguins can’t fly.</p>
      <p>tom cruise never won an oscar.</p>
      <sec id="sec-2-1">
        <title>On the other hand, the following examples are factual negated statements that are not salient:</title>
        <p>penguins can’t run for presidency.</p>
        <p>tom cruise never won the nba best player award.</p>
        <p>Given this definition and examples, write a list of 3 salient factual negated statement about
lebanon.</p>
        <p>Answer:
1. is not a desert country.
2. is not an oil-rich country.
3. is not a landlocked country.</p>
        <sec id="sec-2-1-1">
          <title>Model</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Text Extractions</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>KG Inferences</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>ChatGPT 0-shot</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>ChatGPT k-shot</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>Alpaca 0-shot</title>
        </sec>
        <sec id="sec-2-1-7">
          <title>Alpaca k-shot</title>
          <p>Human</p>
        </sec>
        <sec id="sec-2-1-8">
          <title>Text Extractions</title>
        </sec>
        <sec id="sec-2-1-9">
          <title>KG Inferences</title>
        </sec>
        <sec id="sec-2-1-10">
          <title>ChatGPT 0-shot</title>
        </sec>
        <sec id="sec-2-1-11">
          <title>ChatGPT k-shot</title>
        </sec>
        <sec id="sec-2-1-12">
          <title>Alpaca 0-shot</title>
        </sec>
        <sec id="sec-2-1-13">
          <title>Alpaca k-shot</title>
          <p>Human</p>
        </sec>
        <sec id="sec-2-1-14">
          <title>Text Extractions</title>
        </sec>
        <sec id="sec-2-1-15">
          <title>KG Inferences</title>
        </sec>
        <sec id="sec-2-1-16">
          <title>ChatGPT 0-shot</title>
        </sec>
        <sec id="sec-2-1-17">
          <title>ChatGPT k-shot</title>
        </sec>
        <sec id="sec-2-1-18">
          <title>Alpaca 0-shot</title>
        </sec>
        <sec id="sec-2-1-19">
          <title>Alpaca k-shot</title>
          <p>Human</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>In Section 3, we experiment with diferent number of samples and diferent salient:nonsalient ratio (see Appendix D).</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>Data. We consider 50 subjects, 25 encyclopedic entities such as elon musk, and 25 commonsense
concepts, such as jogging (Full list in Appendix A). Our intuition behind these choices is diversity:
(i) in types, e.g., activities, occupations, people; and (ii) in popularity, e.g., tom cruise (a famous
hollywood actor) and peri gilpin (a less known tv actor).</p>
      <p>Methods. To compile lists of negative statements about these subjects, we consider:
• Text Extractions: The pattern-based method [5] relies on a handful of manually crafted
patterns, in the form of why-questions, to extract interesting negative statements from rich
query logs, e.g., “why doesn’t amazon..” with the completion “accept paypal”. We instantiate
the query-log API with Google and Bing, merge the results, and rank by frequency.
• KG Inferences: The peer-based negation inference methodology [3] relies on a given KG
to identify highly related entities to the input entity (called peers). Positive statements
about these peers are used to infer candidate negatives, which are finally ranked using
Positive Meaning
0
0
0.11
0.07
0.09
0.10
0.14
lebanon isn’t devoid
of historical sites</p>
      <sec id="sec-3-1">
        <title>Model</title>
      </sec>
      <sec id="sec-3-2">
        <title>Text Extractions</title>
      </sec>
      <sec id="sec-3-3">
        <title>KG Inferences</title>
      </sec>
      <sec id="sec-3-4">
        <title>ChatGPT 0-shot</title>
      </sec>
      <sec id="sec-3-5">
        <title>ChatGPT k-shot</title>
      </sec>
      <sec id="sec-3-6">
        <title>Alpaca 0-shot</title>
      </sec>
      <sec id="sec-3-7">
        <title>Alpaca k-shot</title>
        <p>Human
Sample Statement
rabbits can’t vomit
the beatles didn’t tour
avocado isn’t bad</p>
        <p>statistical metrics, such as relative frequency, e.g., “unlike similar physicists, such as max
planck and albert einstein, stephen hawking never won the nobel prize in physics”. We
instantiate the KGs to Wikidata [14] and Ascent [15], for encyclopedic/commonsense
subjects, respectively.
• ChatGPT 0-shot: The zero-shot probe introduced in Section 2 is submitted to
Chat</p>
        <p>GPT [13] (May 2023 version).
• ChatGPT k-shot: The few-shot probe in Section 2, with =3 (salient:nonsalient 3:0), is
submitted to ChatGPT.
• Alpaca 0-shot: The zero-shot probe introduced in Section 2 is submitted to Alpaca-13B, a
model fine-tuned from LLaMA on instruction-following demonstrations by Stanford [ 16].
• Alpaca k-shot: The few-shot probe from Section 2, with =3 (salient:nonsalient 3:0), is
submitted to Alpaca-13B.</p>
        <sec id="sec-3-7-1">
          <title>To ensure reproducibility, the randomness (temperature) for all LLMs variants is set to 0.</title>
          <p>• Human 2: We ask MTurkers to write lists of salient negative statements about a given
subject. We show them examples of what a salient negative statement looks like. We
collect, for each subject, two lists of statements from two workers. The performance is
later measured as the average of the two.</p>
        </sec>
        <sec id="sec-3-7-2">
          <title>Metrics. For the returned statements, we measure:</title>
          <p>• Correctness: The true negativity (is it actually false?) and factuality of a statement (is
it a judgeable statement?), e.g., not an opinion. We allow the labels: correct, incorrect,
ambiguous, or positive meaning. Samples are shown in Table 3.
• Salience: The unexpectedness, informativeness, or interestingness of a statement. We
allow: salient (1), somehow salient (0.5), and nonsalient (0).</p>
          <p>Results are annotated on their salience by 2 domain-experts 3, with inter-annotator agreement
= 60%. Correctness, the more straight forward metric of the two, was annotated by 1 of the
domain-experts.</p>
          <p>2We are aware of the risk that workers might use LLMs to generate these statements. In the absence of reliable
detection tools on this newly emerging problem, we rely on our personal judgement as well as string matchings to
discard untrustworthy answers. In particular, any response that matches the exact wording of one of the responses
of the LLM baselines, or any near-duplicates in human-generations, were rejected.</p>
          <p>3Experts on the topic of salient negative knowledge at web-scale.
sal:nonsal
3:0
0:3
3:3
10:10
True Factuality and Negativity of Statements. Results for correctness are shown in Table
2, and investigated further in Table 3. The KG inferences model ranks first on correctness
overall. This is due to the factuality of KG statements. KG triples, especially encyclopedic
ones, are expressed using precise and well-defined relations, such as award received. Moreover,
they have been curated using manual and automated techniques, and hence, their truthfulness
is easy to verify. Moreover, both variants of ChatGPT’s probes perform significantly better
than variants of Alpaca on correctness in both domains, with an out-performance of up to 36%
in correctness@1. We also notice that, for both Alpaca and ChatGPT, their few-shot probes
perform better than the zero-shot probes, with an improvement of 16% for Alpaca and 5% for
ChatGPT. Finally, we find that many of the generated statements by humans and LLMs were
actually statements with negative keywords but a positive meaning, such as lebanon isn’t devoid
of historical sites, with up to 14% of generated statements for the former and 11% for the latter.</p>
        </sec>
        <sec id="sec-3-7-3">
          <title>More samples are in Appendix C.</title>
          <p>Salience of Truly Negative Statements. Results for salience are shown in Table 2. This
metric is only computed over (previously annotated) correct statements. The best performances
are shared between the KG inferences model and ChatGPT’s few-shot variant. Though not
performing comparably well overall, the text extractions model ranks first on salience of
encyclopedic subjects @3 and 5. This is especially apparent for prominent entities, which are frequently
queried using famous search engines. Again, ChatGPT’s variants significantly outperforms
Alpaca’s on the notion of salience, with up to 23% improvement in salience@1, maintaining
the same level of quality for both types of subjects. Sample results from all models are shown
in Table 1 and Appendix E. An experiment on the quality of generated negatives over two
popularity levels, namely prominent and long tail subjects, is in Appendix B.
Efect of  Value on LLM’s Few-shot Probe. We examine the LLM using diferent numbers
of samples, for in-context learning. We consider a subset of 5 entities (3 encyclopedic and 2
commonsense), and assess the performance of the few-shot ChatGPT using diferent values
of , with diferent salient:nonsalient ratios. Results are in Table 4. Adding a small but equal
number of salient and nonsalient samples (3:3) improves the correctness by 8%, compared to
only adding salient samples (3:0), however, at the expense of their salience, which drops by by
14%. Adding only nonsalient samples (0:3) compromises both metrics. Finally, adding a larger
but equal number of salient and nonsalient samples (10:10) does not result in any improvements.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Take-home Lessons &amp; Open Issues</title>
      <p>In this paper, we perform a systematic evaluation of LLMs’ ability to generate salient negative
statements. We assess them against existing method and crowdsourced statements. We find that
LLMs’ few-shot probes show promising results in salience@1. Moreover, we find that ChatGPT
outperforms Alpaca on this task, in both correctness and salience. One of the remaining
limitations, however, is the ability of LLMs to recognize truly negative factual statements, as
opposed to ambiguous, or seemingly negative statements with positive meaning. We hope that
this study, as well as the following observations, give insights to future researchers on this topic.
Prompt Engineering. There is a wide consensus that LLMs are very powerful when you ask
them for information in the right manner. In our task, we notice that the wording, especially
of the zero-shot probe, changes the results dramatically. For instance, using the expressions
negative statements, negated statements, and negation statements returns completely diferent
responses. For instance, the probe with the word negated (alone without salient factual) returns
obviously true statements with negative keywords added to them, e.g., “stephen hawking was not
a physicist”. The probe with the word negative does not return any results, but an apology from
the AI about not being able to give bad statements about individuals. On this and other tasks,
designing intuitive prompts and studying the ability of LLMs to understand them is the most
important part of the process [17].</p>
      <p>The Notion of Salient Negation. Assessing the truthfulness of statements is one thing, but
assessing the salience of negatives is more challenging. Salience is a subjective metric. For
instance, for a Basketball fan, the fact that jordan did not star in the film space jam 2 (the first
was built around him), is a big deal. For others, the salience is not obvious. In addition of
the expertise of the reader, their nature is also important. In other words, are these negations
generated for a human-reader, or to equip machines with better negative knowledge? For
instance, what might not appear salient to a human, can be important to improve the reasoning
skills of a chat bot. In this study, we assume that the reader is a human, who usually has a higher
standard for what is interesting than a machine. Generally, designing experiments should take
into consideration downstream applications and information about the end-user.
Maintenance. Ideally, models must always keep track of real-world changes which afect
the truthfulness of statements, coverage of emerging entities, etc. This is relatively easy in the
collaborative knowledge graphs, which are updated on a daily basis. For LLMs, the process of
re-training is much more expensive. e.g., in May 2023, ChatGPT still generates the statement
brendan fraser has never won an oscar, which is no longer true, due to his win in 2023 (the
training of the model has been completed in September 2021).
[3] H. Arnaout, S. Razniewski, G. Weikum, Enriching knowledge bases with interesting
negative statements, in: AKBC, 2020.
[4] T. Safavi, J. Zhu, D. Koutra, NegatER: Unsupervised Discovery of Negatives in
Commonsense Knowledge Bases, in: EMNLP, 2021.
[5] J. Romero, S. Razniewski, K. Pal, J. Z. Pan, A. Sakhadeo, G. Weikum, Commonsense
properties from query logs and question answering forums, in: CIKM, 2019.
[6] G. Karagiannis, I. Trummer, S. Jo, S. Khandelwal, X. Wang, C. Yu, Mining an
"antiknowledge base" from Wikipedia updates with applications to fact checking and beyond,
PVLDB (2019).
[7] N. Lee, B. Z. Li, S. Wang, W.-t. Yih, H. Ma, M. Khabsa, Language models as fact checkers?,
in: ACL, FEVER workshop, 2020.
[8] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, A. Miller, Language
models as knowledge bases?, in: EMNLP, 2019.
[9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: NAACL, 2019.
[10] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are</p>
      <sec id="sec-4-1">
        <title>Unsupervised Multitask Learners, OpenAI technical report (2019).</title>
        <p>[11] N. Kassner, H. Schütze, Negated and misprimed probes for pretrained language models:</p>
        <p>Birds can talk, but cannot fly, in: ACL, 2020.
[12] J. Chen, W. Shi, Z. Fu, S. Cheng, L. Li, Y. Xiao, Say what you mean! large language models
speak too positively about negative commonsense knowledge, arXiv (2023).
[13] OpenAI, Introducing chatgpt, https://openai.com/blog/chatgpt, 2022.
[14] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledge base, CACM (2014).
[15] T. Nguyen, S. Razniewski, J. Romero, G. Weikum, Refined commonsense knowledge from
large-scale web contents, TKDE (2022).
[16] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto,</p>
      </sec>
      <sec id="sec-4-2">
        <title>Alpaca: A strong, replicable instruction-following model, https://crfm.stanford.edu/2023/</title>
        <p>03/13/alpaca.html, 2023.
[17] J. Jang, S. Ye, M. Seo, Can large language models truly understand prompts? a case study
with negated prompts, in: Proceedings of The 1st Transfer Learning for Natural Language</p>
      </sec>
      <sec id="sec-4-3">
        <title>Processing Workshop, 2023.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Encyclopedic and Commonsense Subjects</title>
      <p>We consider 50 subjects of diferent domains, namely commonsense and and of diferent
popularity, namely prominent and long tail (see Table 5).</p>
      <sec id="sec-5-1">
        <title>Encyclopedic</title>
        <p>Prominent stephen hawking, michael jordan,
lebanon, michelle obama, microsoft,
china, amazon, albert einstein, the
beatles, elon musk, angela merkel,
taxi driver, taj mahal, white house,
eat pray love, tom cruise, brendan
fraser, the godfather, my cousin
vinny, mercedes-benz group, gmc,
linkedin
Long tail peri gilpin, caramel, ubisoft
Commonsense
elephant, soup, lawyer, acne,
mother, gorilla, pancake,
newspaper, jaguar, avocado, garlic, chef,
salad, rabbit, jogging, cuflink,
strudel, librarian, armchair
tabbouleh, breadfruit, kitchenette,
hockey stick, basketball court,
coffee table</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>B. Prominent and Long Tail Subjects</title>
      <p>We recompute the quality of negatives (@5) over two levels of subject-popularity, namely
prominent and long tail. Figure 1 indicates a significant decrease in both salience and correctness
for long tail subjects, for the text-based method; dropping to only 1% on salience. Using query
logs as the corpus, users query prominent/trendy subjects much more frequently than long
tail ones. We find the human-written statements for both popularity-levels comparable, with a
slight advantage for prominent subjects. Similarly, the KG inferences model shows comparable
results with a slight advantage of prominent subjects in correctness, and of long tail subjects in
salience. Finally, we find an unexpected improvement, for all LLM variants, of long tail subjects
over prominent ones, in both metrics. One interpretation could be the large amount of noisy
web sources (main data source for training LLMs), about famous entities. For example, tabbouleh
(long tail) is a specific instance of salad (prominent). While negatives about the former are more
clear-cut, e.g., tabbouleh isn’t made with rice but bulgur, negatives about the latter seem more
unfocused, e.g., salad isn’t always a healthy choice.</p>
    </sec>
    <sec id="sec-7">
      <title>C. Negative Statement with Positive Meaning</title>
      <p>As shown in Table 3, many of the LLM-generated and crowdsourced statements are in fact
positive. Some of the recurring expressions which convey a positive meaning using negative
keywords:</p>
      <p>CorTreexcttnEexstsractions</p>
      <p>ChatGPT 00-shot</p>
      <p>KG Infe0.r02e.202n.42c6es00.3.334 0.4H0u.4m050.4a.48n9 0.580.62 0.69C0h.7a6tGPT k-shot</p>
      <p>Alp0a.2ca k-shot 0.4 Alpaca 00-s.6hot 0.8 1</p>
      <p>Salience
Correctness</p>
      <sec id="sec-7-1">
        <title>Expression: not exclusively (15 statements)</title>
        <p>Amazon did not exclusively focus on selling its own products.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Expression: not without (2)</title>
        <p>Example: Strudel is not tasty without sugar.</p>
        <p>Expression: not just (9)
Example: Acne is not just a teenage problem.</p>
        <p>Expression: not only (20)
Example: Librarians do not only work in public libraries.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Expression: not limited to (5)</title>
        <p>Example: Cofee tables are not limited to indoor use.</p>
      </sec>
      <sec id="sec-7-4">
        <title>Expression: not solely (7)</title>
        <p>Example: GMC does not solely operate in the United States.
Expression: not all (10)</p>
        <p>Example: Not all librarians are women.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>D. -shot In-context Learning Probe</title>
      <p>In this probe =6 (3:3); LLM=ChatGPT.</p>
      <p>A salient factual negated statement about an entity means that the statement doesn’t hold
in reality. Moreover, the negated statement is either surprising, unexpected, or useful to the
reader. For example:
penguins can’t fly.
istanbul isn’t the capital of turkey.</p>
      <p>tom cruise never won an oscar.</p>
      <sec id="sec-8-1">
        <title>On the other hand, the following examples are factual negated statements that are not salient:</title>
        <p>penguins can’t run for presidency.
istanbul isn’t the capital of france.</p>
        <p>tom cruise never won the nba best player award.</p>
        <p>Given this definition and examples, write a list of 3 salient factual negated statement about
microsoft.</p>
        <p>Answer:
1. is not primarily a dating platform.
2. does not charge users a fee to create an account.</p>
        <p>3. does not allow users to post anonymous content.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>E. Sample Results</title>
      <p>The following tables show the top results about linkedin, chef, and angela merkel, respectively:
Text
KG
ChatGPT 0-shot
ChatGPT k-shot
Alpaca 0-shot
Alpaca k-shot
Human</p>
      <p>Top Negative Statements (linkedin)
isn’t working
isn’t loading
isn’t headquartered in san francisco
isn’t a software company
isn’t designed for sharing personal content
doesn’t permit users to buy followers
isn’t used for online dating
doesn’t allow users to post pictures of their pets
doesn’t have a user-friendly interface
doesn’t provide any value to its users
isn’t a social media platform
doesn’t own the content posted on its platform
doesn’t have a billion members
wasn’t founded by mark zuckerberg
Text
KG
ChatGPT 0-shot
ChatGPT k-shot
Alpaca 0-shot
Alpaca k-shot
Human</p>
      <p>Top Negative Statements (chef)
doesn’t wear hat
doesn’t eat their own food
doesn’t take orders
doesn’t bring drinks
didn’t use any garlic
didn’t win any cooking competitions
doesn’t just cook food
not all have formal culinary training
don’t need to have an understanding of nutrition
don’t need to have good knife skills
don’t need to be certified
don’t usually work with raw ingredients
doesn’t wash the dishes
doesn’t always wear the chef’s hat
Text
ChatGPT 0-shot
ChatGPT k-shot
Alpaca 0-shot
Alpaca k-shot</p>
      <p>Top Negative Statements (angela merkel)
didn’t listen to donald trump
doesn’t deserve to be honoured by germany
isn’t on twitter
isn’t a lawyer
isn’t a native german speaker
didn’t originally pursue a career in politics
has never been married
is not a member of the SPD
isn’t a member of the CDU
isn’t a scientist
isn’t the first female chancellor of germany
isn’t from east germany</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnaout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , Completeness, recall, and
          <article-title>negation in open-world knowledge bases: A survey, arXiv (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Arnaout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          , G. Weikum,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>UnCommonSense: Informative negative knowledge about everyday concepts</article-title>
          ,
          <source>in: CIKM</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>