<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bradley P. Allen</string-name>
          <email>b.p.allen@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul T. Groth</string-name>
          <email>p.t.groth@uva.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The 23rd International Semantic Web Conference</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Amsterdam</institution>
          ,
          <addr-line>Science Park 900, 1098 XH Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents difer not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge</kwd>
        <kwd>large language models</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>fact checking</kwd>
        <kwd>metalinguistic disagreement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent years have seen a surge of interest in the use of LLMs for purposes of knowledge engineering
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LLMs are being used to perform text classification, sentiment analysis, and natural language
inference, exploiting next-token prediction to generate text that can be transformed into the type of
symbolic outputs normally produced in these tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Increasing emphasis is being placed on the
use of LLMs in knowledge graph construction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The results have been encouraging, but a major
concern that has emerged is the impact of hallucination, which is defined as the presence of factually
incorrect or unjustified assertions in the output of LLMs [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Benchmarks such as SHROOM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and WildHallucinations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been developed to evaluate the ability to detect hallucination when it
occurs in LLM output.
      </p>
      <p>
        A number of mechanisms have been proposed to mitigate hallucination in LLMs through the use of
knowledge from a variety of sources, including natural language text, KGs, and rules, to ground [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] an
LLM. Retrieval-augmented generation (RAG) is a specific version of this approach that has attracted
a great deal of interest, particularly in the context of commercial applications [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Such
knowledgeenhanced LLMs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] show improvements in the performance of natural language understanding and
generation tasks. However, even with such improvements, knowledge-enhanced LLMs still produce
errors as measured using common evaluation metrics (e.g. F1 measures for classification). These
evaluation metrics are calculated by measuring the diference between an LLM’s output and ground
truth as provided in fact checking benchmarks such as LAMA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], KAMEL [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and FActScore [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The errors reported by these metrics are typically assumed to stem from disagreement about facts. But
there is another way in which these diferences can arise.</p>
      <p>
        Metalinguistic disagreement [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
        ] occurs
when people argue about the meaning or use of words rather than about facts or ideas. In contrast, a
factual disagreement is about what is actually true in the world. Examples of factual disagreement are
debating whether a tomato is healthier than an apple, or debating whether Sarah is taller than John; in
https://www.bradleypallen.org/ (B. P. Allen); https://pgroth.com/ (P. T. Groth)
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
contrast, examples of metalinguistic disagreement are arguing whether a tomato should be called a
fruit or a vegetable, or arguing about what height qualifies as “tall” when describing a person.</p>
      <p>Consider the following scenario: a knowledge-enhanced LLM generates an output that contradicts
ground truth provided by a KG. This is used as evidence that the LLM has committed a factual error in
its output. However, in producing its output, the knowledge-enhanced LLM has provided a rationale
that indicates that there is a disagreement about the meaning of a term that has led to the output. Can
this occur in practice? Our hypothesis is that it does.</p>
      <p>
        Why would this matter? Factual disagreements can be resolved through knowledge graph refinement
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] or through few-shot in-context learning that provides the correct facts to the LLM; however,
metalinguistic disagreements may require ontology engineering to address representational issues with
a knowledge graph, or the engineering of prompts that incorporate intensional definitions in natural
language of concepts for an LLM [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Data governance [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] also acknowledges the importance of
establishing metalinguistic agreement of intensional definitions of concepts and relations in natural
language and their realization in databases and database schemas; for example, the FAIR principles
[
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ] specifically urge clear documentation of metadata aligning natural language concepts and
metadata in scientific data resources. We therefore argue that distinguishing factual from metalinguistic
disagreement between LLMs and KGs is relevant to the practice of knowledge graph and ontology
engineering.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Evidence for the occurrence of metalinguistic disagreement in</title>
    </sec>
    <sec id="sec-3">
      <title>LLMs</title>
      <p>To test our hypothesis that metalinguistic disagreement is a detectable phenomenon, we conducted a
simple experiment by fact checking a set of knowledge graph triples aligned with natural language text
using an LLM, and then estimating the rate at which metalinguistic disagreement occurs when the LLM
determines the triple is not true.</p>
      <p>
        We randomly sampled 100 Wikipedia abstracts from the 10,000 document sample provided in the
T-REx dataset, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples.
T-REx has been widely used in the evaluation of LLM-based fact checking and extraction for knowledge
graphs [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. From the total set of triples aligned with the documents in that sample, we then sampled
250 triples. We then defined a zero-shot chain-of-thought classifier [ 23] to assign a truth value to an
aligned triple, providing the Wikidata abstract with which it is aligned as context in the LLM prompt
[24]. The classifier was executed to obtain a rationale and a truth value for each of the 250 sampled
triples and aligned abstracts, and each result was then processed by a second zero-shot chain-of-thought
classifier (using gpt-4o-2024-05-13) acting as an LLM-as-a-judge [ 25], to classify whether the
truthvalue-assigning classifier’s rationale indicated a metalinguistic disagreement. Processing required a
total of 2 inference API calls per alignment, per LLM. Evaluations whose statistics are reported below
were conducted during the period from 1 July 2024 to 8 July 2024. Costs incurred through calls to
language model APIs totalled less than $100 USD. Code and data used in the experiments are available
in a Github repository1.
      </p>
      <p>As shown in Table 1, over the 9 LLMs evaluated, false negative rates over the 250 sampled T-REx
triples ranged between 0.104 and 0.504 with a mean of 0.246, and the rate of metalinguistic disagreements
between the classifier and Wikidata (i.e., the number of detected metalinguistic disagreements divided
by the number of evaluated alignments) ranged between 0.04 and 0.264 with a mean of 0.097. To
illustrate the nature of disagreements detected, Table 2 shows two examples of false negatives from the
experiment which exhibit metalinguistic disagreement. The complete set of rationales and classifications
is available in the Github repository.
1https://github.com/bradleypallen/trex-metalinguistic-disagreement
gpt-4o-2024-05-13
gpt-4-0125-preview
claude-3-haiku-20240307
claude-3-opus-20240229
claude-3-5-sonnet-20240620
meta-llama/Meta-Llama-3-70b-Instruct
gpt-3.5-turbo
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mistral-7B-Instruct-v0.3</p>
      <p>MDR
rationale for negative finding
Based on the information provided in the hypothetical world, the knowledge graph
triple “Subject: James II of England, Predicate: followed by, Object: Mary II of England”
is false. The passage states that after James II was deposed in 1688, he was replaced
by his daughter Mary II, ruling jointly with her husband William III. This means that
Mary II did not immediately follow James II, but rather replaced him as the monarch.</p>
      <p>The predicate “followed by” implies that Mary II succeeded James II directly, which is
not the case.</p>
      <p>Based on the information provided, the knowledge graph triple “Subject: chocolate,
Predicate: made from material, Object: sugar” is false. The passage clearly states that
chocolate is made from the seeds of the cacao tree, specifically Theobroma cacao, and
that it is roasted and ground to produce the chocolate. There is no mention of sugar
being a material that chocolate is made from. In fact, the passage states that sugar is
added to chocolate to create sweet chocolate, but it is not a primary ingredient that
chocolate is made from. Therefore, the given knowledge graph triple is false.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Proposed benchmark</title>
      <p>We argue that the above results suggest that that metalinguistic disagreement between knowledge graphs
and LLMs can occur during fact-checking tasks. However, there are some significant shortcomings in
the above approach:
• Lack of human validation. The detection of metalinguistic disagreement relies on using an
LLM-as-a-judge, which may not be a reliable substitute for human judgment [26, 27]. This
introduces the possibility that the detected “disagreements” are artifacts of how diferent LLMs
process and generate language, rather than true metalinguistic disagreements. Human review at
scale is needed to validate the results. Without this, it’s dificult to determine if what the LLMs
identify as metalinguistic disagreements align with human judgments.
• Possible conflation with other error types. What’s interpreted as metalinguistic disagreement
could potentially be other types of errors or inconsistencies in LLM outputs, such as hallucinations
or context misinterpretations.
• Limited sample size. The experiment uses a relatively small sample of 250 triples. A larger-scale
study is needed to draw more robust conclusions.</p>
      <p>We argue that by creating a benchmark metalinguistic disagreement detection dataset that addresses
these limitations, we could more confidently assess the occurrence and nature of metalinguistic
disagreements in LLM-based fact-checking. This would provide a stronger foundation for investigating
our hypothesis and advancing our understanding of how LLMs interpret and disagree about meaning
in knowledge graph engineering contexts.</p>
      <p>Specific requirements for such a benchmark include:
• Human-annotated examples. A set of fact-checking instances annotated by human experts to
identify clear cases of metalinguistic disagreement, factual disagreement, and agreement. This
would serve as a gold standard for evaluation.
• Inter-annotator agreement metrics. Support the evaluation of system performance using
inter-annotator agreement metrics that incorporate knowledge graph ground truth and human
annotations to measure the degrees of inter-agent factual and metalinguistic agreement.
• Multiple knowledge graph sources. Use triples from diferent knowledge graphs spanning
multiple knowledge domains to account for variations in how relations and concepts are defined
across sources, and to test if metalinguistic disagreements are more prevalent in certain areas.
• Contextual information. Provide relevant context for each fact-checking instance, similar to
the Wikipedia abstracts used by T-REx.
• Examples with ambiguity, temporal aspects, and gradable predicates. Deliberately include
examples with potential for ambiguity or multiple interpretations to probe the boundaries of
metalinguistic disagreement, examples where the truth value of a statement might change over
time, to explore how temporal context afects metalinguistic understanding, and examples with
gradable predicates (e.g., ”tall,” ”fast”) that might be more prone to metalinguistic disagreement.
• Negative examples. Include clear cases where no metalinguistic disagreement should occur, to
test for false positives.</p>
      <p>
        As an initial next step towards this objective, we plan to extend the dataset used in the initial
experiments described above in a manner similar to that used in the design and implementation of the
SHROOM hallucination detection benchmark [
        <xref ref-type="bibr" rid="ref6">6, 28</xref>
        ], through crowdsourcing to incorporate human
annotation and increasing the size of the sample of knowledge alignments from the T-REx dataset.
Human annotators will be presented with a summary of a Wikipedia page and a statement generated
from the Wikidata knowledge graph triple for each alignment, and the annotator must indicate if they
disagree with the statement, and if so, whether they disagree on the factuality of the statement or the
meaning of any of the terms used in the statement.
      </p>
      <p>In conclusion, we anticipate that such a benchmark can not only shed light on the nature and
frequency of metalinguistic disagreements between LLMs and KGs, but also contribute to the ongoing
debate about LLMs’ capacity for generating meaningful statements. Some have argued that LLMs are
incapable of understanding meaning in the way humans do [29]. Others are exploring ways in which
LLMs might be capable of at least some limited or partial forms of meaning as a consequence of either
the model’s pre-training or its grounding through in-context learning [30, 31, 32, 33, 34]. We believe
that the proposed benchmark can contribute to a more nuanced view of the epistemic status of LLMs
relative to KGs based on two-component semantics [35, 36, 37], and support experimental work in
determining whether or not LLMs can generate meaningful statements or be claimed to have beliefs
[38, 39].</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work was partially supported by EU’s Horizon Europe research and innovation programme within
the ENEXA project (grant Agreement no. 101070305). The authors wish to thank Frank van Harmelen,
Levin Hornischer, Filip Ilievski, Jan-Christoph Kalo, Aybüke Özgün, Lise Stork, and Klim Zaporojets for
discussions and suggestions that have been invaluable in refining this work.
Large Scale Alignment of Natural Language with Knowledge Base Triples, in: Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
[23] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners,</p>
      <p>Advances in neural information processing systems 35 (2022) 22199–22213.
[24] B. P. Allen, P. T. Groth, Evaluating Class Membership Relations in Knowledge Graphs using Large
Language Models, in: European Semantic Web Conference, 2024. arXiv:arXiv:2404.17000, to
appear.
[25] C.-H. Chiang, H.-y. Lee, Can large language models be an alternative to human evaluations?, arXiv
preprint arXiv:2305.01937 (2023).
[26] A. Bavaresco, R. Bernardi, L. Bertolazzi, D. Elliott, R. Fernández, A. Gatt, E. Ghaleb, M. Giulianelli,
M. Hanna, A. Koller, et al., LLMs instead of human judges? a large scale empirical study across 20
NLP evaluation tasks, arXiv preprint arXiv:2406.18403 (2024).
[27] A. S. Thakur, K. Choudhary, V. S. Ramayapally, S. Vaidyanathan, D. Hupkes, Judging the Judges:
Evaluating Alignment and Vulnerabilities in LLMs-as-Judges, arXiv preprint arXiv:2406.12624
(2024).
[28] B. Allen, F. Polat, P. Groth, SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot
LLM-Based Classification for Hallucination Detection, in: Proceedings of the 18th International
Workshop on Semantic Evaluation (SemEval-2024), Association for Computational Linguistics,
Mexico City, Mexico, 2024. URL: https://doi.org/10.48550/arXiv.2404.03732.
[29] E. M. Bender, A. Koller, Climbing towards NLU: On meaning, form, and understanding in the
age of data, in: Proceedings of the 58th annual meeting of the association for computational
linguistics, 2020, pp. 5185–5198.
[30] M. Mandelkern, T. Linzen, Do Language Models’ Words Refer?, arXiv preprint arXiv:2308.05576
(2024). arXiv:2308.05576.
[31] H. Lederman, K. Mahowald, Are Language Models More Like Libraries or Like Librarians?
Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs, arXiv preprint
arXiv:2401.04854 (2024). arXiv:2401.04854.
[32] B. A. Levinstein, D. A. Herrmann, Still no lie detector for language models: Probing empirical and
conceptual roadblocks, Philosophical Studies (2024) 1–27.
[33] G. Baggio, E. Murphy, On the referential capacity of language models: An internalist rejoinder to</p>
      <p>Mandelkern &amp; Linzen, arXiv preprint arXiv:2406.00159 (2024).
[34] J. Grindrod, Large language models and linguistic intentionality, Synthese 204 (2024) 71.
[35] F. Berto, Topics of thought: The logic of knowledge, belief, imagination, Oxford University Press,
2022.
[36] P. Hawke, Theories of aboutness, Australasian Journal of Philosophy 96 (2018) 697–723.
[37] P. Hawke, L. Hornischer, F. Berto, Truth, topicality, and transparency: one-component versus
two-component semantics, Linguistics and Philosophy (2024) 1–23.
[38] D. A. Herrmann, B. A. Levinstein, Standards for Belief Representations in LLMs, arXiv preprint
arXiv:2405.21030 (2024).
[39] J. Harding, Operationalising representation in natural language processing, The British Journal
for the Philosophy of Science (2023). To appear.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <article-title>Knowledge Engineering Using Large Language Models</article-title>
          ,
          <source>Transactions on Graph Data and Knowledge</source>
          <volume>1</volume>
          (
          <year>2023</year>
          ) 3:
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          :
          <fpage>19</fpage>
          . URL: https://drops.dagstuhl.de/entities/document/ 10.4230/TGDK.1.
          <issue>1</issue>
          .3. doi:
          <volume>10</volume>
          .4230/TGDK.1.
          <issue>1</issue>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig, Pre-Train, Prompt, and
          <article-title>Predict: A Systematic Survey of Prompting Methods in Natural Language Processing</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.1145/3560815. doi:
          <volume>10</volume>
          .1145/3560815.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Koutsiana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nwachukwu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Meroño-Peñuela</surname>
          </string-name>
          , E. Simperl,
          <article-title>Knowledge Prompting: How Knowledge Engineers Use Large Language Models</article-title>
          ,
          <source>arXiv preprint arXiv:2408.08878</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , et al.,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy</article-title>
          , challenges, and open questions,
          <source>arXiv preprint arXiv:2311.05232</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mickus</surname>
          </string-name>
          , E. Zosa,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vahtola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Segonne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          , M. Apidianaki, SemEval
          <article-title>-2024 Task 6: SHROOM, a shared-task on hallucinations and related observable overgeneration mistakes</article-title>
          ,
          <source>in: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ravichander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chandu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          , et al.,
          <article-title>WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries</article-title>
          ,
          <source>arXiv preprint arXiv:2407.17468</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Harnad</surname>
          </string-name>
          , Language Writ Large: LLMs, ChatGPT, Grounding, Meaning and Understanding,
          <source>arXiv preprint arXiv:2402.02243</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2312.10997</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A Survey of Knowledge Enhanced Pre-Trained Language Models</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>1413</fpage>
          -
          <lpage>1430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <article-title>Language models as knowledge bases?</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>01066</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J.-C. Kalo</surname>
            , L. Fichtel,
            <given-names>KAMEL</given-names>
          </string-name>
          :
          <article-title>Knowledge Analysis with Multitoken Entities in Language Models</article-title>
          ., in: AKBC,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , H. Hajishirzi,
          <article-title>FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation</article-title>
          ,
          <source>arXiv preprint arXiv:2305.14251</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Plunkett</surname>
          </string-name>
          , T. Sundell, Varieties of metalinguistic negotiation,
          <source>Topoi</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>983</fpage>
          -
          <lpage>999</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Plunkett</surname>
          </string-name>
          , T. Sundell,
          <article-title>Disagreement and the semantics of normative and evaluative terms</article-title>
          ,
          <source>Philosophers</source>
          <volume>13</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Rudolph</surname>
          </string-name>
          , Contested metalinguistic negotiation,
          <source>Synthese</source>
          <volume>202</volume>
          (
          <year>2023</year>
          )
          <fpage>90</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Knowledge graph refinement: A survey of approaches and evaluation methods</article-title>
          ,
          <source>Semantic web 8</source>
          (
          <year>2017</year>
          )
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <source>Conceptual Engineering Using Large Language Models, arXiv preprint arXiv:2312.03749</source>
          (
          <year>2023</year>
          ). arXiv:
          <volume>2312</volume>
          .
          <fpage>03749</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khatri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <article-title>Designing data governance</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>53</volume>
          (
          <year>2010</year>
          )
          <fpage>148</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>M. D. Wilkinson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>I. J.</given-names>
          </string-name>
          <string-name>
            <surname>Aalbersberg</surname>
            , G. Appleton,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Axton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Baak</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Blomberg</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Boiten</surname>
            ,
            <given-names>L. B. da Silva</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>P. E.</given-names>
          </string-name>
          <string-name>
            <surname>Bourne</surname>
          </string-name>
          , et al.,
          <article-title>The FAIR Guiding Principles for scientific data management and stewardship</article-title>
          ,
          <source>Scientific data 3</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vogt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Strömert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Karam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Konrad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prinz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baum</surname>
          </string-name>
          , FAIR
          <volume>2</volume>
          .
          <article-title>0: Extending the FAIR Guiding Principles to Address Semantic Interoperability</article-title>
          ,
          <source>arXiv preprint arXiv:2405.03345</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Elsahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vougiouklis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Remaci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gravier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Laforest</surname>
          </string-name>
          , E. Simperl, T-REx: A
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>