<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Knowledge Graphs to Mitigate Counting Hallucinations: A Case Study of Wikidata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fariz Darari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaycent G. Ongris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berty C. L. Tobing</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Douglas R. Faisal</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>On Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science</institution>
          ,
          <addr-line>Universitas Indonesia, Depok 16424</addr-line>
          ,
          <country country="ID">Indonesia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GDP Labs</institution>
          ,
          <addr-line>Jakarta 12950</addr-line>
          ,
          <country country="ID">Indonesia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>When answering questions, language models (LMs) often ground their responses in unstructured textual sources. However, structured sources such as knowledge graphs (KGs) often contain valuable counting or cardinality facts, e.g., the number of children a person has, the number of seasons in a TV series, or the number of branches a company operates. Leveraging KGs can help LMs reduce hallucinations in count-based queries, such as "How many children does X have?" or "How many branches does company Y have?". This work introduces the problem of counting hallucinations and proposes a novel LM-based QA approach that integrates structured counting knowledge from KGs like Wikidata to address these shortcomings. We also introduce the first benchmark dataset for counting QA, comprising over 10,000 entries with more than 30,000 counting questions. Through our experiments, we compare QA accuracy in various scenarios: using no structured counting knowledge at all and using our KGQA methods without vs. with perfect entity extraction. We also examine how performance difers between a smaller language model and a larger, more advanced model. The results on Wikidata show that incorporating structured counting knowledge leads to a substantial improvement in accuracy, with more than a 60% gain even without perfect entity extraction. This highlights the efectiveness and promise of our approach for advancing future KGQA research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The use of language models (LMs) has skyrocketed in recent years, powering applications
across a wide range of domains such as travel, journalism, code writing, data analysis, and even
agriculture [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This widespread adoption is underscored by the fact that AI agents and bots
accounted for over 51% of web trafic in 2024, surpassing human activity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This surge in bot
trafic, largely driven by AI agents crawling websites to gather data, underscores the growing
dependence on LMs for fast and accurate responses.
      </p>
      <p>
        However, a persistent issue with LMs is hallucination, that is, the tendency to generate
plausible-sounding but factually incorrect information [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This becomes especially problematic
when users seek accurate and verifiable answers. To mitigate this, LMs should be grounded
in textual sources such as internal documents and web content. Moreover, LMs can rely on
structured sources, namely knowledge graphs (KGs), to ground model outputs in curated facts
and ontologies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As reliance on LMs grows, grounding them in structured, verifiable sources
such as KGs becomes increasingly important to ensure factual consistency and trustworthiness.
      </p>
      <p>Among the many question types that challenge LMs, counting questions, which ask about the
number of entities associated with a subject, are both frequent and error-prone. These questions
usually begin with "How many..." and do not involve explicit units, relying instead on context
(e.g., "How many branches does company X have?"). KGs, particularly those like Wikidata,
contain structured representations of such cardinality or counting facts, often encoded through
properties like "number of children" or "number of branches." In fact, Wikidata includes nearly
90 properties prefixed with "number of..." and further analysis reveals that Wikidata incorporates
over 160 properties related to cardinalities1.</p>
      <p>This paper is motivated by the readily available counting knowledge within Wikidata2,
one of the largest and most widely-used KGs. The central question we address is how to
efectively utilize this resource to answer natural language questions concerning counts, which
are often susceptible to hallucination. Specifically, we aim to mitigate the problem of counting
hallucinations, that is, the generation of inaccurate numerical answers to counting questions, by
leveraging KGs. To the best of our knowledge, this issue has not been systematically investigated
within the framework of LM-based knowledge graph question answering (KGQA).</p>
      <p>To illustrate a case of counting hallucination, we conducted a preliminary experiment using
the language model of Gemma 3 4B3. When asked about the number of children of Neneh Cherry
(a Swedish singer-songwriter), Gemma incorrectly stated that she has four children, even listing
names of (fictional) children and their partners (e.g., Lyric Vanessa Cherry with partner Eric
Salmon). We then posed the same question to a more advanced model, Claude Sonnet 4, and
obtained the same incorrect answer (i.e., four children). However, upon verification, we found
that Neneh Cherry actually has three children. Notably, the correct count of her children is
readily available on her Wikidata page (https://www.wikidata.org/wiki/Q233342). We argue
that this information, along with a wealth of other counting knowledge present in Wikidata,
can be further leveraged to enhance KGQA.</p>
      <p>This paper directly addresses this issue. Our contributions are:
1. We introduce the problem of counting hallucinations in LM responses.
2. We investigate how counting facts are represented in a KG, using Wikidata as a case
study.
3. We propose methods for extracting and integrating counting knowledge into KGQA
systems.
4. We construct the first benchmark dataset specifically tailored for counting question
answering.
5. We evaluate our approaches on a large-scale KG (that is, Wikidata) and analyze their
behavior and performance.</p>
      <p>We hope our findings provide practical guidance for designing LM-based QA systems,
especially for questions involving counts. We also expect the approach to transfer to other KGs
beyond Wikidata, including enterprise and domain-specific KGs.</p>
      <sec id="sec-1-1">
        <title>1https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all 2https://www.wikidata.org/ 3https://deepmind.google/models/gemma/</title>
        <p>More broadly, we study an LM-based KGQA approach for simple, single-hop count questions
on Wikidata. We focus on relations that already encode a count (e.g., number of children, number
of seasons), rather than computing counts via entity enumeration or multi-hop aggregation.
This focus reflects practical and ethical considerations: in many settings, entity-level records
(e.g., a person’s children or a company’s branches) may be withheld for privacy or policy
reasons, while an aggregate count is curated and published as a first-class property. Pre-encoded
counts also reduce brittleness relative to COUNT-style queries, which are more sensitive to
KG incompleteness and duplication, and typically require diferent robustness mechanisms.
Accordingly, multi-hop and compositional counting are out of scope here but complementary
to our setting.</p>
        <p>The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3
details our methodology for identifying and integrating counting facts. Section 4 presents our
results and analysis. Section 5 discusses the implications and limitations of our findings. Finally,
Section 6 concludes the paper and outlines future directions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Knowledge Graphs and RDF</title>
        <p>
          A knowledge graph (KG) is a graph-based data structure designed to capture real-world
knowledge. It consists of nodes representing entities and edges representing the relationships
connecting them [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. KGs can be modeled in the Resource Description Framework (RDF) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. An RDF
graph is a set of RDF triples, consisting of three components: the subject, predicate, and object,
written &lt;S, P, O&gt;. For simplicity, we do not distinguish between RDF terms (literals, IRIs,
and blank nodes). Examples of triples are &lt;Donald Trump, child, Barron Trump&gt;
and &lt;Donald Trump, number of children, 5&gt;. Syntactically, RDF graphs can be
serialized in various formats, such as RDF/XML, Turtle, and JSON-LD. In this paper, we use the
terms knowledge graphs and RDF graphs interchangeably.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Wikidata KG</title>
        <p>
          Wikidata, a multilingual "Wikipedia for data", functions as a KG by centralizing and managing
Wikipedia’s factual information, paving the way for new applications through its structured
and integrated data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Following an open KG approach, Wikidata’s data is openly licensed,
permitting free access, use, modification, and sharing with minimal attribution and openness
requirements. While providing an RDF representation, Wikidata utilizes URIs optimized for
persistence and language neutrality, potentially at the cost of human readability [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. SPARQL Queries</title>
        <p>
          SPARQL [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] serves as the standard query language for retrieving information from KGs. Its core
component, the basic graph pattern, consists of triple patterns that resemble RDF triples but
allow variables in the subject, predicate, and object positions. These patterns identify matching
subgraphs within the RDF data by substituting variables with RDF terms, resulting in a graph
equivalent to the matched portion. For example, the following SPARQL query retrieves all
children of Donald Trump:
        </p>
        <sec id="sec-2-3-1">
          <title>SPARQL Query</title>
          <p># we assume a default namespace for terms used here
SELECT ?child WHERE {
:DonaldTrump :child ?child }</p>
          <p>When executed against a KG, the results of the preceding query would likely include all five
of Donald Trump’s children: Donald Trump Jr., Ivanka Trump, Eric Trump, Tifany Trump, and
Barron Trump.</p>
          <p>A SPARQL query can also incorporate aggregate functions, such as the COUNT feature. The
following example demonstrates how to count the number of children of Donald Trump:
A Wikidata-oriented implementation of the above SPARQL query is shown below:
SPARQL Query
SELECT (COUNT(?child) AS ?cnt) WHERE {</p>
          <p>:DonaldTrump :child ?child }</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>SPARQL Query</title>
          <p>SELECT (COUNT(?child) AS ?cnt) WHERE {
wd:Q22686 wdt:P40 ?child }</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>SPARQL Query</title>
          <p>SELECT ?numOfChildren WHERE {</p>
          <p>wd:Q22686 wdt:P1971 ?numOfChildren }</p>
          <p>
            Wikidata provides properties well-suited for directly representing counting information. For
example, in the case of children, a SPARQL query using the "number of children" property,
which readily provides the count for Donald Trump’s children, is provided below, returning
ifve. In this paper, we concentrate on such counting properties.
2.4. KGQA
Knowledge graph question answering (KGQA) leverages the structured knowledge within a KG
to answer questions posed in natural language [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. Put simply, KGQA involves using the triples
stored in a KG to address natural language queries. For example, to answer "Where is Jack
Ma’s birthplace?", a KG containing the triple &lt;Jack Ma, born in, Hangzhou&gt; can be
queried to retrieve the entity Hangzhou. In practice, the availability of prominent open KGs such
as Wikidata and DBpedia, as well as enterprise KGs maintained by organizations and companies,
can be leveraged as a valuable source for KGQA. This work focuses on Wikidata, which is one
of the largest and most popular KGs in the world. While various KGQA datasets exist, including
SimpleQuestionsWikidata [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], QALD-9-plus [12], QALD-10 [13], and Semantic answer type
and relation prediction task (SMART) dataset [14], none of these datasets specifically focus on
counting questions, as their design caters to more general QA tasks.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Hallucination</title>
        <p>Language models (LMs) struggle with ambiguous language or concepts that fall outside their
well-defined knowledge, resulting in outputs that can appear convincing but are factually
wrong or of-topic [ 15]. This issue, commonly referred to as "hallucinations," diminishes the
trustworthiness of these models. These hallucinations can be categorized as stemming from data,
training, and inference [16]. This work focuses on the first cause, specifically, limitations at the
knowledge boundary (e.g., long-tail knowledge, up-to-date information, and copyright-sensitive
content). The problem of knowledge boundary is one area where KGs can ofer mitigation by
expanding the knowledge reach of LMs. Furthermore, by exposing LMs to KGs, the issue of
fact-conflicting hallucination [ 17], where LMs generate content inconsistent with established
world knowledge, can be further mitigated. This is because KGs provide a factual grounding for
the processing performed by LMs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Problem Statement</title>
        <p>In this work, we define cardinality/counting hallucination as the generation of an incorrect
numerical quantity by a language model (LM) when referring to the number of instances of a
person or object, as compared to the actual, verifiable number in the real world.</p>
        <p>Key elements of this definition are as follows:
1. Incorrect numerical quantity: The core issue is a discrepancy between the model’s output
and the true real-world count.
2. Countable entities: This hallucination specifically applies to countable items or
individuals.
3. LM generation: The incorrect quantity originates from the LM’s output.</p>
        <p>4. Verifiable reference: A true, real-world value serves as the benchmark for correctness.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Counting Knowledge in Wikidata</title>
        <p>We conduct an analysis of Wikidata properties to identify which ones can be categorized as
counting properties. Wikidata contains over 12,000 properties, among which we focus on
those of type Quantity and WikibaseItem, totaling more than 2,300 properties. We exclude
other property types, such as CommonsMedia, ExternalId, and EntitySchema, as they are not
appropriate for representing counting information. We then perform a thorough manual review
of these properties to determine which can be considered counting properties, based on the
criterion that such properties must represent a countable number (i.e., a non-negative integer)
of entities that can be enumerated in real-world contexts. In the end, we identify a total of
169 counting properties, ranging from number of episodes (P1113) and (number of) employees
(P1128), to number of participants (P1132) and number of wins (P1355). The full list of the
counting properties is available at: https://s.id/wd-counting-properties.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Approaches</title>
        <p>To address counting hallucinations, we propose two methods for integrating knowledge from
knowledge graphs (KGs) into LMs: Text-to-SPARQL and KG-to-Text. Importantly, both
approaches are LM agnostic, meaning they can be applied with any LM, regardless of type or vendor.
The prompts used in these two methods are accessible at: https://s.id/counting-qa-prompts. In
this work, we focus on simple counting questions in which the subject entity and the target
count property are already given (e.g., "How many children does X have?"), and the relation is
encoded as a count in the KG. For broader question types, an initial step to validate or classify
the intent would be necessary before applying either method. This intent detection could be
carried out using conventional classification models or LM-based approaches. However, this
aspect falls outside the scope of the present study.
3.3.1. Text-to-SPARQL
The Text-to-SPARQL approach, as displayed in Figure 1, outlines the process of transforming a
natural language (NL) counting question into a structured SPARQL query to retrieve answers
from a KG. It begins with question understanding by identifying the entities and properties
involved. The main entity is first extracted using an LM, which is prompted to isolate the most
specific proper noun mentioned in the question. This extracted entity name is then resolved to
its corresponding identifier in Wikidata using the wbsearchentities API4, which returns
the most relevant entity ID (QID). To facilitate property identification, we build a vector database
(i.e., Chroma5) from Wikidata property IDs, labels, and descriptions, embedding these fields with
the sentence-transformers model all-MiniLM-L6-v2 to capture semantic similarity. We
index those Wikidata properties whose type is either Quantity or WikibaseItem, as discussed in
Subsection 3.2. At query time, the input NL question is matched against this indexed collection
to retrieve the most relevant properties. Now that the appropriate entity ID (QID) and property
ID (PID) have been obtained, an LM is prompted with pre-defined examples and instructed
to generate a SPARQL query that conforms to the structure of Wikidata. This query is then
executed against Wikidata, and the result is converted into a human-readable answer.</p>
        <p>The following is the pseudocode for the Text-to-SPARQL approach.</p>
        <p>text_to_sparql_qa function
1 # Assume that the global variables langmodel, vector_db, and wikidata_endpoint
exist
2
3 def text_to_sparql_qa(question):
4 entity_name = extract_main_entity(question, langmodel)
5 entity_wikidata_id = link_entity(entity_name)
6 property_candidates = search_properties(question, vector_db)
7 sparql_query = generate_sparql(question, entity_name, entity_wikidata_id,
property_candidates, langmodel)
8 return execute(sparql_query, wikidata_endpoint)
3.3.2. KG-to-Text
This KG-to-Text pipeline, illustrated in Figure 2, answers NL counting questions using verbalized
information from a KG (i.e., Wikidata). It begins by extracting the main entity from the input
question and mapping it to its corresponding Wikidata identifier, using the same extraction and
linking mechanism as described in the Text-to-SPARQL approach. Next, a SPARQL query is
executed to retrieve all direct (truthy) triples associated with the identified entity. Each retrieved
triple is then verbalized into human-readable form by converting the subject, predicate, and
object into their English labels. These verbalized triples serve as contextual input to the LM,
which subsequently generates the final answer to the counting question.</p>
        <p>The following is the pseudocode for the KG-to-Text approach.</p>
        <p>4https://www.wikidata.org/w/api.php?action=help&amp;modules=wbsearchentities
5https://www.trychroma.com/</p>
        <p>kg_to_text_qa function
1 # Assume that the global variables langmodel and wikidata_endpoint exist
2
3 def kg_to_text_qa(question):
4 entity_name = extract_main_entity(question, langmodel)
5 entity_wikidata_id = link_entity(entity_name)
6 entity_context = verbalize(entity_wikidata_id)
7 return ask(question, entity_context, langmodel)</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Dataset</title>
        <p>As mentioned in the KGQA part of Section 2, currently no dedicated dataset exists for counting
question answering. We have developed such a dataset in a semi-automated manner using
Wikidata and the DeepSeek LM through the following steps:
1. First, we collect all properties available in Wikidata. Next, we select only those properties
that inherently provide counting information, as elaborated in Subsection 3.2.
2. From this refined set of counting properties, we sample entities that possess these
properties and extract their corresponding count values.
3. Subsequently, we generate natural language (NL) questions for each entity-property pair,
where the true answer to each question is from the corresponding count value in Wikidata.
The generation process is supported by an LM. In our case, we rely on DeepSeek-V3-0324.</p>
        <p>The prompt for the generation is openly available at: https://s.id/counting-qa-prompts.
4. Finally, all the generated natural language questions and their corresponding numerical
answers are compiled into a single dataset, which we have named the Counting QA
dataset.</p>
        <p>Currently, our Counting QA dataset contains over 10,000 entries, each accompanied by three
NL question variations, totaling more than 30,000 questions focused on counting tasks. A
sample of the dataset is shown in Table 1, while the full dataset is publicly available at: https:
//s.id/counting-qa-dataset.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experiment Setup</title>
        <p>We aim to evaluate our LM-based KGQA approaches, namely Text-to-SPARQL and KG-to-Text,
using the Counting QA dataset. Additionally, we include a simple baseline (naive) method
that relies solely on the pre-trained knowledge of LMs, without leveraging external structured
information. To examine the impact of structured counting knowledge across diferent types
of LMs, we select two representatives: Llama3.2 3B as a smaller LM and DeepSeek-V3-0324
as a more advanced one. To enable a modular analysis of our experimental setup, we also
introduce scenarios with perfect entity extraction, in which the correct entities are provided
directly, assuming flawless entity extraction and linking. This experiment design supports
clearer identification of potential sources of error. For evaluation, we use accuracy as the metric,
defined as the number of correct predictions, where the predicted count equals the true count,
divided by the total number of predictions. We choose accuracy due to its simplicity and ease
of interpretation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>Table 2 presents the results of our experiments across diferent approaches and LMs. Due to time
constraints, the experiments were performed on a randomly selected subset of 1,000 entries from
our Counting QA dataset, using only the first question variation for each entry. We consider
this sample size suficiently representative of the full dataset. To ensure consistency across all
experiments, we applied a fixed random seed so that the same test set was used throughout.</p>
      <p>The experiment results show how diferent approaches perform across two language models,
Llama3.2 3B and DeepSeek-V3-0324. Overall, incorporating KGs significantly improves accuracy
in answering counting questions, hence reducing counting hallucinations.</p>
      <p>Starting with the simplest baseline, the Naive approach achieves the lowest performance,
reaching only 2.60% accuracy with Llama3.2 and 14.10% with DeepSeek. This reveals the severe
limitations of relying on an unguided LM output to provide precise numerical facts.</p>
      <p>The Text-to-SPARQL approach, which issues explicit KG queries based on extracted entities,
leads to a notable improvement. Accuracy rises to 45.90% for Llama3.2 and 65.30% for DeepSeek.
This result highlights the benefits of using structured queries over KGs, though it remains
somewhat limited by the quality of entity and property extraction.</p>
      <p>The KG-to-Text approach performs even better. This method retrieves and verbalizes triples
from the KG and supplies them as context for the LM to answer. It achieves 62.70% accuracy
with Llama3.2 and 77.60% with DeepSeek. These outcomes indicate that providing relevant
context from the KG helps keep the model anchored to the correct numerical facts.</p>
      <p>When perfect entity extraction is introduced, meaning the entity extraction and linking are
assumed to be flawless, the results improve dramatically. Under this scenario, Text-to-SPARQL
with perfect entity extraction achieves 65.40% accuracy with Llama3.2 and 82.80% with DeepSeek.
The KG-to-Text approach with perfect entity extraction reaches nearly perfect performance,
achieving 98.60% with Llama3.2 and 99.10% with DeepSeek. This emphasizes how essential
accurate entity identification is for maximizing the benefits of integrating KGs.</p>
      <p>Overall, these findings show that combining KGs with language models greatly enhances the
reliability of counting question answers. They also demonstrate that while a more advanced
model like DeepSeek consistently outperforms the smaller Llama3.2, both experience significant
gains when supported by KGs and precise entity extraction &amp; linking. This confirms the strong
potential of this combined approach to reduce counting errors.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. KG Potential in Mitigating Counting Hallucination</title>
        <p>Our experiments clearly highlight the potential of KGs to mitigate counting hallucinations
in LMs. By grounding the LM’s generation on structured facts from a KG like Wikidata, we
observe a significant reduction in incorrect answers for counting questions. Methods such as
Text-to-SPARQL and KG-to-Text demonstrate substantial improvements over naive approaches,
achieving accuracies as high as 77.60% (without perfect entity extraction) and 99.10% (with
perfect entity extraction), as compared to that of naive (with only less than 15% of accuracy).
This underscores the value of integrating external structured knowledge into the QA process,
efectively anchoring the model’s output in verifiable data and substantially reducing the
tendency to hallucinate or guess quantities.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Completeness and Timeliness of KGs</title>
        <p>However, the efectiveness of KGs in preventing hallucinations is inherently tied to the
completeness and timeliness of the underlying graph. Even large-scale KGs like Wikidata may
lack up-to-date entries for rapidly evolving facts or could have incomplete coverage in certain
domains, which can limit their ability to serve as an authoritative grounding source. For instance,
newly formed organizations or emerging events might not yet be captured, leading to either a
fallback to LM-only generation or outdated responses. Thus, while KGs ofer strong potential
to counter hallucinations for well-established facts, their utility can be constrained by the scope
and update frequency of the graph.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Dependence on Entity Extraction and Linking</title>
        <p>A critical dependency in KG-based approaches is the accurate extraction and linking of textual
mentions to their corresponding entities (QIDs) within the KG. The efectiveness of the
downstream retrieval or reasoning process heavily depends on this step. Our analysis reveals that
errors in surface form recognition, ambiguities (e.g., multiple people sharing the same name),
or insuficient disambiguation can lead to retrieving irrelevant or incomplete facts, directly
impacting the final answer’s correctness. This dependency is further amplified in structured
approaches like Text-to-SPARQL, where the entity ID is used to craft explicit queries. Enhancing
entity extraction and linking through context-aware disambiguation, type filtering, and robust
fallback mechanisms remains essential to fully leverage the benefits of KG integration.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Scale and Latency</title>
        <p>Next, we observe trade-ofs in computational scale and latency across diferent approaches. Using
the DeepSeek API provides access to state-of-the-art capabilities with relatively fast inference
times, but incurs significant monetary costs, especially at scale. In contrast, running local models
like Llama3.2 can help avoid API expenses and reduce operational costs when deployed on
low-specification hardware, although this often results in slower processing speeds. However,
when Llama3.2 is executed on high-specification hardware, whether locally or in the cloud, it
can achieve fast inference comparable to frontier models, though at increased infrastructure cost.
This suggests that while KGs help reduce hallucinations, practical deployment also requires
balancing computational resources, budget constraints, and expected performance, especially
when integrating with diverse LM infrastructures.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Knowledge Graph Construction</title>
        <p>The efectiveness of our approach fundamentally depends on the availability of KGs that contain
counting-related information. In this work, we leverage Wikidata, a general-purpose KG, as
the primary source for counting knowledge. However, in specific domains such as personal
or enterprise settings, similar knowledge might not be readily available. In these contexts, the
applicability of our methods hinges on the existence or creation of relevant KGs. When such
graphs are absent, users may need to construct them, either manually through domain expertise
or automatically using information extraction techniques [18]. Counting knowledge itself can
be incorporated in several ways. It can be added manually by domain experts, extracted directly
from textual sources using dedicated methods [19], or derived through aggregation of existing
KG properties to synthesize new properties that reflect countable facts.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we have introduced the problem of counting hallucinations in language model
(LM) outputs. Through a detailed study of Wikidata, we have investigated how counting facts
are represented in a KG and have identified 169 properties that encapsulate countable entities,
ranging from the number of episodes and number of children to the number of employees and
more. We have proposed two primary LM-based KGQA methods for leveraging this structured
knowledge: Text-to-SPARQL, which converts natural language counting questions into SPARQL
queries, and KG-to-Text, which verbalizes knowledge triples associated with an entity and injects
them into the LM’s context for question answering (QA). To support systematic evaluation, we
have constructed the first benchmark dataset tailored specifically for counting QA, consisting
of more than 10,000 entries and 30,000 counting question variations. Experimental results have
demonstrated that combining KGs with LMs substantially reduces counting hallucinations.</p>
      <p>Future work may focus on enhancing the robustness of entity extraction and linking, which
serve as core components of our framework. Another promising direction is fine-tuning LMs
on the constructed counting QA dataset to improve their alignment with structured numerical
knowledge. Finally, it would be valuable to extend this approach beyond Wikidata, exploring
its applicability to other types of KGs, including personal and enterprise-level graphs.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Gemini for grammar &amp;
spelling checking and paraphrasing. After using these tools, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.
for Wikidata, in: International Workshop on the Semantic Web, 2017. URL: https://api.
semanticscholar.org/CorpusID:24052929.
[12] A. Perevalov, D. Diefenbach, R. Usbeck, A. Both, QALD-9-plus: A Multilingual Dataset
for Question Answering over DBpedia and Wikidata Translated by Native Speakers, 2022.</p>
      <p>URL: https://arxiv.org/abs/2202.00120.
[13] R. Usbeck, X. Yan, A. Perevalov, L. Jiang, J. Schulz, A. Kraft, C. Möller, J. Huang, J. Reineke,
A.-C. N. Ngomo, M. Saleem, A. Both, QALD-10 – The 10th challenge on question answering
over linked data: Shifting from DBpedia to Wikidata as a KG for KGQA, Semantic Web 15
(2024) 2193–2207. doi:10.3233/SW-233471.
[14] N. Mihindukulasooriya, M. Dubey, A. Gliozzo, J. Lehmann, A.-C. N. Ngomo, R. Usbeck,
G. Rossiello, U. Kumar, Semantic Answer Type and Relation Prediction Task (SMART
2021), 2022. URL: https://arxiv.org/abs/2112.07606.
[15] G. Agrawal, T. Kumarage, Z. Alghamdi, H. Liu, Can Knowledge Graphs Reduce
Hallucinations in LLMs? : A Survey, in: Proceedings of the 2024 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies
(Volume 1: Long Papers), Association for Computational Linguistics, Mexico City, Mexico,
2024, pp. 3947–3960. doi:10.18653/v1/2024.naacl-long.219.
[16] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin,
T. Liu, A Survey on Hallucination in Large Language Models: Principles, Taxonomy,
Challenges, and Open Questions, ACM Trans. Inf. Syst. 43 (2025). doi:10.1145/3703155.
[17] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, L. Wang,
A. T. Luu, W. Bi, F. Shi, S. Shi, Siren’s Song in the AI Ocean: A Survey on Hallucination in
Large Language Models, 2023. URL: https://arxiv.org/abs/2309.01219.
[18] L. Zhong, J. Wu, Q. Li, H. Peng, X. Wu, A Comprehensive Survey on Automatic Knowledge</p>
      <p>Graph Construction, ACM Comput. Surv. 56 (2023). doi:10.1145/3618295.
[19] P. Mirza, S. Razniewski, F. Darari, G. Weikum, Enriching Knowledge Bases with Counting
Quantifiers, in: The Semantic Web – ISWC 2018, Springer International Publishing, Cham,
2018, pp. 179–197.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>A year ago, it was just ChatGPT - now LLMs are everywhere</article-title>
          ,
          <year>2023</year>
          . URL: https:// venturebeat.com/ai/a
          <article-title>-year-ago-it-was-just-chatgpt-now-llms-are-everywhere/</article-title>
          , accessed:
          <fpage>2025</fpage>
          -07-17.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Group</surname>
          </string-name>
          ,
          <article-title>Artificial Intelligence fuels rise of hard-to-detect bots that now make up more than half of global internet trafic, according to the 2025</article-title>
          <source>Imperva Bad Bot Report</source>
          ,
          <year>2025</year>
          . URL: https://cpl.thalesgroup.com/about-us/newsroom/ 2025-imperva
          <article-title>-bad-bot-report-ai-internet-trafic</article-title>
          , accessed:
          <fpage>2025</fpage>
          -07-17.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Meharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models, in: Findings of the Association for Computational Linguistics: EMNLP 2024, Association for Computational Linguistics</article-title>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>11709</fpage>
          -
          <lpage>11724</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .findings-emnlp.
          <volume>685</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Asjad</surname>
          </string-name>
          ,
          <source>Integrating Graph Structures into Language Models: A Comprehensive Study of GraphRAG</source>
          ,
          <year>2024</year>
          . URL: https://www.marktechpost.com/
          <year>2024</year>
          /08/24/ integrating-graph
          <article-title>-structures-into-language-models-a-comprehensive-study-of-graphrag/</article-title>
          , last accessed:
          <fpage>2025</fpage>
          -07-17.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Labra</surname>
          </string-name>
          <string-name>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Polleres</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rula</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Schmelzeisen</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Sequeda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zimmermann</surname>
          </string-name>
          , Knowledge Graphs,
          <source>Synthesis Lectures on Data, Semantics, and Knowledge</source>
          , Springer,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .2200/ S01125ED1V01Y202109DSK022.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. L. Richard</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <source>David Wood, RDF 1.1 Concepts</source>
          and
          <string-name>
            <given-names>Abstract</given-names>
            <surname>Syntax</surname>
          </string-name>
          ,
          <year>2014</year>
          . URL: https://www.w3.org/TR/rdf11-concepts/, accessed:
          <fpage>2025</fpage>
          -07-17.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seaborne</surname>
          </string-name>
          , SPARQL
          <volume>1</volume>
          .1
          <string-name>
            <given-names>Query</given-names>
            <surname>Language</surname>
          </string-name>
          ,
          <year>2013</year>
          . URL: https://www.w3.org/TR/ sparql11-query/, accessed:
          <fpage>2025</fpage>
          -07-17.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Procko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ochoa</surname>
          </string-name>
          ,
          <article-title>Graph Retrieval-Augmented Generation for Large Language Models: A Survey</article-title>
          , in: 2024 Conference on AI, Science, Engineering, and
          <source>Technology (AIxSET)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>166</fpage>
          -
          <lpage>169</lpage>
          . doi:
          <volume>10</volume>
          .1109/AIxSET62544.
          <year>2024</year>
          .
          <volume>00030</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Information Retrieval Method for Knowledge Graph Complex Question Answering</article-title>
          , in: 2022
          <source>China Automation Congress (CAC)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1059</fpage>
          -
          <lpage>1064</lpage>
          . doi:
          <volume>10</volume>
          .1109/CAC57257.
          <year>2022</year>
          .
          <volume>10055934</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Diefenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Tanon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maret</surname>
          </string-name>
          , Question Answering Benchmarks
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>