<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Large Language Models on Wikipedia Graph Navigation: Insights from the WikiGame</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Margiotta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danilo Croce</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Enterprise Engineering, University of Rome Tor Vergata</institution>
          ,
          <addr-line>Via del Politecnico 1, 00133, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Reveal s.r.l.</institution>
          ,
          <addr-line>Via Kenia 21, 00144, Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) are believed to encode substantial structural and factual knowledge from resources such as Wikipedia, yet the extent to which they can exploit this internalized information for graph-based reasoning tasks remains unclear. We present a systematic evaluation of LLM navigation strategies in the context of the WikiGame, a task requiring players to reach a target Wikipedia page by traversing internal hyperlinks. We introduce a controlled experimental protocol that compares human and model performance across multiple settings, including both “blind” navigation (without access to outgoing links) and “link-aware” navigation (where available links are provided at each step). Using a large-scale dataset of human gameplay, we benchmark state-of-the-art LLMs (GPT-4, Llama 3.1) on identical start-goal pairs, measuring success rate, path eficiency, and error typologies. Our results show that while LLMs can match or surpass human accuracy under certain conditions, they exhibit qualitatively diferent strategies and characteristic failure modes, such as generating structurally invalid paths. Our findings highlight both the potential and the current limitations of LLMs in structured reasoning tasks, and propose a reproducible, game-based framework for assessing their ability to generalize beyond memorization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;WikiGame</kwd>
        <kwd>Wikipedia</kwd>
        <kwd>navigation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>reasoning</kwd>
        <kwd>human-machine comparison</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. Introduction WikiGame1 (also known as Wikispeedia [6]), a
humaninvented challenge where the objective is to navigate
Large Language Models (LLMs) have demonstrated re- from a given Wikipedia start page to a target page,
usmarkable progress across a wide range of linguistic, ing only internal hyperlinks and as few clicks as
possireasoning, and knowledge-intensive tasks [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]. This ble.Crucially, success in the WikiGame is not a matter of
progress is commonly attributed to pre-training on mas- simple recall: it requires sequential link selection,
consive, web-scale corpora that include not only unstruc- ceptual inference, and a practical understanding of the
tured text, but also highly structured resources such as Wikipedia graph’s structure. Human players bring
backWikipedia [3]. As a result, there is increasing speculation ground knowledge, associative reasoning, and an ability
that LLMs may implicitly acquire not just isolated facts, to generalize; LLMs, in contrast, are tested on their
cabut also the latent structure, the network of hyperlinks, pacity to replicate this process, whether via latent recall,
conceptual proximity, and topological organization, of combinatorial reasoning, or structural generalization.
sources like Wikipedia [4]. For example, consider the challenge of
navigat
      </p>
      <p>However, it remains an open question what it truly ing from Germanium (a chemical element) to Rock
means for an LLM to “internalize” a knowledge graph. (geology). While these concepts are related at a high
Does the model simply memorize page-level facts and fre- level, Wikipedia’s hyperlink structure does not provide a
quent co-occurrences, or does it develop an operational direct or trivial path between them. A successful player
understanding of the underlying relational structure, en- must identify and traverse a plausible sequence of
interabling it to solve combinatorial navigation tasks that mediate pages, such as:
it has not directly memorized [3, 5]? Addressing these
questions is essential for assessing the actual capabilities Germanium → Mineral → Earth’s crust → Rock
and limitations of LLMs, especially as they are increas- (geology)
ingly applied in scenarios that require reasoning beyond
surface-level retrieval.</p>
      <p>In this work, we address these questions through the
avoiding shortcuts that may appear semantically valid
but do not correspond to actual Wikipedia links. This task
exemplifies the combinatorial complexity and the need
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- for real structural knowledge, rather than rote
memorizatics, September 24 — 26, 2025, Cagliari, Italy tion of facts. To rigorously investigate these capacities,
c$rodcea@nieinlef.om.uanrigrioomttaa2@.iutn(Dir.oCmrao2c.eit);(Dba. sMilia@rginioftot.au)n;iroma2.it we construct a large-scale dataset of human WikiGame
(R. Basili) sessions (approximately 4,000 start-goal pairs), annotate
0000-0001-9111-1950 (D. Croce); 0000-0001-5140-0694 (R. Basili)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://www.thewikigame.com/</p>
      <p>Attribution 4.0 International (CC BY 4.0).
them with empirical dificulty (success rate), and define a agent trajectories and does not systematically benchmark
controlled evaluation framework spanning several exper- human or LLM strategies.
imental conditions2. We benchmark two state-of-the-art Graph-based neural architectures such as Relational
LLMs (Llama 3.1 [7] and GPT-4 [2]) in three settings char- Graph Convolutional Networks have also been
evaluacterized by increasing amount of information: (i) Blind ated on multi-hop reasoning tasks over Wikipedia
subNavigation, where the model is given only the names graphs [5], highlighting the importance of both
symof the start and end pages and must generate a naviga- bolic and learned relational information for efective
tion path without any additional guidance; (ii) Chain- pathfinding. Synthetic data approaches [ 9] attempt to
reof-Thought Reasoning, where the model is asked to produce human navigation on Wikipedia, showing that
explicitly explain the rationale behind each navigational clickstream-inspired trajectories can approximate real
step [8]; and (iii) Link-Aware Navigation, where, at user behavior, but do not address the capacity of LLMs to
each step, the model is provided with the full list of out- navigate the graph or compare them directly to human
going links from the current Wikipedia page, thus closely performance. The WikiGame itself (and variants such as
simulating the experience and options available to a hu- Wikispeedia [6]) has long been a benchmark for human
man player. semantic navigation, but only recently have researchers</p>
      <p>For each configuration, we assess not only overall suc- begun to systematically evaluate LLMs on this task.
cess, such as the path optimality, but also analyze failure
modes, including invalid links and hallucinated pages. Generalization vs Memorization in LLMs. A core
This allows us to explore whether LLM navigation relies research question is whether LLMs’ strong performance
on memorization, structural reasoning, or search-like on navigation reflects generalization from distributed
strategies. While large models can match or exceed hu- knowledge or mere memorization of surface patterns
man performance in some settings, their errors often and co-occurrences [3]. Prior work has highlighted both
stem from structural hallucinations, revealing the limits the strengths and limitations of LLMs in
knowledgebetween latent knowledge and true reasoning. Our work intensive tasks, but comprehensive, human-comparable
ofers a reproducible benchmark and a diagnostic frame- evaluation on graph navigation remains scarce.
work for evaluating how LLMs internalize knowledge
graphs, with implications for model evaluation and the
distinction between memorization and generalization.</p>
      <p>In the rest of the paper, we review related work in
Section 2, define the WikiGame task in Section 3, present
experiments and results in Section 4, and conclude with
key findings and future perspectives in Section 5.</p>
      <p>
        Our Contribution. In contrast to previous research,
our study ofers a systematic comparison between
humans and state-of-the-art LLMs on identical WikiGame
challenges. By varying the information available to the
models (blind vs. link-aware settings) and evaluating not
only success rates but also the nature of errors (e.g.,
invalid links, hallucinated pages), we provide new insights
2. Related Work into the mechanisms that underlie LLM navigation
strategies. This framework enables us to directly probe the
exLLMs as Knowledge Graph Navigators. The ques- tent to which LLMs genuinely reason about Wikipedia’s
tion of whether Large Language Models can serve as structure versus relying on rote memorization or surface
implicit knowledge bases [3], and, more deeply, whether heuristics.
they internalize the structural and relational properties
of graph-based resources, has received increasing atten- 3. WikiGame as a Probe for LLM
tion. While early benchmarks focused on factual recall or
simple question answering [
        <xref ref-type="bibr" rid="ref1">3, 1</xref>
        ], more recent work ex- Reasoning
plores reasoning, pathfinding, and multi-hop navigation
on graph-structured data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>In this section, we formalize the WikiGame as a graph</title>
      <p>navigation task and motivate its value as a benchmark
Navigation in Wikipedia and the WikiGame. for large language models. We outline our experimental
Wikipedia, as a richly interlinked graph, has served as protocol for evaluating LLM reasoning under diferent
a challenging environment for both algorithmic agents information settings and introduce metrics to distinguish
and neural models. Zaheer et al. [4] train agents to im- memorization, structural generalization, and explicit
reaitate random walks on Wikipedia, showing that neural soning.
policies can learn to reach distant targets by leveraging These methodological choices establish a solid
foundagraph regularities. However, their focus is on synthetic tion for analyzing the strategies and limitations of both
human and model-based Wikipedia navigation.
2All software and datasets are publicly available on GitHub at https:
//github.com/crux82/wikigame-llm-eval.
3.1. From Encyclopedia to Graph: Blind Navigation with Chain-of-Thought
ReasonFormalizing Wikipedia Navigation ing. This mode extends the previous setting by
requiring the model to articulate, in natural language, the
reaWikipedia can naturally be represented as a directed soning behind each navigational step. The sequence
graph, where each node corresponds to an article and of justifications ofers a window into the intermediate
each directed edge to a hyperlink from one article to representations and planning strategies of the model,
another. Formally, let  = (, ℰ ) denote the Wikipedia helping us distinguish whether successful paths arise
hyperlink graph, with  the set of pages and ℰ the set of from semantically-grounded reasoning or from
statistidirected edges such that (,  ) ∈ ℰ if  is hyperlinked cal shortcuts. Moreover, Chain-of-Thought (CoT)
superwithin the text of . vision [8] enables us to quantify the impact of explicit</p>
      <p>Given this structure, the WikiGame can be formu- reasoning on path quality and error rates. As before,
lated as a pathfinding problem: starting from a source the model is not exposed to outgoing links at any point.
node  (the Start page), the agent must reach a target The prompt design for this condition is detailed in
Apnode  (the End page) by traversing a sequence of nodes pendix B.
(0 = , 1, . . . ,  = ), such that each consecutive
pair (, +1) corresponds to an existing edge in ℰ . Link-Aware Navigation (Stepwise Choice). Finally,</p>
      <p>The challenge lies not only in finding any path from the link-aware mode simulates the actual gameplay
ex to , but in selecting paths that are plausible and efi- perience: at each step, the model receives the set of
outcient, i.e., minimizing the number of steps, in line with going links from the current node, and is requested to
typical game objectives and human strategies. At each select the next node (page) to traverse. This setting
distep, the agent’s possible actions are constrained to the rectly tests the model’s ability to reason under stepwise
outgoing links from the current page, and (depending on constraints, avoid invalid transitions, and make locally
the experimental condition) may or may not be explicitly grounded decisions. Notably, this scenario also allows for
visible to the agent. direct comparison to human strategies, since the action</p>
      <p>This formalization allows us to cast the WikiGame as space at each step is identical to what a player would
a sequential decision-making problem over a partially see. Here, the primary sources of error are choices of
observable and large-scale real-world graph. Crucially, suboptimal but valid links, and the rate of hallucinated
success requires not only factual knowledge, but also steps should, in principle, be minimized. See Appendix C
structural reasoning and the ability to generalize over for the full prompt.</p>
      <p>Wikipedia’s highly interconnected topology, making it a
compelling testbed for both human and artificial agents.</p>
    </sec>
    <sec id="sec-3">
      <title>Success Rate. The most immediate measure is the</title>
      <p>Blind Navigation (Direct Path Prediction). In the success rate, defined as the proportion of WikiGame
inblind setting, the model is presented only with the ti- stances in which the agent (human or LLM, under a given
tles of the start node () and end node (), and is asked strategy) successfully reaches the target node  starting
to output a plausible sequence of Wikipedia page titles from node  via a valid sequence of Wikipedia links. This
forming a path from  to . Crucially, at no step does metric provides a high-level view of navigational ability,
the model observe the set of valid outgoing links from aggregating all sources of error into a single outcome
any node. This setting tests whether LLMs can retrieve variable. High success rates in the blind setting, for
inor reconstruct complex multi-step relations from inter- stance, may indicate substantial memorization or
internalized knowledge, probing their ability to generalize, nalized global structure, while improvements in CoT or
rather than simply recall isolated facts. Of particular in- link-aware settings can reveal the role of explicit
reasonterest here is whether errors reflect “hallucinated” nodes ing or contextual cues. Contrasting success across these
(page titles not present in Wikipedia) or “hallucinated” modes helps disentangle whether LLMs rely on static
links (pairs of existing pages for which no hyperlink ex- recall, reasoning over implicit knowledge, or dynamic
ists in the actual graph). Such distinctions shed light on use of available context.
whether the model’s apparent knowledge is structural or
superficial. The precise prompt is in Appendix A.</p>
      <sec id="sec-3-1">
        <title>3.2. Probing LLM Competence: Experimental Paradigms</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>We evaluate LLMs under three progressively informative settings, each designed to probe a diferent aspect of their reasoning and navigation abilities:</title>
      <sec id="sec-4-1">
        <title>3.3. Evaluation Metrics: Dissecting Navigational Behavior</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>To assess the navigation and reasoning abilities of LLMs in the WikiGame, we employ complementary evaluation metrics that capture diferent aspects of task performance, including memorization, generalization, and strategy.</title>
      <p>Mean Path Length (with Standard Deviation). Be- broad base allows for a detailed analysis of game
difiyond mere task completion, we consider the eficiency culty: as shown in Figure 1, the distribution of human
of navigation. For all successful paths, we compute the success is highly skewed, with only a handful of games
average number of steps required to reach the goal, along approaching high completion rates and the majority
poswith the standard deviation to capture variability across ing a real challenge to human intuition and knowledge.
trials. Shorter average path lengths may suggest direct or
globally informed strategies (possibly indicative of inter- 21.3% 21.9% 22.6%
nalized conceptual proximity or shortcut-finding) while 20
longer or more variable paths can reveal hesitancy, local 15.8%
search, or lack of structural insight. Comparing path 15
lengths between humans and LLMs, and among settings, 10 10.2%
provides a window into diferences in search strategy 5.9%
and the quality of graph representations. 5
Icnomvapluidte LthienkperRcaenteta.geFoofrnmavoigdaetli-obnasaetdtemsoplutstiionnws,hiwche 0 10% 10 20%20 30%30 40%40 50%50 60%60 70%70 80%80 90%90 100%
a transition is made between two existing Wikipedia Percentage of players who completed the game
pages, but the selected edge does not actually exist among Figure 1: Distribution of human success rates across the 4000
the outgoing links of the current page (i.e., (, +1) ∈/ collected WikiGame instances. Each bar shows the percentage
ℰ , even though +1 ∈  ). This error mode is critical for of games whose completion rate by human players falls within
probing the distinction between true structural general- the corresponding interval. Most games are far from trivial,
ization and shallow recall: frequent invalid links imply with only a small fraction of tasks having high human success
that the model has learned about entities but not their ac- rates.
tual connectivity, while low rates suggest a more faithful
reconstruction of Wikipedia’s hyperlink topology.
Notably, we expect invalid link errors to be most revealing
in the blind setting, where the temptation to hallucinate
plausible (but non-existent) transitions is highest.</p>
      <p>Invalid Page Rate. Complementary to the above, we
also measure the proportion of model-generated paths
in which one or more nodes () do not correspond to
any real Wikipedia page ( ∈/  ). This captures a
distinct failure mode (hallucination of nonexistent entities)
which can arise from overgeneralization or semantic
drift. Tracking this error across diferent strategies (e.g.,
whether it is reduced by explicit reasoning or by access
to real links) informs our understanding of the interplay
between LLM world knowledge and task-specific prompt
structure.</p>
      <sec id="sec-5-1">
        <title>4. Experimental Evaluation</title>
        <sec id="sec-5-1-1">
          <title>4.1. Experimental Setup</title>
          <p>We begin by collecting a large corpus of human
gameplay data from the public WikiGame platform
(thewikigame.com), which assigns users random
startgoal Wikipedia pairs and records navigation attempts,
both successful and unsuccessful. Using a custom
scraping tool, we continuously harvested game records over
several weeks, yielding over 4000 unique games, each
annotated with start and target page, number of attempts,
completion count, and aggregated success rate. This</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>To ensure both representativeness and feasibility for</title>
      <p>LLM evaluation, we structured our experimental dataset
by dificulty, grouping games based on their human
success rate: Medium (50% ≤ success rate &lt; 75%), Hard
(25% ≤ success rate &lt; 50%), Very Hard (1% ≤ success
rate &lt; 25%), and Impossible (success rate = 0%). The
Easy category (success rate &gt; 75%) was excluded, as
it contained only 6 games. From each bin, we selected
the 30 most-played games, resulting in a diverse set of
120 start-goal pairs that accurately reflect the real
distribution of task dificulty, while keeping the evaluation
manageable.</p>
      <p>For the model-based experiments, we selected a panel
of LLMs that exemplifies the diversity of current
architectures, scales, and access paradigms. Our
evaluation includes the latest proprietary GPT-4 models
accessed via the OpenAI API: gpt-4.13, gpt-4.1-mini4,
gpt-4.1-nano5, and gpt-4o-mini6, chosen to cover
a spectrum from flagship large-scale models to compact
and cost-eficient variants. For the open-weight
evaluation, we used Meta’s Llama 3.1-8B-Instruct7, deployed
locally to ensure experimental control and
reproducibility. This experimental design allows for direct
comparison across proprietary versus open models, and across
varying model sizes and training data coverage. The
GPT-4 family was selected to probe the limits of
pro3https://platform.openai.com/docs/models/gpt-4.1
4https://platform.openai.com/docs/models/gpt-4.1-mini
5https://platform.openai.com/docs/models/gpt-4.1-nano
6https://platform.openai.com/docs/models/gpt-4o-mini
7https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
prietary large-scale models with extensive training on only the largest model (GPT-4.1) approaches or matches
web data (including Wikipedia), while the Llama variant human performance, particularly on the less dificult
allows us to assess the capabilities of an open, smaller- games. When models are provided with explicit link
inscale architecture under more restricted computational formation (Link-aware mode), success rates increase
draresources (local GPU). As can be seen in Figure 2 all mod- matically for all GPT-based models, with GPT-4.1
achievels were evaluated under the three experimental modes ing perfect accuracy (100%) even on the hardest tasks. In
introduced in Section 3.2: Blind Navigation, Blind Naviga- contrast, smaller models and Llama 3.1-8B exhibit lower
tion with Chain-of-Thought, and Link-Aware Navigation. overall performance and are especially challenged as
difFor OpenAI models, API calls were made with determin- ficulty rises.
istic temperature ( = 0) to ensure reproducibility. For Table 1 details these trends, confirming the strong
adLlama, inference was run on local GPUs using a greedy vantage of large-scale LLMs when given access to link
decoding strategy, avoiding any probabilistic sampling structure, and quantifying the performance gap between
and thus ensuring fully reproducible outputs. Notably, model families and sizes, as well as with respect to
huthe Blind modes required a single API call per game, mans. Notably, large models like GPT-4.1 nearly match
while the Link-Aware mode demanded one call per navi- or even exceed human accuracy on medium and hard
gation step, increasing both API cost and computational games, even when required to hallucinate plausible paths
resources, a key reason for limiting the test set to 120 without structural information, demonstrating
substangames. All model outputs were automatically checked for tial internalized knowledge of Wikipedia’s structure.
structural validity (i.e., presence of only real Wikipedia However, this capacity rapidly diminishes for smaller
page names and hyperlinks), with detailed error metrics models and Llama 3.1-8B, underscoring the importance of
collected as described in Section 3.3. To ensure trans- both scale and training diversity for generalization in this
parency and reproducibility, all data, code, and prompt combinatorial setting. The Link-aware condition reveals
templates are publicly available8. that explicit access to local structure allows even smaller
models to become more competitive, and often enables
100% large LLMs to outperform humans on the most dificult
80% tasks. These results highlight that while large LLMs
internalize part of Wikipedia’s global structure, their ability
60% to generalize without explicit context remains limited;
40% access to structural cues is critical for bridging the gap
between memorization and robust reasoning.
20% Remarkably, GPT-4.1 achieves success rates in the
0% Blind setting that are nearly indistinguishable from those
GPT-4.1 GPT-4.1-mini GPT-4o-mini GPT-4.1-nano LLAMA 3.1 of human players, despite not having access to the
out</p>
      <p>Blind CoT Link-aware Human going links at each step, an advantage always available
Figure 2: Success rates achieved by humans and diferent to humans. This surprising alignment suggests that
GPTLLMs on the 120 WikiGame tasks, ordered by dificulty, from 4.1 has internalized a substantial portion of Wikipedia’s
easiest on the left to most dificult on the right (based on structure, likely as a result of large-scale pretraining.
human success rates) Such performance raises the question of whether these
and experimental mode. LLM performance is shown for No models are simply memorizing large parts of Wikipedia’s</p>
      <p>Reasoning, Chain-of-Thought (CoT), and Link-aware link graph or have developed more generalizable
strateconditions; human baseline is reported for comparison. gies for navigation. In any case, the fact that a model can
solve the task as well as humans, even when deprived
of crucial contextual information, highlights both the
4.2. Results and Discussion strengths and the unresolved boundaries of current LLM
capabilities.</p>
      <p>Comparing Human and Model Success. We present
a comparative analysis of human participants and Large
Language Models (LLMs) across the full set of WikiGame
tasks, stratified by dificulty. As shown in Figure 1,
human success rates decrease steadily as task dificulty
increases, from approximately 56% on Medium tasks to 0%
on the Impossible category. Looking instead at Figure 2, in
the Blind settings (No Reasoning and Chain-of-Thought),
8Links to the dataset and code repository will be provided after
acceptance.</p>
      <p>Error Analysis: Invalid Links and Hallucinated
Pages We further analyze model behavior by
quantifying two principal categories of structural error: invalid
links - transitions between real Wikipedia pages that are
not connected in the actual hyperlink graph (and invalid
(hallucinated) pages) nodes that do not exist in Wikipedia.</p>
      <p>Tables 2 and 3 summarize the error rates for all models,
dificulty levels, and information settings.</p>
      <p>Invalid links represent the dominant failure mode
Table 1 Table 2
Detailed success rates (%) of each model and human partici- Invalid Link Rate: Percentage of navigation paths containing
pants, across all dificulty categories and experimental settings at least one transition between two existing Wikipedia pages
(No Reasoning, Chain-of-Thought, Link-aware) for which no hyperlink actually exists. Results are reported
for all models, dificulty bins, and experimental modes.
overgeneration in less robust models. Together, these
results illustrate that the core challenge for LLMs in blind
navigation is not the invention of entirely new entities,
across all models, especially in the Blind and Chain-of- but rather the generation of plausible-yet-nonexistent
Thought (CoT) conditions. Here, smaller models such links between real Wikipedia pages. Invalid link rates
as GPT-4.1-nano and Llama 3.1-8B often exceed 70–80% are highly sensitive to both model scale and the
availabilinvalid link rates, while the best-performing model (GPT- ity of local context, whereas invalid page rates remain a
4.1) remains substantially lower but is still afected by in- secondary but informative indicator of robustness. The
creasing task dificulty. Interestingly, generating explicit error patterns reinforce that, while large LLMs have
inreasoning with CoT prompts only marginally reduces ternalized significant aspects of Wikipedia’s structure,
invalid link errors, and in some cases may even exacer- their global knowledge is incomplete and patchy, most
bate them, suggesting that stepwise justifications do not evident when explicit structural feedback is absent.
systematically enhance structural fidelity.</p>
      <p>By contrast, providing local link information (Link- Navigation Eficiency. Table 4 reports the average
aware mode) yields dramatic improvements for all GPT- path lengths for each model and human participants
based models, with invalid link rates dropping to near- across task dificulty and experimental mode, revealing
zero on most settings, regardless of dificulty. This high- a marked distinction in navigation eficiency. In both
lights the centrality of explicit structural cues for accurate the Blind (No Reasoning) and Chain-of-Thought (CoT)
graph traversal. Notably, Llama 3.1-8B still struggles with settings, all language models produce navigation paths
invalid links even in the Link-aware setting, indicating that are, on average, substantially shorter than those of
architectural and training limitations not overcome by human players. For instance, on Medium and Hard tasks,
local information alone. The generation of nonexistent humans typically require around 5.5 and 6.6 steps
respecWikipedia pages is a less frequent, but still important, tively, whereas top-performing LLMs such as GPT-4.1
error type. Invalid page rates remain below 10% for most solve the same tasks in just 3-4 steps. This pattern
sugmodels and settings, with higher incidences concentrated gests that, when unconstrained by real hyperlink options,
among smaller models and in the most challenging tasks. LLMs tend to "jump" directly to the goal, likely
exploitThe GPT-4 family is notably conservative, rarely halluci- ing their internal representations of semantic relatedness
nating new pages, while Llama 3.1-8B and smaller GPT and making aggressive, shortcut-like connections not
variants are more prone to this error, particularly under accessible to humans.</p>
      <p>Blind conditions. CoT reasoning occasionally increases In contrast, when models are placed in the Link-aware
invalid page rates, perhaps reflecting a tendency toward mode (where only valid outgoing links are visible at each
step) average path lengths increase and can even
approach or exceed human averages, particularly for more
dificult games. This shift reflects a more conservative in smaller models such as Llama 3.1-8B, which tend to
and locally grounded navigation style: restricted to real generate less coherent or more error-prone sequences.
Inoptions, models avoid risky or speculative moves and terestingly, when faced with semantically distant or
couninstead opt for safer, if longer, paths. The diference is terintuitive start-goal pairs, even the best models
strugespecially evident in smaller models (e.g., Llama 3.1-8B), gle: their errors, however, remain structured (centered on
which show much greater variance and, in some cases, plausible but nonexistent links) rather than descending
excessively long solutions as task complexity grows. into nonsensical outputs. This points to an
internaliza</p>
      <p>Interestingly, while shorter paths might seem opti- tion of Wikipedia’s “semantic landscape” that is broad
mal, this eficiency in Blind settings often arises from but incomplete, with brittle spots where the true
hyperthe use of invalid or hallucinated links, as indicated in link structure diverges from distributional similarity. A
our previous error analysis. By contrast, the slightly further finding concerns the limits of Chain-of-Thought
longer paths produced in Link-aware mode are typically prompting in structurally constrained tasks. While
vermore faithful to Wikipedia’s structure, and thus better balized reasoning can support performance on factoid
reflect human-like and valid solutions. Consequently, or arithmetic challenges, in navigation it sometimes
enpath length should always be interpreted alongside error courages overgeneration or speculative shortcuts,
highrates: eficiency alone does not guarantee correctness, lighting the limits of purely linguistic supervision for
and valid navigation sometimes demands a willingness inherently graph-based reasoning problems.
to take longer, but legal, routes through the graph. A further nuance emerges from our error analysis: not
all invalid links proposed by LLMs are necessarily
misKey Insights and Open Challenges. Beyond quanti- takes in a semantic sense. In several cases, especially in
tative gains, our study reveals several less obvious but cru- the Blind navigation setting, the models generate
transicial insights into LLM navigation and reasoning. Larger tions between pages that are not currently hyperlinked
GPT models, by virtue of scale and pretraining diversity, in Wikipedia, but which would be both meaningful and
are able to recombine fragments of Wikipedia knowl- contextually appropriate. This phenomenon highlights
edge into plausible multi-step paths, even when direct a subtle limitation of the evaluation protocol itself: the
supervision for these specific routes is unlikely. This Wikipedia graph, while vast, is not exhaustive, and may
compositional ability is especially evident in challenging omit reasonable connections that a knowledgeable agent
settings, where models often leverage high-trafic "hub" could plausibly infer. Consequently, some LLM
“hallucipages as implicit waypoints—a behavior rarely observed nations” may in fact surface gaps in the existing
knowledge structure rather than true model failures. This
ambiguity complicates the strict interpretation of invalid link
rates: high-performing models may occasionally reveal
“missing links” that reflect creative generalization rather
than simple error.</p>
      <p>We acknowledge financial support from the PNRR project
FAIR - Future AI Research (PE00000013), under the NRRP
MUR program funded by the NextGenerationEU and
support from Project ECS 0000024 Rome Technopole - CUP
B83C22002820006, NRP Mission 4 Component 2
Investment 1.5, Funded by the European Union -
NextGenerationEU.</p>
      <sec id="sec-6-1">
        <title>A. Prompt: Blind - No Reasoning</title>
        <p>This prompt instructs the model to generate a direct navigation path from a given start Wikipedia page to a target
page, using as few steps as possible. The model must output only the sequence of page titles, for the model it is
as if it were the link, (separated by "-&gt;") with no explanation or reasoning, simulating the most basic WikiGame
navigation scenario without any access to outgoing links or intermediate guidance.</p>
        <p>How to play:
A start page and an end page on Wikipedia are selected. These can be chosen randomly or decided by the players.
Starting from the Start_Node, you must click only on internal links found within the main body of the article to reach the</p>
        <p>End_Node.</p>
        <p>Your task:
The user will provide a Start_Node and an End_Node.</p>
        <p>You must generate a path from the start to the end, trying to use the fewest possible link hops.</p>
        <p>Do not explain anything.</p>
        <p>The only output should be:
- A line containing ###
- A single line with the names of the pages in the path, separated by -&gt; (e.g., Page1 -&gt; Page2 -&gt; Page3)
Expected output format:
###
Page1 -&gt; Page2 -&gt; Page3 -&gt; Page4
Important:
- Page1 -&gt; Page2 -&gt; Page3 -&gt; Page4 it’s only an example for the output format, don’t use as solution
- Write only the page titles separated by -&gt;.
- Do not include any reasoning or explanation.
- Do not write anything before or after the final line.</p>
        <p>- Start your output with ### on a line by itself.</p>
      </sec>
      <sec id="sec-6-2">
        <title>B. Prompt: Blind - Reasoning</title>
        <p>This prompt requires the model to solve the WikiGame navigation task while explicitly articulating the reasoning
behind each step. At every hop, the model must briefly explain its choice, and only after completing the path, output
the full solution as a sequence of page titles. This setting aims to probe the model’s internal reasoning process and
to assess whether explanation improves path validity or plausibility.</p>
        <p>How to play:
A start page and an end page on Wikipedia are selected. These can be chosen randomly or decided by the players.
Starting from the Start_Node, you must click only on internal links found within the main body of the article to reach the</p>
        <p>End_Node.</p>
        <p>Your task:
solve the path from the Start Node to the End Node using as few steps as possible.</p>
        <p>At each step, you must explain why you’re clicking on the chosen link.</p>
        <p>Once you’ve reached the destination, write the full path using -&gt; between page names.</p>
        <p>Instructions:
You will be given two page names: Start_Node and End_Node.</p>
        <p>Starting from Start_Node, find a path to reach End_Node.</p>
        <p>At each step, explain briefly why you’re choosing that link.</p>
        <p>When you reach the destination:
- First, think to an Explanation to reach the End_Node from Start_Node
- Then write a line with just ###
- Finally write the full path as a list of link names separated by -&gt;
- Do not include any text before or after the final path
Important:
- Do not skip the ### line before the full path.
- Do not add explanations after the ### section.
- The final line must contain only Wikipedia page titles separated by -&gt;, nothing else.
- The final line must contain all the page title ordered by the order choice during the Explanation.
- The final line must start with the Start_Node and finish with the End_Node (whitout explanation or suffix)
Expected output format:
Explanation:
1. I start at "Page 1" (Start_Node) and click on "Page 2" because ...
2. From "Page 2", I click on "Page 3" because ...
3. From "Page 3", I go to "Page 4" (End_Node) which is the final goal because ...
###</p>
        <p>Page1 -&gt; Page2 -&gt; Page3 -&gt; Page4</p>
      </sec>
      <sec id="sec-6-3">
        <title>C. Prompt: Link-Aware</title>
        <p>In this prompt, the model is presented at each step with the explicit list of outgoing links from the current Wikipedia
page and must choose one to move closer to the target page. No reasoning or explanation is required (only the
chosen page name is output) thus closely mimicking the human decision process in an actual WikiGame session
with visible navigation options. This mode directly tests the model’s ability to select valid and efective next steps
when provided with local link context.</p>
        <p>How to play:
A start page and an end page on Wikipedia are selected. These can be chosen randomly or decided by the players.
Starting from the Start_Node, you must click only on internal links found within the main body of the article to reach the</p>
        <p>End_Node.</p>
        <p>Your task:
The user will provide a Start_Node and an End_Node and a List_Link_From_Start_Node, a list of page name linked from Start_Node
.</p>
        <p>You must make a unique choice with a page name from those proposed in List_Link_From_Start_Node, the page you choose must get
you as close as possible from Start_Node to End_Node.</p>
        <p>Make every time a choice to reach the End_Node.</p>
        <p>Do not explain anything.</p>
        <p>The only output should be:
- A line containing ###
- The unique page name choice, only one from the list List_Link_From_Start_Node
- A final line containing @@@
Very Important Instruction:
- Write only the page titles choice.
- You must choice the page from the list List_Link_From_Start_Node
- Do not include any reasoning or explanation.
- Start your output with ### on a line by itself.
- After the page name choice write a last line with @@@
- Don’t write the same page name of the Start_Node, you will lose.
- Don’t write a page name that not is in the List_Link_From_Start_Node
- Don’t change the case of page name, write in the same way is in the List_Link_From_Start_Node</p>
      </sec>
      <sec id="sec-6-4">
        <title>D. Error Analysis</title>
        <p>To illustrate typical model errors and their underlying causes, we present a qualitative analysis of failed navigation
attempts in the Blind settings (No Reasoning and CoT), focusing on the most frequent error type: Invalid Link, where
a transition is generated between two existing Wikipedia pages, but the corresponding hyperlink does not exist.
Case 1: Semantic Plausibility without Structural Support
Task: Medium dificulty, gpt-4o-mini (Blind - No Reasoning)
Start: Germanium End: Rock_(geology)
Generated Path: Germanium → Metalloid → Silicon → Rock_(geology)
Error: No link from Silicon to Rock_(geology).</p>
        <p>While Silicon and Rock_(geology) are closely related semantically, the Wikipedia page for Silicon does not
link directly to Rock_(geology). In contrast, human players typically reach the target via longer, structurally
valid paths, e.g., through Mineral, Earth’s crust, or Solid.</p>
        <p>Case 2: Link Hallucination from Mentioned but Unlinked Entities
Task: Very Hard, gpt-4.1 (Blind - Reasoning)
Start: Clock End: Computing
Generated Path: Clock → Computer → Computing
Error: No link from Clock to Computer.</p>
        <p>The model identifies Computer as conceptually relevant (and mentioned in the Clock article text), but this mention
is not a hyperlink. Human solutions tend to traverse more granular technical or historical intermediates, yielding
longer but valid paths.</p>
        <p>Case 3: Overgeneralization of Conceptual Connections
Task: Very Hard, gpt-4o-mini (Blind - Reasoning)
Start: Clock End: Computing
Generated Path: Clock → Time → Measurement → Computing
Error: No link from Measurement to Computing.</p>
        <p>Computing is present as a term within Measurement, but not as a direct hyperlink. The LLM overgeneralizes the
apparent connection, skipping intermediate concepts that human players usually include.</p>
        <p>Case 4: Surface Similarity versus Structural Reality
Task: Medium, gpt-4o-mini (Blind - No Reasoning)
Start: Diesel_engine End: Electric_charge
Generated Path: Diesel_engine → Internal_combustion_engine → Electric_vehicle →
Electric_charge
Error: No link from Electric_vehicle to Electric_charge.</p>
        <p>Although Electric_vehicle is strongly associated with Electric_charge in meaning, Wikipedia’s link
structure does not provide a direct connection. Human players reliably reach the target via technical or physical
intermediates such as Spark plug, Electric current, or Piezoelectricity.</p>
        <p>Summary. Across these cases, LLMs display a tendency to infer links based on high-level conceptual associations
or textual mentions rather than strictly adhering to Wikipedia’s hyperlink structure. This behavior is particularly
evident in Blind settings, where models must rely on internalized world knowledge. In contrast, human players
favor longer but structurally valid paths. These examples highlight a key challenge for LLM-based graph navigation:
distinguishing plausible but invalid shortcuts from topologically feasible solutions.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          , J. D.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners, 5</article-title>
          . Conclusion and Future Work in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
          </string-name>
          , M. Balcan,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Our study has several limitations: the range of start- [5] I. Staliu¯naitė</article-title>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Gorinski</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Iacobacci</surname>
          </string-name>
          ,
          <article-title>Relational goal pairs and models is constrained by cost, and our met- graph convolutional neural networks for multihop rics focus on structural correctness rather than semantic reasoning: A comparative study, arXiv preprint nuance or user experience</article-title>
          .
          <source>Expanding to additional ar- arXiv:2210.06418</source>
          (
          <year>2022</year>
          ).
          <article-title>chitectures, with larger dimensions multilingual Wikis</article-title>
          , [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pineau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wikispeedia:</surname>
          </string-name>
          <article-title>An onor richer evaluation criteria represents important future line game for inferring semantic distances between work. In summary, LLMs show strong but imperfect concepts</article-title>
          .,
          <source>in: IJCAI</source>
          , volume
          <volume>9</volume>
          ,
          <year>2009</year>
          , pp.
          <fpage>1598</fpage>
          -
          <lpage>1603</lpage>
          .
          <article-title>generalization beyond memorization, with qualitative [7]</article-title>
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          et al.,
          <article-title>The llama 3 herd of models, strategy diferences persisting relative to humans</article-title>
          .
          <source>Future</source>
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783.
          <article-title>research should probe broader model families</article-title>
          , alterna- arXiv:
          <fpage>2407</fpage>
          .21783.
          <article-title>tive domains, and hybrid approaches that combine LLM [8</article-title>
          ]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. H.</surname>
          </string-name>
          <article-title>reasoning with explicit graph traversal, as well as deeper Chi</article-title>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain of thought prompting comparisons of human and model navigation strategies. elicits reasoning in large language models</article-title>
          , CoRR
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>abs/2201</source>
          .11903 (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          2201.11903. arXiv:
          <volume>2201</volume>
          .11903. Acknowledgments [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gerlach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Piccardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Durán</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Data</given-names>
            <surname>Mining</surname>
          </string-name>
          ,
          <source>WSDM '22</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2022</year>
          , p.
          <fpage>16</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>doi:10.1145/3488560</source>
          .3498496.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>